11 Jul 2020

Intro

Fasta file is the common type of file we need to deal with in Bio computing, one fasta file will typically contains thousands lines of Gene sequences, and it is dumb to just use the 1G source file. In order to separate one big file into small parts and make it easier to edit, I have used python and perl to deal with this issue. Below are the notes in learning.

Python

Since each sequence starts with a “>”, it would be easy for us to identify each sequence and set them into an array

if Y[0] == ">":
  fa_Info.append(Y)
  fa_Num = fa_Num + 1
  fa_Seq.append("")
else:
  fa_Seq[fa_Num] = fa_Seq[fa_Num] + Y

and thus we can get our sequence in the array fa_Seq.

Thus, combining with the open file and the output component, the entire code looks like this:

# set parameter
# X = sequence number in each file
# input_file =  file name
# prefix = split parts' prefix

X = 2500
input_file = "final_genome.fasta"
prefix = "split_"

FA_in_file = open(input_file, "r")

# read fasta file to a list
fa_Info = []
fa_Seq = []
fa_Num = -1

for Y in FA_in_file.readlines():
    Y = Y.rstrip()
    if Y[0] == ">":
        fa_Info.append(Y)
        fa_Num = fa_Num + 1
        fa_Seq.append("")
    else:
        fa_Seq[fa_Num] = fa_Seq[fa_Num] + Y

print("seq-num:", fa_Num+1)

# split the fasta list to multipe files
file_Num = (fa_Num + 1)//X + 1
print("file-num:", file_Num)

for i in range(file_Num):
    print("processing file no:", i+1)
    exec(prefix + str(i + 1) + ' = open("' + prefix + str(i + 1) + '.fasta"' + ', "w")')
    start = i * X
    end = (i + 1) * X
    if end > fa_Num + 1:
        end = fa_Num + 1
    for j in range(start, end, 1):
        exec(prefix + str(i + 1) + '.write(fa_Info[j] + "\\n")')
        # output the seq with 60 chars in each line
        # while len(fa_Seq[j]) > 60:
        #     exec(prefix + str(i + 1) + '.write(fa_Seq[j][:60] + "\\n")')
        #     fa_Seq[j] = fa_Seq[j][60:]
        # else:
        #     exec(prefix + str(i + 1) + '.write(fa_Seq[j] + "\\n")')
        exec(prefix + str(i + 1) + '.write(fa_Seq[j] + "\\n")')
    exec(prefix + str(i + 1) + '.close()')

FA_in_file.close()

Perl

But the way python deal with the data is not efficient, when dealing with long chains of DNAs, it will soon run out of time. Thus, I have found a faster way online in handling the fasta file - use perl

Here is the source of the code: reference

Final words

Great experience in finding ways to cut a big fasta file into smaller parts.

Many thanks to help from Mr chen and tutorials on CSDN

How to handle fasta file

Intro

Python

Perl

Final words