3

I want to compare the headers from the fasta file to file1, and if there's a match, reorganize the header and put the match first. If there's no match between fasta file and file1, look at file2 and if there's a match, put that name first on the header.

The goal is to organize the fasta file header with the similar match listed as the first entry on the header. How can I do this with this code or is there another way to write it in python or biopython?

BioPython solutions welcome.

Current files look like this

seqs.fasta:

>ant| bee| fly 
AGCT... 
>bird| eagle| hawk| vulture 
GATCG...

file1.txt:

mouse 
hawk 
cow

file2.txt:

crane 
fly

The output file should look like this, so that the order of the header starts with the match from file 1 or file 2:

> fly| ant| bee 
AGCT... 
> hawk| bird| eagle| vulture 
AGCT... 

My code is:

fasta= open('seq.fasta', 'r')
one= open('file1.txt' , 'r')
two= open('file2.txt' , 'r')
output= open('newseq.fasta', 'w')

for line in fasta:
    if line.startswith('>'): # getting just the header
        header= line.strip().split("|") # there are several entries on the header separated by the |
        for name in one:
           if name in header: # looking to see if file1.txt has match in fasta file
              header.insert(0,line.pop(header.index(name))) # insert the name from file1 as the first header entry
        else: # if no match in file1 look at file2
            for name in two:
               if name in header:
                    header.insert(0,line.pop(header.index(name))) # insert the name from file2 as first header entry   
    else:
output.write(line)
M__
  • 9,527
  • 3
  • 23
  • 44
nora job
  • 33
  • 3

1 Answers1

1

There is nothing wrong with the concept of the code, that is exactly how I would do it. The problem is here:

    else:
output.write(line)

Corrections

import os
output = 'newseq.fasta'
if os.path.exists(output):
    os.unlink(output) 

for line in fasta:
    if line.startswith('>'): # getting just the header
        header= line.strip().split("|") # there are several entries on the header separated by the |
        for name in one:
           if name in header: 
              line = header.insert(0,line.pop(header.index(name))) # insert the name from file1 as the first header entry
        else: # if no match in file1 look at file2
            for name in two:
               if name in header:
                    line = header.insert(0,line.pop(header.index(name))) 

# The indentation must be absolutely correct 
# otherwise the code falls over. Must be one indent from
# the first for statement
    with open(output, 'a') as fout:
        fout.write(line)

I assume that fasta, one and two are loaded successfully.

M__
  • 9,527
  • 3
  • 23
  • 44