I would like to download all the ITS1 and ITS2 genes from NCBI in a fasta file. And, I'd like to download even the related taxonomy of each sequence.
Thanks,
Marco
I would like to download all the ITS1 and ITS2 genes from NCBI in a fasta file. And, I'd like to download even the related taxonomy of each sequence.
Thanks,
Marco
I recommend to use Biopython:
from Bio import Entrez
Entrez.email = "your@somemail.com"
Entrez.api_key = "YOUR_API_KEY"
search_result = Entrez.read(Entrez.esearch(db="nucleotide", term="ITS1[Gene Name]"))
with Entrez.efetch(
db="nucleotide", id=search_result["IdList"], rettype="fasta", retmode="text"
) as handle, open(filepath, "w") as file:
file.write(handle.read())
For batch operations, it would be better to acquire a NCBI API key and combine GNU parallel and Entrez Direct . If you already have the accession numbers of ITS1 and ITS2 genes, put the accession numbers in a text file like "accession-number.txt". Otherwise search the ITS genes you want to download and then get the accession number:
esearch -db nuccore -query 'you query term[Gene Name]' |
efetch -format acc >accession-number.txt
To download all the sequences in fasta format.:
parallel -a accession-number.txt -j8 'efetch -db nuccore -id {} -format fasta >{}.fa'
Here -j8 is used to control the number of jobs run in parallel, it should be less than the limit of 10.
To get related taxonomy (for eg, taxonomy rank, scientific name, taxonomy id, genus and species) of each sequence:
body='BEGIN { OFS = "\t" } { print "{}", $0}'
paralle -a accession-number.txt -j8 "elink -db nuccore -id {} -target taxonomy -name nuccore_taxonomy |
esummary |
xtract -pattern DocumentSummary -element Rank ScientificName TaxId Genus Species |
awk '$body'"