How can I download from NCBI all the ITS genes and the related taxonomy?

Question

I would like to download all the ITS1 and ITS2 genes from NCBI in a fasta file. And, I'd like to download even the related taxonomy of each sequence.

Thanks,

Marco

Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. — Community, Jun 08 '22 at 17:52

score 4 · Answer 1 · answered Jun 08 '22 at 16:26

I recommend to use Biopython:

from Bio import Entrez

Entrez.email = "your@somemail.com"
Entrez.api_key = "YOUR_API_KEY"

search_result = Entrez.read(Entrez.esearch(db="nucleotide", term="ITS1[Gene Name]"))
with Entrez.efetch(
    db="nucleotide", id=search_result["IdList"], rettype="fasta", retmode="text"
) as handle, open(filepath, "w") as file:
    file.write(handle.read())

score 3 · Answer 2 · answered Jun 15 '22 at 01:42

For batch operations, it would be better to acquire a NCBI API key and combine GNU parallel and Entrez Direct . If you already have the accession numbers of ITS1 and ITS2 genes, put the accession numbers in a text file like "accession-number.txt". Otherwise search the ITS genes you want to download and then get the accession number:

esearch -db nuccore -query 'you query term[Gene Name]' |
  efetch -format acc >accession-number.txt

To download all the sequences in fasta format.:

parallel -a accession-number.txt -j8 'efetch -db nuccore -id {} -format fasta >{}.fa'

Here -j8 is used to control the number of jobs run in parallel, it should be less than the limit of 10.

To get related taxonomy (for eg, taxonomy rank, scientific name, taxonomy id, genus and species) of each sequence:

body='BEGIN { OFS = "\t" } { print "{}", $0}'
paralle -a accession-number.txt -j8 "elink -db nuccore -id {} -target taxonomy -name nuccore_taxonomy | 
  esummary |
  xtract -pattern DocumentSummary -element Rank ScientificName TaxId Genus Species |
  awk '$body'"

Thank you! I thought, that they do not allow to make more than 10 requests per second for an API key regardless the fact it is parallel or serial. Which is real restrictions? — Vovin, Jun 15 '22 at 08:21
I think the real restriction is 10 requests per second. You can try `-j` with a number larger than 10 to see if there are any errors. — Forrest Vigor, Jun 16 '22 at 02:11
To get taxonomy info without awk: ` paralle -a accession-number.txt -j8 "elink -db nuccore -id {} -target taxonomy -name nuccore_taxonomy | esummary | xtract -pattern DocumentSummary -pfx '{}\t' -element Rank ScientificName TaxId Genus Species"` — Forrest Vigor, Jun 17 '22 at 04:15

How can I download from NCBI all the ITS genes and the related taxonomy?

2 Answers2