3

FASTA files can be very big and unwieldy, especially if lines are at most 80 characters, one can't speed up browsing them by using less with -S to have one sequence every two lines.

How can I extract just the strain names (or sequence names, i.e. the string on the line starting with >) into a list? I'd like to use seqkit since I'm already familiar with it and know some of the flags and idioms it uses.

I googled the question but couldn't find an answer using seqkit - the examples all use awk, grep, sed or similar, which are alright for copy/pasting but not very memorable.

terdon
  • 8,869
  • 3
  • 16
  • 44
Cornelius Roemer
  • 367
  • 1
  • 13

3 Answers3

6

The nice thing about seqkit is that the answer is really simple and memorable:

seqkit seq -n sequences.fasta

This will output a text file containing one sequence name per line:

MPX/PT0001/2022
MPX/PT0002/2022
MPX/PT0003/2022
Cornelius Roemer
  • 367
  • 1
  • 13
5

You can trivially do this with standard *nix tools. Here are a few options:

sed -n 's/^>//p' file.fa 
awk 'sub(/^>/,"")' file.fa
grep '^>' file.fa | tr -d '>'
perl -ne 'print if s/^>//' file.fa 

The advantage of these is that they will work on any POSIX-compliant system and most other *nix flavors, even if they're not POSIX such as Linux, out of the box without needing to install any external tools.

Another thing to take into account is how fast a command is, especially when dealing with large files. I ran a few tests using the solutions presented here on the human hg38 genome assembly which can be downloaded here and seqkit version 2.2.0 which was downloaded from here. The file was stored locally on my laptop's SSD drive, and the laptop itself is:

$ inxi 
CPU: quad core Intel Core i7-6820HQ (-MT MCP-)
speed/min/max: 2896/800/3600 MHz Kernel: 5.17.5-arch1-1 x86_64
Up: 25d 10h 27m Mem: 15035.9/31958.1 MiB (47.0%)
Storage: 476.94 GiB (103.1% used) Procs: 452 Shell: Bash inxi: 3.3.15

To get more accurate results, I used a little function I have which will run the command you give it a specified number of times and then report the average real time across all runs. Here's the code if you want to try it:

timethis(){
    max=$1;
    shift;

    for com
    do
        printf '\nCOMMAND: %s\n' "$com"
    c=0;
    while [[ $c -lt $max ]]
    do
      let c++;
      ( time -p eval "$com" ) 2>&1 | grep -oP 'real.*?\K[\d\.]+'
        done | 
        awk -vm=$max '{k+=$1}END{print (k/m)}';
    done
}

And the results. These are the average "real" (i.e. wall clock) times taken across 10 runs:

$ timethis 10 \
  "rg '^>' GCA_000001405.29_GRCh38.p14_genomic.fna | tr -d '>' >/dev/null" \
  "grep '^>' GCA_000001405.29_GRCh38.p14_genomic.fna | tr -d '>' >/dev/null" \
  "sed -n 's/^>//p' GCA_000001405.29_GRCh38.p14_genomic.fna >/dev/null" \
  "perl -ne 'print if s/^>//' GCA_000001405.29_GRCh38.p14_genomic.fna >/dev/null" \
  "./seqkit seq -n GCA_000001405.29_GRCh38.p14_genomic.fna >/dev/null" \
  "awk 'sub(/^>/,\"\")' GCA_000001405.29_GRCh38.p14_genomic.fna >/dev/null" \
  "perl -nle 'push @res, $_  while s/^>//; END {print  join \"\n\",@res}' GCA_000001405.29_GRCh38.p14_genomic.fna"

COMMAND: rg '^>' GCA_000001405.29_GRCh38.p14_genomic.fna | tr -d '>' >/dev/null
0.474

COMMAND: grep '^>' GCA_000001405.29_GRCh38.p14_genomic.fna | tr -d '>' >/dev/null 
0.583

COMMAND: sed -n 's/^>//p' GCA_000001405.29_GRCh38.p14_genomic.fna >/dev/null 
5.62

COMMAND: perl -ne 'print if s/^>//' GCA_000001405.29_GRCh38.p14_genomic.fna  >/dev/null 
6.408

COMMAND: ./seqkit seq -n GCA_000001405.29_GRCh38.p14_genomic.fna >/dev/null 
4.326

COMMAND: awk 'sub(/^>/,"")' GCA_000001405.29_GRCh38.p14_genomic.fna >/dev/null 
13.518

COMMAND: perl -nle 'push @res, 3  while s/^>//; END {print  join "\n",@res}' GCA_000001405.29_GRCh38.p14_genomic.fna 
8.455

As you can see above, the grep or rg and tr approaches only took around half a second on average, so that is clearly much faster than any of the other choices. However, seqkit is pretty good since it only took around 4 seconds, faster than any of the other ad hoc solutions.

terdon
  • 8,869
  • 3
  • 16
  • 44
  • Seqkit is arguably a bioinfo standard tool ;) – Cornelius Roemer May 30 '22 at 20:08
  • 2
    @CorneliusRoemer I guess. I have never used it myself and have been doing bioinformatics for more than 20 years. But by "standard" I mean "standard *nix tools", not bioinformatics tools. I edited to clarify. – terdon May 31 '22 at 08:34
  • 1
    @CorneliusRoemer there are weakness on reliance on a developers 'kit'. Flexibility, if you wanted to count the number of pangotype A.1 in SARS-CoV-2, my code is easy to modify to />.*A\.1.*/ and introduce $count++ , whereas you would need to search seqkit to see if the developer had foresight. Bugs: are 'kits' completely bug free, how can you be sure? Upgrades: the promised upgrade ain't ever released (very, very common). Distributing your algorithm: this is a headache because you've now got a dependency which you don't have permission to redistribute. – M__ May 31 '22 at 14:17
  • @M__ of course there are use cases for *nix tools. But there are also use cases for *kits. In this case I wanted to use a memorable solution that will not require lookup every time. The two other answers are arguably off topic because they don't address the question - which explicitly asked for `seqkit` usage. Another *kit would have been more on-topic than *nix tools. I don't mind them, it's just a bit typical for SO, you ask a question and you get a solution that's for a different question ;) – Cornelius Roemer May 31 '22 at 14:34
  • 3
    @CorneliusRoemer answers providing different approaches are welcome across the Stack Exchange network (note that this site is _not_ [SO]). Especially when there is already a fine answer with using the requested tool. As for what is memorable, that is entirely personal. You are familiar with `seqt` so that seems easy and memorable for you. I have been using *nix tools for 20 years, so nothing is more memorable than a simple `sed s/foo/bar/`. But that's fine! That's the beauty of it: we can all provide answers and people can choose whichever approach works well for them. Everybody wins! – terdon May 31 '22 at 14:51
  • Fair :) though it's `seqkit` not `seqt` – Cornelius Roemer May 31 '22 at 15:03
  • Heh, you see how memorable I find it, @CorneliusRoemer! :P – terdon May 31 '22 at 15:05
  • @CorneliusRoemer you've got to agree with terdon: seqkit is far too slow. It is worrying. This would cause problems on the data sets in question where a million strong fasta file is normal. Take your pick but realistically its the end of this discussion, there's no point using an inefficient algorithm against huge data sets, sooner or later they'll be a RAM issue. – M__ May 31 '22 at 17:42
  • That's a good reason to use an alternative tool, I agree, if speed matters. If you use the command line interactively though, terseness can be more important. It depends on the use case I suppose. I'd be curious how fast `rg` would be. The slowdown in `seqkit` seems to be related to file access? Large `sys` time. Can you add a link on how to download the benchmark data so I can compare on my machine and try `rg`? – Cornelius Roemer May 31 '22 at 18:10
  • 1
    @CorneliusRoemer good catch! The test results I show were using a genome file stored on a network drive, which would of course affect the times. I had assumed that wouldn't be relevant since all tools were accessing the same file over the same network. However, I downloaded a different genome file locally and on my local SSD, `seqkit` is far more competitive. I'll update in a few minutes with the new test results, but do you have any idea why one tool would be affected by the network I/O times more than another? That's very surprising to me. – terdon May 31 '22 at 18:34
  • Ah interesting! I thought you were the bioinformatician with 20y of experience. I barely have 1 :D Seqkit could read in chunks? No idea, one should maybe even open this as a bug on Github. By the way there are other similar tools out there. Please try `rg` too, it's a Rust port of grep and likely faster than grep. Also best to link to the data file so it's reproducible. Performance could differ on macOS and *nix. Ah yes, that's an advantage of `seqkit` it could possibly work even on windows... – Cornelius Roemer May 31 '22 at 18:50
  • 1
    @CorneliusRoemer Ha! Serves me right for throwing my age around :) I do indeed have 20 years of experience, but I still come from the world of biology, not programming so there will be holes in my knowledge. I have updated the post with the new results, including all data needed to reproduce them, and `seqkit` now only loses out to `grep` + `tr` but is faster than all the others. So much for my years of experience, I have to eat my words now! – terdon May 31 '22 at 18:54
  • @M__ my initial numbers don't hold up when doing the same thing with a fasta file stored on a local SSD. So it looks like `seqkit` is slower than expected when reading data over a network but is faster than everything except the simple `grep` + `tr` when running on a local file. – terdon May 31 '22 at 18:55
  • @CorneliusRoemer if it is chunking it at source, its not a bug. It would make sense. Does it parallelise? Chunking and parallelising would be a technique to speed the operation and minimise RAM issues. Obviously it will fall over via a network. – M__ May 31 '22 at 22:08
  • Thanks @terdon its surely grep/ripgrep. If `seqkit` is chunking at source, that is really interesting (its clearly not pulling the whole file in and then chunking). – M__ May 31 '22 at 22:41
1

Or a bit more verbose ...

perl -nle 'push @res, $_  while s/^>//; END {print  join "\n",@res}' myfile.fa
terdon
  • 8,869
  • 3
  • 16
  • 44
M__
  • 9,527
  • 3
  • 23
  • 44