pigz | awk | wc is the fastest method
First off for benchmarks with FASTQ it's best to use a specific real-world example with a known answer. I've chosen this file:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG01815/sequence_read/ERR047740_1.filt.fastq.gz
as my test file, the correct answers being:
Number of reads: 67051220
Number of bases in reads: 6034609800
Next we want to find the fastest way possible to count these, all timings are the average wall-clock time (real) of 10 runs collected with the bash time on an otherwise unloaded system:
zgrep
zgrep . ERR047740_1.filt.fastq.gz |
awk 'NR%4==2{c++; l+=length($0)}
END{
print "Number of reads: "c;
print "Number of bases in reads: "l
}'
This is the slowest method with an average run-time of 125.35 seconds
gzip awk
Using gzip we gain about another 10 seconds:
gzip -dc ERR047740_1.filt.fastq.gz |
awk 'NR%4==2{c++; l+=length($0)}
END{
print "Number of reads: "c;
print "Number of bases in reads: "l
}'
Average run-time is 116.69 seconds
Konrad's gzip awk wc variant
fix_base_count() {
local counts=($(cat))
echo "${counts[0]} $((${counts[1]} - ${counts[0]}))"
}
gzip -dc ERR047740_1.filt.fastq.gz \
| awk 'NR % 4 == 2' \
| wc -cl \
| fix_base_count
This runs slower on this test file than the gzip awk variant of the solution, average run-time is 122.28 seconds.
kseq_test using latest kseq.h from klib
Code compiled with: gcc -O2 -o kseq_test kseq_test.c -lz where kseq_test.c is Simon's adaptation of Heng Li's FASTQ parser.
kseq_test ERR047740_1.filt.fastq.gz
Average run-time is 99.14 seconds, which is better than the gzip core utilities based solution so far, but we can do better!
piz awk
Using Mark Adler's pigz as a drop-in replacement for gzip, note that pigz gives us a speed gain as on top of gzip as in addition to the main deflate thread it uses another 3 threads for reading, writing and checksum calculations, see the man page for details.
pigz -dc ERR047740_1.filt.fastq.gz |
awk 'NR%4==2{c++; l+=length($0)}
END{
print "Number of reads: "c;
print "Number of bases in reads: "l
}'
Average run-time is now 93.86 seconds, this is ~5 seconds faster than the kseq based C code but we can further improve the benchmark.
pigz awk wc
Next we use pigz as a drop in replacment for Konrad's wc variant of the awk based solution.
fix_base_count() {
local counts=($(cat))
echo "${counts[0]} $((${counts[1]} - ${counts[0]}))"
}
gzip -dc ERR047740_1.filt.fastq.gz \
| awk 'NR % 4 == 2' \
| wc -cl \
| fix_base_count
Average run-time is now down to 83.03 seconds, this is ~16 seconds faster than the kseq based solution and ~42 seconds faster than the OPs zgrep based solution.
Next as a baseline lets see just how much of this run-time is due to decompression of the input fastq.gz file.
gzip alone
gzip -dc ERR047740_1.filt.fastq.gz > /dev/null
Average run-time: 105.95 seconds, so the gzip based solutions (which also includes zcat and zgrep as these are provided by gzip) are never going to be faster than kseq_test.
pigz alone
pigz -dc ERR047740_1.filt.fastq.gz > /dev/null
Average run-time: 77.66 seconds, so quite clearly the additional three threads for read, write and checksum calculation offer a useful advantage. What's more this speed-up is greater when leveraging the awk | wc based solution, it's not clear why, but I expect this is due to the extra write thread.
Interestingly average CPU usage across all threads is quite revealing for the various answers, I've collated these stats using GNU time /usr/bin/time --verbose
zgrep based solution 133% - must be more than one thread somehow
gzip | awk based solution 99% - all gzip based solutions run single-threaded at 99% CPU usage
pigz | awk 147%
gzip | awk | wc 99% as with gzip
pgiz | awk | wc 155%
kseq_test 99%
gzip > dev/null 99%
pigz > dev/null 155%
Whilst the main deflate thread in pigz will run at 100% CPU load the extra 3 don't quite fully occupy additional cores to 100% (as is evidenced by average CPU usage of ~150%) they do however clearly result in reduced run-time.
I'm using Ubuntu 16.04.2 LTS**, my gzip, zcat, zgrep versions are all gzip 1.6 and pigz is version 2.3.1. gcc is version 5.4.0
** I think my patch level is actually 16.04.4 but I've not rebooted for 170 days :p