14

Is there any resource (paper, blogpost, Github gist, etc.) describing the BWA-MEM algorithm for assigning mapping qualities? I vaguely remember that I have somewhere seen a formula for SE reads, which looked like

$C * (s_1 - s_2) / s_1,$

where $s_1$ and $s_2$ denoted the alignment scores of two best alignments and C was some constant.

I believe that a reimplementation of this algorithm in some scripting language could be very useful for the bioinfo community. For instance, I sometimes test various mapping methods and some of them tend to find good alignments, but fail in assigning appropriate qualities. Therefore, I would like to re-assign all the mapping qualities in a SAM file with the BWA-MEM algorithm.

Btw. This algorithm must already have been implemented outside BWA, see the BWA-MEM paper:

GEM does not compute mapping quality. Its mapping quality is estimated with a BWA-like algorithm with suboptimal alignments available.

Unfortunately, the BWA-MEM paper repo contains only the resulting .eval files.

Update: The question is not about the algorithm for computing alignment scores. Mapping qualities and alignment scores are two different things:

  • Alignment score quantifies the similarity between two sequences (e.g., a read and a reference sequence)
  • Mapping quality (MAQ) quantifies the probability that a read is aligned to a wrong position.

Even alignments with high scores can have a very low mapping quality.

terdon
  • 8,869
  • 3
  • 16
  • 44
Karel Břinda
  • 1,889
  • 8
  • 18
  • 2
    Unfortunately I don’t know the answer for BWA-MEM (since it differs from BWA!) but pretty much all other tools are described here: https://sequencing.qcfail.com/articles/mapq-values-are-really-useful-but-their-implementation-is-a-mess/ – Konrad Rudolph May 31 '17 at 13:06
  • Maybe this page can help http://genome.sph.umich.edu/wiki/Mapping_Quality_Scores . Once you know the best and alternative positions where a read can align (or even the best and second best only?) it's not too difficult to implement I guess. – dariober Jun 01 '17 at 13:44

1 Answers1

3

Yes, there bwa-mem was published as a preprint

BWA-MEM’s seed extension differs from the standard seed extension in two aspects. Firstly, suppose at a certain extension step we come to reference position x with the best extension score achieved at query position y.

...

Secondly, while extending a seed, BWA-MEM tries to keep track of the best extension score reaching the end of the query sequence

And there is a description of the scoring algorithm directly in the source code of bwa-mem (lines 22 - 44), but maybe the only solution is really to go though the source code.

Kamil S Jaron
  • 5,437
  • 1
  • 22
  • 57
  • Thank you for your answer. However, the question is more about assigning mapping qualities. Even reads with a very high alignment score can have a mapping quality equal to zero. – Karel Břinda May 31 '17 at 00:47
  • 1
    Have you checked the source code? line 22 - 44. – Kamil S Jaron May 31 '17 at 05:33
  • 5
    @KamilSJaron Wow, that’s terribly hard to understand. The [actual code](https://github.com/lh3/bwa/blob/master/bwamem.c#L945-L969) unfortunately isn’t better. :-( At any rate, could you update your answer to include this more prominently? – Konrad Rudolph May 31 '17 at 13:01
  • @ KamilSJaron I did and it is still not completely clear for me even in the easier case of single-end reads. – Karel Břinda May 31 '17 at 13:15
  • Well I had no intention to actually explain the score (since I really do not know and the question explicitly asked about resources). I just knew about the preprint and I also got the idea to look at the source code, where I found those 22 lines of maths that seemed to explain the scoring. – Kamil S Jaron May 31 '17 at 19:44