2

I have many amino acids sequences in a fasta format that I did multiple sequence alignment. I was trying to plot something like binary a code as heatmap. If it had a change in the AA compared to the reference sequence it would be red, if it did not change would be yellow, for example.

I came across to msaplot from ggtree package. I also checked ggmsa package for that. But so far, I did not get what I wanted. So basically I wanted to:

  1. change the multiple sequence alignment to a binary matrix (if the amino differs from the reference sequence, plot x, if not y)

  2. plot as a heatmap with two colors based if the amino acid changed or not

here's some data example of what I doing, just instal ggmsa and you can try this:

protein_sequences <- system.file("extdata", "sample.fasta", package = "ggmsa")
ggmsa(protein_sequences, start = 265, end = 300, color = "Chemistry_AA",font = NULL)
rodrigarco
  • 71
  • 7

1 Answers1

5

@StupidWolf answered me on stackoverflow and he's answer solved my question. I will share it here as well.

@StupidWolf:

We read in the alignment:

library(Biostrings)
library(ggmsa)
protein_sequences <- system.file("extdata", "sample.fasta", package = "ggmsa")
aln = readAAMultipleAlignment(protein_sequences)
ggmsa(protein_sequences, start = 265, end = 300) 

enter image description here Set the reference as the 1st sequence, some Rattus, you can also use the consensus with consensusString() :

aln = unmasked(aln)
names(aln)[1]
[1] "PH4H_Rattus_norvegicus"

ref = aln[1]

Here we iterate through the sequence and make the binary for where the sequences are the same as the reference:

bm = sapply(1:length(aln),function(i){
as.numeric(as.matrix(aln[i])==as.matrix(ref))
})

bm = t(bm)
rownames(bm) = names(aln)

The plot you see above has sequences reversed, so to visualize the same thing we reverse it, and also subset on 265 - 300:

library(pheatmap)
pheatmap(bm[nrow(bm):1,265:300],cluster_rows=FALSE,cluster_cols=FALSE)

enter image description here The last row, is Rattus, the reference, and anything similar to that is read, as you can see in the alignment above, last 4 sequences are identical.

rodrigarco
  • 71
  • 7