ProbCons
{{short description|Protein multiple-sequence alignment program}}
In bioinformatics and proteomics, ProbCons is an open source software for probabilistic consistency-based multiple alignment of amino acid sequences. It is one of the most efficient protein multiple sequence alignment programs, since it has repeatedly demonstrated a statistically significant advantage in accuracy over similar tools, including Clustal and MAFFT.{{cite journal |doi=10.1101/gr.2821705 |vauthors=Do CB, Mahabhashyam MS, Brudno M, Batzoglou S |year=2005 |title=PROBCONS: Probabilistic Consistency-based Multiple Sequence Alignment |journal=Genome Research |volume=15 |issue=2 |pages=330–340 |pmid=15687296 |pmc=546535}}{{Cite book|title=Multiple Sequence Alignment Methods|volume = 1079|last=Roshan|first=Usman|date=2014-01-01|publisher=Humana Press|isbn=9781627036450|editor-last=Russell|editor-first=David J|series=Methods in Molecular Biology|pages=147–153|language=English|doi=10.1007/978-1-62703-646-7_9|pmid = 24170400|chapter = Multiple Sequence Alignment Using Probcons and Probalign}}
Algorithm
The following describes the basic outline of the ProbCons algorithm.[http://www.bioinf.uni-freiburg.de//Lehre/Courses/2011_WS/V_BioinfoII/slides_probcons.pdf Lecture "Bioinformatics II" at University of Freiburg]
=Step 1: Reliability of an alignment edge=
For every pair of sequences compute the probability that letters and are paired in an alignment that is generated by the model.
P(x_i \sim y_i|x,y) \ \overset{\underset{\mathrm{def}}{}}{=}& \ \Pr[x_i \sim y_i \text{ in some } a|x,y] \\[8pt]
=& \ \sum_{\text{alignment } a \atop {\text{with }x_i - y_i}} \Pr[a|x,y] \\[2pt]
=& \ \sum_{\text{alignment } a} \mathbf{1}\{x_i - y_i \in a\} \Pr[a|x,y]
\end{align}
(Where is equal to 1 if and are in the alignment and 0 otherwise.)
=Step 2: Maximum expected accuracy=
The accuracy of an alignment with respect to another alignment is defined as the number of common aligned pairs divided by the length of the shorter sequence.
Calculate expected accuracy of each sequence:
E_{\Pr[a|x,y]}(\operatorname{acc}(a^*,a)) & = \sum_{a}\Pr[a|x,y] \operatorname{acc}(a^*,a) \\
& = \frac{1}{\min(|x|,|y|)} \cdot \sum_{a}\mathbf{1}\{x_i \sim y_i \in a\} \Pr[a|x,y]\\
& = \frac{1}{\min(|x|,|y|)} \cdot \sum_{x_i - y_i} P(x_i \sim y_j|x,y)
\end{align}
This yields a maximum expected accuracy (MEA) alignment:
E(x,y) = \arg\max_{a^*} \; E_{\Pr[a|x,y]}(\operatorname{acc}(a^*,a))
=Step 3: Probabilistic Consistency Transformation=
All pairs of sequences x,y from the set of all sequences are now re-estimated using all intermediate sequences z:
P'(x_i - y_i|x,y) = \frac{1}
\mathcal{S} |
This step can be iterated.
=Step 4: Computation of guide tree=
Construct a guide tree by hierarchical clustering using MEA score as sequence similarity score. Cluster similarity is defined using weighted average over pairwise sequence similarity.
=Step 5: Compute MSA=
Finally compute the MSA using progressive alignment or iterative alignment.
See also
References
{{Reflist}}
External links
- {{Official website|http://probcons.stanford.edu/}}