ProbCons
ProbCons is an open source probabilistic consistency-based multiple alignment of amino acid sequences. It is one of the most efficient protein multiple sequence alignment programs, since it has repeatedly demonstrated a statistically significant advantage in accuracy over similar tools, including Clustal and MAFFT.[1][2]
Algorithm
The following describes the basic outline of the ProbCons algorithm.[3]
Step 1: Reliability of an alignment edge
For every pair of sequences compute the probability that letters [math]\displaystyle{ x_i }[/math] and [math]\displaystyle{ y_i }[/math] are paired in [math]\displaystyle{ a^* }[/math] an alignment that is generated by the model.
[math]\displaystyle{ \begin{align} P(x_i \sim y_i|x,y) & \stackrel{def}{=} Pr[x_i \sim y_i \text{ in some a }|x,y] \\ & = \sum_{\text{alignment a with }x_i - y_i} Pr[a|x,y]\\ & = \sum_{\text{alignment a}} \mathbf{1}\{x_i - y_i \in a\} Pr[a|x,y] \end{align} }[/math]
(Where [math]\displaystyle{ \mathbf{1}\{x_i \sim y_i \in a\} }[/math] is equal to 1 if [math]\displaystyle{ x_i }[/math] and [math]\displaystyle{ y_i }[/math] are in the alignment and 0 otherwise.)
Step 2: Maximum expected accuracy
The accuracy of an alignment [math]\displaystyle{ a^* }[/math] with respect to another alignment [math]\displaystyle{ a }[/math] is defined as the number of common aligned pairs divided by the length of the shorter sequence.
Calculate expected accuracy of each sequence:
[math]\displaystyle{ \begin{align} E_{Pr[a|x,y]}(acc(a^*,a)) & = \sum_{a}Pr[a|x,y]acc(a^*,a) \\ & = \frac{1}{min(|x|,|y|)} \cdot \sum_{a}\mathbf{1}\{x_i \sim y_i \in a\} Pr[a|x,y]\\ & = \frac{1}{min(|x|,|y|)} \cdot \sum_{x_i - y_i} P(x_i \sim y_j|x,y) \end{align} }[/math]
This yields a maximum expected accuracy (MEA) alignment:
[math]\displaystyle{ E(x,y) = \arg\max_{a^*} \; E_{Pr[a|x,y]}(acc(a^*,a)) }[/math]
Step 3: Probabilistic Consistency Transformation
All pairs of sequences x,y from the set of all sequences [math]\displaystyle{ \mathcal{S} }[/math] are now re-estimated using all intermediate sequences z:
[math]\displaystyle{ P'(x_i - y_i|x,y) = \frac{1}{|\mathcal{S}|} \sum_{z} \sum_{1 \leq k \leq |z|} P(x_i \sim z_i|x,z) \cdot P(z_i \sim y_i|z,y) }[/math]
This step can be iterated.
Step 4: Computation of guide tree
Construct a guide tree by hierarchical clustering using MEA score as sequence similarity score. Cluster similarity is defined using weighted average over pairwise sequence similarity.
Step 5: Compute MSA
Finally compute the MSA using progressive alignment or iterative alignment.
See also
References
- ↑ "PROBCONS: Probabilistic Consistency-based Multiple Sequence Alignment". Genome Research 15 (2): 330–340. 2005. doi:10.1101/gr.2821705. PMID 15687296.
- ↑ Roshan, Usman (2014-01-01). "Multiple Sequence Alignment Using Probcons and Probalign". in Russell, David J (in English). Multiple Sequence Alignment Methods. Methods in Molecular Biology. 1079. Humana Press. pp. 147–153. doi:10.1007/978-1-62703-646-7_9. ISBN 9781627036450.
- ↑ Lecture "Bioinformatics II" at University of Freiburg
External links
Original source: https://en.wikipedia.org/wiki/ProbCons.
Read more |