Probalign
Probalign is a sequence alignment tool that calculates a maximum expected accuracy alignment using partition function posterior probabilities.[1] Base pair probabilities are estimated using an estimate similar to Boltzmann distribution. The partition function is calculated using a dynamic programming approach.
Algorithm
The following describes the algorithm used by probalign to determine the base pair probabilities.[2]
Alignment score
To score an alignment of two sequences two things are needed:
- a similarity function [math]\displaystyle{ \sigma(x,y) }[/math] (e.g. PAM, BLOSUM,...)
- affine gap penalty: [math]\displaystyle{ g(k) = \alpha + \beta k }[/math]
The score [math]\displaystyle{ S(a) }[/math] of an alignment a is defined as:
[math]\displaystyle{ S(a) = \sum_{x_i-y_j \in a} \sigma(x_i,y_j) + \text{gap cost} }[/math]
Now the boltzmann weighted score of an alignment a is:
[math]\displaystyle{ e^{\frac{S(a)}{T}} = e^{\frac{\sum_{x_i-y_j \in a} \sigma(x_i,y_j) + \text{gap cost}}{T}} = \left( \prod_{x_i - y_i \in a} e^{\frac{\sigma(x_i,y_j)}{T}} \right) \cdot e^{\frac{gapcost}{T}} }[/math]
Where [math]\displaystyle{ T }[/math] is a scaling factor.
The probability of an alignment assuming boltzmann distribution is given by
[math]\displaystyle{ Pr[a|x,y] = \frac{e^{\frac{S(a)}{T}}}{Z} }[/math]
Where [math]\displaystyle{ Z }[/math] is the partition function, i.e. the sum of the boltzmann weights of all alignments.
Dynamic programming
Let [math]\displaystyle{ Z_{i,j} }[/math] denote the partition function of the prefixes [math]\displaystyle{ x_0,x_1,...,x_i }[/math] and [math]\displaystyle{ y_0,y_1,...,y_j }[/math]. Three different cases are considered:
- [math]\displaystyle{ Z^{M}_{i,j}: }[/math] the partition function of all alignments of the two prefixes that end in a match.
- [math]\displaystyle{ Z^{I}_{i,j}: }[/math] the partition function of all alignments of the two prefixes that end in an insertion [math]\displaystyle{ (-,y_j) }[/math].
- [math]\displaystyle{ Z^{D}_{i,j}: }[/math] the partition function of all alignments of the two prefixes that end in a deletion [math]\displaystyle{ (x_i,-) }[/math].
Then we have: [math]\displaystyle{ Z_{i,j} = Z^{M}_{i,j} + Z^{D}_{i,j} + Z^{I}_{i,j} }[/math]
Initialization
The matrixes are initialized as follows:
- [math]\displaystyle{ Z^{M}_{0,j} = Z^{M}_{i,0} = 0 }[/math]
- [math]\displaystyle{ Z^{M}_{0,0} = 1 }[/math]
- [math]\displaystyle{ Z^{D}_{0,j} = 0 }[/math]
- [math]\displaystyle{ Z^{I}_{i,0} = 0 }[/math]
Recursion
The partition function for the alignments of two sequences [math]\displaystyle{ x }[/math] and [math]\displaystyle{ y }[/math] is given by [math]\displaystyle{ Z_{|x|,|y|} }[/math], which can be recursively computed:
- [math]\displaystyle{ Z^{M}_{i,j} = Z_{i-1,j-1} \cdot e^{\frac{\sigma(x_i,y_j)}{T}} }[/math]
- [math]\displaystyle{ Z^{D}_{i,j} = Z^{D}_{i-1,j} \cdot e^{\frac{\beta}{T}} + Z^{M}_{i-1,j} \cdot e^{\frac{g(1)}{T}} + Z^{I}_{i-1,j} \cdot e^{\frac{g(1)}{T}} }[/math]
- [math]\displaystyle{ Z^{I}_{i,j} }[/math] analogously
Base pair probability
Finally the probability that positions [math]\displaystyle{ x_i }[/math] and [math]\displaystyle{ y_j }[/math] form a base pair is given by:
[math]\displaystyle{ P(x_i - y_j|x,y) = \frac{Z_{i-1,j-1} \cdot e^{\frac{\sigma(x_i,y_j)}{T}} \cdot Z'_{i',j'}}{Z_{|x|,|y|}} }[/math]
[math]\displaystyle{ Z', i', j' }[/math] are the respective values for the recalculated [math]\displaystyle{ Z }[/math] with inversed base pair strings.
See also
- ProbCons
- Multiple Sequence Alignment
References
- ↑ U. Roshan and D. R. Livesay, Probalign: multiple sequence alignment using partition function posterior probabilities, Bioinformatics, 22(22):2715-21, 2006 (PDF)
- ↑ Lecture "Bioinformatics II" at University of Freiburg
External links
Original source: https://en.wikipedia.org/wiki/Probalign.
Read more |