Viterbi semiring
| Algebraic structure → Ring theory Ring theory |
|---|
| Algebraic structures |
|---|
The Viterbi semiring is a commutative semiring defined over the set of probabilities (typically the interval ) with addition operation as the maximum (max) and multiplication as the usual real multiplication. Formally, it can be denoted as a 5-tuple where:[1][2]
- Carrier set (): , the set of probability values from 0 to 1 (inclusive).
- Additive operation (): defined as the maximum of two elements. For any , . This operation is idempotent since (taking the max of an element with itself yields the same element). The additive identity is , because for any .
- Multiplicative operation (): defined as the standard product of real numbers. For , . The multiplicative identity is , since for any . The additive identity serves as the multiplicative zero (absorbing element) as well: .
This structure satisfies all semiring axioms. Addition (max) is associative, commutative, and has identity ; multiplication is associative (and commutative in this case, since real multiplication is commutative) with identity ; and multiplication distributes over addition (for example, ). Importantly, the max operation makes the semiring additively idempotent (), imparting a natural partial order: iff . In this semiring, multiplying two values yields a value that is no greater than either factor, ensuring for (this property is sometimes called multiplicative subidempotence in the literature).[3]
Because behaves like a "logical OR" over weighted probabilities and multiplication behaves like "AND" (combining independent probabilities), the Viterbi semiring is also known as the "max-times" semiring.[2] It is closely related to the tropical semiring used in optimization: in fact, it is isomorphic to a tropical semiring via a logarithmic transformation.[4] For example, mapping probabilities to log-costs turns maximizing into minimizing a cost, and products of probabilities into sums of log-costs.[4] This means algorithms formulated in the Viterbi semiring have equivalents in the min-plus (tropical) semiring commonly used for shortest path and other optimization problems.
Dynamic programming and the Viterbi algorithm
The Viterbi semiring provides the algebraic framework underlying a class of dynamic programming (DP) algorithms that compute optimal paths or optimal solutions by combining local decisions. In a DP context, the semiring's additive operation (max) is used to select the best (highest-weight) outcome among alternatives, and the multiplicative operation (×) is used to accumulate weights (probabilities or scores) along a path. The classic Viterbi algorithm – which finds the most probable sequence of states in a probabilistic model – can be understood as a DP computation over the semiring.[5] At each step, the algorithm applies the distributive law of the semiring to efficiently combine sub-solutions: rather than enumerating all possible state sequences, it uses the distributivity of multiplication over max to propagate only the best partial probabilities forward.[2] This is analogous to how other DP algorithms (like shortest path algorithms) operate over different semirings (e.g. the Floyd–Warshall algorithm uses the semiring for path costs).[2]
In practical terms, any DP recurrence that involves taking a maximum over weighted combinations can be seen as an instance of computing over the Viterbi semiring. For example, if we have a recurrence relation such as:
this is exactly an expression in the Viterbi semiring – the max over corresponds to the semiring addition (choosing the best predecessor state ) and the multiplication corresponds to semiring multiplication (concatenating the weight/probability of transitioning from state to ). The Viterbi algorithm for sequence models uses this template: it iteratively computes the best path probability to each state at time by taking the max over all previous-state probabilities times the transition probability into the current state.[6] Because the addition is idempotent (max), once a better path to a given state is found, worse alternatives are discarded, which is key to the efficiency of the algorithm.
Notably, this semiring perspective reveals that the forward-sum algorithm and the Viterbi-max algorithm are parallel concepts: they differ only in the choice of semiring. The forward (or inside) algorithm that sums over all paths uses the sum-product semiring , whereas the Viterbi algorithm uses the max-product semiring (the Viterbi semiring, ).[7][5] In fact, many parsing and inference algorithms can be generalized such that swapping out the semiring from sum to max turns a marginal probability computation into a maximum probability computation (this insight is sometimes called the semiring parsing framework).[5] The distributive property of semirings is what enables these DP algorithms to aggregate a potentially exponential number of possibilities efficiently.
Role in hidden Markov models (HMMs)
Hidden Markov Models provide a quintessential use case for the Viterbi semiring. In an HMM, we have a sequence of observed outputs and a sequence of hidden states with transition probabilities and emission probabilities. The decoding problem for HMMs asks: what is the most likely sequence of hidden states that could produce a given observation sequence? The Viterbi algorithm answers this by maximizing the joint probability of the state-path and observations. Algebraically, this is exactly a computation in the semiring on the space of state sequences. At each time step, the algorithm selects the highest probability path that ends in each state, using max to choose the best predecessor state and multiplying by the transition and emission probabilities to extend the path.
Using the Viterbi semiring for HMM decoding means that instead of summing over all possible state paths (as one would in the forward algorithm for total likelihood), we track only the maximum probability for each partial path.[6] In other words, the algorithm "forgets" the lower-probability paths at every step, keeping only the best path to each state. This approach finds the single most likely path (often called the Viterbi path or MAP sequence), along with its probability.[8] Formally, if is the maximum probability of ending in state at time (over all state sequences up to ), the Viterbi update is:
where the max runs over all previous states and and are the transition and emission probabilities, respectively. This recurrence uses max as the combine operation and multiplication to accumulate probabilities along a path – precisely the Viterbi semiring operations. As a result, the most probable state sequence for the whole observation sequence is obtained at the final time by taking the best ending state probability.[9]
In HMM applications, the Viterbi semiring is contrasted with the inside (forward) semiring: the forward algorithm computes by summing over paths, whereas the Viterbi algorithm (using the max-times semiring) computes , i.e. the probability of the single best path.[10] Both are dynamic programming algorithms on the same HMM trellis, but instantiated with different semiring operations. It's also common to refer to the outcome of the Viterbi algorithm as a Viterbi alignment or Viterbi parse in contexts like speech or parsing, emphasizing that it represents an optimal sequence rather than a probability sum. The term "Viterbi" has thus become synonymous with the max-product computation on probabilistic models.
Viterbi training is a related concept in HMM learning: instead of the Expectation-Maximization (EM) algorithm which sums over all paths (the Baum-Welch algorithm uses forward-backward sums), one can train by only considering the single best path for each training example (using the Viterbi semiring in the E-step). This is essentially a hard approximation to EM. While Viterbi training can converge faster, it may find a suboptimal local maximum since it ignores probability mass off the best path. Nevertheless, it highlights again how simply switching to the max-operation (Viterbi semiring) in the algorithm yields a different approach to the problem.[11]
Implementation considerations
In software, the Viterbi semiring is typically implemented implicitly through the use of standard data types (for probabilities or log-probabilities) combined with appropriate operations. The key consideration is numerical stability: probabilities multiplied together can become extremely small for long sequences, so it is common to perform computations in the log domain. Implementing the Viterbi algorithm in log-space means using probabilities, converting multiplication into addition of logs, and using on log-probabilities (which corresponds to on the original probabilities since log is monotonic). In practice, one might represent each score as a floating-point log-probability and then use addition for transitions and use the standard max function to select the best path. This is equivalent to using an isomorphic semiring often called the log-max semiring, which corresponds to a tropical semiring with min if using negative logs.[4] Working in log-space prevents underflow and turns multiplications into faster additions. Indeed, many speech and language processing systems use probabilities and then perform a max over sums of logs; for example, "typically use log probabilities" in Viterbi decoding for speech recognition to avoid numerical underflow.[9]
Another important implementation detail is retrieving the argmax path (the actual sequence of states or decisions) and not just its probability. The Viterbi semiring itself computes the max probability values, but to recover the optimal sequence, the standard technique is to maintain backpointers: along with each , store the argmax choice of predecessor state that achieved that maximum. After the DP table is filled, one can backtrack via these pointers to reconstruct the optimal path. This can be seen as augmenting the semiring with extra information (the argmax derivation), effectively working in an extended semiring that carries both the probability and the argmax history.[2] Some formalisms define an argmax semiring or use tuples to carry this information, but in practice a separate parallel array of backpointers is straightforward and efficient.
Higher-level libraries and frameworks sometimes make the semiring abstraction explicit. For instance, in finite-state machine libraries like OpenFST, the default tropical weight is essentially the Viterbi semiring implemented in log-space (using as weights, so that the shortest path corresponds to the highest probability).[12] The library provides generic algorithms (like composition or shortest-path) that work with any semiring specified, and using the tropical semiring yields Viterbi-style results. In another example, the torch_semiring_einsum library for PyTorch allows tensor operations to be carried out in different semirings; it provides a Viterbi semiring implementation out-of-the-box.[5] This means a programmer can switch from summing to maximizing simply by changing the semiring context, illustrating the flexibility of abstracting DP algorithms by semiring type.
From an encoding standpoint, implementing the Viterbi semiring might involve a small wrapper or class that defines the two operations. However, since max and * (or + in log-space) are primitive operations, many implementations of the Viterbi algorithm just use control flow to apply these operations without an explicit "Semiring" object. The pattern typically looks like nested loops over states and time, with an inner calculation that does exactly the over (previous score transition emission). Pseudocode or high-level languages often make this clear. The benefit of understanding it as a semiring is more apparent when writing generic algorithms: for example, a CKY parser or a matrix multiplication can be generalized to different semirings. Indeed, by plugging in the Viterbi semiring into a generic parsing algorithm, one obtains a parser that returns the single best parse (often called a Viterbi parse) instead of a distribution over parses.
Applications in NLP, bioinformatics, and speech
The Viterbi semiring (max-times) is widely used in areas that require decoding the most likely sequence or structure under probabilistic models. Below are key application domains and examples:
Natural Language Processing (NLP)
In computational linguistics, the Viterbi algorithm is used for tasks like part-of-speech tagging, named entity recognition, and probabilistic parsing.[2] For instance, in a probabilistic context-free grammar (PCFG), one can use a dynamic programming parser with the Viterbi semiring to find the single most likely parse tree for a sentence (often called the Viterbi parse). Similarly, an HMM-based tagger uses Viterbi to find the most likely sequence of tags for a given word sequence. The semiring framework also appears in probabilistic parsing algorithms more generally – by switching between the inside semiring (summing probabilities of all parses) and the Viterbi semiring (taking max), systems can compute either parse probabilities or the best parse without changing the core algorithm. Overall, whenever a "best sequence" needs to be inferred (as opposed to a sum or expectation), NLP systems employ the Viterbi/max-product approach.
Bioinformatics
Many problems in bioinformatics involve finding an optimal sequence alignment or annotation, often using HMM variants.[6] For example, profile HMMs (used to model protein or DNA sequence families) apply the Viterbi algorithm to align a new sequence to the profile, yielding the single best alignment path (the highest scoring alignment). This is analogous to finding the most likely evolutionary path or gene structure. Gene prediction can also involve HMMs where states represent coding or non-coding regions; Viterbi decoding will produce the most likely segmentation of a DNA sequence into genes. The principle is the same: instead of summing over all gene architectures, the algorithm picks the highest probability gene model. Notably, sequence alignment algorithms like Needleman–Wunsch or Smith–Waterman (though often not framed in probability terms) are effectively using a max-plus semiring (max for the best score, plus for adding match/mismatch scores), which is conceptually similar to the Viterbi semiring approach.[13] Thus, the Viterbi semiring underlies a range of bioinformatics tools for sequence analysis, from multiple sequence alignment to protein structure prediction, whenever a best hypothesis needs to be identified.
Speech recognition
Automatic speech recognition (ASR) heavily relies on the Viterbi semiring in decoding. In HMM-based speech recognition (e.g. with Gaussian mixture or acoustic models), the goal is to find the most likely word sequence given an acoustic signal. The search through the state-space of all possible word pronunciations is conducted with the Viterbi algorithm, which finds the best path through a lattice of HMM states representing words or sub-word units.[9] The Viterbi decoder uses max (taking the most probable hypothesis) at each time frame rather than summing over alternate hypotheses (which would correspond to the impractical task of computing the total probability of all possible utterances). This yields the single best transcription. Even in modern deep learning-based ASR, decoding often involves a Viterbi or Viterbi-like beam search. The semiring concept also appears in weighted finite-state transducer (WFST) frameworks for speech: the tropical semiring (a variant of the Viterbi semiring with log weights) is used to find the lowest-cost path, which corresponds to the highest probability sentence.[12] ASR uses Viterbi decoding as a core algorithm ("takes MAX, not SUM" in the process of choosing state transitions)[14] to output the single best recognition hypothesis.
Other
Beyond these areas, the max-times semiring has applications in machine learning and AI wherever one needs maximum a posteriori (MAP) inference. For example, in graphical models or factor graphs, the max-product algorithm (for MAP estimation) is essentially an instance of propagating messages in the Viterbi semiring (as opposed to the sum-product semiring for marginal inference). In domains like computer vision (e.g. decoding the best segmentation of an image) or computational finance (optimal policy paths), one can often spot the same algebraic pattern. The Viterbi semiring's reach is so broad that it has been described as a "universal" tool for maximization problems in sequence analysis and beyond.
Variations and related semirings
The Viterbi semiring is one member of a family of semirings used in weighted computations. Several variations and related semirings are worth noting:
- Inside/forward (sum-product) semiring: As mentioned, the complement to the Viterbi's max-product is the sum-product semiring . In this semiring, the additive operation is ordinary addition (summing probabilities) and multiplication is the same (product of probabilities). This semiring underlies algorithms that compute total probabilities or expectations (like the forward algorithm in HMMs or the Inside algorithm in probabilistic grammars).[7] It is not additively idempotent (since for ), which leads to different behavior (accumulating mass from multiple paths rather than picking one). Many algorithms are duals: e.g., Forward vs Viterbi, Inside vs Viterbi parse, sum-product vs max-product message passing – distinguished only by using sum vs max in the semiring.
- Tropical and log semirings:[4] The tropical semiring is usually defined as (or sometimes with as identity). If we take logarithms of probabilities, the log-domain version of the Viterbi semiring becomes a tropical semiring: in probability corresponds to in negative log-space, and multiplication corresponds to addition of log values. Thus, the Viterbi semiring is often referred to as a tropical semiring in contexts like automata and path-finding (the only difference is a choice of using max vs min convention, and whether one is dealing with probabilities or costs). In practice, many implementations use the tropical semiring for decoding tasks; for example, finding the shortest path in a graph (min-plus) is mathematically analogous to finding the highest probability path (max-times) by a suitable sign inversion. The tropical semiring has applications from discrete event systems to optimization problems, and the Viterbi semiring is essentially the "probability-maximization" incarnation of that idea.
- Max–min and fuzzy semirings: A variant sometimes encountered in AI is a max–min semiring (often used in fuzzy logic or constraint satisfaction) where the additive operation is max and the multiplicative operation is min. An example is the fuzzy semiring on with , which is suitable for certain logic inference where conjunction is modeled by min (the truth value of a conjunction is the minimum truth of the operands) and disjunction by max. This is not the Viterbi semiring (since Viterbi uses actual multiplication), but it shares the use of max as an idempotent sum. Both Viterbi's max-times and the max–min semiring belong to the class of idempotent semirings, and both are useful for "best-match" computations under different interpretations of "combine" (multiply vs min). The max–min semiring is sometimes called the Gödel or fuzzy logic semiring, whereas the Viterbi semiring could be seen as a probabilistic max semiring, because it multiplies probabilities rather than taking minima.
- -best and N-best semirings:[2] In some applications, one might want not just the single best result but the top- best alternatives (for example, the top parses of a sentence, or the -best hypotheses in speech recognition). This can be achieved by generalizing the semiring to keep sets of top scores instead of one. The -best semiring is a construction where each element is a list of up to highest values instead of a single value. The additive operation (which would return the top of the union of two lists) and multiplicative (which forms combined scores and then selects top ) are defined to propagate multiple best solutions. Goodman (1998) introduced this for parsing, so one can compute -best parses by essentially running a Viterbi-like DP that keeps hypotheses at each step. Libraries and algorithms have been developed to implement lazy -best extraction efficiently, since generating all intermediate candidates can be expensive. The -best semiring is not a true semiring in the strictest algebraic sense unless or some special conditions hold (because it may violate the distributivity or idempotence in a formal way), but it is a very useful generalized semiring in practice for extending Viterbi-style algorithms to produce ranked lists of solutions.
- Other related semirings: The Boolean semiring can be seen as a special case of a max-times semiring on (with 1 and 0 as truth values, max = OR, times = AND). In fact, the Viterbi semiring can be thought of as a "weighted" generalization of the Boolean semiring, where is replaced by a probability in , OR is replaced by max, and AND by multiplication.[2] Another related structure is the min-max semiring (also called the "bottleneck" semiring) used in certain path problems where the score of a path is the minimum edge weight on that path and one wants to maximize that – this uses and or vice versa, depending on convention. The Viterbi semiring is often mentioned alongside such semirings as an example of an idempotent semiring useful in optimisation, parsing, and search problems.
See also
References
- ↑ Droste, Manfred; Kuich, Werner; Vogler, Heiko, eds (2009) (in en). Handbook of Weighted Automata. 7. doi:10.1007/978-3-642-01492-5. ISBN 978-3-642-01491-8. Bibcode: 2009hwa..book.....D. https://link.springer.com/book/10.1007/978-3-642-01492-5.
- ↑ 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 Vieira, Timothy (2023-07-20). Automating the Analysis and Improvement of Dynamic Programming Algorithms with Applications to Natural Language Processing (Thesis). Johns Hopkins University.
- ↑ Lakshmi, C. Venkata; Vasanthi, T. (2014-02-05). "Some Special Classes of Semirings" (in en). International Journal of Applied Information Systems 6 (8): 27–29. doi:10.5120/ijais14-451093. https://www.ijais.org/archives/volume6/number8/598-1093/.
- ↑ 4.0 4.1 4.2 4.3 Tannen, Val. "Provenance Analysis for First-Order Model Checking". p. 6. https://courses.cs.washington.edu/courses/cse599c/18sp/lectures/t.pdf.
- ↑ 5.0 5.1 5.2 5.3 "Semiring Einsum (torch_semiring_einsum) — Semiring Einsum documentation". https://bdusell.github.io/semiring-einsum/.
- ↑ 6.0 6.1 6.2 Christopher, Vivek Vinushanth. "Understanding the Hidden Markov Model". https://builtin.com/articles/hidden-markov-model.
- ↑ 7.0 7.1 Gimpel, Kevin; Smith, Noah A.. "Cube Summing, Approximate Inference with Non-Local Features, and Dynamic Programming without Semirings". p. 2. https://homes.cs.washington.edu/~nasmith/papers/gimpel+smith.eacl09.pdf.
- ↑ Gogioso, Stefano. "Categorical Quantum Dynamics". p. 13. https://www.cs.ox.ac.uk/people/aleks.kissinger/theses/bob/Stefano.pdf.
- ↑ 9.0 9.1 9.2 Renals, Steve. "Search and Decoding". p. 2. https://www.inf.ed.ac.uk/teaching/courses/asr/2011-12/asr-search-nup4.pdf.
- ↑ Nakhleh, Luay. "Profile HMMs for Sequence Families". p. 7. https://www.cs.rice.edu/~nakhleh/COMP571/Slides/ProfileHMMs-Handout.pdf.
- ↑ Saers, Markus; Wu, Dekai. "Handling Ties Correctly and Efficiently in Viterbi Training Using the Viterbi Semiring". p. 285. https://cse.hkust.edu.hk/~dekai/library/WU_Dekai/SaersWu_Lata2018.pdf.
- ↑ 12.0 12.1 Hannemann, Mirko. "Weighted Finite State Transducers in Automatic Speech Recognition". https://www.cs.brandeis.edu/~cs136a/CS136a_Slides/zre_lecture_asr_wfst.pdf.
- ↑ "HMM : Viterbi algorithm - a toy example". https://www.cis.upenn.edu/~cis2620/notes/Example-Viterbi-DNA.pdf.
- ↑ Ward, Wayne. "Hidden Markov Models in Speech Recognition". https://www.math.uci.edu/~yqi/lect/SpeechRec2.pdf.
