Kneser–Ney smoothing
Kneser–Ney smoothing, also known as Kneser-Essen-Ney smoothing, is a method primarily used to calculate the probability distribution of n-grams in a document based on their histories.[1] It is widely considered the most effective method of smoothing due to its use of absolute discounting by subtracting a fixed value from the probability's lower order terms to omit n-grams with lower frequencies. This approach has been considered equally effective for both higher and lower order n-grams. The method was proposed in a 1994 paper by Reinhard Kneser, Ute Essen and Hermann Ney (de).[2]
A common example that illustrates the concept behind this method is the frequency of the bigram "San Francisco ". If it appears several times in a training corpus, the frequency of the unigram "Francisco" will also be high. Relying on only the unigram frequency to predict the frequencies of n-grams leads to skewed results;[3] however, Kneser–Ney smoothing corrects this by considering the frequency of the unigram in relation to possible words preceding it.
Method
Let [math]\displaystyle{ c(w,w') }[/math] be the number of occurrences of the word [math]\displaystyle{ w }[/math] followed by the word [math]\displaystyle{ w' }[/math] in the corpus.
The equation for bigram probabilities is as follows:
[math]\displaystyle{ p_{KN}(w_i|w_{i-1}) = \frac{\max(c(w_{i-1},w_{i}) - \delta,0)}{\sum_{w'} c(w_{i-1},w')} + \lambda_{w_{i-1}} p_{KN}(w_i) }[/math][4]
Where the unigram probability [math]\displaystyle{ p_{KN}(w_i) }[/math] depends on how likely it is to see the word [math]\displaystyle{ w_i }[/math] in an unfamiliar context, which is estimated as the number of times it appears after any other word divided by the number of distinct pairs of consecutive words in the corpus:
[math]\displaystyle{ p_{KN}(w_i) = \frac { |\{ w' : 0 \lt c(w',w_i) \} | } { |\{ (w',w'') : 0 \lt c(w',w'') \} | } }[/math]
Note that [math]\displaystyle{ p_{KN} }[/math] is a proper distribution, as the values defined in the above way are non-negative and sum to one.
The parameter [math]\displaystyle{ \delta }[/math] is a constant which denotes the discount value subtracted from the count of each n-gram, usually between 0 and 1.
The value of the normalizing constant [math]\displaystyle{ \lambda_{w_{i-1}} }[/math] is calculated to make the sum of conditional probabilities [math]\displaystyle{ p_{KN}(w_i|w_{i-1}) }[/math] over all [math]\displaystyle{ w_i }[/math] equal to one. Observe that (provided [math]\displaystyle{ \delta \lt 1 }[/math]) for each [math]\displaystyle{ w_i }[/math] which occurs at least once in the context of [math]\displaystyle{ w_{i-1} }[/math] in the corpus we discount the probability by exactly the same constant amount [math]\displaystyle{ {\delta} / \left(\sum_{w'} c(w_{i-1},w')\right) }[/math], so the total discount depends linearly on the number of unique words [math]\displaystyle{ w_i }[/math] that can occur after [math]\displaystyle{ w_{i-1} }[/math]. This total discount is a budget we can spread over all [math]\displaystyle{ p_{KN}(w_i|w_{i-1}) }[/math] proportionally to [math]\displaystyle{ p_{KN}(w_i) }[/math]. As the values of [math]\displaystyle{ p_{KN}(w_i) }[/math] sum to one, we can simply define [math]\displaystyle{ \lambda_{w_{i-1}} }[/math] to be equal to this total discount:
[math]\displaystyle{ \lambda_{w_{i-1}} = \frac {\delta} {\sum_{w'} c(w_{i-1},w')} |\{ w' : 0 \lt c(w_{i-1},w') \} | }[/math]
This equation can be extended to n-grams. Let [math]\displaystyle{ w_{i-n+1}^{i-1} }[/math] be the [math]\displaystyle{ n-1 }[/math] words before [math]\displaystyle{ w_i }[/math]:
[math]\displaystyle{ p_{KN}(w_i|w_{i-n+1}^{i-1}) = \frac{\max(c(w_{i-n+1}^{i-1},w_{i}) - \delta,0)}{\sum_{w'} c(w_{i-n+1}^{i-1},w')} + \delta \frac{| \{ w' : 0 \lt c(w_{i-n+1}^{i-1},w') \} |}{\sum_{w'} c(w_{i-n+1}^{i-1},w')} p_{KN}(w_i|w_{i-n+2}^{i-1}) }[/math][5]
This model uses the concept of absolute-discounting interpolation which incorporates information from higher and lower order language models. The addition of the term for lower order n-grams adds more weight to the overall probability when the count for the higher order n-grams is zero.[6] Similarly, the weight of the lower order model decreases when the count of the n-gram is non zero.
Modified Kneser–Ney smoothing
Modifications of this method also exist. Chen and Goodman's 1998 paper lists and benchmarks several such modifications. Computational efficiency and scaling to multi-core systems is the focus of Chen and Goodman’s 1998 modification.[7] This approach is once used for Google Translate under a MapReduce implementation.[8] KenLM is a performant open-source implementation.[9]
References
- ↑ 'A Bayesian Interpretation of Interpolated Kneser-Ney NUS School of Computing Technical Report TRA2/06'
- ↑ Ney, Hermann; Essen, Ute; Kneser, Reinhard (January 1994). "On structuring probabilistic dependences in stochastic language modelling". Computer Speech & Language 8 (1): 1–38. doi:10.1006/csla.1994.1001.
- ↑ 'Brown University: Introduction to Computational Linguistics '
- ↑ 'Kneser Ney Smoothing Explained'
- ↑ 'NLP Tutorial: Smoothing'
- ↑ 'An empirical study of smoothing techniques for language modeling'
- ↑ An Empirical Study of Smoothing Techniques for Language Modeling p 21
- ↑ Large Language Models in Machine Translation
- ↑ "Estimation . Kenlm . Code . Kenneth Heafield". https://kheafield.com/code/kenlm/estimation/.
Original source: https://en.wikipedia.org/wiki/Kneser–Ney smoothing.
Read more |