n-gram language model

From HandWiki

An n-gram language model is a language model that models sequences of words as a Markov process. It makes use of the simplifying assumption that the probability of the next word in a sequence depends only on a fixed size window of previous words. A bigram model considers one previous word, a trigram model considers two, and in general, an n-gram model considers n-1 words of previous context.[1]

For example, a bigram language model models the probability of the sentence I saw the red house as:

[math]\displaystyle{ P(\text{I, saw, the, red, house}) \approx P(\text{I}\mid\langle s\rangle) P(\text{saw}\mid \text{I}) P(\text{the}\mid\text{saw}) P(\text{red}\mid\text{the}) P(\text{house}\mid\text{red}) P(\langle /s\rangle\mid \text{house}) }[/math]

Where [math]\displaystyle{ \langle s\rangle }[/math] and [math]\displaystyle{ \langle /s\rangle }[/math] are special tokens denoting the start and end of a sentence.

These conditional probabilities may be estimated based on frequency counts in some text corpus. For example, [math]\displaystyle{ P(\text{saw}\mid \text{I}) }[/math] can be naively estimated as the proportion of occurrences of the word I which are followed by saw in the corpus. The problem of sparsity (for example, if the bigram "red house" has zero occurrences in our corpus) may necessitate modifying the basic markov model by smoothing techniques, particularly when using larger context windows.[1]

n-gram models are no longer commonly used in natural language processing research and applications, as they have been supplanted by state of the art deep learning methods, most recently large language models.

Method

In an n-gram model, the probability [math]\displaystyle{ P(w_1,\ldots,w_m) }[/math] of observing the sentence [math]\displaystyle{ w_1,\ldots,w_m }[/math] is approximated as

[math]\displaystyle{ P(w_1,\ldots,w_m) = \prod^m_{i=1} P(w_i\mid w_1,\ldots,w_{i-1})\approx \prod^m_{i=2} P(w_i\mid w_{i-(n-1)},\ldots,w_{i-1}) }[/math]

It is assumed that the probability of observing the ith word wi in the context history of the preceding i − 1 words can be approximated by the probability of observing it in the shortened context history of the preceding n − 1 words (nth-order Markov property). To clarify, for n = 3 and i = 2 we have [math]\displaystyle{ P(w_i\mid w_{i-(n-1)},\ldots,w_{i-1})=P(w_2\mid w_1) }[/math].

The conditional probability can be calculated from n-gram model frequency counts:

[math]\displaystyle{ P(w_i\mid w_{i-(n-1)},\ldots,w_{i-1}) = \frac{\mathrm{count}(w_{i-(n-1)},\ldots,w_{i-1},w_i)}{\mathrm{count}(w_{i-(n-1)},\ldots,w_{i-1})} }[/math]

The terms bigram and trigram language models denote n-gram models with n = 2 and n = 3, respectively.

Typically, the n-gram model probabilities are not derived directly from frequency counts, because models derived this way have severe problems when confronted with any n-grams that have not been explicitly seen before. Instead, some form of smoothing is necessary, assigning some of the total probability mass to unseen words or n-grams. Various methods are used, from simple "add-one" smoothing (assign a count of 1 to unseen n-grams, as an uninformative prior) to more sophisticated models, such as Good–Turing discounting or back-off models.

Example

In a bigram (n = 2) language model, the probability of the sentence I saw the red house is approximated as

[math]\displaystyle{ P(\text{I, saw, the, red, house}) \approx P(\text{I}\mid\langle s\rangle) P(\text{saw}\mid \text{I}) P(\text{the}\mid\text{saw}) P(\text{red}\mid\text{the}) P(\text{house}\mid\text{red}) P(\langle /s\rangle\mid \text{house}) }[/math]

whereas in a trigram (n = 3) language model, the approximation is

[math]\displaystyle{ P(\text{I, saw, the, red, house}) \approx P(\text{I}\mid \langle s\rangle,\langle s\rangle) P(\text{saw}\mid\langle s\rangle,I) P(\text{the}\mid\text{I, saw}) P(\text{red}\mid\text{saw, the}) P(\text{house}\mid\text{the, red}) P(\langle /s\rangle\mid\text{red, house}) }[/math]

Note that the context of the first n – 1 n-grams is filled with start-of-sentence markers, typically denoted <s>.

Additionally, without an end-of-sentence marker, the probability of an ungrammatical sequence *I saw the would always be higher than that of the longer sentence I saw the red house.

Unigram model

A special case of an n-gram model is the unigram model, where n=0. A unigram model can be treated as the combination of several one-state finite automata.[2] It assumes that the probabilities of tokens in a sequence are independent, e.g.:

[math]\displaystyle{ P_\text{uni}(t_1t_2t_3)=P(t_1)P(t_2)P(t_3). }[/math]

In this model, the probability of each word only depends on that word's own probability in the document, so we only have one-state finite automata as units. The automaton itself has a probability distribution over the entire vocabulary of the model, summing to 1. The following is an illustration of a unigram model of a document.

Terms Probability in doc
a 0.1
world 0.2
likes 0.05
we 0.05
share 0.3
... ...

[math]\displaystyle{ \sum_{\text{term in doc}} P(\text{term}) = 1 }[/math]

The probability generated for a specific query is calculated as

[math]\displaystyle{ P(\text{query}) = \prod_{\text{term in query}} P(\text{term}) }[/math]

Different documents have unigram models, with different hit probabilities of words in it. The probability distributions from different documents are used to generate hit probabilities for each query. Documents can be ranked for a query according to the probabilities. Example of unigram models of two documents:

Terms Probability in Doc1 Probability in Doc2
a 0.1 0.3
world 0.2 0.1
likes 0.05 0.03
we 0.05 0.02
share 0.3 0.2
... ... ...

In information retrieval contexts, unigram language models are often smoothed to avoid instances where P(term) = 0. A common approach is to generate a maximum-likelihood model for the entire collection and linearly interpolate the collection model with a maximum-likelihood model for each document to smooth the model.[3]

References

  1. 1.0 1.1 Jurafsky, Dan; Martin, James H. (7 January 2023). "N-gram Language Models". Speech and Language Processing (3rd edition draft ed.). https://web.stanford.edu/~jurafsky/slp3/ed3book_jan72023.pdf. Retrieved 24 May 2022. 
  2. Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze (2009). An Introduction to Information Retrieval. pp. 237–240. Cambridge University Press.
  3. Buttcher, Clarke, and Cormack. Information Retrieval: Implementing and Evaluating Search Engines. pp. 289–291. MIT Press.