Language model

From HandWiki
Short description: Statistical model of structure of language

A language model is a probabilistic model of a natural language.[1] In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.[2]

Language models are useful for a variety of tasks, including speech recognition[3] (helping prevent predictions of low-probability (e.g. nonsense) sequences), machine translation,[4] natural language generation (generating more human-like text), optical character recognition, handwriting recognition,[5] grammar induction,[6] and information retrieval.[7][8]

Large language models, currently their most advanced form, are a combination of larger datasets (frequently using scraped words from the public internet), feedforward neural networks, and transformers. They have superseded recurrent neural network-based models, which had previously superseded the pure statistical models, such as word n-gram language model.

Pure statistical models

Models based on word n-grams

A word n-gram language model is a purely statistical model of language. It has been superseded by recurrent neural network-based models, which have been superseded by large language models. [9] It is based on an assumption that the probability of the next word in a sequence depends only on a fixed size window of previous words. If only one previous word was considered, it was called a bigram model; if two words, a trigram model; if n − 1 words, an n-gram model.[10] Special tokens were introduced to denote the start and end of a sentence [math]\displaystyle{ \langle s\rangle }[/math] and [math]\displaystyle{ \langle /s\rangle }[/math].

To prevent a zero probability being assigned to unseen words, each word's probability is slightly lower than its frequency count in a corpus. To calculate it, various methods were used, from simple "add-one" smoothing (assign a count of 1 to unseen n-grams, as an uninformative prior) to more sophisticated models, such as Good–Turing discounting or back-off models.


Maximum entropy language models encode the relationship between a word and the n-gram history using feature functions. The equation is

[math]\displaystyle{ P(w_m \mid w_1,\ldots,w_{m-1}) = \frac{1}{Z(w_1,\ldots,w_{m-1})} \exp (a^T f(w_1,\ldots,w_m)) }[/math]

where [math]\displaystyle{ Z(w_1,\ldots,w_{m-1}) }[/math] is the partition function, [math]\displaystyle{ a }[/math] is the parameter vector, and [math]\displaystyle{ f(w_1,\ldots,w_m) }[/math] is the feature function. In the simplest case, the feature function is just an indicator of the presence of a certain n-gram. It is helpful to use a prior on [math]\displaystyle{ a }[/math] or some form of regularization.

The log-bilinear model is another example of an exponential language model.

Skip-gram model

Neural models

Recurrent neural network

Continuous representations or embeddings of words are produced in recurrent neural network-based language models (known also as continuous space language models).[11] Such continuous space embeddings help to alleviate the curse of dimensionality, which is the consequence of the number of possible sequences of words increasing exponentially with the size of the vocabulary, furtherly causing a data sparsity problem. Neural networks avoid this problem by representing words as non-linear combinations of weights in a neural net.[12]

Large language models

A large language model (LLM) is a language model notable for its ability to achieve general-purpose language generation. LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process.[13] LLMs are artificial neural networks, the largest and most capable of which are built with a transformer-based architecture. Some recent implementations are based on other architectures, such as recurrent neural network variants and Mamba (a state space model).[14][15][16]

LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word.[17] Up to 2020, fine tuning was the only way a model could be adapted to be able to accomplish specific tasks. Larger sized models, such as GPT-3, however, can be prompt-engineered to achieve similar results.[18] They are thought to acquire knowledge about syntax, semantics and "ontology" inherent in human language corpora, but also inaccuracies and biases present in the corpora.[19]

Some notable LLMs are OpenAI's GPT series of models (e.g., GPT-3.5 and GPT-4, used in ChatGPT and Microsoft Copilot), Google's PaLM and Gemini (used in Bard), Meta's LLaMA family of open-source models, and Anthropic's Claude models.

Although sometimes matching human performance, it is not clear they are plausible cognitive models. At least for recurrent neural networks it has been shown that they sometimes learn patterns which humans do not learn, but fail to learn patterns that humans typically do learn.[20]

Evaluation and benchmarks

Evaluation of the quality of language models is mostly done by comparison to human created sample benchmarks created from typical language-oriented tasks. Other, less established, quality tests examine the intrinsic character of a language model or compare two such models. Since language models are typically intended to be dynamic and to learn from data it sees, some proposed models investigate the rate of learning, e.g. through inspection of learning curves. [21]

Various data sets have been developed to use to evaluate language processing systems.[22] These include:

  • Corpus of Linguistic Acceptability[23]
  • GLUE benchmark[24]
  • Microsoft Research Paraphrase Corpus[25]
  • Multi-Genre Natural Language Inference
  • Question Natural Language Inference
  • Quora Question Pairs[26]
  • Recognizing Textual Entailment[27]
  • Semantic Textual Similarity Benchmark
  • SQuAD question answering Test[28]
  • Stanford Sentiment Treebank[29]
  • Winograd NLI
  • BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC, OpenBookQA, NaturalQuestions, TriviaQA, RACE, MMLU (Massive Multitask Language Understanding), BIG-bench hard, GSM8k, RealToxicityPrompts, WinoGender, CrowS-Pairs.[30] (LLaMa Benchmark)

See also


  1. Jurafsky, Dan; Martin, James H. (2021). "N-gram Language Models". Speech and Language Processing (3rd ed.). Retrieved 24 May 2022. 
  2. Rosenfeld, Ronald (2000). "Two decades of statistical language modeling: Where do we go from here?". Proceedings of the IEEE 88 (8): 1270–1278. doi:10.1109/5.880083. 
  3. Kuhn, Roland, and Renato De Mori (1990). "A cache-based natural language model for speech recognition". IEEE transactions on pattern analysis and machine intelligence 12.6: 570–583.
  4. Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013). "Semantic parsing as machine translation" . Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).
  5. Pham, Vu, et al (2014). "Dropout improves recurrent neural networks for handwriting recognition" . 14th International Conference on Frontiers in Handwriting Recognition. IEEE.
  6. Htut, Phu Mon, Kyunghyun Cho, and Samuel R. Bowman (2018). "Grammar induction with neural language models: An unusual replication" . arXiv:1808.10000.
  7. Ponte, Jay M.; Croft, W. Bruce (1998). "A language modeling approach to information retrieval". Proceedings of the 21st ACM SIGIR Conference. Melbourne, Australia: ACM. pp. 275–281. doi:10.1145/290941.291008. 
  8. Hiemstra, Djoerd (1998). "A linguistically motivated probabilistically model of information retrieval". Proceedings of the 2nd European conference on Research and Advanced Technology for Digital Libraries. LNCS, Springer. pp. 569–584. doi:10.1007/3-540-49653-X_34. 
  9. Bengio, Yoshua; Ducharme, Réjean; Vincent, Pascal; Janvin, Christian (March 1, 2003). "A neural probabilistic language model". The Journal of Machine Learning Research 3: 1137–1155. 
  10. Cite error: Invalid <ref> tag; no text was provided for refs named jm
  11. Karpathy, Andrej. "The Unreasonable Effectiveness of Recurrent Neural Networks". 
  12. Bengio, Yoshua (2008). "Neural net language models". Scholarpedia. 3. p. 3881. doi:10.4249/scholarpedia.3881. Bibcode2008SchpJ...3.3881B. Retrieved 28 August 2015. 
  13. "Better Language Models and Their Implications". 2019-02-14. 
  14. Peng, Bo; et al. (2023). "RWKV: Reinventing RNNS for the Transformer Era". arXiv:2305.13048 [cs.CL].
  15. Merritt, Rick (2022-03-25). "What Is a Transformer Model?" (in en-US). 
  16. Gu, Albert; Dao, Tri (2023-12-01), Mamba: Linear-Time Sequence Modeling with Selective State Spaces 
  17. Bowman, Samuel R. (2023). "Eight Things to Know about Large Language Models". arXiv:2304.00612 [cs.CL].
  18. Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav et al. (Dec 2020). Larochelle, H.; Ranzato, M.; Hadsell, R. et al.. eds. "Language Models are Few-Shot Learners". Advances in Neural Information Processing Systems (Curran Associates, Inc.) 33: 1877–1901. 
  19. Manning, Christopher D. (2022). "Human Language Understanding & Reasoning". Daedalus 151 (2): 127–138. doi:10.1162/daed_a_01905. 
  20. Hornstein, Norbert; Lasnik, Howard; Patel-Grosz, Pritty; Yang, Charles (2018-01-09) (in en). Syntactic Structures after 60 Years: The Impact of the Chomskyan Revolution in Linguistics. Walter de Gruyter GmbH & Co KG. ISBN 978-1-5015-0692-5. Retrieved 11 December 2021. 
  21. Karlgren, Jussi; Schutze, Hinrich (2015), "Evaluating Learning Language Representations", International Conference of the Cross-Language Evaluation Forum, Lecture Notes in Computer Science, Springer International Publishing, pp. 254–260, doi:10.1007/978-3-319-64206-2_8, ISBN 9783319642055 
  22. Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (2018-10-10). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805 [cs.CL].
  23. "The Corpus of Linguistic Acceptability (CoLA)". 
  24. "GLUE Benchmark" (in en). 
  25. "Microsoft Research Paraphrase Corpus" (in en-us). 
  26. Aghaebrahimian, Ahmad (2017), "Quora Question Answer Dataset", Text, Speech, and Dialogue, Lecture Notes in Computer Science, 10415, Springer International Publishing, pp. 66–73, doi:10.1007/978-3-319-64206-2_8, ISBN 9783319642055 
  27. Sammons, V.G.Vinod Vydiswaran, Dan Roth, Mark; Vydiswaran, V.G.; Roth, Dan. "Recognizing Textual Entailment". 
  28. "The Stanford Question Answering Dataset". 
  29. "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank". 
  30. Hendrycks, Dan (2023-03-14), Measuring Massive Multitask Language Understanding,, retrieved 2023-03-15 

Further reading

  • J M Ponte; W B Croft (1998). "A Language Modeling Approach to Information Retrieval". pp. 275–281. 
  • F Song; W B Croft (1999). "A General Language Model for Information Retrieval". pp. 279–280.