Language identification

Short description: Determination of language from a text sample

In natural language processing, language identification or language guessing is the problem of determining which natural language a given content is in. Computational approaches to this problem view it as a special case of text categorization, solved with various statistical methods.

Overview

Logical approach

A common non-statistical intuitive approach (though highly uncertain) is to look for common letter combinations, or distinctive diacritics or punctuation.^[1]^[2]

Statistical approach

There are several statistical approaches to language identification.

An older statistical method by Grefenstette^[3] was based on the frequency of short n-grams, which are often function morphemes. For example, "ing" is more common in English than in French, while the sequence "que" is more common in French. Given a new page found on the Web, one counts the number of occurrences of each such short sequence and picks the language whose frequency table it matches the most.

One technique is to compare the compressibility of the text to the compressibility of texts in a set of known languages. This approach is known as mutual information based distance measure. The same technique can also be used to empirically construct family trees of languages which closely correspond to the trees constructed using historical methods.^[4] Mutual information based distance measure is essentially equivalent to more conventional model-based methods and is not generally considered to be either novel or better than simpler techniques.

Another technique, as described by Cavnar and Trenkle (1994) and Dunning (1994) is to create a language n-gram model from a "training text" for each of the languages. These models can be based on characters (Cavnar and Trenkle) or encoded bytes (Dunning); in the latter, language identification and character encoding detection are integrated. Then, for any piece of text needing to be identified, a similar model is made, and that model is compared to each stored language model. The most likely language is the one with the model that is most similar to the model from the text needing to be identified. This approach can be problematic when the input text is in a language for which there is no model. In that case, the method may return another, "most similar" language as its result. Also problematic for any approach are pieces of input text that are composed of several languages, as is common on the Web.

As of 2025^[update], a commonly used baseline method is via the fastText library, which has comparable classification accuracy as deep learning techniques, but much faster.^[5]

Identifying similar languages

One of the great bottlenecks of language identification systems is to distinguish between closely related languages. Similar languages like Bulgarian and Macedonian or Indonesian and Malay present significant lexical and structural overlap, making it challenging for systems to discriminate between them.

In 2014 the DSL shared task^[6] has been organized providing a dataset (Tan et al., 2014) containing 13 different languages (and language varieties) in six language groups: Group A (Bosnian, Croatian, Serbian), Group B (Indonesian, Malaysian), Group C (Czech, Slovak), Group D (Brazilian Portuguese, European Portuguese), Group E (Peninsular Spanish, Argentine Spanish), Group F (American English, British English). The best system reached performance of over 95% results (Goutte et al., 2014). Results of the DSL shared task are described in Zampieri et al. 2014.

Software

Apache OpenNLP includes char n-gram based statistical detector and comes with a model that can distinguish 103 languages
Apache Tika contains a language detector for 18 languages

References

Benedetto, D., E. Caglioti and V. Loreto. Language trees and zipping. Physical Review Letters, 88:4 (2002), Complexity theory.
Cavnar, William B. and John M. Trenkle. "N-Gram-Based Text Categorization". Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval (1994) [1].
Cilibrasi, Rudi and Paul M.B. Vitanyi. "Clustering by compression". IEEE Transactions on Information Theory 51(4), April 2005, 1523–1545.
Dunning, T. (1994) "Statistical Identification of Language". Technical Report MCCS 94-273, New Mexico State University, 1994.
Goodman, Joshua. (2002) Extended comment on "Language Trees and Zipping". Microsoft Research, Feb 21 2002. (This is a criticism of the data compression in favor of the Naive Bayes method.)
Goutte, C.; Leger, S.; Carpuat, M. (2014) The NRC System for Discriminating Similar Languages. Proceedings of the Coling 2014 workshop "Applying NLP Tools to Similar Languages, Varieties and Dialects"
Grefenstette, Gregory. (1995) Comparing two language identification schemes. Proceedings of the 3rd International Conference on the Statistical Analysis of Textual Data (JADT 1995).
Poutsma, Arjen. (2001) Applying Monte Carlo techniques to language identification. SmartHaven, Amsterdam. Presented at CLIN 2001 .
Tan, L.; Zampieri, M.; Ljubešić, N.; Tiedemann, J. (2014) Merging Comparable Data Sources for the Discrimination of Similar Languages: The DSL Corpus Collection. Proceedings of the 7th Workshop on Building and Using Comparable Corpora (BUCC). Reykjavik, Iceland. p. 6-10
The Economist. (2002) "The elements of style: Analysing compressed data leads to impressive results in linguistics"
Radim Řehůřek and Milan Kolkus. (2009) "Language Identification on the Web: Extending the Dictionary Method" Computational Linguistics and Intelligent Text Processing.
Zampieri, M.; Tan, L.; Ljubešić, N.; Tiedemann, J. (2014) A Report on the DSL Shared Task 2014. Proceedings of the 1st Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects (VarDial). Dublin, Ireland. p. 58-67.

References

↑ Stock, Wolfgang G.; Stock, Mechtild (2013-07-31) (in en). Handbook of Information Science. Walter de Gruyter. pp. 180–181. ISBN 978-3-11-023500-5. https://books.google.com/books?id=d1PnBQAAQBAJ&pg=PA180.
↑ Hagiwara, Masato (2021-12-14) (in en). Real-World Natural Language Processing: Practical Applications with Deep Learning. Simon and Schuster. pp. 105–106. ISBN 978-1-61729-642-0. https://books.google.com/books?id=Ok5NEAAAQBAJ&pg=PA105.
↑ Grefenstette, Gregory (December 1995). "Comparing two language identification schemes". Rome. pp. 263–268. https://www.xrce.xerox.com/competencies/contentanalysis/publications/Documents/P49030/content/gg_aslib.pdf.
↑ Benedetto, Dario; Caglioti, Emanuele; Loreto, Vittorio (January 2002). "Language Trees and Zipping" (in en). Physical Review Letters 88 (4). doi:10.1103/PhysRevLett.88.048702. ISSN 0031-9007. PMID 11801178. Bibcode: 2002PhRvL..88d8702B. https://link.aps.org/doi/10.1103/PhysRevLett.88.048702.
↑ Joulin, Armand; Grave, Edouard; Bojanowski, Piotr; Mikolov, Tomas (April 2017). Lapata, Mirella; Blunsom, Phil; Koller, Alexander. eds. "Bag of Tricks for Efficient Text Classification". Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers (Valencia, Spain: Association for Computational Linguistics): 427–431. https://aclanthology.org/E17-2068/.
↑ "VarDial Workshop @ COLING 2014". http://corporavm.uni-koeln.de/vardial/sharedtask.html.

0.00

(0 votes)

Original source: https://en.wikipedia.org/wiki/Language identification. Read more

[1] Stock, Wolfgang G.; Stock, Mechtild (2013-07-31) (in en). Handbook of Information Science. Walter de Gruyter. pp. 180–181. ISBN 978-3-11-023500-5. https://books.google.com/books?id=d1PnBQAAQBAJ&pg=PA180.

[2] Hagiwara, Masato (2021-12-14) (in en). Real-World Natural Language Processing: Practical Applications with Deep Learning. Simon and Schuster. pp. 105–106. ISBN 978-1-61729-642-0. https://books.google.com/books?id=Ok5NEAAAQBAJ&pg=PA105.

[3] Grefenstette, Gregory (December 1995). "Comparing two language identification schemes". Rome. pp. 263–268. https://www.xrce.xerox.com/competencies/contentanalysis/publications/Documents/P49030/content/gg_aslib.pdf.

[4] Benedetto, Dario; Caglioti, Emanuele; Loreto, Vittorio (January 2002). "Language Trees and Zipping" (in en). Physical Review Letters 88 (4). doi:10.1103/PhysRevLett.88.048702. ISSN 0031-9007. PMID 11801178. Bibcode: 2002PhRvL..88d8702B. https://link.aps.org/doi/10.1103/PhysRevLett.88.048702.

[5] Joulin, Armand; Grave, Edouard; Bojanowski, Piotr; Mikolov, Tomas (April 2017). Lapata, Mirella; Blunsom, Phil; Koller, Alexander. eds. "Bag of Tricks for Efficient Text Classification". Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers (Valencia, Spain: Association for Computational Linguistics): 427–431. https://aclanthology.org/E17-2068/.

[6] "VarDial Workshop @ COLING 2014". http://corporavm.uni-koeln.de/vardial/sharedtask.html.

[1]

[2]

[3]

[4]

[5]

[6]

Anonymous

Search

Language identification

Namespaces

More

Page actions

Contents

Overview

Logical approach

Statistical approach

Identifying similar languages

Software

See also

References

References

Navigation

Navigation

Resources

Help

googletranslator

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Language identification

Overview

Logical approach

Statistical approach

Identifying similar languages

Software

See also

References

References

Navigation

Wiki tools

Page tools

Other projects

Categories