Social:List of text corpora

From HandWiki
Revision as of 16:49, 5 February 2024 by Raymond Straus (talk | contribs) (change)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Short description: Overview of data sets of languages

Text corpora (singular: text corpus) are large and structured sets of texts, which have been systematically collected. Text corpora are used by corpus linguists and within other branches of linguistics for statistical analysis, hypothesis testing, finding patterns of language use, investigating language change and variation, and teaching language proficiency.[1]

English language

European languages

Slavic

East Slavic

South Slavic

West Slavic

German

Middle Eastern Languages

Devanagari

East Asian Languages

South Asian Languages

African languages

Parallel corpora of diverse languages

  • Chinese/English Political Interpreting Corpus (CEPIC) [28][29] consists of transcripts of speeches delivered by top political figures from Hong Kong, Beijing, Washington DC and London, as well as their translated/interpreted texts. Developed by Jun Pan and HKBU Library.
  • Europarl Corpus - proceedings of the European Parliament from 1996 to 2012
  • EUR-Lex corpus - collection of all official languages of the European Union, created from the EUR-Lex database[30]
  • OPUS: Open source Parallel Corpus in many many languages[31]
  • Tatoeba A parallel corpus which contains over 8.9 million sentences in multiple languages; 107 languages have more than 1,000 sentences each; a further 81 languages have from 100 to 1,000 sentences each.[32]
  • NTU-Multilingual Corpus in 7 languages (ara, eng, ind, jpn, kor, mcn, vie)[33] (legacy repo)
  • SeedLing corpus - A Seed Corpus for the Human Language Project with 1000+ languages from various sources.[34]
  • GRALIS parallel texts for various Slavic languages, compiled by the institute for Slavic languages at Graz University (Branko Tošović et al.)
  • The ACTRES Parallel Corpus (P-ACTRES 2.0) is a bidirectional English-Spanish corpus consisting of original texts in one language and their translation into the other. P-ACTRES 2.0 contains over 6 million words considering both directions together.[35]

Comparable Corpora

  • Corpus of Political Speeches contains four collections of political speeches in English and Chinese from The Corpus of U.S. Presidential Speeches (1789–2015), The Corpus of Policy Address by Hong Kong Governors (1984–1996) and Hong Kong Chief Executives (1997–2014), The Corpus of Speeches given on New Year's days and Double Tenth days by Taiwan Presidents (1978–2014), and The Corpus of Report on the Work of the Government by Premiers of the People's Republic of China (1984–2013). Developed by HKBU Library.
  • WaCky - The Web-As-Corpus Kool Yinitiative Web as Corpus (eng, fre, deu, ita)
  • Disambiguating Similar Language Corpora Collection (DSLCC)[36] (Bosnian, Croatian, Serbian, Indonesian, Malay, Czech, Slovak, Brazilian Portuguese, European Portuguese, Peninsular Spanish, Argentine Spanish)
  • Wikipedia Comparable Corpora(registration required) when (41 million aligned Wikipedia articles for 253 language pairs)
  • The TenTen Corpus Family – comparable web corpora of target size 10 billion words. These corpora are available in the corpus management system Sketch Engine, currently, there exist TenTen corpora for more than 30 languages (such as English TenTen corpus,[37] Arabic TenTen corpus,[38] Spanish TenTen corpus,[39] Russian Tenten corpus,[40][41]). The overview of existing TenTen corpora can be found at https://www.sketchengine.co.uk/documentation/tenten-corpora/
  • Timestamped JSI web corpora – web corpora of news articles crawled from a list of RSS feeds. Newsfeed corpora are being prepared in the framework of the project implemented by the Jožef Stefan Institute at Slovenian scientific research institute.[42] and published in Sketch Engine. More information about the project is on the project websites.

L2 (English) Corpora

  • Cambridge Learner Corpus[43]
  • Corpus of Academic Written and Spoken English (CAWSE),[44] a collection of Chinese students’ English language samples in academic settings. Freely downloadable online.  
  • English as a Lingua Franca in Academic Settings (ELFA),[45] an academic ELF corpus.[46][47]
  • International Corpus of Learner English (ICLE),[48] a corpus of learner written English.
  • Louvain International Database of Spoken English Interlanguage (LINDSEI),[49] a corpus of learner spoken English.
  • Trinity Lancaster Corpus, one of the largest corpus of L2 spoken English.[50][51]
  • University of Pittsburgh English Language Institute Corpus (PELIC)[52]
  • Vienna-Oxford International Corpus of English (VOICE),[53] an ELF corpus.[46]

References

  1. Leech, Geoffrey (2007). "Teaching and language corpora: a convergence". in Wichmann, A.. Teaching and Language Corpora. London: Longman. p. 9. 
  2. "Corpus Resource Database (CoRD)". Department of English, University of Helsinki. http://www.helsinki.fi/varieng/CoRD/corpora/. 
  3. Wahle, Jan Philip; Ruas, Terry; Mohammad, Saif; Gipp, Bela (2022). "D3: A Massive Dataset of Scholarly Metadata for Analyzing the State of Computer Science Research". Proceedings of the Thirteenth Language Resources and Evaluation Conference (Marseille, France: European Language Resources Association): 2642–2651. https://aclanthology.org/2022.lrec-1.283. 
  4. Professor Mark Davies at BYU created an online tool to search Google's English language corpus, drawn from Google Books, at http://googlebooks.byu.edu/x.asp.
  5. "PhraseFinder". http://phrasefinder.io/.  A search engine for the Google Books Ngram Corpus that supports wildcard queries and offers an API.
  6. [1],Basque corpora
  7. (in Spanish) "Molinolabs - corpus". molinolabs.com. http://www.molinolabs.com/corpus.html. 
  8. "CorALit – CorALit - Lietuvių mokslo kalbos tekstynas". coralit.lt. http://coralit.lt/en/node/18. 
  9. "Turkish National Corpus - Türkçe Ulusal Derlemi - Homepage". tnc.org.tr. http://www.tnc.org.tr. 
  10. Glazkova, A (2020). "Topical Classification of Text Fragments Accounting for Their Nearest Context". Automation and Remote Control 81 (12): 2262–2276. doi:10.1134/S0005117920120097. https://www.researchgate.net/publication/348432654. 
  11. Rubtsova, Yu (2015). "Constructing a corpus for sentiment classification training". Software & Systems 1: 72–78. doi:10.15827/0236-235X.109.072-078. http://www.swsys.ru/index.php?page=article&id=3962&lang=&lang=en. 
  12. "Under Update". search.dcl.bas.bg. http://search.dcl.bas.bg. 
  13. "Електронски корупус на македонски книжевни текстови". http://drmj.manu.edu.mk/%d0%b5%d0%bb%d0%b5%d0%ba%d1%82%d1%80%d0%be%d0%bd%d1%81%d0%ba%d0%b8-%d0%ba%d0%be%d1%80%d0%bf%d1%83%d1%81-%d0%bd%d0%b0-%d0%bc%d0%b0%d0%ba%d0%b5%d0%b4%d0%be%d0%bd%d1%81%d0%ba%d0%b8-%d0%ba%d0%bd%d0%b8/. 
  14. "Portál | Český národní korpus". http://korpus.cz/. 
  15. Zdravkova, Katrina; Tufiş, Dan; Simov, Kiril; Radziszewski, Adam; Qasemizadeh, Behrang; Priest-Dorman, Greg; Petkevič, Vladimír; Oravecz, Csaba et al. (2010-05-14). "Available from CLARIN". http://nl.ijs.si/me/v4/. https://www.clarin.si/repository/xmlui/handle/11356/1043. 
  16. 16.0 16.1 "University of Tehran NLP Lab". ece.ut.ac.ir. http://ece.ut.ac.ir/nlp/. 
  17. Hadi Veisi, Mohammad MohammadAmini, Hawre Hosseini; Toward Kurdish language processing: Experiments in collecting and processing the AsoSoft text corpus, Digital Scholarship in the Humanities, fqy074, https://doi.org/10.1093/llc/fqy074
  18. "KOTONOHA「現代日本語書き言葉均衡コーパス」 少納言". kotonoha.gr.jp. http://www.kotonoha.gr.jp/shonagon/. 
  19. https://wortschatz.uni-leipzig.de/en/download/Hindi
  20. D. Upeksha, C. Wijayarathna, M. Siriwardena, L. Lasandun, C. Wimalasuriya, N. de Silva, and G. Dias . 2015. Implementing a Corpus for Sinhala Language. In Symposium on Language Technology for South Asia.
  21. Glossa (uio.no)
  22. https://aclanthology.org/L14-1376/
  23. https://arxiv.org/pdf/2102.06991.pdf, https://wortschatz.uni-leipzig.de/en/download/Hausa
  24. https://www.sketchengine.eu/igtenten-igbo-corpus/
  25. https://www.sketchengine.eu/corpora-and-languages/oromo-text-corpora/
  26. https://www.researchgate.net/publication/336274457_Digital_Yoruba_Corpus, https://www.sketchengine.eu/corpora-and-languages/yoruba-text-corpora/
  27. https://wortschatz.uni-leipzig.de/en/download/Zulu
  28. Pan, Jun (2019). "The Chinese/English Political Interpreting Corpus (CEPIC). Hong Kong Baptist University Library". https://digital.lib.hkbu.edu.hk/cepic/. 
  29. Pan, Jun (2019-10-30). "The Chinese/English Political Interpreting Corpus (CEPIC): A New Electronic Resource for Translators and Interpreters". Proceedings of the Second Workshop Human-Informed Translation and Interpreting Technology Associated with RANLP 2019 (Incoma Ltd., Shoumen, Bulgaria): 82–88. doi:10.26615/issn.2683-0078.2019_010. 
  30. "EUR-Lex Corpus". sketchengine.co.uk. 2 June 2016. https://www.sketchengine.co.uk/eurlex-corpus/. 
  31. "OPUS - an open source parallel corpus". opus.lingfil.uu.se. http://opus.lingfil.uu.se/. 
  32. "Tatoeba - Number of sentences per language". tatoeba.org. http://tatoeba.org/eng/stats/sentences_by_language. 
  33. Liling Tan and Francis Bond (14 May 2012). "Building and Annotating the Linguistically Diverse NTU-MC (NTU — Multilingual Corpus)". International Journal of Asian Language Processing 22 (4): 161–174. http://www.colips.org/journal/volume22/22.4.2.NTU-MC%20Tan%20final.pdf. Retrieved 12 January 2014. 
  34. Guy Emerson, Liling Tan, Susanne Fertmann, Alexis Palmer and Michaela Regneri . 2014. SeedLing: Building and using a seed corpus for the Human Language Project. In Proceedings of the use of Computational methods in the study of Endangered Languages (ComputEL) Workshop. Baltimore, USA.
  35. H. Sanjurjo-González and M. Izquierdo. 2019. P-ACTRES 2.0: A parallel corpus for cross-linguistic research. In Parallel Corpora for Contrastive and Translation Studies: New resources and applications (pp. 215-231). John Benjamins Publishing.
  36. Liling Tan, Marcos Zampieri, Nikola Ljubešic, and Jörg Tiedemann. Merging comparable data sources for the discrimination of similar languages: The DSL corpus collection. In Proceedings of the 7th Workshop on Building and Using Comparable Corpora (BUCC). 2014.
  37. Kilgarriff, Adam (2012). "Getting to Know Your Corpus". Text, Speech and Dialogue. Lecture Notes in Computer Science. 7499. pp. 3–15. doi:10.1007/978-3-642-32790-2_1. ISBN 978-3-642-32789-6. 
  38. Belinkov, Y., Habash, N., Kilgarriff, A., Ordan, N., Roth, R., & Suchomel, V. (2013). arTen-Ten: a new, vast corpus for Arabic. Proceedings of WACL.
  39. Kilgarriff, A., & Renau, I. (2013). esTenTen, a vast web corpus of Peninsular and American Spanish. Procedia - Social and Behavioral Sciences, 95, 12-19.
  40. Хохлова, М. В. (2016). Обзор больших русскоязычных корпусов текстов. In Материалы научной конференции" Интернет и современное общество" (pp. 74-77).
  41. Khokhlova, M. (2016). Comparison of High-Frequency Nouns from the Perspective of Large Corpora. RASLAN 2016 Recent Advances in Slavonic Natural Language Processing, 9.
  42. Trampuš, M., & Novak, B. (2012, October). Internals of an aggregated web news feed. In Proceedings of the Fifteenth International Information Science Conference IS SiKDD 2012 (pp. 431-434)
  43. "Cambridge English Corpus" (in en), Wikipedia, 2019-09-27, https://en.wikipedia.org/w/index.php?title=Cambridge_English_Corpus&oldid=918173927, retrieved 2020-01-07 
  44. "CAWSE Corpus - The University of Nottingham Ningbo China - 宁波诺丁汉大学". https://www.nottingham.edu.cn/en/education-and-english/research/cawse/cawse-corpus.aspx. 
  45. "English as a Lingua Franca in Academic Settings" (in en). 2018-03-23. https://www.helsinki.fi/en/researchgroups/english-as-a-lingua-franca-in-academic-settings. 
  46. 46.0 46.1 "English as a lingua franca" (in en), Wikipedia, 2019-12-14, https://en.wikipedia.org/w/index.php?title=English_as_a_lingua_franca&oldid=930727312, retrieved 2020-01-07 
  47. Mauranen, A (2010). "English as an academic lingua franca: The ELFA project". English for Specific Purposes 29 (3): 183–190. doi:10.1016/j.esp.2009.10.001. 
  48. "ICLE" (in en). https://uclouvain.be/en/research-institutes/ilc/cecl/icle.html. 
  49. "LINDSEI" (in fr). https://uclouvain.be/fr/node/11968. 
  50. "Trinity Lancaster Corpus | ESRC Centre for Corpus Approaches to Social Science (CASS)" (in en-US). http://cass.lancs.ac.uk/trinity-lancaster-corpus/. 
  51. Gablasova, D (2019). "The Trinity Lancaster Corpus: Development, Description and Application.". International Journal of Learner Corpus Research 5 (2): 126–158. doi:10.1075/ijlcr.19001.gab. 
  52. Juffs, A., Han, N-R., & Naismith, B. (2020). The University of Pittsburgh English Language Corpus (PELIC) [Data set]. doi:10.5281/zenodo.3991977
  53. "Project". https://www.univie.ac.at/voice/page/index.php. 

See also