AsoSoft text corpus

Short description: Kurdish text corpus

The AsoSoft text corpus is the first large-scale Kurdish text corpus, collected and processed by the AsoSoft research and development group. It contains 458,000 documents (188 million tokens) that are collected from sources such as websites, news agencies, books, and magazines. The corpus is partially tagged by topic, so it can be used for topic identification tasks. Also, it is applicable for extracting language model and computational lexicon information. Part of the corpus (75 million tokens) is available online for non-commercial use. The corpus uses the TEI format.^[1]

References

↑ Veisi, Hadi; MohammadAmini, Mohammad; Hosseini, Hawre (8 February 2019). "Toward Kurdish language processing: Experiments in collecting and processing the AsoSoft text corpus". Digital Scholarship in the Humanities. doi:10.1093/llc/fqy074.

External links

0.00

(0 votes)

Original source: https://en.wikipedia.org/wiki/AsoSoft text corpus. Read more

[veisi-1] Veisi, Hadi; MohammadAmini, Mohammad; Hosseini, Hawre (8 February 2019). "Toward Kurdish language processing: Experiments in collecting and processing the AsoSoft text corpus". Digital Scholarship in the Humanities. doi:10.1093/llc/fqy074.

[1]

Anonymous

Search

AsoSoft text corpus

Namespaces

More

Page actions

References

External links

Navigation

Navigation

Resources

Help

googletranslator

Navigation

Wiki tools

Wiki tools

Anonymous

Search

AsoSoft text corpus

References

External links

Navigation

Wiki tools

Page tools

Other projects

Categories