BulSemCor

From HandWiki

The Bulgarian Sense-annotated Corpus (BulSemCor) (Bulgarian: Български семантично анотиран корпус (БулСемКор)) is a structured corpus of Bulgarian texts in which each lexical item is assigned a sense tag. BulSemCor was created by the Department of Computational Linguistics[1] at the Institute for Bulgarian Language of the Bulgarian Academy of Sciences.

Structure

BulSemCor was created as part of a nationally funded project titled "BulNet – A lexico-semantic network for the Bulgarian Language" (2005–2010). It follows the general methodology of SemCor[2] combined with some specific principles.[3] The corpus for annotation consists of 101,791 tokens covering an excerpt from the Bulgarian "Brown" Corpus[4] modelled on the Brown Corpus.Template:Sn An important feature of BulSemCor is that the samples are selected using heuristics that provide optimal coverage of ambiguous lexis.

BulSemCor is manually sense-annotated according to the Bulgarian WordNet. Its size is comparable to that of other contemporary semantically annotated corpora or pool of acceptable linguistic components. The semantic annotation consists in associating each lexical item in the corpus with exactly one synonym set (synset) in the Bulgarian WordNet that best describes its sense in the particular context. The selection of the best match among the suggested candidates is based on a set of procedures, such as the other synset members, the synset gloss (explanatory definition) and the position of a given candidate in the WordNet structure.

Scale

The number of annotated tokens is 99,480 (the difference in the number of tokens compared to the initial corpus is due to the fact that some of them are not linguistic items). The simple word count is 86,842 and multiword expressions (MWE) are 5,797 (12,638 tokens).

Specific features

All words in BulSemCor are assigned a sense, while according to established practice only simple content words or content word classes (typically nouns and verbs) are annotated. Since 2000 the development of language resources, has broadened to include annotation of function words and multiword expressions covering particular senses or types of words and expressions. In this respect, BulSemCor's annotation is more exhaustive and hence provides greater opportunities for linguistic observations and non-linear programming (NLP) applications.

Annotated items inherit the linguistic information associated with the corresponding synset, which along with morphological and semantic tags may include annotation on one or more of the following additional levels:[5]

  • Partial information about the syntactic structure of MWE types – particularly, information about syntactic heads and their dependents;
  • Information about the category of the named entities – names, locations, organisations, dates, numbers, etc.;
  • Information about the taxonomic category of adverbs, such as time, place, manner, degree, quantity, etc.;
  • Information about the type of the syntactic relationships – coordination or subordination – expressed by conjunctions;
  • Information about the original part-of-speech of substantivised words (non-nouns that act as nouns in a particular context);
  • Stylistic/register, grammatical and other information about synsets or individual synset members;

See also

References

External links