A quantitative study of Mongolian.


2013. №5, 46-57

Abstract:

The paper describes a General Corpus of the Modern Mongolian language (GCML), which contains 966 texts, 1 155 583 words. We also report a morphological analyzer for the Modern Mongolian language (MML), a grammatical dictionary for 63 071 lexemes, a general table of morphological homonymy. The processor analyzes effectively 95% of textual word forms which correspond to 76% word forms from the inputs of the concordance to the GCML.

MML can be described in its quantitative aspect, according to a structural-probabilistic model (SPM) of MML. SPM contains frequency dictionaries (FDs) of MML of different types: FDs of word forms, lexemes, grammatemes, root morphemes and allomorphemes, affi xal morphemes and allomorphemes, fl exionemes, grammemes.

SPM allows to describe behavior of various language units in the written text from the quantitative point of view: their frequency, distribution in texts, compatibility with other units etc. It is possible to transform the usual structural model into an SPM, which is based on statistical analysis of texts (in this model units of language are considered as possessing «the weight», the language oppositions and relations are being measured).

The paper reports the top lists of some FDs: i. e. FD of word forms (top-list of the upper 32 word forms having frequencies higher than 2091 ipm), FD of lexemes (top-list of the upper 32 lexemes having frequencies higher than 2627 ipm) and FD of grammatemes (top-list of the upper 32 grammatemes having frequencies higher than 3920 ipm).