Data on distribution of polysemy in Russian dictionaries and their practical application in automated text analysis

DATA ON DISTRIBUTION OF POLYSEMY IN RUSSIAN DICTIONARIES AND THEIR PRACTICAL APPLICATION IN AUTOMATED TEXT ANALYSIS

Nikolay Golovko

North Caucasus Federal University, Humanities Institute, Pushkin street, 1, 355009, Stavropol

(RUSSIAN FEDERATION)

E-mail: nvgolovko@inbox.ru

ABSTRACT

During the process of statistical analysis of Russian dictionary databases certain patterns have been discovered in distribution of polysemous lexical units amongst dictionary sections. Correlations have been deduced between initial graphemes of Russian words, number of their meanings and borrowedness; said correlations were investigated in further research in order to determine their possible practical application. It has been found that calculations based on percentage shares of words possessing initial graphemes which are associated with lower and greater degrees of polysemousness could be used in automated text analysis practices in combination with average word length values to reliably differentiate samples of formal speech from samples of informal speech in Russian language.

Keywords: lexical polysemy, lexicography, statistical analysis in linguistics, automated text analysis, average word length, stylistics.

1. INTRODUCTION

In the recent years, we have been collecting statistical data on the evolution of Russian lexicon. We were particularly interested in historical development of polysemy and in the way it is represented in scientifically recognized dictionaries. The data and their further analysis have revealed certain tendencies that we intend to describe in this article.

Due to the fact that these tendencies are regarded in relation to automatic text analysis, it should be noted that various scientific efforts have been known in the area of establishing connections between formal aspects of texts in natural languages on one side and the content of these texts on the other side. Historically, the rapid development of computing and substantial increase of information to be processed in analytical systems had led to the appearance of initial research in this area in the second half of the 20^th century [Rocchio 1971; Sсhank, Abelson 1977] that was soon followed by a number of attempts that had been made in terms of frame research before the end of the century [Gruber 1985; Hoffmann 1985; Ôim, Saluveer 1985; Tannen 1985] and that had been accompanied by several considerations on differentiation between various types of meanings [Hudson 1985; Kiefer 1985], as well as on discovery of inner structures within texts [Wilks 1985]. This field of scientific research had then evolved in the direction of text clustering and categorization [Apte, Damerau, Weiss 1994; Beil, Ester, Xu 2002] and formal conceptual analysis [Cimiano, Staab, Tane 2003; Ferrucci 2004; Mehler, Waltinger, Wegner 2007]. Certain attempts have also been made within the area of automatic recognition and evaluation of styles [Holmes 1998; Peng, Hengartner 2002].

2. MATERIALS AND METHODS

The most influential, comprehensive, and commonly approved Russian dictionaries – Ozhegov’s, Kuznetsov’s, and Minor Academic – were selected as the database sources for the initial research. We have processed the mentioned databases and have determined the absolute quantities of monosemous words, words characterized by a lower degree of polysemy (i.e. having 2 possible meanings), medium degree of polysemy (3 to 5 possible meanings) and greater degree of polysemy (6 meanings and above). In addition, we have calculated the relative quantities of polysemous words compared to the total amount of dictionary entries. The statistical analysis had followed the alphabetical order of word arrangement, present in these three dictionaries; thus, the data were collected for each alphabetical group of words separately.

The determined and calculated quantities for each dictionary can be found in Tables 1-3. All tables share the following abbreviations: M – quantity of monosemous words, P – quantity of polysemous words, LP – lower degree of polysemy, MP – medium degree of polysemy, GP – greater degree of polysemy.

Table 1. Results of statistical analysis for Ozhegov’s Dictionary of the Russian Language

Initial grapheme	Words	M	P	LP	MP	GP	% P	% LP	% MP	% GP
А	726	575	151	130	21	0	20.80%	17.91%	2.89%	0.00%
Б	1497	1176	321	250	67	4	21.44%	16.70%	4.48%	0.27%
В	2514	1915	599	414	169	16	23.83%	16.47%	6.72%	0.64%
Г	993	769	224	156	63	5	22.56%	15.71%	6.34%	0.50%
Д	1396	1030	366	275	76	15	26.22%	19.70%	5.44%	1.07%
Е (Ё)	133	94	39	26	11	2	29.32%	19.55%	8.27%	1.50%
Ж	301	206	95	63	28	4	31.56%	20.93%	9.30%	1.33%
З	1788	1314	474	340	123	11	26.51%	19.02%	6.88%	0.62%
И	935	719	216	172	40	4	23.10%	18.40%	4.28%	0.43%
Й	5	4	1	1	0	0	20.00%	20.00%	0.00%	0.00%
К	2259	1706	553	396	149	8	24.48%	17.53%	6.60%	0.35%
Л	781	592	189	138	46	5	24.20%	17.67%	5.89%	0.64%
М	1404	1104	300	212	84	4	21.37%	15.10%	5.98%	0.28%
Н	2229	1670	559	411	142	6	25.08%	18.44%	6.37%	0.27%
О	2748	1974	774	560	193	21	28.17%	20.38%	7.02%	0.76%
П	6287	4685	1602	1135	429	38	25.48%	18.05%	6.82%	0.60%
Р	2102	1444	658	458	187	13	31.30%	21.79%	8.90%	0.62%
С	3607	2691	916	624	258	34	25.40%	17.30%	7.15%	0.94%
Т	1272	936	336	212	113	11	26.42%	16.67%	8.88%	0.86%
У	1056	752	304	218	80	6	28.79%	20.64%	7.58%	0.57%
Ф	490	358	132	97	32	3	26.94%	19.80%	6.53%	0.61%
Х	441	321	120	82	29	9	27.21%	18.59%	6.58%	2.04%
Ц	196	142	54	38	16	0	27.55%	19.39%	8.16%	0.00%
Ч	488	374	114	77	32	5	23.36%	15.78%	6.56%	1.02%
Ш	509	391	118	94	23	1	23.18%	18.47%	4.52%	0.20%
Щ	72	46	26	17	8	1	36.11%	23.61%	11.11%	1.39%
Э	323	255	68	53	15	0	21.05%	16.41%	4.64%	0.00%
Ю	45	34	11	11	0	0	24.44%	24.44%	0.00%	0.00%
Я	135	90	45	30	14	1	33.33%	22.22%	10.37%	0.74%
Total	36732	27367	9365	6690	2448	227

Table 2. Results of statistical analysis for Kuznetsov’s Contemporary Dictionary of the Russian Language

Initial grapheme	Words	M	P	LP	MP	GP	% P	% LP	% MP	% GP
А	884	593	291	224	67	0	32.92%	25.34%	7.58%	0.00%
Б	1623	1023	600	414	171	15	36.97%	25.51%	10.54%	0.92%
В	2654	1662	992	630	317	45	37.38%	23.74%	11.94%	1.70%
Г	1083	656	427	278	129	20	39.43%	25.67%	11.91%	1.85%
Д	1506	968	538	335	184	19	35.72%	22.24%	12.22%	1.26%
Е (Ё)	136	83	53	31	19	3	38.97%	22.79%	13.97%	2.21%
Ж	281	166	115	76	34	5	40.93%	27.05%	12.10%	1.78%
З	1971	1190	781	490	265	26	39.62%	24.86%	13.44%	1.32%
И	1064	664	400	274	119	7	37.59%	25.75%	11.18%	0.66%
Й	6	4	2	2	0	0	33.33%	33.33%	0.00%	0.00%
К	2548	1527	1021	614	361	46	40.07%	24.10%	14.17%	1.81%
Л	841	509	332	200	120	12	39.48%	23.78%	14.27%	1.43%
М	1564	874	690	427	241	22	44.12%	27.30%	15.41%	1.41%
Н	2457	1474	983	636	313	34	40.01%	25.89%	12.74%	1.38%
О	2960	1685	1275	769	455	51	43.07%	25.98%	15.37%	1.72%
П	6903	3846	3057	1795	1106	156	44.29%	26.00%	16.02%	2.26%
Р	2299	1324	975	593	341	41	42.41%	25.79%	14.83%	1.78%
С	4060	2271	1789	1005	654	130	44.06%	24.75%	16.11%	3.20%
Т	1390	769	621	344	232	45	44.68%	24.75%	16.69%	3.24%
У	1160	587	573	318	220	35	49.40%	27.41%	18.97%	3.02%
Ф	574	339	235	146	82	7	40.94%	25.44%	14.29%	1.22%
Х	454	242	212	122	67	23	46.70%	26.87%	14.76%	5.07%
Ц	208	99	109	55	47	7	52.40%	26.44%	22.60%	3.37%
Ч	469	248	221	123	86	12	47.12%	26.23%	18.34%	2.56%
Ш	571	324	247	168	74	5	43.26%	29.42%	12.96%	0.88%
Щ	74	39	35	16	17	2	47.30%	21.62%	22.97%	2.70%
Э	410	261	149	104	42	3	36.34%	25.37%	10.24%	0.73%
Ю	54	33	21	13	8	0	38.89%	24.07%	14.81%	0.00%
Я	145	85	60	15	42	3	41.38%	10.34%	28.97%	2.07%
Total	40349	23545	16804	10217	5813	774

Table 3. Results of statistical analysis for Minor Academic Dictionary of the Russian Language

Initial grapheme	Words	M	P	LP	MP	GP	% P	% LP	% MP	% GP
А	1600	1338	262	222	40	0	16.38%	13.88%	2.50%	0.00%
Б	3122	2577	545	409	128	8	17.46%	13.10%	4.10%	0.26%
В	6049	4586	1463	1095	337	31	24.19%	18.10%	5.57%	0.51%
Г	2274	1821	453	306	137	10	19.92%	13.46%	6.02%	0.44%
Д	3528	2838	690	525	149	16	19.56%	14.88%	4.22%	0.45%
Е (Ё)	226	165	61	43	16	2	26.99%	19.03%	7.08%	0.88%
Ж	630	489	141	91	46	4	22.38%	14.44%	7.30%	0.63%
З	4828	3673	1155	846	284	25	23.92%	17.52%	5.88%	0.52%
И	2250	1654	596	455	131	10	26.49%	20.22%	5.82%	0.44%
Й	18	16	2	2	0	0	11.11%	11.11%	0.00%	0.00%
К	5056	3870	1186	800	350	36	23.46%	15.82%	6.92%	0.71%
Л	1768	1357	411	298	100	13	23.25%	16.86%	5.66%	0.74%
М	3473	2661	812	557	242	13	23.38%	16.04%	6.97%	0.37%
Н	5846	4386	1460	1047	379	34	24.97%	17.91%	6.48%	0.58%
О	6646	4794	1852	1347	459	46	27.87%	20.27%	6.91%	0.69%
П	16594	12415	4179	3199	890	90	25.18%	19.28%	5.36%	0.54%
Р	4725	3284	1441	980	418	43	30.50%	20.74%	8.85%	0.91%
С	8611	6317	2294	1481	701	112	26.64%	17.20%	8.14%	1.30%
Т	2969	2188	781	491	252	38	26.31%	16.54%	8.49%	1.28%
У	2582	1686	896	637	238	21	34.70%	24.67%	9.22%	0.81%
Ф	1347	1053	294	214	74	6	21.83%	15.89%	5.49%	0.45%
Х	1008	776	232	141	74	17	23.02%	13.99%	7.34%	1.69%
Ц	491	349	142	91	48	3	28.92%	18.53%	9.78%	0.61%
Ч	1134	861	273	174	87	12	24.07%	15.34%	7.67%	1.06%
Ш	1212	904	308	213	91	4	25.41%	17.57%	7.51%	0.33%
Щ	156	113	43	24	17	2	27.56%	15.38%	10.90%	1.28%
Э	881	686	195	148	43	4	22.13%	16.80%	4.88%	0.45%
Ю	74	50	24	18	6	0	32.43%	24.32%	8.11%	0.00%
Я	193	111	82	54	25	3	42.49%	27.98%	12.95%	1.55%
Total	89291	67018	22273	15908	5762	603

Upon further analysis of the data provided, certain peculiar tendencies in the distribution of polysemous words across dictionary sections have been noted and recognized.

It has become evident that in certain cases the absolute and relative quantities of polysemous words seem to correlate with the initial grapheme. While the majority of dictionary sections demonstrates similar and homogeneous distribution of words characterized by various degrees of polysemy, a number of the sections represents significant or notable deviations against the general picture. It is also important that the mentioned deviations can be observed in the three dictionary databases, thus suggesting a potentially universal nature of correlations between initial graphemes and degrees of polysemy.

Dictionary sections A and Й can serve as the example of obviously low degrees of polysemy. None of the three dictionaries under analysis contains any record of words with these initial graphemes that would have had 6 meanings and above; the A section is also characterized by a relatively low percentage share of words that possess more than 3 meanings, and the Й section consists entirely of words that have 1 or 2 meanings.

Dictionary sections Щ and Я represent the notable counterexample. The relative quantities of moderately and greatly polysemized words (i.e. having 3 meanings and above) with these initial graphemes clearly exceed the statistical data collected from other sections.

These results and considerations had led us to the decision to perform a deeper investigation of the described correlations.

3. DISCUSSION

Though the interdependencies between lexical polysemy, that represents the ideal side of language symbols, and initial graphemes of the words, that are associated with the material side, might be regarded as uncommon or strange, there is theoretical ground to claim that it cannot be characterized as coincidental or arbitrary. Both Russian and global studies in quantitative linguistics have known certain successful efforts to establish links between the material form of lexical units on one side and linguistic or even extra-linguistic properties of language symbols on the other side. A remarkable example is the adaptation of Zipf-Mandelbrot law to linguistic phenomena, demonstrating that frequency of a word’s appearance in both oral and written speech is proportional to its length (i.e. number of graphemes and / or sounds contained in it). [Seleznev, Isaeva 2005; Köhler, Altmann, Piotrowski 2005] Word length, being a purely physical attribute that can be measured with relative ease, is also traditionally recognized as a valuable and reliable criterion for determination of various inner characteristics of texts, including their functional stylistics.

Validity of Zipf-Mandelbrot law in its application to natural languages could be explained philosophically, in terms of meta-science. It is widely known that any natural language is in part driven by principle of linguistic economy, which encourages a speaker or a writer to minimize their efforts and use the lowest possible quantity of language units to represent the message they intended to transmit. Obviously, the language-speaking community would prefer simpler methods of expression to more complicated means, thus utilizing shorter words more frequently.

The same approach of correlation between linguistic and extra-linguistic phenomena could be applied to the statistical data that we have collected.

There exists a certain peculiar feature that is common for all non-standard dictionary sections mentioned by us above. Initial graphemes А and Й, which represent the lowest relative degrees of polysemy in all three dictionaries, are not typical for Russian language. Russian vocabulary contains a minuscule amount of native words that begin with the grapheme А (and, consequently, with the corresponding sound), but the majority of words in the A section and all lexical units in the Й section are borrowed from other languages. In addition, these borrowed words are either formal or used as scientific terms.

In contrast to А and Й, the initial graphemes Щ and Я, which are characterized by greatest percentage shares of polysemous words, represent typical Russian sounds that are frequently found in native lexical units but are seldom or never associated with borrowed words. The grapheme Щ is especially notable in this respect due to the fact that dictionaries contain no borrowed words with this initial grapheme.

We consider it basically safe to speculate that lexical polysemy is related to borrowedness of words. Indeed, a lexical unit obtains new meanings through historical development of the language system, as well as through the frequency of use; thus, older and more actively utilized words should generally possess a more extended semantic system than newer and less ubiquitous lexemes. A word’s age could be connected to its origin, assuming that native words are relatively older and borrowed words are relatively newer, and frequency of use is closely associated with word length, as we have demonstrated above.

We have summarized these contemplations and have arrived to the following statements.

1.1. It can be expected that a Russian word in possession of 3 meanings and above would be native and more frequent in speech practice, and a Russian word characterized by 1 or 2 meanings would be borrowed and less frequent in speech practice.

1.2. A Russian word’s frequency of use can be estimated by its length, and the Russian word’s origin could be estimated by its initial grapheme.

1.3. A Russian word’s length and initial grapheme could be used as formal physical markers to estimate the number of meanings it is associated with.

These statements were used as the foundation for further research and analysis.

In order to use the initial graphemes as formal markers, we needed to determine the degrees of reliability that could be associated with corresponding dictionary sections. That was necessary for exclusion of random coincidental deviations of statistical data caused by peculiarities of certain dictionaries; also, the graphemes were supposed to be characterized by distinct positive or negative differences from the typical percentage shares found in the majority of sections. For the achievement of this purpose, initial graphemes possessing the most considerable and the least considerable relative amounts of polysemous entries, as well as of words with medium and greater degrees of polysemy (both separate and summarized) were selected from each dictionary database and represented in the following tables.

All tables share abbreviations with Tables 1-3. The values in columns where difference exceeds 1% are rounded. The grey zone indicates graphemes that are located significantly close to the general threshold (0.5% or less below or above the threshold).

Table 4. Selections from statistical data based on Ozhegov’s Dictionary of the Russian Language

Initial grapheme	% P	Initial grapheme	% MP	Initial grapheme	% GP	Initial grapheme	% MP+GP
Щ	36%	Щ	11%	Х	2.04%	Щ	13%
Я	33%	Я	10%	ЕЁ	1.50%	Я	11%
Ж	32%	Ж	9%	Щ	1.39%	Ж	11%
Р	31%	Р	9%	Ж	1.33%	ЕЁ	10%
ЕЁ	29%	Т	9%	Д	1.07%	Т	10%
У	29%			Ч	1.02%	Р	10%
						Б	5%
						Ш	5%
Б	21%	Б	4%	Ц	0.00%	И	5%
М	21%	И	4%	Э	0.00%	Э	5%
Э	21%	А	3%	А	0.00%	А	3%
А	21%	Ю	0%	Ю	0.00%	Ю	0%
Й	20%	Й	0%	Й	0.00%	Й	0%

Table 5. Selections from statistical data based on Kuznetsov’s Contemporary Dictionary of the Russian Language

Initial grapheme	% P	Initial grapheme	% MP	Initial grapheme	% GP	Initial grapheme	% MP+GP
Ц	52%	Я	29%	Х	5%	Я	31%
У	49%	Щ	23%	Ц	3%	Ц	26%
Щ	47%	Ц	23%	Т	3%	Щ	26%
Ч	47%	У	19%	С	3%	У	22%
Х	47%	Ч	18%	У	3%	Ч	21%
				Б	0.92%
В	37%			Ш	0.88%
Б	37%	И	11%	Э	0.73%	И	12%
Э	36%	Б	11%	И	0.66%	Б	11%
Д	36%	Э	10%	Ю	0.00%	Э	11%
Й	33%	А	8%	А	0.00%	А	8%
А	33%	Й	0%	Й	0.00%	Й	0%

Table 6. Selections from statistical data based on Minor Academic Dictionary of the Russian Language

Initial grapheme	% P	Initial grapheme	% MP	Initial grapheme	% GP	Initial grapheme	% MP+GP
Я	42%	Я	13%	Х	1.69%	Я	15%
У	35%	Щ	11%	Я	1.55%	Щ	12%
Ю	32%	Ц	10%	С	1.30%	Ц	10%
Р	30%	У	9%	Щ	1.28%	У	10%
Ц	29%	Р	9%	Т	1.28%	Т	10%
				Ч	1.06%	Р	10%
		Ф	5%
		П	5%	М	0.37%
Г	20%	Э	5%	Ш	0.33%	Э	5%
Д	20%	Д	4%	Б	0.26%	Д	5%
Б	17%	Б	4%	Ю	0.00%	Б	4%
А	16%	А	3%	А	0.00%	А	3%
Й	11%	Й	0%	Й	0.00%	Й	0%

We have set the following criteria to determine degrees of reliability:

1) Initial graphemes that are found in the white zone of all four types of percentage shares are recognized as the most reliable.

2) Initial graphemes that are found in the white zone of three types of percentage shares out of four are recognized as more reliable.

3) Initial graphemes that are found in the white or grey zones of all four types of percentage shares are recognized as reliable.

4) Initial graphemes that are found in the white or grey zones of three types of percentage shares out of four are recognized as less reliable.

5) All the remaining initial graphemes are recognized as the least reliable and are not taken into account in further analysis.

The criteria were applied to each selection separately. The resulting picture is represented in Table 7. The table contains the following abbreviations: (+) – higher degrees of polysemy, (-) – lower degrees of polysemy, MtR – most reliable graphemes, MrR – more reliable graphemes, R – reliable graphemes, LR – less reliable graphemes.

Table 7. Distribution of selected initial graphemes across all dictionaries

Dictionary/ Graphemes	Ozhegov		Kuznetsov		Minor Academic
Dictionary/ Graphemes	(+)	(-)	(+)	(-)	(+)	(-)
MtR	Ж, Щ	А, Й	У, Ц	А, Й, Э	Я	А, Б, Й
MrR	ЕЁ, Я	Э, Ю	Ч, Щ	И	У, Ц, Щ	Д
R	-	-	-	Б	-	-
LR	Р	Б	-	-	-	-

The results were compared and systematized in order to compose and formulate the final verdict:

2.1. Initial graphemes Щ, А and Й are characterized by the greatest degree of reliability as possible formal markers of lexical polysemy. These graphemes are either the most reliable or more reliable in all three dictionaries.

2.2. Initial graphemes У, Ц, Я and Э are characterized by greater degree of reliability. These graphemes are either the most reliable or more reliable in two dictionaries out of three.

2.3. Initial grapheme Б is characterized by medium degree of reliability. This grapheme meets reliability criteria in all three dictionaries, but is the most reliable in one dictionary out of three.

2.4. Initial graphemes Е(Ё), Ж, Ч, Д, И and Ю are characterized by lower degree of reliability. These graphemes meet reliability criteria in one dictionary out of three and are either the most reliable or more reliable in it.

2.5. Initial grapheme Р is characterized by the lowest degree of reliability and is excluded from the selection due to apparently coincidental nature of its appearance in the list.

2.6. Initial graphemes Е(Ё), Ж, У, Ц, Ч, Щ and Я could be used as formal markers of greater degrees of polysemy, while graphemes А, Б, Д, И, Й, Э and Ю could be used as formal markers of lower degrees of polysemy.

Having obtained this information, we commenced a search for possible practical application of these formal markers.

It has already been mentioned that the marker of word length is traditionally used in Russian automated text analysis to distinguish formal texts from informal speech. Formal texts are represented by scientific and juridical speech which are both characterized by considerable amounts of terminological lexemes; these lexemes possess extended graphical and sound forms and are usually borrowed from the corpus of internationally recognized words originating from ancient Greek and Latin languages. In addition, formal texts are supposed to and are frequently demanded to be as monosemantic as possible, in order to efficiently prevent ambiguous perception and improper understanding. Informal texts are associated with journalism and fiction which are not limited by formal requirements and thus make use of shorter colloquial words that are frequently of native origin; also, textual polysemy is encouraged by genres that exist in the corresponding area of speech culture. [Kozhina 2008]

At the same time, word length is not reliable enough to be applied as the only detection criterion. Due to this reason, in common practice it is supported by several other criteria which may vary depending on a certain analytical system. For example, the known Russian public text analyzer Hudlomer, available at URL http://teneta.rinet.ru/hudlomer/, utilizes the so called Fomenko’s invariant to supplement its word length spectrum mechanism. Certain researchers have suggested to use chi-squared distribution, Fisher’s hypergeometric criterion etc. [Shevelev 2006] The negative side of these additional criteria consists in the fact that they deal with the sole external aspect of texts, having no considerable connection to their semantics. In its turn, the marker of initial graphemes could establish the link to the internal aspect of the text under analysis, thus properly supplementing the external criterion of word length.

In this respect, we partially follow a trend in Russian linguistics that associates textual polysemy with the concept of entropy. Certain recent findings in this area of Russian linguistic research are related to Dr. Sergey Gusarenko’s scientific school. [Gusarenko 2009] In accord with this regard, the appropriate understanding of texts in natural language is influenced by entropy, i.e. the amount of chaos introduced by external and internal factors. One of such internal factors is believed to be polysemy. Each polysemous word introduces an amount of chaos into the text, thus making it more difficult to process and understand by natural or artificial means; correspondingly, a text’s total entropy is determined by the number of polysemous words in it, due to the fact that entropy is an additive quantity. Similarly, by means of measuring the percentage shares of words that possess lower and higher degrees of polysemy we estimate the general polysemousness of the text reflected in the number of meanings each word could possibly have. We have been making use of the term “potential polysemousness” to denote this concept. [Golovko 2012]

As far as the principal properties of lexemes that determine characteristic traits of formal and informal texts (i.e. length, borrowedness and number of meanings) correspond to Zipf-Mandelbrot law in its application to linguistics and to statements 1.1 – 1.3 that we have presented above, we have made the decision to test the formal markers of word length and of initial graphemes for the possibility of correct classification of Russian texts into formal and informal subdivisions. We were particularly interested in determining whether these two markers would constitute a basis solid enough to support reliable classification.

4. RESULTS

In order to perform the test and gather statistical data for further analysis we have collected a selection of 100 texts in Russian language. The selection includes 25 samples of poetry and fiction, 25 publicistic texts, 25 juridical texts and 25 scientific texts; correspondingly, formal and informal speech had been represented by 50 samples each.

At the initial stage, we have applied a verification procedure in order to ensure that all initial graphemes are indeed reliable. The results of separate data investigation for each grapheme are represented in Table 8.

Table 8. Statistical data for initial graphemes associated with lower and greater degrees of polysemy (hereafter: in the meaning represented in statement 2.6)

Initial grapheme / Texts	Literary	Publicistic	Scientific	Juridical	Total formal	Total informal
А	0,3818%	0,7792%	1,2733%	1,5392%	0,5805%	1,4062%
Б	3,2023%	2,8651%	1,9355%	1,3503%	3,0337%	1,6429%
Д	3,4709%	3,2472%	3,1434%	3,9605%	3,3591%	3,5519%
И	1,4698%	2,2922%	3,3330%	3,6504%	1,8810%	3,4917%
Й	0,0147%	0,0000%	0,0220%	0,0000%	0,0073%	0,0110%
Э	0,1964%	0,4390%	0,9470%	0,4518%	0,3177%	0,6994%
Ю	0,0433%	0,0474%	0,0811%	0,1941%	0,0453%	0,1376%
Е	1,5353%	1,4675%	0,8500%	0,6964%	1,5014%	0,7732%
Ж	0,8424%	0,6390%	0,2657%	0,1723%	0,7407%	0,2190%
У	1,7155%	1,8030%	2,6991%	3,0701%	1,7593%	2,8846%
Ц	0,2339%	0,3430%	0,4894%	0,5139%	0,2885%	0,5016%
Ч	2,4071%	2,4605%	1,6164%	0,8647%	2,4338%	1,2406%
Щ	0,0607%	0,0237%	0,0209%	0,0018%	0,0422%	0,0113%
Я	1,4625%	0,6685%	0,5591%	0,2802%	1,0655%	0,4197%

It has been discovered that certain graphemes exhibit behaviour that contradicts the initial assumptions or does not entirely correspond to them. Consequently, we have excluded such graphemes as Б (previously noted as possessing medium level of reliability), Д (lower degree of reliability), У and Ц (greater degrees of reliability) from the selection.

At the next stage, total text length in symbols, total amount of words in each text, average word length, percentage shares of initial graphemes associated with lower and greater degrees of polysemy, difference between the percentage shares and relation between the shares of greater degree of polysemy and lower degree of polysemy have been determined for each text. The results are displayed in subsequent tables 9-12 for each functional style of texts separately. All tables share the following abbreviations: AWL – average word length, %LP – percentage share of initial graphemes associated with lower degrees of polysemy, %GP – percentage share of initial graphemes associated with greater degrees of polysemy, D – difference between %LP and %GP, R – %GP/%LP ratio.

Table 9. Statistical data for poetry and fiction

Text #	1	2	3	4	5	6	7
Symbols	359964	400357	236025	447871	269075	25415	81711
Words	53913	63380	35110	65386	34534	2931	9809
AWL	4.7831	4.8953	4.9669	5.0885	3.9340	3.9710	3.8842
%LP	2,3186%	2,3825%	2,3697%	2,5021%	2,0212%	1,8765%	1,5700%
%GP	8,6565%	7,3872%	8,5246%	6,4387%	4,3609%	4,1624%	4,1493%
D	-6,3379	-5,0047	-6,1549	-3,9366	-2,3397	-2,2859	-2,5793
R	3,7336	3,1007	3,5974	2,5733	2,1576	2,2182	2,6429
Text #	8	9	10	11	12	13	14
Symbols	22095	376723	90674	183978	165793	23287	114118
Words	3554	39600	14394	27688	31699	3587	16839
AWL	4.6629	3.4766	4.7587	4.5648	3.8478	4.4957	4.8982
%LP	1,8008%	1,3586%	2,1120%	1,9864%	2,8708%	2,0072%	1,7459%
%GP	5,9932%	4,4141%	7,6699%	8,6752%	8,3441%	6,2448%	5,0716%
D	-4,1924	-3,0555	-5,5579	-6,6888	-5,4733	-4,2376	-3,3257
R	3,3281	3,2491	3,6316	4,3673	2,9066	3,1111	2,9048
Text #	15	16	17	18	19	20	21
Symbols	313322	26592	75628	15183	17401	29917	20800
Words	46329	3808	10983	1784	2859	4693	3171
AWL	4.8489	5.0089	4.9390	5.0639	4.4008	4.0324	4.8802
%LP	1,9642%	2,4422%	1,8028%	3,0830%	1,2592%	1,9604%	2,4913%
%GP	5,5386%	8,1933%	5,1261%	5,7175%	2,9731%	4,9435%	5,3611%
D	-3,5744	-5,7511	-3,3233	-2,6345	-1,7139	-2,9831	-2,8698
R	2,8198	3,3548	2,8434	1,8545	2,3611	2,5217	2,1519
Text #	22	23	24	25	Average value
Symbols	118760	107931	72194	101108	147836.9
Words	17920	16403	11278	16748	21536
AWL	4.2513	4.5122	4.4370	4.1192	4.5089
%LP	3,4208%	2,0484%	1,6049%	1,6480%	2,1059%
%GP	7,6228%	7,0841%	6,7122%	8,3234%	6,3075%
D	-4,2020	-5,0357	-5,1073	-6,6754	-4,2016
R	2,2284	3,4583	4,1823	5,0507	3,0540

Table 10. Statistical data for publicistic texts

Text #	1	2	3	4	5	6	7
Symbols	7230	8655	15477	29767	11681	5459	6486
Words	964	1268	2251	4856	1632	752	848
AWL	4.9481	5.4495	5.4860	4.7780	5.8658	5.9535	5.8255
%LP	2,0747%	4,6530%	3,7317%	2,4918%	3,9828%	5,3191%	4,9528%
%GP	6,1203%	4,6530%	4,4425%	6,8987%	4,8407%	4,6543%	7,9009%
D	-4,0456	0,0000	-0,7108	-4,4069	-0,8579	0,6648	-2,9481
R	2,9500	1,0000	1,1905	2,7686	1,2154	0,8750	1,5952
Text #	8	9	10	11	12	13	14
Symbols	6623	15360	13956	19398	6621	10501	12455
Words	1055	2149	1959	2908	819	1513	1822
AWL	4.8275	5.8381	5.7626	5.3855	5.4481	5.5056	5.4748
%LP	3,2227%	4,4672%	3,8285%	3,2325%	4,2735%	2,6438%	4,1164%
%GP	6,8246%	5,4444%	5,3088%	4,5392%	2,6862%	4,3622%	5,8178%
D	-3,6019	-0,9772	-1,4803	-1,3067	1,5873	-1,7184	-1,7014
R	2,1176	1,2188	1,3867	1,4043	0,6286	1,6500	1,4133
Text #	15	16	17	18	19	20	21
Symbols	11695	15516	6054	5257	5468	6675	6685
Words	1826	2256	881	753	843	952	936
AWL	4.9578	5.4756	5.5392	5.6521	5.1851	5.1134	5.2799
%LP	3,0120%	3,1472%	2,8377%	3,4529%	3,2028%	2,4160%	4,0598%
%GP	5,3669%	5,4965%	7,0375%	4,6481%	7,2361%	4,0966%	4,5940%
D	-2,3549	-2,3493	-4,1998	-1,1952	-4,0333	-1,6806	-0,5342
R	1,7818	1,7465	2,4800	1,3462	2,2593	1,6957	1,1316
Text #	22	23	24	25	Average value
Symbols	8503	7547	11982	8662	10548.52
Words	1205	1120	1877	1225	1546.8
AWL	5.7734	5.3848	4.9989	5.7241	5.4253
%LP	3,6515%	3,0357%	3,7294%	3,3469%	3,5553%
%GP	4,3154%	3,4821%	6,7128%	4,0000%	5,2592%
D	-0,6639	-0,4464	-2,9834	-0,6531	-1,7039
R	1,1818	1,1471	1,8000	1,1951	1,5672

Table 11. Statistical data for scientific texts

Text #	1	2	3	4	5	6	7
Symbols	8082	11794	7151	6307	5871	153660	522418
Words	1016	1320	773	746	649	18607	62956
AWL	6.6181	7.4258	7.9405	7.1032	7.5932	6.5052	6.9933
%LP	6,3976%	5,0000%	5,1746%	7,2386%	6,0092%	5,7505%	7,8277%
%GP	4,0354%	2,7273%	1,5524%	3,0831%	2,4653%	3,2300%	2,4477%
D	2,3622	2,2727	3,6222	4,1555	3,5439	2,5205	5,3800
R	0,6308	0,5455	0,3000	0,4259	0,4103	0,5617	0,3127
Text #	8	9	10	11	12	13	14
Symbols	640763	611274	294539	251950	8339	158064	342976
Words	86264	76340	32636	29555	988	18346	38439
AWL	5.8213	6.4292	7.0700	6.9543	6.7247	7.2798	7.4348
%LP	5,7892%	6,6138%	6,1864%	4,4561%	5,1619%	6,2411%	8,1766%
%GP	3,5461%	4,8913%	1,9672%	3,3869%	1,5182%	3,2868%	2,7004%
D	2,2431	1,7225	4,2192	1,0692	3,6437	2,9543	5,4762
R	0,6125	0,7396	0,3180	0,7601	0,2941	0,5266	0,3303
Text #	15	16	17	18	19	20	21
Symbols	576735	538073	638815	434023	10314	10577	9600
Words	76182	67485	79477	49650	1183	1247	1130
AWL	6.1412	6.5304	6.5416	7.2522	7.4142	7.1460	6.9956
%LP	5,0287%	5,3434%	5,3047%	8,0685%	6,5089%	5,1323%	6,0177%
%GP	5,6352%	2,8110%	3,9986%	2,5277%	2,1978%	3,4483%	4,7788%
D	-0,6065	2,5324	1,3061	5,5408	4,3111	1,6840	1,2389
R	1,1206	0,5261	0,7538	0,3133	0,3377	0,6719	0,7941
Text #	22	23	24	25	Average value
Symbols	9142	12669	6324	5623	211003.3
Words	1143	1444	667	654	25955.88
AWL	6.4182	7.4017	8.0435	7.2446	7.0009
%LP	3,3246%	5,0554%	2,9985%	2,5994%	5,6562%
%GP	5,3368%	3,1856%	2,9985%	5,0459%	3,3121%
D	-2,0122	1,8698	0,0000	-2,4465	2,3441
R	1,6053	0,6301	1,0000	1,9412	0,6585

Table 12. Statistical data for juridical texts

Text #	1	2	3	4	5	6	7
Symbols	174699	52122	9220	27346	131609	41427	36367
Words	20664	6716	1066	3410	15172	4872	4659
AWL	6.9340	6.4242	6.8405	6.2651	6.9853	6.4002	6.3634
%LP	7,7478%	7,8023%	4,5028%	4,2522%	6,7888%	3,2430%	4,9152%
%GP	1,5970%	3,2162%	1,2195%	3,6070%	1,9641%	2,9762%	3,5415%
D	6,1508	4,5861	3,2833	0,6452	4,8247	0,2668	1,3737
R	0,2061	0,4122	0,2708	0,8483	0,2893	0,9177	0,7205
Text #	8	9	10	11	12	13	14
Symbols	67000	11092	22194	13540	54614	26047	308198
Words	5866	1242	2543	1505	6304	2241	26883
AWL	5.8529	7.2262	7.0822	7.5056	6.3165	6.9942	5.0633
%LP	2,7787%	3,6232%	6,3704%	7,5083%	4,6003%	4,9041%	5,0329%
%GP	0,5967%	1,0467%	2,1628%	1,3289%	1,6022%	0,4904%	0,8109%
D	2,1820	2,5765	4,2076	6,1794	2,9981	4,4137	4,2220
R	0,2147	0,2889	0,3395	0,1770	0,3483	0,1000	0,1611
Text #	15	16	17	18	19	20	21
Symbols	35743	463287	105466	102350	26770	68811	47690
Words	3890	39250	11425	11922	3013	7466	5739
AWL	7.6270	6.2498	7.6229	7.0478	7.6777	6.8924	7.0585
%LP	5,5270%	4,5172%	14,0919%	5,7876%	9,8241%	3,8843%	4,6001%
%GP	1,6710%	1,8497%	1,7155%	1,8202%	0,9957%	1,3662%	2,9273%
D	3,8560	2,6675	12,3764	3,9674	8,8284	2,5181	1,6728
R	0,3023	0,4095	0,1217	0,3145	0,1014	0,3517	0,6364
Text #	22	23	24	25	Average value
Symbols	69531	140400	80977	47342	86553.68
Words	8277	17074	9739	5411	9053.96
AWL	6.9447	6.7259	6.8607	7.2689	6.8092
%LP	5,5817%	8,3870%	7,2081%	2,3840%	5,8345%
%GP	1,5344%	3,0749%	1,8072%	5,3964%	2,0127%
D	4,0473	5,3121	5,4009	-3,0124	3,8218
R	0,2749	0,3666	0,2507	2,2636	0,4275

The data demonstrate clear differences between formal and informal texts both in terms of average word length and potential polysemousness.

1) In 100% of informal texts, AWL value is below 6; in 94% of formal texts, AWL value is above 6.

2) In 86% of informal texts, %LP is below 4%; in 84% of formal texts, %LP is above 4%.

3) In 94% of informal texts, %GP is above 4%; in 86% of formal texts, %GP is below 4%.

4) In 94% of informal texts, D is below 0; in 90% of formal texts, D is above 0.

5) In correspondence with D, R is above 1 in informal texts and below 1 in formal texts.

5. CONCLUSIONS

On the basis of the gathered data and of their analysis it is possible to formulate the following conclusions:

3.1. The difference between percentage shares of initial graphemes associated with lower degrees of polysemy and of initial graphemes associated with greater degrees of polysemy can be applied in the practice of automated text processing as the most reliable and the most simply estimated value that reflects potential polysemousness of a text in the Russian language.

3.2. The combination of average word length and potential polysemousness enables reliable differentiation of Russian formal texts from Russian informal texts. In the selection of texts that were analyzed above, anomalies of average word length are always neutralized by correct values of the marker of potential polysemousness, while anomalies of potential polysemousness are always neutralized by correct values of average word length; therefore, no incorrect verdicts would have been encountered if these two criteria were used in ensemble within a detection algorithm.

6. REFERENCES

1. S.I. Ozhegov. Dictionary of the Russian language. Edited by professor L.I. Skvortsov. 25^th edition, modified and extended. Oniks, 2008.

2. The contemporary dictionary of the Russian language. Supervised by S.I. Kuznetsov. Reader’s Digest, 2004.

3. The minor academic dictionary of the Russian language. (In 4 volumes.) Edited by A.P. Yevgenyeva. 4^th edition, stereotypical. Russkiy Yazyk, 1999.

4. V.A. Seleznev, Ye.V. Isaeva. Hurst parameter of the dictionary sequence. Materials of scientific conference “Quantitative linguistics: researches and models” (КЛИМ-2005): 146-152 (2005). (In Russian)

5. M.N. Kozhina. Stylistics of the Russian language. Flinta, 2008. (In Russian)

6. O.G. Shevelev. Research and development on algorithms of comparison of styles of textual compositions. Abstract of Ph.D. thesis in applied sciences. Tomsk State University, 2006. (In Russian)

7. S.V. Gusarenko. Systematic interaction and entropy of cognitive and semantic structures of the discourse. Abstract of Ph.D. thesis in philology. Stavropol State University, 2009. (In Russian)

8. N.V. Golovko. Estimation of semantic potential of texts in analytical systems. Lambert Academic Publishing, 2012. (In Russian)

9. Quantitative Linguistics: an international handbook. Edited by R. Köhler, G. Altmann, R.G. Piotrowski. [Google Books] http://books.google.ru/books?id=-7Z-GA73MMAC&dprintsec=frontcover&hl=ru

10. C. Apte, F. Damerau, S. Weiss. Automated Learning of Decision Rules for Text Categorization. [Penn State College of Information Sciences and Technology] http://citeseerx.ist.psu.edu/viewdoc/ download?doi=10.1.1.39.3129&rep=rep1&type=pdf

11. F. Beil, M. Ester, X. Xu. Frequent Term-Based Text Clustering. [University of Arkansas at Little Rock] https://www.cs.sfu.ca/~ester/papers/KDD02.Clustering.final.pdf

12. P. Cimiano, S. Staab, J. Tane. Deriving Concept Hierarchies from Text by Smooth Formal Concept Analysis. [University of Karlsruhe] http://www.aifb.kit.edu/images/a/a5/2003_156_Cimiano_Deriving_Concep_1.ps

13. D. Ferrucci. Text analysis as formal inference for the purposes of uniform tracing and explanation generation. [IBM] http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/aba6e5969bd737fe85256f2c006d96fe?OpenDocument

14. A. Mehler, U. Waltinger, A. Wegner. A Formal Text Representation Model Based on Lexical Chaining. [Bielefeld University] http://www.ulliwaltinger.de/pdf/LNVD07MehlerWaltingerWegner.pdf

15. S.J. Gruber. Frame information and lexically-based inference. Quaderni di semantica. 2(6): 58-78 (1985).

16. Th.R. Hofmann. Semantic frames and content representation. Quaderni di semantica. 2(6): 267-284 (1985).

17. D.I. Holmes. The Evolution of Stylometry in Humanities Scholarship. Literary and Linguistic Computing. 13(3): 111-117 (1998).

18. R. Hudson. Some basic assumptions about linguistic and non-linguistic knowledge. Quaderni di semantica. 2(6): 284-287 (1985).

19. F. Kiefer. How to account for situational meaning? Quaderni di semantica. 2(6): 288-295 (1985).

20. H. Ôim, M. Saluveer. Frames in linguistic descriptions. Quaderni di semantica. 2(6): 295-305 (1985).

21. R.D. Peng, N.W. Hengartner. Quantitative analysis of literary styles. The American Statistician. 56(3): 175-185 (2002).

22. J.J. Rocchio, Jr. Relevance feedback in information retrieval. The SMART Retrieval System: Experiments in Automatic Document Processing. 313-323 (1971).

23. R. Sсhank, R. Abelson. Scripts, Plans, Goals and Understanding. Hillsdale, 1977.

24. D. Tannen. Frames and schemas in interaction. Quaderni di semantica. 2(6): 326-335 (1985).

25. Y. Wilks. Text structures and knowledge structures. Quaderni di semantica. 2(6): 335-344 (1985).