DATA ON DISTRIBUTION OF POLYSEMY IN RUSSIAN
DICTIONARIES AND THEIR PRACTICAL APPLICATION IN AUTOMATED TEXT ANALYSIS
Nikolay Golovko
North
Caucasus Federal University, Humanities Institute,
(
E-mail: nvgolovko@inbox.ru
ABSTRACT
During
the process of statistical analysis of Russian dictionary databases certain
patterns have been discovered in distribution of polysemous lexical units
amongst dictionary sections. Correlations have been deduced between initial
graphemes of Russian words, number of their meanings and borrowedness; said
correlations were investigated in further research in order to determine their
possible practical application. It has been found that calculations based on
percentage shares of words possessing initial graphemes which are associated
with lower and greater degrees of polysemousness could be used in automated
text analysis practices in combination with average word length values to
reliably differentiate samples of formal speech from samples of informal speech
in Russian language.
Keywords: lexical
polysemy, lexicography, statistical analysis in linguistics, automated text
analysis, average word length, stylistics.
1. INTRODUCTION
In
the recent years, we have been collecting statistical data on the evolution of
Russian lexicon. We were particularly interested in historical development of
polysemy and in the way it is represented in scientifically recognized
dictionaries. The data and their further analysis have revealed certain
tendencies that we intend to describe in this article.
Due
to the fact that these tendencies are regarded in relation to automatic text
analysis, it should be noted that various scientific efforts have been known in
the area of establishing connections between formal aspects of texts in natural
languages on one side and the content of these texts on the other side. Historically,
the rapid development of computing and substantial increase of information to
be processed in analytical systems had led to the appearance of initial
research in this area in the second half of the 20th century [Rocchio 1971; Sñhank, Abelson 1977] that was soon
followed by a number of attempts that had been made in terms of frame research
before the end of the century [Gruber 1985; Hoffmann 1985; Ôim, Saluveer 1985; Tannen 1985] and that had been
accompanied by several considerations on differentiation between various types
of meanings [Hudson 1985; Kiefer 1985], as well as on discovery of inner
structures within texts [Wilks 1985]. This field of scientific research had
then evolved in the direction of text clustering and categorization [Apte, Damerau, Weiss 1994; Beil, Ester, Xu 2002] and
formal conceptual analysis [Cimiano, Staab, Tane 2003; Ferrucci
2004; Mehler, Waltinger, Wegner 2007]. Certain attempts have also
been made within the area of automatic recognition and evaluation of styles
[Holmes 1998; Peng, Hengartner 2002].
2. MATERIALS AND METHODS
The
most influential, comprehensive, and commonly approved Russian dictionaries –
Ozhegov’s, Kuznetsov’s, and Minor Academic – were selected as the database
sources for the initial research. We have processed the mentioned databases and
have determined the absolute quantities of monosemous words, words
characterized by a lower degree of polysemy (i.e. having 2 possible meanings),
medium degree of polysemy (3 to 5 possible meanings) and greater degree of
polysemy (6 meanings and above). In addition, we have calculated the relative
quantities of polysemous words compared to the total amount of dictionary
entries. The statistical analysis had followed the alphabetical order of word
arrangement, present in these three dictionaries; thus, the data were collected
for each alphabetical group of words separately.
The
determined and calculated quantities for each dictionary can be found in Tables
1-3. All tables share the following abbreviations: M – quantity of monosemous words, P – quantity of polysemous words, LP – lower degree of polysemy, MP
– medium degree of polysemy, GP –
greater degree of polysemy.
Table 1. Results of
statistical analysis for Ozhegov’s Dictionary of the Russian Language
Initial grapheme |
Words |
M |
P |
LP |
MP |
GP |
% P |
% LP |
% MP |
% GP |
À |
726 |
575 |
151 |
130 |
21 |
0 |
20.80% |
17.91% |
2.89% |
0.00% |
Á |
1497 |
1176 |
321 |
250 |
67 |
4 |
21.44% |
16.70% |
4.48% |
0.27% |
 |
2514 |
1915 |
599 |
414 |
169 |
16 |
23.83% |
16.47% |
6.72% |
0.64% |
à |
993 |
769 |
224 |
156 |
63 |
5 |
22.56% |
15.71% |
6.34% |
0.50% |
Ä |
1396 |
1030 |
366 |
275 |
76 |
15 |
26.22% |
19.70% |
5.44% |
1.07% |
Å (¨) |
133 |
94 |
39 |
26 |
11 |
2 |
29.32% |
19.55% |
8.27% |
1.50% |
Æ |
301 |
206 |
95 |
63 |
28 |
4 |
31.56% |
20.93% |
9.30% |
1.33% |
Ç |
1788 |
1314 |
474 |
340 |
123 |
11 |
26.51% |
19.02% |
6.88% |
0.62% |
È |
935 |
719 |
216 |
172 |
40 |
4 |
23.10% |
18.40% |
4.28% |
0.43% |
É |
5 |
4 |
1 |
1 |
0 |
0 |
20.00% |
20.00% |
0.00% |
0.00% |
Ê |
2259 |
1706 |
553 |
396 |
149 |
8 |
24.48% |
17.53% |
6.60% |
0.35% |
Ë |
781 |
592 |
189 |
138 |
46 |
5 |
24.20% |
17.67% |
5.89% |
0.64% |
Ì |
1404 |
1104 |
300 |
212 |
84 |
4 |
21.37% |
15.10% |
5.98% |
0.28% |
Í |
2229 |
1670 |
559 |
411 |
142 |
6 |
25.08% |
18.44% |
6.37% |
0.27% |
Î |
2748 |
1974 |
774 |
560 |
193 |
21 |
28.17% |
20.38% |
7.02% |
0.76% |
Ï |
6287 |
4685 |
1602 |
1135 |
429 |
38 |
25.48% |
18.05% |
6.82% |
0.60% |
Ð |
2102 |
1444 |
658 |
458 |
187 |
13 |
31.30% |
21.79% |
8.90% |
0.62% |
Ñ |
3607 |
2691 |
916 |
624 |
258 |
34 |
25.40% |
17.30% |
7.15% |
0.94% |
Ò |
1272 |
936 |
336 |
212 |
113 |
11 |
26.42% |
16.67% |
8.88% |
0.86% |
Ó |
1056 |
752 |
304 |
218 |
80 |
6 |
28.79% |
20.64% |
7.58% |
0.57% |
Ô |
490 |
358 |
132 |
97 |
32 |
3 |
26.94% |
19.80% |
6.53% |
0.61% |
Õ |
441 |
321 |
120 |
82 |
29 |
9 |
27.21% |
18.59% |
6.58% |
2.04% |
Ö |
196 |
142 |
54 |
38 |
16 |
0 |
27.55% |
19.39% |
8.16% |
0.00% |
× |
488 |
374 |
114 |
77 |
32 |
5 |
23.36% |
15.78% |
6.56% |
1.02% |
Ø |
509 |
391 |
118 |
94 |
23 |
1 |
23.18% |
18.47% |
4.52% |
0.20% |
Ù |
72 |
46 |
26 |
17 |
8 |
1 |
36.11% |
23.61% |
11.11% |
1.39% |
Ý |
323 |
255 |
68 |
53 |
15 |
0 |
21.05% |
16.41% |
4.64% |
0.00% |
Þ |
45 |
34 |
11 |
11 |
0 |
0 |
24.44% |
24.44% |
0.00% |
0.00% |
ß |
135 |
90 |
45 |
30 |
14 |
1 |
33.33% |
22.22% |
10.37% |
0.74% |
Total |
36732 |
27367 |
9365 |
6690 |
2448 |
227 |
|
Table 2. Results of
statistical analysis for Kuznetsov’s Contemporary Dictionary of the Russian
Language
Initial grapheme |
Words |
M |
P |
LP |
MP |
GP |
% P |
% LP |
% MP |
% GP |
À |
884 |
593 |
291 |
224 |
67 |
0 |
32.92% |
25.34% |
7.58% |
0.00% |
Á |
1623 |
1023 |
600 |
414 |
171 |
15 |
36.97% |
25.51% |
10.54% |
0.92% |
 |
2654 |
1662 |
992 |
630 |
317 |
45 |
37.38% |
23.74% |
11.94% |
1.70% |
à |
1083 |
656 |
427 |
278 |
129 |
20 |
39.43% |
25.67% |
11.91% |
1.85% |
Ä |
1506 |
968 |
538 |
335 |
184 |
19 |
35.72% |
22.24% |
12.22% |
1.26% |
Å (¨) |
136 |
83 |
53 |
31 |
19 |
3 |
38.97% |
22.79% |
13.97% |
2.21% |
Æ |
281 |
166 |
115 |
76 |
34 |
5 |
40.93% |
27.05% |
12.10% |
1.78% |
Ç |
1971 |
1190 |
781 |
490 |
265 |
26 |
39.62% |
24.86% |
13.44% |
1.32% |
È |
1064 |
664 |
400 |
274 |
119 |
7 |
37.59% |
25.75% |
11.18% |
0.66% |
É |
6 |
4 |
2 |
2 |
0 |
0 |
33.33% |
33.33% |
0.00% |
0.00% |
Ê |
2548 |
1527 |
1021 |
614 |
361 |
46 |
40.07% |
24.10% |
14.17% |
1.81% |
Ë |
841 |
509 |
332 |
200 |
120 |
12 |
39.48% |
23.78% |
14.27% |
1.43% |
Ì |
1564 |
874 |
690 |
427 |
241 |
22 |
44.12% |
27.30% |
15.41% |
1.41% |
Í |
2457 |
1474 |
983 |
636 |
313 |
34 |
40.01% |
25.89% |
12.74% |
1.38% |
Î |
2960 |
1685 |
1275 |
769 |
455 |
51 |
43.07% |
25.98% |
15.37% |
1.72% |
Ï |
6903 |
3846 |
3057 |
1795 |
1106 |
156 |
44.29% |
26.00% |
16.02% |
2.26% |
Ð |
2299 |
1324 |
975 |
593 |
341 |
41 |
42.41% |
25.79% |
14.83% |
1.78% |
Ñ |
4060 |
2271 |
1789 |
1005 |
654 |
130 |
44.06% |
24.75% |
16.11% |
3.20% |
Ò |
1390 |
769 |
621 |
344 |
232 |
45 |
44.68% |
24.75% |
16.69% |
3.24% |
Ó |
1160 |
587 |
573 |
318 |
220 |
35 |
49.40% |
27.41% |
18.97% |
3.02% |
Ô |
574 |
339 |
235 |
146 |
82 |
7 |
40.94% |
25.44% |
14.29% |
1.22% |
Õ |
454 |
242 |
212 |
122 |
67 |
23 |
46.70% |
26.87% |
14.76% |
5.07% |
Ö |
208 |
99 |
109 |
55 |
47 |
7 |
52.40% |
26.44% |
22.60% |
3.37% |
× |
469 |
248 |
221 |
123 |
86 |
12 |
47.12% |
26.23% |
18.34% |
2.56% |
Ø |
571 |
324 |
247 |
168 |
74 |
5 |
43.26% |
29.42% |
12.96% |
0.88% |
Ù |
74 |
39 |
35 |
16 |
17 |
2 |
47.30% |
21.62% |
22.97% |
2.70% |
Ý |
410 |
261 |
149 |
104 |
42 |
3 |
36.34% |
25.37% |
10.24% |
0.73% |
Þ |
54 |
33 |
21 |
13 |
8 |
0 |
38.89% |
24.07% |
14.81% |
0.00% |
ß |
145 |
85 |
60 |
15 |
42 |
3 |
41.38% |
10.34% |
28.97% |
2.07% |
Total |
40349 |
23545 |
16804 |
10217 |
5813 |
774 |
|
Table 3. Results of
statistical analysis for Minor Academic Dictionary of the Russian Language
Initial grapheme |
Words |
M |
P |
LP |
MP |
GP |
% P |
% LP |
% MP |
% GP |
À |
1600 |
1338 |
262 |
222 |
40 |
0 |
16.38% |
13.88% |
2.50% |
0.00% |
Á |
3122 |
2577 |
545 |
409 |
128 |
8 |
17.46% |
13.10% |
4.10% |
0.26% |
 |
6049 |
4586 |
1463 |
1095 |
337 |
31 |
24.19% |
18.10% |
5.57% |
0.51% |
à |
2274 |
1821 |
453 |
306 |
137 |
10 |
19.92% |
13.46% |
6.02% |
0.44% |
Ä |
3528 |
2838 |
690 |
525 |
149 |
16 |
19.56% |
14.88% |
4.22% |
0.45% |
Å (¨) |
226 |
165 |
61 |
43 |
16 |
2 |
26.99% |
19.03% |
7.08% |
0.88% |
Æ |
630 |
489 |
141 |
91 |
46 |
4 |
22.38% |
14.44% |
7.30% |
0.63% |
Ç |
4828 |
3673 |
1155 |
846 |
284 |
25 |
23.92% |
17.52% |
5.88% |
0.52% |
È |
2250 |
1654 |
596 |
455 |
131 |
10 |
26.49% |
20.22% |
5.82% |
0.44% |
É |
18 |
16 |
2 |
2 |
0 |
0 |
11.11% |
11.11% |
0.00% |
0.00% |
Ê |
5056 |
3870 |
1186 |
800 |
350 |
36 |
23.46% |
15.82% |
6.92% |
0.71% |
Ë |
1768 |
1357 |
411 |
298 |
100 |
13 |
23.25% |
16.86% |
5.66% |
0.74% |
Ì |
3473 |
2661 |
812 |
557 |
242 |
13 |
23.38% |
16.04% |
6.97% |
0.37% |
Í |
5846 |
4386 |
1460 |
1047 |
379 |
34 |
24.97% |
17.91% |
6.48% |
0.58% |
Î |
6646 |
4794 |
1852 |
1347 |
459 |
46 |
27.87% |
20.27% |
6.91% |
0.69% |
Ï |
16594 |
12415 |
4179 |
3199 |
890 |
90 |
25.18% |
19.28% |
5.36% |
0.54% |
Ð |
4725 |
3284 |
1441 |
980 |
418 |
43 |
30.50% |
20.74% |
8.85% |
0.91% |
Ñ |
8611 |
6317 |
2294 |
1481 |
701 |
112 |
26.64% |
17.20% |
8.14% |
1.30% |
Ò |
2969 |
2188 |
781 |
491 |
252 |
38 |
26.31% |
16.54% |
8.49% |
1.28% |
Ó |
2582 |
1686 |
896 |
637 |
238 |
21 |
34.70% |
24.67% |
9.22% |
0.81% |
Ô |
1347 |
1053 |
294 |
214 |
74 |
6 |
21.83% |
15.89% |
5.49% |
0.45% |
Õ |
1008 |
776 |
232 |
141 |
74 |
17 |
23.02% |
13.99% |
7.34% |
1.69% |
Ö |
491 |
349 |
142 |
91 |
48 |
3 |
28.92% |
18.53% |
9.78% |
0.61% |
× |
1134 |
861 |
273 |
174 |
87 |
12 |
24.07% |
15.34% |
7.67% |
1.06% |
Ø |
1212 |
904 |
308 |
213 |
91 |
4 |
25.41% |
17.57% |
7.51% |
0.33% |
Ù |
156 |
113 |
43 |
24 |
17 |
2 |
27.56% |
15.38% |
10.90% |
1.28% |
Ý |
881 |
686 |
195 |
148 |
43 |
4 |
22.13% |
16.80% |
4.88% |
0.45% |
Þ |
74 |
50 |
24 |
18 |
6 |
0 |
32.43% |
24.32% |
8.11% |
0.00% |
ß |
193 |
111 |
82 |
54 |
25 |
3 |
42.49% |
27.98% |
12.95% |
1.55% |
Total |
89291 |
67018 |
22273 |
15908 |
5762 |
603 |
|
Upon
further analysis of the data provided, certain peculiar tendencies in the
distribution of polysemous words across dictionary sections have been noted and
recognized.
It
has become evident that in certain cases the absolute and relative quantities
of polysemous words seem to correlate with the initial grapheme. While the
majority of dictionary sections demonstrates similar and homogeneous
distribution of words characterized by various degrees of polysemy, a number of
the sections represents significant or notable deviations against the general
picture. It is also important that the mentioned deviations can be observed in
the three dictionary databases, thus suggesting a potentially universal nature
of correlations between initial graphemes and degrees of polysemy.
Dictionary
sections A and É can serve as the
example of obviously low degrees of polysemy. None of the three dictionaries
under analysis contains any record of words with these initial graphemes that
would have had 6 meanings and above; the A
section is also characterized by a relatively low percentage share of words
that possess more than 3 meanings, and the É section consists entirely of words that have 1 or 2 meanings.
Dictionary
sections Ù
and ß represent the
notable counterexample. The relative quantities of moderately and greatly
polysemized words (i.e. having 3 meanings and above) with these initial
graphemes clearly exceed the statistical data collected from other sections.
These
results and considerations had led us to the decision to perform a deeper
investigation of the described correlations.
3. DISCUSSION
Though
the interdependencies between lexical polysemy, that represents the ideal side
of language symbols, and initial graphemes of the words, that are associated
with the material side, might be regarded as uncommon or strange, there is
theoretical ground to claim that it cannot be characterized as coincidental or
arbitrary. Both Russian and global studies in quantitative linguistics have
known certain successful efforts to establish links between the material form of
lexical units on one side and linguistic or even extra-linguistic properties of
language symbols on the other side. A remarkable example is the adaptation of
Zipf-Mandelbrot law to linguistic phenomena, demonstrating that frequency of a
word’s appearance in both oral and written speech is proportional to its length
(i.e. number of graphemes and / or sounds contained in it). [Seleznev, Isaeva
2005; Köhler, Altmann, Piotrowski 2005]
Word length, being a purely physical attribute that can be measured with relative
ease, is also traditionally recognized as a valuable and reliable criterion for
determination of various inner characteristics of texts, including their
functional stylistics.
Validity
of Zipf-Mandelbrot law in its application to natural languages could be
explained philosophically, in terms of meta-science. It is widely known that
any natural language is in part driven by principle of linguistic economy,
which encourages a speaker or a writer to minimize their efforts and use the
lowest possible quantity of language units to represent the message they
intended to transmit. Obviously, the language-speaking community would prefer
simpler methods of expression to more complicated means, thus utilizing shorter
words more frequently.
The
same approach of correlation between linguistic and extra-linguistic phenomena
could be applied to the statistical data that we have collected.
There
exists a certain peculiar feature that is common for all non-standard
dictionary sections mentioned by us
above. Initial graphemes À and É, which represent the lowest relative degrees of polysemy in all three
dictionaries, are not typical for Russian language. Russian vocabulary contains
a minuscule amount of native words that begin with the grapheme À (and, consequently,
with the corresponding sound), but the majority of words in the A section and all lexical units in the É section are borrowed
from other languages. In addition, these borrowed words are either formal or
used as scientific terms.
In
contrast to À and É, the initial graphemes Ù and ß, which are characterized by greatest percentage shares of polysemous
words, represent typical Russian sounds that are frequently found in native
lexical units but are seldom or never associated with borrowed words. The
grapheme Ù
is especially notable in this respect due to the fact that dictionaries contain
no borrowed words with this initial grapheme.
We
consider it basically safe to speculate that lexical polysemy is related to
borrowedness of words. Indeed, a lexical unit obtains new meanings through
historical development of the language system, as well as through the frequency
of use; thus, older and more actively utilized words should generally possess a
more extended semantic system than newer and less ubiquitous lexemes. A word’s
age could be connected to its origin, assuming that native words are relatively
older and borrowed words are relatively newer, and frequency of use is closely
associated with word length, as we have demonstrated above.
We
have summarized these contemplations and have arrived to the following
statements.
1.1. It can be expected that a Russian word in possession of 3 meanings
and above would be native and more frequent in speech practice, and a Russian
word characterized by 1 or 2 meanings would be borrowed and less frequent in
speech practice.
1.2. A Russian word’s frequency of use can be estimated by its length,
and the Russian word’s origin could be estimated by its initial grapheme.
1.3. A Russian word’s length and initial grapheme could be used as
formal physical markers to estimate the number of meanings it is associated
with.
These
statements were used as the foundation for further research and analysis.
In
order to use the initial graphemes as formal markers, we needed to determine
the degrees of reliability that could be associated with corresponding
dictionary sections. That was necessary for exclusion of random coincidental
deviations of statistical data caused by peculiarities of certain dictionaries;
also, the graphemes were supposed to be characterized by distinct positive or
negative differences from the typical percentage shares found in the majority
of sections. For the achievement of this purpose, initial graphemes possessing
the most considerable and the least considerable relative amounts of polysemous
entries, as well as of words with medium and greater degrees of polysemy (both
separate and summarized) were selected from each dictionary database and
represented in the following tables.
All
tables share abbreviations with Tables 1-3. The values in columns where
difference exceeds 1% are rounded. The grey zone indicates graphemes that are
located significantly close to the general threshold (0.5% or less below or
above the threshold).
Table 4. Selections from
statistical data based on Ozhegov’s Dictionary of the Russian Language
Initial grapheme |
% P |
Initial grapheme |
% MP |
Initial grapheme |
% GP |
Initial grapheme |
% MP+GP |
Ù |
36% |
Ù |
11% |
Õ |
2.04% |
Ù |
13% |
ß |
33% |
ß |
10% |
Ũ |
1.50% |
ß |
11% |
Æ |
32% |
Æ |
9% |
Ù |
1.39% |
Æ |
11% |
Ð |
31% |
Ð |
9% |
Æ |
1.33% |
Ũ |
10% |
Ũ |
29% |
Ò |
9% |
Ä |
1.07% |
Ò |
10% |
Ó |
29% |
|
|
× |
1.02% |
Ð |
10% |
|
|
|
|
|
|
Á |
5% |
|
|
|
|
|
|
Ø |
5% |
Á |
21% |
Á |
4% |
Ö |
0.00% |
È |
5% |
Ì |
21% |
È |
4% |
Ý |
0.00% |
Ý |
5% |
Ý |
21% |
À |
3% |
À |
0.00% |
À |
3% |
À |
21% |
Þ |
0% |
Þ |
0.00% |
Þ |
0% |
É |
20% |
É |
0% |
É |
0.00% |
É |
0% |
Table 5. Selections from
statistical data based on Kuznetsov’s Contemporary Dictionary of the Russian
Language
Initial grapheme |
% P |
Initial grapheme |
% MP |
Initial grapheme |
% GP |
Initial grapheme |
% MP+GP |
Ö |
52% |
ß |
29% |
Õ |
5% |
ß |
31% |
Ó |
49% |
Ù |
23% |
Ö |
3% |
Ö |
26% |
Ù |
47% |
Ö |
23% |
Ò |
3% |
Ù |
26% |
× |
47% |
Ó |
19% |
Ñ |
3% |
Ó |
22% |
Õ |
47% |
× |
18% |
Ó |
3% |
× |
21% |
|
|
|
|
Á |
0.92% |
|
|
 |
37% |
|
|
Ø |
0.88% |
|
|
Á |
37% |
È |
11% |
Ý |
0.73% |
È |
12% |
Ý |
36% |
Á |
11% |
È |
0.66% |
Á |
11% |
Ä |
36% |
Ý |
10% |
Þ |
0.00% |
Ý |
11% |
É |
33% |
À |
8% |
À |
0.00% |
À |
8% |
À |
33% |
É |
0% |
É |
0.00% |
É |
0% |
Table 6. Selections from
statistical data based on Minor Academic Dictionary of the Russian Language
Initial grapheme |
% P |
Initial grapheme |
% MP |
Initial grapheme |
% GP |
Initial grapheme |
% MP+GP |
ß |
42% |
ß |
13% |
Õ |
1.69% |
ß |
15% |
Ó |
35% |
Ù |
11% |
ß |
1.55% |
Ù |
12% |
Þ |
32% |
Ö |
10% |
Ñ |
1.30% |
Ö |
10% |
Ð |
30% |
Ó |
9% |
Ù |
1.28% |
Ó |
10% |
Ö |
29% |
Ð |
9% |
Ò |
1.28% |
Ò |
10% |
|
|
|
|
× |
1.06% |
Ð |
10% |
|
|
Ô |
5% |
|
|
|
|
|
|
Ï |
5% |
Ì |
0.37% |
|
|
à |
20% |
Ý |
5% |
Ø |
0.33% |
Ý |
5% |
Ä |
20% |
Ä |
4% |
Á |
0.26% |
Ä |
5% |
Á |
17% |
Á |
4% |
Þ |
0.00% |
Á |
4% |
À |
16% |
À |
3% |
À |
0.00% |
À |
3% |
É |
11% |
É |
0% |
É |
0.00% |
É |
0% |
We
have set the following criteria to determine degrees of reliability:
1) Initial graphemes that are found in the white zone of all four types
of percentage shares are recognized as the most reliable.
2) Initial graphemes that are found in the white zone of three types of
percentage shares out of four are recognized as more reliable.
3) Initial graphemes that are found in the white or grey zones of all
four types of percentage shares are recognized as reliable.
4) Initial graphemes that are found in the white or grey zones of three
types of percentage shares out of four are recognized as less reliable.
5) All the remaining initial graphemes are recognized as the least
reliable and are not taken into account in further analysis.
The
criteria were applied to each selection separately. The resulting picture is
represented in Table 7. The table contains the following abbreviations: (+) –
higher degrees of polysemy, (-) – lower degrees of polysemy, MtR – most reliable graphemes, MrR – more reliable graphemes, R – reliable graphemes, LR – less reliable graphemes.
Table 7. Distribution of
selected initial graphemes across all dictionaries
Dictionary/ Graphemes |
Ozhegov |
Kuznetsov |
Minor Academic |
|||
(+) |
(-) |
(+) |
(-) |
(+) |
(-) |
|
MtR |
Æ, Ù |
À, É |
Ó, Ö |
À, É, Ý |
ß |
À, Á, É |
MrR |
Ũ, ß |
Ý, Þ |
×, Ù |
È |
Ó, Ö, Ù |
Ä |
R |
- |
- |
- |
Á |
- |
- |
LR |
Ð |
Á |
- |
- |
- |
- |
The
results were compared and systematized in order to compose and formulate the
final verdict:
2.1. Initial graphemes Ù, À
and É are
characterized by the greatest degree of reliability as possible formal markers
of lexical polysemy. These graphemes are either the most reliable or more
reliable in all three dictionaries.
2.2. Initial graphemes Ó, Ö,
ß and Ý are characterized by
greater degree of reliability. These graphemes are either the most reliable or
more reliable in two dictionaries out of three.
2.3. Initial grapheme Á is characterized by medium degree of reliability. This grapheme meets
reliability criteria in all three dictionaries, but is the most reliable in one
dictionary out of three.
2.4. Initial graphemes Å(¨), Æ, ×,
Ä, È and Þ are characterized by
lower degree of reliability. These graphemes meet reliability criteria in one
dictionary out of three and are either the most reliable or more reliable in
it.
2.5. Initial grapheme Ð is characterized by the lowest degree of reliability and is excluded
from the selection due to apparently coincidental nature of its appearance in
the list.
2.6. Initial graphemes Å(¨), Æ, Ó,
Ö, ×, Ù and ß could be used as formal markers of greater degrees of polysemy, while
graphemes À, Á, Ä, È, É, Ý and Þ could be used as formal markers of
lower degrees of polysemy.
Having
obtained this information, we commenced a search for possible practical
application of these formal markers.
It has already been mentioned that the marker of word
length is traditionally used in Russian automated text analysis to distinguish
formal texts from informal speech. Formal texts are represented by scientific
and juridical speech which are both characterized by considerable amounts of
terminological lexemes; these lexemes possess extended graphical and sound
forms and are usually borrowed from the corpus of internationally recognized
words originating from ancient Greek and Latin languages. In addition, formal
texts are supposed to and are frequently demanded to be as monosemantic as
possible, in order to efficiently prevent ambiguous perception and improper
understanding. Informal texts are associated with journalism and fiction which
are not limited by formal requirements and thus make use of shorter colloquial
words that are frequently of native origin; also, textual polysemy is
encouraged by genres that exist in the corresponding area of speech culture.
[Kozhina 2008]
At the same time, word length is not reliable enough
to be applied as the only detection criterion. Due to this reason, in common practice
it is supported by several other criteria which may vary depending on a certain
analytical system. For example, the known Russian public text analyzer
Hudlomer, available at URL http://teneta.rinet.ru/hudlomer/,
utilizes the so called Fomenko’s invariant to supplement its word length
spectrum mechanism. Certain researchers have suggested to use chi-squared
distribution, Fisher’s hypergeometric criterion etc. [Shevelev 2006] The
negative side of these additional criteria consists in the fact that they deal
with the sole external aspect of texts, having no considerable connection to
their semantics. In its turn, the marker of initial graphemes could establish
the link to the internal aspect of the text under analysis, thus properly
supplementing the external criterion of word length.
In this respect, we partially
follow a trend in Russian linguistics that associates textual polysemy with the
concept of entropy. Certain recent findings in this area of Russian linguistic
research are related to Dr. Sergey Gusarenko’s scientific school. [Gusarenko
2009] In accord with this regard, the appropriate understanding of texts in
natural language is influenced by entropy, i.e. the amount of chaos introduced
by external and internal factors. One of such internal factors is believed to
be polysemy. Each polysemous word introduces an amount of chaos into the text,
thus making it more difficult to process and understand by natural or
artificial means; correspondingly, a text’s total entropy is determined by the
number of polysemous words in it, due to the fact that entropy is an additive
quantity. Similarly, by means of measuring the percentage shares of words that
possess lower and higher degrees of polysemy we estimate the general
polysemousness of the text reflected in the number of meanings each word could
possibly have. We have been making use of the term “potential polysemousness”
to denote this concept. [Golovko 2012]
As far as the principal
properties of lexemes that determine characteristic traits of formal and informal
texts (i.e. length, borrowedness and number of meanings) correspond to
Zipf-Mandelbrot law in its application to linguistics and to statements 1.1 –
1.3 that we have presented above, we have made the decision to test the formal
markers of word length and of initial graphemes for the possibility of correct
classification of Russian texts into formal and informal subdivisions. We were
particularly interested in determining whether these two markers would
constitute a basis solid enough to support reliable classification.
4.
RESULTS
In order to perform the test and
gather statistical data for further analysis we have collected a selection of
100 texts in Russian language. The selection includes 25 samples of poetry and fiction, 25 publicistic texts, 25
juridical texts and 25 scientific texts; correspondingly, formal and informal
speech had been represented by 50 samples each.
At the initial stage, we have
applied a verification procedure in order to ensure that all initial graphemes
are indeed reliable. The results of separate data investigation for each
grapheme are represented in Table 8.
Table 8. Statistical data for initial
graphemes associated with lower and greater degrees of polysemy (hereafter: in
the meaning represented in statement 2.6)
Initial grapheme / Texts |
Literary
|
Publicistic
|
Scientific
|
Juridical
|
Total
formal |
Total
informal |
À |
0,3818% |
0,7792% |
1,2733% |
1,5392% |
0,5805% |
1,4062% |
Á |
3,2023% |
2,8651% |
1,9355% |
1,3503% |
3,0337% |
1,6429% |
Ä |
3,4709% |
3,2472% |
3,1434% |
3,9605% |
3,3591% |
3,5519% |
È |
1,4698% |
2,2922% |
3,3330% |
3,6504% |
1,8810% |
3,4917% |
É |
0,0147% |
0,0000% |
0,0220% |
0,0000% |
0,0073% |
0,0110% |
Ý |
0,1964% |
0,4390% |
0,9470% |
0,4518% |
0,3177% |
0,6994% |
Þ |
0,0433% |
0,0474% |
0,0811% |
0,1941% |
0,0453% |
0,1376% |
Å |
1,5353% |
1,4675% |
0,8500% |
0,6964% |
1,5014% |
0,7732% |
Æ |
0,8424% |
0,6390% |
0,2657% |
0,1723% |
0,7407% |
0,2190% |
Ó |
1,7155% |
1,8030% |
2,6991% |
3,0701% |
1,7593% |
2,8846% |
Ö |
0,2339% |
0,3430% |
0,4894% |
0,5139% |
0,2885% |
0,5016% |
× |
2,4071% |
2,4605% |
1,6164% |
0,8647% |
2,4338% |
1,2406% |
Ù |
0,0607% |
0,0237% |
0,0209% |
0,0018% |
0,0422% |
0,0113% |
ß |
1,4625% |
0,6685% |
0,5591% |
0,2802% |
1,0655% |
0,4197% |
It has been discovered that
certain graphemes exhibit behaviour that contradicts the initial assumptions or
does not entirely correspond to them. Consequently, we have excluded such
graphemes as Á (previously noted as
possessing medium level of reliability), Ä (lower degree of
reliability), Ó and Ö (greater degrees of reliability)
from the selection.
At the
next stage, total text length in symbols, total amount of words in each text,
average word length, percentage shares of initial graphemes associated with
lower and greater degrees of polysemy, difference between the percentage shares
and relation between the shares of greater degree of polysemy and lower degree
of polysemy have been determined for each text. The results are displayed in
subsequent tables 9-12 for each functional style of texts separately. All
tables share the following abbreviations: AWL – average
word length, %LP
– percentage share of initial graphemes associated with lower degrees
of polysemy, %GP
– percentage share of initial graphemes associated with greater degrees
of polysemy, D –
difference between %LP
and %GP, R – %GP/%LP ratio.
Table 9. Statistical data for poetry and
fiction
Text # |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
Symbols |
359964 |
400357 |
236025 |
447871 |
269075 |
25415 |
81711 |
Words |
53913 |
63380 |
35110 |
65386 |
34534 |
2931 |
9809 |
AWL |
4.7831 |
4.8953 |
4.9669 |
5.0885 |
3.9340 |
3.9710 |
3.8842 |
%LP |
2,3186% |
2,3825% |
2,3697% |
2,5021% |
2,0212% |
1,8765% |
1,5700% |
%GP |
8,6565% |
7,3872% |
8,5246% |
6,4387% |
4,3609% |
4,1624% |
4,1493% |
D |
-6,3379 |
-5,0047 |
-6,1549 |
-3,9366 |
-2,3397 |
-2,2859 |
-2,5793 |
R |
3,7336 |
3,1007 |
3,5974 |
2,5733 |
2,1576 |
2,2182 |
2,6429 |
Text # |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
Symbols |
22095 |
376723 |
90674 |
183978 |
165793 |
23287 |
114118 |
Words |
3554 |
39600 |
14394 |
27688 |
31699 |
3587 |
16839 |
AWL |
4.6629 |
3.4766 |
4.7587 |
4.5648 |
3.8478 |
4.4957 |
4.8982 |
%LP |
1,8008% |
1,3586% |
2,1120% |
1,9864% |
2,8708% |
2,0072% |
1,7459% |
%GP |
5,9932% |
4,4141% |
7,6699% |
8,6752% |
8,3441% |
6,2448% |
5,0716% |
D |
-4,1924 |
-3,0555 |
-5,5579 |
-6,6888 |
-5,4733 |
-4,2376 |
-3,3257 |
R |
3,3281 |
3,2491 |
3,6316 |
4,3673 |
2,9066 |
3,1111 |
2,9048 |
Text # |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
Symbols |
313322 |
26592 |
75628 |
15183 |
17401 |
29917 |
20800 |
Words |
46329 |
3808 |
10983 |
1784 |
2859 |
4693 |
3171 |
AWL |
4.8489 |
5.0089 |
4.9390 |
5.0639 |
4.4008 |
4.0324 |
4.8802 |
%LP |
1,9642% |
2,4422% |
1,8028% |
3,0830% |
1,2592% |
1,9604% |
2,4913% |
%GP |
5,5386% |
8,1933% |
5,1261% |
5,7175% |
2,9731% |
4,9435% |
5,3611% |
D |
-3,5744 |
-5,7511 |
-3,3233 |
-2,6345 |
-1,7139 |
-2,9831 |
-2,8698 |
R |
2,8198 |
3,3548 |
2,8434 |
1,8545 |
2,3611 |
2,5217 |
2,1519 |
Text # |
22 |
23 |
24 |
25 |
Average value |
||
Symbols |
118760 |
107931 |
72194 |
101108 |
147836.9 |
||
Words |
17920 |
16403 |
11278 |
16748 |
21536 |
||
AWL |
4.2513 |
4.5122 |
4.4370 |
4.1192 |
4.5089 |
||
%LP |
3,4208% |
2,0484% |
1,6049% |
1,6480% |
2,1059% |
||
%GP |
7,6228% |
7,0841% |
6,7122% |
8,3234% |
6,3075% |
||
D |
-4,2020 |
-5,0357 |
-5,1073 |
-6,6754 |
-4,2016 |
||
R |
2,2284 |
3,4583 |
4,1823 |
5,0507 |
3,0540 |
Table 10. Statistical data for publicistic
texts
Text # |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
Symbols |
7230 |
8655 |
15477 |
29767 |
11681 |
5459 |
6486 |
Words |
964 |
1268 |
2251 |
4856 |
1632 |
752 |
848 |
AWL |
4.9481 |
5.4495 |
5.4860 |
4.7780 |
5.8658 |
5.9535 |
5.8255 |
%LP |
2,0747% |
4,6530% |
3,7317% |
2,4918% |
3,9828% |
5,3191% |
4,9528% |
%GP |
6,1203% |
4,6530% |
4,4425% |
6,8987% |
4,8407% |
4,6543% |
7,9009% |
D |
-4,0456 |
0,0000 |
-0,7108 |
-4,4069 |
-0,8579 |
0,6648 |
-2,9481 |
R |
2,9500 |
1,0000 |
1,1905 |
2,7686 |
1,2154 |
0,8750 |
1,5952 |
Text # |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
Symbols |
6623 |
15360 |
13956 |
19398 |
6621 |
10501 |
12455 |
Words |
1055 |
2149 |
1959 |
2908 |
819 |
1513 |
1822 |
AWL |
4.8275 |
5.8381 |
5.7626 |
5.3855 |
5.4481 |
5.5056 |
5.4748 |
%LP |
3,2227% |
4,4672% |
3,8285% |
3,2325% |
4,2735% |
2,6438% |
4,1164% |
%GP |
6,8246% |
5,4444% |
5,3088% |
4,5392% |
2,6862% |
4,3622% |
5,8178% |
D |
-3,6019 |
-0,9772 |
-1,4803 |
-1,3067 |
1,5873 |
-1,7184 |
-1,7014 |
R |
2,1176 |
1,2188 |
1,3867 |
1,4043 |
0,6286 |
1,6500 |
1,4133 |
Text # |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
Symbols |
11695 |
15516 |
6054 |
5257 |
5468 |
6675 |
6685 |
Words |
1826 |
2256 |
881 |
753 |
843 |
952 |
936 |
AWL |
4.9578 |
5.4756 |
5.5392 |
5.6521 |
5.1851 |
5.1134 |
5.2799 |
%LP |
3,0120% |
3,1472% |
2,8377% |
3,4529% |
3,2028% |
2,4160% |
4,0598% |
%GP |
5,3669% |
5,4965% |
7,0375% |
4,6481% |
7,2361% |
4,0966% |
4,5940% |
D |
-2,3549 |
-2,3493 |
-4,1998 |
-1,1952 |
-4,0333 |
-1,6806 |
-0,5342 |
R |
1,7818 |
1,7465 |
2,4800 |
1,3462 |
2,2593 |
1,6957 |
1,1316 |
Text # |
22 |
23 |
24 |
25 |
Average value |
||
Symbols |
8503 |
7547 |
11982 |
8662 |
10548.52 |
||
Words |
1205 |
1120 |
1877 |
1225 |
1546.8 |
||
AWL |
5.7734 |
5.3848 |
4.9989 |
5.7241 |
5.4253 |
||
%LP |
3,6515% |
3,0357% |
3,7294% |
3,3469% |
3,5553% |
||
%GP |
4,3154% |
3,4821% |
6,7128% |
4,0000% |
5,2592% |
||
D |
-0,6639 |
-0,4464 |
-2,9834 |
-0,6531 |
-1,7039 |
||
R |
1,1818 |
1,1471 |
1,8000 |
1,1951 |
1,5672 |
Table 11. Statistical data for scientific
texts
Text # |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
Symbols |
8082 |
11794 |