DATA ON DISTRIBUTION OF POLYSEMY IN RUSSIAN DICTIONARIES AND THEIR PRACTICAL APPLICATION IN AUTOMATED TEXT ANALYSIS

 

Nikolay Golovko

 

North Caucasus Federal University, Humanities Institute, Pushkin street, 1, 355009, Stavropol

(RUSSIAN FEDERATION)

E-mail: nvgolovko@inbox.ru

 

ABSTRACT

 

During the process of statistical analysis of Russian dictionary databases certain patterns have been discovered in distribution of polysemous lexical units amongst dictionary sections. Correlations have been deduced between initial graphemes of Russian words, number of their meanings and borrowedness; said correlations were investigated in further research in order to determine their possible practical application. It has been found that calculations based on percentage shares of words possessing initial graphemes which are associated with lower and greater degrees of polysemousness could be used in automated text analysis practices in combination with average word length values to reliably differentiate samples of formal speech from samples of informal speech in Russian language.

 

Keywords: lexical polysemy, lexicography, statistical analysis in linguistics, automated text analysis, average word length, stylistics.

 

1. INTRODUCTION

 

In the recent years, we have been collecting statistical data on the evolution of Russian lexicon. We were particularly interested in historical development of polysemy and in the way it is represented in scientifically recognized dictionaries. The data and their further analysis have revealed certain tendencies that we intend to describe in this article.

Due to the fact that these tendencies are regarded in relation to automatic text analysis, it should be noted that various scientific efforts have been known in the area of establishing connections between formal aspects of texts in natural languages on one side and the content of these texts on the other side. Historically, the rapid development of computing and substantial increase of information to be processed in analytical systems had led to the appearance of initial research in this area in the second half of the 20th century [Rocchio 1971; Sñhank, Abelson 1977] that was soon followed by a number of attempts that had been made in terms of frame research before the end of the century [Gruber 1985; Hoffmann 1985; Ôim, Saluveer 1985; Tannen 1985] and that had been accompanied by several considerations on differentiation between various types of meanings [Hudson 1985; Kiefer 1985], as well as on discovery of inner structures within texts [Wilks 1985]. This field of scientific research had then evolved in the direction of text clustering and categorization [Apte, Damerau, Weiss 1994; Beil, Ester, Xu 2002] and formal conceptual analysis [Cimiano, Staab, Tane 2003; Ferrucci 2004; Mehler, Waltinger, Wegner 2007]. Certain attempts have also been made within the area of automatic recognition and evaluation of styles [Holmes 1998; Peng, Hengartner 2002].

 

2. MATERIALS AND METHODS

 

The most influential, comprehensive, and commonly approved Russian dictionaries – Ozhegov’s, Kuznetsov’s, and Minor Academic – were selected as the database sources for the initial research. We have processed the mentioned databases and have determined the absolute quantities of monosemous words, words characterized by a lower degree of polysemy (i.e. having 2 possible meanings), medium degree of polysemy (3 to 5 possible meanings) and greater degree of polysemy (6 meanings and above). In addition, we have calculated the relative quantities of polysemous words compared to the total amount of dictionary entries. The statistical analysis had followed the alphabetical order of word arrangement, present in these three dictionaries; thus, the data were collected for each alphabetical group of words separately.

The determined and calculated quantities for each dictionary can be found in Tables 1-3. All tables share the following abbreviations: M – quantity of monosemous words, P – quantity of polysemous words, LP – lower degree of polysemy, MP – medium degree of polysemy, GP – greater degree of polysemy.

 

Table 1. Results of statistical analysis for Ozhegov’s Dictionary of the Russian Language

 

Initial

grapheme

Words

M

P

LP

MP

GP

% P

% LP

% MP

% GP

À

726

575

151

130

21

0

20.80%

17.91%

2.89%

0.00%

Á

1497

1176

321

250

67

4

21.44%

16.70%

4.48%

0.27%

Â

2514

1915

599

414

169

16

23.83%

16.47%

6.72%

0.64%

Ã

993

769

224

156

63

5

22.56%

15.71%

6.34%

0.50%

Ä

1396

1030

366

275

76

15

26.22%

19.70%

5.44%

1.07%

Å (¨)

133

94

39

26

11

2

29.32%

19.55%

8.27%

1.50%

Æ

301

206

95

63

28

4

31.56%

20.93%

9.30%

1.33%

Ç

1788

1314

474

340

123

11

26.51%

19.02%

6.88%

0.62%

È

935

719

216

172

40

4

23.10%

18.40%

4.28%

0.43%

É

5

4

1

1

0

0

20.00%

20.00%

0.00%

0.00%

Ê

2259

1706

553

396

149

8

24.48%

17.53%

6.60%

0.35%

Ë

781

592

189

138

46

5

24.20%

17.67%

5.89%

0.64%

Ì

1404

1104

300

212

84

4

21.37%

15.10%

5.98%

0.28%

Í

2229

1670

559

411

142

6

25.08%

18.44%

6.37%

0.27%

Î

2748

1974

774

560

193

21

28.17%

20.38%

7.02%

0.76%

Ï

6287

4685

1602

1135

429

38

25.48%

18.05%

6.82%

0.60%

Ð

2102

1444

658

458

187

13

31.30%

21.79%

8.90%

0.62%

Ñ

3607

2691

916

624

258

34

25.40%

17.30%

7.15%

0.94%

Ò

1272

936

336

212

113

11

26.42%

16.67%

8.88%

0.86%

Ó

1056

752

304

218

80

6

28.79%

20.64%

7.58%

0.57%

Ô

490

358

132

97

32

3

26.94%

19.80%

6.53%

0.61%

Õ

441

321

120

82

29

9

27.21%

18.59%

6.58%

2.04%

Ö

196

142

54

38

16

0

27.55%

19.39%

8.16%

0.00%

×

488

374

114

77

32

5

23.36%

15.78%

6.56%

1.02%

Ø

509

391

118

94

23

1

23.18%

18.47%

4.52%

0.20%

Ù

72

46

26

17

8

1

36.11%

23.61%

11.11%

1.39%

Ý

323

255

68

53

15

0

21.05%

16.41%

4.64%

0.00%

Þ

45

34

11

11

0

0

24.44%

24.44%

0.00%

0.00%

ß

135

90

45

30

14

1

33.33%

22.22%

10.37%

0.74%

Total

36732

27367

9365

6690

2448

227

 

 

Table 2. Results of statistical analysis for Kuznetsov’s Contemporary Dictionary of the Russian Language

 

Initial

grapheme

Words

M

P

LP

MP

GP

% P

% LP

% MP

% GP

À

884

593

291

224

67

0

32.92%

25.34%

7.58%

0.00%

Á

1623

1023

600

414

171

15

36.97%

25.51%

10.54%

0.92%

Â

2654

1662

992

630

317

45

37.38%

23.74%

11.94%

1.70%

Ã

1083

656

427

278

129

20

39.43%

25.67%

11.91%

1.85%

Ä

1506

968

538

335

184

19

35.72%

22.24%

12.22%

1.26%

Å (¨)

136

83

53

31

19

3

38.97%

22.79%

13.97%

2.21%

Æ

281

166

115

76

34

5

40.93%

27.05%

12.10%

1.78%

Ç

1971

1190

781

490

265

26

39.62%

24.86%

13.44%

1.32%

È

1064

664

400

274

119

7

37.59%

25.75%

11.18%

0.66%

É

6

4

2

2

0

0

33.33%

33.33%

0.00%

0.00%

Ê

2548

1527

1021

614

361

46

40.07%

24.10%

14.17%

1.81%

Ë

841

509

332

200

120

12

39.48%

23.78%

14.27%

1.43%

Ì

1564

874

690

427

241

22

44.12%

27.30%

15.41%

1.41%

Í

2457

1474

983

636

313

34

40.01%

25.89%

12.74%

1.38%

Î

2960

1685

1275

769

455

51

43.07%

25.98%

15.37%

1.72%

Ï

6903

3846

3057

1795

1106

156

44.29%

26.00%

16.02%

2.26%

Ð

2299

1324

975

593

341

41

42.41%

25.79%

14.83%

1.78%

Ñ

4060

2271

1789

1005

654

130

44.06%

24.75%

16.11%

3.20%

Ò

1390

769

621

344

232

45

44.68%

24.75%

16.69%

3.24%

Ó

1160

587

573

318

220

35

49.40%

27.41%

18.97%

3.02%

Ô

574

339

235

146

82

7

40.94%

25.44%

14.29%

1.22%

Õ

454

242

212

122

67

23

46.70%

26.87%

14.76%

5.07%

Ö

208

99

109

55

47

7

52.40%

26.44%

22.60%

3.37%

×

469

248

221

123

86

12

47.12%

26.23%

18.34%

2.56%

Ø

571

324

247

168

74

5

43.26%

29.42%

12.96%

0.88%

Ù

74

39

35

16

17

2

47.30%

21.62%

22.97%

2.70%

Ý

410

261

149

104

42

3

36.34%

25.37%

10.24%

0.73%

Þ

54

33

21

13

8

0

38.89%

24.07%

14.81%

0.00%

ß

145

85

60

15

42

3

41.38%

10.34%

28.97%

2.07%

Total

40349

23545

16804

10217

5813

774

 

 

Table 3. Results of statistical analysis for Minor Academic Dictionary of the Russian Language

 

Initial

grapheme

Words

M

P

LP

MP

GP

% P

% LP

% MP

% GP

À

1600

1338

262

222

40

0

16.38%

13.88%

2.50%

0.00%

Á

3122

2577

545

409

128

8

17.46%

13.10%

4.10%

0.26%

Â

6049

4586

1463

1095

337

31

24.19%

18.10%

5.57%

0.51%

Ã

2274

1821

453

306

137

10

19.92%

13.46%

6.02%

0.44%

Ä

3528

2838

690

525

149

16

19.56%

14.88%

4.22%

0.45%

Å (¨)

226

165

61

43

16

2

26.99%

19.03%

7.08%

0.88%

Æ

630

489

141

91

46

4

22.38%

14.44%

7.30%

0.63%

Ç

4828

3673

1155

846

284

25

23.92%

17.52%

5.88%

0.52%

È

2250

1654

596

455

131

10

26.49%

20.22%

5.82%

0.44%

É

18

16

2

2

0

0

11.11%

11.11%

0.00%

0.00%

Ê

5056

3870

1186

800

350

36

23.46%

15.82%

6.92%

0.71%

Ë

1768

1357

411

298

100

13

23.25%

16.86%

5.66%

0.74%

Ì

3473

2661

812

557

242

13

23.38%

16.04%

6.97%

0.37%

Í

5846

4386

1460

1047

379

34

24.97%

17.91%

6.48%

0.58%

Î

6646

4794

1852

1347

459

46

27.87%

20.27%

6.91%

0.69%

Ï

16594

12415

4179

3199

890

90

25.18%

19.28%

5.36%

0.54%

Ð

4725

3284

1441

980

418

43

30.50%

20.74%

8.85%

0.91%

Ñ

8611

6317

2294

1481

701

112

26.64%

17.20%

8.14%

1.30%

Ò

2969

2188

781

491

252

38

26.31%

16.54%

8.49%

1.28%

Ó

2582

1686

896

637

238

21

34.70%

24.67%

9.22%

0.81%

Ô

1347

1053

294

214

74

6

21.83%

15.89%

5.49%

0.45%

Õ

1008

776

232

141

74

17

23.02%

13.99%

7.34%

1.69%

Ö

491

349

142

91

48

3

28.92%

18.53%

9.78%

0.61%

×

1134

861

273

174

87

12

24.07%

15.34%

7.67%

1.06%

Ø

1212

904

308

213

91

4

25.41%

17.57%

7.51%

0.33%

Ù

156

113

43

24

17

2

27.56%

15.38%

10.90%

1.28%

Ý

881

686

195

148

43

4

22.13%

16.80%

4.88%

0.45%

Þ

74

50

24

18

6

0

32.43%

24.32%

8.11%

0.00%

ß

193

111

82

54

25

3

42.49%

27.98%

12.95%

1.55%

Total

89291

67018

22273

15908

5762

603

 

 

Upon further analysis of the data provided, certain peculiar tendencies in the distribution of polysemous words across dictionary sections have been noted and recognized.

It has become evident that in certain cases the absolute and relative quantities of polysemous words seem to correlate with the initial grapheme. While the majority of dictionary sections demonstrates similar and homogeneous distribution of words characterized by various degrees of polysemy, a number of the sections represents significant or notable deviations against the general picture. It is also important that the mentioned deviations can be observed in the three dictionary databases, thus suggesting a potentially universal nature of correlations between initial graphemes and degrees of polysemy.

Dictionary sections A and É can serve as the example of obviously low degrees of polysemy. None of the three dictionaries under analysis contains any record of words with these initial graphemes that would have had 6 meanings and above; the A section is also characterized by a relatively low percentage share of words that possess more than 3 meanings, and the É section consists entirely of words that have 1 or 2 meanings.

Dictionary sections Ù and ß represent the notable counterexample. The relative quantities of moderately and greatly polysemized words (i.e. having 3 meanings and above) with these initial graphemes clearly exceed the statistical data collected from other sections.

These results and considerations had led us to the decision to perform a deeper investigation of the described correlations.

 

3. DISCUSSION

 

Though the interdependencies between lexical polysemy, that represents the ideal side of language symbols, and initial graphemes of the words, that are associated with the material side, might be regarded as uncommon or strange, there is theoretical ground to claim that it cannot be characterized as coincidental or arbitrary. Both Russian and global studies in quantitative linguistics have known certain successful efforts to establish links between the material form of lexical units on one side and linguistic or even extra-linguistic properties of language symbols on the other side. A remarkable example is the adaptation of Zipf-Mandelbrot law to linguistic phenomena, demonstrating that frequency of a word’s appearance in both oral and written speech is proportional to its length (i.e. number of graphemes and / or sounds contained in it). [Seleznev, Isaeva 2005; Köhler, Altmann, Piotrowski 2005] Word length, being a purely physical attribute that can be measured with relative ease, is also traditionally recognized as a valuable and reliable criterion for determination of various inner characteristics of texts, including their functional stylistics.

Validity of Zipf-Mandelbrot law in its application to natural languages could be explained philosophically, in terms of meta-science. It is widely known that any natural language is in part driven by principle of linguistic economy, which encourages a speaker or a writer to minimize their efforts and use the lowest possible quantity of language units to represent the message they intended to transmit. Obviously, the language-speaking community would prefer simpler methods of expression to more complicated means, thus utilizing shorter words more frequently.

The same approach of correlation between linguistic and extra-linguistic phenomena could be applied to the statistical data that we have collected.

There exists a certain peculiar feature that is common for all non-standard dictionary sections  mentioned by us above. Initial graphemes À and É, which represent the lowest relative degrees of polysemy in all three dictionaries, are not typical for Russian language. Russian vocabulary contains a minuscule amount of native words that begin with the grapheme À (and, consequently, with the corresponding sound), but the majority of words in the A section and all lexical units in the É section are borrowed from other languages. In addition, these borrowed words are either formal or used as scientific terms.

In contrast to À and É, the initial graphemes Ù and ß, which are characterized by greatest percentage shares of polysemous words, represent typical Russian sounds that are frequently found in native lexical units but are seldom or never associated with borrowed words. The grapheme Ù is especially notable in this respect due to the fact that dictionaries contain no borrowed words with this initial grapheme.

We consider it basically safe to speculate that lexical polysemy is related to borrowedness of words. Indeed, a lexical unit obtains new meanings through historical development of the language system, as well as through the frequency of use; thus, older and more actively utilized words should generally possess a more extended semantic system than newer and less ubiquitous lexemes. A word’s age could be connected to its origin, assuming that native words are relatively older and borrowed words are relatively newer, and frequency of use is closely associated with word length, as we have demonstrated above.

We have summarized these contemplations and have arrived to the following statements.

1.1. It can be expected that a Russian word in possession of 3 meanings and above would be native and more frequent in speech practice, and a Russian word characterized by 1 or 2 meanings would be borrowed and less frequent in speech practice.

1.2. A Russian word’s frequency of use can be estimated by its length, and the Russian word’s origin could be estimated by its initial grapheme.

1.3. A Russian word’s length and initial grapheme could be used as formal physical markers to estimate the number of meanings it is associated with.

These statements were used as the foundation for further research and analysis.

In order to use the initial graphemes as formal markers, we needed to determine the degrees of reliability that could be associated with corresponding dictionary sections. That was necessary for exclusion of random coincidental deviations of statistical data caused by peculiarities of certain dictionaries; also, the graphemes were supposed to be characterized by distinct positive or negative differences from the typical percentage shares found in the majority of sections. For the achievement of this purpose, initial graphemes possessing the most considerable and the least considerable relative amounts of polysemous entries, as well as of words with medium and greater degrees of polysemy (both separate and summarized) were selected from each dictionary database and represented in the following tables.

All tables share abbreviations with Tables 1-3. The values in columns where difference exceeds 1% are rounded. The grey zone indicates graphemes that are located significantly close to the general threshold (0.5% or less below or above the threshold).

 

Table 4. Selections from statistical data based on Ozhegov’s Dictionary of the Russian Language

 

Initial

grapheme

% P

Initial

grapheme

% MP

Initial

grapheme

% GP

Initial

grapheme

% MP+GP

Ù

36%

Ù

11%

Õ

2.04%

Ù

13%

ß

33%

ß

10%

Ũ

1.50%

ß

11%

Æ

32%

Æ

9%

Ù

1.39%

Æ

11%

Ð

31%

Ð

9%

Æ

1.33%

Ũ

10%

Ũ

29%

Ò

9%

Ä

1.07%

Ò

10%

Ó

29%

 

 

×

1.02%

Ð

10%

 

 

 

 

 

 

Á

5%

 

 

 

 

 

 

Ø

5%

Á

21%

Á

4%

Ö

0.00%

È

5%

Ì

21%

È

4%

Ý

0.00%

Ý

5%

Ý

21%

À

3%

À

0.00%

À

3%

À

21%

Þ

0%

Þ

0.00%

Þ

0%

É

20%

É

0%

É

0.00%

É

0%

 

Table 5. Selections from statistical data based on Kuznetsov’s Contemporary Dictionary of the Russian Language

 

Initial

grapheme

% P

Initial

grapheme

% MP

Initial

grapheme

% GP

Initial

grapheme

% MP+GP

Ö

52%

ß

29%

Õ

5%

ß

31%

Ó

49%

Ù

23%

Ö

3%

Ö

26%

Ù

47%

Ö

23%

Ò

3%

Ù

26%

×

47%

Ó

19%

Ñ

3%

Ó

22%

Õ

47%

×

18%

Ó

3%

×

21%

 

 

 

 

Á

0.92%

 

 

Â

37%

 

 

Ø

0.88%

 

 

Á

37%

È

11%

Ý

0.73%

È

12%

Ý

36%

Á

11%

È

0.66%

Á

11%

Ä

36%

Ý

10%

Þ

0.00%

Ý

11%

É

33%

À

8%

À

0.00%

À

8%

À

33%

É

0%

É

0.00%

É

0%

 

Table 6. Selections from statistical data based on Minor Academic Dictionary of the Russian Language

 

Initial

grapheme

% P

Initial

grapheme

% MP

Initial

grapheme

% GP

Initial

grapheme

% MP+GP

ß

42%

ß

13%

Õ

1.69%

ß

15%

Ó

35%

Ù

11%

ß

1.55%

Ù

12%

Þ

32%

Ö

10%

Ñ

1.30%

Ö

10%

Ð

30%

Ó

9%

Ù

1.28%

Ó

10%

Ö

29%

Ð

9%

Ò

1.28%

Ò

10%

 

 

 

 

×

1.06%

Ð

10%

 

 

Ô

5%

 

 

 

 

 

 

Ï

5%

Ì

0.37%

 

 

Ã

20%

Ý

5%

Ø

0.33%

Ý

5%

Ä

20%

Ä

4%

Á

0.26%

Ä

5%

Á

17%

Á

4%

Þ

0.00%

Á

4%

À

16%

À

3%

À

0.00%

À

3%

É

11%

É

0%

É

0.00%

É

0%

 

We have set the following criteria to determine degrees of reliability:

1) Initial graphemes that are found in the white zone of all four types of percentage shares are recognized as the most reliable.

2) Initial graphemes that are found in the white zone of three types of percentage shares out of four are recognized as more reliable.

3) Initial graphemes that are found in the white or grey zones of all four types of percentage shares are recognized as reliable.

4) Initial graphemes that are found in the white or grey zones of three types of percentage shares out of four are recognized as less reliable.

5) All the remaining initial graphemes are recognized as the least reliable and are not taken into account in further analysis.

The criteria were applied to each selection separately. The resulting picture is represented in Table 7. The table contains the following abbreviations: (+) – higher degrees of polysemy, (-) – lower degrees of polysemy, MtR – most reliable graphemes, MrR – more reliable graphemes, R – reliable graphemes, LR – less reliable graphemes.

 

Table 7. Distribution of selected initial graphemes across all dictionaries

 

Dictionary/

Graphemes

Ozhegov

Kuznetsov

Minor Academic

(+)

(-)

(+)

(-)

(+)

(-)

MtR

Æ, Ù

À, É

Ó, Ö

À, É, Ý

ß

À, Á, É

MrR

Ũ, ß

Ý, Þ

×, Ù

È

Ó, Ö, Ù

Ä

R

-

-

-

Á

-

-

LR

Ð

Á

-

-

-

-

 

The results were compared and systematized in order to compose and formulate the final verdict:

2.1. Initial graphemes Ù, À and É are characterized by the greatest degree of reliability as possible formal markers of lexical polysemy. These graphemes are either the most reliable or more reliable in all three dictionaries.

2.2. Initial graphemes Ó, Ö, ß and Ý are characterized by greater degree of reliability. These graphemes are either the most reliable or more reliable in two dictionaries out of three.

2.3. Initial grapheme Á is characterized by medium degree of reliability. This grapheme meets reliability criteria in all three dictionaries, but is the most reliable in one dictionary out of three.

2.4. Initial graphemes Å(¨), Æ, ×, Ä, È and Þ are characterized by lower degree of reliability. These graphemes meet reliability criteria in one dictionary out of three and are either the most reliable or more reliable in it.

2.5. Initial grapheme Ð is characterized by the lowest degree of reliability and is excluded from the selection due to apparently coincidental nature of its appearance in the list.

2.6. Initial graphemes Å(¨), Æ, Ó, Ö, ×, Ù and ß could be used as formal markers of greater degrees of polysemy, while graphemes À, Á, Ä, È, É, Ý and Þ could be used as formal markers of lower degrees of polysemy.

Having obtained this information, we commenced a search for possible practical application of these formal markers.

It has already been mentioned that the marker of word length is traditionally used in Russian automated text analysis to distinguish formal texts from informal speech. Formal texts are represented by scientific and juridical speech which are both characterized by considerable amounts of terminological lexemes; these lexemes possess extended graphical and sound forms and are usually borrowed from the corpus of internationally recognized words originating from ancient Greek and Latin languages. In addition, formal texts are supposed to and are frequently demanded to be as monosemantic as possible, in order to efficiently prevent ambiguous perception and improper understanding. Informal texts are associated with journalism and fiction which are not limited by formal requirements and thus make use of shorter colloquial words that are frequently of native origin; also, textual polysemy is encouraged by genres that exist in the corresponding area of speech culture. [Kozhina 2008]

At the same time, word length is not reliable enough to be applied as the only detection criterion. Due to this reason, in common practice it is supported by several other criteria which may vary depending on a certain analytical system. For example, the known Russian public text analyzer Hudlomer, available at URL http://teneta.rinet.ru/hudlomer/, utilizes the so called Fomenko’s invariant to supplement its word length spectrum mechanism. Certain researchers have suggested to use chi-squared distribution, Fisher’s hypergeometric criterion etc. [Shevelev 2006] The negative side of these additional criteria consists in the fact that they deal with the sole external aspect of texts, having no considerable connection to their semantics. In its turn, the marker of initial graphemes could establish the link to the internal aspect of the text under analysis, thus properly supplementing the external criterion of word length.

In this respect, we partially follow a trend in Russian linguistics that associates textual polysemy with the concept of entropy. Certain recent findings in this area of Russian linguistic research are related to Dr. Sergey Gusarenko’s scientific school. [Gusarenko 2009] In accord with this regard, the appropriate understanding of texts in natural language is influenced by entropy, i.e. the amount of chaos introduced by external and internal factors. One of such internal factors is believed to be polysemy. Each polysemous word introduces an amount of chaos into the text, thus making it more difficult to process and understand by natural or artificial means; correspondingly, a text’s total entropy is determined by the number of polysemous words in it, due to the fact that entropy is an additive quantity. Similarly, by means of measuring the percentage shares of words that possess lower and higher degrees of polysemy we estimate the general polysemousness of the text reflected in the number of meanings each word could possibly have. We have been making use of the term “potential polysemousness” to denote this concept. [Golovko 2012]

As far as the principal properties of lexemes that determine characteristic traits of formal and informal texts (i.e. length, borrowedness and number of meanings) correspond to Zipf-Mandelbrot law in its application to linguistics and to statements 1.1 – 1.3 that we have presented above, we have made the decision to test the formal markers of word length and of initial graphemes for the possibility of correct classification of Russian texts into formal and informal subdivisions. We were particularly interested in determining whether these two markers would constitute a basis solid enough to support reliable classification.

 

4. RESULTS

 

In order to perform the test and gather statistical data for further analysis we have collected a selection of 100 texts in Russian language. The selection includes 25 samples of poetry and fiction, 25 publicistic texts, 25 juridical texts and 25 scientific texts; correspondingly, formal and informal speech had been represented by 50 samples each.

At the initial stage, we have applied a verification procedure in order to ensure that all initial graphemes are indeed reliable. The results of separate data investigation for each grapheme are represented in Table 8.

 

Table 8. Statistical data for initial graphemes associated with lower and greater degrees of polysemy (hereafter: in the meaning represented in statement 2.6)

 

Initial grapheme / Texts

Literary

Publicistic

Scientific

Juridical

Total formal

Total informal

À

0,3818%

0,7792%

1,2733%

1,5392%

0,5805%

1,4062%

Á

3,2023%

2,8651%

1,9355%

1,3503%

3,0337%

1,6429%

Ä

3,4709%

3,2472%

3,1434%

3,9605%

3,3591%

3,5519%

È

1,4698%

2,2922%

3,3330%

3,6504%

1,8810%

3,4917%

É

0,0147%

0,0000%

0,0220%

0,0000%

0,0073%

0,0110%

Ý

0,1964%

0,4390%

0,9470%

0,4518%

0,3177%

0,6994%

Þ

0,0433%

0,0474%

0,0811%

0,1941%

0,0453%

0,1376%

Å

1,5353%

1,4675%

0,8500%

0,6964%

1,5014%

0,7732%

Æ

0,8424%

0,6390%

0,2657%

0,1723%

0,7407%

0,2190%

Ó

1,7155%

1,8030%

2,6991%

3,0701%

1,7593%

2,8846%

Ö

0,2339%

0,3430%

0,4894%

0,5139%

0,2885%

0,5016%

×

2,4071%

2,4605%

1,6164%

0,8647%

2,4338%

1,2406%

Ù

0,0607%

0,0237%

0,0209%

0,0018%

0,0422%

0,0113%

ß

1,4625%

0,6685%

0,5591%

0,2802%

1,0655%

0,4197%

 

It has been discovered that certain graphemes exhibit behaviour that contradicts the initial assumptions or does not entirely correspond to them. Consequently, we have excluded such graphemes as Á (previously noted as possessing medium level of reliability), Ä (lower degree of reliability), Ó and Ö (greater degrees of reliability) from the selection.

At the next stage, total text length in symbols, total amount of words in each text, average word length, percentage shares of initial graphemes associated with lower and greater degrees of polysemy, difference between the percentage shares and relation between the shares of greater degree of polysemy and lower degree of polysemy have been determined for each text. The results are displayed in subsequent tables 9-12 for each functional style of texts separately. All tables share the following abbreviations: AWL – average word length, %LP – percentage share of initial graphemes associated with lower degrees of polysemy, %GP – percentage share of initial graphemes associated with greater degrees of polysemy, D – difference between %LP and %GP, R%GP/%LP ratio.

 

Table 9. Statistical data for poetry and fiction

 

Text #

1

2

3

4

5

6

7

Symbols

359964

400357

236025

447871

269075

25415

81711

Words

53913

63380

35110

65386

34534

2931

9809

AWL

4.7831

4.8953

4.9669

5.0885

3.9340

3.9710

3.8842

%LP

2,3186%

2,3825%

2,3697%

2,5021%

2,0212%

1,8765%

1,5700%

%GP

8,6565%

7,3872%

8,5246%

6,4387%

4,3609%

4,1624%

4,1493%

D

-6,3379

-5,0047

-6,1549

-3,9366

-2,3397

-2,2859

-2,5793

R

3,7336

3,1007

3,5974

2,5733

2,1576

2,2182

2,6429

Text #

8

9

10

11

12

13

14

Symbols

22095

376723

90674

183978

165793

23287

114118

Words

3554

39600

14394

27688

31699

3587

16839

AWL

4.6629

3.4766

4.7587

4.5648

3.8478

4.4957

4.8982

%LP

1,8008%

1,3586%

2,1120%

1,9864%

2,8708%

2,0072%

1,7459%

%GP

5,9932%

4,4141%

7,6699%

8,6752%

8,3441%

6,2448%

5,0716%

D

-4,1924

-3,0555

-5,5579

-6,6888

-5,4733

-4,2376

-3,3257

R

3,3281

3,2491

3,6316

4,3673

2,9066

3,1111

2,9048

Text #

15

16

17

18

19

20

21

Symbols

313322

26592

75628

15183

17401

29917

20800

Words

46329

3808

10983

1784

2859

4693

3171

AWL

4.8489

5.0089

4.9390

5.0639

4.4008

4.0324

4.8802

%LP

1,9642%

2,4422%

1,8028%

3,0830%

1,2592%

1,9604%

2,4913%

%GP

5,5386%

8,1933%

5,1261%

5,7175%

2,9731%

4,9435%

5,3611%

D

-3,5744

-5,7511

-3,3233

-2,6345

-1,7139

-2,9831

-2,8698

R

2,8198

3,3548

2,8434

1,8545

2,3611

2,5217

2,1519

Text #

22

23

24

25

Average value

Symbols

118760

107931

72194

101108

147836.9

Words

17920

16403

11278

16748

21536

AWL

4.2513

4.5122

4.4370

4.1192

4.5089

%LP

3,4208%

2,0484%

1,6049%

1,6480%

2,1059%

%GP

7,6228%

7,0841%

6,7122%

8,3234%

6,3075%

D

-4,2020

-5,0357

-5,1073

-6,6754

-4,2016

R

2,2284

3,4583

4,1823

5,0507

3,0540

 

Table 10. Statistical data for publicistic texts

 

Text #

1

2

3

4

5

6

7

Symbols

7230

8655

15477

29767

11681

5459

6486

Words

964

1268

2251

4856

1632

752

848

AWL

4.9481

5.4495

5.4860

4.7780

5.8658

5.9535

5.8255

%LP

2,0747%

4,6530%

3,7317%

2,4918%

3,9828%

5,3191%

4,9528%

%GP

6,1203%

4,6530%

4,4425%

6,8987%

4,8407%

4,6543%

7,9009%

D

-4,0456

0,0000

-0,7108

-4,4069

-0,8579

0,6648

-2,9481

R

2,9500

1,0000

1,1905

2,7686

1,2154

0,8750

1,5952

Text #

8

9

10

11

12

13

14

Symbols

6623

15360

13956

19398

6621

10501

12455

Words

1055

2149

1959

2908

819

1513

1822

AWL

4.8275

5.8381

5.7626

5.3855

5.4481

5.5056

5.4748

%LP

3,2227%

4,4672%

3,8285%

3,2325%

4,2735%

2,6438%

4,1164%

%GP

6,8246%

5,4444%

5,3088%

4,5392%

2,6862%

4,3622%

5,8178%

D

-3,6019

-0,9772

-1,4803

-1,3067

1,5873

-1,7184

-1,7014

R

2,1176

1,2188

1,3867

1,4043

0,6286

1,6500

1,4133

Text #

15

16

17

18

19

20

21

Symbols

11695

15516

6054

5257

5468

6675

6685

Words

1826

2256

881

753

843

952

936

AWL

4.9578

5.4756

5.5392

5.6521

5.1851

5.1134

5.2799

%LP

3,0120%

3,1472%

2,8377%

3,4529%

3,2028%

2,4160%

4,0598%

%GP

5,3669%

5,4965%

7,0375%

4,6481%

7,2361%

4,0966%

4,5940%

D

-2,3549

-2,3493

-4,1998

-1,1952

-4,0333

-1,6806

-0,5342

R

1,7818

1,7465

2,4800

1,3462

2,2593

1,6957

1,1316

Text #

22

23

24

25

Average value

Symbols

8503

7547

11982

8662

10548.52

Words

1205

1120

1877

1225

1546.8

AWL

5.7734

5.3848

4.9989

5.7241

5.4253

%LP

3,6515%

3,0357%

3,7294%

3,3469%

3,5553%

%GP

4,3154%

3,4821%

6,7128%

4,0000%

5,2592%

D

-0,6639

-0,4464

-2,9834

-0,6531

-1,7039

R

1,1818

1,1471

1,8000

1,1951

1,5672

 

Table 11. Statistical data for scientific texts

 

Text #

1

2

3

4

5

6

7

Symbols

8082

11794

7151

6307

5871

153660

522418

Words

1016

1320

773

746

649

18607

62956

AWL

6.6181

7.4258

7.9405

7.1032

7.5932

6.5052

6.9933

%LP

6,3976%

5,0000%

5,1746%

7,2386%

6,0092%

5,7505%

7,8277%

%GP

4,0354%

2,7273%

1,5524%

3,0831%

2,4653%

3,2300%

2,4477%

D

2,3622

2,2727

3,6222

4,1555

3,5439

2,5205

5,3800

R

0,6308

0,5455

0,3000

0,4259

0,4103

0,5617

0,3127

Text #

8

9

10

11

12

13

14

Symbols

640763

611274

294539

251950

8339

158064

342976

Words

86264

76340

32636

29555

988

18346

38439

AWL

5.8213

6.4292

7.0700

6.9543

6.7247

7.2798

7.4348

%LP

5,7892%

6,6138%

6,1864%

4,4561%

5,1619%

6,2411%

8,1766%

%GP

3,5461%

4,8913%

1,9672%

3,3869%

1,5182%

3,2868%

2,7004%

D

2,2431

1,7225

4,2192

1,0692

3,6437

2,9543

5,4762

R

0,6125

0,7396

0,3180

0,7601

0,2941

0,5266

0,3303

Text #

15

16

17

18

19

20

21

Symbols

576735

538073

638815

434023

10314

10577

9600

Words

76182

67485

79477

49650

1183

1247

1130

AWL

6.1412

6.5304

6.5416

7.2522

7.4142

7.1460

6.9956

%LP

5,0287%

5,3434%

5,3047%

8,0685%

6,5089%

5,1323%

6,0177%

%GP

5,6352%

2,8110%

3,9986%

2,5277%

2,1978%

3,4483%

4,7788%

D

-0,6065

2,5324

1,3061

5,5408

4,3111

1,6840

1,2389

R

1,1206

0,5261

0,7538

0,3133

0,3377

0,6719

0,7941

Text #

22

23

24

25

Average value

Symbols

9142

12669

6324

5623

211003.3

Words

1143

1444

667

654

25955.88

AWL

6.4182

7.4017

8.0435

7.2446

7.0009

%LP

3,3246%

5,0554%

2,9985%

2,5994%

5,6562%

%GP

5,3368%

3,1856%

2,9985%

5,0459%

3,3121%

D

-2,0122

1,8698

0,0000

-2,4465

2,3441

R

1,6053

0,6301

1,0000

1,9412

0,6585

 

Table 12. Statistical data for juridical texts

 

Text #

1

2

3

4

5

6

7

Symbols

174699

52122

9220

27346

131609

41427

36367

Words

20664

6716

1066

3410

15172

4872

4659

AWL

6.9340

6.4242

6.8405

6.2651

6.9853

6.4002

6.3634

%LP

7,7478%

7,8023%

4,5028%

4,2522%

6,7888%

3,2430%

4,9152%

%GP

1,5970%

3,2162%

1,2195%

3,6070%

1,9641%

2,9762%

3,5415%

D

6,1508

4,5861

3,2833

0,6452

4,8247

0,2668

1,3737

R

0,2061

0,4122

0,2708

0,8483

0,2893

0,9177

0,7205

Text #

8

9

10

11

12

13

14

Symbols

67000

11092

22194

13540

54614

26047

308198

Words

5866

1242

2543

1505

6304

2241

26883

AWL

5.8529

7.2262

7.0822

7.5056

6.3165

6.9942

5.0633

%LP

2,7787%

3,6232%

6,3704%

7,5083%

4,6003%

4,9041%

5,0329%

%GP

0,5967%

1,0467%

2,1628%

1,3289%

1,6022%

0,4904%

0,8109%

D

2,1820

2,5765

4,2076

6,1794

2,9981

4,4137

4,2220

R

0,2147

0,2889

0,3395

0,1770

0,3483

0,1000

0,1611

Text #

15

16

17

18

19

20

21

Symbols

35743

463287

105466

102350

26770

68811

47690

Words

3890

39250

11425

11922

3013

7466

5739

AWL

7.6270

6.2498

7.6229

7.0478

7.6777

6.8924

7.0585

%LP

5,5270%

4,5172%

14,0919%

5,7876%

9,8241%

3,8843%

4,6001%

%GP

1,6710%

1,8497%

1,7155%

1,8202%

0,9957%

1,3662%

2,9273%

D

3,8560

2,6675

12,3764

3,9674

8,8284

2,5181

1,6728

R

0,3023

0,4095

0,1217

0,3145

0,1014

0,3517

0,6364

Text #

22

23

24

25

Average value

Symbols

69531

140400

80977

47342

86553.68

Words

8277

17074

9739

5411

9053.96

AWL

6.9447

6.7259

6.8607

7.2689

6.8092

%LP

5,5817%

8,3870%

7,2081%

2,3840%

5,8345%

%GP

1,5344%

3,0749%

1,8072%

5,3964%

2,0127%

D

4,0473

5,3121

5,4009

-3,0124

3,8218

R

0,2749

0,3666

0,2507

2,2636

0,4275

 

The data demonstrate clear differences between formal and informal texts both in terms of average word length and potential polysemousness.

1) In 100% of informal texts, AWL value is below 6; in 94% of formal texts, AWL value is above 6.

2) In 86% of informal texts, %LP is below 4%; in 84% of formal texts, %LP is above 4%.

3) In 94% of informal texts, %GP is above 4%; in 86% of formal texts, %GP is below 4%.

4) In 94% of informal texts, D is below 0; in 90% of formal texts, D is above 0.

5) In correspondence with D, R is above 1 in informal texts and below 1 in formal texts.

 

5. CONCLUSIONS

 

On the basis of the gathered data and of their analysis it is possible to formulate the following conclusions:

3.1. The difference between percentage shares of initial graphemes associated with lower degrees of polysemy and of initial graphemes associated with greater degrees of polysemy can be applied in the practice of automated text processing as the most reliable and the most simply estimated value that reflects potential polysemousness of a text in the Russian language.

3.2. The combination of average word length and potential polysemousness enables reliable differentiation of Russian formal texts from Russian informal texts. In the selection of texts that were analyzed above, anomalies of average word length are always neutralized by correct values of the marker of potential polysemousness, while anomalies of potential polysemousness are always neutralized by correct values of average word length; therefore, no incorrect verdicts would have been encountered if these two criteria were used in ensemble within a detection algorithm.

 

6. REFERENCES

 

1. S.I. Ozhegov. Dictionary of the Russian language. Edited by professor L.I. Skvortsov. 25th edition, modified and extended. Oniks, 2008.

2. The contemporary dictionary of the Russian language. Supervised by S.I. Kuznetsov. Reader’s Digest, 2004.

3. The minor academic dictionary of the Russian language. (In 4 volumes.) Edited by A.P. Yevgenyeva. 4th edition, stereotypical. Russkiy Yazyk, 1999.

4. V.A. Seleznev, Ye.V. Isaeva. Hurst parameter of the dictionary sequence. Materials of scientific conference “Quantitative linguistics: researches and models” (ÊËÈÌ-2005): 146-152 (2005). (In Russian)

5. M.N. Kozhina. Stylistics of the Russian language. Flinta, 2008. (In Russian)

6. O.G. Shevelev. Research and development on algorithms of comparison of styles of textual compositions. Abstract of Ph.D. thesis in applied sciences. Tomsk State University, 2006. (In Russian)

7. S.V. Gusarenko. Systematic interaction and entropy of cognitive and semantic structures of the discourse. Abstract of Ph.D. thesis in philology. Stavropol State University, 2009. (In Russian)

8. N.V. Golovko. Estimation of semantic potential of texts in analytical systems. Lambert Academic Publishing, 2012. (In Russian)

9. Quantitative Linguistics: an international handbook. Edited by R. Köhler, G. Altmann, R.G. Piotrowski. [Google Books] http://books.google.ru/books?id=-7Z-GA73MMAC&dprintsec=frontcover&hl=ru

10. C. Apte, F. Damerau, S. Weiss. Automated Learning of Decision Rules for Text Categorization. [Penn State College of Information Sciences and Technology] http://citeseerx.ist.psu.edu/viewdoc/ download?doi=10.1.1.39.3129&rep=rep1&type=pdf

11. F. Beil, M. Ester, X. Xu. Frequent Term-Based Text Clustering. [University of Arkansas at Little Rock] https://www.cs.sfu.ca/~ester/papers/KDD02.Clustering.final.pdf

12. P. Cimiano, S. Staab, J. Tane. Deriving Concept Hierarchies from Text by Smooth Formal Concept Analysis. [University of Karlsruhe] http://www.aifb.kit.edu/images/a/a5/2003_156_Cimiano_Deriving_Concep_1.ps

13. D. Ferrucci. Text analysis as formal inference for the purposes of uniform tracing and explanation generation. [IBM] http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/aba6e5969bd737fe85256f2c006d96fe?OpenDocument

14. A. Mehler, U. Waltinger, A. Wegner. A Formal Text Representation Model Based on Lexical Chaining. [Bielefeld University] http://www.ulliwaltinger.de/pdf/LNVD07MehlerWaltingerWegner.pdf

15. S.J. Gruber. Frame information and lexically-based inference. Quaderni di semantica. 2(6): 58-78 (1985).

16. Th.R. Hofmann. Semantic frames and content representation. Quaderni di semantica. 2(6): 267-284 (1985).

17. D.I. Holmes. The Evolution of Stylometry in Humanities Scholarship. Literary and Linguistic Computing. 13(3): 111-117 (1998).

18. R. Hudson. Some basic assumptions about linguistic and non-linguistic knowledge. Quaderni di semantica. 2(6): 284-287 (1985).

19. F. Kiefer. How to account for situational meaning? Quaderni di semantica. 2(6): 288-295 (1985).

20. H. Ôim, M. Saluveer. Frames in linguistic descriptions. Quaderni di semantica. 2(6): 295-305 (1985).

21. R.D. Peng, N.W. Hengartner. Quantitative analysis of literary styles. The American Statistician. 56(3): 175-185 (2002).

22. J.J. Rocchio, Jr. Relevance feedback in information retrieval. The SMART Retrieval System: Experiments in Automatic Document Processing. 313-323 (1971).

23. R. Sñhank, R. Abelson. Scripts, Plans, Goals and Understanding. Hillsdale, 1977.

24. D. Tannen. Frames and schemas in interaction. Quaderni di semantica. 2(6): 326-335 (1985).

25. Y. Wilks. Text structures and knowledge structures. Quaderni di semantica. 2(6): 335-344 (1985).