Overview of status quo in Russian associative lexicography

 

Throughout our research of various issues related to automated text processing we have been concerned with the problem of correct interpretation of indirect lexical meanings (i.e. those that are not directly motivated by the words denotation and consequently cannot be immediately derived from the primary meaning). It is obvious that such meanings can only be processed by an automated analyzer in case that its database contains separate descriptions of these meanings, whereas it is hardly possible to codify all of them, and even if some group of researchers succeeds in such enterprise, the database will soon become obsolete due to volatile nature of secondary connotations.

Briefly described, this problem can be represented in the following way: automated text analyzer encounters a word combination the elements of which are not detected as compatible with each other. As long as this combination is not listed in the database of known phrases, and the lexical units it consists of cannot be related to each other in terms of their meaning, the analyzer is facing an unintelligible fragment of text that cannot be properly understood (in terms of so called machine understanding) and thus is not available for accurate processing. The situation results in errors and mistakes that are evident and commonplace in fields such as automated translation: certain phrases that were not recognized by the analyzer are rendered in weird or senseless fashion.

Thus, a proactive algorithm is necessary to be developed and implemented in automated analyzers in addition to existing retroactive mechanisms; such algorithm would be able to establish links between lexical units wherever this task cannot be fulfilled by ordinary means. In order to function properly, the algorithm would require certain knowledge on the ways according to which new connotations are created by the language-speaking community. Naturally, in the majority of existing cases this process is based upon associative rethinking of the denotation and upon metaphors in particular. [1] It seems possible that associative dictionaries (i.e. databases of notions that certain words are associated with by language speakers) could have provided the solid foundation for such algorithms.

Consequently, we are currently interested in researching the possibilities that existing samples of Russian associative lexicography offer in relation to the potential perspective of proactive algorithm development for the actual needs of automated text analysis.

Historically, the first notable attempt to compose an associative dictionary of the Russian language had been performed in 1977 by recognized Soviet psychologist Aleksej N. Leontev. [2] This dictionary had represented certain results received by the researcher during a number of associative experiments required for the study of language conscience; thus, it is natural that the methodology behind the dictionary had been purely psychological and had been developed in accord with behaviourist approach. The database of this dictionary includes a limited set of stimuli (200 entries total) that were presented to test subjects; the latter had to call out the very first association that a stimulus had triggered in their minds. The associations had then been processed and their frequency amongst the test audience had been calculated; reactions that had been found in multiple test subjects were placed at the beginning of dictionary entries alongside with the number of their occurrences, and then the so called solitary associations (i.e. those that had been mentioned by a single person only) were enumerated.

Brief analysis of entries in Leontevs dictionary reveals that the most frequent associations are antonyms to the stimuli ( [grandmother] [grandfather], [big] [small]) or words that are a common part of the stumulis distribution due to their connection through a set phrase or a widely spread word combination ( [to come] [home], [to meet] [friend]). The nature of this phenomenon is directly connected to the methodology behind the dictionary: under the requirement to name the associations immediately the test subjects naturally refer to the most easily recalled reactions. Such principle had been entirely satisfactory for the purposes of Leontevs research, but it cannot be considered as viable in terms of aforementioned algorithm building, as long as it is not based upon linguistic methodology, had been conducted in pursuit of notably different objectives and does not provide a reliable source of associations or association routes. The word small is undoubtedly a valid psychological reaction to the stimulus big, but it can hardly be imagined as basis for denotation rethinking and subsequent development of any connotations for that lexical unit. In addition, the amount of stimuli is minuscule and the words themselves seem to be chosen randomly; it limits severely the possibilities of using the database, which can be applied neither for full-scale analysis nor for the probable built-in language of semantic primitives within the algorithm.

The same problems can be observed during analysis of another significant effort in the area, the Russian associative dictionary supervised by Yuri N. Karaulov. The dictionary has originally been issued in 1994 and was subsequently updated in 2002. [3] The latest edition includes around 7 000 stimuli and more than 1 000 000 of associations related to them; obviously, the database of this thesaurus is considerably larger in comparison with Leontevs dictionary, enabling its possible application in the heuristic algorithm under discussion. It should also be noted that basic lexical units ( [to be], for example) are present in the stimuli list; thus, the algorithms system of semantic primitives could certainly be supported by the dictionary.

Nevertheless, the methodology in the foundation of the thesaurus has not undergone notable modifications since Leontevs dictionary had been issued. Psychological nature of the dictionary remains evident in this sample of Russian associative lexicography; the experiment behind it has been significantly expanded in terms of numbers, but as far as its essence was concerned, it originated from the same behaviourist approach and was still grounded upon stimulus reaction core principle. Consequently, Karaulovs thesaurus shares the major drawbacks found in Leontevs dictionary from the point of view of proactive analysis of lexical connotations. It is thus not surprising that the most frequent associations for many stimuli, including the words that we quoted in the previous paragraph as examples, coincide exactly in both dictionaries.

One of the most recent associative databases worth being mentioned is (Siberian associative dictionary). [4] The experiment that provided the basis for this thesaurus has been conducted since 2008; full access to the entire database for common audience has been granted in 2015. The dictionary possesses no major difference from Leontevs and Karaulovs thesauri in terms of methodology, although it might represent certain interest due to offering most contemporary material on the subject. Generally, the issues caused by psychological essence of the experiment persist in this dictionary.

Certain attempts have also been performed in the field of compilation of web-based associative dictionaries, either supported by methods of corpus linguistics or relying upon user-generated content. For instance, a database of this kind is currently available at URL http://slovesa.ru/. [5]

Corpus technologies represent an interesting direction of associative research, especially in cases where they are conjoined with syntax analyzers or context-driven mechanisms of data mining that are capable of establishing links between lexical units which are not directly adjacent to each other. Traditional corpus research is usually dependent upon a words distribution while the latter is not significantly reliable as far as associations are concerned; deeper analysis is required to extract the associations from the context they belong to. It should also be noted that automated reference to corpus databases fills the dictionary with a considerable amount of excessive associations that are either not entirely accurate or not related to a words associative field completely. Thus, even though corpus-based thesauri are located notably closer to the area of linguistics and contain more objective language material, their databases would still require preliminary processing and cleaning in advance to their application in heuristic algorithms.

User-generated content in its turn presupposes more conscious approach of the audience to the associations they provide. This leads to coexistence of two oppositely directed vectors in the development of such dictionaries. The first tendency consists in provision of greater amount of reactions that are not necessarily limited to the most easily and quickly recalled associations; it improves the dictionary. The second trend is related to addition of incoherent, senseless responses that are submitted for the purpose of self-amusement of the user or in pursuit of increasing the number of associations to as large values as possible; correspondingly, it decreases the quality of the database. Whilst the active and conscious participation of language speakers is beneficial, certain issues still exist in such dictionaries; linguistic methodology is missing, the vocabulary is not consistent, and available samples of user-generated associative lexicography would still require preliminary preparations alike to those necessary for corpus-based thesauri.

This overview is leading us to the following conclusions:

1.  Russian associative lexicography in its current state is overwhelmingly dominated by psychological methods of research and particularly by stimulus reaction approach that is based upon subconscious mental activity rather than on conscious consideration and introspection of test subjects. Whilst such methodology is historically refined, objective, and generally satisfactory for the purposes of researches it originated from, it does not provide substantial data on mechanisms that determine internal processes of associative rethinking that lead to creation of new connotations in lexical units. In addition, application of external research methods is obviously not suitable for an established branch of scientific knowledge; consequently, a linguistic discipline such as lexicography is supposed to make use of linguistic methodology instead of borrowing its methods of research from psychology.

2.  In certain aspects, web-based associative dictionaries can be regarded as advancement in comparison with traditional associative thesauri due to detachment from purely psychological research methods and active use of objective distributional data from corpus databases and of conscious linguistic self-analysis of the audience. Nevertheless, these dictionaries are not grounded on reliable lexicographic principles, are not compiled in accord with established and organized procedures, and contain a plenty of random incoherent associations.

3.  No existing database of associations available in public access meets the criteria set by the task to develop the heuristic algorithm for the purpose of establishing links between words whose connotations are not represented in ordinary databases of automated text analyzers, due to significant drawbacks in methodological foundations they are based on.

On the basis of these conclusions, the following suggestions can be derived:

1.  Linguistic methodology of recovery, processing, and explanation of associations related to lexical units should be developed, and a new associative dictionary of the Russian language should be created on the foundation of this methodology. Research of semantic indicators capable of aligning the routes of possible denotation rethinking processes should be an integral part of the methodology.

2.  Introspection and conscious approach of language speakers to the discovery of semantic associations should be encouraged in the procedures of gathering the information for the database of the dictionary. The audience should be intentionally concentrated on the search for semantically relevant associations (i.e. those that are meaningful language-wise). To ensure that certain quality of the introspection is delivered, professional audience like philologists might be selected.

3.  Database compilation could be improved and facilitated my means of making use of Web 2.0 technologies such as the wiki principle of content generation.

Thus, a more accurate, linguistically grounded and supported, professional community-driven associative dictionary could be created to fill in certain gaps that currently exist in Russian associative lexicography.

 

References

 

1. .. // SCI-ARTICLE.RU: . 2014. URL: http://sci-article.ru/stat.php?i=1395142238 ( : 01.02.2016)

2. .. . .: , 1977. 192 .

3. / . . . . .: , 2002. 755 .

4. (2008 2015) (- .., ..). URL: http://adictru.nsu.ru ( : 01.02.2016)

5. . URL: http://slovesa.ru ( : 01.02.2016)