Overview of status quo in Russian associative lexicography
Throughout our research of various
issues related to automated text processing we have been concerned with the
problem of correct interpretation of indirect lexical meanings (i.e. those that
are not directly motivated by the word’s denotation and consequently cannot be
immediately derived from the primary meaning). It is obvious that such meanings
can only be processed by an automated analyzer in case that its database
contains separate descriptions of these meanings, whereas it is hardly possible
to codify all of them, and even if some group of researchers succeeds in such
enterprise, the database will soon become obsolete due to volatile nature of
secondary connotations.
Briefly described, this problem can
be represented in the following way: automated text analyzer encounters a word
combination the elements of which are not detected as compatible with each
other. As long as this combination is not listed in the database of known
phrases, and the lexical units it consists of cannot be related to each other
in terms of their meaning, the analyzer is facing an unintelligible fragment of
text that cannot be properly “understood” (in terms of so called machine
understanding) and thus is not available for accurate processing. The situation
results in errors and mistakes that are evident and commonplace in fields such
as automated translation: certain phrases that were not recognized by the
analyzer are rendered in weird or senseless fashion.
Thus, a proactive algorithm is
necessary to be developed and implemented in automated analyzers in addition to
existing retroactive mechanisms; such algorithm would be able to establish
links between lexical units wherever this task cannot be fulfilled by ordinary
means. In order to function properly, the algorithm would require certain
knowledge on the ways according to which new connotations are created by the
language-speaking community. Naturally, in the majority of existing cases this
process is based upon associative rethinking of the denotation and upon
metaphors in particular. [1] It seems possible that associative dictionaries
(i.e. databases of notions that certain words are associated with by language
speakers) could have provided the solid foundation for such algorithms.
Consequently, we are currently
interested in researching the possibilities that existing samples of Russian
associative lexicography offer in relation to the potential perspective of
proactive algorithm development for the actual needs of automated text
analysis.
Historically, the first notable
attempt to compose an associative dictionary of the Russian language had been
performed in 1977 by recognized Soviet psychologist Aleksej
N. Leont’ev. [2] This dictionary had represented
certain results received by the researcher during a number of associative
experiments required for the study of language conscience; thus, it is natural
that the methodology behind the dictionary had been purely psychological and
had been developed in accord with behaviourist
approach. The database of this dictionary includes a limited set of stimuli
(200 entries total) that were presented to test subjects; the latter had to
call out the very first association that a stimulus had triggered in their
minds. The associations had then been processed and their frequency amongst the
test audience had been calculated; reactions that had been found in multiple
test subjects were placed at the beginning of dictionary entries alongside with
the number of their occurrences, and then the so called “solitary associations”
(i.e. those that had been mentioned by a single person only) were enumerated.
Brief analysis of entries in Leont’ev’s dictionary reveals that the most frequent
associations are antonyms to the stimuli (бабушка [grandmother] – дедушка [grandfather], большой [big] – маленький [small]) or words that are a common
part of the stumuli’s distribution due to their
connection through a set phrase or a widely spread word combination (приходить [to come]
– домой [home], встречать [to meet] – друга [friend]). The nature of this
phenomenon is directly connected to the methodology behind the dictionary:
under the requirement to name the associations immediately the test subjects
naturally refer to the most easily recalled reactions. Such principle had been
entirely satisfactory for the purposes of Leont’ev’s
research, but it cannot be considered as viable in terms of aforementioned
algorithm building, as long as it is not based upon linguistic methodology, had
been conducted in pursuit of notably different objectives and does not provide
a reliable source of associations or association routes. The word small is undoubtedly a valid
psychological reaction to the stimulus big,
but it can hardly be imagined as basis for denotation rethinking and subsequent
development of any connotations for that lexical unit. In addition, the amount of stimuli is minuscule and the words themselves seem
to be chosen randomly; it limits severely the possibilities of using the
database, which can be applied neither for full-scale analysis nor for the
probable built-in language of semantic primitives within the algorithm.
The same problems can be observed
during analysis of another significant effort in the area, the “Russian
associative dictionary” supervised by Yuri N. Karaulov.
The dictionary has originally been issued in 1994 and was subsequently updated
in 2002. [3] The latest edition includes around 7 000 stimuli and more
than 1 000 000 of associations related to them; obviously, the
database of this thesaurus is considerably larger in comparison with Leont’ev’s dictionary, enabling its possible application in
the heuristic algorithm under discussion. It should also be noted that basic
lexical units (быть [to be], for example) are present
in the stimuli list; thus, the algorithm’s system of semantic primitives could
certainly be supported by the dictionary.
Nevertheless, the methodology in the
foundation of the thesaurus has not undergone notable modifications since Leont’ev’s dictionary had been issued. Psychological nature
of the dictionary remains evident in this sample of Russian associative
lexicography; the experiment behind it has been significantly expanded in terms
of numbers, but as far as its essence was concerned, it originated from the
same behaviourist approach and was still grounded
upon “stimulus – reaction” core principle. Consequently, Karaulov’s
thesaurus shares the major drawbacks found in Leont’ev’s
dictionary from the point of view of proactive analysis of lexical
connotations. It is thus not surprising that the most frequent associations for
many stimuli, including the words that we quoted in the previous paragraph as
examples, coincide exactly in both dictionaries.
One of the most recent associative
databases worth being mentioned is СИБАС (Siberian associative dictionary).
[4] The experiment that provided the basis for this thesaurus has been conducted since 2008;
full access to the entire database for common audience has been granted in
2015. The dictionary possesses no major difference from Leont’ev’s
and Karaulov’s thesauri in terms of methodology,
although it might represent certain interest due to offering most contemporary
material on the subject. Generally, the issues caused by psychological essence
of the experiment persist in this dictionary.
Certain attempts have also been
performed in the field of compilation of web-based associative dictionaries,
either supported by methods of corpus linguistics or relying upon
user-generated content. For instance, a database of this kind is currently
available at URL http://slovesa.ru/. [5]
Corpus technologies represent an
interesting direction of associative research, especially in cases where they
are conjoined with syntax analyzers or context-driven mechanisms of data mining
that are capable of establishing links between lexical units which are not
directly adjacent to each other. Traditional corpus research is usually
dependent upon a word’s distribution while the latter is not significantly
reliable as far as associations are concerned; deeper analysis is required to
extract the associations from the context they belong to. It should also be
noted that automated reference to corpus databases fills the dictionary with a
considerable amount of excessive associations that are either not entirely
accurate or not related to a word’s associative field completely. Thus, even
though corpus-based thesauri are located notably closer to the area of
linguistics and contain more objective language material, their databases would
still require preliminary processing and cleaning in advance to their
application in heuristic algorithms.
User-generated content in its turn
presupposes more conscious approach of the audience to the associations they
provide. This leads to coexistence of two oppositely directed vectors in the
development of such dictionaries. The first tendency consists in provision of
greater amount of reactions that are not necessarily limited to the most easily
and quickly recalled associations; it improves the dictionary. The second trend
is related to addition of incoherent, senseless responses that are submitted
for the purpose of self-amusement of the user or in pursuit of increasing the
number of associations to as large values as possible; correspondingly, it
decreases the quality of the database. Whilst the active and conscious
participation of language speakers is beneficial, certain issues still exist in
such dictionaries; linguistic methodology is missing, the vocabulary is not
consistent, and available samples of user-generated associative lexicography
would still require preliminary preparations alike to those necessary for
corpus-based thesauri.
This overview is leading us to the
following conclusions:
1. Russian associative lexicography in
its current state is overwhelmingly dominated by psychological methods of
research and particularly by “stimulus – reaction” approach that is based upon
subconscious mental activity rather than on conscious consideration and
introspection of test subjects. Whilst such methodology is historically refined,
objective, and generally satisfactory for the purposes of researches it
originated from, it does not provide substantial data on mechanisms that
determine internal processes of associative rethinking that lead to creation of
new connotations in lexical units. In addition, application of external
research methods is obviously not suitable for an established branch of
scientific knowledge; consequently, a linguistic discipline such as
lexicography is supposed to make use of linguistic methodology instead of
borrowing its methods of research from psychology.
2. In certain aspects, web-based
associative dictionaries can be regarded as advancement in comparison with
traditional associative thesauri due to detachment from purely psychological
research methods and active use of objective distributional data from corpus
databases and of conscious linguistic self-analysis of the audience.
Nevertheless, these dictionaries are not grounded on reliable lexicographic
principles, are not compiled in accord with established and organized
procedures, and contain a plenty of random incoherent associations.
3. No existing database of associations
available in public access meets the criteria set by the task to develop the
heuristic algorithm for the purpose of establishing links between words whose
connotations are not represented in ordinary databases of automated text
analyzers, due to significant drawbacks in methodological foundations they are
based on.
On the basis of these conclusions,
the following suggestions can be derived:
1. Linguistic methodology of recovery,
processing, and explanation of associations related to lexical units should be
developed, and a new associative dictionary of the Russian language should be
created on the foundation of this methodology. Research of semantic indicators
capable of aligning the routes of possible denotation rethinking processes
should be an integral part of the methodology.
2. Introspection and conscious approach
of language speakers to the discovery of semantic associations should be
encouraged in the procedures of gathering the information for the database of
the dictionary. The audience should be intentionally concentrated on the search
for semantically relevant associations (i.e. those that are meaningful
language-wise). To ensure that certain quality of the introspection is
delivered, professional audience like philologists might be selected.
3. Database compilation could be
improved and facilitated my means of making use of Web 2.0 technologies such as
the wiki principle of content generation.
Thus, a more accurate,
linguistically grounded and supported, professional community-driven
associative dictionary could be created to fill in certain gaps that currently
exist in Russian associative lexicography.
References
1. Гончарова К.Н. Процессы полисемантизации в русском и английском языках как основа построения универсального алгоритма прогнозирования коннотативных значений // SCI-ARTICLE.RU: научный периодический электронный журнал. 2014. URL: http://sci-article.ru/stat.php?i=1395142238 (дата обращения: 01.02.2016)
2. Леонтьев А.Н. Словарь ассоциативных норм русского языка. М.: МГУ, 1977. 192 с.
3. Русский ассоциативный словарь / под рук. Ю. Н. Караулова. М.: Издательство «Астрель», 2002. 755 с.
4. СИБАС – Русская региональная ассоциативная база данных (2008 – 2015) (авторы-составители И.В.Шапошникова, А.А.Романенко). URL: http://adictru.nsu.ru (дата обращения: 01.02.2016)
5. Словарь ассоциаций. URL: http://slovesa.ru (дата обращения: 01.02.2016)