Vitalijs Sadovskis and Dr. Royal Skousen, Department of Linguistics and English Language
The primary purpose of this project was to explain the common and frequently occurring phenomena of variation in Russian morphology. This research is based on empirical data from the Russian language output of native speakers of Russian. The data sets were prepared for and processed by the Analogical Modeling software designed specifically for predicting language behavior. The results of computer analysis were compared with the actual language usage.
More specifically, this study attempted to explain variation in morphological formation of Russian secondary imperfective verbs. A number of imperfective verbs in Russian show three different kinds of variation when derived from their perfective counterparts. For example, the Russian perfective verb sosredotočit’ ‘concentrate’ forms two imperfectives that have different surface representation – sosredotočivat’ and sosredotačivat’. The former takes the derivational suffix –– (orthographically represented as –iva– or –yva–), while the latter in addition to taking –– also undergoes a morphonological vowel alteration <o||a> (orthographically either –a– or –ja–) in the verb stem (sosredotčivat’ versus sosredotčivat’). A different derivational strategy is employed in the case of the perfective izgotovit’ ‘produce’ which forms the following imperfective variants: izgotavlivat’ and izgotovljat’. This examplifies suffixal variation, where a derivative form can surface with either suffix –– or –– (orthographically –a– or –ja–): izgotavlt’ versus izgotovlt’. In rare instances, the native Russian speakers may use all three derivational strategies to produce different imperfective forms derived from one perfective (e.g., prisposoblivat’sja (––), prisposablivat’sja (–– and <o||a>), and prisposobljat’sja (––) as imperfective counterparts of the perfective prisposoblit’sja ‘adapt.’
In order to account for variation in actual language usage, the Russian National Corpus (http://ruscorpora.ru/) was employed as a source of empirical data of language behavior. Specifically, three different corpora were used: the Main Corpus, the Newspaper Corpus, and the Spoken Corpus. The selected time frame of the Main and the Spoken corpora is between 1992 and 2012, whereas the Newspaper Corpus is based on news and media materials produced in 2000s. The total number of words in all three corpora within the specified timeframes exceeds 261 million words (Main Corpus: 81,701,836 words; Newspaper Corpus: 173,520,540 words; Spoken Corpus: 6,240,071 words). The goal was to capture the degree of usage of imperfective variants within the most recent 20 years of corpus data.
Based on this data, the total number of entries across three corpora was calculated for each variant of the same verb, as well as for all variants of the same verb combined. For each variant the percentage of the total number of cross-corpora entries was calculated as well. This way it was possible to capture the actual language behavior, that is to see how many variants of each verb are in use, what is the formation strategy for each variant, and how frequent each variant is. The search was conducted for all inflectional and participle forms of a particular varinat (including verbs with the postfix –sja in infinitive.)
The vast majority of imperfective variants currently used by native speakers are derivatives of type 4 perfectives. Thus, for the purpose of constructing the data set, several hundred type 4 verbs were selected. Type 4 verbs are one of the inflectional types of perfectives from which imperfectives can be derived. The dataset was comprised only from type 4 perfectives that don’t produce multiple outcomes, i.e., more than one form of imperfective. This dataset served as the basis for predicting the outcomes for the verbs in question (in this case – imperfective derivatives that have two or more variants according to the Russian National Corpus). Each verb in the dataset has one of the three types of derivational strategy as a specified outcome: suffix ––, suffix ––, or a combination of suffix –– and <o||a> alteration. In order to predict the behavior of perfectives that form more than one imperfective (referred to as test items), they were represented via the same set of possible variables in the dataset, but in this case no outcome was specified. After predicting the outcomes for the test items, the results were checked against the corpus data to see if the predictions match the actual language behavior.
Three tests were designed for Analogical Modeling outcome predictions. The goal of the first test was to predict the outcomes for infrequent forms that show no variation. The dataset in this case consisted of the 500 most frequent (based on the type frequency data from the Russian National Corpus) no-variation type 4 perfective verbs that have their imperfective counterparts (that is, for which the outcomes can be specified.) The test items are a 100 infrequent no-variation type 4 perfectives that can also form imperfective derivatives. The Analogical Modeling successfully predicted 91% of the outcomes (selected by plurality) for infrequent no-variation verbs based on the behavior of the 500 most frequent ones.
The second test was designed to predict the outcomes for perfective verbs which form derivatives with a degree of variation of at least 9 to 1, i.e., the most infrequent variant would account for at leat 10% of combined entries in the Russian National Corpus. The total number of occurences for selected verbs was at least 100. According to the corpus data there are 11 such verbs. Like in the previous dataset, the 500 most frequent type 4 verbs were used to predict the outcomes for 11 test items.
Finally, the last test predicted the outcomes for the same 11 test items, but used the different dataset. This dataset was tailored for specific verbs in question. Specifically, it contained all type 4 verbs that have the same prefixes as the test items regardless of how frequent they are. The frequency of exemplars in this dataset was taken into account by using the number of cross-corpora entries, dividing them by 10, rounding them to the nearest decimal, and using this number as the number of identical verb entries in the dataset. Any verb that has less than 10 corpora entries is used once in this dataset.
According to the test results, the dataset in the second test was able to better account for the actual language behavior documented in the corpus. In the case of this variation, two factors seem to be the most relevant, namely the verb prefixation and the actual frequency of verb usage. The first factor suggests that at least in this case, speakers percieve similarity not only in terms of the phonological shape of the word, but also in terms of its morphological structure (i.e., they store prefix in their memory as a unit rather than a sequence of phonemes). This is also supported by the fact that many Russian verbs with the same root (and closely related meanings) but different prefixes use different imperfectivation strategies. This derivational behavior was the initial basis for constructing the prefixal dataset. The second factor implies the significance of incorporating the actual frequency in the dataset versus just listing the most frequent verbs. Since these two factors were not isolated, at this point it is hard to determine the relative importance of each of them in terms of the outcome predictions. Further research is needed in order to independently establish the relevance of each of these factors. For instance, incorporating the actual frequency of the 500 most frequent type 4 verbs may potentially significantly improve the prediction accuracy and therefore undermine the suggested importance of prefixation.