Number Use in Language:

a Quantitative and Typological Investigation

 

Project funded by the ESRC (R000222419)

directed by Professor G. G. Corbett, University of Surrey

Research Fellow: Dr Andrew Hippisley

 

Dataset deliverable

 

Description document

 

Andrew Hippisley, University of Surrey

 

 

Abstract

 

The researchers Corbett, Hippisley, Brown and Marriott have investigated the relationship between number availability and number use. One of the deliverables promised was a dataset of nouns from the Uppsala corpus encoded for frequency information, case and number features, as well as semantic information, i.e. animacy category. This document contains a description of the dataset column by column, and in some cases a note on methodology. The researchers are grateful for the support of the ESRC (grant no. R00002222419).

 

1. Background

 

An important contribution to linguistic typology was Smith-Stark's hierarchy of number availability, an extended version of which is given in (1). Nouns with number marking (formally distinguishing singular and plural) typically occupy some top portion. Different languages make the 'split' at different points on the hierarchy (e.g. only Speaker, Addressee, and Kin terms may mark number).

 

(1)

The chief aim of the ESRC project was to investigate to what extent Smith-Stark's hierarchy of number availability impacted on the way number was used. The general methodology was to analyse the way in which the nominals of a one million word Russian corpus distributed their singular and plural forms, and compare that with the nominals' position on the Smith-Stark hierarchy. This task was carried out using the concordance and word list tools in the WordSmith concordance package.

 

 

2. The corpus

 

The Uppsala corpus is a set of sub-corpora of various genres, containing in total about 1 million words. It is considered the best Russian corpus available, in terms of scope and design. For information on the Uppsala corpus, see Lönngren (1993) and Maier (1994).

 

3. The dataset

 

The dataset is in the form of a Microsoft Excel document where case, number (singular and plural), and animacy information about the nouns occurring in the Uppsala corpus are given numerical values, corresponding to case features, animacy features, and frequency. The lexemes recorded in the dataset are those represented by a word form occurring more than five times. The dataset contains around 5440 lexemes, accounting for around 243000 word forms from the entire 1million word corpus.

 

In oder to present the data as a Microsoft Excel document (diacritics cannot be used), we adopted the following translitertaion system.

4. Description of the columns in the dataset

 

We consider each column in the dataset in turn. The information recorded for some columns involved additional analysis of the corpus (for example affixal homonymy) and in these cases we outline the methodology adopted for retrieving the required information.

 

4.1 Columns: Lexeme, Gloss

 

The lexemes, the amalgamation of all word-from of a noun lexeme, are arranged in frequency order. They appear with a gloss.

 

 

 

4.2 Columns: Animacy

 

Each lexeme has been recorded for animacy category, based on our extended version of the Smith-Stark hierarchy as shown in (1) above. Animacy has been recorded numerically, and the correspondence between animacy and numerical index is given in Table 1.

 

Animacy

numeric index

Kin

3

Non-human rational

4

Human rational

5

Human non-rational

6

Animate

7

Concrete inanimate

8

Abstract inanimate

9

 

TABLE 1: Animacy category and its numeric index in the dataset

 

 

4.3 Columns: ‘Frequency’, ‘Sg’, ‘Pl’, ‘Pl/freq’

 

Frequency information for each lexeme is recorded in the last four columns. This information is broken down into overall frequency (‘Frequency’), all singular occurrences (‘Sg’), all plural occurrences (‘Pl’), and the proportion of all the occurrences of the lexeme which are plural (‘Pl/freq’).

 

 

4.4 Columns: ‘NomSg’, ‘GenSg’, ‘DatSg’, ‘InstSg’, ‘LocSg’, ‘NomPl’, ‘GenPl’, ‘DatPl’, ‘InstPl’, ‘LocPl’

 

In Russian there are two number values (singular and plural) and six cases. These are: nominative, accusative, genitive, dative, instrumental, and locative. As a typical member of Indo-European Russian is a fusional type language where a single ending fuses case and number information. The columns above correspond to the case and number combinations.

 

Methodological implication

 

The case/number endings fall into a number of paradigms. The main noun classes in Russian are given in Table 2.

 

 

I

II

III

IV

 

stol ‘table’

karta ‘map’

kost´ ‘bone’

okno ‘window

SG

       

nom

stol

kart-a

kost´

okn-o

acc

stol

kart-u

kost´

okn-o

gen

stol-a

kart-y

kost-i

okn-a

dat

stol-u

kart-e

kost-i

okn-u

inst

stol-om

kart-oj

kost-´ju

okn-om

loc

stol-e

kart-e

kost-i

okn-e

PL

       

nom

stol-y

kart-y

kost-i

okn-a

acc

stol-y

kart-y

kost-i

okn-a

gen

stol-ov

kart

kost-ej

okon

dat

stol-am

kart-am

kost-jam

okn-am

inst

stol-ami

kart-ami

kost-jami

okn-ami

loc

stol-ax

kart-ax

kost-jax

okn-ax

TABLE 2: Russian noun classes

 

From Table 2 we see that there are four main groups, represented here by stol, karta, kost´, and okno. In classes I and IV each case/number combination is marked by a separate form, except for the direct cases. In the other classes their endings tend to merge. For example, in class II the suffix -y marks genitive singular and the direct cases in the plural. In class III the merging of endings is widespread. This makes the analysis of the nouns in the corpus a more complex task. Each word form occurrence which does not mark case and number had to be disambiguated by carefully examining the context in which the word form appears.

 

In addition to homonymy within the lexeme, there is homonymy across word forms, and further analysis had to be done to disambiguate examples of this kind. One actual example from the coprus is the word form vek-i. This can either be the nominative plural of vek-o ‘eyelid’, or an archaic nominative plural of vek ‘century’, and has been disambiguated accordingly.

 

 

4.5 Columns ‘Gen2’ and ‘Loc2’

 

In addition to the six cases shown in Table 2, Russian has two sub-cases, the second genitive (a sub-case of the genitive), and a second locative (a sub-case of the locative). The sub-cases occur in the singular paradigm of class I. An example of a noun with a second genitive and a second locative is glaz ‘eye’ and its singular paradigm is given in Table 3. Columns ‘Gen2’ and ‘Loc2’ record occurrences of the second genitive and second locative respectfully.

 

Methodological implication

 

As can be seen from Table 3, the sub-case endings are both in -u which in class I is homonymous with the dative singular (Table 2). For nouns known to contain a sub-case we have disambiguated word forms in -u by checking the contexts in which they occur. Zaliznjak (1977), a morphological dictionary, has been used as a guide to which nouns contain sub-cases.

 

 

I

 

glaz ‘eye’

SG

 

nom

glaz

acc

glaz

gen

glaz-a

gen 2

glazu

dat

glaz-u

inst

glaz-om

loc

glaz-e

loc 2

glaz-ú

 

TABLE 3: sub-cases

 

4.6 Columns: ‘AccSg’, ‘Ref to NomSg’, ‘Ref to Nom1Pl’

 

Generally the accusative singular is syncretic with the nominative singular. The column ‘AccSg’ records occurrences of the accusative singular of a lexeme where the accusative singular is morphologically distinct from the nominative singular. Morphological distinction for the accusative singular is restricted to class II (inanimate and animate) and class I animate nouns only. Where the accusative singular is not morphologically disambiguated from the nominative singular, the following procedure has been adopted: (i) the ‘AccSg’ column is recorded with a zero value; (ii) the ‘Ref to NomSg’ is given the value 1. In other words, for lexemes with a 1 in the ‘Ref to NomSg’ column, the value in the ‘NomSg’ column corresponds to the occurrences of nominative singular and accusative singular together. If a zero appears in the ‘Ref to NomSg’ column, then any value in the ‘NomSg’ is a record solely of nominative singular occurrences. Similarly in the plural, where the referring column is ‘Ref to Nom1Pl’. All animates have accusative plurals distinct from nominative in the plural, and where this is the case a zero has been recorded in the ‘Ref to Nom1Pl’ column’.

 

Methodological implication

 

In the singular, in class I accusative case is syncretic with the genitive for animate nouns. All class I animate nouns have been disambiguated for accusative and genitive singular by carefully examining the contexts in which they appear. In the plural, all animates have accuasative / genitive syncretism, and have been disambiguated.

 

 

4.7 Column: ‘InstSg2’

 

Some class II nouns have instrumental singular word forms in the optional -oju ending, usually in addition to the general -oj ending. We have recorded these occurrences separately. Occurrences in the general ending are recorded in the ‘InstSg’ column, and occurrences in the optional ending are recorded in a separate column ‘InstSg2’.

 

 

4.8 Column: ‘Vocative’

 

So-called vocatives are restricted to class II animate nouns, and formed by shortening the citation nominative singular to the bare stem. Very few occurrences were found, and they are recorded in the ‘Vocative’ column.

 

4.9 Column: ‘Nom2Pl’, ‘Acc2Pl’, ‘Inst2Pl’

 

In the plural, some nouns have alternative nominative, accusative and instrumental forms, and occurrences of these have been recorded in the columns ‘Nom2Pl’, ‘Acc2Pl’, ‘Inst2Pl’. The general word forms and the alternative word forms for these nouns are given in the look-up tables below (Tables 4 to 6). Zaliznjak has been used as guide as to what counts as the alternative form. Note that not all cases have been recorded in Zaliznjak. Note also that the general form does not always correspond to the most frequent form. In some cases, an alternative gloss is associated with the alternative word form, and this is also given.

 

 

Lexeme

Gloss

General form

(‘NomPl’)

Alternative form

(‘Nom2Pl’)

Alternative gloss

 

God

year

gody

goda

-

Chelovek

person

ljudi

cheloveki

-

Vek

century

veka

veki

(used in expressions)

Direktor

director

direktora

direktory

(not in Zaliznjak)

Cvet

flower / colour

cvety

cveta

(when ‘colour’)

Zub

tooth

zuby

zubja

cog (in machine)

Traktor

tractor

traktora

traktory

-

Shtorm

gale

shtormy

shtorma

-

Zarja

dawn

zari

zori

-

Jastreb

hawk

jastreba

jastreby

-

Shchenok

puppy

shchenki

shchenjata

-

Shtabel´

stack

shtabelja

shtabeli

-

 

 

TABLE 4: Nominative plural alternatives

 

 

Lexeme

Gloss

General form

(‘AccPl’)

Alternative form

(‘Acc2Pl’)

Alternative gloss

Ptichka

bird / tick

ptichek

ptichki

tick (only)

Shchenok

puppy

shchenkov

shchenjat

puppy

 

TABLE 5: Accusative plural alternatives

 

 

Lexeme

Gloss

General form

(‘GenPl’)

Alternative form

(‘Gen2Pl’)

Alternative gloss

God

year

let

godov

-

Chelovek

person

ljudej

chelovek

(used with numerals)

Kurica

hen

kur

kuric

-

Korol´

king

korolej

korolev

(not in Zaliznjak)

Prostynja

sheet

prostynej

prostyn´

-

 

TABLE 6: alternative genitive plurals

 

 

Lexeme

Gloss

General form

(‘InstPl’)

Alternative form

(‘Inst2Pl’)

Alternative gloss

Sleza

tear

slezami

slez´mi

-

Kost´

bone

kostjami

kost´mi

-

 

TABLE 7: alternative instrumental plurals

 

 

References

 

Lönngren, Lennart 1993. Chastotnyj slovar´ sovremennogo russkogo jazyka. (=Acta Universitatis Upsaliensis, Studia Slavica Usaliensis 33). Uppsala.

Maier, Ingrid 1994. Review of Lennart Lönngren (ed.) Chastotnyj slovar' sovremennogo russkogo jazyka. Rusistika Segodnja , 1. 130-6.

Zaliznjak, A. A., 1977.Grammaticheskij slovar´ russkogo jazyka. Moscow: Russkij jazyk.