Dr Matteo Pellegrini
Pronouns: he/him
About
Biography
I am a Research Fellow at the Surrey Morphology Group, where I work as the Language Data Scientist of the NILOMORPH project, funded by an ERC Synergy Grant (cPI: Matthew Baerman; PIs: Bert Remijsen, Lameen Souag). Before that, I worked as a Posdoctoral Researcher at the CIRCSE Research Centre (Catholic University of the Sacred Heart, Milan), within the LiLa project, funded by an ERC Consolidator Grant (PI: Marco Passarotti).
PhD: University of Bergamo / University of Pavia (joint PhD program, supervisor: Pierluigi Cuzzolin)
Master's and Bachelor's degrees: University of Turin (supervisor: Davide Ricca)
ResearchResearch interests
My research interests include:
- the structure of implication between word forms in morphological paradigms, mostly focusing on inflection, but also exploring interactions with derivation;
- overabundance, i.e. the availability of several variants to express the same grammatical properties for a given lexeme;
- how those aspects can be modelled with computational techniques.
To investigate such issues, I also value and enjoy the hard work necessary for the creation of large, high quality datasets, and I participate in efforts to make them openly available and interoperable with other linguistic resources, by developing and applying open standards and vocabularies and by adhering the principles of the framework of Linguistic Linked Open Data.
During my PhD, I applied information-theoretic methodology and used conditional entropy to estimate uncertainty in predicting word forms from each other in Latin verb and noun paradigms.
Within the LiLa project, I worked on the creation of PrinParLat a large lexicon of inflected forms of Latin verbs, nouns and adjectives, featuring a systematic documentation of overabundance using the theoretical notion of flexemes.
Within the NILOMORPH project, I curate the data necessary for a reconstruction of how the complex non-concatenative morphology of West Nilotic languages has developed.
Research interests
My research interests include:
- the structure of implication between word forms in morphological paradigms, mostly focusing on inflection, but also exploring interactions with derivation;
- overabundance, i.e. the availability of several variants to express the same grammatical properties for a given lexeme;
- how those aspects can be modelled with computational techniques.
To investigate such issues, I also value and enjoy the hard work necessary for the creation of large, high quality datasets, and I participate in efforts to make them openly available and interoperable with other linguistic resources, by developing and applying open standards and vocabularies and by adhering the principles of the framework of Linguistic Linked Open Data.
During my PhD, I applied information-theoretic methodology and used conditional entropy to estimate uncertainty in predicting word forms from each other in Latin verb and noun paradigms.
Within the LiLa project, I worked on the creation of PrinParLat a large lexicon of inflected forms of Latin verbs, nouns and adjectives, featuring a systematic documentation of overabundance using the theoretical notion of flexemes.
Within the NILOMORPH project, I curate the data necessary for a reconstruction of how the complex non-concatenative morphology of West Nilotic languages has developed.
Publications
This paper provides a fully word-based, abstractive analysis of predictability in Latin verb paradigms. After reviewing previous traditional and theoretically grounded accounts of Latin verb inflection, a procedure is outlined where the uncertainty in guessing the content of paradigm cells given knowledge of one or more inflected wordforms is measured by means of the information-theoretic notions of unary and ary implicative entropy, respectively, in a quantitative approach that uses the type frequency of alternation patterns between wordforms as an estimate of their probability of application. Entropy computations are performed by using the Qumin toolkit on data taken from the inflected lexicon LatInfLexi. Unary entropy values are used to draw a mapping of the verbal paradigm in zones of full interpredictability, composed of cells that can be inferred from one another with no uncertainty. ary entropy values are used to extract categorical and near principal part sets, that allow to fill the rest of the paradigm with little or no uncertainty. Lastly, the issue of the impact of information on the derivational relatedness of lexemes on uncertainty in inflectional predictions is tackled, showing that adding a classification of verbs in derivational families allows for a relevant reduction of entropy, not only for derived verbs, but also for simple ones.
This chapter is devoted to a detailed description of the methodology that will be applied in this work to perform an analysis of Latin inflectional paradigms that satisfies the theoretical desiderata outlined in Chap. 1. The applied method makes use of notions and procedures taken from information-theory (Shannon 1948), notably surprisal and entropy. In Sect. 2.1, a basic introduction to such information-theoretic notions is offered, pointing out the properties that make them useful to investigate topics related to implicative relations and the Paradigm Cell Filling Problem. The first proposal to use entropy for this purpose was outlined in Ackerman et al. (2009), whose procedure—aiming at an estimate of the degree of uncertainty associated with morphological realizations—is described in Sect. 2.2. However, the tools and algorithm that are used throughout this work are based on a similar but refined procedure (cf. Bonami 2014, Bonami and Boyé 2014, Beniamine 2018), that only requires a lexicon of inflected wordforms with no a priori morphological analysis. This method can be used not only to estimate the uncertainty in guessing the content of the paradigm cell of a lexeme knowing one inflected wordform, as explained in Sect. 2.3, but also given knowledge of multiple wordforms, as detailed in Sect. 2.4. The possible impact of additional information in making such tasks easier will be discussed in Sect. 2.5, where two possible variables are mentioned, namely i) the gender of a noun, and ii) the fact that a given lexeme is derivationally related to another one. In Sect. 2.6, the most important characteristics of the adopted methodology will be summarized, clarifying how they relate to the theoretical principles discussed in the first chapter.
In this paper, we investigate the value of derivational information in predicting the inflectional behavior of lexemes. We focus on Latin, for which large-scale data on both inflection and derivation are easily available. We train boosting tree classifiers to predict the inflection class of verbs and nouns with and without different pieces of derivational information. For verbs, we also model inflectional behavior in a word-based fashion, training the same type of classifier to predict wordforms given knowledge of other wordforms of the same lexemes. We find that derivational information is indeed helpful, and document an asymmetry between the beginning and the end of words, in that the final element in a word is highly predictive, while prefixes prove to be uninformative. The results obtained with the word-based methodology also allow for a finer-grained description of the behavior of different pairs of cells.
In this paper, we present PrinParLat, a new resource listing the principal parts of Latin verbs-i.e., a set of forms from which the full paradigm can be obtained-as well as a fine-grained classification of their inflectional behavior. The resource is released both as a set of .csv tables in the Paralex standard format, and as RDF data linked to the LiLa Knowledge Base of interoperable resources for Latin. After introducing the notions of theoretical morphology of which this resource makes crucial use to be able to code complex morphological information in a compact way, the details of the procedure followed to obtain the data are outlined, as well as the modelling decisions taken regarding knowledge representation. The outcome of our work is evaluated by performing both an intrinsic evaluation of the data in the resource, and an extrinsic evaluation of their added value in the larger ecosystem of the LiLa Knowledge Base.
In this chapter, we will investigate how the picture of interpredictability between paradigm cells changes when considering not only the phonotactic shape of inflected wordforms, but also additional information of a different kind, namely the derivational relatedness of the lexemes involved. In Sect. 6.1, we will use some examples to illustrate the question and to stress the potential relevance of knowing whether two lexemes are ultimately derived from a same base or whether they are formed by means of the same derivational process. We will propose a method to take this information into account and briefly discuss the difference with the standard procedure for entropy computation. We will start from verbal lexemes that ultimately derive from the same base in Sect. 6.2, proposing a working definition of the notion of derivational-inflectional family and showing how our data were coded so as to include a classification in such families in Sect. 6.2.1, briefly providing a qualitative picture of the inflectional behaviour of verbs that belong to the same derivational-inflectional family in Sect. 6.2.2 and presenting our results in Sect. 6.2.3. The same line of reasoning will be followed in Sect. 6.3 for nouns that are formed by means of the same derivational process—i.e., that belong to the same derivational-inflectional series, as defined in Sect. 6.2.1. We will then present in Sect. 6.4 some of the results obtained on the same topic with a different methodology by Bonami and Pellegrini (2022), where the issue of inflectional predictions is framed as a classification problem: a classifier is trained to predict various correlates of the inflectional behaviour of a lexeme from several derivational predictors. We discuss how these results can integrate those of the other sections of this chapter by avoiding some of the problems of the entropy-based approach to the PCFP and by allowing for a quantitative evaluation of the role of many more facets of derivational history. Lastly, in Sect. 6.5, we will discuss the theoretical and methodological implications of our results, also highlighting some problems that we leave to further research.
In this chapter, we will locate the approach that is adopted in this work in the larger field of theoretical morphology. To do so, we will discuss several points: each of them will be illustrated using examples taken from Latin inflectional morphology. The point of departure will be an overview of the terminology that is used throughout this work (Sect. 1.1). We will then discuss several classifications of morphological theories (Sect. 1.2), from the traditional distinction between Item-and-Arrangement, Item-and-Process and Word-and-Paradigm models operated by Hockett (1954) up to the recent characterization of constructive and abstractive approaches proposed in Blevins (2006, 2016), highlighting the specific aspects on which each of the discussed classifications is based. In Sect. 1.3, we will focus on implicative relations, contrasting different ways in which they can be formulated, in terms of generalizations on exponents (Sect. 1.3.1), on stems (Sect. 1.3.2) or on inflected wordforms (Sect. 1.3.3). We will then add a quantitative dimension to the picture (Sect. 1.4), showing the importance of considering also non-categorical implicative relations. Lastly, in Sect. 1.5 we will explain the choices that have been made in this work regarding each of the topics discussed in the previous sections.
To perform a quantitative, entropy-based analysis like the one that has been sketched out in the previous chapter, a large, representative lexicon of inflected wordforms in phonetic transcription is needed. Similar resources are increasingly being developed for modern languages: cf. among else Zanchetta and Baroni (2005) and Calderone et al. (2017) for Italian, Bonami et al. (2014) and Hathout et al. (2014) for French, the lexicon described in Bonami and Luís (2014) for Portuguese.
This paper provides an in-depth investigation of the possibility of systematically using flexemes - i.e., lexical units characterized in terms of form, as opposed to lexemes, characterized in terms of meaning - to model overabundance - i.e., the availability of more than one form in the same paradigm cell. The starting point is a preliminary evaluation of the advantages and disadvantages of using flexemes to account for different overabundance phenomena, showing that flexemes are a good way to capture the systematicity of overabundance, either across lexemes or across cells. Consequently, it is suggested that flexemes can be an interesting technical solution for the creation of a lexicon of Latin verbs that not only documents all the competing wordforms available as principal parts, but also captures the systematic relationship that sometimes holds between variants filling different cells. A principled method to identify such systematicity is then described in detail. It is argued that a constructive approach based on the identity of stems and/or inflection class is not fully adequate for the data at hand. Therefore, the proposed procedure adopts an abstractive, word-based perspective that only relies on alternation patterns between unsegmented wordforms. Practical and theoretical implications of the work are finally discussed, particularly regarding the usefulness of a formal approach to the identification of lexical units and paradigm cells.
This paper investigates a case of overabundance in the plural cell of an open subclass of Italian VN compounds. The empirical basis includes: (i) a 163-item list of relevant compounds, for which the relative frequency of cell mates has been estimated by means of Web data; (ii) a naming questionnaire based on visual input, with 30 images submitted to about 200 informants, including those of several objects whose names are scarcely established in the lexicon; (iii) a further questionnaire, adapted to each informant, asking for acceptability judgements to detect overabundance at the single speaker's level. Results show that the given subclass of VN compounds provides an instance of systematic and productive overabundance in the Italian morphological system, differently from the examples usually discussed for this language.
In this paper, we provide an extensive analysis of the formal features displayed by Italian diminutives (especially allomorphy) by using surface alternation patterns automatically extracted from a dataset of base-diminutive pairs. Applying the theoretical tools provided by Construction Morphology (CxM), we take alternation patterns as a proxy for paradigmatic relations and locate them into separate hierarchies, exploiting the mechanism of multiple inheritance in order to express generalizations on different, mutually independent factors at play (i.e., lexical category, gender and final segment of the base and derivative, suffixes and antesuffixal phonological material displayed by the derivative). In doing so, we contribute to the exploration of the formal side of Italian diminutives, which have been so far addressed mostly from a semantic perspective, and to the refinement of the representation of formal phenomena such as allomorphy according to the CxM framework.
This work presents an analysis of interpredictability between paradigm cells in Italian verbal inflection. The approach is abstractive: the starting point is inflected forms, without assuming any segmentation a priori. The methodology is based on information-theoretic notions: the uncertainty in predicting a form from another one is measured by means of conditional entropy. The results are used to draw a mapping of the verbal paradigm of Italian in zones, such that cells that belong to the same zone are systematically predictable from one another. This mapping is compared to the one of important previous studies that have adopted a constructive approach, based on predictability between stems. Lastly, the quantification of uncertainty through conditional entropy is used as evaluation metrics to establish which approach yields a more economical analysis between the abstractive one of the present work and the constructive one of previous studies.
This chapter recaps the crucial arguments of the book and summarizes its contribution to the advancement of research in different areas. It is shown that the book provides scholars working on Latin with the description of a freely available resource listing the inflected wordforms of verbs and nouns and with new descriptive insights on the structure of paradigms in terms of predictability; and that it is also interesting for theoretical morphologists in that it explores the possibilities opened up by the methodological innovation of taking into account other pieces of information beside the phonotactic shape of wordforms when predicting paradigm cells, focusing on the role of such important factors as gender and derivation.
This chapter is devoted to Latin verb inflection. As a starting point, it is necessary to provide some preliminary information on the verbal system, as it is outlined in traditional descriptions. We will do so in Sect. 4.1, where we will also review previous theoretically grounded studies on Latin inflectional morphology regarding verbs. We will then move to our analysis of implicative relations, which is performed not on the full paradigm of Latin verbs but on a reduced version that abstracts away from all cases of systematic syncretism, called the “cell paradigm” following Boyé and Schalchli (2016) (see Sect. 4.2 for a more detailed elaboration). Results on various fragments of the Latin paradigm will be presented in Sect. 4.3; in Sect. 4.3.1, we will focus on the alternation patterns that hold between wordforms that are based on different stems, and consequently on the uncertainty in predicting the cells involved from one another; in Sect. 4.3.2, we will look at the situation in wordforms that are based on the same stem. In Sect. 4.4, we will try to give an idea of the overall structure of the Latin verb paradigm, as it emerges from our entropy-based analysis. Firstly, we will draw a map of the paradigm in different zones that contain cells between which there is full mutual predictability. Secondly, we will compute entropy values on a so-called “distillation” (cf. Stump and Finkel 2013) of the paradigm, where we keep only one cell for each zone. Lastly, in Sect. 4.5 we will extend our investigation to predictions from more than one cell, whose uncertainty will be measured by means of n-ary implicative entropy. These results will also be exploited to extract principal part sets and near-principal part sets and compare them to the ones of the traditional analysis and to the ones that have been found with different methodologies—notably, Stump and Finkel (2013)’s Principal Part Analysis.
We will now switch our focus from verbs to nouns. The structure of the first part of this chapter is analogous to the previous one. In Sect. 5.1, we provide some preliminary information on the inflectional behaviour of Latin nouns, as it is described in traditional grammars. We also review some (more or less) recent theoretical accounts of the well-known facts. In Sect. 5.2, we present the results of our systematic analysis of implicative relations and predictability in noun inflection. The shape of the cell paradigm of Latin nouns—on which our computations are done—is shown in Sect. 5.2.1. The results concerning unary and n-ary implicative entropy are then given in turn in Sects. 5.2.2 and 5.2.3. In the latter section, the principal parts and near principal parts that can be inferred from the entropy-based analysis are discussed. The remaining part of the chapter is devoted to a topic that, unlike verbs, proves to be relevant for nouns: the role of information on a lexeme’s gender in reducing uncertainty in inflectional predictions. How do entropy values change if we assume that we do not only know the phonotactic shape of an inflected wordform but also the gender of the lexeme it belongs to? We begin by providing some examples that show information on gender is at least potentially available to speakers and capable of yielding a reduction in implicative entropy (Sect. 5.3.1). We then briefly show some quantitative data on the distribution of Latin nouns among the three genders, both in absolute terms and in their relationship with the classification in the traditional five declensions, highlighting some clear preferential associations between a noun’s declension and its gender and their impact on uncertainty in the PCFP (Sect. 5.3.2). We finally present the results that are obtained by taking gender information into account in Sect. 5.3.3, discussing the relevance of such results in the debate on the function(s) of gender, but also some caveats that should be made on their reliability, in Sect. 5.3.4. The findings of the chapter are then summarized in Sect. 5.4.
This paper presents an approach to integrating Latin inflected forms and corpus attestations within a Linked Open Data (LOD) framework, enhancing interoperability between Wikidata and the LiLa knowledge base. Building on the PrinParLat lexicon of Latin verb principal parts, we generate the complete set of inflected forms for over 8,000 verbs, encoded as RDF in a dedicated Wikibase instance. These forms are linked to the Index Thomisticus Treebank (ITTB), whose morphologically annotated tokens are related to corresponding forms based on segmental identity, lemma alignment, and mapped morphological features. Our generation and linking process achieves over 95% coverage of ITTB verbal tokens, demonstrating the robustness of our pipeline even for Medieval Latin data. By aligning Paralex, Wikidata, and LiLa ontologies, we ensure semantic interoperability and facilitate future integration into Wikidata. Beyond Latin, this workflow provides a reproducible model for linking inflectional paradigms and corpus attestations in other languages.