Dr Matteo Pellegrini

Pronouns: he/him

Research Fellow in Language Data Science

PhD, University of Bergamo / University of Pavia BA, MA, University of Turin

matteo.pellegrini@surrey.ac.uk

Academic and research departments

Literature and Languages.

About

Biography

I am a Research Fellow at the Surrey Morphology Group, where I work as the Language Data Scientist of the NILOMORPH project, funded by an ERC Synergy Grant (cPI: Matthew Baerman; PIs: Bert Remijsen, Lameen Souag). Before that, I worked as a Posdoctoral Researcher at the CIRCSE Research Centre (Catholic University of the Sacred Heart, Milan), within the LiLa project, funded by an ERC Consolidator Grant (PI: Marco Passarotti).

PhD: University of Bergamo / University of Pavia (joint PhD program, supervisor: Pierluigi Cuzzolin)
Master's and Bachelor's degrees: University of Turin (supervisor: Davide Ricca)

Research

Research interests

My research interests include:

the structure of implication between word forms in morphological paradigms, mostly focusing on inflection, but also exploring interactions with derivation;
overabundance, i.e. the availability of several variants to express the same grammatical properties for a given lexeme;
how those aspects can be modelled with computational techniques.

To investigate such issues, I also value and enjoy the hard work necessary for the creation of large, high quality datasets, and I participate in efforts to make them openly available and interoperable with other linguistic resources, by developing and applying open standards and vocabularies and by adhering the principles of the framework of Linguistic Linked Open Data.
During my PhD, I applied information-theoretic methodology and used conditional entropy to estimate uncertainty in predicting word forms from each other in Latin verb and noun paradigms.
Within the LiLa project, I worked on the creation of PrinParLat a large lexicon of inflected forms of Latin verbs, nouns and adjectives, featuring a systematic documentation of overabundance using the theoretical notion of flexemes.
Within the NILOMORPH project, I curate the data necessary for a reconstruction of how the complex non-concatenative morphology of West Nilotic languages has developed.

Publications

Javier de Torres, Marco Passarotti, Giovanni Moretti, Francesco Mambrini, Matteo Pellegrini A Dataset of Latin Etymologies Extracted from Wiktionary, In: Proceedings of the 11th Edition of the Swiss Text Analytics Conferencepp. 226-233 Association for Computational Linguistics (ACL)

We present a curated resource of Latin etymologies automatically extracted from Wiktionary, enriched with links to the LiLa Knowledge Base of Latin and modelled as RDF triples using the LemonEty ontology. We also present the Python pipeline the data was generated with, as it can be reused to extract Wiktionary's etymologies for other languages. The etymology chains cover Latin words and their attested or reconstructed ancestors in languages such as Proto-Indo-European, Proto-Italic, Ancient Greek, Hebrew, Egyptian, and others. To address the structural noise and editorial hetero-geneity of Wiktionary etymology data, we have introduced strong rule-based filters throughout the pipeline, especially in the curation stage. After validation, the resulting dataset contains etymological chains for 9,684 lemmas, which can be used to support research in Historical Linguistics, Computational Etymology and language learning, among other applications.

Matteo Pellegrini, Francesco Mambrini, Giovanni Moretti, Marco Passarotti (2026)Gazetteers of Latin Authors and Works for Chronological Modelling in the LiLa Knowledge Base, In: Journal of Open Humanities Data1270 Ubiquity Press Ltd

DOI: 10.5334/johd.558

To enhance the potential for diachronic investigation of the LiLa knowledge base for Latin, we present gazetteers of authors and works documented therein. Those include 173 authors and 539 works, covering a time span from the 3rd century BCE up to present-day ecclesiastical Latin. Items in the gazetteers are linked to their identifiers in other knowledge bases (e.g., Wikidata). A model is proposed to allow for the coding in RDF of chronological information for different objects (authors and works) and at different levels of granularity (e.g., the exact publication date of a work or the life span of its author). Keywords: Linguistic Linked Open Data, Latin, gazetteers

Passarotti Marco, Mambrini Francesco, Pellegrini Matteo, Litta Eleonora, Moretti Giovanni (2026)PREMOVE in LiLa: Integrating Latin Preverbed Motion Verbs with WordNet and VerbNet place:Palma De Mallorca

PREMOVE is a diachronic dataset of Ancient Greek and Latin PREverbed MOtion VErbs, providing manually curated morphological, syntactic, and semantic annotations for almost three thousand verbal occurrences. This paper presents the integration of PREMOVE into the LiLa Knowledge Base of Latin, linking its semantic annotations to WordNet (WN) and VerbNet (VN). We describe the RDF conversion using OntoLex-Lemon and FrAC, enabling explicit modelling of token-level attestations and dataset-level provenance. The resulting linked resource achieves full FAIR compliance and supports complex SPARQL queries, allowing users to explore motion semantics across lexical, textual, and semantic layers. Example SPARQL queries demonstrate how researchers can retrieve attested forms for specific WN synsets or VN classes, supporting reproducible linguistic research and cross-resource exploration of motion semantics in ancient languages.

Matteo Pellegrini, Matthew Baerman, Oliver Bond (2026)Linked Open Data for West Nilotic Languages: The NILOMORPH project, In: 10th Workshop on Linked Data in Linguistics (LDL-2026) @ LREC 2026: Workshop Proceedings

In this paper, we present the NILOMORPH project, that aims at describing the complex non-concatenative morphology of West Nilotic languages and reconstructing the dynamics of its evolution from a more straightforward concatenative system. The project adopts techniques from several methodologies and draws on many kinds of data displaying different formats, tagsets and conventions. Data are also multilingual, documenting different West Nilotic varieties, and multimodal, including also audio and video recordings. This makes the process of integration of these data particularly challenging. We first describe how the data can be converted to standard formats such as CLDF and Paralex, to achieve interoperability between resources of the same kind. We then discuss how they can be modelled as Linguistic Linked Open Data in the Resource Description Framework, reusing already existing vocabularies and defining new classes and properties to meet the needs of the project, to also achieve interoperability between resources of different kinds.

Alessandra Teresa Cignarella, Matteo Pellegrini (2026)An Overview of Current Practices and Recommendations for Working with Stereotypes in NLP, In: NLPerspectives @ LREC 2026: Workshop Proceedingspp. 112-123 ELRA Language Resources Association (ELRA)

This article presents a discussion on the main challenges and considerations involved in addressing stereotypes within Natural Language Processing (NLP), and proposes a set of guidelines and recommendations for their treatment in research and resource development. On the one hand, the growing interest in fairness, bias mitigation, and inclusivity has led to an increasing number of studies and datasets dealing with stereotypes; on the other hand, their conceptualization and operationalization remain highly heterogeneous across works. The aim of this article is therefore twofold: (1) to provide a concise yet comprehensive overview of existing annotation schemes highlighting their key features and offering a comparative analysis and (2) to propose a set of tentative guidelines and recommendations to foster clarity when working with stereotypes in NLP. Furthermore, as a case study, we conduct an annotation exercise of a subset of texts from the QUEEREOTYPES dataset, containing stereotypes targeting LGBTQIA+ people, using all labels proposed in prior work to assess their clarity, overlap, and practical usefulness.

Andrea Farina, Marco Passarotti, Francesco Mambrini, Matteo Pellegrini, Giovanni Moretti, Eleonora Litta (2026)PREMOVE In LiLa: Integrating Latin Preverbed Motion Verbs With WordNet And VerbNet, In: The Fifteenth Language Resources and Evaluation Conference (LREC 2026)lrec2026-main-294pp. 3672-3683 European Language Resources Association (ELRA)

DOI: 10.63317/3ifm66wvmf86

Matteo Pellegrini, Eleonora Litta, Federica Iurescia, Marco Passarotti (2026)Towards a unified approach to inflectional and derivational predictions: a case study on Latin, In: Morphology36(1)10 Springer

DOI: 10.1007/s11525-026-09458-5

In this paper, we build on previous work on predictability in morphology, where entropy-based techniques have first been introduced to measure predictability in inflectional predictions, and then extended to predictability of the citation form of derivatives, and we apply those techniques to explore variation in the predictability of derivatives from different inflected forms of their bases. As a case study, we choose a paradigmatic system that includes the principal parts of Latin verbs and related agent and action nouns, where interesting variation can be found according to the forms considered for the predictions. Our results confirm many of the expectations that can be formulated on the basis of a qualitative analysis and additionally highlight less evident characteristics of the distribution of morphological patterns in our data. We argue that a unified approach like the one adopted in this paper is capable of capturing the complex and bidirectional interaction between inflectional and derivational information, and we point to interesting issues that arise from such a unified approach, highlighting their broader implications for morphological theory and computational modelling.

Matteo Pellegrini, Eleonora Litta, Federica Iurescia, Marco Passarotti (2026)On-line materials for the paper "Towards a unified approach to inflectional and derivational predictions: a case study on Latin" Zenodo

DOI: 10.5281/zenodo.15090158

On-line materials accompanying the paper "Towards a unified approach to inflectional and derivational predictions: a case study on Latin", submitted to the journal "Morphology".

Matteo Pellegrini (2020)Patterns of interpredictability and principal parts in Latin verb paradigms: an entropy-based approach, In: Journal of Latin linguistics19(2)pp. 195-229 De Gruyter

DOI: 10.1515/joll-2020-2014

This paper provides a fully word-based, abstractive analysis of predictability in Latin verb paradigms. After reviewing previous traditional and theoretically grounded accounts of Latin verb inflection, a procedure is outlined where the uncertainty in guessing the content of paradigm cells given knowledge of one or more inflected wordforms is measured by means of the information-theoretic notions of unary and ary implicative entropy, respectively, in a quantitative approach that uses the type frequency of alternation patterns between wordforms as an estimate of their probability of application. Entropy computations are performed by using the Qumin toolkit on data taken from the inflected lexicon LatInfLexi. Unary entropy values are used to draw a mapping of the verbal paradigm in zones of full interpredictability, composed of cells that can be inferred from one another with no uncertainty. ary entropy values are used to extract categorical and near principal part sets, that allow to fill the rest of the paradigm with little or no uncertainty. Lastly, the issue of the impact of information on the derivational relatedness of lexemes on uncertainty in inflectional predictions is tackled, showing that adding a classification of verbs in derivational families allows for a relevant reduction of entropy, not only for derived verbs, but also for simple ones.

Matteo Pellegrini (2023)The Method, In: Paradigm Structure and Predictability in Latin Inflectionpp. 23-42 Springer International Publishing

DOI: 10.1007/978-3-031-24844-3_2

This chapter is devoted to a detailed description of the methodology that will be applied in this work to perform an analysis of Latin inflectional paradigms that satisfies the theoretical desiderata outlined in Chap. 1. The applied method makes use of notions and procedures taken from information-theory (Shannon 1948), notably surprisal and entropy. In Sect. 2.1, a basic introduction to such information-theoretic notions is offered, pointing out the properties that make them useful to investigate topics related to implicative relations and the Paradigm Cell Filling Problem. The first proposal to use entropy for this purpose was outlined in Ackerman et al. (2009), whose procedure—aiming at an estimate of the degree of uncertainty associated with morphological realizations—is described in Sect. 2.2. However, the tools and algorithm that are used throughout this work are based on a similar but refined procedure (cf. Bonami 2014, Bonami and Boyé 2014, Beniamine 2018), that only requires a lexicon of inflected wordforms with no a priori morphological analysis. This method can be used not only to estimate the uncertainty in guessing the content of the paradigm cell of a lexeme knowing one inflected wordform, as explained in Sect. 2.3, but also given knowledge of multiple wordforms, as detailed in Sect. 2.4. The possible impact of additional information in making such tasks easier will be discussed in Sect. 2.5, where two possible variables are mentioned, namely i) the gender of a noun, and ii) the fact that a given lexeme is derivationally related to another one. In Sect. 2.6, the most important characteristics of the adopted methodology will be summarized, clarifying how they relate to the theoretical principles discussed in the first chapter.

Olivier Bonami, Matteo Pellegrini (2022)Derivation predicting inflection A quantitative study of the relation between derivational history and inflectional behavior in Latin, In: Studies in language46(4)pp. 753-792 John Benjamins Publishing Co

DOI: 10.1075/sl.21002.bon

In this paper, we investigate the value of derivational information in predicting the inflectional behavior of lexemes. We focus on Latin, for which large-scale data on both inflection and derivation are easily available. We train boosting tree classifiers to predict the inflection class of verbs and nouns with and without different pieces of derivational information. For verbs, we also model inflectional behavior in a word-based fashion, training the same type of classifier to predict wordforms given knowledge of other wordforms of the same lexemes. We find that derivational information is indeed helpful, and document an asymmetry between the beginning and the end of words, in that the final element in a word is highly predictive, while prefixes prove to be uninformative. The results obtained with the word-based methodology also allow for a finer-grained description of the behavior of different pairs of cells.

Matteo Pellegrini, Marco Passarotti, Francesco Mambrini, Giovanni Moretti (2025)PrinParLat: a lexicon of principal parts of Latin verbs linked to the LiLa Knowledge Base, In: Language resources and evaluation59(4)pp. 3555-3595 Springer Nature

DOI: 10.1007/s10579-025-09847-y

In this paper, we present PrinParLat, a new resource listing the principal parts of Latin verbs-i.e., a set of forms from which the full paradigm can be obtained-as well as a fine-grained classification of their inflectional behavior. The resource is released both as a set of .csv tables in the Paralex standard format, and as RDF data linked to the LiLa Knowledge Base of interoperable resources for Latin. After introducing the notions of theoretical morphology of which this resource makes crucial use to be able to code complex morphological information in a compact way, the details of the procedure followed to obtain the data are outlined, as well as the modelling decisions taken regarding knowledge representation. The outcome of our work is evaluated by performing both an intrinsic evaluation of the data in the resource, and an extrinsic evaluation of their added value in the larger ecosystem of the LiLa Knowledge Base.

Matteo Pellegrini, Marco Passarotti (2018)LatInfLexi: An inflected Lexicon of Latin verbs, In: CEUR workshop proceedings2253pp. 324-329

DOI: 10.4000/books.aaccademia.3582

Matteo Pellegrini (2023)The Impact of Derivational Relatedness on Inflectional Predictions, In: Paradigm Structure and Predictability in Latin Inflectionpp. 145-178 Springer International Publishing

DOI: 10.1007/978-3-031-24844-3_6

In this chapter, we will investigate how the picture of interpredictability between paradigm cells changes when considering not only the phonotactic shape of inflected wordforms, but also additional information of a different kind, namely the derivational relatedness of the lexemes involved. In Sect. 6.1, we will use some examples to illustrate the question and to stress the potential relevance of knowing whether two lexemes are ultimately derived from a same base or whether they are formed by means of the same derivational process. We will propose a method to take this information into account and briefly discuss the difference with the standard procedure for entropy computation. We will start from verbal lexemes that ultimately derive from the same base in Sect. 6.2, proposing a working definition of the notion of derivational-inflectional family and showing how our data were coded so as to include a classification in such families in Sect. 6.2.1, briefly providing a qualitative picture of the inflectional behaviour of verbs that belong to the same derivational-inflectional family in Sect. 6.2.2 and presenting our results in Sect. 6.2.3. The same line of reasoning will be followed in Sect. 6.3 for nouns that are formed by means of the same derivational process—i.e., that belong to the same derivational-inflectional series, as defined in Sect. 6.2.1. We will then present in Sect. 6.4 some of the results obtained on the same topic with a different methodology by Bonami and Pellegrini (2022), where the issue of inflectional predictions is framed as a classification problem: a classifier is trained to predict various correlates of the inflectional behaviour of a lexeme from several derivational predictors. We discuss how these results can integrate those of the other sections of this chapter by avoiding some of the problems of the entropy-based approach to the PCFP and by allowing for a quantitative evaluation of the role of many more facets of derivational history. Lastly, in Sect. 6.5, we will discuss the theoretical and methodological implications of our results, also highlighting some problems that we leave to further research.

Matteo Pellegrini (2023)Paradigm Structure and Predictability in Latin Inflection Springer International Publishing

DOI: 10.1007/978-3-031-24844-3

Matteo Pellegrini (2023)The Theoretical Framework, In: Paradigm Structure and Predictability in Latin Inflectionpp. 1-21 Springer International Publishing

DOI: 10.1007/978-3-031-24844-3_1

In this chapter, we will locate the approach that is adopted in this work in the larger field of theoretical morphology. To do so, we will discuss several points: each of them will be illustrated using examples taken from Latin inflectional morphology. The point of departure will be an overview of the terminology that is used throughout this work (Sect. 1.1). We will then discuss several classifications of morphological theories (Sect. 1.2), from the traditional distinction between Item-and-Arrangement, Item-and-Process and Word-and-Paradigm models operated by Hockett (1954) up to the recent characterization of constructive and abstractive approaches proposed in Blevins (2006, 2016), highlighting the specific aspects on which each of the discussed classifications is based. In Sect. 1.3, we will focus on implicative relations, contrasting different ways in which they can be formulated, in terms of generalizations on exponents (Sect. 1.3.1), on stems (Sect. 1.3.2) or on inflected wordforms (Sect. 1.3.3). We will then add a quantitative dimension to the picture (Sect. 1.4), showing the importance of considering also non-categorical implicative relations. Lastly, in Sect. 1.5 we will explain the choices that have been made in this work regarding each of the topics discussed in the previous sections.

Matteo Pellegrini (2023)The Data and the Tools, In: Paradigm Structure and Predictability in Latin Inflectionpp. 43-67 Springer International Publishing

DOI: 10.1007/978-3-031-24844-3_3

To perform a quantitative, entropy-based analysis like the one that has been sketched out in the previous chapter, a large, representative lexicon of inflected wordforms in phonetic transcription is needed. Similar resources are increasingly being developed for modern languages: cf. among else Zanchetta and Baroni (2005) and Calderone et al. (2017) for Italian, Bonami et al. (2014) and Hathout et al. (2014) for French, the lexicon described in Bonami and Luís (2014) for Portuguese.

Matteo Pellegrini (2023)Flexemes in theory and in practice Modelling overabundance in Latin verb paradigms, In: Morphology (Dordrecht)33(3)pp. 361-395 Springer Nature

DOI: 10.1007/s11525-023-09414-7

This paper provides an in-depth investigation of the possibility of systematically using flexemes - i.e., lexical units characterized in terms of form, as opposed to lexemes, characterized in terms of meaning - to model overabundance - i.e., the availability of more than one form in the same paradigm cell. The starting point is a preliminary evaluation of the advantages and disadvantages of using flexemes to account for different overabundance phenomena, showing that flexemes are a good way to capture the systematicity of overabundance, either across lexemes or across cells. Consequently, it is suggested that flexemes can be an interesting technical solution for the creation of a lexicon of Latin verbs that not only documents all the competing wordforms available as principal parts, but also captures the systematic relationship that sometimes holds between variants filling different cells. A principled method to identify such systematicity is then described in detail. It is argued that a constructive approach based on the identity of stems and/or inflection class is not fully adequate for the data at hand. Therefore, the proposed procedure adopts an abstractive, word-based perspective that only relies on alternation patterns between unsegmented wordforms. Practical and theoretical implications of the work are finally discussed, particularly regarding the usefulness of a formal approach to the identification of lexical units and paradigm cells.

Matteo Pellegrini, Davide Ricca (2019)An instance of productive overabundance: The plural of some Italian VN compounds, In: Word structure12(1)pp. 94-126 Edinburgh Univ Press

DOI: 10.3366/word.2019.0140

This paper investigates a case of overabundance in the plural cell of an open subclass of Italian VN compounds. The empirical basis includes: (i) a 163-item list of relevant compounds, for which the relative frequency of cell mates has been estimated by means of Web data; (ii) a naming questionnaire based on visual input, with 30 images submitted to about 200 informants, including those of several objects whose names are scarcely established in the lexicon; (iii) a further questionnaire, adapted to each informant, asking for acceptability judgements to detect overabundance at the single speaker's level. Results show that the given subclass of VN compounds provides an instance of systematic and productive overabundance in the Italian morphological system, differently from the examples usually discussed for this language.

M. Silvia Micheli, Matteo Pellegrini (2025)Exploring the form of Italian diminutives, In: Constructions and frames17(2)pp. 175-210 John Benjamins Publishing Co

DOI: 10.1075/cf.23028.mic

In this paper, we provide an extensive analysis of the formal features displayed by Italian diminutives (especially allomorphy) by using surface alternation patterns automatically extracted from a dataset of base-diminutive pairs. Applying the theoretical tools provided by Construction Morphology (CxM), we take alternation patterns as a proxy for paradigmatic relations and locate them into separate hierarchies, exploiting the mechanism of multiple inheritance in order to express generalizations on different, mutually independent factors at play (i.e., lexical category, gender and final segment of the base and derivative, suffixes and antesuffixal phonological material displayed by the derivative). In doing so, we contribute to the exploration of the formal side of Italian diminutives, which have been so far addressed mostly from a semantic perspective, and to the refinement of the representation of formal phenomena such as allomorphy according to the CxM framework.

Matted Pellegrini, Matteo Pellegrini (2024)From Stems to Forms: Paradigm Zones in Italian Verb Inflection, In: ETUDES ROMANES DE BRNO45(3)pp. 70-101 Masaryk Univ, Fac Arts

DOI: 10.5817/ERB2024-3-5

This work presents an analysis of interpredictability between paradigm cells in Italian verbal inflection. The approach is abstractive: the starting point is inflected forms, without assuming any segmentation a priori. The methodology is based on information-theoretic notions: the uncertainty in predicting a form from another one is measured by means of conditional entropy. The results are used to draw a mapping of the verbal paradigm of Italian in zones, such that cells that belong to the same zone are systematically predictable from one another. This mapping is compared to the one of important previous studies that have adopted a constructive approach, based on predictability between stems. Lastly, the quantification of uncertainty through conditional entropy is used as evaluation metrics to establish which approach yields a more economical analysis between the abstractive one of the present work and the constructive one of previous studies.

Matteo Pellegrini (2023)Conclusions, In: Paradigm Structure and Predictability in Latin Inflectionpp. 179-183 Springer International Publishing

DOI: 10.1007/978-3-031-24844-3_7

This chapter recaps the crucial arguments of the book and summarizes its contribution to the advancement of research in different areas. It is shown that the book provides scholars working on Latin with the description of a freely available resource listing the inflected wordforms of verbs and nouns and with new descriptive insights on the structure of paradigms in terms of predictability; and that it is also interesting for theoretical morphologists in that it explores the possibilities opened up by the methodological innovation of taking into account other pieces of information beside the phonotactic shape of wordforms when predicting paradigm cells, focusing on the role of such important factors as gender and derivation.

Matteo Pellegrini (2023)Predictability and Paradigm Organization in Latin Verb Inflection, In: Paradigm Structure and Predictability in Latin Inflectionpp. 69-108 Springer International Publishing

DOI: 10.1007/978-3-031-24844-3_4

This chapter is devoted to Latin verb inflection. As a starting point, it is necessary to provide some preliminary information on the verbal system, as it is outlined in traditional descriptions. We will do so in Sect. 4.1, where we will also review previous theoretically grounded studies on Latin inflectional morphology regarding verbs. We will then move to our analysis of implicative relations, which is performed not on the full paradigm of Latin verbs but on a reduced version that abstracts away from all cases of systematic syncretism, called the “cell paradigm” following Boyé and Schalchli (2016) (see Sect. 4.2 for a more detailed elaboration). Results on various fragments of the Latin paradigm will be presented in Sect. 4.3; in Sect. 4.3.1, we will focus on the alternation patterns that hold between wordforms that are based on different stems, and consequently on the uncertainty in predicting the cells involved from one another; in Sect. 4.3.2, we will look at the situation in wordforms that are based on the same stem. In Sect. 4.4, we will try to give an idea of the overall structure of the Latin verb paradigm, as it emerges from our entropy-based analysis. Firstly, we will draw a map of the paradigm in different zones that contain cells between which there is full mutual predictability. Secondly, we will compute entropy values on a so-called “distillation” (cf. Stump and Finkel 2013) of the paradigm, where we keep only one cell for each zone. Lastly, in Sect. 4.5 we will extend our investigation to predictions from more than one cell, whose uncertainty will be measured by means of n-ary implicative entropy. These results will also be exploited to extract principal part sets and near-principal part sets and compare them to the ones of the traditional analysis and to the ones that have been found with different methodologies—notably, Stump and Finkel (2013)’s Principal Part Analysis.

Matteo Pellegrini (2023)Predictability in Latin Noun Inflection and the Role of Gender, In: Paradigm Structure and Predictability in Latin Inflectionpp. 109-144 Springer International Publishing

DOI: 10.1007/978-3-031-24844-3_5

We will now switch our focus from verbs to nouns. The structure of the first part of this chapter is analogous to the previous one. In Sect. 5.1, we provide some preliminary information on the inflectional behaviour of Latin nouns, as it is described in traditional grammars. We also review some (more or less) recent theoretical accounts of the well-known facts. In Sect. 5.2, we present the results of our systematic analysis of implicative relations and predictability in noun inflection. The shape of the cell paradigm of Latin nouns—on which our computations are done—is shown in Sect. 5.2.1. The results concerning unary and n-ary implicative entropy are then given in turn in Sects. 5.2.2 and 5.2.3. In the latter section, the principal parts and near principal parts that can be inferred from the entropy-based analysis are discussed. The remaining part of the chapter is devoted to a topic that, unlike verbs, proves to be relevant for nouns: the role of information on a lexeme’s gender in reducing uncertainty in inflectional predictions. How do entropy values change if we assume that we do not only know the phonotactic shape of an inflected wordform but also the gender of the lexeme it belongs to? We begin by providing some examples that show information on gender is at least potentially available to speakers and capable of yielding a reduction in implicative entropy (Sect. 5.3.1). We then briefly show some quantitative data on the distribution of Latin nouns among the three genders, both in absolute terms and in their relationship with the classification in the traditional five declensions, highlighting some clear preferential associations between a noun’s declension and its gender and their impact on uncertainty in the PCFP (Sect. 5.3.2). We finally present the results that are obtained by taking gender information into account in Sect. 5.3.3, discussing the relevance of such results in the debate on the function(s) of gender, but also some caveats that should be made on their reliability, in Sect. 5.3.4. The findings of the chapter are then summarized in Sect. 5.4.

David Lindemann, Matteo Pellegrini, Francesco Mambrini, Marco Passarotti (2025)Wikidata and LiLa for Latin: Enabling Interoperability and Access to Inflected Forms and Corpus Attestations, In: Journal of open humanities data11(2)83pp. 1-17 Ubiquity Press Ltd

DOI: 10.5334/johd.464

This paper presents an approach to integrating Latin inflected forms and corpus attestations within a Linked Open Data (LOD) framework, enhancing interoperability between Wikidata and the LiLa knowledge base. Building on the PrinParLat lexicon of Latin verb principal parts, we generate the complete set of inflected forms for over 8,000 verbs, encoded as RDF in a dedicated Wikibase instance. These forms are linked to the Index Thomisticus Treebank (ITTB), whose morphologically annotated tokens are related to corresponding forms based on segmental identity, lemma alignment, and mapped morphological features. Our generation and linking process achieves over 95% coverage of ITTB verbal tokens, demonstrating the robustness of our pipeline even for Medieval Latin data. By aligning Paralex, Wikidata, and LiLa ontologies, we ensure semantic interoperability and facilitate future integration into Wikidata. Beyond Latin, this workflow provides a reproducible model for linking inflectional paradigms and corpus attestations in other languages.