case study
Published: 01 March 2021

A corpus approach to Roman law: Legal history meets computational linguistics

Dr Marton Ribary, Leverhulme Early Career Research Fellow, focuses on the empirical characteristics of legal texts which are sometimes neglected by traditional research methods 

Dr Marton Ribary
Dr Marton Ribary

Our aim 

Since the 19th century, Roman legal scholarship has been dominantly carried out with traditional philological methods including close reading and strict juristic reasoning. Its results are published in traditional print formats such as single-author journal articles and monographs.  

Our research focuses on the empirical characteristics of legal texts which are sometimes neglected by traditional research methods. We demonstrate how the combination of computational methods and open research strategy can revitalise fields with respect to both research and dissemination methods in the humanities, social sciences and beyond. 

Our approach and challenges 

We created a relational database of the Latin text of Justinian’s Digest (533 CE) and carried out computational analysis in collaboration with the University of Cambridge’s Dr Barbara McGillivray. We investigated the structure of the Digest corpus by automatically clustering sections and explored the characteristics of Roman legal language with computational distributional semantics. By embracing the principles of open research, our goal was to start a conversation between Roman law scholars and computational linguists. 

Open review has enabled us to share ideas with our meticulous reviewers and explain theoretical and technical choices in more detail. Thanks to the project’s carefully maintained GitLab repository, we were able to correct and rebuild the project with relative ease after a user spotted a bug in the first release of our database. By creating fully transparent and accessible documentation on GitLab and Figshare, we also managed to reach users with no prior experience with relational databases or computational methods. 

In a traditional field like Roman legal scholarship, we expected some resistance against our research methods and dissemination strategy. We now realise that in order to attract new audiences, some aspects of our research will still need to be presented and disseminated in more traditional manners. 


Our results were published as gold open access pieces in rigorous and rapid online-first journals. This included a data paper explaining technical details and the reuse potential of the Roman law database [1], and a full research paper about our computational analysis [2]. The database was published on the Figshare public repository with a digital object identifier [3]. The detailed documentation includes a discussion of free online [4] and other tools [5,6] for loading the database, as well as sample (SQL) queries for using it. The project’s GitLab repository [7] includes all files and codes with full documentation, demo files and a workflow pipeline graph.  

Our basic resource files were freely available [8,9] apart from one Latin corpus [10] where we only had access to language models trained on the corpus [11]. While this has not affected the robustness of our results, we wanted to train our own models based on the original corpus in the spirit of full transparency and reproducibility. Since then, we have successfully negotiated a data agreement with the research institute which maintains the annotated corpus. 

We are already seeing clear signs that our commitment to open research principles is initiating promising conversations. These exchanges are maturing into collaborative research projects which combine traditional philological methods and innovative computational methods in Roman legal scholarship. 

Read the full research paper

Images and diagrams

  1. A simplified empirical conceptual tree-map of Roman law
  2. Pre-processing and the creation of lemmatized texts on the example of a passage from the Digest (Ulpian, Inst. 1, D.1.1.61)
  3. Schema graph of the relational database based on Justinian’s Digest
  4. Processing flowchart: From resource texts to database

Reference list

[1] Ribary, M. A Relational Database on Roman Law Based on Justinian’s Digest. Journal of Open Humanities Data 2020, 6, 1–5. DOI: 

[2] Ribary, M.; McGillivray, B. A Corpus Approach to Roman Law Based on Justinian’s Digest. Informatics 2020, 7, 44. DOI: 

[3] Ribary, M. A relational database of Roman law based on Justinian’s Digest. figshare. Dataset. DOI: 

[4] Free online SQLite tool, 

[5] Command Line Shell for SQLite, 

[6] DB Browser for SQLite, 

[7] Ribary, M. pyDigest, A GitLab Repository of Scripts, Files and Documentation. URL: 

[8] Riedlberger, P.; Rosenbaum, G. Amanuensis, V5.0. URL:

[9] McGillivray, B. LatinISE Corpus, Version 4; LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic. URL: 

[10] Laboratoire d'Analyse Statistique des Langues Anciennes, 

[11] Sprugnoli R.; Passarotti M.; Moretti G. Lemma Embeddings for Latin. 2019. – Version v.1.0.0. DOI: 

Publish an Open Research case study

If you are a member of the University of Surrey and would like us to publish an Open Research case study, please read our Open Research case study author guidelines (PDF) to find out how.

Share what you've read?