case study
Published: 13 May 2024

Open research in lexicography: The ColloCaid project

Dr Ana Frankenberg-Garcia led the ColloCaid project, incorporating open practices in lexicography research. Her team compiled the largest open-access lexical database of academic English collocations to date, and a free prototype enabling writers to interact with the database during writing.  

Ana Frankenberg
Ana Frankenberg-Garcia


Given the difficulties writers experience in finding effective word combinations, the ColloCaid project (AHRC AH/P003508/1) investigated how to provide this type of lexical assistance during writing in a minimally disruptive way. The team compiled a very large database of over 32,000 effective word combinations (called collocations) curated from corpora of expertly written formal/academic texts, then developed a prototype that integrates interactive vocabulary assistance into a text editor to help users improve the fluency and idiomaticity of their writing directly as they write.   

Approaches and challenges

There are many core academic English lexical resources available, and we did not want to duplicate previous efforts. Our team combined three open-access general academic English wordlists as a starting point for the research. One challenge encountered was that lists aiming to cover similar vocabulary were found to have fewer than expected words in common.  

We compiled our lexical database through consultation and analysis of two expert academic writing corpora - PICAE (The Pearson International Corpus of Academic English, Longman Pearson) and OCAE (Oxford Corpus of Academic English, Oxford). We had to seek permission to use these quality proprietary resources, which took time to obtain, but were able to focus on other aspects of the project while waiting for the copyright holders’ responses. 

We developed a vocabulary assistance tool incorporating feedback from both experts and end users. To maximise participation, we needed to find a solution that was straightforward to implement without any software installation.  

We conducted a participatory design workshop and a number of writing workshops to test and evaluate the tool at different stages of development. 

Finally, we wanted to make our research widely accessible without compromising its potential commercial value.     


We created a project website (see [1] below) to disseminate our work. We discussed the rationale for the project in [2], detailed the approach to compiling our lexical database in [3], and discussed specific issues encountered along the way in [4], enabling our work to be replicable and providing novel lexicographic insights to researchers reusing wordlists and working with corpora.  

We adapted an open-source online editor to our prototype so that it could be readily tested by anyone directly from their browsers, without any software installation [5]. We ensured early versions of the prototype were available for testing before the lengthy process of compiling the database was completed [6], and acted on feedback from experts and end-users in participatory workshops to fine-tune later versions [7].  

We openly shared datasets, code and documentation in a public repository [8]. To protect the potential value of the complete lexical database, we offer a representative sample, and researchers interested in the full dataset are asked to contact us to establish its terms of use.     


Our project website includes links to:  

  • The free-to-use ColloCaid prototype
  • An introductory video about ColloCaid
  • Openly shared project datasets and code
  • Numerous conference presentations, webinars and invited talks attended by experts and non-experts
  • More than 10 open-access research publications arising from the project which have been attracting a growing number of citations by researchers in lexicography and writing pedagogy.  

More presentations and publications are forthcoming.  

To monitor impact, we keep track of registered tool users. Nearly 9,000 users from all over the world have used the ColloCaid prototype. This includes writers accessing the tool independently or encouraged by academic writing tutors at various universities across the globe. Additionally, the full collocation database has been shared with the European Lexicographic Infrastructure (ELEXIS) project, who have reused our data in Word of Games app. 

Key references and further information

[1]  ColloCaid website (n.d)  

[2] Frankenberg-Garcia, A. (2020) Combining user needs, lexicographic data and digital writing environments. Language Teaching, 53-1:29-43.  

[3] Frankenberg-Garcia, A. Lew, R., Roberts, J., Rees, G. and Sharma, N. (2019) Developing a writing assistant to help EAP writers with collocations in real time, ReCALL, 32/2: 23-39.  

[4] Frankenberg-Garcia, A. Rees, G. and Lew, R. (2021) Slipping through the cracks in e-Lexicography. International Journal of Lexicography, 34-2:206-234.  

[5] ColloCaid prototype (2021) 

[6] Roberts, J. C., Butcher, P., Rees, G., Lew, R., Sharma, N. & Frankenberg-Garcia, A. (2023). Less is more: Focused Design and Problem Framing in Visualisation – Developing the ColloCaid Collocation Editor. Proceedings of EG UK Computer Graphics & Visual Computing (2023), pp. 53-55.  

[7] Frankenberg-Garcia, A., Pinto, P.T., Bocorny, A.E. and Sarmento, S. (2022). Corpus-aided EAP Writing Workshops to Support International Scholarly Publication. Applied Corpus Linguistics, 2/3: 1-13. 

[8] ColloCaid open datasets, code and documentation (2020-2021) 

Project team

Principal investigator  



Share what you've read?