First-ever dataset to improve English-to-Malayalam machine translation fills critical gap for low-resource languages
The world’s first dataset aimed at improving the quality of English-to-Malayalam machine translation – a long-overlooked language spoken by more than 38 million people in India – has been developed by researchers at the University of Surrey.
Malayalam is considered a low-resource language in the world of machine translation, and until now, there has been almost no data available to evaluate the accuracy of machine-translated text from English, limiting progress for communities that rely on digital translation tools.
Funded by the European Association for Machine Translation (EAMT), the Surrey-led research published in ACL Anthology focused on two key areas – Quality Estimation, which predicts how good a translation is without the need for a reference text, and Automatic Post-Editing (APE), which automatically corrects errors.
The team curated 8,000 English-to-Malayalam translation segments across finance, legal and news – domains where accuracy is essential. Each segment was reviewed by professional annotators at TechLiebe, an industry partner, who provided three human quality scores and a corrected “post-edited” version of the machine-translated text.
An additional layer of annotation known as ‘Weak Error Remarks’ was also introduced, allowing human annotators to quickly note and describe the types of errors they spotted, such as mistranslations, missing words or added phrases. Early findings show that when these added notes are combined with large language models, systems can interpret the translation better on where the translation went wrong – a method that is already outperforming current approaches.
The research team have completed the majority of annotations, with a public release of the dataset planned for April 2026. The methodology could serve as a blueprint for other low-resource languages, including many across India, Africa and Creole-speaking regions, where high-quality translation data is urgently needed.
###
Notes to editors
- Dr Diptesh Kanojia and Archchana Sindhujan are available for interview; please contact mediarelations@surrey.ac.uk to arrange.
- The paper can be found here.
Related sustainable development goals
Featured Academics
Media Contacts
External Communications and PR team
Phone: +44 (0)1483 684380 / 688914 / 684378
Email: mediarelations@surrey.ac.uk
Out of hours: +44 (0)7773 479911