Dr Shenbin Qian

Postgraduate Research Student

s.qian@surrey.ac.uk

Please visit https://shenbinqian.github.io/ for more information

Academic and research departments

Centre for Translation Studies, Literature and Languages.

About

My research project

Emotion preservation in neural machine translation

Despite the progress of neural machine translation in recent years, the translation of emotion-intensive texts like microblogs on social media is still not satisfactory. While current studies in natural language processing focus on developing larger general-purpose language models and using fewer parallel data, this project investigates the errors made by current neural machine translation systems and its linguistic background, and then develops a neural machine translation system specifically targeted to emotion-intensive texts to better convey human emotions.

Supervisors

Constantin Orasan

Félix do Carmo

Diptesh Kanojia

My qualifications

2022

MSc in Applied Data Science and Statistics with Distinction

University of Exeter

2017

MA in Translation and Interpreting

Xihua University

Research

Research interests

Machine Translation;
Sentiment Analysis;
Topic Modeling;
Text Mining;
Topics Related to Language Technology

Publications

Shenbin Qian, Constantin Orasan, Diptesh Kanojia, Félix do Carmo (2025)Automatically Generating Chinese Homophone Words to Probe Machine Translation Estimation Systems, In: JinYeong Bak, Rob van der Goot, Hyeju Jang, Weerayut Buaphet, Alan Ramponi, Wei Xu, Alan Ritter (eds.), Proceedings of the Tenth Workshop on Noisy and User-generated Textpp. 97-107 Association for Computational Linguistics

Evaluating machine translation (MT) of user-generated content (UGC) involves unique challenges such as checking whether the nuance of emotions from the source are preserved in the target text. Recent studies have proposed emotion-related datasets, frameworks and models to automatically evaluate MT quality of Chinese UGC, without relying on reference translations. However, whether these models are robust to the challenge of preserving emotional nuances has been left largely unexplored. To address this gap, we introduce a novel method inspired by information theory which generates challenging Chinese homophone words related to emotions, by leveraging the concept of self-information. Our approach generates homophones that were observed to cause translation errors in emotion preservation, and exposes vulnerabilities in MT systems and their evaluation methods when tackling emotional UGC. We evaluate the efficacy of our method using human evaluation for the quality of these generated homophones, and compare it with an existing one, showing that our method achieves higher correlation with human judgments. The generated Chinese homophones, along with their manual translations, are utilized to generate perturbations and to probe the robustness of existing quality evaluation models, including models trained using multi-task learning, fine-tuned variants of multilingual language models, as well as large language models (LLMs). Our results indicate that LLMs with larger size exhibit higher stability and robustness to such perturbations. We release our data and code for reproducibility and further research.

Shenbin Qian, Archchana Sindhujan, Minnie Kabra, Diptesh Kanojia, Constantin Orasan, Tharindu Ranasinghe, Fred Blain (2024)What do Large Language Models Need for Machine Translation Evaluation?, In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processingpp. 3660-3674 Association for Computational Linguistics

DOI: 10.18653/v1/2024.emnlp-main.214

Leveraging large language models (LLMs) for various natural language processing tasks has led to superlative claims about their performance. For the evaluation of machine translation (MT), existing research shows that LLMs are able to achieve results comparable to fine-tuned multilingual pre-trained language models. In this paper, we explore what translation information, such as the source, reference, translation errors and annotation guidelines, is needed for LLMs to evaluate MT quality. In addition, we investigate prompting techniques such as zero-shot, Chain of Thought (CoT) and few-shot prompting for eight language pairs covering high-, medium-and low-resource languages, leveraging varying LLM variants. Our findings indicate the importance of reference translations for an LLM-based evaluation. While larger models do not necessarily fare better, they tend to benefit more from CoT prompting, than smaller models. We also observe that LLMs do not always provide a numerical score when generating evaluations, which poses a question on their reliability for the task. Our work presents a comprehensive analysis for resource-constrained and training-less LLM-based evaluation of machine translation. We release the accrued prompt templates, code and data publicly for reproducibility.

Archchana Sindhujan, Diptesh Kanojia, Constantin Orasan, Shenbin Qian (2025)When LLMs Struggle: Reference-less Translation Evaluation for Low-resource Languages, In: Proceedings of the First Workshop on Language Models for Low-Resource Languages (LoResLM 2025)pp. 437-459

This paper investigates the reference-less evaluation of machine translation for low-resource language pairs, known as quality estimation (QE). Segment-level QE is a challenging cross-lingual language understanding task that provides a quality score (0 − 100) to the translated output. We comprehensively evaluate large language models (LLMs) in zero/few-shot scenarios and perform instruction fine-tuning using a novel prompt based on annotation guidelines. Our results indicate that prompt-based approaches are outperformed by the encoder-based fine-tuned QE models. Our error analysis reveals tokenization issues, along with errors due to transliteration and named entities, and argues for refinement in LLM pre-training for cross-lingual tasks. We release the data, and models trained publicly for further research.

Shenbin Qian, Constantin Orasan, Felix do Carmo, Qiuliang Li, Diptesh Kanojia Evaluation of Chinese-English Machine Translation of Emotion-Loaded Microblog Texts: A Human Annotated Dataset for the Quality Assessment of Emotion Translation

DOI: 10.48550/arxiv.2306.11900

In this paper, we focus on how current Machine Translation (MT) tools perform on the translation of emotion-loaded texts by evaluating outputs from Google Translate according to a framework proposed in this paper. We propose this evaluation framework based on the Multidimensional Quality Metrics (MQM) and perform a detailed error analysis of the MT outputs. From our analysis, we observe that about 50% of the MT outputs fail to preserve the original emotion. After further analysis of the errors, we find that emotion carrying words and linguistic phenomena such as polysemous words, negation, abbreviation etc., are common causes for these translation errors.

Shenbin Qian, Constantin Orasan, Diptesh Kanojia, Hadeel Saadany, Felix do Carmo (2022)SURREY-CTS-NLP at WASSA2022:An Experiment of Discourse and Sentiment Analysis for the Prediction of Empathy, Distress and Emotion, In: PROCEEDINGS OF THE 12TH WORKSHOP ON COMPUTATIONAL APPROACHES TO SUBJECTIVITY, SENTIMENT & SOCIAL MEDIA ANALYSISpp. 271-275 Assoc Computational Linguistics-Acl

This paper summarises the submissions our team, SURREY-CTS-NLP has made for the WASSA 2022 Shared Task for the prediction of empathy, distress and emotion. In this work, we tested different learning strategies, like ensemble learning and multi-task learning, as well as several large language models, but our primary focus was on analysing and extracting emotion-intensive features from both the essays in the training data and the news articles, to better predict empathy and distress scores from the perspective of discourse and sentiment analysis. We propose several text feature extraction schemes to compensate the small size of training examples for fine-tuning pretrained language models, including methods based on Rhetorical Structure Theory (RST) parsing, cosine similarity and sentiment score. Our best submissions achieve an average Pearson correlation score of 0.518 for the empathy prediction task and an F1 score of 0.571 for the emotion prediction task(1), indicating that using these schemes to extract emotion-intensive information can help improve model performance.