Research

Research interests

Publications

Jian He Low, Harry Thomas Walsh, Ozge Mercanoglu Sincan, Richard Bowden (2025)Hands-On: Segmenting Individual Signs from Continuous Sequences, In: 2025 IEEE 19th International Conference on Automatic Face and Gesture Recognition (FG) IEEE

This work tackles the challenge of continuous signlanguage segmentation, a key task with huge implications forsign language translation and data annotation. We proposea transformer-based architecture that models the temporaldynamics of signing and frames segmentation as a sequencelabeling problem using the Begin-In-Out (BIO) tagging scheme.Our method leverages the HaMeR hand features, and iscomplemented with 3D Angles. Extensive experiments show thatour model achieves state-of-the-art results on the DGS Corpus,while our features surpass prior benchmarks on BSLCorpus.

Ozge Mercanoglu Sincan, Jian He Low, Sobhan Asasi, Richard Bowden (2025)Gloss-Free Sign Language Translation: An Unbiased Evaluation of Progress in the Field, In: Computer Vision and Image Understandingahead of print104498 Elsevier

Sign Language Translation (SLT) aims to automatically convert visual sign language videos into spoken language text and vice versa. While recent years have seen rapid progress, the true sources of performance improvements often remain unclear. Do reported performance gains come from methodological novelty, or from the choice of a different backbone, training optimizations, hyperparameter tuning, or even differences in the calculation of evaluation metrics? This paper presents a comprehensive study of recent gloss-free SLT models by re-implementing key contributions in a unified codebase. We ensure fair comparison by standardizing preprocessing, video encoders, and training setups across all methods. Our analysis shows that many of the performance gains reported in the literature often diminish when models are evaluated under consistent conditions, suggesting that implementation details and evaluation setups play a significant role in determining results. We make the codebase publicly available here to support transparency and reproducibility in SLT research.

Jian He Low, Ozge Mercanoglu Sincan, Richard Bowden (2025)Sign Spotting Disambiguation using Large Language Models, In: ACM International Conference on Intelligent Virtual Agents (IVA Adjunct ’25) ACM

Sign spotting, the task of identifying and localizing individual signs within continuous sign language video, plays a pivotal role in scaling dataset annotations and addressing the severe data scarcity issue in sign language translation. While automatic sign spotting holds great promise for enabling frame-level supervision at scale, it grapples with challenges such as vocabulary inflexibility and ambiguity inherent in continuous sign streams. Hence, we introduce a novel, training-free framework that integrates Large Language Models (LLMs) to significantly enhance sign spotting quality. Our approach extracts global spatio-temporal and hand shape features, which are then matched against a large-scale sign dictionary using dynamic time warping and cosine similarity. This dictionary-based matching inherently offers superior vocabulary flexibility without requiring model retraining. To mitigate noise and ambiguity from the matching process, an LLM performs context-aware gloss disambiguation via beam search, notably without fine-tuning. Extensive experiments on both synthetic and real-world sign language datasets demonstrate our method's superior accuracy and sentence fluency compared to traditional approaches, highlighting the potential of LLMs in advancing sign spotting.

Jian He Low, Ozge Mercanoglu Sincan, Richard Bowden (2025)SAGE: Segment-Aware Gloss-Free Encoding for Token-Efficient Sign Language Translation, In: 2025 IEEE/CVF International Conference on Computer Vision (ICCV 2025) Institute of Electrical and Electronics Engineers (IEEE)

Gloss-free Sign Language Translation (SLT) has advanced rapidly, achieving strong performances without relying on gloss annotations. However, these gains have often come with increased model complexity and high computational demands, raising concerns about scalability, especially as large-scale sign language datasets become more common. We propose a segment-aware visual tokenization framework that leverages sign segmentation to convert continuous video into discrete, sign-informed visual tokens. This reduces input sequence length by up to 50% compared to prior methods, resulting in up to 2.67× lower memory usage and better scalability on larger datasets. To bridge the visual and linguistic modalities, we introduce a token-to-token contrastive alignment objective, along with a dual-level supervision that aligns both language embeddings and intermediate hidden states. This improves fine-grained cross-modal alignment without relying on gloss-level supervision. Our approach notably exceeds the performance of state-of-the-art methods on the PHOENIX14T benchmark, while significantly reducing sequence length. Further experiments also demonstrate our improved performance over prior work under comparable sequence-lengths, validating the potential of our tokenization and alignment strategies.