11am - 12 noon

Wednesday 13 August 2025

Representations for Sign Language Production

PhD Viva Open Presentation - Harry Walsh

Hybrid event - All Welcome!

Free

21BA02 - Arthur C Clarke building
University of Surrey
Guildford
Surrey
GU2 7XH

JOIN VIA TEAMS

Speakers

Harry Walsh

Representations for Sign Language Production

Abstract:
Sign languages are complex languages with their own grammatical structure and vocabulary. As a result, communication between the Deaf and hearing communities can be challenging, requiring translation rather than a simple word-to-sign substitution, typically performed by an expert human interpreter or translator. However, the demand for interpreters exceeds the supply, with only an estimated 1 interpreter for every 3 Deaf individuals in the UK, leading to a lack of access to information for the Deaf community. This disparity has motivated the development of automatic translation systems for sign language.

For decades, researchers have sought to develop automatic sign language translation systems, aiming to bridge communication gaps between Deaf and hearing communities. However, translating between spoken language text and continuous sign language video, presents significant challenges due to the multi-channelled nature of sign language and limited annotated data. Research has focused on the translation of signed to spoken language, known as Sign Language Translation (SLT), with the reverse task, Sign Language Production (SLP), being based on more deterministic graphics-based approaches. Early SLP systems, relying on avatar-based approaches and hand-crafted rules, often produced unnatural signing. While more recent advances in deep learning have improved the naturalness of productions, they still fall short of human expressiveness. This thesis addresses this gap by exploring representations for sign, that can aid in translating spoken language to photo-realistic sign language videos.

Historically, the SLP task has been broken into three steps: First, an initial translation from a spoken sentence to a gloss sequence, followed by a skeleton production, which is finally used to drive a photo-realistic signer or avatar. This thesis begins by investigating two representations that can be used in the first step of the translation pipeline, namely, gloss and the Hamburg Notation System (HamNoSys). The contribution also explores the influence of tokenization, the process of segmenting a sequence into discrete units, which further alters the representation. We also leverage the recent development in pre-trained language models, like BERT and Word2Vec, to generate improved word and sentence-level embeddings. The findings demonstrate the effectiveness of the proposed techniques and help to set a baseline for the following contributions of this thesis.

The second contribution introduces a novel approach to Text-to-Gloss (T2G) translation, called Select and Reorder (S&R). This approach breaks down the translation process into two distinct steps: Gloss Selection (GS) and Gloss Reordering (GR). The GS model is responsible for selecting the most appropriate glosses for a given spoken language sentence, while the GR model reorders the glosses to create a more natural sign language sequence. This disentanglement of tasks allows each model to specialize in a specific aspect of translation. For this task, we create two new representations: Spoken Language Order (SPO) gloss and Sign Language Order (SIO) text. To build these representations, we once again leverage the power of pre-trained language models to build an alignment between the text and gloss. Through this, we show significant improvement in our metrics while also reducing the computational cost of the translation pipeline.

Previous works have suffered from the problem of regression to the mean, leading to under-articulated and incomprehensible signing. Moving to the next stage of the pipeline, the third contribution of the thesis uses dictionary examples and a learned codebook of facial expressions to create expressive sign language sequences. However, simply concatenating signs and adding a face creates robotic and unnatural sequences. To address this, we present a novel 7-step approach to stitch signs together. First, by normalizing each sign into a canonical pose, cropping, and stitching, we create a continuous sequence. Then, by applying filtering in the frequency domain and resampling each sign, we create cohesive natural sequences that mimic the prosody found in the original data. We leverage a SignGAN model to map the output to a photo-realistic signer and present a complete SLP pipeline.

Finally, the thesis explores the creation of a data-driven representation of sign language as a substitute for resource-intensive linguistic annotations. This involves employing a vector quantization technique to learn a lexicon of motions that can be assembled to generate a natural and meaningful sign language sequence. The lexicon can be directly mapped to a sequence of poses, allowing the translation to be performed by a single network. Furthermore, by leveraging the limited linguistic annotations, we can enhance the representation by applying additional pressure on the model during the training process. The development of a data-driven representation offers a potential solution to overcome the reliance on linguistic annotation. The results show that the learned representation is a suitable replacement for gloss, outperforming previous methods.

Featured Academics

Prof Richard Bowden

Professor of Computer Vision and Machine Learning