Publications

Mohamed Ilyes Lakhal, Richard Bowden (2024)Diversity-Aware Sign Language Production Through a Pose Encoding Variational Autoencoder, In: 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG)pp. 1-10 IEEE

This paper addresses the problem of diversity-aware sign language production, where we want to give an image (or sequence) of a signer and produce another image with the same pose but different attributes (e.g. gender, skin color). To this end, we extend the variational inference paradigm to include information about the pose and the conditioning of the attributes. This formulation improves the quality of the synthesised images. The generator framework is presented as a UNet architecture to ensure spatial preservation of the input pose, and we include the visual features from the variational inference to maintain control over appearance and style. We generate each body part with a separate decoder. This architecture allows the generator to deliver better overall results. Experiments on the SMILE II dataset show that the proposed model performs quantitatively better than state-of-the-art baselines regarding diversity, per-pixel image quality, and pose estimation. Quantitatively, it faithfully reproduces non-manual features for signers.

Mohamed Ilyes Lakhal, Richard Bowden (2025)GaussianGAN: Real-Time Photorealistic controllable Human Avatars, In: 2025 IEEE 19th International Conference on Automatic Face and Gesture Recognition (FG)

Photorealistic and controllable human avatars have gained popularity in the research community thanks to rapid advances in neural rendering, providing fast and realistic synthesis tools. However, a limitation of current solutions is the presence of noticeable blurring. To solve this problem, we propose GaussianGAN, an animatable avatar approach developed for photorealistic rendering of people in real-time. We introduce a novel Gaussian splatting densification strategy to build Gaussian points from the surface of cylindrical structures around estimated skeletal limbs. Given the camera calibration, we render an accurate semantic segmentation with our novel view segmentation module. Finally, a UNet generator uses the rendered Gaussian splatting features and the segmentation maps to create photorealistic digital avatars. Our method runs in real-time with a rendering speed of 79 FPS. It outperforms previous methods regarding visual perception and quality, achieving a state-of-the-art results in terms of a pixel fidelity of 32.94db on the ZJU Mocap dataset and 33.39db on the Thuman4 dataset.

Sobhan Asasi, Mohamed Ilyes Lakhal, Richard Bowden (2025)Hierarchical Feature Alignment for Gloss-Free Sign Language Translation, In: IVA '25: Proceedings of the 25th ACM International Conference on Intelligent Virtual Agents Association for Computing Machinery (ACM)

Sign Language Translation (SLT) attempts to convert sign language videos into spoken sentences. However, many existing methods struggle with the disparity between visual and textual representations during end-to-end learning. Gloss-based approaches help to bridge this gap by leveraging structured linguistic information. While, gloss-free methods offer greater flexibility and remove the burden of annotation, they require effective alignment strategies. Recent advances in Large Language Models (LLMs) have enabled gloss-free SLT by generating text-like representations from sign videos. In this work, we introduce a novel hierarchical pre-training strategy inspired by the structure of sign language, incorporating pseudo-glosses and contrastive video-language alignment. Our method hierarchically extracts features at frame, segment, and video levels, aligning them with pseudo-glosses and the spoken sentence to enhance translation quality. Experiments demonstrate that our approach improves BLEU-4 and ROUGE scores while maintaining efficiency.

Harry Thomas Walsh, Edward Fish, Ozge Mercanoglu Sincan, Mohamed Ilyes Lakhal, Richard Bowden (2025)SLRTP2025 Sign Language Production Challenge: Methodology, Results, and Future Work, In: The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025 (CVPR 2025) Institute of Electrical and Electronics Engineers (IEEE)

Sign Language Production (SLP) is the task of generating sign language video from spoken language inputs. The field has seen a range of innovations over the last few years, with the introduction of deep learning-based approaches providing significant improvements in the realism and naturalness of generated outputs. However, the lack of standardized evaluation metrics for SLP approaches hampers meaningful comparisons across different systems. To address this, we introduce the first Sign Language Production Challenge, held as part of the third SLRTP Workshop at CVPR 2025. The competition's aims are to evaluate architectures that translate from spoken language sentences to a sequence of skeleton poses, known as Text-to-Pose (T2P) translation , over a range of metrics. For our evaluation data, we use the RWTH-PHOENIX-Weather-2014T dataset, a Ger-man Sign Language-Deutsche Gebärdensprache (DGS) weather broadcast dataset. In addition, we curate a custom hidden test set from a similar domain of discourse. This paper presents the challenge design and the winning methodologies. The challenge attracted 33 participants who submitted 231 solutions, with the top-performing team achieving BLEU-1 scores of 31.40 and DTW-MJE of 0.0574. The winning approach utilized a retrieval-based framework and a pre-trained language model. As part of the workshop, we release a standardized evaluation network , including high-quality skeleton extraction-based key-points establishing a consistent baseline for the SLP field, which will enable future researchers to compare their work against a broader range of methods.