11am - 12 noon

Tuesday 16 December 2025

Improving 3D Human Pose Estimation and Rendering for Sign Language

PhD Viva Open Presentation - Maxim Ivashechkin

Hybrid Meeting (21BA02 & Teams) - All Welcome!

Free

21BA02 - Arthur C Clarke building
University of Surrey
Guildford
Surrey
GU2 7XH

Improving 3D Human Pose Estimation and Rendering for Sign Language

Abstract:
Accurately estimating and reconstructing 3D human poses in images and video remains a highly challenging task due to limitations in image resolution, occlusions, motion blur, data scarcity, etc. In the context of sign language, the problem becomes even more complex due to high articulation, frequent hand-to-hand and hand-to-face interactions. State-of-the-art methods are often trained on benchmark datasets featuring simple motions, such as walking, which can lead to overfitting and poor generalization to more complex gestures. Moreover, existing 3D reconstruction and rendering methods tend to produce noisy or physically implausible results, contributing to the uncanny valley effect. While some recent approaches leverage more expressive datasets, these still fall short of capturing the intricacies of sign language articulation. Given the critical role of sign language as a primary mode of communication for millions, advancing 3D modelling in this domain holds significant importance. This thesis addresses key challenges through multiple contributions.

Our primary objective for human reconstruction is to estimate the 3D structure from a single image, while leveraging multiview data for regularization during training. We begin by introducing a 3D skeletal model constrained by physically plausible parameters, such as limited degrees of freedom for human joints, angular constraints, and body symmetry, to ensure realism. The model uses forward kinematics to derive 3D poses from joint angles and bone lengths. Conditioned solely on a 2D skeleton, our approach generalizes well to unseen data and produces realistic 3D reconstructions.

A key limitation of this pipeline is its dependence on off-the-shelf 2D detectors, whose failures and inaccuracies, especially under occlusion or motion blur, propagate to the final output. To address this, we integrate an image-based model that extracts hand-centric features from cropped regions, which are then used to drive the 3D skeletal estimation. Temporal consistency is further enhanced using a temporal transformer that refines sequences of 3D hand poses. To improve per-joint accuracy over the baseline, we also explore a diffusion-based model. However, diffusion models commonly suffer from long inference times, which poses a bottleneck for real-time systems. Despite this, our baseline model generalizes effectively to unseen sign language datasets.

In sign language, precise hand interactions such as finger spelling are critical, where even minor finger misalignments can alter the meaning of a sign. Most state-of-the-art methods rely on implicit modelling of hand interactions, which can still result in unrealistic intersections between hands. Instead, we propose an explicit strategy that minimizes the probability of intersections between hands. These probabilities are estimated using an occupancy network conditioned on a 3D skeleton. We develop a mesh parameterization over the 3D hand kinematic model to train the occupancy network. An intersection loss is then introduced, computed by applying the occupancy network to pairs of interacting hand skeletons. Training the 3D hand model jointly with this loss significantly reduces hand intersections while maintaining or even improving per-joint accuracy.

The developed occupancy model exhibits strong generalization when conditioned on 3D skeletal inputs and produces accurate volumetric hand geometry. We leverage this to train a neural radiance field (NeRF) model for rendering 3D hands. The continuous probability map from the occupancy network allows hand point cloud surface extraction, which serves as input to the NeRF. Additionally, the NeRF is conditioned on occupancy features and point probabilities, where the latter helps to indicate hand intersections. To further guide appearance modelling, we incorporate embeddings from a convolutional variational autoencoder trained on hand images. Our method achieves state-of-the-art results without relying on explicit MANO meshes.

Although NeRF models provide strong generalization to novel views and poses, their computational demands limit scalability, particularly for full-body rendering in real-time. The emergence of 3D Gaussian Splatting (3DGS) for rasterized rendering presents an efficient alternative. We integrate SMPL-X human priors and attach 3DGS primitives to mesh vertices, applying constraints and regularization based on physical human properties to prevent overfitting to training views and poses. The proposed method delivers high-quality rendering and introduces a novel sign language gloss stitching technique in 3D space. By interpolating SMPL-X parameters between two key frames, we improve articulation and human motion and eliminate artifacts typically found in state-of-the-art 2D GAN-based approaches.

The 3DGS framework enables fast training and fitting to multiview imagery. However, such data for training is costly to collect and often restricted in its usability. To address this, we aim to generate diverse human avatars from text prompts, thereby enriching signer representation in the sign language production area. We propose a weakly-supervised pipeline that first generates synthetic human images via a text-conditioned diffusion model. Then, we reconstruct 3D appearance from a single image using a transformer-based model regularized with a human prior. Finally, we close the loop with a transformer-based diffusion model that synthesizes full 3D human appearances directly from text. Compared to the state-of-the-art approaches for 3D human generation, our method achieved better text-prompt alignment and realism.