University roles and responsibilities
- PGR Student Rep
Demonstrated for the following modules:
- Computer Vision & Graphics (EEE2041)
- AR, VR AND THE METAVERSE (EEEM067)
We aim to estimate the pose of dogs from videos using a temporal deep learning model as this can result in more accurate pose predictions when temporary occlusions or substantial movements occur. Generally, deep learning models require a lot of data to perform well. To our knowledge, public pose datasets containing videos of dogs are non existent. To solve this problem, and avoid manually labelling videos as it can take a lot of time, we generated a synthetic dataset containing 500 videos of dogs performing different actions using Unity3D. Diversity is achieved by randomising parameters such as lighting, backgrounds, camera parameters and the dog’s appearance and pose. We evaluate the quality of our synthetic dataset by assessing the model’s capacity to generalise to real data. Usually, networks trained on synthetic data perform poorly when evaluated on real data, this is due to the domain gap. As there was still a domain gap after improving the quality of the synthetic dataset and inserting diversity, we bridged the domain gap by applying 2 different methods: fine-tuning and using a mixed dataset to train the network. Additionally, we compare the model pre-trained on synthetic data with models pre-trained on a real-world animal pose datasets. We demonstrate that using the synthetic dataset is beneficial for training models with (small) real-world datasets. Furthermore, we show that pre-training the model with the synthetic dataset is the go to choice rather than pre-training on real-world datasets for solving the pose estimation task from videos of dogs.
We propose an approach to automatically extract the 3D pose of dogs from single-view RGB images using only synthetic data for training. Due to the lack of suitable 3D datasets, previous approaches have predominantly relied on 2D weakly supervised methods. While these approaches demonstrate promising results, some depth ambiguities still persist indicating the neural network's limited understanding of the 3D environment. To tackle these depth ambiguities, we generate a synthetic 3D pose dataset (DigiDogs) by modifying the popular video game Grand Theft Auto. Additionally, to address the domain gap between synthetic and real data, we harness the power of Meta's foundation model DINOv2 due to its generalisation capability and fine-tune it for the application of 3D pose estimation. Through a combination of qualitative and quantitative analyses, we demonstrate the viability of estimating the 3D pose of dogs from real-world images using synthetic training data.
Estimating the pose of animals can facilitate the understanding of animal motion which is fundamental in disciplines such as biomechanics, neuroscience, ethology, robotics and the entertainment industry. Human pose estimation models have achieved high performance due to the huge amount of training data available. Achieving the same results for animal pose estimation is challenging due to the lack of animal pose datasets. To address this problem we introduce SyDog: a synthetic dataset of dogs containing ground truth pose and bounding box coordinates which was generated using the game engine, Unity. We demonstrate that pose estimation models trained on SyDog achieve better performance than models trained purely on real data and significantly reduce the need for the labour intensive labelling of images. We release the SyDog dataset as a training and evaluation benchmark for research in animal motion.