11am - 12 noon

Friday 4 July 2025

Audio-Visual Speaker Localization and Tracking

PhD Viva Open Presentation - Jinzheng Zhao

Online event - All Welcome!

Free

ONLINE

back to all events

This event has passed

Speakers

Jinzheng Zhao

Audio-Visual Speaker Localization and Tracking

Abstract:

In recent years, audio-visual speaker tracking has gained significant attention due to its academic importance and broad range of applications. The integration of audio and visual modalities offers complementary information for effective localization and tracking. By combining both audio and visual data as measurements, Bayesian-based filters can address challenges such as data association, audio-visual fusion, and track management. This thesis explores the topic of audio-visual speaker tracking by focusing on three key aspects.

At first, we focus on the general scenarios where the sensors (e.g., cameras and microphone arrays) are fixed. For extracting visual measurements, face detection methods are commonly used as a powerful tool to obtain speakers' positions. For extracting audio measurements, the global coherent field (GCF) is a popular choice due to its robustness against noise and reverberance. But the performance degrades as the number of speakers increases. To mitigate this problem, we first propose the separation-before-localization method to improve the performance of GCF for multi-speaker localization. In the second step, to accurately estimate the number of speakers and calculate their positions, we fuse the audio measurements from GCF with the visual measurements and employ the Poisson Multi-Bernoulli Mixture (PMBM) filter for multi-speaker tracking. The proposed method achieves state-of-the-art performance in the AV16.3 dataset.

In the second stage, we combine the traditional Bayesian filter with deep learning methods. With the help of deep learning methods, Bayesian filters can be designed in an end-to-end way instead of a two-stage process. Specifically, we propose an end-to-end differentiable particle filter to combine particle filter and Transformer. Traditional particle filter is not an end-to-end process. Feature extraction should be finished before the Bayesian filter. Besides, the transition and update processes are often pre-defined, making it difficult for the various scenarios. We mitigate this problem by using self-attention to model the transition process and cross-attention to model the update process thus the two processes can be adapted to different scenarios. Both the feature extraction and Bayesian filter parts can be optimized at the same time. Experimental results on the Two-Ears simulated dataset and AVRI real dataset show that the proposed model is superior to the baselines in terms of tracking accuracy, robustness to noise, and long-term modeling.

In the third stage, we focus on speaker localization in the egocentric scenario, which is a new task and has not been fully explored yet. In the novel egocentric scenario, there exist problems such as motion blur, surrounding noises, and frequent disappearance of the speaker. Besides, the dataset in this new task is not large. Firstly, to mitigate the data scarcity problem, we create a dataset using Unity to simulate the scenario where a wearer is walking and recording videos, and a speaker is randomly walking. Then for the problem of frequent disappearance of the speaker, we propose an attention-based audio-visual fusion model. The proposed model outperforms the previous methods on localization accuracy. We further extend our model to the realistic EasyCom dataset. Experimental results show that the proposed model shows superior performance on the speaker localization and wearer activity detection tasks.

In summary, we focus on audio-visual speaker tracking in both general and egocentric scenarios. In the general scenario, the cameras and microphone arrays are fixed. We first improve the performance of the statistical-based method, GCF in multi-speaker scenarios and propose AV-PMBM for multiple-speaker tracking. Then we propose a differentiable particle filter, combining a statistical-based method and a learning-based method. This method leverages the strength of the particle filter and Transformer. For the egocentric scenario, we first propose a new dataset that covers frequent speaker disappearance. Then we propose a Transformer-based audio-visual fusion method for speaker localization to mitigate the problem of frequent disappearance of speakers. Experimental results on both simulated and real datasets show the superiority of our method.