Dr Jinzheng Zhao

Postgraduate research student

j.zhao@surrey.ac.uk

Academic and research departments

Centre for Vision, Speech and Signal Processing (CVSSP).

About

My research project

Audio-Visual Multi-speaker Tracking

Automatic tracking of multiple speakers has many applications in e.g. security surveillance, human-machine interaction, and robotics. Different sensors (such as microphones and cameras) have been used jointly to track multiple speakers in cluttered environments with a number of moving speakers and background noise.

There are, however, a number of challenges associated with this problem. For example, how to estimate the unknown and time varying number of speakers? How to deal with the uncertainties associated with the audio-visual measurements, e.g. false detections, mis-detections, noise, clutters in the measurements.

The aim of my project is to develop novel ideas to address these challenges, by leveraging a recent baseline developed in the Centre for Vision Speech and Signal Processing at University of Surrey, i.e. the particle flow probability hypothesis density (PHD) filtering algorithms, for fusing the audio-visual measurements, estimating the time-varying number of targets, in the presence of measurement uncertainties.

Publications

Davide Berghi, Craig Cieciura, Farshad Einabadi, Maxine Glancy, Oliver Charles Camilleri, Philip Anthony Foster, Asmar Nadeem, Faegheh Sardari, Jinzheng Zhao, Marco Volino, Armin Mustafa, Philip J B Jackson, Adrian Hilton (2024)ForecasterFlexOBM: A Multi-View Audio-Visual Dataset for Flexible Object-Based Media Production, In: ForecasterFlexOBM: A multi-view audio-visual dataset for flexible object-based media production IEEE

DOI: 10.1109/ICME57554.2024.10687655

Leveraging machine learning techniques, in the context of object-based media production, could enable provision of personalized media experiences to diverse audiences. To fine-tune and evaluate techniques for personalization applications, as well as more broadly, datasets which bridge the gap between research and production are needed. We introduce and publicly release such a dataset, themed around a UK weather forecast and shot against a blue-screen background, of three professional actors/presenters – one male and one female (English) and one female (British Sign Language). Scenes include both production and research-oriented examples, with a range of dialogue, motions, and actions. Capture techniques consisted of a synchronized 4K resolution 16-camera array, production-typical microphones plus professional audio mix, a 16-channel microphone array with collocated Grasshopper3 camera, and a photogrammetry array. We demonstrate applications relevant to virtual production and creation of personalized media including neural radiance fields, shadow casting, action/event detection, speaker source tracking and video captioning.

Davide Berghi, Peipei Wu, Jinzheng Zhao, Wenwu Wang, Philip J. B. Jackson (2024)Fusion of Audio and Visual Embeddings for Sound Event Localization and Detection, In: Proceedings of the ICASSP 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024) Institute of Electrical and Electronics Engineers (IEEE)

Sound event localization and detection (SELD) combines two subtasks: sound event detection (SED) and direction of arrival (DOA) estimation. SELD is usually tackled as an audio-only problem, but visual information has been recently included. Few audio-visual (AV)-SELD works have been published and most employ vision via face/object bounding boxes, or human pose keypoints. In contrast, we explore the integration of audio and visual feature embeddings extracted with pre-trained deep networks. For the visual modality, we tested ResNet50 and Inflated 3D ConvNet (I3D). Our comparison of AV fusion methods includes the AV-Conformer and Cross-Modal Attentive Fusion (CMAF) model. Our best models outperform the DCASE 2023 Task3 audio-only and AV baselines by a wide margin on the development set of the STARSS23 dataset, making them competitive amongst state-of-the-art results of the AV challenge, without model ensembling, heavy data augmentation, or prediction post-processing. Such techniques and further pre-training could be applied as next steps to improve performance.

Meng Cui, Xubo Liu, Jinzheng Zhao, Jianyuan Sun, Guoping Lian, Tao Chen, Mark D. Plumbley, Daoliang Li, Wenwu Wang (2023)Fish Feeding Intensity Assessment in Aquaculture: A New Audio Dataset AFFIA3K and a Deep Learning Algorithm, In: 2022 IEEE 32nd International Workshop on Machine Learning for Signal Processing (MLSP 2022)pp. 1-6 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/MLSP55214.2022.9943405

Fish feeding intensity assessment (FFIA) aims to evaluate the change of fish appetite during the feeding process, which is potentially useful in industrial aquaculture. Previous methods are mainly based on computer vision techniques. However, these methods are limited by water refraction and uneven illumination. In this paper, we introduce a new approach for FFIA using audio. We create a new audio dataset for FFIA, namely AFFIA3K, which contains 3000 labelled audio clips of different fish feeding intensity (None, Weak, Medium, Strong). We present a deep learning framework for FFIA, where the audio signal is first transformed into acoustic features, i.e. mel spectrogram, then a convolutional neural network (CNN)-based model is used to classify the fish feeding intensity. Experimental results show that our approach achieves an mAP of 0.74 on the test set of AFFIA3K, and considerably outperforms baseline systems. This indicates the potential of our proposed approach in aquaculture applications.

Xubo Liu, Xinhao Mei, Qiushi Huang, Jianyuan Sun, Jinzheng Zhao, Haohe Liu, Mark D. Plumbley, Volkan Kilic, Wenwu Wang (2022)Leveraging Pre-trained BERT for Audio Captioning, In: 2022 30th European Signal Processing Conference (EUSIPCO 2022)pp. 1145-1149 European Signal Processing Conference (EUSIPCO)

DOI: 10.23919/EUSIPCO55093.2022.9909761

Audio captioning aims at using language to describe the content of an audio clip. Existing audio captioning systems are generally based on an encoder-decoder architecture, in which acoustic information is extracted by an audio encoder and then a language decoder is used to generate the captions. Training an audio captioning system often encounters the problem of data scarcity. Transferring knowledge from pre-trained audio models such as Pre-trained Audio Neural Networks (PANNs) have recently emerged as a useful method to mitigate this issue. However, there is less attention on exploiting pre-trained language models for the decoder, compared with the encoder. BERT is a pre-trained language model that has been extensively used in natural language processing tasks. Nevertheless, the potential of using BERT as the language decoder for audio captioning has not been investigated. In this study, we demonstrate the efficacy of the pre-trained BERT model for audio captioning. Specifically, we apply PANNs as the encoder and initialize the decoder from the publicly available pre-trained BERT models. We conduct an empirical study on the use of these BERT models for the decoder in the audio captioning model. Our models achieve competitive results with the existing audio captioning methods on the AudioCaps dataset.

Jianyuan Sun, Xubo Liu, Xinhao Mei, Jinzheng Zhao, Mark D. Plumbley, Volkan Kilic, Wenwu Wang (2022)Deep Neural Decision Forest for Acoustic Scene Classification, In: 2022 30th European Signal Processing Conference (EUSIPCO 2022)pp. 772-776 European Signal Processing Conference (EUSIPCO)

DOI: 10.23919/EUSIPCO55093.2022.9909575

Acoustic scene classification (ASC) aims to classify an audio clip based on the characteristic of the recording environment. In this regard, deep learning based approaches have emerged as a useful tool for ASC problems. Conventional approaches to improving the classification accuracy include integrating auxiliary methods such as attention mechanism, pre-trained models and ensemble multiple sub-networks. However, due to the complexity of audio clips captured from different environments, it is difficult to distinguish their categories without using any auxiliary methods for existing deep learning models using only a single classifier. In this paper, we propose a novel approach for ASC using deep neural decision forest (DNDF). DNDF combines a fixed number of convolutional layers and a decision forest as the final classifier. The decision forest consists of a fixed number of decision tree classifiers, which have been shown to offer better classification performance than a single classifier in some datasets. In particular, the decision forest differs substantially from traditional random forests as it is stochastic, differentiable, and capable of using the back-propagation to update and learn feature representations in neural network. Experimental results on the DCASE2019 and ESC-50 datasets demonstrate that our proposed DNDF method improves the ASC performance in terms of classification accuracy and shows competitive performance as compared with state-of-the-art baselines.

Peipei Wu, Jinzheng Zhao, Shidrokh Goudarzi, Wenwu Wang, PEIPEI WU (2022)Partial Arithmetic Consensus based Distributed Intensity Particle Flow SMC-PHD Filter for Multi-Target Tracking, In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processingpp. 5078-5082 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/ICASSP43922.2022.9746344

Intensity Particle Flow (IPF) SMC-PHD has been proposed recently for multi-target tracking. In this paper, we extend IPF-SMC-PHD filter to distributed setting, and develop a novel consensus method for fusing the estimates from individual sensors, based on Arithmetic Average (AA) fusion. Different from conventional AA method which may be degraded when unreliable estimates are presented, we develop a novel arithmetic consensus method to fuse estimates from each individual IPF-SMC-PHD filter with partial consensus. The proposed method contains a scheme for evaluating the reliability of the sensor nodes and preventing unreliable sensor information to be used in fusion and communication in sensor network, which help improve fusion accuracy and reduce sensor communication costs. Numerical simulations are performed to demonstrate the advantages of the proposed algorithm over the uncooperative IPF-SMC-PHD and distributed particle-PHD with AA fusion.

Jinzheng Zhao, Peipei Wu, Xubo Liu, Yong Xu, Lyudmila Mihaylova, Simon Godsill, Wenwu Wang (2022)Audio-Visual Tracking of Multiple Speakers Via a PMBM Filter, In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)pp. 5068-5072 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/ICASSP43922.2022.9747595

Audio-visual tracking of multiple speakers requires to estimate the state (e.g. velocity and location) of each speaker by leveraging the information of both audio and visual modalities. Estimating the number of speakers and their states jointly remains a challenging problem. We propose an Audio-Visual Possion Multi-Bernoulli Mixture Filter (AV-PMBM) that can not only predict the number of speakers but also give accurate estimation of their states. We also propose a novel sound source localization technique based on DOA information and a deep learning based object detector to provide reliable audio measurements for the AV tracker. To our knowledge, this represents the first attempt using PMBM for multi-speaker tracking with audio visual modalities. Experiments on the AV16.3 dataset demonstrate that AV-PMBM achieves state-of-the-art performance in optimal sub-pattern assignment (OSPA).

Meng Cui, Xubo Liu, Haohe Liu, Jinzheng Zhao, Daoliang Li, Wenwu Wang (2025)Fish Tracking, Counting, and Behaviour Analysis in Digital Aquaculture: A Comprehensive Survey, In: Reviews in aquaculture17(1)e13001 Wiley

DOI: 10.1111/raq.13001

Digital aquaculture leverages advanced technologies and data‐driven methods, providing substantial benefits over traditional aquaculture practices. This article presents a comprehensive review of three interconnected digital aquaculture tasks, namely, fish tracking, counting, and behaviour analysis, using a novel and unified approach. Unlike previous reviews which focused on single modalities or individual tasks, we analyse vision‐based (i.e., image‐ and video‐based), acoustic‐based, and biosensor‐based methods across all three tasks. We examine their advantages, limitations, and applications, highlighting recent advancements and identifying critical cross‐cutting research gaps. The review also includes emerging ideas such as applying multitask learning and large language models to address various aspects of fish monitoring, an approach not previously explored in aquaculture literature. We identify the major obstacles hindering research progress in this field, including the scarcity of comprehensive fish datasets and the lack of unified evaluation standards. To overcome the current limitations, we explore the potential of using emerging technologies such as multimodal data fusion and deep learning to improve the accuracy, robustness, and efficiency of integrated fish monitoring systems. In addition, we provide a summary of existing datasets available for fish tracking, counting, and behaviour analysis. This holistic perspective offers a roadmap for future research, emphasizing the need for comprehensive datasets and evaluation standards to facilitate meaningful comparisons between technologies and to promote their practical implementations in real‐world settings.

Davide Berghi, Craig Cieciura, Farshad Einabadi, Maxine Glancy, Oliver Charles Camilleri, Philip Anthony Foster, Asmar Nadeem, Faegheh Sardari, Jinzheng Zhao, Marco Volino, Armin Mustafa, Philip J B Jackson, Adrian Douglas Mark Hilton ForecasterFlexOBM: A multi-view audio-visual dataset for flexible object-based media production, In: ForecasterFlexOBM: A Multi-View Audio-Visual Dataset for Flexible Object-Based Media Production University of Surrey

DOI: 10.15126/surreydata.900912

Jinzheng Zhao, Yong Xu, Xinyuan Qian, Davide Berghi, Peipei Wu, Meng Cui, Jianyuan Sun, Philip Jackson, Wenwu Wang (2023)Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions, In: arXiv.org Cornell University Library, arXiv.org

DOI: 10.48550/arxiv.2310.14778

Audio-visual speaker tracking has drawn increasing attention over the past few years due to its academic values and wide application. Audio and visual modalities can provide complementary information for localization and tracking. With audio and visual information, the Bayesian-based filter can solve the problem of data association, audio-visual fusion and track management. In this paper, we conduct a comprehensive overview of audio-visual speaker tracking. To our knowledge, this is the first extensive survey over the past five years. We introduce the family of Bayesian filters and summarize the methods for obtaining audio-visual measurements. In addition, the existing trackers and their performance on AV16.3 dataset are summarized. In the past few years, deep learning techniques have thrived, which also boosts the development of audio visual speaker tracking. The influence of deep learning techniques in terms of measurement extraction and state estimation is also discussed. At last, we discuss the connections between audio-visual speaker tracking and other areas such as speech separation and distributed speaker tracking.

Junqi Zhao, Xubo Liu, Jinzheng Zhao, Yi Yuan, Qiuqiang Kong, Mark D Plumbley, Wenwu Wang (2024)Universal Sound Separation with Self-Supervised Audio Masked Autoencoder

Universal sound separation (USS) is a task of separating mixtures of arbitrary sound sources. Typically, universal separation models are trained from scratch in a supervised manner, using labeled data. Self-supervised learning (SSL) is an emerging deep learning approach that leverages unlabeled data to obtain task-agnostic representations, which can benefit many downstream tasks. In this paper, we propose integrating a self-supervised pre-trained model, namely the audio masked autoencoder (A-MAE), into a universal sound separation system to enhance its separation performance. We employ two strategies to utilize SSL embeddings: freezing or updating the parameters of A-MAE during fine-tuning. The SSL embeddings are concate-nated with the short-time Fourier transform (STFT) to serve as input features for the separation model. We evaluate our methods on the AudioSet dataset, and the experimental results indicate that the proposed methods successfully enhance the separation performance of a state-of-the-art ResUNet-based USS model.

Jinzheng Zhao, Peipei Wu, Xubo Liu, Shidrokh Goudarzi, Haohe Liu, Yong Xu, Wenwu Wang (2022)Audio Visual Multi-Speaker Tracking with Improved GCF and PMBM Filter, In: INTERSPEECH 20223704pp. 3704-3708 Isca-Int Speech Communication Assoc

DOI: 10.21437/Interspeech.2022-10190

Audio and visual signals can be used jointly to provide complementary information for multi-speaker tracking. Face detectors and color histogram can provide visual measurements while Direction of Arrival (DOA) lines and global coherence field (GCF) maps can provide audio measurements. GCF, as a traditional sound source localization method, has been widely used to provide audio measurements in audio-visual speaker tracking by estimating the positions of speakers. However, GCF cannot directly deal with the scenarios of multiple speakers due to the emergence of spurious peaks on the GCF map, making it difficult to find the non-dominant speakers. To overcome this limitation, we propose a phase-aware VoiceFilter and a separation-before-localization method, which enables the audio mixture to be separated into individual speech sources while retaining their phases. This allows us to calculate the GCF map for multiple speakers, thereby their positions accurately and concurrently. Based on this method, we design an adaptive audio measurement likelihood for audio-visual multiple speaker tracking using Poisson multi-Bernoulli mixture (PMBM) filter. The experiments demonstrate that our proposed tracker achieves state-of-the-art results on the AV16.3 dataset.

Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, Jinzheng Zhao, Qiushi Huang, Mark D. Plumbley, Wenwu Wang (2022)Separate What You Describe: Language-Queried Audio Source Separation, In: Interspeech 2022pp. 1801-1805

DOI: 10.21437/Interspeech.2022-10894

In this paper, we introduce the task of language-queried audio source separation (LASS), which aims to separate a target source from an audio mixture based on a natural language query of the target source (e.g., “a man tells a joke followed by people laughing”). A unique challenge in LASS is associated with the complexity of natural language description and its relation with the audio sources. To address this issue, we proposed LASSNet, an end-to-end neural network that is learned to jointly process acoustic and linguistic information, and separate the target source that is consistent with the language query from an audio mixture. We evaluate the performance of our proposed system with a dataset created from the AudioCaps dataset. Experimental results show that LASS-Net achieves considerable improvements over baseline methods. Furthermore, we observe that LASS-Net achieves promising generalization results when using diverse human-annotated descriptions as queries, indicating its potential use in real-world scenarios. The separated audio samples and source code are available at https://liuxubo717.github.io/LASS-demopage.

Xingchi Liu, Qing Li, Jiaming Liang, Jinzheng Zhao, Peipei Wu, Chenyi Lyu, Shidrokh Goudarzi, Jemin George, Tien Pham, Wenwu Wang, Lyudmila Mihaylova, Simon Godsill (2022)Advanced Machine Learning Methods for Autonomous Classification of Ground Vehicles with Acoustic Data, In: T Pham, L Solomon (eds.), ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FOR MULTI-DOMAIN OPERATIONS APPLICATIONS IV12113121131pp. 121131P-121131P-10 Spie-Int Soc Optical Engineering

DOI: 10.1117/12.2618105

This paper presents a distributed multi-class Gaussian process (MCGP) algorithm for ground vehicle classification using acoustic data. In this algorithm, the harmonic structure analysis is used to extract features for GP classifier training. The predictions from local classifiers are then aggregated into a high-level prediction to achieve the decision-level fusion, following the idea of divide-and-conquer. Simulations based on the acoustic-seismic classification identification data set (ACIDS) confirm that the proposed algorithm provides competitive performance in terms of classification error and negative log-likelihood (NLL), as compared to an MCGP based on the data-level fusion where only one global MCGP is trained using data from all the sensors.

Jinzheng Zhao, Yong Xu, Xinyuan Qian, Wenwu Wang (2023)Audio Visual Speaker Localization from EgoCentric Views, In: arXiv.org Cornell University Library, arXiv.org

DOI: 10.48550/arxiv.2309.16308

The use of audio and visual modality for speaker localization has been well studied in the literature by exploiting their complementary characteristics. However, most previous works employ the setting of static sensors mounted at fixed positions. Unlike them, in this work, we explore the ego-centric setting, where the heterogeneous sensors are embodied and could be moving with a human to facilitate speaker localization. Compared to the static scenario, the ego-centric setting is more realistic for smart-home applications e.g., a service robot. However, this also brings new challenges such as blurred images, frequent speaker disappearance from the field of view of the wearer, and occlusions. In this paper, we study egocentric audio-visual speaker DOA estimation and deal with the challenges mentioned above. Specifically, we propose a transformer-based audio-visual fusion method to estimate the relative DOA of the speaker to the wearer, and design a training strategy to mitigate the problem of the speaker disappearing from the camera's view. We also develop a new dataset for simulating the out-of-view scenarios, by creating a scene with a camera wearer walking around while a speaker is moving at the same time. The experimental results show that our proposed method offers promising performance in this new dataset in terms of tracking accuracy. Finally, we adapt the proposed method for the multi-speaker scenario. Experiments on EasyCom show the effectiveness of the proposed model for multiple speakers in real scenarios, which achieves state-of-the-art results in the sphere active speaker detection task and the wearer activity prediction task. The simulated dataset and related code are available at https://github.com/KawhiZhao/Egocentric-Audio-Visual-Speaker-Localization.

Peipei Wu, Jinzheng Zhao, Yaru Chen, Davide Berghi, Yi Yuan, Chenfei Zhu, Yin Cao, Yang Liu, Philip J B Jackson, Mark David Plumbley, Wenwu Wang (2023)PLDISET: Probabilistic Localization and Detection of Independent Sound Events with Transformers

Sound Event Localization and Detection (SELD) is a task that involves detecting different types of sound events along with their temporal and spatial information, specifically, detecting the classes of events and estimating their corresponding direction of arrivals at each frame. In practice, real-world sound scenes might be complex as they may contain multiple overlapping events. For instance, in DCASE challenges task 3, each clip may involve simultaneous occurrences of up to five events. To handle multiple overlapping sound events, current methods prefer multiple output branches to estimate each event, which increases the size of the models. Therefore, current methods are often difficult to be deployed on the edge of sensor networks. In this paper, we propose a method called Probabilistic Localization and Detection of Independent Sound Events with Transformers (PLDISET), which estimates numerous events by using one output branch. The method has three stages. First, we introduce the track generation module to obtain various tracks from extracted features. Then, these tracks are fed into two transformers for sound event detection (SED) and localization, respectively. Finally, one output system, including a linear Gaussian system and regression network, is used to estimate each track. We give the evaluation resn results of our model on DCASE 2023 Task 3 development dataset.

Jinzheng Zhao, Yong Xu, Xinyuan Qian, Haohe Liu, Mark D. Plumbley, Wenwu Wang (2024)Attention-Based End-to-End Differentiable Particle Filter for Audio Speaker Tracking, In: IEEE Open Journal of Signal Processing5pp. 449-458 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/OJSP.2024.3363649

Particle filters (PFs) have been widely used in speaker tracking due to their capability in modeling a non-linear process or a non-Gaussian environment. However, particle filters are limited by several issues. For example, pre-defined handcrafted measurements are often used which can limit the model performance. In addition, the transition and update models are often preset which make PF less flexible to be adapted to different scenarios. To address these issues, we propose an end-to-end differentiable particle filter framework by employing the multi-head attention to model the long-range dependencies. The proposed model employs the self-attention as the learned transition model and the cross-attention as the learned update model. To our knowledge, this is the first proposal of combining particle filter and transformer for speaker tracking, where the measurement extraction, transition and update steps are integrated into an end-to-end architecture. Experimental results show that the proposed model achieves superior performance over the recurrent baseline models.

Jinzheng Zhao, Peipei Wu, Shidrokh Goudarzi, Xubo Liu, Jianyuan Sun, Yong Xu, Wenwu Wang (2022)Visually Assisted Self-supervised Audio Speaker Localization and Tracking, In: 2022 30th European Signal Processing Conference (EUSIPCO) EUSIPCO

DOI: 10.23919/EUSIPCO55093.2022.9909535

—Training a robust tracker of objects (such as vehicles and people) using audio and visual information often needs a large amount of labelled data, which is difficult to obtain as manual annotation is expensive and time-consuming. The natural synchronization of the audio and visual modalities enables the object tracker to be trained in a self-supervised manner. In this work, we propose to localize an audio source (i.e., speaker) using a teacher-student paradigm, where the visual network teaches the audio network by knowledge distillation to localize speakers. The introduction of multi-task learning, by training the audio network to perform source localization and semantic segmentation jointly, further improves the model performance. Experimental results show that the audio localization network can learn from visual information and achieve competitive tracking performance as compared to the baseline methods that are based on the audio-only measurements. The proposed method can provide more reliable measurements for tracking than the traditional sound source localization methods, and the generated audio features aid in visual tracking.