Dr Peipei Wu

Postgraduate Research Student

p.wu@surrey.ac.uk

Academic and research departments

Centre for Vision, Speech and Signal Processing (CVSSP).

Research

Research interests

My research interest is audio-visual tracking. Recent years, different sensors (such as microphones and cameras) have been used jointly to track multiple speakers in cluttered environments with a number of moving speakers and background noise. My work is to find a solution to augment the performance of audio-visual tracking.

Publications

Peipei Wu, Jinzheng Zhao, Shidrokh Goudarzi, Wenwu Wang, PEIPEI WU (2022)Partial Arithmetic Consensus based Distributed Intensity Particle Flow SMC-PHD Filter for Multi-Target Tracking, In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processingpp. 5078-5082 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/ICASSP43922.2022.9746344

Intensity Particle Flow (IPF) SMC-PHD has been proposed recently for multi-target tracking. In this paper, we extend IPF-SMC-PHD filter to distributed setting, and develop a novel consensus method for fusing the estimates from individual sensors, based on Arithmetic Average (AA) fusion. Different from conventional AA method which may be degraded when unreliable estimates are presented, we develop a novel arithmetic consensus method to fuse estimates from each individual IPF-SMC-PHD filter with partial consensus. The proposed method contains a scheme for evaluating the reliability of the sensor nodes and preventing unreliable sensor information to be used in fusion and communication in sensor network, which help improve fusion accuracy and reduce sensor communication costs. Numerical simulations are performed to demonstrate the advantages of the proposed algorithm over the uncooperative IPF-SMC-PHD and distributed particle-PHD with AA fusion.

Jinzheng Zhao, Yong Xu, Xinyuan Qian, Davide Berghi, Peipei Wu, Meng Cui, Jianyuan Sun, Philip Jackson, Wenwu Wang (2023)Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions, In: arXiv.org Cornell University Library, arXiv.org

DOI: 10.48550/arxiv.2310.14778

Audio-visual speaker tracking has drawn increasing attention over the past few years due to its academic values and wide application. Audio and visual modalities can provide complementary information for localization and tracking. With audio and visual information, the Bayesian-based filter can solve the problem of data association, audio-visual fusion and track management. In this paper, we conduct a comprehensive overview of audio-visual speaker tracking. To our knowledge, this is the first extensive survey over the past five years. We introduce the family of Bayesian filters and summarize the methods for obtaining audio-visual measurements. In addition, the existing trackers and their performance on AV16.3 dataset are summarized. In the past few years, deep learning techniques have thrived, which also boosts the development of audio visual speaker tracking. The influence of deep learning techniques in terms of measurement extraction and state estimation is also discussed. At last, we discuss the connections between audio-visual speaker tracking and other areas such as speech separation and distributed speaker tracking.

Davide Berghi, Peipei Wu, Jinzheng Zhao, Wenwu Wang, Philip J. B. Jackson (2024)Fusion of Audio and Visual Embeddings for Sound Event Localization and Detection, In: Proceedings of the ICASSP 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024) Institute of Electrical and Electronics Engineers (IEEE)

Sound event localization and detection (SELD) combines two subtasks: sound event detection (SED) and direction of arrival (DOA) estimation. SELD is usually tackled as an audio-only problem, but visual information has been recently included. Few audio-visual (AV)-SELD works have been published and most employ vision via face/object bounding boxes, or human pose keypoints. In contrast, we explore the integration of audio and visual feature embeddings extracted with pre-trained deep networks. For the visual modality, we tested ResNet50 and Inflated 3D ConvNet (I3D). Our comparison of AV fusion methods includes the AV-Conformer and Cross-Modal Attentive Fusion (CMAF) model. Our best models outperform the DCASE 2023 Task3 audio-only and AV baselines by a wide margin on the development set of the STARSS23 dataset, making them competitive amongst state-of-the-art results of the AV challenge, without model ensembling, heavy data augmentation, or prediction post-processing. Such techniques and further pre-training could be applied as next steps to improve performance.

Jinzheng Zhao, Peipei Wu, Xubo Liu, Shidrokh Goudarzi, Haohe Liu, Yong Xu, Wenwu Wang (2022)Audio Visual Multi-Speaker Tracking with Improved GCF and PMBM Filter, In: INTERSPEECH 20223704pp. 3704-3708 Isca-Int Speech Communication Assoc

DOI: 10.21437/Interspeech.2022-10190

Audio and visual signals can be used jointly to provide complementary information for multi-speaker tracking. Face detectors and color histogram can provide visual measurements while Direction of Arrival (DOA) lines and global coherence field (GCF) maps can provide audio measurements. GCF, as a traditional sound source localization method, has been widely used to provide audio measurements in audio-visual speaker tracking by estimating the positions of speakers. However, GCF cannot directly deal with the scenarios of multiple speakers due to the emergence of spurious peaks on the GCF map, making it difficult to find the non-dominant speakers. To overcome this limitation, we propose a phase-aware VoiceFilter and a separation-before-localization method, which enables the audio mixture to be separated into individual speech sources while retaining their phases. This allows us to calculate the GCF map for multiple speakers, thereby their positions accurately and concurrently. Based on this method, we design an adaptive audio measurement likelihood for audio-visual multiple speaker tracking using Poisson multi-Bernoulli mixture (PMBM) filter. The experiments demonstrate that our proposed tracker achieves state-of-the-art results on the AV16.3 dataset.

Yi Yuan, Haohe Liu, Xubo Liu, Xiyuan Kang, Peipei Wu, Mark Plumbley, Wenwu Wang (2023)Text-Driven Foley Sound Generation With Latent Diffusion Model, In: arXiv.org Cornell University Library, arXiv.org

DOI: 10.48550/arxiv.2306.10359

Foley sound generation aims to synthesise the background sound for multimedia content. Previous models usually employ a large development set with labels as input (e.g., single numbers or one-hot vector). In this work, we propose a diffusion model based system for Foley sound generation with text conditions. To alleviate the data scarcity issue, our model is initially pre-trained with large-scale datasets and fine-tuned to this task via transfer learning using the contrastive language-audio pertaining (CLAP) technique. We have observed that the feature embedding extracted by the text encoder can significantly affect the performance of the generation model. Hence, we introduce a trainable layer after the encoder to improve the text embedding produced by the encoder. In addition, we further refine the generated waveform by generating multiple candidate audio clips simultaneously and selecting the best one, which is determined in terms of the similarity score between the embedding of the candidate clips and the embedding of the target text label. Using the proposed method, our system ranks \({1}^{st}\) among the systems submitted to DCASE Challenge 2023 Task 7. The results of the ablation studies illustrate that the proposed techniques significantly improve sound generation performance. The codes for implementing the proposed system are available online.

Xingchi Liu, Qing Li, Jiaming Liang, Jinzheng Zhao, Peipei Wu, Chenyi Lyu, Shidrokh Goudarzi, Jemin George, Tien Pham, Wenwu Wang, Lyudmila Mihaylova, Simon Godsill (2022)Advanced Machine Learning Methods for Autonomous Classification of Ground Vehicles with Acoustic Data, In: T Pham, L Solomon (eds.), ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FOR MULTI-DOMAIN OPERATIONS APPLICATIONS IV12113121131pp. 121131P-121131P-10 Spie-Int Soc Optical Engineering

DOI: 10.1117/12.2618105

This paper presents a distributed multi-class Gaussian process (MCGP) algorithm for ground vehicle classification using acoustic data. In this algorithm, the harmonic structure analysis is used to extract features for GP classifier training. The predictions from local classifiers are then aggregated into a high-level prediction to achieve the decision-level fusion, following the idea of divide-and-conquer. Simulations based on the acoustic-seismic classification identification data set (ACIDS) confirm that the proposed algorithm provides competitive performance in terms of classification error and negative log-likelihood (NLL), as compared to an MCGP based on the data-level fusion where only one global MCGP is trained using data from all the sensors.

Yang Liu, Yong Xu, Peipei Wu, Wenwu Wang (2023)Labelled Non-Zero Diffusion Particle Flow SMC-PHD Filtering for Multi-Speaker Tracking, In: IEEE transactions on multimediapp. 2544-2559 IEEE

DOI: 10.1109/TMM.2023.3301221

Particle flow (PF) is a method originally proposed for single target tracking, and used recently to address the weight degeneracy problem of the sequential Monte Carlo probability hypothesis density (SMC-PHD) filter for audio-visual (AV) multi-speaker tracking, where the particle flow is calculated by using only the measurements near the particle, assuming that the target is detected, as in a recent method based on non-zero particle flow (NPF), i.e. the AV-NPF-SMC-PHD filter. This, however, can be problematic when occlusion happens and the occluded speaker may not be detected. To address this issue, we propose a new method where the labels of the particles are estimated using the likelihood function, and the particle flow is calculated in terms of the selected particles with the same labels. As a result, the particles associated with detected speakers and undetected speakers are distinguished based on the particle labels. With this novel method, named as AV-LPF-SMC-PHD, the speaker states can be estimated as the weighted mean of the labelled particles, which is computationally more efficient than using a clustering method as in the AV-NPF-SMC-PHD filter. The proposed algorithm is compared systematically with several baseline tracking methods using the AV16.3, AVDIAR and CLEAR datasets, and is shown to offer improved tracking accuracy with a lower computational cost.

Yaru Chen, Ruohao Guo, Xubo Liu, Peipei Wu, Guangyao Li, Zhenbo Li, Wenwu Wang (2023)CM-PIE: Cross-modal perception for interactive-enhanced audio-visual video parsing, In: arXiv.org Cornell University Library, arXiv.org

DOI: 10.48550/arxiv.2310.07517

Audio-visual video parsing is the task of categorizing a video at the segment level with weak labels, and predicting them as audible or visible events. Recent methods for this task leverage the attention mechanism to capture the semantic correlations among the whole video across the audio-visual modalities. However, these approaches have overlooked the importance of individual segments within a video and the relationship among them, and tend to rely on a single modality when learning features. In this paper, we propose a novel interactive-enhanced cross-modal perception method~(CM-PIE), which can learn fine-grained features by applying a segment-based attention module. Furthermore, a cross-modal aggregation block is introduced to jointly optimize the semantic representation of audio and visual signals by enhancing inter-modal interactions. The experimental results show that our model offers improved parsing performance on the Look, Listen, and Parse dataset compared to other methods.

Peipei Wu, Jinzheng Zhao, Yaru Chen, Davide Berghi, Yi Yuan, Chenfei Zhu, Yin Cao, Yang Liu, Philip J B Jackson, Mark David Plumbley, Wenwu Wang (2023)PLDISET: Probabilistic Localization and Detection of Independent Sound Events with Transformers

Sound Event Localization and Detection (SELD) is a task that involves detecting different types of sound events along with their temporal and spatial information, specifically, detecting the classes of events and estimating their corresponding direction of arrivals at each frame. In practice, real-world sound scenes might be complex as they may contain multiple overlapping events. For instance, in DCASE challenges task 3, each clip may involve simultaneous occurrences of up to five events. To handle multiple overlapping sound events, current methods prefer multiple output branches to estimate each event, which increases the size of the models. Therefore, current methods are often difficult to be deployed on the edge of sensor networks. In this paper, we propose a method called Probabilistic Localization and Detection of Independent Sound Events with Transformers (PLDISET), which estimates numerous events by using one output branch. The method has three stages. First, we introduce the track generation module to obtain various tracks from extracted features. Then, these tracks are fed into two transformers for sound event detection (SED) and localization, respectively. Finally, one output system, including a linear Gaussian system and regression network, is used to estimate each track. We give the evaluation resn results of our model on DCASE 2023 Task 3 development dataset.

Jinzheng Zhao, Peipei Wu, Shidrokh Goudarzi, Xubo Liu, Jianyuan Sun, Yong Xu, Wenwu Wang (2022)Visually Assisted Self-supervised Audio Speaker Localization and Tracking, In: 2022 30th European Signal Processing Conference (EUSIPCO) EUSIPCO

DOI: 10.23919/EUSIPCO55093.2022.9909535

—Training a robust tracker of objects (such as vehicles and people) using audio and visual information often needs a large amount of labelled data, which is difficult to obtain as manual annotation is expensive and time-consuming. The natural synchronization of the audio and visual modalities enables the object tracker to be trained in a self-supervised manner. In this work, we propose to localize an audio source (i.e., speaker) using a teacher-student paradigm, where the visual network teaches the audio network by knowledge distillation to localize speakers. The introduction of multi-task learning, by training the audio network to perform source localization and semantic segmentation jointly, further improves the model performance. Experimental results show that the audio localization network can learn from visual information and achieve competitive tracking performance as compared to the baseline methods that are based on the audio-only measurements. The proposed method can provide more reliable measurements for tracking than the traditional sound source localization methods, and the generated audio features aid in visual tracking.