Peng Zhang
About
My research project
Deep learning for audio-visual scene analysisFuture sound systems are expected to recommend “what” media content users may choose to consume, yet typical audio reproduction systems cannot readily adapt “how” they present the experience. This project aims to design advanced audio-visual signal processing algorithms to analyse the acoustic environment and user context, including sound events within the room (e.g. people talking with music playing in the background). Such information will be used to inform the spatial audio system for sound reproduction, with awareness of the user's intentions, choices and environment. The engineering challenges are to provide users with the controls they would like to have in order to adjust the behaviour of their audio system, for example, boosting the bass response interacting with the room to provide greater envelopment or utilising the spatial extent of an immersive sound reproduction system to render the content more intelligibly.
Supervisors
Future sound systems are expected to recommend “what” media content users may choose to consume, yet typical audio reproduction systems cannot readily adapt “how” they present the experience. This project aims to design advanced audio-visual signal processing algorithms to analyse the acoustic environment and user context, including sound events within the room (e.g. people talking with music playing in the background). Such information will be used to inform the spatial audio system for sound reproduction, with awareness of the user's intentions, choices and environment. The engineering challenges are to provide users with the controls they would like to have in order to adjust the behaviour of their audio system, for example, boosting the bass response interacting with the room to provide greater envelopment or utilising the spatial extent of an immersive sound reproduction system to render the content more intelligibly.
Publications
Complex activities in real-world audio unfold over extended durations and exhibit hierarchical structure, yet most prior work focuses on short clips and isolated events. To bridge this gap, we introduce MultiAct, a new dataset and benchmark for multi-level structured understanding of human activities from long-form audio. MultiAct comprises long-duration kitchen recordings annotated at three semantic levels (activities, sub-activities and events) and paired with fine-grained captions and high-level summaries. We further propose a unified hierarchical model that jointly performs classification, detection, sequence prediction and multi-resolution captioning. Experiments on MultiAct establish strong baselines and reveal key challenges in modelling hierarchical and compositional structure of long-form audio. A promising direction for future work is the exploration of methods better suited to capturing the complex, long-range relationships in long-form audio.