Haosen Yang
Academic and research departments
Surrey Institute for People-Centred Artificial Intelligence (PAI), Centre for Vision, Speech and Signal Processing (CVSSP).About
My research project
Visual Representation learningThe primary objective of my research is to investigate effective system to understand real world in a mutually beneficial and interpretable manner. By combining the strengths of both video, laugauage and audio, I aim to enhance their individual performances and foster a collaborative learning process.
The primary objective of my research is to investigate effective system to understand real world in a mutually beneficial and interpretable manner. By combining the strengths of both video, laugauage and audio, I aim to enhance their individual performances and foster a collaborative learning process.
Publications
Audio-Visual Segmentation (AVS) aims to precisely outline audible objects in a visual scene at the pixel level. Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion. This limits their scalability since it is time consuming and tedious to acquire such cross-modality pixel level labels. To overcome this obstacle, in this work we introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training. For tackling this newly proposed problem, we formulate a novel Cross-Modality Semantic Filtering (CMSF) approach to accurately associate the underlying audio-mask pairs by leveraging the off-the-shelf multi-modal foundation models (e.g., detection [1], open-world segmentation [2] and multi-modal alignment [3]). Guiding the proposal generation by either audio or visual cues, we design two training-free variants: AT-GDINO-SAM and OWOD-BIND. Extensive experiments on the AVS-Bench dataset show that our unsupervised approach can perform well in comparison to prior art supervised counterparts across complex scenarios with multiple auditory objects. Particularly, in situations where existing supervised AVS methods struggle with overlapping foreground objects, our models still excel in accurately segmenting overlapped auditory objects. Our code will be publicly released.
Additional publications
- Temporal action proposal generation with background constraint
- Temporal action proposal generation with transformers
- MIFNet: Multiple Instances Focused Temporal Action Proposal Generation
- Nsnet: Non-saliency suppression sampler for efficient video recognition
- Self-supervised Video Representation Learning with Motion-Aware Masked Autoencoder