
Haohe Liu
Academic and research departments
Centre for Vision, Speech and Signal Processing (CVSSP), Department of Electrical and Electronic Engineering.About
My research project
Automatic sound labelling for broadcast audioI’m Haohe Liu, a first-year Ph.D. student at the Centre for Vision Speech and Signal Processing (CVSSP), University of Surrey. My research includes topics related to speech, music, and general audio. I like 🎧 🎼 🎹 🏃🏼 🏂🏼 in my spare time. Most of my studies are fully open-sourced. I’m also interested in/researching on some other novel topics on audio. If you would like to discuss or cooperate, please drop me an email.
Highlighted research as the first author:
- AudioLDM: State-of-the-art text-to-audio generation model.
- NaturalSpeech: First text-to-speech model that achieves on par CMOS with human recording.
- VoiceFixer: Restore the quality of human speech signal regardless how the signal is degraded.
- CWS-PResUNet: The music source separation system that achieves leading performance on Music Demixing Challenge 2021.
- NVSR: Speech super-resolution.
- DiffRes: A module that make temporal-resolution of the spectrogram differentiable for efficient audio classificaiton.
- Few-shot bioacoustic detection: The 2rd ranking system in the DCASE 2022 Challenge Task 5.
- …
At the University of Surrey, I am fortunate to be co-advised by Prof. Mark D. Plumbley and Prof. Wenwu Wang. And I’m lucky to be jointly funded by BBC Research & Development (R&D) and the Doctoral College. In the CVSSP, I’m working as part of the AI for Sound project with the goal of developing new methods for automatic labeling of sound environments and events in broadcast audio, assisting production staff to find and search through content, and helping the general public access archive content. I’m also working closely with BBC R&D Audio Team on putting our audio recognition algorithms into production, such as generating machine tags in BBC sound effect library.
Internship
2021.10-2022.04 Microsoft Research Asia, Beijing - Research Intern - with Xu Tan
2020.07-2021.10 ByteDance AI lab, Shanghai - Research Intern - with Qiuqiang Kong
Challenges
- (main contributor) 1st in DCASE 2022 Challenge Task 5:Few-shot Bioacoustic Event Detection. [code][details][leaderboard]
- (main contributor) 2nd on the vocal score and 5th place on the overall score in ISMIR 2021 Music Demixing Challenge. [code][details] [leaderboard]
- 3rd in DCASE 2022 Challenge Task 6A: Automated Audio Captioning.
- 2nd in DCASE 2022 Challenge Task 6B: Language-Based Audio Retrieval.
Supervisors
I’m Haohe Liu, a first-year Ph.D. student at the Centre for Vision Speech and Signal Processing (CVSSP), University of Surrey. My research includes topics related to speech, music, and general audio. I like 🎧 🎼 🎹 🏃🏼 🏂🏼 in my spare time. Most of my studies are fully open-sourced. I’m also interested in/researching on some other novel topics on audio. If you would like to discuss or cooperate, please drop me an email.
Highlighted research as the first author:
- AudioLDM: State-of-the-art text-to-audio generation model.
- NaturalSpeech: First text-to-speech model that achieves on par CMOS with human recording.
- VoiceFixer: Restore the quality of human speech signal regardless how the signal is degraded.
- CWS-PResUNet: The music source separation system that achieves leading performance on Music Demixing Challenge 2021.
- NVSR: Speech super-resolution.
- DiffRes: A module that make temporal-resolution of the spectrogram differentiable for efficient audio classificaiton.
- Few-shot bioacoustic detection: The 2rd ranking system in the DCASE 2022 Challenge Task 5.
- …
At the University of Surrey, I am fortunate to be co-advised by Prof. Mark D. Plumbley and Prof. Wenwu Wang. And I’m lucky to be jointly funded by BBC Research & Development (R&D) and the Doctoral College. In the CVSSP, I’m working as part of the AI for Sound project with the goal of developing new methods for automatic labeling of sound environments and events in broadcast audio, assisting production staff to find and search through content, and helping the general public access archive content. I’m also working closely with BBC R&D Audio Team on putting our audio recognition algorithms into production, such as generating machine tags in BBC sound effect library.
Internship
2021.10-2022.04 Microsoft Research Asia, Beijing - Research Intern - with Xu Tan
2020.07-2021.10 ByteDance AI lab, Shanghai - Research Intern - with Qiuqiang Kong
Challenges
- (main contributor) 1st in DCASE 2022 Challenge Task 5:Few-shot Bioacoustic Event Detection. [code][details][leaderboard]
- (main contributor) 2nd on the vocal score and 5th place on the overall score in ISMIR 2021 Music Demixing Challenge. [code][details] [leaderboard]
- 3rd in DCASE 2022 Challenge Task 6A: Automated Audio Captioning.
- 2nd in DCASE 2022 Challenge Task 6B: Language-Based Audio Retrieval.
Teaching
2023/02-2023/05 Demonstrator for EEEM068 Applied Machine Learning
2022/10-2023/01 Demonstrator for EEE1033 Computer and Digital Logic
2022/10-2023/01 Demonstrator for EEE3008 Fundamentals of Digital Signal Processing
Publications
In this paper, we introduce the task of language-queried audio source separation (LASS), which aims to separate a target source from an audio mixture based on a natural language query of the target source (e.g., “a man tells a joke followed by people laughing”). A unique challenge in LASS is associated with the complexity of natural language description and its relation with the audio sources. To address this issue, we proposed LASSNet, an end-to-end neural network that is learned to jointly process acoustic and linguistic information, and separate the target source that is consistent with the language query from an audio mixture. We evaluate the performance of our proposed system with a dataset created from the AudioCaps dataset. Experimental results show that LASS-Net achieves considerable improvements over baseline methods. Furthermore, we observe that LASS-Net achieves promising generalization results when using diverse human-annotated descriptions as queries, indicating its potential use in real-world scenarios. The separated audio samples and source code are available at https://liuxubo717.github.io/LASS-demopage.
Sounds carry an abundance of information about activities and events in our everyday environment, such as traffic noise, road works, music, or people talking. Recent machine learning methods, such as convolutional neural networks (CNNs), have been shown to be able to automatically recognize sound activities, a task known as audio tagging. One such method, pre-trained audio neural networks (PANNs), provides a neural network which has been pre-trained on over 500 sound classes from the publicly available AudioSet dataset, and can be used as a baseline or starting point for other tasks. However, the existing PANNs model has a high computational complexity and large storage requirement. This could limit the potential for deploying PANNs on resource-constrained devices, such as on-the-edge sound sensors, and could lead to high energy consumption if many such devices were deployed. In this paper, we reduce the computational complexity and memory requirement of the PANNs model by taking a pruning approach to eliminate redundant parameters from the PANNs model. The resulting Efficient PANNs (E-PANNs) model, which requires 36% less computations and 70% less memory, also slightly improves the sound recognition (audio tagging) performance. The code for the E-PANNs model has been released under an open source license.
—Audio captioning aims at using language to describe the content of an audio clip. Existing audio captioning systems are generally based on an encoder-decoder architecture, in which acoustic information is extracted by an audio encoder and then a language decoder is used to generate the captions. Training an audio captioning system often encounters the problem of data scarcity. Transferring knowledge from pre-trained audio models such as Pre-trained Audio Neural Networks (PANNs) have recently emerged as a useful method to mitigate this issue. However, there is less attention on exploiting pre-trained language models for the decoder, compared with the encoder. BERT is a pre-trained language model that has been extensively used in natural language processing tasks. Nevertheless, the potential of using BERT as the language decoder for audio captioning has not been investigated. In this study, we demonstrate the efficacy of the pre-trained BERT model for audio captioning. Specifically, we apply PANNs as the encoder and initialize the decoder from the publicly available pre-trained BERT models. We conduct an empirical study on the use of these BERT models for the decoder in the audio captioning model. Our models achieve competitive results with the existing audio captioning methods on the AudioCaps dataset.
Additional publications
Preprint
- Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D. Plumbley. AudioLDM: Text-to-Audio Generation with Latent Diffusion Models, arXiv preprint 2023.
- Haohe Liu, Xubo Liu, Qiuqiang Kong, Wenwu Wang, and Mark D. Plumbley. Learning the Spectrogram Temporal Resolution for Audio Classification, arXiv preprint 2022.
- Zehua Chen, Yihan Wu, Yichong Leng, Jiawei Chen, Haohe Liu, Xu Tan, Yang Cui, Ke Wang, Lei He, Sheng zhao, Jiang Bian, Danilo Mandic, ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech, arXiv preprint 2022.
- Xu Tan*, Jiawei Chen*, Haohe Liu*, Jian Cong, Chen Zhang, Yanqing Liu, Xi Wang, Yichong Leng, Yuanhao Yi, Lei He, Frank Soong, Tao Qin, Sheng Zhao, Tie-Yan Liu, NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality, arxiv preprint 2022. [pdf][demo]
- Haohe Liu, Qiuqiang Kong, Qiao Tian, Yan Zhao, DeLiang Wang, Chuanzeng Huang, Yuxuan Wang, VoiceFixer: Toward General Speech Restoration with Neural Vocoder, arXiv preprint 2021. [pdf][code][demo]
Conference paper
- Haohe Liu, Qiuqiang Kong, Xubo Liu, Xinhao Mei, Wenwu Wang, Mark D. Plumbley, Ontology-aware learning and evaluation for audio tagging, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023. [pdf] (Under Review)
- Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, Mark D. Plumbley, Wenwu Wang, Simple Pooling Front-ends For Efficient Audio Classification, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023. [pdf] (Under Review)
- Xubo Liu*, Qiushi Huang*, Xinhao Mei*, Haohe Liu, Qiuqiang Kong, Jianyuan Sun, Ko Tom, Yu Zhang, Lilian H. Tang, Mark D. Plumbley, Volkan Kılıc4, Wenwu Wang Visually-awared Audio Captioning with Adaptive Audio-Visual Attention, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023 (Under Review)
- Haohe Liu, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley, Segment-level Metric Learning for Few-shot Bioacoustic Event Detection, DCASE Workshop 2022
- Yichong Leng, Zehua Chen, Junliang Guo, Haohe Liu, Jiawei Chen, Xu Tan, Danilo Mandic, Lei He, Xiang-Yang Li, Tao Qin, Sheng Zhao, Tie-Yan Liu, BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis, Conference on Neural Information Processing Systems (NeurIPS) 2022. [pdf][demo]
- Haohe Liu, Xubo Liu, Qiuqiang Kong, Qiao Tian, Yan Zhao, Deliang Wang, Chuanzeng Huang, Yuxuan Wang, VoiceFixer: A Unified Framework for High-Fidelity Speech Restoration, in Proceedings of INTERSPEECH 2022. [pdf][code][demo]
- Haohe Liu, Woosung Choi, Xubo Liu, Qiuqiang Kong, Qiao Tian, DeLiang Wang, Neural Vocoder is All You Need for Speech Super-resolution, in Proceedings of INTERSPEECH 2022. [pdf][code][demo]
- Jinzheng Zhao, Peipei Wu, Xubo Liu, Shidrokh Goudarzi, Haohe Liu, Yong Xu, Wenwu Wang, Multiple Speakers Tracking with Audio and Visual Signals, in Proceedings of INTERSPEECH 2022.
- Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, Jinzheng Zhao, Qiushi Huang, Mark D. Plumbley, Wenwu Wang, Separate What You Describe: Language-Queried Audio Source Separation, in Proceedings of INTERSPEECH 2022. [pdf][code][demo]
- Xubo Liu, Xinhao Mei, Qiushi Huang, Jianyuan Sun, Jinzheng Zhao, Haohe Liu, Mark D. Plumbley, Volkan Kılıc, Wenwu Wang, Leveraging Pre-trained BERT for Audio Captioning, in Proceedings of EUSIPCO 2022. [pdf]
- Haohe Liu, Qiuqiang Kong, Jiafeng Liu, CWS-PResUNet: Music Source Separation with Channel-wise Subband Phase-aware ResUNet, ISMIR Music Demixing Workshop 2021. [pdf][code]
- Qiuqiang Kong, Yin Cao, Haohe Liu, Keunwoo Choi, Yuxuan Wang, Decoupling Magnitude and Phase Estimation with Deep ResUNet for Music Source Separation, ISMIR 2021. [pdf]
- Qiuqiang Kong, Haohe Liu, Xingjian Du, Li Chen, Rui Xia, YuxuanWang, Speech Enhancement with weakly labeled data from audioset, INTERSPEECH 2021. [pdf]
- Haohe Liu, Lei Xie, Jian Wu, Geng Yang, Channel-wise Subband Input for Better Voice and Accompaniment Separation on High Resolution Music, INTERSPEECH 2020. [pdf][code][demo]
- Haohe Liu, Siqi Yao, Yulin Wang, Design and Visualization of Guided GAN on MNIST dataset, Proceedings of the 3rd international conference on Graphics and Signal Processing 2019.
Technical report
- Haohe Liu, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley, Surrey System for DCASE 2022 Task 5: Few-shot Bio-acoustic Event Detection with Segment-level Metric Learning, DCASE2022 Challenge Technical Report 2022. [pdf]
- Xinhao Mei, Xubo Liu, Haohe Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang, Automated Audio Captioning with Keywords Guidance, DCASE2022 Challenge Technical Report 2022. [pdf]
- Xinhao Mei, Xubo Liu, Haohe Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang, Language-Based Audio Retrieval with Pre-trained Models, DCASE2022 Challenge Technical Report 2022. [pdf]
Updated on Feb. 2023