Professor Mark Plumbley FIET, FIEEE

Professor of Signal Processing, EPSRC Fellow in "AI for Sound"

https://www.kcl.ac.uk/people/mark-plumbley

Academic and research departments

Centre for Vision, Speech and Signal Processing (CVSSP), School of Computer Science and Electronic Engineering.

About

Biography

Until September 2025, Mark Plumbley was Professor of Signal Processing at the Centre for Vision, Speech and Signal Processing (CVSSP), School of Computer Science and Electronic Engineering, at the University of Surrey, in Guildford, UK.

After receiving his PhD degree in neural networks in 1991, he became a Lecturer at King’s College London, before moving to Queen Mary University of London in 2002. He subsequently became Professor and Director of the Centre for Digital Music, before joining the University of Surrey in 2015.

His current research concerns AI for Sound: using machine learning, AI and signal processing for analysis and recognition of sounds, particularly real-world everyday sounds. He led the first data challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2013).

He holds an EPSRC Fellowship in "AI for Sound", recently led EPSRC-funded projects on making sense of everyday sounds and on audio source separation, and two EU-funded research training networks in sparse representations and compressed sensing. He is a co-editor of the Springer book on "Computational Analysis of Sound Scenes and Events". He is a Fellow of the IET and IEEE.

Previous roles

2019 - 2022

Head of School

School of Computer Science and Electronic Engineering, University of Surrey

2017 (Jan-Sep)

Interim Head of Department

Department of Computer Science, University of Surrey

2010 - 2014

Director, Centre for Digital Music

Queen Mary University of London

2002 - 2014

Professor of Machine Learning and Signal Processing

School of Electronic Engineering and Computer Science, Queen Mary University of London

Affiliations and memberships

Fellow, Institution of Engineering and Technology (IET)

Fellow, Institute of Electrical and Electronics Engineers (IEEE)

News and events

2025

1 Oct 2025: Moved to King's College London to become Head of Department of Informatics.
24 Jun 2025: Plenary lecture on "AI for Acoustics" at Forum Acusticum / Euronoise 2025, Malaga, Spain, 23-26 June 2025
13 May 2025: Recruiting! Deadline for Research Fellow in Generative Audio AI
2 May 2025: Extended deadline for Call for Expressions of Interest in Noise Network Plus Working Groups [LinkedIn]
28-30 Apr 2025: See you at the Urban Sound Symposium 2025, Zurich, Switzerland.
8-11 Apr 2025: Papers presented at the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025), Hyderabad, India:
- Apr 11: A decade of DCASE: Achievements, practices, evaluations and future challenges (Annamaria Mesaros, Romain Serizel, Toni Heittola, Tuomas Virtanen, Mark D. Plumbley) Special Sesssion: 50 years of Audio and Acoustic Signal Processing [Paper] [AAM] [Preprint]
- Apr 11: Selective-Memory Meta-Learning with Environment Representations for Sound Event Localization and Detection (Jinbo Hu, Yin Cao, Ming Wu, Qiuqiang Kong, Feiran Yang, Mark D. Plumbley, Jun Yang). SPS Journal Paper, published as: IEEE/ACM Transactions on Audio, Speech, and Language Processing 32, pp 4313-4327, 2024. [Paper] [AAM] [Preprint] [Code]
- Apr 11: Personalized live sound recognition using efficient PANNs (Arshdeep Singh, Haohe Liu, Gabriel Bibbó, Thomas Deacon, Mark D. Plumbley) Show and Tell [LinkedIn]
- Apr 10: Sound-VECaps: Improving Audio Generation With Visual Enhanced Captions (Yi Yuan, Dongya Jia, Xiaobin Zhuang, Yuanzhe Chen, Zhuo Chen, Wang Yuping, Yuxuan Wang, Xubo Liu, Xiyuan Kang, Mark D. Plumbley, Wenwu Wang) [Paper] [AAM] [Preprint] [Code] [Dataset] [Demos]
- Apr 8: FlowSep: Language-Queried Sound Separation with Rectified Flow Matching (Yi Yuan, Xubo Liu, Haohe Liu, Mark D. Plumbley, Wenwu Wang) [Paper] [AAM] [Preprint] [Code] [Demos]
19 Mar 2025: See you at the UKAN+ Celebration event, Institute of Physics, London
18 Mar 2025: Thanks for everyone who attended the Launch Meeting of Noise Network Plus, Royal Academy of Engineering, London / Hybrid [Slides & Photos] [X/Twitter, 2, 3]
4 Mar 2025: Launch Event for the AI Hub in Generative Models (Gen AI Hub), Francis Crick Institute, London.
11 Feb 2025: EPSRC TERC Network Plus Launch Event, Royal Academy of Engineering, London
- Bringing together the six funded projects, including our Noise Network Plus

2024

10-15 Dec 2024: See you at the Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024), Vancouver, Canada.
6 Dec 2024: See you at the NeurIPS @ Cambridge meetup.
23-25 Oct 2024: See you at DCASE 2024 Workshop on Detection and Classification of Acoustic Scenes and Events, Tokyo, Japan
23-24 Sep 2024: See you at the 34th IEEE International Workshop on Machine Learning for Signal Processing (IEEE MLSP 2024), London, UK
12-13 Sep 2024: See you at the IOA ACOUSTICS 2024 50th Anniversary Conference, including the UKAN+ Annual Meeting 2024
3-7 Sept 2024: See you at the 27th International Conference on Digital Audio Effects (DAFx24), University of Surrey, UK.
Paper submission deadline: 20 Mar 2024
14-19 Apr 2024: Papers from our group at the International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024), Seoul, Korea. Accepted papers from the group:
- Yi Yuan, Haohe Liu, Xubo Liu, Qiushi Huang, Mark D. Plumbley and Wenwu Wang. Retrieval-Augmented Text-to-Audio Generation. [arXiv]
- Haohe Liu, Ke Chen, Qiao Tian, Wenwu Wang and Mark D. Plumbley. AudioSR: Versatile Audio Super-resolution at Scale. [arXiv]
20-27 Feb 2024: Xubo Liu will be at 38th AAAI Conference on Artificial Intelligence (AAAI-24), Vancouver, Canada, presenting our paper: Haohe Liu, Xubo Liu, Qiuqiang Kong, Wenwu Wang and Mark D. Plumbley. Learning Temporal Resolution in Spectrogram for Audio Classification. [arXiv]
31 Jan 2024: Attending the University of Surrey Annual Open Research Culture Event, University of Surrey, UK.
Haohe Liu presenting, shortlisted for an Open Research Award for his case study: Haohe Liu, Wenwu Wang, Mark D. Plumbley: Building the research community with open-source practice.
16-17 Jan 2024: Attending Workshop on Interdisciplinary Perspectives on Soundscapes and Wellbeing, University of Surrey, UK.
(Day 1: Workshop & Webinar; Day 2: World Café & Sandpit Sessions

2023

26 Oct 2023: Attending Speech and Audio in the Northeast (SANE 2023), NYU, Brooklyn, New York.
Presentations from the group:
- Arshdeep Singh, Haohe Liu and Mark D. Plumbley. E-PANNs: An Efficient version of PANNs for AudioTagging (Poster). [Twitter/X]
- Gabriel Bibbo, Arshdeep Singh and Mark D. Plumbley. Recognise and Notify Sound Events using a Raspberry PI based Standalone Device (Poster). [Twitter/X]
22-25 Oct 2023: Attending IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA 2023), New Paltz, New York, USA. Presentations from the group:
- James A. King, Arshdeep Singh and Mark D. Plumbley. Compressing audio CNNs with graph centrality based filter pruning. [arXiv] [Twitter/X]
- Gabriel Bibbo, Arshdeep Singh and Mark D. Plumbley. Recognise and Notify Sound Events using a Raspberry PI based Standalone Device (Demo) [Twitter/X]
20-22 Sep 2023: Attending DCASE 2023 Workshop Workshop on Detection and Classification of Acoustic Scenes and Events, 20-22 September 2023, Tampere, Finland. [Twitter/X]
Papers from the group:
- Thu 21 Sep, P1-6: Jinbo Hu, Yin Cao, Ming Wu, Feiran Yang, Ziying Yu, Wenwu Wang, Mark D. Plumbley, Jun Yang: META-SELD: Meta-Learning for Fast Adaptation to the New Environment in Sound Event Localization and Detection
- Thu 21 Sep, P2-20: Yi Yuan, Haohe Liu, Xubo Liu, Xiyuan Kang, Peipei Wu, Mark D. Plumbley, Wenwu Wang: Text-Driven Foley Sound Generation with Latent Diffusion Model
- Fri 22 Sep, P3-25: Peipei Wu, Jinzheng Zhao, Yaru Chen, Davide Berghi, Yi Yuan, Chenfei Zhu, Yin Cao, Yang Liu, Philip JB Jackson, Mark D. Plumbley, Wenwu Wang: PLDISET: Probabilistic Localization and Detection of Independent Sound Events with Transformers
- Also participating in Discussion Panel (16:00 Thu 21 Sep)
19 Jul 2023: Great to see everyone on the CVSSP Summer Walk! [Twitter/X]
9-13 Jul 2023: Presenting Distinguished Plenary Lecture at 29th International Congress on Sound and Vibration (ICSV29), Prague.
22 Jun 2023: Presenting at the C4DM 20th Anniversary Celebration, Centre for Digital Music, Queen Mary University of London. [Twitter/X]
5-10 Jun 2023: Attending ICASSP 2023, Rhodes Island, Greece.
Presentations from the group:
- Arshdeep Singh and Mark D. Plumbley: Efficient similarity-based passive filter pruning for compressing CNNs. [arXiv] [Open Access] [Twitter/X]
- Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, Mark D. Plumbley and Wenwu Wang. Simple pooling front-ends for efficient audio classification. [arXiv] [Code]
- Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang and Mark D Plumbley. Generating sound effects, music, speech, and beyond, with text [Show and Tell Demo, presenting AudioLDM]. [Twitter/X]
31 May 2023: Team placed first in DCASE Challenge Task 7 Track A "Foley Sound Synthesis" for submission: Yi Yuan, Haohe Liu, Xubo Liu, Xiyuan Kang, Mark D. Plumbley, Wenwu Wang: Latent Diffusion Model Based Foley Sound Generation System for DCASE Challenge 2023 Task 7 [Paper] [Code]
24 Apr 2023: Paper accepted for ICML 2023: Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, Mark D Plumbley. AudioLDM: Text-to-Audio Generation with Latent Diffusion Models [OpenReview]
19-21 Apr 2023: Presenting invited plenary talk at Urban Sound Symposum 2022, Barcelona, Spain. [Twitter/X]
20 Mar 2023: Open Research Case Study with Arshdeep Singh: Efficient Audio-based CNNs via Filter Pruning
14-21 Mar 2023: Presenting series of talks during visits to Redmond, WA & San Francisco Bay Area, CA, USA. Visits include:
- 14 Mar: Microsoft Research Lab, Redmond - with thanks to host: Ivan Tashev
- 15 Mar: Meta Reality Labs, Redmond - with thanks to host: Buye Xu
- 16 Mar: Adobe Research, San Francisco - with thanks to hosts: Gautham Mysore, Justin Salamon & Nick Bryan
- 16 Mar: Dolby Laboratories, San Francisco - with thanks to host: Mark Thomas
- 17 Mar: Stanford University, Center for Computer Research in Music and Acoustics (CCRMA) - with thanks to host: Malcolm Slaney
- 20 Mar: ByteDance, San Francisco - with thanks to host: Yuxuan Wang
- 20 Mar: Apple, Cupertino - with thanks to hosts: Miquel Espi Marques & Ahmed Tewfik
- 21 Mar: Amazon Lab126, Sunnyvale - with thanks to host: Trausti Kristjansson
29 Jan 2023: New paper on arXiv: Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, Mark D. Plumbley. AudioLDM: Text-to-Audio Generation with Latent Diffusion Models. arXiv:2301.12503 [cs.SD]. [Project Page] [Code on GitHub] [Hugging Face Space] [Twitter]
18 Jan 2023: Invited talk at LISTEN Lab Workshop, Télécom Paris.
12 Jan 2023: Invited talk at RISE Learning Machines Seminar series, RISE Research Institutes of Sweden. [YouTube] [Twitter]
9 Jan 2023: Welcome to Thomas Deacon, joining the AI for Sound project today!

2022

26 Dec 2022: Annamaria Mesaros awarded First Prize in the IEEE Finland Joint Chapter on Signal Processing and Circuits and Systems (SP/CAS) Paper Award 2022 for our paper: A. Mesaros, T. Heittola, T. Virtanen and M. D. Plumbley, Sound Event Detection: A tutorial, IEEE Signal Processing Magazine, vol. 38, no. 5, pp. 67-83, Sept. 2021 [arXiv] [Open Access] [Twitter/X]
12 Dec 2022: Congratulations to former PhD students Qiuqiang Kong and Turab Iqbal on being selected for the IEEE Signal Processing Young Author Best Paper Award, for the paper: Qiuqiang Kong*; Yin Cao; Turab Iqbal*; Yuxuan Wang; Wenwu Wang; Mark D. Plumbley: "PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition" IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 2880-2894, 19 October 2020. DOI: 10.1109/TASLP.2020.3030497 [IEEE SPS Newsletter] [IEEE award page] [Paper] [arXiv] [Open Access] [Code]
1 Dec 2022: Virtual Keynote talk at the Amazon Audio Tech Summit, on "AI for Sound"
29 Nov 2022: Talk at the Centre for Digital Music at Queen Mary University of London, plus external PhD examiner.
21 Nov 2022: Welcome to Gabriel Bibbó, joining the AI for Sound project today!
3-4 Nov 2022: Attending (virtually) the DCASE 2022 Workshop on Detection and Classification of Acoustic Scenes and Events. Papers from the group:
- Arshdeep Singh, Mark D. Plumbley. Low-complexity CNNs for Acoustic Scene Classification. [arXiv]
- Yang Xiao, Xubo Liu, James King, Arshdeep Singh, Eng Siong Chng, Mark D. Plumbley, Wenwu Wang. Continual Learning For On-Device Environmental Sound Classification. [arXiv]
- Haohe Liu, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley. Segment-level Metric Learning for Few-shot Bioacoustic Event Detection. [arXiv]
- Jinbo Hu, Yin Cao, Ming Wu, Qiuqiang Kong, Feiran Yang, Mark D. Plumbley, Jun Yang. Sound event localization and detection for real spatial sound scenes: event-independent network and data augmentation chains. [arXiv]
21-24 Aug 2022: Attending 51st International Congress and Exposition on Noise Control Engineering (Inter-noise 2022), Glasgow, Scotland [Full Programme]
- 22 Aug: Presenting our paper: Creating a new research community on detection and classification of acoustic scenes and events: Lessons from the first ten years of DCASE challenges and workshops (Mark D. Plumbley and Tuomas Virtanen)
  Session: MON9.1/d - Advances in Environmental Noise (14:40-17:00)
17-23 Jul 2022: Attending (virtually) 39th International Conference on Machine Learning (ICML 2022).
13 Jul 2022: On CVSSP Summer Walk to Albury [Twitter]
11-12 Jul 2022: At two-day face-to-face workshop on "Designing AI for Home Wellbeing"
- 12 Jul: "Designing AI for Home Wellbeing" Seminar Day, supported by the Surrey Institute for People-Centred AI [LinkedIn]
- 11 Jul: "Designing AI for Home Wellbeing" World Cafe, supported by the University of Surrey Institute of Advanced Studies [LinkedIn]
30 Jun 2022: Dr Emily Corrigan-Kavanagh co-organizing in-person "Designing AI for Home Wellbeing" World Cafe at the Design Research Society conference DRS 2022 Bilbao (25 Jun - 3 Jul)
27 Jun - 1 Jul 2022: Interviewed by Tech Monitor (27 Jun) and BBC Tech Tent [0:17:20-0:22:00] (1 Jul) on virtual reality sound.
16 Jun 2022: Paper accepted for 32nd IEEE International Workshop on Machine Learning for Signal Processing (MLSP 2022), Xi'an, China, 22-25 Aug 2022:
- Meng Cui, Xubo Liu, Jinzheng Zhao, Jianyuan Sun, Guoping Lian, Tao Chen, Mark D. Plumbley, Daoliang Li, Wenwu Wang. Fish Feeding Intensity Assessment in Aquaculture: A New Audio Dataset AFFIA3K and A Deep Learning Algorithm
15 Jun 2022: Three papers accepted for INTERSPEECH 2022, Incheon, Korea, 18-22 September 2022:
- Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, Jinzheng Zhao, Qiushi Huang, Mark D. Plumbley, Wenwu Wang: Separate What You Describe: Language-Queried Audio Source Separation [arXiv] [Code] [Demo]
- Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark Plumbley and Wenwu Wang. On Metric Learning for Audio-Text Cross-Modal Retrieval [arXiv] [Code]
- Arshdeep Singh and Mark D. Plumbley. A Passive Similarity based CNN Filter Pruning for Efficient Acoustic Scene Classification [arXiv]
16 May 2022: Papers accepted for 30th European Signal Processing Conference (EUSIPCO), Belgrade, Serbia, 30 Aug - 2 Sep 2022:
- Xubo Liu, Xinhao Mei, Qiushi Huang, Jianyuan Sun, Jinzheng Zhao, Haohe Liu, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang: Leveraging Pre-trained BERT for Audio Captioning [arXiv]
- Jianyuan Sun, Xubo Liu, Xinhao Mei, Jinzheng Zhao, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang. Deep Neural Decision Forest for Acoustic Scene Classification [arXiv]
7-13 May 2022: At IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2022) virtual sessions
- 13 May: Bin Li presents our Journal of Selected Topics in Signal Processing article: Sparse Analysis Model Based Dictionary Learning for Signal Declipping (Bin Li, Lucas Rencker, Jing Dong, Yuhui Luo, Mark D. Plumbley and Wenwu Wang) [Presentation] [Paper] [Open Access] [Code]
- 11 May: Xinhao Mei presents our paper: Diverse Audio Captioning via Adversarial Training (Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark Plumbley and Wenwu Wang) [Presentation] [Paper] [Open Access] [arXiv]
- 10 May: Zhao Ren presents our IEEE Transactions on Multimedia article: CAA-Net: Conditional Atrous CNNs With Attention for Explainable Device-Robust Acoustic Scene Classification (Zhao Ren, Qiuqiang Kong, Jing Han, Mark D. Plumbley and Bjorn W Schuller) [Presentation] [Paper] [Open Access] [arXiv] [Code]
- 8 May: Annamaria Mesaros presents our Signal Processing Magazine article: Sound Event Detection: A Tutorial (Annamaria Mesaros, Toni Heittola, Tuomas Virtanen and Mark D Plumbley) [Presentation] [Paper] [Open Access] [arXiv]
- 7 May: Jinbo Hu presents our challenge paper: A Track-Wise Ensemble Event Independent Network for Polyphonic Sound Event Localization and Detection (Jinbo Hu, Ming Wu, Feiran Yang, Jun Yang, Yin Cao, Mark Plumbley) [Presentation] [Paper] [Open Access] [arXiv] [Code]
5 Apr 2022: At UKAN Workshop on Soundscapes, including leading the afternoon in-person World Cafe, at Imperial College London. [Flyer (PDF)] [Twitter] [2] [3]
7 Mar 2022: At Virtual World Cafe on Exploring Sounds Sensing to Improve Workplace Wellbeing, organised by Dr Emily Corrigan-Kavanagh

2021

3 Dec 2021: Dr Emily Corrigan-Kavanagh presents our paper: Envisioning Sound Sensing Technologies for Enhancing Urban Living (Emily Corrigan-Kavanagh, Andres Fernandez, Mark D. Plumbley) at Environments By Design: Health, Wellbeing And Place, Virtual, 1-3 Dec 2021
15-19 Nov 2021: At the DCASE 2021 Workshop on Detection and Classification of Acoustic Scenes and Events, Online
- 19 Nov: Moderating the Town Hall discussion
  [Twitter] [Webpage]
- 18 Nov: Turab Iqbal presents our poster: ARCA23K: An Audio Dataset for Investigating Open-Set Label Noise (Turab Iqbal, Yin Cao, Andrew Bailey, Mark D. Plumbley, Wenwu Wang)
  [Twitter] [Paper] [Poster] [Video] [Code] [Data]
- 18 Nov: Xinhao Mei presents our poster: Audio Captioning Transformer (Xinhao Mei, Xubo Liu, Qiushi Huang, Mark D. Plumbley, Wenwu Wang)
  [Twitter] [Paper] [Poster] [Video]
- 17 Nov: Andres Fernandez presents our poster: Using UMAP to Inspect Audio Data for Unsupervised Anomaly Detection Under Domain-Shift Conditions (Andres Fernandez, Mark D. Plumbley)
  [Twitter] [Paper] [Poster] [arXiv] [Video] [Code] [Data] [Webpage]
- 17 Nov: Xubo Liu and Qiushi Huang present our poster: CL4AC: A Contrastive Loss for Audio Captioning (Xubo Liu, Qiushi Huang, Xinhao Mei, Tom Ko, H. Tang, Mark D. Plumbley, Wenwu Wang)
  [Paper] [Poster] [Video] [Code]
- 17 Nov: Xinhao Mei presents our poster: An Encoder-Decoder Based Audio Captioning System with Transfer and Reinforcement Learning (Xinhao Mei, Qiushi Huang, Xubo Liu, Gengyun Chen, Jingqian Wu, Yusong Wu, Jinzheng Zhao, Shengchen Li, Tom Ko, H. Tang, Xi Shao, Mark D. Plumbley, Wenwu Wang)
  [Paper] [Poster] [Video] [Code]
12 Nov 2021: At "Envisioning Future Homes for Wellbeing" online event, as a "virtual world cafe". [News] [Graphic]
2 Nov 2021: At "Technologies for Home Wellbeing" in-person event, including "Research and Innovation in Technologies for Home Wellbeing" and "Future Technologies for Home Wellbeing" [Programme (PDF)] [News]
18-20 Oct 2021: At IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA 2021) (virtual)
15 Sep 2021: Francesco Renna presents our paper: Francesco Renna, Mark Plumbley and Miguel Coimbra: Source separation of the second heart sound via alternating optimization, Proc Computing in Cardiology - CinC 2021, Brno, Czech Republic [Abstract] [PDF Abstract] [Preprint] [Preprint]
23-27 Aug 2021: At 29th European Signal Processing Conference (EUSIPCO 2021) (virtual)
- 27 Aug: Andrew Bailey presents our paper: Andrew Bailey and Mark D. Plumbley. Gender bias in depression detection using audio features. [Paper]
- 25 Aug: Lam Pham presents our paper: Lam Pham, Chris Baume, Qiuqiang Kong, Tassadaq Hussain, Wenwu Wang, Mark Plumbley. An Audio-Based Deep Learning Framework For BBC Television Programme Classification. [Paper]
30 July 2021: Presenting invited talk on "Machine Learning for Sound Sensing" in the Workshop on Applications of Machine Learning at the Twenty Seventh National Conference on Communications (NCC-2021) Virtual Conference, organized by IIT Kanpur and IIT Roorkee, India, 27-30 July 2021 [Twitter]
6-11 June 2021: At the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2021), Toronto (Virtual), Canada
- 10 June: Yin Cao, Qiuqiang Kong and Turab Iqbal present our poster: An Improved Event-Independent Network for Polyphonic Sound Event Localization and Detection (Yin Cao, Turab Iqbal, Qiuqiang Kong, Fengyan An, Wenwu Wang, Mark D. Plumbley) [Twitter] [Paper] [arXiv] [Code]
- 8 June: Jingzhu Zhang presents our poster: Weighted magnitude-phase loss for speech dereverberation (Jingshu Zhang, Mark D. Plumbley, Wenwu Wang) [Twitter] [Paper] [Open Access]
19-21 April 2021: See you at the Urban Sound Symposium
- Accepted poster with Dr Emily Corrigan-Kavanagh: "Exploring Sound Sensing to Improve Quality of Life in Urban Living"
24 Mar 2021: Presenting Connected Places Catapult Breakfast Briefing: Enabling Thriving Urban Communities through Sound Sensing AI Technology, with Dr Emily Corrigan-Kavanagh [Twitter] [LinkedIn] [Video]
23-24 Mar 2021: See you at The Turing Presents: AI UK 2021
- Tue 23 Mar (Day 1) at 16:00 Presenting a Spotlight Talk on "AI for Sound"
19-22 Jan 2021: At EUSIPCO 2020: 28th European Signal Processing Conference, Virtal Amsterdam.
- Wed 20 Jan: Emad Grais presents our paper in session ASMSP-7: Speech and Audio Separation: Multi-Band Multi-Resolution Fully Convolutional Neural Networks for Singing Voice Separation (Emad M. Grais, Fei Zhao, Mark D. Plumbley) [Paper]
- Fri 22 Jan: Attending meeting of EURASIP Technical Area Committee on Acoustic, Speech and Music Signal Processing (ASMSP TAC)

2020

2 Dec 2020: Plenary lecture at iTWIST 2020: international Traveling Workshop on Interactions between low-complexity data models and Sensing Techniques, Virtual Nantes, France, December 2-4, 2020
12 Nov 2020: Invited talk with Dr Emily Corrigan-Kavanagh at #LboroAppliedAI seminar series on Applied AI, Univ of Loughborough (virtual)
2-4 Nov 2020: At Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2020) Virtual Tokyo, Japan
- Yin Cao presents our paper: Event-independent network for polyphonic sound event localization and detection (Yin Cao, Turab Iqbal, Qiuqiang Kong, Yue Zhong, Wenwu Wang, Mark D. Plumbley) [Paper] [Video] [Code]
- Saeid Safavi presents our poster: Open-window: A sound event dataset for window state detection and recognition (Saeid Safavi, Turab Iqbal, Wenwu Wang, Philip Coleman, Mark D. Plumbley) [Paper] [Video] [Dataset] [Code]
- Our team wins Judges’ Award for DCASE 2020 Challenge Task 5 “Urban Sound Tagging with Spatiotemporal Context”!
  Congratulations to Turab Iqbal, Yin Cao, and Wenwu Wang [News Item] [Twitter]
- Our team wins Reproducible System Award for DCASE 2020 Challenge Task 3 “Sound Event Localization and Detection”!
  Congratulations to Yin Cao, Turab Iqbal, Qiuqiang Kong, Zhong Yue, and Wenwu Wang [News Item] [Twitter] [Paper] [Code]
- Joint 1st place team in DCASE 2020 Challenge Task 5 “Urban Sound Tagging with Spatiotemporal Context” with entry: Turab Iqbal, Yin Cao, Mark D. Plumbley and Wenwu Wang: Incorporating Auxiliary Data for Urban Sound Tagging [Paper] [Code]
14-16 Oct 2020: Attending BBC Sounds Amazing 2020
1 Oct 2020: New award of £2.2 million Multimodal Video Search by Examples (MVSE) [2][3], collaborative project including Hui Wang (Ulster), Mark Gales (Cambridge) and Josef Kittler, Miroslav Bober & Wenwu Wang (Surrey)
1 Oct 2020: Welcome to Marc Green, joining the AI for Sound project
24 Sep 2020: New award of £1.4 million UK Acoustics Network Plus (UKAN+) [News Item]
21 Sep 2020: Welcome to Andres Fernandez, joining the AI for Sound project
16 Sep 2020: Presentation at Huawei Future Device Technology Summit 2020
2 Sep 2020: Welcome to Dr Emily Corrigan-Kavanagh, joining the AI for Sound project
27 Aug 2020: Presentation at online meeting of the Association of Noise Consultants (ANC) [LinkedIn]
1 Jul 2020: Presentation to Samsung AI Centre Cambridge, partners of the EPSRC Fellowship in "AI for Sound".
17 Jun 2020: Recruiting for three researchers on AI for Sound project (deadline 17 July 2020):
5 Jun 2020: Starting new EPSRC Fellowship in "AI for Sound". Press Release: Fellowship to advance sound to new frontiers using AI
27 May 2020: Recruiting for Research Fellow on Advanced Machine Learning for Audio Tagging (deadline 27 June 2020)
4-8 May 2020: (Virtually) at 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020)
[Postponed to 2021: 24 Mar 2020: Talk on "AI for Sound" at The Turing Presents: AI UK, London, UK.]
24 Jan 2020: Papers accepted for ICASSP 2020:
- Turab Iqbal, Yin Cao, Qiuqiang Kong, Mark D. Plumbley and Wenwu Wang. Learning with out-of-distribution data for audio classification.
- Qiuqiang Kong, Yuxuan Wang, Xuchen Song, Yin Cao, Wenwu Wang and Mark D. Plumbley. Source separation with weakly labelled data: An approach to computational auditory scene analysis. [arXiv:2002.02065]
15 Jan 2020: At PHEE 2020 Conference, London, UK

2019

22 Nov 2019: Zhao Ren presented our paper at 9th International Conference on Digital Public Health (DPH 2019), Marseille, France: Zhao Ren, Jing Han, Nicholas Cummins, Qiuqiang Kong, Mark D. Plumbley and Björn W. Schuller. Multi-instance learning for bipolar disorder diagnosis using weakly labelled speech data. [Paper] [Open Access]
1 Nov 2019: At UK Acoustics Network (UKAN) 2nd Anniversary Event, London, UK [Twitter]
25-26 Oct 2019: At DCASE2019 Workshop on Detection and Classification of Acoustic Scenes and Events, 25-26 October 2019, New York, USA [Twitter]
- Our team wins the Reproducible System Award for DCASE2019 Task 3!
  Well done to Yin Cao, Turab Iqbal, Qiuqiang Kong, Miguel Blanco Galindo, Wenwu Wang.
  [Paper and Code (reproducible!)]
- Yin Cao presenting our poster: Polyphonic Sound Event Detection and Localization using a Two-Stage Strategy.
  [Paper] [Code] [Twitter]
- Francois Grondin presenting our paper with Iwona Sobieraj and James Glass: Sound event localization and detection using CRNN on pairs of microphones [Paper] [Twitter]
24 Oct 2019: At Speech and Audio in the Northeast (SANE 2019), Columbia University, New York City, USA [Twitter: #SANE2019]
- Yin Cao and Saeid Safavi presented our demo on sound recognition, generalisation and visualisations [Twitter] [Slides] [YouTube]
- Turab Iqbal presented our poster: Learning with Out-of-Distribution Data for Audio Classification [Twitter]
- Yin Cao presented our poster: Cross-task learning for audio tagging, sound event detection and spatial localization: DCASE 2019 baseline systems [Twitter]
20-23 Oct 2019: At IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA 2019), New Paltz, New York, USA [Twitter: #WASPAA2019]
16 Oct 2019: Talk at Sound Sensing in Smart Cities 2 at Connected Places Catapult, London, on "AI for Sound: A Future Technology for Smart Cities"
30 Sep 2019: Opening Keynote at the 50th Annual Conference of the International Association of Sound and Audiovisual Archives (IASA 2019), Hilversum, The Netherlands [Photos]
5 Sep 2019: Emad Grais presented our paper at EUSIPCO 2019: Grais, Wierstorf, Ward, Mason & Plumbley. Referenceless performance evaluation of audio source separation using deep neural networks [Session: ThAP1: ASMSP PIII, Thursday, September 5 13:40 - 15:40, Poster Area 1] [PDF] [Open Access]
15 Aug 2019: Wenwu Wang presented our paper at IJCAI-19: Kong, Xu, Jackson, Wang, Plumbley. Single-Channel Signal Separation and Deconvolution with Generative Adversarial Networks [Session: ML|DL - Deep Learning 7 (L), Thu 15 Aug, 17:15] [Abstract] [Proceedings] [PDF] [arXiv:1906.07552] [Source Code]
23-25 Jul 2019: Co-organized AI Summer School at University of Surrey [Twitter: #AISurreySummerSchool] [Videos]
28 Jun 2019: Great result in DCASE2019 Task 3 "Sound Event Localization and Detection" for CVSSP Team (Yin Cao, Turab Iqbal, Qiuqiang Kong, Miguel Blanco Galindo, Wenwu Wang, Mark Plumbley). Top academic team and 2nd overall [Abstract] [Report] [Source Code]
26 Jun 2019: At Human Motion Analysis for Healthcare Applications Conference, IET London. [Twitter: #Human_Motion_Analysis]
25 Jun 2019: At Soundscape Workshop 2019, London [Twitter: #SoundscapeWorkshop2019]
20 Jun 2019: Talk at Connected Places Catapult, London, on "AI for Sound: A Future Technology for Smarter Living and Travelling" [Twitter (2) (3) (4)]
- See report by UKAuthority: Sound sensors ‘could support smart cities and assistive tech’
13 Jun 2019: Talk at the International Hearing Instruments Developer Forum (HADF 2019), Oldenburg, Germany on "Computational analysis of sound scenes and events"
3 Jun 2019: Paper accepted for EUSIPCO 2019: Emad M. Grais, Hagen Wierstorf, Dominic Ward, Russell Mason and Mark D. Plumbley. Referenceless performance evaluation of audio source separation using deep neural networks. [arXiv:1811.00454]
12-17 May 2019: At IEEE International Conference on Acoustics, Speech, and Signal Processing Signal Processing (ICASSP 2019), Brighton, UK
- Fri 17 May (pm): Qiuqiang Kong presented our poster: Kong, Xu, Iqbal, Cao, Wang & Plumbley: Acoustic Scene Generation with Conditional SampleRNN [Twitter] [Open Access] [Source Code] [Sound Samples]
- Fri 17 May (am): Christian Kroos presented our talk: Kroos, Bones, Cao, Harris, Jackson, Davies, Wang, Cox & Plumbley: Generalisation in Environmental Sound Classification: The 'Making Sense of Sounds' Data Set and Challenge [Open Access] [MSoS Challenge] [Dataset]
- Tue 14 May: Co-authoring three papers in session "Detection and Classification of Acoustic Scenes and Events II"
  - Zhao Ren presented our talk: Ren, Kong, Han, Plumbley, Schuller: Attention-based Atrous Convolutional Neural Networks: Visualisation and Understanding Perspectives of Acoustic Scenes[Twitter] [Open Access]
  - Yuanbo Hou presented our talk: Hou, Kong, Li, Plumbley: Sound Event Detection with Sequentially Labelled Data Based on Connectionist Temporal Classification and Unsupervised Clustering [Twitter] [Open Access]
  - Zuzanna Podwinska presented our talk: Podwinska, Sobieraj, Fazenda, Davies, Plumbley: Acoustic Event Detection from Weakly Labeled Data Using Auditory Salience [Twitter] [Open Access]
- 13 May: Many congratulations to Zhijin Qin, presented with IEEE SPS Young Author Best Paper Award for: Qin, Gao, Plumbley, Parini. Wideband spectrum sensing on real-time signals at sub-Nyquist sampling rates in single and cooperative multiple nodes. IEEE Transactions on Signal Processing. 64: 3106-3117, 2016 [Twitter]
- 12-13 May: Tutorials Co-chair
10 May 2019: Paper accepted for IJCAI-19: Qiuqiang Kong, Yong Xu, Wenwu Wang, Philip J.B. Jackson and Mark D. Plumbley. Single-Channel Signal Separation and Deconvolution with Generative Adversarial Networks. [Abstract] [arXiv:1906.07552] [Source Code] (Acceptance rate: 18%)
18 Apr 2019: Now availble Open Access: Chungeun Kim, Emad M. Grais, Russell Mason and Mark D. Plumbley. Perception of phase changes in the context of musical audio source separation. AES 145th Convention, New York, 17-20 October 2018. [Open Access]
16 Apr 2019: Cross-task baseline code for all DCASE 2019 tasks released. Report: Qiuqiang Kong, Yin Cao, Turab Iqbal, Yong Xu, Wenwu Wang and Mark D. Plumbley. Cross-task learning for audio tagging, sound event detection and spatial localization: DCASE 2019 baseline systems [arXiv:1904.03476]
Source Code:
- Task 1: https://github.com/qiuqiangkong/dcase2019_task1
- Task 2: https://github.com/qiuqiangkong/dcase2019_task2
- Task 3: https://github.com/qiuqiangkong/dcase2019_task3
- Task 4: https://github.com/qiuqiangkong/dcase2019_task4
- Task 5: https://github.com/qiuqiangkong/dcase2019_task5
4 Apr 2019: Presenting work from our sound recognition projects at the CVSSP 30th Anniversary event [Twitter (2)]
13 Mar 2019: Co-organizing kick-off meeting of AI@Surrey, bringing together AI related research across the University of Surrey [Twitter (2)]
20 Feb 2019: Paper published: Qiuqiang Kong, Yong Xu, Iwona Sobieraj, Wenwu Wang and Mark D. Plumbley. Sound event detection and time-frequency segmentation from weakly labelled data. IEEE/ACM Transactions on Audio, Speech and Language Processing 27(4): 777-787, April 2019. DOI:10.1109/TASLP.2019.2895254 [Open Access] [Source Code]
31 Jan 2019: AudioCommons project completed. See also: Deliverables & Papers; Tools & Resources
2 Jan 2019: Paper published: Estefanía Cano, Derry FitzGerald, Antoine Liutkus, Mark D. Plumbley and Fabian-Robert Stöter. Musical Source Separation: An Introduction. IEEE Signal Processing Magazine 36(1):31-40, January 2019. DOI:10.1109/MSP.2018.2874719 [Open Access]

2018

18 Dec 2018: Tropical-themed CVSSP Christmas Party! [Twitter]
21-22 Nov 2018: Hosting the 5th General Meeting of the AudioCommons project at the University of Surrey [Twitter]
19-20 Nov 2018: Co-chair of DCASE 2018 Workshop on Detection and Classification of Acoustic Scenes and Events, Surrey, UK [Twitter: #DCASE2018] [Proceedings]
- 20 Nov:
- Chairing the final Panel Session [Twitter]
- Qiuqiang Kong presenting our poster: Kong, Iqbal, Xu, Wang, Plumbley: DCASE 2018 Challenge Surrey cross-task convolutional neural network baseline [Twitter]
  Code:
  - Task 1: https://github.com/qiuqiangkong/dcase2018_task1
  - Task 2: https://github.com/qiuqiangkong/dcase2018_task2
  - Task 3: https://github.com/qiuqiangkong/dcase2018_task3
  - Task 4: https://github.com/qiuqiangkong/dcase2018_task4
  - Task 5: https://github.com/qiuqiangkong/dcase2018_task5
- 19 Nov:
- Turab Iqbal and Qiuqiang Kong presenting our poster: Iqbal, Kong, Plumbley, Wang: General-purpose audio tagging from noisy labels using convolutional neural networks
- Christian Kroos presenting our poster: Kroos, Bones, Cao, Harris, Jackson, Davies, Wang, Cox, Plumbley: The Making Sense of Sounds Challenge
- Zhao Ren presenting our talk: Ren, Kong, Qian, Plumbley, Schuller: Attention-based convolutional neural networks for acoustic scene classification [Twitter]
- Opening the workshop [Twitter]
18 Sep 2018: CVSSP team (with Turab Iqbal, Qiuqiang Kong, Wenwu Wang) ranked 3rd out of 558 systems on Kaggle for DCASE 2018 Challenge Task 2: Freesound General-Purpose Audio Tagging[News item]
- Report: Turab Iqbal, Qiuqiang Kong, Mark D. Plumbley and Wenwu Wang. Stacked convolutional neural networks for general-purpose audio tagging. Technical Report, Detection and Classification of Acoustic Scenes and Events 2018 (DCASE 2018) Challenge, September 2018 [Software Code]
8 Aug 2018: Making Sense of Sounds Challenge announced
26 Jul 2018: Shengchen Li presents our paper at the 37th Chinese Control Conference: Li, Dixon & Plumbley. A demonstration of hierarchical structure usage in expressive timing analysis by model selection tests. In: Proceedings of the 37th Chinese Control Conference, CCC2018, Wuhan, China, 25-27 July 2018, pp 3190-3195. DOI:10.23919/ChiCC.2018.8483169 [Open Access]
6 Jul 2018: Co-Chair of Audio Day 2018, University of Surrey, Guildford, UK [Twitter: #SurreyAudioDay]
- Hosting the closing Panel Discussion [Twitter]
- From the AudioCommons project:
  - Postdoc Saied Safavi presents work on predicting the perceived level of reverberation using machine learning [Twitter]
  - Postdoc Andy Pearce presents work on timbral attributes for sound effects [Twitter]
  - Wenwu Wang introduces the project [Twitter]
- I present an overview of our work on the Making Sense of Sounds project [Twitter (2)]
- From the Musical Audio Repurposing using Source Separation project:
  - Postdoc Ryan Kim presents our work on perceptual evaluation of audio source separtion [Twitter (2)]
  - Emad Grais, former postdoc, presents our work on audio source separation using deep learning [Twitter]
2-5 Jul 2018: Co-Chair of LVA/ICA 2018: 14th International Conference on Latent Variable Analysis and Signal Separation, University of Surrey, Guildford, UK [Twitter: #LVAICA2018]
- See also: Dan Stowell: Notes from LVA-ICA conference 2018
- 5 Jul
- Congratulations to Lucas Rencker for a "Best Student Paper Award"!
- Lucas Rencker presents our talk: Rencker, Bach, Wang & Plumbley. Consistent dictionary learning for signal declipping [Twitter] [Open Access]
- Alfredo Zermini presents our poster: Zermini, Kong, Xu, Plumbley & Wang. Improving Reverberant Speech Separation with Binaural Cues Using Temporal Context and Convolutional Neural Networks [Twitter] [Open Access]
- 4 Jul
- Cian O’Brien presenting our late-breaking poster: O’Brien & Plumbley: Latent Mixture Models for Automatic Music Transcription
- Qiuqiang Kong presenting our SiSEC poster: Ward, Kong & Plumbley. Source Separation with Long Time Dependency Gated Recurrent Units Neural Networks [Description] [Twitter]
- 3 Jul
- Emad M. Grais presenting our poster: Grais, Wierstorf, Ward & Plumbley. Multi-Resolution Fully Convolutional Neural Networks for Monaural Audio Source Separation [Twitter] [Open Access]
- Iwona Sobieraj presenting our late-breaking poster: Sobieraj, Rencker & Plumbley. Orthogonality-Regularized Masked NMF with KL-Divergence for Learning on Weakly Labeled Audio Data[Twitter]
- Opening the conference [Twitter]
5 Jun 2018: At introductory meeting of Communication and Room Acoustics SIG of UK Acoustics Network
23-26 May 2018: Group members at AES Milan 2018: 144th International Pro Audio Convention.
- Fri 25 May: Emad Grais presented our poster Grais & Plumbley: Combining Fully Convolutional and Recurrent Neural Networks for Single Channel Audio Source Separation [Twitter] [Details] [Poster (pdf) ]
- Thu 24 May: PhD student Qiuqiang Kong (Panellist) at Workshop W11: Deep Learning for Audio Applications [Twitter]
- Wed 23 May: Workshop (panel discussion) on Audio Repurposing Using Source Separation organized by our EPSRC project Musical Audio Repurposing using Source Separation, Chaired by Phil Coleman with postdoc Chungeun (Ryan) Kim (Panellist) [Twitter] [Blog from Phil Coleman]
18 May 2018: Visit to Audio Analytic at end of successful stay by PhD student Iwona Sobieraj.
18 May 2018: Papers accepted for EUSIPCO 2018:
- Emad M. Grais, Dominic Ward and Mark D. Plumbley. Raw Multi-Channel Audio Source Separation using Multi-Resolution Convolutional Auto-Encoders.
- Cian O’Brien and Mark D. Plumbley. A Hierarchical Latent Mixture Model for Polyphonic Music Analysis.
3 May 2018: Congratulations to Prof Trevor Cox, collaborator on the EPSRC-funded Making Sense of Sounds project, on new book Now You're Talking: Human Conversation from the Neanderthals to Artificial Intelligence.
30 Apr 2018: Congratulations to Dr Chris Baume on official award of PhD, for his thesis Semantic Audio Tools for Radio Production,
funded by BBC R&D as part of the BBC Audio Research Partnership.
29 Apr 2018: Now Published (Open Access): Chris Baume, Mark D. Plumbley, David Frohlich, Janko Ćalić. PaperClip: A digital pen interface for semantic speech editing in radio production. Journal of the Audio Engineering Society 66(4):241-252, April 2018. DOI:10.17743/jaes.2018.0006 [Full text also at: http://epubs.surrey.ac.uk/845786/]
16-20 Apr 2018: At IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018), Calgary, Alberta, Canada. [Twitter: #ICASSP2018]
- Iwona Sobieraj presented our poster Sobieraj, Rencker & Plumbley: Orthogonality-regularized masked NMF for learning on weakly labeled audio data [Tue 17 April, 13:30 - 15:30, Poster Area E] [Twitter]
- Qiuqiang Kong & Yong Xu presented two posters: (a) Kong, Xu, Wang & Plumbley: Audio Set classification with attention model: A probabilistic perspective and (b) Kong, Xu, Wang & Plumbley: A joint separation-classification model for sound event detection of weakly labelled data [Wed 18 Apr, 08:30 - 10:30, Poster Area C] [Twitter]
- Wenwu Wang & Philip Jackson presented our poster Huang, Jackson, Plumbley & Wang: Synthesis of images by two-stage generative adversarial networks [Wed 18 Apr, 13:30 - 15:30, Poster Area G] [Twitter]
- Yong Xu presented talk Xu, Kong, Wang & Plumbley: Large-scale weakly supervised audio classification using gated convolutional neural network [Thu 19 Apr, 13:30 - 13:50] [Twitter]
- Emad Grais presented our poster Ward, Wierstorf, Mason, Grais & Plumbley. BSS Eval or PEASS? Predicting the perception of singing-voice separation [Thu 19 April, 16:00 - 18:00, Poster Area D] [Twitter] [Details] [Study info] [Code]
- Cian O'Brien presented our poster O'Brien & Plumbley: Inexact proximal operators for lp-quasinorm minimization [Fri 20 Apr, 16:00 - 18:00, Poster Area G] [Twitter]
- Also: Springer stand with our new book: Tuomas Virtanen, Mark D. Plumbley &Dan Ellis Computational Analysis of Sound Scenes and Events [Twitter]
1 Apr 2018: Now Published (Open Access): Chris Baume, Mark D. Plumbley, Janko Ćalić and David Frohlich. A contextual study of semantic speech editing in radio production. International Journal of Human-Computer Studies 115:67-80, July 2018. DOI:10.1016/j.ijhcs.2018.03.006 [Full text also at: http://epubs.surrey.ac.uk/846079/]
1 Apr 2018: Paper accepted for 37th Chinese Control Conference, CCC2018: Shengchen Li, Simon Dixon and Mark D. Plumbley. A demonstration of hierarchical structure usage in expressive timing analysis by model selection tests.
23 Mar 2018: At SpaRTaN-MacSeNet Workshop on Sparse Representations and Compressed Sensing in Paris, organized by our SpaRTaN and MacSeNet Initial/Innovative Training Networks. [Twitter: #SpartanMacsenet]
- Included presentations from Surrey PhD students Cian O'Brien [Twitter], Alfredo Zermini [Twitter], Lucas Rencker [Twitter] and Iwona Sobieraj [Twitter]
22 Mar 2018: At SpaRTaN-MacSeNet Training Workshop, Paris [Twitter]
19 Mar 2018: Papers accepted for LVA/ICA 2018:
- Emad M. Grais, Hagen Wierstorf, Dominic Ward and Mark D. Plumbley. Multi-resolution fully convolutional neural networks for monaural audio source separation.
- Lucas Rencker, Francis Bach, Wenwu Wang and Mark D. Plumbley. Consistent dictionary learning for signal declipping.
- Alfredo Zermini, Qiuqiang Kong, Yong Xu, Mark D. Plumbley and Wenwu Wang. Improving reverberant speech separation with binaural cues using temporal context and convolutional neural networks.
14 Mar 2018: At launch of the Digital Catapult's Machine Intelligence Garage, London [Twitter: #MachineIntelligenceGarage]
6 Mar 2018: Welcome to Saeid Safavi, joining the EU H2020 project Audio Commons.
5 Mar 2018: Welcome to Ryan Kim, joining the EPSRC-funded project Musical Audio Repurposing using Source Separation
23 Feb 2018: Paper accepted for HCI International 2018: Tijs Duel, David M. Frohlich, Christian Kroos, Yong Xu, Philip J. B. Jackson and Mark D. Plumbley. Supporting audiography: Design of a system for sentimental sound recording, classification and playback.
22 Feb 2018: At 4th General Meeting of the AudioCommons project in Barcelona [Twitter (2)]
14 Feb 2018: Paper accepted for AES Milan 2018: Emad M. Grais and Mark D. Plumbley. Combining fully convolutional and recurrent neural networks for single channel audio source separation.
12 Feb 2018: Welcome to Manal Helal, joining the EPSRC-funded project Musical Audio Repurposing using Source Separation
29 Jan 2018: Papers accepted for ICASSP 2018:
- Qiang Huang, Philip Jackson, Mark D. Plumbley and Wenwu Wang. Synthesis of images by two-stage generative adversarial networks.
- Qiuqiang Kong, Yong Xu, Wenwu Wang and Mark D. Plumbley. A joint separation-classification model for sound event detection of weakly labelled data.
- Qiuqiang Kong, Yong Xu, Wenwu Wang and Mark D. Plumbley. Audio Set classification with attention model: A probabilistic perspective.
- Cian O'Brien and Mark D. Plumbley. Inexact proximal operators for lp-quasinorm minimization.
- Iwona Sobieraj, Lucas Rencker and Mark D. Plumbley. Orthogonality-regularized masked NMF for learning on weakly labeled audio data.
- Dominic Ward, Hagen Wierstorf, Russell D. Mason, Emad M. Grais and Mark D. Plumbley. BSS Eval or PEASS? Predicting the perception of singing-voice separation.
- Yong Xu, Qiuqiang Kong, Wenwu Wang and Mark D. Plumbley. Large-scale weakly supervised audio classification using gated convolutional neural network.

2017

4 Dec 2017: Iwona Sobieraj presents poster at Women in Machine Learning Workshop (WiML 2017): Masked Non-negative Matrix Factorization for Bird Detection Using Weakly Labelled Data
27 Nov 2017: Introducing the Communication Acoustics SIG at the Launch of UK Acoustics Network [Event] [Photos] [Twitter: #UKAcousticsLaunch]
16-17 Nov 2017: At Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2017), Munch, Germany. [Twitter: #DCASE2017]
- Christian Kroos presented our paper: Kroos & Plumbley: Neuroevolution for Sound Event Detection in Real Life Audio: A Pilot Study [Twitter (2) (3) (4)] [Slides] [Video]
- Yong Xu and Qiuqiang Kong presented our technical report poster: Surrey-CVSSP System for DCASE2017 Challenge Task4 [Twitter]
15 Nov 2017: At Meeting of RCUK Digital Economy Programme Advisory Board (DE PAB), Swindon.
14 Nov 2017: Emad Grais at GlobalSIP 2017 presenting our poster: Grais & Plumbley: Single Channel Audio Source Separation using Convolutional Denoising Autoencoders.
1 Nov 2017: Chairing Advisory Board of the Software Sustainability Institute, Southampton.
25 Oct 2017: At Dutch Design Week 2017, Eindhoven, The Netherlands.
23-27 Oct 2017: Presentations from group at ISMIR 2017, Suzhou, China
- 27 Oct: Qiuqiang Kong presents our Late-Breaking poster "Music source separation using weakly labelled data" at [Extended abstract]
- 25 Oct: Shengchen Li presents our poster "Clustering expressive timing with regressed polynomial coefficients demonstrated by a model selection test" [Paper]
20 Oct 2017: Hagen Wierstorf presents our talk "Perceptual Evaluation of Source Separation for Remixing Music" at AES New York 2017 [Abstract] [Twitter] [Paper] [Open Access]
19 Oct 2017: At Speech and Audio in the North East (SANE 2017), New York, NY, USA [Twitter: #SANE2017]
- Presented our poster: Signal Processing, Psychoacoustic Engineering and Digital Worlds: Interdisciplinary Audio Research at the University of Surrey [Twitter]
16-18 Oct 2017: At WASPAA 2017, New Paltz, NY, USA
- 18 Oct: Keynote talk: Making Sense of Sounds: Machine Listening in the Real World [Twitter]
6 Oct 2017: Delighted that Department of Computer Science receives Athena SWAN Bronze award, glad to have helped the April 2017 submission.
4-7 Sep 2017: Handing back to Helen Treharne as returning Head of Department of Computer Science.
28 Aug - 2 Sep 2017: Presentations from group at EUSIPCO 2017:
- Qiuqiang Kong, Yong Xu and Mark D. Plumbley. Joint detection and classification convolutional neural network on weakly labelled bird audio detection.
- Cian O’Brien and Mark D. Plumbley. Automatic music transcription using low rank non-negative matrix decomposition.
- Lucas Rencker, Wenwu Wang and Mark D. Plumbley. Multivariate Iterative Hard Thresholding for sparse decomposition with flexible sparsity patterns.
- Iwona Sobieraj, Qiuqiang Kong and Mark D. Plumbley. Masked non-negative matrix factorization for bird detection using weakly labeled data.
22 Jun 2017: At 2017 AES International Conference on Semantic Audio, Erlangen, Germany [Twitter: #aessa17]
- See also: Report from Brecht De Man
- Presenting Opening Keynote: "Audio Event Detection and Scene Recognition" [Twitter]
8 Jun 2017: At Cheltenham Science Festival for Panel Discussion on Is Your Tech Listening To You? with Jason Nurse and Rory Cellan-Jones. Featured with interview in BBC Tech Tent 9 Jun 2017

News

11 FEB 2025

Noise Network Plus: The £1.8 million initiative to engineer a quieter future

woman switches on a smart speaker in her living room

21 JUN 2024

Nature sounds, noise filters and enhanced doorbells - how could AI change the way your home sounds?

20 MAR 2023

Efficient audio-based CNNs via filter pruning

The distress analysis interview corpus of human and computer interviews

07 JUL 2021

Open research principles for depression diagnosis with machine learning

18 NOV 2020

CVSSP researchers win two awards at DCASE 2020

24 SEP 2020

University of Surrey to boost UK's world-leading acoustic technologies, thanks to EPSRC grant

05 JUN 2020

Fellowship to advance sound to new frontiers using AI

In the media

25 March 2025

Meet the researcher – Mark Plumbley

gen ai - The AI Hub in Generative Models

21 February 2025

That's TV spoke to Mark Plumbley about a project aiming to reduce noise pollution for future generations

That's TV Surrey

12 February 2025

GB: al via nuovo progetto contro inquinamento acustico [New project against noise pollution underway]

[In Italian]

30Science.com

12 February 2025

University leads £1.8m noise pollution project

BBC News, South East

11 February 2025

Noise Network Plus: the £1.8 million initiative to engineer a quieter future

pro-manchester

11 February 2025

£1.8 million will support initiative to engineer a quieter future

Today Headline

23 January 2025

EPSRC funding work into tomorrow’s engineering research challenges

The Engineer

04 July 2022

Making the virtual reality world sound right

[Interviewed 0:17:20-0:22:00]

BBC Tech Tent

27 June 2022

Sound of the metaverse: Meta creates AI models to improve virtual audio

Tech Monitor

24 June 2019

Sound sensors ‘could support smart cities and assistive tech’

UKAuthority

09 June 2017

Tech Tent: fake news, algorithms and listening gadgets

BBC Tech Tent [Overview]

09 June 2017

Cheltenham Science Festival 2017

BBC Tech Tent [Podcast]

09 June 2017

Hackers could use Alexa or Siri to spy on you, cybersecurity expert warns

The Times

02 December 2016

Computer learns to recognize sounds by watching video

MIT News

03 November 2016

Binge-watching videos teaches computers to recognise sounds

New Scientist

Research

Research interests

My research concerns AI for Sound: using machine learning and signal processing for analysis and recognition of sounds. My focus is on detection, classification and separation of acoustic scenes and events, particularly real-world sounds, using methods such as deep learning, sparse representations and probabilistic models.

I have published over 400 papers in journals, conferences and books, including over 70 journal papers and the recent Springer co-edited book on Computational Analysis of Sound Scenes and Events.

Much of my research is funded by grants from EPSRC and EU, Innovate UK and other sources. I currently hold an EPSRC Fellowship on "AI for Sound", and recently led EPSRC projects Making Sense of Sounds and Musical Audio Repurposing using Source Separation, and two EU research training networks, SpaRTaN and MacSeNet. My total grant funding is around £54M, including £20M as Principal Investigator, Coordinator or Lead Applicant.

I was co-Chair of the DCASE 2018 Workshop on Detection and Classification of Acoustic Scenes and Events, co-Chair of the 14th International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA 2018) and co-Chair of the Signal Processing with Adaptive Sparse Structured Representations (SPARS 2017) workshop.

Research projects

UK Acoustics Network Plus

EPSRC grant EP/V007866/1, £1.4M total, £40k to Surrey, Apr 2021 – Mar 2025. Joint with Horoshenkov (Sheffield), Bristow (Surrey), plus 10 others at Imperial, Loughborough, Salford, LSBU, Manchester, Bath, Reading, Southampton, Nottingham Trent.

Multimodal Video Search by Examples (MVSE)

EPSRC grant EP/V002856/1, £2.2M total, £863k to Surrey, Apr 2021 – Sep 2024. Joint with with Kittler, Wang & Bober (Surrey); Wang (QUB), Bond & Mulvenna (Ulster); Gales (Cambridge), BBC R&D.

AI for Sound

EPSRC Fellowship EP/T019751/1, £2.1M, May 2020 – Apr 2025.

Making Sense of Sounds

EPSRC grant EP/N014111/1, £1.28M total, £875k to Surrey, Mar 2016 – Mar 2019. Joint with University of Salford.

Musical Audio Repurposing using Source Separation

EPSRC grant EP/L027119/1 (Queen Mary) & EP/L027119/2 (Surrey), £887,607, Nov 2014 – Oct 2018.

Audio-Visual Media Research Platform

EPSRC grant EP/P022529/1, £1.58M, Aug 2017 – Jul 2022.

Audio Commons: Bringing Creative Commons audio content to the creative industries

EU H2020 Research and Innovation Grant 688382 total €2.98M (~£2.2M), £470k to Surrey, Feb 2016 - Jan 2019.

MacSeNet: Machine Sensing Training Network

EU Marie Skłodowska-Curie Action (MSCA) Innovative Training Network H2020-MSCA-ITN-2014 project 642685, Total €3.8M (£3.0M), £560k to Surrey, 2015-2018.

SpaRTaN: Sparse Representations and Compressed Sensing Training Network

EU Marie Curie Actions Initial Training Networks, FP7-PEOPLE-2013-ITN project 607290. Total €2.8M (£2.1M), €840k (£630k) to Surrey, 2014-2018.

Supervision

Postgraduate research supervision

I welcome applications from excellent students who wish to study for a PhD in my areas of research interest. Some example potential PhD projects are given below. I am also happy to discuss your own ideas for a project that you think may be of interest.

For more information on undertaking PhD study see: Centre for Vision, Speech and Signal Processing: PhD Study. For information on how to apply, see: Vision, Speech and Signal Processing PhD.

Example potential PhD projects

Deep learning for audio models and representations

The aim of this project is to investigate new machine learning methods for discovering good representations from audio sensor data. Deep learning has demonstrated very good performance on many machine learning problems, including audio event detection and acoustic scene classification. However, the theory of the deep learning process is not well understood, and these models typically require large amounts of training data. Recent advances based on information theory, such as the Information Bottleneck method, and model compression offer a promising direction for learning deep learning models and representations. The project will pay particular attention to methods for learning without labels, such as unsupervised and self-supervised learning. Project outcomes are expected to include new methods for constructing and learning efficient models and representations of audio data, making best use of available training data and labels, where available.

Privacy-preserving machine learning for sound sensing

The aim of this project is to build methods for machine learning for sound sensors, while preserving privacy of people or activities that are part of the sensed sound environment. For sound monitoring, we may wish to deploy sound sensors in public spaces, homes, or on personal devices. If the traditional approach of a large central data store were used for machine learning on deployed sensors, if may be difficult to achieve privacy of individuals whose sounds were picked up by sensors. Recently methods have been proposed that attempt to preserve privacy. For example, federated learning allows training to take place on distributed client devices, without the data itself being transferred to a central server, while differential privacy allows aggregate information about a dataset to be disclosed without disclosing information on individual records. This project will use these and related methods to investigate and develop new methods to allow sound sensors to be updated based on sounds captured by deployed sensors, while avoiding distribution of private or sensitive information.

Fair machine learning of sound data

The aim of this project is to investigate potential biases in sound-related datasets and challenges, and develop ways to overcome such biases to ensure fair machine learning approaches for sound data. When machine learning is used to tackle real-world problems, we should ensure that these models do not incorporate bias. Biases in datasets may not only result in skewed classification performance, such as over-reporting of performance, but can also result in subjects with protected characteristics, such as race or gender, being unfairly penalised. Biases can occur due to many factors, such as imbalanced data collection, correlations of factors with protected characteristics, or the machine learning models themselves. Fair Machine Learning (FML) is an area of research that explores the idea of fair, unbiased classification, in an attempt to work towards a fairer society through the use of machine learning. This project will investigate potential biases in databases such as DCASE challenge databases, and investigate methodologies and tools to document, measure and counteract biases. We will also consider how to make such methods understandable, including investigating metrics to evaluate the transparency of different approaches, and explore interpretability frameworks for deep learning models used for audio classification, and how to improve these.

Active Learning for sounds

We will investigate methods for active learning for sound classification, where users are queried interactively for labels which are used to update labels and class boundaries. As well as reducing the burden of labelling compared to "brute force" labelling from scratch, active learning offer the potential for individual users to create personalized sets of classes which are more useful to their intended application. Active learning can also help preserve user privacy, using models created locally based on the user data, without transmitting this data outside of the device. Initial investigation may start from simulated semi-supervised active learning, where datasets are used which are fully labelled but where the labels are queried one at a time. Developed methods will be tested on real subjects, which may include users recruited via crowdsourcing approaches. We will also investigate combinations of active learning with methods designed to work with small numbers of labelled data, such as few-shot learning and self-supervised learning, to develop efficient methods that can learn sound classes from a small number of user queries.

Measurement of soundscapes and noise

The aim of this PhD project is to use sound recognition to develop new estimates of sound impacts on people, to enable improved measurement and design of soundscapes. Exposure to noise can have significant physical and mental health implications: The annual UK cost of noise impacts is 7 billion–10 billion GBP. Most currently noise measurements and noise policies rely on simple loudness measures (Sound Pressure Level, measured in e.g. dBA), which do not fully reflect the real impact on individual people. Automatically recognizing different sound sources, and creating new measures that reflect the impact of these sounds on different people, could help us to improve people’s experience of our sonic environment. In this project we will investigate and propose predictive soundscape models based on sound source recognition. The project may start from baseline sound recognition models, mapping these onto a valence-arousal affect (emotion) space, and would include soundwalks and laboratory studies to inform the development of new models, and to validate models on unseen data and surveys on indoor and/or outdoor settings. Methods produced by this project may contribute to future work on soundscape standards, and how these are used in future noise and soundscape policies to improve wellbeing of citizens.

Teaching

My main teaching is for EEE3008 Fundamentals of Digital Signal Processing.

I also supervise Year 3 Projects for EEE programmes (EEE3017) and MSc Dissertation Projects for EEE and CS (EEEM004 and COMM002).

Publications

Rhys Burchett-Vass, Arshdeep Singh, Gabriel Bibbó, Mark D. Plumbley (2025)Integrating IP broadcasting with audio tags: Workflow and challenges, In: Audio Engineering Society Conference on AI and Machine Learning for Audio (AES AIMLA) Audio Engineering Society (AES)

The broadcasting industry has adopted IP technologies, revolutionising both live and pre-recorded content production, from news gathering to live music events. IP broadcasting allows for the transport of audio and video signals in an easily configurable way, aligning with modern networking techniques. This shift towards an IP workflow allows for much greater flexibility, not only in routing signals but with the integration of tools using standard web development techniques. One possible tool could include the use of live audio tagging, which has a number of uses in the production of content. These could include adding sound effects to automated closed captioning or identifying unwanted sound events within a scene. In this paper, we describe the process of containerising an audio tagging model into a microservice, a small segregated code module that can be integrated into a multitude of different network setups. The goal is to develop a modular, accessible, and flexible tool capable of seamless deployment into broadcasting workflows of all sizes, from small productions to large corporations. Challenges surrounding latency of the selected audio tagging model and its effect on the usefulness of the end product are discussed.

Paul A. Bereuter, Benjamin Stahl, Mark D. Plumbley, Alois Sontacchi (2025)Towards Reliable Objective Evaluation Metrics for Generative Singing Voice Separation Models, In: Proceedings of the 2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA 2025) Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/WASPAA66052.2025.11230934

Traditional Blind Source Separation Evaluation (BSS-Eval) metrics were originally designed to evaluate linear audio source separation models based on methods such as time-frequency masking. However, recent generative models may introduce nonlinear relationships between the separated and reference signals, limiting the reliability of these metrics for objective evaluation. To address this issue, we conduct a Degradation Category Rating listening test and analyze correlations between the obtained degradation mean opinion scores (DMOS) and a set of objective audio quality metrics for the task of singing voice separation. We evaluate three state-of-the-art discriminative models and two new, competitive generative models. For both discriminative and generative models, intrusive embedding-based metrics show higher correlations with DMOS than conventional intrusive metrics such as BSS-Eval metrics. For discriminative models, the highest correlation is achieved by the MSE computed on Music2Latent embeddings. When it comes to the evaluation of generative models, the strongest correlations are evident for the multi-resolution STFT loss and the MSE calculated on MERT-L12 embeddings, with the latter also providing the most balanced correlation across both model types. Our results highlight the limitations of BSS-Eval metrics for evaluating generative singing voice separation models and emphasize the need for careful selection and validation of alternative evaluation metrics for the task of singing voice separation.

Gabriel Bibbó, Craig Cieciura, Mark D. Plumbley (2025)Room Acoustics and Microphone Characteristics Show Systematic Impact on Sound Event Recognition, In: Proceedings of the 54th International Congress and Exposition on Noise Control Engineering (INTER-NOISE 2025) International Institute of Noise Control Engineering (I-INCE)

DOI: 10.3397/IN_2025_1070751

The robustness of audio pattern recognition systems under varying acoustic conditions and hardware remains a critical challenge for real-world applications. We examine how room acoustics, microphone characteristics, and overlapping events affect classification performance for domestic events. We conducted experiments in four rooms at the University of Surrey—with reverberation times (RT60: 0.27–0.78 s, 50 Hz–10 kHz) and clarity indices (C50: 11.6–18.5 dB; C80: 13.1–25.9 dB, 500 Hz–1 kHz)—using four microphones: USB Condenser, ICS-43432 stereo, AudioMoth, and Earthworks M23 reference. For two CNN-14 architectures, baseline performance obtained from the original audio was used for comparison with different microphone/room configurations. Results expressed as the percentage of audio frames correctly detected versus ground truth show: First, high RT60 degraded detection of impulsive events (e.g., door knocks) by approximately 50%, while sustained events (e.g., speech, music) remained above 90%. Second, overlapping events produced masking effects that reduced performance by about 20%. Third, while microphone differences affect accuracy, low-cost devices matched reference performance for speech and music classes. Both CNN-14 architectures exhibited similar degradation patterns across conditions. These results underscore the need for improved acoustic characterization and hardware-aware processing. We suggest that future work should integrate adaptive feature extraction and training strategies to mitigate reverberation and overlap in complex environments.

Thomas Edward Deacon, David Mark Frohlich, Mark David Plumbley (2025)Soundscape Experience Mapping

This paper presents the Soundscape Experience Mapping method for understanding how older adults perceive their indoor acoustic environments. With global ageing, populations are increasingly choosing to age in place. This creates an opportunity to enhance older life through the intentional design of supportive soundscapes for the home. Through a mixed-method approach combining Ecological Momentary Assessment with Cultural Probe methods, we engaged eight older adults (age: 56-76) in Belgium to document their domestic sound-scape experiences. Our analysis identifies four themes of older adults' indoor soundscape experiences: personal agency in shaping acoustic environments, temporal patterns that structure daily routines, sound-memory associations creating place attachment, and social presence through acoustic monitoring that connects individual experiences to broader contexts. Findings demonstrate how sounds function as spatiotemporal anchors, structuring daily routines while fostering place attachment through memory. This design research provides foundational insights for Audio AI applications that support independence for older adults.

Thomas Deacon, Jennifer Williams, Jason R. C. Nurse, Christopher Hicks, Gabriel Bibbo, Arshdeep Singh, Mark David Plumbley Privacy for Audio AI: Risks, Challenges, and Emerging Solutions in the Era of Audio AI

Advanced audio AI technologies—including speech generation, source separation, and acoustic scene understand- ing—could drive innovations across smart devices, assisted living, smart buildings, and urban noise monitoring. But a growing body of research indicates that privacy is still an important concern for a variety of stakeholders related to AI solutions becoming part of the real-world. This concern is heightened as conventional voice privacy approaches remain vulnerable to attacks, and there is a tension between offering novel functionality and privacy protection. To address this emerging problem for AI audio technology, this panel brings together experts in cybersecurity, voice privacy, responsible AI, audio forensics, and audio AI to examine current issues and emerging challenges.

Emily Mary Corrigan-Kavanagh, Mark David Plumbley, David Mark Frohlich, Arshdeep Singh (2024)Designing AI for Home Wellbeing and Implications for Future Healthcare Technologies, In: Equilibrium in a time of permacrisis1pp. 35-38

This paper proposes a new design research approach for “Designing Artificial Intelligence (AI) for Home Wellbeing” as a new research field to collaboratively develop AI technologies for home wellbeing with stakeholders, such as end-users, and disciplinary experts, rather than leaving design solely to AI experts. Designing AI for home wellbeing is significant as AI applications progressively aBect home life. AI can identify trends and/or themes from live data, such as of images, video clips, sounds, or text, (Samoili et al. 2021) and is increasingly deployed in the home, such as recognising abnormal activity for security or monitoring occupants’ behavioural patterns for managing physical health (Guo et al. 2019). The home can play a significant part in supporting both psychological and physiological wellbeing through its central role in everyday living. For example, the home can satisfy fundamental physiological needs for wellbeing, such as shelter and security through its physicality, and psychological needs for wellbeing, such as social connection and intimacy, by facilitating communal eating and social interaction between occupants (Corrigan-Kavanagh and Escobar-Tello 2018). Given the strong influence of home on wellbeing, AI for the home should enhance, introduce, or support home wellbeing. However, there currently exists limited research on design methods and approaches for engaging stakeholders, including end-users, and diBerent disciplines in AI development and deployment (Hossain and Ahmed 2021; Delgado, Barocas, and Levy 2022), creating barriers for designing AI for home wellbeing. This paper calls for further scholarship in “Designing AI for Home Wellbeing” while presenting initial findings from research developing this new field.

Yi Yuan, Haohe Liu, Xubo Liu, Qiushi Huang, Mark D Plumbley, Wenwu Wang (2024)Retrieval-Augmented Text-to-Audio Generation

Despite recent progress in text-to-audio (TTA) generation, we show that the state-of-the-art models, such as AudioLDM, trained on datasets with an imbalanced class distribution, such as AudioCaps, are biased in their generation performance. Specifically, they excel in generating common audio classes while underperforming in the rare ones, thus degrading the overall generation performance. We refer to this problem as long-tailed text-to-audio generation. To address this issue, we propose a simple retrieval-augmented approach for TTA models. Specifically, given an input text prompt, we first leverage a Contrastive Language Audio Pretraining (CLAP) model to retrieve relevant text-audio pairs. The features of the retrieved audio-text data are then used as additional conditions to guide the learning of TTA models. We enhance AudioLDM with our proposed approach and denote the resulting augmented system as Re-AudioLDM. On the AudioCaps dataset, Re-AudioLDM achieves a state-of-the-art Frechet Audio Distance (FAD) of 1.37, outperforming the existing approaches by a large margin. Furthermore, we show that Re-AudioLDM can generate realistic audio for complex scenes, rare audio classes, and even unseen audio types, indicating its potential in TTA tasks.

Han Yin, Xiao Yang, Rohan Kumar Das, Jisheng Bai, Haohe Liu, Wenwu Wang, Mark D. Plumbley (2025)EnvSDD: Benchmarking Environmental Sound Deepfake Detection, In: Interspeech 2025 International Speech Communication Association (ISCA)

DOI: 10.21437/Interspeech.2025-1143

Audio generation systems now create very realistic soundscapes that can enhance media production, but also pose potential risks. Several studies have examined deepfakes in speech or singing voice. However, environmental sounds have different characteristics, which may make methods for detecting speech and singing deepfakes less effective for real-world sounds. In addition, existing datasets for environmental sound deepfake detection are limited in scale and audio types. To address this gap, we introduce EnvSDD, the first large-scale curated dataset designed for this task, consisting of 45.25 hours of real and 316.74 hours of fake audio. The test set includes diverse conditions to evaluate the eneralizability, such as unseen generation models and unseen datasets. We also propose an audio deepfake detection system, based on a pre-trained audio foundation model. Results on EnvSDD show that our proposed system outperforms the state-of-the-art systems from speech and singing domains.

James A King, Arshdeep Singh, Mark D. Plumbley (2023)Compressing audio CNNS with graph centrality based filter pruning, In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/WASPAA58266.2023.10248103

Convolutional neural networks (CNNs) are popular in high-performing solutions to many real-world problems, such as audio classification. CNNs have many parameters and filters, with some having a larger impact on the performance than others. This means that networks may contain many unnecessary filters, increasing a CNN's computation and memory requirements while providing limited performance benefits. To make CNNs more efficient, we propose a pruning framework that eliminates filters with the highest " commonality ". We measure this commonality using the graph-theoretic concept of centrality. We hypothesise that a filter with a high centrality should be eliminated as it represents commonality and can be replaced by other filters without affecting the performance of a network much. An experimental evaluation of the proposed framework is performed on acoustic scene classification and audio tagging. On the DCASE 2021 Task 1A baseline network, our proposed method reduces computations per inference by 71% with 50% fewer parameters with less than a two percentage point drop in accuracy compared to the original network. For large-scale CNNs such as PANNs designed for audio tagging, our method reduces computations per inference by 24% with 41% fewer parameters at a slight improvement in performance.

Emily Mary Corrigan-Kavanagh, Mark David Plumbley, David Mark Frohlich (2025)Applying a Virtual World Café Method for Participatory Design of AI Systems, In: Journal of participatory research methods6(3) Specialty Publcations

DOI: 10.35844/001c.140957

Designing artificial intelligence (AI) using participatory design (PD) methods is becoming fundamental as AI increasingly augments everyday life. Well documented cases of machine bias, where AI systems informing hiring, loan approvals and prison sentencing have discriminated against people with certain demographics (i.e. gender, race), have highlighted the need to engage end-users in AI design. PD methods show promise in designing AI systems for maximum societal benefits as they allow users to collaborate with researchers and make decisions about how AI systems should be designed. This paper presents a modified virtual world café method, based on the World Café method, as a PD method to identify relevant design requirements for designing AI systems from the beginning of development. Specifically, we describe how this method was developed and used to create design requirements for sound sensing AI for the home with UK-based residents. Findings suggest that the method can be used as a PD method at the beginning of AI system development to define design requirements. The paper concludes with reflections on how the virtual world café method performs as a PD method for designing AI and how findings might be taken forward in future PD research for designing AI systems more generally.

Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, Mark Plumbley (2023)AudioLDM: Text-to-Audio Generation with Latent Diffusion Models, In: Pre-print International Machine Learning Society (IMLS)

Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining (CLAP) latents. The pretrained CLAP models enable us to train LDMs with audio embedding while providing text embedding as a condition during sampling. By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency. Trained on AudioCaps with a single GPU, AudioLDM achieves state-of-the-art TTA performance measured by both objective and subjective metrics (e.g., frechet distance). Moreover, AudioLDM is the first TTA system that enables various text-guided audio manipulations (e.g., style transfer) in a zero-shot fashion. Our implementation and demos are available at https://audioldm.github.io.

Yang Xiao, Xubo Liu, James King, Arshdeep Singh, Eng Siong Chng, Mark D. Plumbley, Wenwu Wang (2022)Continual Learning For On-Device Environmental Sound Classification, In: M Lagrange, Annamaria Mesaros, Thomas Pellegrini, Gaël Richard, Romain Serizel, Dan Stowell (eds.), Proceedings of the 7th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2022)pp. 211-215

DOI: 10.48550/arXiv.2207.07429

Continuously learning new classes without catastrophic forgetting is a challenging problem for on-device environmental sound classification given the restrictions on computation resources (e.g., model size, running memory). To address this issue, we propose a simple and efficient continual learning method. Our method selects the historical data for the training by measuring the per-sample classification uncertainty. Specifically, we measure the uncertainty by observing how the classification probability of data fluctuates against the parallel perturbations added to the classifier embedding. In this way, the computation cost can be significantly reduced compared with adding perturbation to the raw data. Experimental results on the DCASE 2019 Task 1 and ESC-50 dataset show that our proposed method outperforms baseline continual learning methods on classification accuracy and computational efficiency, indicating our method can efficiently and incrementally learn new classes without the catastrophic forgetting problem for on-device environmental sound classification.

Thomas Edward Deacon, Gabriel Bibbo, Arshdeep Singh, Mark D. Plumbley (2025)Soundscape experience mapping: A deep listening approach for eliciting older adults’ perceptions of indoor soundscapes

Indoor soundscapes significantly impact wellbeing, yet methodologies for understanding their perception among older adults remain underdeveloped. This paper presents Soundscape Experience Mapping (SEM), combining ecological momentary assessment with participatory design methods to capture and analyse indoor acoustic environments. Through structured listening activities and audio data collection, participants document their acoustic experiences in context. Our pilot study engaged eight older adults (57+) in a town in Belgium, collecting continuous audio recordings and qualitative data over one week. Using momentary judgements , retrospective evaluations, and sound journalling, we gained insights into how older adults perceive their indoor soundscapes. The method produced findings on the personal control of sound environments, soundscape preferences, and how situational factors influence acoustic perception. Participants demonstrated agency in curating their sonic environments, while expressing frustration with uncontrollable sounds. Daily routines and domestic rhythms emerged as key contextual factors shaping sound-scape experiences. This work advances AI-assisted indoor soundscape design by providing evidence-based methods to understand occupant needs, particularly for older adults who could benefit from tailored acoustic environments.

Gabriel Bibbó, Thomas Edward Deacon, Arshdeep Singh, Mark D. Plumbley (2024)The Sounds of Home: A Speech-Removed Residential Audio Dataset for Sound Event Detection, In: Proceedings of the 8th International Workshop on Speech Processing in Everyday Environments (CHiME 2024)pp. 49-53 International Speech Communication Association (ISCA)

DOI: 10.21437/CHiME.2024-11

This paper presents a residential audio dataset to support sound event detection research for smart home applications aimed at promoting wellbeing for older adults. The dataset is constructed by deploying audio recording systems in the homes of 8 participants aged 55-80 years for a 7-day period. Acoustic characteristics are documented through detailed floor plans and construction material information to enable replication of the recording environments for AI model deployment. A novel automated speech removal pipeline is developed, using pre-trained audio neural networks to detect and remove segments containing spoken voice, while preserving segments containing other sound events. The resulting dataset consists of privacy-compliant audio recordings that accurately capture the sound-scapes and activities of daily living within residential spaces. The paper details the dataset creation methodology, the speech removal pipeline utilizing cascaded model architectures, and an analysis of the vocal label distribution to validate the speech removal process. This dataset enables the development and benchmarking of sound event detection models tailored specifically for in-home applications.

Arshdeep Singh, Thomas Edward Deacon, Mark D. Plumbley (2024)Environmental Sound Classification Using Raw-audio Based Ensemble Framework, In: INTER-NOISE and NOISE-CON Congress and Conference Proceedings, INTER-NOISE24pp. 6402-6410

DOI: 10.3397/IN_2024_3740

Environmental sound classification (ESC) aims to automatically recognize audio recordings from the underlying environment, such as " urban park " or " city centre ". Most of the existing methods for ESC use hand-crafted time-frequency features such as log-mel spectrogram to represent audio recordings. However, the hand-crafted features rely on transformations that are defined beforehand and do not consider the variability in the environment due to differences in recording conditions or recording devices. To overcome this, we present an alternative representation framework by leveraging a pre-trained convolutional neural network, SoundNet, trained on a large-scale audio dataset to represent raw audio recordings. We observe that the representations obtained from the intermediate layers of SoundNet lie in low-dimensional subspace. However, the dimensionality of the low-dimensional subspace is not known. To address this, an automatic compact dictionary learning framework is utilized that gives the dimensionality of the underlying subspace. The low-dimensional embeddings are then aggregated in a late-fusion manner in the ensemble framework to incorporate hierarchical information learned at various intermediate layers of SoundNet. We perform experimental evaluation on publicly available DCASE 2017 and 2018 ASC datasets. The proposed ensemble framework improves performance between 1 and 4 percentage points compared to that of existing time-frequency representations.

Jinbo Hu, Yin Cao, Ming Wu, Fang Kang, Feiran Yang, Wenwu Wang, Mark D. Plumbley, Jun Yang (2025)PSELDNets: Pre-trained Neural Networks on a Large-scale Synthetic Dataset for Sound Event Localization and Detection, In: IEEE Transactions on Audio, Speech and Language Processing33pp. 2845 -2860 IEEE

DOI: 10.1109/TASLPRO.2025.3587446

Sound event localization and detection (SELD) has seen substantial advancements through learning-based methods. These systems, typically trained from scratch on specific datasets, have shown considerable generalization capabilities. Recently, deep neural networks trained on large-scale datasets have achieved remarkable success in the sound event classification (SEC) field, prompting an open question of whether these advances can be extended to the development of SELD foundation models. In this paper, leveraging the power of pre-trained SEC models, we propose pre-trained SELD networks (PSELDNets) on a large-scale synthetic dataset. The synthetic dataset, generated by convolving sound events with simulated spatial room impulse responses (SRIRs), contains 1,167 hours of audio clips with an ontology of 170 sound classes. These PSELDNets are applied to various SELD scenarios. When we adapt PSELDNets to specific scenarios, particularly in cases of low-resource data, we introduce a data-efficient fine-tuning method, AdapterBit. PSELDNets are evaluated on synthetic-test-set using collected SRIRs from the TAU Spatial Room Impulse Response Database (TAU-SRIR DB) and achieve satisfactory performance. We also carried out experiments to validate the transferability of PSELDNets to three publicly available datasets and our own real-world recordings. The results demonstrate that PSELDNets surpass state-of-the-art systems across all publicly available datasets. Given the need for direction-of-arrival estimation, SELD generally relies on sufficient multi-channel audio clips. However, incorporating the AdapterBit, PSELDNets show more efficient adaptability to various scenarios using minimal multi-channel or even just monophonic audio clips, outperforming traditional fine-tuning approaches.

Arshdeep Singh, Mark D. Plumbley (2022)A Passive Similarity based CNN Filter Pruning for Efficient Acoustic Scene Classification, In: Proceedings of Interspeech 2022pp. 2433-2437 International Speech Communication Association (ISCA)

DOI: 10.21437/Interspeech.2022-10714

We present a method to develop low-complexity convolu-tional neural networks (CNNs) for acoustic scene classification (ASC). The large size and high computational complexity of typical CNNs is a bottleneck for their deployment on resource-constrained devices. We propose a passive filter pruning framework , where a few convolutional filters from the CNNs are eliminated to yield compressed CNNs. Our hypothesis is that similar filters produce similar responses and give redundant information allowing such filters to be eliminated from the network. To identify similar filters, a cosine distance based greedy algorithm is proposed. A fine-tuning process is then performed to regain much of the performance lost due to filter elimination. To perform efficient fine-tuning, we analyze how the performance varies as the number of fine-tuning training examples changes. An experimental evaluation of the proposed framework is performed on the publicly available DCASE 2021 Task 1A base-line network trained for ASC. The proposed method is simple, reduces computations per inference by 27%, with 25% fewer parameters, with less than 1% drop in accuracy.

Haohe Liu, Ke Chen, Qiao Tian, Wenwu Wang, Mark Plumbley (2023)AudioSR: Versatile Audio Super-resolution at Scale, In: AudioSR: Versatile Audio Super-Resolution at Scale Cornell University Library, arXiv.org

DOI: 10.48550/arxiv.2309.07314

Audio super-resolution is a fundamental task that predicts high-frequency components for low-resolution audio, enhancing audio quality in digital applications. Previous methods have limitations such as the limited scope of audio types (e.g., music, speech) and specific bandwidth settings they can handle (e.g., 4kHz to 8kHz). In this paper, we introduce a diffusion-based generative model, AudioSR, that is capable of performing robust audio super-resolution on versatile audio types, including sound effects, music, and speech. Specifically, AudioSR can upsample any input audio signal within the bandwidth range of 2kHz to 16kHz to a high-resolution audio signal at 24kHz bandwidth with a sampling rate of 48kHz. Extensive objective evaluation on various audio super-resolution benchmarks demonstrates the strong result achieved by the proposed model. In addition, our subjective evaluation shows that AudioSR can acts as a plug-and-play module to enhance the generation quality of a wide range of audio generative models, including AudioLDM, Fastspeech2, and MusicGen. Our code and demo are available at https://audioldm.github.io/audiosr.

Haohe Liu, Ke Chen, Qiao Tian, Wenwu Wang, Mark D. Plumbley (2024)AudioSR: Versatile Audio Super-Resolution at Scale, In: AudioSR: Versatile Audio Super-resolution at Scale Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/ICASSP48485.2024.10447246

Audio super-resolution is a fundamental task that predicts high-frequency components for low-resolution audio, enhancing audio quality in digital applications. Previous methods have limitations such as the limited scope of audio types (e.g., music, speech) and specific bandwidth settings they can handle (e.g., 4 kHz to 8 kHz). In this paper, we introduce a diffusion-based generative model, AudioSR, that is capable of performing robust audio super-resolution on versatile audio types, including sound effects, music, and speech. Specifically, AudioSR can upsample any input audio signal within the bandwidth range of 2 kHz to 16 kHz to a high-resolution audio signal at 24 kHz bandwidth with a sampling rate of 48 kHz. Extensive objective evaluation on various audio super-resolution benchmarks demonstrates the strong result achieved by the proposed model. In addition, our subjective evaluation shows that AudioSR can act as a plug-and-play module to enhance the generation quality of a wide range of audio generative models, including AudioLDM, Fastspeech2, and MusicGen. Our code and demo are available at https://audioldm.github.io/audiosr.

Xubo Liu, Zhongkai Zhu, Haohe Liu, Yi Yuan, Qiushi Huang, Meng Cui, Jinhua Liang, Yin Cao, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang (2025)WavJourney: Compositional Audio Creation with Large Language Models, In: WavJourney: Compositional Audio Creation with Large Language Models Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/TASLPRO.2025.3574867

Despite breakthroughs in audio generation models, their capabilities are often confined to domain-specific conditions such as speech transcriptions and audio captions. In a real-world scenario, however, we often need to generate audio containing various elements such as speech, music, and sound effects with controllable conditions, which is challenging to address using existing audio generation systems. We present WavJourney, a novel framework that leverages Large Language Models (LLMs) to connect various audio models for audio creation. WavJourney allows users to create storytelling audio content with diverse audio elements, simply based on textual descriptions. Specifically, given a text instruction, WavJourney first prompts LLMs to generate an audio script that serves as a structured semantic representation of audio elements. The audio script is then converted into a computer program, where each line of the program calls a task-specific audio generation model or computational operation function. The computer program is then executed to obtain a compositional and interpretable solution for audio creation. Experimental results suggest that WavJourney is capable of synthesizing realistic audio aligned with textually-described semantic, spatial and temporal conditions, achieving state-of-the-art results on text-to-audio generation benchmarks. Additionally, we introduce a new multi-genre story benchmark. Subjective evaluations demonstrate the potential of WavJourney in crafting engaging storytelling audio content from text. We further demonstrate that WavJourney can facilitate human-machine co-creation in multi-round dialogues.

Jisheng Bai, Haohe Liu, Mou Wang, Dongyuan Shi, Wenwu Wang, Mark D. Plumbley, Woon-Seng Gan, Jianfeng Chen (2025)AudioSetCaps: An Enriched Audio-Caption Dataset Using Automated Generation Pipeline With Large Audio and Language Models, In: IEEE Transactions on Audio, Speech and Language Processing33pp. 2817-2829 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/TASLPRO.2025.3583354

With the emergence of audio-language models, constructing large-scale paired audio-language datasets has become essential yet challenging for model development, primarily due to the time-intensive and labour-heavy demands involved. While large language models (LLMs) have improved the efficiency of synthetic audio caption generation, current approaches struggle to effectively extract and incorporate detailed audio information. In this paper, we propose an automated pipeline that integrates audio-language models for fine-grained content extraction, LLMs for synthetic caption generation, and a contrastive language-audio pretraining (CLAP) model-based refinement process to improve the quality of captions. Specifically, we employ prompt chaining techniques in the content extraction stage to obtain accurate and fine-grained audio information, while we use the refinement process to mitigate potential hallucinations in the generated captions. Leveraging the AudioSet dataset and the proposed approach, we create AudioSetCaps, a dataset comprising 1.9 million audio-caption pairs, the largest audio-caption dataset at the time of writing. The models trained with AudioSetCaps achieve state-of-the-art performance on audio-text retrieval with R@1 scores of 46.3% for text-to-audio and 59.7% for audio-to-text retrieval and automated audio captioning with the CIDEr score of 84.8. As our approach has shown promising results with AudioSetCaps, we create another dataset containing 4.1 million synthetic audio-language pairs based on the Youtube-8 M and VGGSound datasets. To facilitate research in audio-language learning, we have made our pipeline, datasets with 6 million audio-language pairs,

Xubo Liu, Zhongkai Zhu, Haohe Liu, Yi Yuan, Meng Cui, Qiushi Huang, Jinhua Liang, Yin Cao, Qiuqiang Kong, Mark Plumbley, Wenwu Wang (2023)WavJourney: Compositional Audio Creation with Large Language Models, In: WavJourney: Compositional Audio Creation with LLMs Cornell University Library, arXiv.org

DOI: 10.48550/arxiv.2307.14335

Large Language Models (LLMs) have shown great promise in integrating diverse expert models to tackle intricate language and vision tasks. Despite their significance in advancing the field of Artificial Intelligence Generated Content (AIGC), their potential in intelligent audio content creation remains unexplored. In this work, we tackle the problem of creating audio content with storylines encompassing speech, music, and sound effects, guided by text instructions. We present WavJourney, a system that leverages LLMs to connect various audio models for audio content generation. Given a text description of an auditory scene, WavJourney first prompts LLMs to generate a structured script dedicated to audio storytelling. The audio script incorporates diverse audio elements, organized based on their spatio-temporal relationships. As a conceptual representation of audio, the audio script provides an interactive and interpretable rationale for human engagement. Afterward, the audio script is fed into a script compiler, converting it into a computer program. Each line of the program calls a task-specific audio generation model or computational operation function (e.g., concatenate, mix). The computer program is then executed to obtain an explainable solution for audio generation. We demonstrate the practicality of WavJourney across diverse real-world scenarios, including science fiction, education, and radio play. The explainable and interactive design of WavJourney fosters human-machine co-creation in multi-round dialogues, enhancing creative control and adaptability in audio production. WavJourney audiolizes the human imagination, opening up new avenues for creativity in multimedia content creation.

Xuenan Xu, Arshdeep Singh, Mengyue Wu, Wenwu Wang, Mark D. Plumbley (2025)Investigating Passive Filter Pruning for Efficient CNN-Transformer Audio Captioning, In: 2024 IEEE 34th International Workshop on Machine Learning for Signal Processing (MLSP 2024)pp. 175-180 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/MLSP58920.2024.10734745

Although automated audio captioning (AAC) has achieved remarkable performance improvement in recent years, concerns about the complexity of AAC models have drawn little attention from the research community. To reduce the number of model parameters, passive filter pruning has been successfully applied to convolution neural networks (CNNs) in audio classification tasks. However, due to the discrepancy between audio classification and AAC, these pruning methods are not necessarily suitable for captioning. In this work, we investigate the effectiveness of several passive filter pruning approaches on an efficient CNN-Transformer-based AAC architecture. Through extensive experiments, we find that under the same pruning ratio, pruning from the later convolution blocks significantly improves the performance. Utilizing the norm-based pruning method, our pruned model reduces the parameter number by 15% compared to that of the original model while maintaining a similar performance.

Aryan Choudhary, Arshdeep Singh, Vinayak Abrol, Mark D. Plumbley (2024)Efficient CNNs with Quaternion Transformations and Pruning for Audio Tagging, In: Interspeech 2024pp. 1150-1154 International Speech Communication Association (ISCA)

DOI: 10.21437/Interspeech.2024-1331

This paper presents a novel approach to make convolutional neural networks (CNNs) efficient by reducing their computational cost and memory footprint. Even though large-scale CNNs show state-of-the-art performance in many tasks, high computational costs and the requirement of a large memory footprint make them resource-hungry. Therefore, deploying large-scale CNNs on resource-constrained devices poses significant challenges. To address this challenge, we propose to use quaternion CNNs, where quaternion algebra enables the memory footprint to be reduced. Furthermore, we investigate methods to reduce the memory footprint and computational cost further through pruning the quaternion CNNs. Experimental evaluation of the audio tagging task involving the classification of 527 audio events from AudioSet shows that the quaternion algebra and pruning reduce memory footprint by 90% and computational cost by 70% compared to the original CNN model while maintaining similar performance.

Jianyuan Sun, Wenwu Wang, Mark D. Plumbley (2024)PFCA-Net: Pyramid Feature Fusion and Cross Content Attention Network for Automated Audio Captioning, In: Interspeech 2024 International Speech Communication Association (ISCA)

DOI: 10.21437/Interspeech.2024-1268

Automated audio captioning (AAC) aims to generate textual descriptions for a given audio clip. Despite the existing AAC models obtaining promising performance, they struggle to capture intricate audio patterns due to only using a high-dimensional representation. In this paper, we propose a new encoder-decoder model for AAC, called the Pyramid Feature Fusion and Cross Context Attention Network (PFCA-Net). In PFCA-Net, the encoder is constructed using a pyramid network, facilitating the extraction of audio features across multiple scales. It achieves this by combining top-down and bottom-up connections to fuse features across scales, resulting in feature maps at various scales. In the decoder, cross-content attention is designed to fuse the different scale features which allows the propagation of information from a low-scale to a high-scale. Experimental results show that PFCA-Net achieves considerable improvement over existing models.

Zhaoyu Wang, Haohe Liu, Harry Coppock, Björn Schuller, Mark D. Plumbley (2024)Neural Compression Augmentation for Contrastive Audio Representation Learning, In: Interspeech 2024pp. 3335-3339 International Speech Communication Association (ISCA)

DOI: 10.21437/Interspeech.2024-1156

The choice of data augmentation is pivotal in contrastive self-supervised learning. Current augmentation techniques for audio data, such as the widely used Random Resize Crop (RRC), underperform in pitch-sensitive music tasks and lack generalisation across various types of audio. This study aims to address these limitations by introducing Neural Compression Augmentation (NCA), an approach based on lossy neural compression. We use the Audio Barlow Twins (ABT), a contrastive self-supervised framework for audio, as our backbone. We experiment with both NCA and several baseline augmentation methods in the augmentation block of ABT and train the models on AudioSet. Experimental results show that models integrated with NCA considerably surpass the original performance of ABT, especially in the music tasks of the HEAR benchmark, demonstrating the effectiveness of compression-based augmentation for audio contrastive self-supervised learning.

Thomas Deacon, Arshdeep Singh, Gabriel Bibbo, Mark D. Plumbley (2024)Soundscape Personalisation at Work: Designing AI-Enabled Sound Technologies for the Workplace, In: Proceedings of the 21st Sound and Music Computing Conferencepp. 116-126

DOI: 10.5281/zenodo.14336495

Poor workplace soundscapes can negatively impact productivity and employee satisfaction. While current regulations and physical acoustic treatments are beneficial, the potential of AI sound systems to enhance worker wellbeing is not fully explored. This paper investigates the use of AI-enabled sound technologies in workplaces, aiming to boost wellbeing and productivity through a soundscape approach while addressing user concerns. To evaluate these systems, we used scenario-based design and focus groups with knowledge workers from open-plan offices and those working remotely. Participants were presented with initial design concepts for AI sound analysis and control systems. This paper outlines user requirements and recommendations gathered from these focus groups, with a specific emphasis on soundscape personalisation and the creation of relevant datasets.

Xuenan Xu, Haohe Liu, Mengyue Wu, Wenwu Wang, Mark D. Plumbley (2024)Efficient Audio Captioning with Encoder-Level Knowledge Distillation, In: Interspeech 2024pp. 1160-1164 International Speech Communication Association

DOI: 10.21437/Interspeech.2024-1680

Significant improvement has been achieved in automated audio captioning (AAC) with recent models. However, these models have become increasingly large as their performance is enhanced. In this work, we propose a knowledge distillation (KD) framework for AAC. Our analysis shows that in the encoder-decoder based AAC models, it is more effective to dis-till knowledge into the encoder as compared with the decoder. To this end, we incorporate encoder-level KD loss into training, in addition to the standard supervised loss and sequence-level KD loss. We investigate two encoder-level KD methods, based on mean squared error (MSE) loss and contrastive loss, respectively. Experimental results demonstrate that contrastive KD is more robust than MSE KD, exhibiting superior performance in data-scarce situations. By leveraging audio-only data into training in the KD framework, our student model achieves competitive performance, with an inference speed that is 19 times faster.

Yi Yuan, Zhuo Chen, Xubo Liu, Haohe Liu, Xuenan Xu, Dongya Jia, Yuanzhe Chen, Mark D. Plumbley, Wenwu Wang (2025)T-CLAP: Temporal-enhanced contrastive language-audio pretraining, In: 2024 IEEE 34th International Workshop on Machine Learning for Signal Processing (MLSP 2024) Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/MLSP58920.2024.10734763

Contrastive language-audio pretraining (CLAP) has been developed to align the representations of audio and language, achieving remarkable performance in retrieval and classification tasks. However, current CLAP struggles to capture temporal information within audio and text features, presenting substantial limitations for tasks such as audio retrieval and generation. To address this gap, we introduce T-CLAP, a temporal-enhanced CLAP model. We use Large Language Models (LLMs) and mixed-up strategies to generate temporal-contrastive captions for audio clips from extensive audio-text datasets. Subsequently, a new temporal-focused contrastive loss is designed to fine-tune the CLAP model by incorporating these synthetic data. We conduct comprehensive experiments and analysis in multiple downstream tasks. T-CLAP shows improved capability in capturing the temporal relationship of sound events and outperforms state-of-the-art models by a significant margin. Our code and models will be released soon.

Zihang Song, Han Zhang, Sean Fuller, Andrew Lambert, Zhinong Ying, Petri Mahonen, Yonina Eldar, Shuguang Cui, Mark D. Plumbley, Clive Parini, Arumugam Nallanathan, Yue Gao (2023)Numerical evaluation on sub-Nyquist spectrum reconstruction methods, In: Frontiers of Computer Science17(6)176504 Higher Education Press

DOI: 10.1007/s11704-023-2520-3

As wireless technology continues to expand, there is a growing concern about the efficient use of spectrum resources. Even though a significant portion of the spectrum is allocated to licensed primary users (PUs), studies indicate that their actual utilization is often limited to between 5% to 10% [1]. The underutilization of spectrum has given rise to cognitive radio (CR) technology, which allows secondary users (SUs) to opportunistically access these underused resources [2]. However, wideband spectrum sensing, the key of CR, is limited by the need for high-speed analog-to-digital converters (ADCs), which are costly and power-hungry. Compressed spectrum sensing (CSS) addresses this challenge by employing sub-Nyquist rate sampling. The efficiency of active transmission detection heavily depends on the quality of spectrum reconstruction. There are various reconstruction methods in CSS, each with its merits and drawbacks. Still, existing algorithms have not tapped into the full potential of sub-sampling sequences, and their performance notably drops in noisy environments [3,4]. The GHz Bandwidth Sensing (GBSense) project1) introduces an innovative approach for GHz bandwidth sensing. GBSense incorporates advanced sub-Nyquist sampling methods and is compatible with low-power devices. This project also prompted the GBSense Challenge 2021, which centered on sub-Nyquist reconstruction algorithms, with four leading algorithms to be presented and evaluated in this paper.

Andres Fernandez, Mark D. Plumbley (2021)USING UMAP TO INSPECT AUDIO DATA FOR UNSUPERVISED ANOMALY DETECTION UNDER DOMAIN-SHIFT CONDITIONS, In: Proceedings of the 6th Workshop on Detection and Classication of Acoustic Scenes and Events (DCASE 2021)

The goal of Unsupervised Anomaly Detection (UAD) is to detect anomalous signals under the condition that only non-anomalous (normal) data is available beforehand. In UAD under DomainShift Conditions (UAD-S), data is further exposed to contextual changes that are usually unknown beforehand. Motivated by the difficulties encountered in the UAD-S task presented at the 2021 edition of the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge, we visually inspect Uniform Manifold Approximations and Projections (UMAPs) for log-STFT, logmel and pretrained Look, Listen and Learn (L3) representations of the DCASE UAD-S dataset. In our exploratory investigation, we look for two qualities, Separability (SEP) and Discriminative Support (DSUP), and formulate several hypotheses that could facilitate diagnosis and developement of further representation and detection approaches. Particularly, we hypothesize that input length and pretraining may regulate a relevant tradeoff between SEP and DSUP. Our code as well as the resulting UMAPs and plots are publicly available.

XUBO LIU, QIUSHI HUANG, XINHAO MEI, Tom Ko, H Lilian Tang, MARK DAVID PLUMBLEY, WENWU WANG (2021)CL4AC: A CONTRASTIVE LOSS FOR AUDIO CAPTIONING, In: Proceedings of the 6th Workshop on Detection and Classication of Acoustic Scenes and Events (DCASE 2021)

Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language to describe the content of an audio clip. As shown in the submissions received for Task 6 of the DCASE 2021 Challenges, this problem has received increasing interest in the community. The existing AAC systems are usually based on an encoder-decoder architecture, where the audio signal is encoded into a latent representation, and aligned with its corresponding text descriptions, then a decoder is used to generate the captions. However, training of an AAC system often encounters the problem of data scarcity, which may lead to inaccurate representation and audio-text alignment. To address this problem, we propose a novel encoder-decoder framework called Contrastive Loss for Audio Captioning (CL4AC). In CL4AC, the self-supervision signals derived from the original audio-text paired data are used to exploit the correspondences between audio and texts by contrasting samples, which can improve the quality of latent representation and the alignment between audio and texts, while trained with limited data. Experiments are performed on the Clotho dataset to show the effectiveness of our proposed approach.

TURAB IQBAL, YIN CAO, ANDREW BAILEY, Mark D. Plumbley, WENWU WANG (2021)ARCA23K: AN AUDIO DATASET FOR INVESTIGATING OPEN-SET LABEL NOISE, In: Proceedings of the 6th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2021)

The availability of audio data on sound sharing platforms such as Freesound gives users access to large amounts of annotated audio. Utilising such data for training is becoming increasingly popular, but the problem of label noise that is often prevalent in such datasets requires further investigation. This paper introduces ARCA23K, an Automatically Retrieved and Curated Audio dataset comprised of over 23 000 labelled Freesound clips. Unlike past datasets such as FSDKaggle2018 and FSDnoisy18K, ARCA23K facilitates the study of label noise in a more controlled manner. We describe the entire process of creating the dataset such that it is fully reproducible, meaning researchers can extend our work with little effort. We show that the majority of labelling errors in ARCA23K are due to out-of-vocabulary audio clips, and we refer to this type of label noise as open-set label noise. Experiments are carried out in which we study the impact of label noise in terms of classification performance and representation learning.

Arshdeep Singh, Mark D. Plumbley (2023)Efficient Similarity-Based Passive Filter Pruning for Compressing CNNS, In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2023)pp. 1-5 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/ICASSP49357.2023.10095560

Convolution neural networks (CNNs) have shown great success in various applications. However, the computational complexity and memory storage of CNNs is a bottleneck for their deployment on resource-constrained devices. Recent efforts towards reducing the computation cost and the memory overhead of CNNs involve similarity-based passive filter pruning methods. Similarity-based passive filter pruning methods compute a pairwise similarity matrix for the filters and eliminate a few similar filters to obtain a small pruned CNN. However, the computational complexity of computing the pairwise similarity matrix is high, particularly when a convolutional layer has many filters. To reduce the computational complexity in obtaining the pairwise similarity matrix, we propose to use an efficient method where the complete pairwise similarity matrix is approximated from only a few of its columns by using a Nyström approximation method. The proposed efficient similarity-based passive filter pruning method is 3 times faster and gives same accuracy at the same reduction in computations for CNNs compared to that of the similarity-based pruning method that computes a complete pairwise similarity matrix. Apart from this, the proposed efficient similarity-based pruning method performs similarly or better than the existing norm-based pruning methods. The efficacy of the proposed pruning method is evaluated on CNNs such as DCASE 2021 Task 1A baseline network and a VGGish network designed for acoustic scene classification.

Gabriel Bibbó, Arshdeep Singh, Mark David Plumbley (2023)Recognise and Notify Sound Events using a Raspberry PI based Standalone Device

DOI: 10.5281/zenodo.15465882

Convolutional neural networks (CNNs) have exhibited state-of-the-art performance in various audio classification tasks. However, their real-time deployment remains a challenge on resource-constrained devices like embedded systems. In this paper, we present a demonstration of our standalone hardware device designed for real-time recognition of sound events commonly known as audio tagging. Our system incorporates a real-time implementation of a CNN-based pre-trained audio neural networks (PANNs) on an embedded hardware device, Raspberry Pi. We refer to our standalone device as "PiSoundSensing" system, which makes sense of surrounding sounds using a Raspberry Pi based hardware. Users can interact with the system through a physical button or using an online web interface. The web interface allows users to remotely control the standalone device, and visualize sound events detected over time. We provide a detailed description of the hardware and software used to build PiSoundSensing device. Also, we highlight useful observations including hardware-based standalone device performance compared to that of the software-based performance.

Gabriel Bibbó, Arshdeep Singh, Mark D. Plumbley (2024)Environmental sound classification on an embedded hardware platform, In: INTER-NOISE and NOISE-CON Congress and Conference Proceedings270(5)pp. 6376-6385 Institute of Noise Control Engineering (INCE-USA)

DOI: 10.3397/IN_2024_3723

Convolutional neural networks (CNNs) have exhibited state-of-the-art performance in various audio classification tasks. However, their real-time deployment remains a challenge on resource constrained devices such as embedded systems. In this paper, we analyze how the performance of large-scale pre-trained audio neural networks designed for audio pattern recognition changes when deployed on a hardware such as a Raspberry Pi. We empirically study the role of CPU temperature, microphone quality and audio signal volume on performance. Our experiments reveal that the continuous CPU usage results in an increased temperature that can trigger an automated slowdown mechanism in the Raspberry Pi, impacting inference latency. The quality of a microphone, specifically with affordable devices such as the Google AIY Voice Kit, and audio signal volume, all affect the system performance. In the course of our investigation, we encounter substantial complications linked to library compatibility and the unique processor architecture requirements of the Raspberry Pi, making the process less straightforward compared to conventional computers (PCs). Our observations, while presenting challenges, pave the way for future researchers to develop more compact machine learning models, design heat-dissipative hardware, and select appropriate microphones when AI models are deployed for real-time applications on edge devices.

Meng Cui, Xubo Liu, Jinzheng Zhao, Jianyuan Sun, Guoping Lian, Tao Chen, Mark D. Plumbley, Daoliang Li, Wenwu Wang (2023)Fish Feeding Intensity Assessment in Aquaculture: A New Audio Dataset AFFIA3K and a Deep Learning Algorithm, In: 2022 IEEE 32nd International Workshop on Machine Learning for Signal Processing (MLSP 2022)pp. 1-6 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/MLSP55214.2022.9943405

Fish feeding intensity assessment (FFIA) aims to evaluate the change of fish appetite during the feeding process, which is potentially useful in industrial aquaculture. Previous methods are mainly based on computer vision techniques. However, these methods are limited by water refraction and uneven illumination. In this paper, we introduce a new approach for FFIA using audio. We create a new audio dataset for FFIA, namely AFFIA3K, which contains 3000 labelled audio clips of different fish feeding intensity (None, Weak, Medium, Strong). We present a deep learning framework for FFIA, where the audio signal is first transformed into acoustic features, i.e. mel spectrogram, then a convolutional neural network (CNN)-based model is used to classify the fish feeding intensity. Experimental results show that our approach achieves an mAP of 0.74 on the test set of AFFIA3K, and considerably outperforms baseline systems. This indicates the potential of our proposed approach in aquaculture applications.

Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang (2022)On Metric Learning for Audio-Text Cross-Modal Retrieval, In: Interspeech 2022pp. 4142-4146 International Speech Communication Association (ISCA)

DOI: 10.21437/Interspeech.2022-11115

Audio-text retrieval aims at retrieving a target audio clip or caption from a pool of candidates given a query in another modality. Solving such cross-modal retrieval task is challenging because it not only requires learning robust feature representations for both modalities, but also requires capturing the fine-grained alignment between these two modalities. Existing cross-modal retrieval models are mostly optimized by metric learning objectives as both of them attempt to map data to an embedding space, where similar data are close together and dissimilar data are far apart. Unlike other cross-modal retrieval tasks such as image-text and video-text retrievals, audio-text retrieval is still an unexplored task. In this work, we aim to study the impact of different metric learning objectives on the audio-text retrieval task. We present an extensive evaluation of popular metric learning objectives on the AudioCaps and Clotho datasets. We demonstrate that NT-Xent loss adapted from self-supervised learning shows stable performance across different datasets and training settings, and outperforms the popular triplet-based losses. Our code is available at https://github.com/XinhaoMei/audio-text_retrieval.

Jinhua Liang, Xubo Liu, Wenwu Wang, Mark D. Plumbley, Huy Phan, Emmanouil Benetos (2025)Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities, In: IEEE Transactions on Audio, Speech and Language Processing33pp. 949-961 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/TASLPRO.2025.3533375

The auditory system plays a substantial role in shaping the overall human perceptual experience. While prevailing large language models (LLMs) and visual language models (VLMs) have shown their promise in solving a wide variety of language and vision understanding tasks, only a few of them can be generalised to the audio domain without compromising their domain-specific capability. In this work, we introduce Acoustic Prompt Tuning (APT), a new adapter extending LLMs and VLMs to the audio domain by injecting audio embeddings to the input of LLMs, namely soft prompting. Specifically, APT applies an instruction-aware audio aligner to generate soft prompts, conditioned on both input text and sounds, as the inputs to the language model. To mitigate data scarcity in the audio domain, a curriculum learning strategy is proposed by formulating diverse audio tasks in a sequential manner. Moreover, we improve the audio language model by using interleaved audio-text embeddings as the input sequence. In this improved model, zero constraints are imposed on the input format, thus it is capable of tackling diverse modelling tasks, such as few-shot audio classification and audio comparison. To further evaluate the advanced ability of the audio networks, we introduce natural language audio reasoning (NLAR), a new task that analyses two audio clips by comparison and summarisation. Experiments show that APT-enhanced LLMs (namely APT-LLMs) achieve competitive results compared to the expert models (i.e., the networks trained on the target datasets) across various tasks. We finally demonstrate APT's ability in extending frozen VLMs to the audio domain without fine-tuning, achieving promising results in audiovisual question and answering. Our code and model weights will be released at https://github.com/JinhuaLiang/APT. Index Terms—Audio understanding, large language model, audio-language learning, audio recognition, automated audio captioning, natural language audio reasoning.

Xubo Liu, Xinhao Mei, Qiushi Huang, Jianyuan Sun, Jinzheng Zhao, Haohe Liu, Mark D. Plumbley, Volkan Kilic, Wenwu Wang (2022)Leveraging Pre-trained BERT for Audio Captioning, In: 2022 30th European Signal Processing Conference (EUSIPCO 2022)pp. 1145-1149 European Signal Processing Conference (EUSIPCO)

DOI: 10.23919/EUSIPCO55093.2022.9909761

Audio captioning aims at using language to describe the content of an audio clip. Existing audio captioning systems are generally based on an encoder-decoder architecture, in which acoustic information is extracted by an audio encoder and then a language decoder is used to generate the captions. Training an audio captioning system often encounters the problem of data scarcity. Transferring knowledge from pre-trained audio models such as Pre-trained Audio Neural Networks (PANNs) have recently emerged as a useful method to mitigate this issue. However, there is less attention on exploiting pre-trained language models for the decoder, compared with the encoder. BERT is a pre-trained language model that has been extensively used in natural language processing tasks. Nevertheless, the potential of using BERT as the language decoder for audio captioning has not been investigated. In this study, we demonstrate the efficacy of the pre-trained BERT model for audio captioning. Specifically, we apply PANNs as the encoder and initialize the decoder from the publicly available pre-trained BERT models. We conduct an empirical study on the use of these BERT models for the decoder in the audio captioning model. Our models achieve competitive results with the existing audio captioning methods on the AudioCaps dataset.

Jianyuan Sun, Xubo Liu, Xinhao Mei, Jinzheng Zhao, Mark D. Plumbley, Volkan Kilic, Wenwu Wang (2022)Deep Neural Decision Forest for Acoustic Scene Classification, In: 2022 30th European Signal Processing Conference (EUSIPCO 2022)pp. 772-776 European Signal Processing Conference (EUSIPCO)

DOI: 10.23919/EUSIPCO55093.2022.9909575

Acoustic scene classification (ASC) aims to classify an audio clip based on the characteristic of the recording environment. In this regard, deep learning based approaches have emerged as a useful tool for ASC problems. Conventional approaches to improving the classification accuracy include integrating auxiliary methods such as attention mechanism, pre-trained models and ensemble multiple sub-networks. However, due to the complexity of audio clips captured from different environments, it is difficult to distinguish their categories without using any auxiliary methods for existing deep learning models using only a single classifier. In this paper, we propose a novel approach for ASC using deep neural decision forest (DNDF). DNDF combines a fixed number of convolutional layers and a decision forest as the final classifier. The decision forest consists of a fixed number of decision tree classifiers, which have been shown to offer better classification performance than a single classifier in some datasets. In particular, the decision forest differs substantially from traditional random forests as it is stochastic, differentiable, and capable of using the back-propagation to update and learn feature representations in neural network. Experimental results on the DCASE2019 and ESC-50 datasets demonstrate that our proposed DNDF method improves the ASC performance in terms of classification accuracy and shows competitive performance as compared with state-of-the-art baselines.

Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang (2022)Diverse Audio Captioning Via Adversarial Training, In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processingpp. 8882-8886 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/ICASSP43922.2022.9746894

Audio captioning aims at generating natural language descriptions for audio clips automatically. Existing audio captioning models have shown promising improvement in recent years. However, these models are mostly trained via maximum likelihood estimation (MLE), which tends to make captions generic, simple and deterministic. As different people may describe an audio clip from different aspects using distinct words and grammars, we argue that an audio captioning system should have the ability to generate diverse captions for a fixed audio clip and across similar audio clips. To address this problem, we propose an adversarial training framework for audio captioning based on a conditional generative adversarial network (C-GAN), which aims at improving the naturalness and diversity of generated captions. Unlike processing data of continuous values in a classical GAN, a sentence is composed of discrete tokens and the discrete sampling process is non-differentiable. To address this issue, policy gradient, a reinforcement learning technique, is used to back-propagate the reward to the generator. The results show that our proposed model can generate more diverse captions, as compared to state-of-the-art methods.

XINHAO MEI, XUBO LIU, QIUSHI HUANG, Mark D. Plumbley, WENWU WANG (2021)AUDIO CAPTIONING TRANSFORMER, In: Proceedings of the 6th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2021),

DOI: 10.48550/arXiv.2107.09817

Audio captioning aims to automatically generate a natural language description of an audio clip. Most captioning models follow an encoder-decoder architecture, where the decoder predicts words based on the audio features extracted by the encoder. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are often used as the audio encoder. However, CNNs can be limited in modelling temporal relationships among the time frames in an audio signal, while RNNs can be limited in modelling the long-range dependencies among the time frames. In this paper, we propose an Audio Captioning Transformer (ACT), which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free. The proposed method has a better ability to model the global information within an audio signal as well as capture temporal relationships between audio events. We evaluate our model on AudioCaps, which is the largest audio captioning dataset publicly available. Our model shows competitive performance compared to other state-of-the-art approaches.

XINHAO MEI, QIUSHI HUANG, XUBO LIU, Gengyun Chen, Jingqian Wu, Yusong Wu, Jinzheng Zhao, Shengchen Li, Tom Ko, H Lilian Tang, Xi Shao, Mark D. Plumbley, WENWU WANG (2021)An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement Learning, In: Proceedings of the 6th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2021)

DOI: 10.48550/arXiv.2108.02752

Automated audio captioning aims to use natural language to describe the content of audio data. This paper presents an audio captioning system with an encoder-decoder architecture, where the decoder predicts words based on audio features extracted by the encoder. To improve the proposed system, transfer learning from either an upstream audio-related task or a large in-domain dataset is introduced to mitigate the problem induced by data scarcity. Moreover, evaluation metrics are incorporated into the optimization of the model with reinforcement learning, which helps address the problem of " exposure bias " induced by " teacher forcing " training strategy and the mismatch between the evaluation metrics and the loss function. The resulting system was ranked 3rd in DCASE 2021 Task 6. Abla-tion studies are carried out to investigate how much each component in the proposed system can contribute to final performance. The results show that the proposed techniques significantly improve the scores of the evaluation metrics, however, reinforcement learning may impact adversely on the quality of the generated captions.

Abigail Bristow, Mark David Plumbley, Charlotte Clark, Simone Graetzer, Alan Hunter, Antonio J. Torija Martinez (2025)Noise Network Plus: A New Interdisciplinary Network to Address the Grand Challenges of Noise Pollution Importance of Noise Planned Network Activities

DOI: 10.5281/zenodo.14773447

Noise Network Plus is a new interdisciplinary network designed to address the grand challenges for noise pollution for the next 10-15 years, supported by the UK Engineering and Physical Science Research Council (EPSRC).Noise has a major impact on public health, society and wildlife. It has a profound effect on human health and wellbeing, causing heart disease, high annoyance and sleep disturbance. Road noise in England alone is estimated to cause £7-10bn of health costs, with 130,000 healthy life years lost each year. It affects wildlife, including birds and marine mammals, damaging wildlife health and reproduction. Noise also affects AI systems and sensors, including sonar echo sounders and underwater acoustic modems.As well as existing sources of human-made noise, such as road, rail, and air transport, new technologies such as drones, air taxis and air source heat pumps could introduce new sources of noise. However, noise is often neglected until late in the engineering design process, resulting in products, systems and buildings that may create or transmit unnecessary noise.To address these challenges, we will create a new research and innovation network, bringing together diverse, dynamic teams from across disciplines to promote dialogue, co-design missions, form lasting and inclusive collaborations, and build unprecedented noise research capabilities. The Network will support pilot projects to test new research ideas, gather missing evidence and carry out feasibility studies.We will also reimagine the education and training of engineers, to include systems thinking in general, and sound and noise in particular, across engineering education.By understanding the complex systems that build noise into the world, we will begin a long-lasting programme of research and engagement to reduce noise and its impact on people, the environment, and the economy.

Xubo Liu, Qiuqiang Kong, Yan Zhao, Haohe Liu, Yi Yuan, Yuzhuo Liu, Rui Xia, Yuxuan Wang, Mark D. Plumbley, Wenwu Wang (2024)Separate Anything You Describe, In: Separate Anything You Describe Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/TASLP.2024.3520017

Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA). LASS aims to separate a target sound from an audio mixture given a natural language query, which provides a natural and scalable interface for digital audio applications. Recent works on LASS, despite attaining promising separation performance on specific sources (e.g., musical instruments, limited classes of audio events), are unable to separate audio concepts in the open domain. In this work, we introduce AudioSep, a foundation model for open-domain audio source separation with natural language queries. We train AudioSep on large-scale multimodal datasets and extensively evaluate its capabilities on numerous tasks including audio event separation, musical instrument separation, and speech enhancement. AudioSep demonstrates strong separation performance and impressive zero-shot generalization ability using audio captions or text labels as queries, substantially outperforming previous audio-queried and language-queried sound separation models. Specifically, AudioSep achieved strong results including a Signal-to-Distortion Ratio Improvement (SDRi) of 7.74 dB across 527 sound classes of the AudioSet; 9.14 dB on the VGGSound dataset; 8.22 dB on the AudioCaps dataset; 6.85 dB on the Clotho dataset; 10.51 dB on the MUSIC dataset; 10.04 dB on the ESC-50 dataset; 8.16 dB on the DCASE 2024 Task 9 dataset; and an SSNR of 9.21 dB on the VoicebankDemand dataset.

Thomas Edward Deacon, Leen Broeckx, Kim Helsen, Mark David Plumbley (2025)Sound Awareness Activities: Towards a design toolkit for understanding personal soundscape experiences

DOI: 10.5281/zenodo.15365060

Understanding how individuals experience their acoustic environments is essential for sound planning and design. While standardised methods exist for measuring and analysing soundscapes, AI audio systems require more design-oriented, inclusive approaches to capture personal listening experiences. This poster introduces accessible Sound Awareness Activities that help participants explore and reflect on sounds in their homes and neighbourhoods to support indoor soundscape improvements.Our approach, originally part of a participatory design study, combines deep listening, sound journaling, memory-sound associations, and structured soundscape listening activities. Each activity is adaptable for different abilities and contexts to support inclusive participation. These activities go beyond measuring sound to capture personal meanings, emotional responses, and contextual factors that shape acoustic experiences. Though designed for a particular study, activities should work equally well in workshops and other engagement formats.The University of Surrey team conducted a week-long experience sampling study at Belgium's Living and Care Lab (LiCaLab) to explore the activities' effectiveness. Drawing on a thorough understanding of the local participant panel, the LiCaLab team helped refine Surrey's original protocol by providing insights for translation, localization, and adaptation. This collaboration demonstrated how activities can be carefully tailored while preserving core methodological objectives.The poster presentation provides an opportunity to share progress and gather feedback on a novel soundscape research methodology. We will provide practical implementation guidance, including strategies for making activities accessible, supporting diverse participation styles, and creating comfortable environments for sound exploration.This methodological contribution advances soundscape research by offering structured yet adaptable ways to gather rich qualitative data about residential sound experiences. For future work, we aim to consolidate the approach into a toolkit that supports the development of inclusive acoustic design approaches that consider diverse ways of listening and experiencing sound environments.

Andrew Bailey, Mark D. Plumbley (2022)Gender Bias in Depression Detection Using Audio Features, In: 2021 29th European Signal Processing Conference (EUSIPCO)pp. 596-600 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.23919/EUSIPCO54536.2021.9615933

Depression is a large-scale mental health problem and a challenging area for machine learning researchers in detection of depression. Datasets such as Distress Analysis Interview Corpus - Wizard of Oz (DAIC-WOZ) have been created to aid research in this area. However, on top of the challenges inherent in accurately detecting depression, biases in datasets may result in skewed classification performance. In this paper we examine gender bias in the DAIC-WOZ dataset. We show that gender biases in DAIC-WOZ can lead to an overreporting of performance. By different concepts from Fair Machine Learning, such as data re-distribution, and using raw audio features, we can mitigate against the harmful effects of bias.

Lam Pham, Chris Baume, Qiuqiang Kong, Tassadaq Hussain, Wenwu Wang, Mark D. Plumbley (2022)An Audio-Based Deep Learning Framework For BBC Television Programme Classification, In: Proceedings of the 29th European Signal Processing Conference (EUSIPCO 2021), Dublin, Ireland, 23-27, August, 2021 European Association for Signal Processing (EURASIP)

DOI: 10.23919/EUSIPCO54536.2021.9616310

This paper proposes a deep learning framework for classification of BBC television programmes using audio. The audio is firstly transformed into spectrograms, which are fed into a pre-trained Convolutional Neural Network (CNN), obtaining predicted probabilities of sound events occurring in the audio recording. Statistics for the predicted probabilities and detected sound events are then calculated to extract discriminative features representing the television programmes. Finally, the embedded features extracted are fed into a classifier for classifying the programmes into different genres. Our experiments are conducted over a dataset of 6,160 programmes belonging to nine genres labelled by the BBC. We achieve an average classification accuracy of 93.7% over 14-fold cross validation. This demonstrates the efficacy of the proposed framework for the task of audio-based classification of television programmes.

Jinhua Liang, Xubo Liu, Haohe Liu, Huy Phan, Emmanouil Benetos, Mark David Plumbley, Wenwu Wang (2023)Adapting Language-Audio Models as Few-Shot Audio Learners, In: Proceedings of Interspeech 20232023-pp. 276-280 International Speech Communication Association (ISCA)

DOI: 10.21437/Interspeech.2023-1082

Contrastive language-audio pretraining (CLAP) has become a new paradigm to learn audio concepts with audio-text pairs. CLAP models have shown unprecedented performance as zero-shot classifiers on downstream tasks. To further adapt CLAP with domain-specific knowledge, a popular method is to fine-tune its audio encoder with available labelled examples. However, this is challenging in low-shot scenarios, as the amount of annotations is limited compared to the model size. In this work, we introduce a Training-efficient (Treff) adapter to rapidly learn with a small set of examples while maintaining the capacity for zero-shot classification. First, we propose a crossattention linear model (CALM) to map a set of labelled examples and test audio to test labels. Second, we find initialising CALM as a cosine measurement improves our Treff adapter even without training. The Treff adapter outperforms metricbased methods in few-shot settings and yields competitive results to fully-supervised methods.

Haohe Liu, QIUQIANG KONG, Xubo Liu, Xinhao Mei, Wenwu Wang, Mark D Plumbley (2023)Ontology-aware Learning and Evaluation for Audio Tagging, In: INTERSPEECH 20232023-pp. 3799-3803 Isca-Int Speech Communication Assoc

DOI: 10.21437/Interspeech.2023-979

This study defines a new evaluation metric for audio tagging tasks to alleviate the limitation of the mean average precision (mAP) metric. The mAP metric treats different kinds of sound as independent classes without considering their relations. The proposed metric, ontology-aware mean average precision (OmAP), addresses the weaknesses of mAP by utilizing additional on-tology during evaluation. Specifically, we reweight the false positive events in the model prediction based on the AudioSet ontology graph distance to the target classes. The OmAP also provides insights into model performance by evaluating different coarse-grained levels in the ontology graph. We conduct a human assessment and show that OmAP is more consistent with human perception than mAP. We also propose an ontology-based loss function (OBCE) that reweights binary cross entropy (BCE) loss based on the ontology distance. Our experiment shows that OBCE can improve both mAP and OmAP metrics on the AudioSet tagging task.

Arshdeep Singh, Mark David Plumbley (2025)Efficient CNNs via Passive Filter Pruning, In: Efficient CNNs via Passive Filter Pruning IEEE

DOI: 10.1109/TASLPRO.2025.3561589

—Convolutional neural networks (CNNs) have shown state-of-the-art performance in various applications. However, CNNs are resource-hungry due to their requirement of high computational complexity and memory storage. Recent efforts toward achieving computational and memory efficiency in CNNs involve filter pruning methods that eliminate some of the filters in CNNs based on the " importance " of the filters. The majority of existing filter pruning methods are either " active " , which use a dataset and generate feature maps to quantify filter importance, or " passive " , which compute filter importance using entry-wise norm of the filters or by measuring similarity among filters without involving data. However, the existing passive filter pruning methods eliminate relatively smaller norm filters or similar filters without considering the significance of the filters in producing the node output, resulting in degradation in the performance. To address this, we present a passive filter pruning method where the least significant filters with relatively smaller contribution in producing output are pruned away by incorporating the operator norm of the filters. The proposed pruning method results in better performance across various CNNs compared to that of the existing passive filter pruning methods. In comparison to the existing active filter pruning methods, the proposed pruning method is more efficient and achieves similar performance as well. The efficacy of the proposed pruning method is evaluated on audio scene classification and audio tagging tasks using various CNNs architecture such as VGGish, DCASE21 Net and PANNs. The proposed pruning method reduces number of computations and parameters of the unrpuned CNNs by at least 40% and 50% respectively, enhancing inference latency while maintaining similar performance as obtained using the unpruned CNNs.

Arshdeep Singh, Haohe Liu, Mark D. Plumbley (2023)E-PANNs: Sound Recognition Using Efficient Pre-trained Audio Neural Networks, In: INTER-NOISE and NOISE-CON Congress and Conference Proceedings268(1)pp. 7220-7228

DOI: 10.3397/IN_2023_1083

Sounds carry an abundance of information about activities and events in our everyday environment, such as traffic noise, road works, music, or people talking. Recent machine learning methods, such as convolutional neural networks (CNNs), have been shown to be able to automatically recognize sound activities, a task known as audio tagging. One such method, pre-trained audio neural networks (PANNs), provides a neural network which has been pre-trained on over 500 sound classes from the publicly available AudioSet dataset, and can be used as a baseline or starting point for other tasks. However, the existing PANNs model has a high computational complexity and large storage requirement. This could limit the potential for deploying PANNs on resource-constrained devices, such as on-the-edge sound sensors, and could lead to high energy consumption if many such devices were deployed. In this paper, we reduce the computational complexity and memory requirement of the PANNs model by taking a pruning approach to eliminate redundant parameters from the PANNs model. The resulting Efficient PANNs (E-PANNs) model, which requires 36% less computations and 70% less memory, also slightly improves the sound recognition (audio tagging) performance. The code for the E-PANNs model has been released under an open source license.

Annamaria Mesaros, Romain Serizel, Toni Heittola, Tuomas Virtanen, Mark D. Plumbley (2025)A decade of DCASE: Achievements, practices, evaluations and future challenges, In: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/ICASSP49660.2025.10887673

This paper introduces briefly the history and growth of the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge, workshop, research area and research community. Created in 2013 as a data evaluation challenge, DCASE has become a major research topic in the Audio and Acoustic Signal Processing area. Its success comes from a combination of factors: the challenge offers a large variety of tasks that are renewed each year; and the workshop offers a channel for dissemination of related work, engaging a young and dynamic community. At the same time, DCASE faces its own challenges, growing and expanding to different areas. One of the core principles of DCASE is open science and reproducibility: publicly available datasets, baseline systems, technical reports and workshop publications. While the DCASE challenge and workshop are independent of IEEE SPS, the challenge receives annual endorsement from the AASP TC, and the DCASE community contributes significantly to the ICASSP flagship conference and the success of SPS in many of its activities.

Yi Yuan, Dongya Jia, Xiaobin Zhuang, Yuanzhe Chen, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xubo Liu, Xiyuan Kang, Mark D. Plumbley, Wenwu Wang (2025)Sound-VECaps: Improving Audio Generation with Visually Enhanced Captions, In: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/ICASSP49660.2025.10889473

Generative models have shown significant achievements in audio generation tasks. However, existing models struggle with complex and detailed prompts, leading to potential performance degradation. We hypothesize that this problem stems from the simplicity and scarcity of the training data. This work aims to create a large-scale audio dataset with rich captions for improving audio generation models. We first develop an automated pipeline to generate detailed captions by transforming predicted visual captions, audio captions, and tagging labels into comprehensive descriptions using a Large Language Model (LLM). The resulting dataset, Sound-VECaps, comprises 1.66M high-quality audio-caption pairs with enriched details including audio event orders, occurred places and environment information. We then demonstrate that training the text-to-audio generation models with Sound-VECaps significantly improves the performance on complex prompts. Furthermore, we conduct ablation studies of the models on several downstream audio-language tasks, showing the potential of Sound-VECaps in advancing audio-text representation learning.Dataset and demos are available at https://yyua8222.github.io/Sound-VECaps-demo/.

Yi Yuan, Xubo Liu, Haohe Liu, Mark D. Plumbley, Wenwu Wang (2025)FlowSep: Language-Queried Sound Separation with Rectified Flow Matching, In: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/ICASSP49660.2025.10890129

Language-queried audio source separation (LASS) focuses on separating sounds using textual descriptions of the desired sources. Current methods mainly use discriminative approaches, such as time-frequency masking, to separate target sounds and minimize interference from other sources. However, these models face challenges when separating overlapping sound-tracks, which may lead to artifacts such as spectral holes or incomplete separation. Rectified flow matching (RFM), a generative model that establishes linear relations between the distribution of data and noise, offers superior theoretical properties and simplicity, but has not yet been explored in sound separation. In this work, we introduce FlowSep, a new generative model based on RFM for LASS tasks. FlowSep learns linear flow trajectories from noise to target source features within the variational autoencoder (VAE) latent space. During inference, the RFM-generated latent features are reconstructed into a mel-spectrogram via the pre-trained VAE decoder, followed by a pre-trained vocoder to synthesize the waveform. Trained on 1, 680 hours of audio data, FlowSep outperforms the state-of-the-art models across multiple benchmarks, as evaluated with subjective and objective metrics. Additionally, our results show that FlowSep surpasses a diffusion-based LASS model in both separation quality and inference efficiency, highlighting its strong potential for audio source separation tasks. Code, pre-trained models and demos can be found at: https://audio-agi.github.io/FlowSep_demo/.

Jie Jiang, Qiuqiang Kong, Mark D. Plumbley, Nigel Gilbert, Mark Hoogendoorn, Diederik M. Roijers (2021)Deep Learning-Based Energy Disaggregation and On/Off Detection of Household Appliances, In: ACM Transactions on Knowledge Discovery from Data15(3)50pp. 1-21 Association for Computing Machinery

DOI: 10.1145/3441300

Energy disaggregation, a.k.a. Non-Intrusive Load Monitoring, aims to separate the energy consumption of individual appliances from the readings of a mains power meter measuring the total energy consumption of, e.g., a whole house. Energy consumption of individual appliances can be useful in many applications, e.g., providing appliance-level feedback to the end users to help them understand their energy consumption and ultimately save energy. Recently, with the availability of large-scale energy consumption datasets, various neural network models such as convolutional neural networks and recurrent neural networks have been investigated to solve the energy disaggregation problem. Neural network models can learn complex patterns from large amounts of data and have been shown to outperform the traditional machine learning methods such as variants of hiddenMarkov models. However, current neural network methods for energy disaggregation are either computational expensive or are not capable of handling long-term dependencies. In this article, we investigate the application of the recently developed WaveNet models for the task of energy disaggregation. Based on a real-world energy dataset collected from 20 households over 2 years, we show that WaveNet models outperforms the state-of-the-art deep learning methods proposed in the literature for energy disaggregation in terms of both error measures and computational cost. On the basis of energy disaggregation, we then investigate the performance of two deep-learning based frameworks for the task of on/off detection which aims at estimating whether an appliance is in operation or not. The first framework obtains the on/off states of an appliance by binarising the predictions of a regression model trained for energy disaggregation, while the second framework obtains the on/off states of an appliance by directly training a binary classifier with binarised energy readings of the appliance serving as the target values. Based on the same dataset, we show that for the task of on/off detection the second framework, i.e., directly training a binary classifier, achieves better performance in terms of F1 score.

Haohe Liu, Xuenan Xu, Yi Yuan, Mengyue Wu, Wenwu Wang, Mark D. Plumbley (2024)SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound, In: IEEE Journal of Selected Topics in Signal Processing18(8)pp. 1448-1461 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/JSTSP.2024.3506286

Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modelling techniques to audio data. However, traditional codecs often operate at high bitrates or within narrow domains such as speech and lack the semantic clues required for efficient language modelling. Addressing these challenges, we introduce SemantiCodec, a novel codec designed to compress audio into fewer than a hundred tokens per second across diverse audio types, including speech, general sound, and music, without compromising quality. SemantiCodec features a dual-encoder architecture: a semantic encoder using a self-supervised pre-trained Audio Masked Autoencoder (AudioMAE), discretized using k-means clustering on extensive audio data, and an acoustic encoder to capture the remaining details. The semantic and acoustic encoder outputs are used to reconstruct audio via a diffusion-model-based decoder. SemantiCodec is presented in three variants with token rates of 25, 50, and 100 per second, supporting a range of ultra-low bit rates between 0.31 kbps and 1.40 kbps. Experimental results demonstrate that SemantiCodec significantly outperforms the state-of-the-art Descript codec on reconstruction quality. Our results also suggest that SemantiCodec contains significantly richer semantic information than all evaluated state-of-the-art audio codecs, even at significantly lower bitrates. Our code and demos are available at https://haoheliu.github.io/SemantiCodec/ .

Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, Mark D. Plumbley (2024)AudioLDM 2: Learning holistic audio generation with self-supervised pretraining, In: AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/TASLP.2024.3399607

Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a holistic framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework utilizes a general representation of audio, called “language of audio” (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate other modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on the LOA of audio in our training set. The proposed framework naturally brings advantages such as reusable self-supervised pretrained latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech with three AudioLDM 2 variants demonstrate competitive performance of the AudioLDM 2 framework against previous approaches.

Xubo Liu, Qiuqiang Kong, Yan Zhao, Haohe Liu, Yi Yuan, Yuzhuo Liu, Rui Xia, Yuxuan Wang, Mark Plumbley, Wenwu Wang (2023)Separate Anything You Describe, In: Official implementation of "Separate Anything You Describe": source code, evaluation benchmark and pre-trained model Cornell University Library, arXiv.org

DOI: 10.48550/arxiv.2308.05037

Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA). LASS aims to separate a target sound from an audio mixture given a natural language query, which provides a natural and scalable interface for digital audio applications. Recent works on LASS, despite attaining promising separation performance on specific sources (e.g., musical instruments, limited classes of audio events), are unable to separate audio concepts in the open domain. In this work, we introduce AudioSep, a foundation model for open-domain audio source separation with natural language queries. We train AudioSep on large-scale multimodal datasets and extensively evaluate its capabilities on numerous tasks including audio event separation, musical instrument separation, and speech enhancement. AudioSep demonstrates strong separation performance and impressive zero-shot generalization ability using audio captions or text labels as queries, substantially outperforming previous audio-queried and language-queried sound separation models. For reproducibility of this work, we will release the source code, evaluation benchmark and pre-trained model at: https://github.com/Audio-AGI/AudioSep.

Andres Fernandez, Mark D. Plumbley (2021)USING UMAP TO INSPECT AUDIO DATA FOR UNSUPERVISED ANOMALY DETECTION UNDER DOMAIN-SHIFT CONDITIONS Center for Open Science

DOI: 10.31219/osf.io/ren6d

The goal of Unsupervised Anomaly Detection (UAD) is to detect anomalous signals under the condition that only non-anomalous (normal) data is available beforehand. In UAD under Domain-Shift Conditions (UAD-S), data is further exposed to contextual changes that are usually unknown beforehand. Motivated by the difficulties encountered in the UAD-S task presented at the 2021 edition of the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge, we visually inspect Uniform Manifold Approximations and Projections (UMAPs) for log-STFT, log-mel and pretrained Look, Listen and Learn (L3) representations of the DCASE UAD-S dataset. In our exploratory investigation, we look for two qualities, Separability (SEP) and Discriminative Support (DSUP), and formulate several hypotheses that could facilitate diagnosis and developement of further representation and detection approaches. Particularly, we hypothesize that input length and pretraining may regulate a relevant tradeoff between SEP and DSUP. Our code as well as the resulting UMAPs and plots are publicly available. Accepted at the DCASE2021 Workshop

Xinhao Mei, Haohe Liu, Qiuqiang Kong, Tom Ko, Mark D. Plumbley, Yuexian Zou, Wenwu Wang (2024)WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research, In: IEEE/ACM Transactions on Audio, Speech, and Language Processing32pp. 3339-3354 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/TASLP.2024.3419446

The advancement of audio-language (AL) multi-modal learning tasks has been significant in recent years, yet the limited size of existing audio-language datasets poses challenges for researchers due to the costly and time-consuming collection process. To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions. We sourced audio clips and their raw descriptions from web sources and a sound event detection dataset. However, the online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning. To overcome this issue, we propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically. We conduct a comprehensive analysis of the characteristics of WavCaps dataset and evaluate it on multiple downstream audio-language multimodal learning tasks. The systems trained on WavCaps outperform previous state-of-the-art (SOTA) models by a significant margin. Our aspiration is for the WavCaps dataset we have proposed to facilitate research in audio-language multimodal learning and demonstrate the potential of utilizing large language models (LLMs) to enhance academic research. Our dataset and codes are available at https://github.com/XinhaoMei/WavCaps.

Jinbo Hu, Yin Cao, Ming Wu, Qiuqiang Kong, Feiran Yang, Mark D. Plumbley, Jun Yang (2024)Selective-Memory Meta-Learning with Environment Representations for Sound Event Localization and Detection, In: IEEE/ACM Transactions on Audio, Speech, and Language Processing32pp. 4313-4327 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/TASLP.2024.3451974

Environment shifts and conflicts present significant challenges for learning-based sound event localization and detection (SELD) methods. SELD systems, when trained in particular acoustic settings, often show restricted generalization capabilities for diverse acoustic environments. Furthermore, it is notably costly to obtain annotated samples for spatial sound events. Deploying a SELD system in a new environment requires extensive time for re-training and fine-tuning. To overcome these challenges, we propose environment-adaptive Meta-SELD, designed for efficient adaptation to new environments using minimal data. Our method specifically utilizes computationally synthesized spatial data and employs Model-Agnostic MetaLearning (MAML) on a pre-trained, environment-independent model. The method then utilizes fast adaptation to unseen realworld environments using limited samples from the respective environments. Inspired by the Learning-to-Forget approach, we introduce the concept of selective memory as a strategy for resolving conflicts across environments. This approach involves selectively memorizing target-environment-relevant information and adapting to the new environments through the selective attenuation of model parameters. In addition, we introduce environment representations to characterize different acoustic settings, enhancing the adaptability of our attenuation approach to various environments. We evaluate our proposed method on the development set of the Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset and computationally synthesized scenes. Experimental results demonstrate the superior performance of the proposed method compared to conventional supervised learning methods, particularly in localization

Yuanbo Hou, Qiuqiang Kong, Shengchen Li, Mark D. Plumbley (2019)Sound event detection with sequentially labelled data based on Connectionist temporal classification and unsupervised clustering, In: Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019) Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/ICASSP.2019.8683627

Sound event detection (SED) methods typically rely on either strongly labelled data or weakly labelled data. As an alternative, sequentially labelled data (SLD) was proposed. In SLD, the events and the order of events in audio clips are known, without knowing the occurrence time of events. This paper proposes a connectionist temporal classification (CTC) based SED system that uses SLD instead of strongly labelled data, with a novel unsupervised clustering stage. Experiments on 41 classes of sound events show that the proposed two-stage method trained on SLD achieves performance comparable to the previous state-of-the-art SED system trained on strongly labelled data, and is far better than another state-of-the-art SED system trained on weakly labelled data, which indicates the effectiveness of the proposed two-stage method trained on SLD without any onset/offset time of sound events.

Junqi Zhao, Xubo Liu, Jinzheng Zhao, Yi Yuan, Qiuqiang Kong, Mark D Plumbley, Wenwu Wang (2024)Universal Sound Separation with Self-Supervised Audio Masked Autoencoder

Universal sound separation (USS) is a task of separating mixtures of arbitrary sound sources. Typically, universal separation models are trained from scratch in a supervised manner, using labeled data. Self-supervised learning (SSL) is an emerging deep learning approach that leverages unlabeled data to obtain task-agnostic representations, which can benefit many downstream tasks. In this paper, we propose integrating a self-supervised pre-trained model, namely the audio masked autoencoder (A-MAE), into a universal sound separation system to enhance its separation performance. We employ two strategies to utilize SSL embeddings: freezing or updating the parameters of A-MAE during fine-tuning. The SSL embeddings are concate-nated with the short-time Fourier transform (STFT) to serve as input features for the separation model. We evaluate our methods on the AudioSet dataset, and the experimental results indicate that the proposed methods successfully enhance the separation performance of a state-of-the-art ResUNet-based USS model.

D Giannoulis, E Benetos, A Klapuri, MD Plumbley (2014)Improving instrument recognition in polyphonic music through system integrationpp. 5222-5226

DOI: 10.1109/ICASSP.2014.6854599

A method is proposed for instrument recognition in polyphonic music which combines two independent detector systems. A polyphonic musical instrument recognition system using a missing feature approach and an automatic music transcription system based on shift invariant probabilistic latent component analysis that includes instrument assignment. We propose a method to integrate the two systems by fusing the instrument contributions estimated by the first system onto the transcription system in the form of Dirichlet priors. Both systems, as well as the integrated system are evaluated using a dataset of continuous polyphonic music recordings. Detailed results that highlight a clear improvement in the performance of the integrated system are reported for different training conditions.

Y Nishimori, S Akaho, MD Plumbley (2006)Riemannian optimization method on the flag manifold for independent subspace analysis, In: J Rosca, D Erdogmus, JC Principe, S Haykin (eds.), INDEPENDENT COMPONENT ANALYSIS AND BLIND SIGNAL SEPARATION, PROCEEDINGS3889pp. 295-302

AM Stark, MEP Davies, MD Plumbley (2008)Rhythmic analysis for real-time audio effects, In: International Computer Music Conference, ICMC 2008

We outline a set of audio effects that use rhythmic analysis, in particular the extraction of beat and tempo information, to automatically synchronise temporal parameters to the input signal. We demonstrate that this analysis, known as beat-tracking, can be used to create adaptive parameters that adjust themselves according to changes in the properties of the input signal. We present common audio effects such as delay, tremolo and auto-wah augmented in this fashion and discuss their real-time implementation as Audio Unit plug-ins and objects for Max/MSP.

Andrew Nesbit, Mark D. Plumbley, Mike E. Davies (2007)Audio source separation with a signal-adaptive local cosine transform, In: Signal Processing87(8)pp. 1848-1858 Elsevier

DOI: 10.1016/j.sigpro.2007.02.011

Audio source separation is a very challenging problem, and many different approaches have been proposed in attempts to solve it. We consider the problem of separating sources from two-channel instantaneous audio mixtures. One approach to this is to transform the mixtures into the time-frequency domain to obtain approximately disjoint representations of the sources, and then separate the sources using time-frequency masking. We focus on demixing the sources by binary masking, and assume that the mixing parameters are known. In this paper, we investigate the application of cosine packet (CP) trees as a foundation for the transform. We determine an appropriate transform by applying a computationally efficient best basis algorithm to a set of possible local cosine bases organised in a tree structure. We develop a heuristically motivated cost function which maximises the energy of the transform coefficients associated with a particular source. Finally, we evaluate objectively our proposed transform method by comparing it against fixed-basis transforms such as the short-time Fourier transform (STFT) and modified discrete cosine transform (MDCT). Evaluation results indicate that our proposed transform method outperforms MDCT and is competitive with the STFT, and informal listening tests suggest that the proposed method exhibits less objectionable noise than the STFT.

D Stowell, MD Plumbley (2011)Learning Timbre Analogies from Unlabelled Data by Multivariate Tree Regression, In: JOURNAL OF NEW MUSIC RESEARCH40(4)pp. 325-336 ROUTLEDGE JOURNALS, TAYLOR & FRANCIS LTD

DOI: 10.1080/09298215.2011.596938

Y Nishimori, S Akaho, S Abdallah, MD Plumbley (2007)Flag manifolds for subspace ICA problems, In: 2007 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol IV, Pts 1-3pp. 1417-1420

A Robertson, MD Plumbley (2015)Event-based Multitrack Alignment using a Probabilistic Framework, In: Journal of New Music Research44(2)pp. 71-82

DOI: 10.1080/09298215.2015.1009839

© 2015 Taylor & Francis. This paper presents a Bayesian probabilistic framework for real-time alignment of a recording or score with a live performance using an event-based approach. Multitrack audio files are processed using existing onset detection and harmonic analysis algorithms to create a representation of a musical performance as a sequence of time-stamped events. We propose the use of distributions for the position and relative speed which are sequentially updated in real-time according to Bayes’ theorem. We develop the methodology for this approach by describing its application in the case of matching a single MIDI track and then extend this to the case of multitrack recordings. An evaluation is presented that contrasts ourmultitrack alignment method with state-of-the-art alignment techniques.

K O’Hanlon, H Nagano, MD Plumbley (2013)Using Oracle Analysis for Decomposition-Based Automatic Music Transcription7900pp. 353-365

DOI: 10.1007/978-3-642-41248-6_19

One approach to Automatic Music Transcription (AMT) is to decompose a spectrogram with a dictionary matrix that contains a pitch-labelled note spectrum atom in each column. AMT performance is typically measured using frame-based comparison, while an alternative perspective is to use an event-based analysis. We have previously proposed an AMT system, based on the use of structured sparse representations. The method is described and experimental results are given, which are seen to be promising. An inspection of the graphical AMT output known as a piano roll may lead one to think that the performance may be slightly better than is suggested by the AMT metrics used. This leads us to perform an oracle analysis of the AMT system, with some interesting outcomes which may have implications for decomposition based AMT in general.

SJ Welburn, MD Plumbley (2010)Improving the performance of pitch estimators, In: 128th Audio Engineering Society Convention 20102pp. 1319-1332

We are looking to use pitch estimators to provide an accurate high-resolution pitch track for resynthesis of musical audio. We found that current evaluation measures such as gross error rate (GER) are not suitable for algorithm selection. In this paper we examine the issues relating to evaluating pitch estimators and use these insights to improve performance of existing algorithms such as the well-known YIN pitch estimation algorithm.

Christian Kroos, RL Bundgaard-Nielsen, CT Best, Mark Plumbley (2017)Using deep neural networks to estimate tongue movements from speech face motion, In: Proceedings of AVSP 2017 KTH

This study concludes a tripartite investigation into the indirect visibility of the moving tongue in human speech as reflected in co-occurring changes of the facial surface. We were in particular interested in how the shared information is distributed over the range of contributing frequencies. In the current study we examine the degree to which tongue movements during speech can be reliably estimated from face motion using artificial neural networks. We simultaneously acquired data for both movement types; tongue movements were measured with Electromagnetic Articulography (EMA), face motion with a passive marker-based motion capture system. A multiresolution analysis using wavelets provided the desired decomposition into frequency subbands. In the two earlier studies of the project we established linear and non-linear relations between lingual and facial speech motions, as predicted and compatible with previous research in auditory-visual speech. The results of the current study using a Deep Neural Network (DNN) for prediction show that a substantive amount of variance can be recovered (between 13.9 and 33.2% dependent on the speaker and tongue sensor location). Importantly, however, the recovered variance values and the root mean squared error values of the Euclidean distances between the measured and the predicted tongue trajectories are in the range of the linear estimations of our earlier study.

K O'Hanlon, MD Plumbley (2013)Learning overcomplete dictionaries with ℓ0-sparse Non-negative Matrix Factorisation, In: 2013 IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (GLOBALSIP)pp. 977-980 IEEE

DOI: 10.1109/GlobalSIP.2013.6737056

Non-negative Matrix Factorisation (NMF) is a popular tool in which a ‘parts-based’ representation of a non-negative matrix is sought. NMF tends to produce sparse decompositions. This sparsity is a desirable property in many applications, and Sparse NMF (S-NMF) methods have been proposed to enhance this feature. Typically these enforce sparsity through use of a penalty term, and a `1 norm penalty term is often used. However an `1 penalty term may not be appropriate in a non-negative framework. In this paper the use of a `0 norm penalty for NMF is proposed, approximated using backwards elimination from an initial NNLS decomposition. Dictionary recovery experiments using overcomplete dictionaries show that this method outperforms both NMF and a state of the art S-NMF method, in particular when the dictionary to be learnt is dense.

KC Tsui, B Azvine, M Plumbley (1996)The roles of neural and evolutionary computing in intelligent software systems, In: BT TECHNOLOGY JOURNAL14(4)pp. 46-54 SPRINGER

M Barthet, MD Plumbley, A Kachkaev, J Dykes, D Wolff, T Weyde (2014)Big chord data extraction and mining

Harmonic progression is one of the cornerstones of tonal music composition and is thereby essential to many musical styles and traditions. Previous studies have shown that musical genres and composers could be discriminated based on chord progressions modeled as chord n-grams. These studies were however conducted on small-scale datasets and using symbolic music transcriptions. In this work, we apply pattern mining techniques to over 200,000 chord progression sequences out of 1,000,000 extracted from the I Like Music (ILM) commercial music audio collection. The ILM collection spans 37 musical genres and includes pieces released between 1907 and 2013. We developed a single program multiple data parallel computing approach whereby audio feature extraction tasks are split up and run simultaneously on multiple cores. An audio-based chord recognition model (Vamp plugin Chordino) was used to extract the chord progressions from the ILM set. To keep low-weight feature sets, the chord data were stored using a compact binary format. We used the CM-SPADE algorithm, which performs a vertical mining of sequential patterns using co-occurence information, and which is fast and efficient enough to be applied to big data collections like the ILM set. In order to derive key-independent frequent patterns, the transition between chords are modeled by changes of qualities (e.g. major, minor, etc.) and root keys (e.g. fourth, fifth, etc.). The resulting key-independent chord progression patterns vary in length (from 2 to 16) and frequency (from 2 to 19,820) across genres. As illustrated by graphs generated to represent frequent 4-chord progressions, some patterns like circleof- fifths movements are well represented in most genres but in varying degrees. These large-scale results offer the opportunity to uncover similarities and discrepancies between sets of musical pieces and therefore to build classifiers for search and recommendation. They also support the empirical testing of music theory. It is however more difficult to derive new hypotheses from such dataset due to its size. This can be addressed by using pattern detection algorithms or suitable visualisation which we present in a companion study.

MG Jafari, MD Plumbley (2008)Separation of stereo speech signals based on a sparse dictionary algorithm., In: EUSIPCOpp. 1-5

Zuzanna Podwinska, Iwona Sobieraj, Bruno M Fazenda, William J Davies, Mark D. Plumbley (2019)Acoustic event detection from weakly labeled data using auditory salience, In: Proceedings of the 44th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2019) Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/ICASSP.2019.8683586

Acoustic Event Detection (AED) is an important task of machine listening which, in recent years, has been addressed using common machine learning methods like Non-negative Matrix Factorization (NMF) or deep learning. However, most of these approaches do not take into consideration the way that human auditory system detects salient sounds. In this work, we propose a method for AED using weakly labeled data that combines a Non-negative Matrix Factorization model with a salience model based on predictive coding in the form of Kalman filters. We show that models of auditory perception, particularly auditory salience, can be successfully incorporated into existing AED methods and improve their performance on rare event detection. We evaluate the method on the Task2 of DCASE2017 Challenge.

Yizhou Tan, Haojun Ai, Shengchen Li, Mark D. Plumbley (2024)Acoustic Scene Classification across Cities and Devices via Feature Disentanglement, In: IEEE/ACM Transactions on Audio, Speech and Language Processing32pp. 1286-1297 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/TASLP.2024.3353578

Acoustic Scene Classification (ASC) is a task that classifies a scene according to environmental acoustic signals. Audios collected from different cities and devices often exhibit biases in feature distributions, which may negatively impact ASC performance. Taking the city and device of the audio collection as two types of data domain, this paper attempts to disentangle the audio features of each domain to remove the related feature biases. A dual-alignment framework is proposed to generalize the ASC system on new devices or cities, by aligning boundaries across domains and decision boundaries within each domain. During the alignment, the maximum classifier discrepancy and gradient reversed layer are used for the feature disentanglement of scene, city and device, while four candidate domain classifiers are proposed to explore the optimal solution of feature disentanglement. To evaluate the dual-alignment framework, three experiments of biased ASC tasks are designed: 1) cross-city ASC in new cities; 2) cross-device ASC in new devices; 3) cross-city-device ASC in new cities and new devices. Results demonstrate the superiority of the proposed framework, showcasing performance improvements of 0.9%, 19.8%, and 10.7% on classification accuracy, respectively. The effectiveness of the proposed feature disentanglement approach is further evaluated in both biased and unbiased ASC problems, and the results demonstrate that better-disentangled audio features can lead to a more robust ASC system across different devices and cities. This paper advocates for the integration of feature disentanglement in ASC systems to achieve more reliable performance.

J Fritsch, J Ganseman, MD Plumbley (2012)A comparison of two different methods for score-informed source separation, In: Proc. 5th International Workshop on Machine Learning and Music (MML 2012)pp. 11-12

We present a new method for score-informed source separation, combining ideas from two previous approaches: one based on paramet- ric modeling of the score which constrains the NMF updating process, the other based on PLCA that uses synthesized scores as prior probability distributions. We experimentally show improved separation results using the BSS EVAL and PEASS toolkits, and discuss strengths and weaknesses compared with the previous PLCA-based approach.

Zhao Ren, Qiuqiang Kong, Jing Han, Mark Plumbley, Björn W Schuller (2019)ATTENTION-BASED ATROUS CONVOLUTIONAL NEURAL NETWORKS: VISUALISATION AND UNDERSTANDING PERSPECTIVES OF ACOUSTIC SCENES, In: 2019 Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019) IEEE

DOI: 10.1109/ICASSP.2019.8683434

The goal of Acoustic Scene Classification (ASC) is to recognise the environment in which an audio waveform has been recorded. Recently, deep neural networks have been applied to ASC and have achieved state-of-the-art performance. However, few works have investigated how to visualise and understand what a neural network has learnt from acoustic scenes. Previous work applied local pooling after each convolutional layer, therefore reduced the size of the feature maps. In this paper, we suggest that local pooling is not necessary, but the size of the receptive field is important. We apply atrous Convolutional Neural Networks (CNNs) with global attention pooling as the classification model. The internal feature maps of the attention model can be visualised and explained. On the Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 dataset, our proposed method achieves an accuracy of 72.7 %, significantly outperforming the CNNs without dilation at 60.4 %. Furthermore, our results demonstrate that the learnt feature maps contain rich information on acoustic scenes in the time-frequency domain.

M.G. Jafari, F.L. Hedayioglu, M.T. Coimbra, Mark D. Plumbley (2011)Blind source separation of periodic sources from sequentially recorded instantaneous mixtures, In: Proceedings of the 7th International Symposium on Image and Signal Processing and Analysis (ISPA 2011)pp. 540-545

We consider the separation of sources when only one movable sensor is available to record a set of mixtures at distinct locations. A single mixture signal is acquired, which is firstly segmented. Then, based on the assumption that the underlying sources are temporally periodic, we align the resulting signals and form a measurement vector on which source separation can be performed. We demonstrate that this approach can successfully recover the original sources both when working with simulated data, and for a real problem of heart sound separation. © 2011 University of Zagreb.

Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, Mark D Plumbley (2020)PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition, In: IEEE/ACM transactions on audio, speech, and language processing a publication of the Signal Processing Society.28pp. 2880-2894 IEEE

DOI: 10.1109/TASLP.2020.3030497

—Audio pattern recognition is an important research topic in the machine learning area, and includes several tasks such as audio tagging, acoustic scene classification, music classification , speech emotion classification and sound event detection. Recently, neural networks have been applied to tackle audio pattern recognition problems. However, previous systems are built on specific datasets with limited durations. Recently, in computer vision and natural language processing, systems pretrained on large-scale datasets have generalized well to several tasks. However, there is limited research on pretraining systems on large-scale datasets for audio pattern recognition. In this paper, we propose pretrained audio neural networks (PANNs) trained on the large-scale AudioSet dataset. These PANNs are transferred to other audio related tasks. We investigate the performance and computational complexity of PANNs modeled by a variety of convolutional neural networks. We propose an architecture called Wavegram-Logmel-CNN using both log-mel spectrogram and waveform as input feature. Our best PANN system achieves a state-of-the-art mean average precision (mAP) of 0.439 on AudioSet tagging, outperforming the best previous system of 0.392. We transfer PANNs to six audio pattern recognition tasks, and demonstrate state-of-the-art performance in several of those tasks. We have released the source code and pretrained models of PANNs: https://github.com/qiuqiangkong/audioset_tagging_cnn.

D Stowell, D Giannoulis, E Benetos, M Lagrange, MD Plumbley (2015)Detection and Classification of Acoustic Scenes and Events., In: IEEE Transactions on Multimedia17(10)pp. 1733-1746

DOI: 10.1109/TMM.2015.2428998

For intelligent systems to make best use of the audio modality, it is important that they can recognize not just speech and music, which have been researched as specific tasks, but also general sounds in everyday environments. To stimulate research in this field we conducted a public research challenge: the IEEE Audio and Acoustic Signal Processing Technical Committee challenge on Detection and Classification of Acoustic Scenes and Events (DCASE). In this paper, we report on the state of the art in automatically classifying audio scenes, and automatically detecting and classifying audio events. We survey prior work as well as the state of the art represented by the submissions to the challenge from various research groups. We also provide detail on the organization of the challenge, so that our experience as challenge hosts may be useful to those organizing challenges in similar domains. We created new audio datasets and baseline systems for the challenge; these, as well as some submitted systems, are publicly available under open licenses, to serve as benchmarks for further research in general-purpose machine listening.

S Ewert, MD Plumbley, M Sandler (2014)Accounting for phase cancellations in non-negative matrix factorization using weighted distancespp. 649-653

DOI: 10.1109/ICASSP.2014.6853676

Techniques based on non-negative matrix factorization (NMF) have been successfully used to decompose a spectrogram of a music recording into a dictionary of templates and activations. While advanced NMF variants often yield robust signal models, there are usually some inaccuracies in the factorization since the underlying methods are not prepared for phase cancellations that occur when sounds with similar frequency are mixed. In this paper, we present a novel method that takes phase cancellations into account to refine dictionaries learned by NMF-based methods. Our approach exploits the fact that advanced NMF methods are often robust enough to provide information about how sound sources interact in a spectrogram, where they overlap, and thus where phase cancellations could occur. Using this information, the distances used in NMF are weighted entry-wise to attenuate the influence of regions with phase cancellations. Experiments on full-length, polyphonic piano recordings indicate that our method can be successfully used to refine NMF-based dictionaries.

MG Jafari, MD Plumbley (2011)Fast dictionary learning for sparse representations of speech signals, In: IEEE Journal on Selected Topics in Signal Processing5(5)pp. 1025-1031 IEEE

DOI: 10.1109/JSTSP.2011.2157892

For dictionary-based decompositions of certain types, it has been observed that there might be a link between sparsity in the dictionary and sparsity in the decomposition. Sparsity in the dictionary has also been associated with the derivation of fast and efficient dictionary learning algorithms. Therefore, in this paper we present a greedy adaptive dictionary learning algorithm that sets out to find sparse atoms for speech signals. The algorithm learns the dictionary atoms on data frames taken from a speech signal. It iteratively extracts the data frame with minimum sparsity index, and adds this to the dictionary matrix. The contribution of this atom to the data frames is then removed, and the process is repeated. The algorithm is found to yield a sparse signal decomposition, supporting the hypothesis of a link between sparsity in the decomposition and dictionary. The algorithm is applied to the problem of speech representation and speech denoising, and its performance is compared to other existing methods. The method is shown to find dictionary atoms that are sparser than their time-domain waveform, and also to result in a sparser speech representation. In the presence of noise, the algorithm is found to have similar performance to the well established principal component analysis. © 2011 IEEE.

G Roma, Emad M Grais, AJR Simpson, Mark D Plumbley (2016)MUSIC REMIXING AND UPMIXING USING SOURCE SEPARATION, In: Proceedings of the 2nd AES Workshop on Intelligent Music Production

Current research on audio source separation provides tools to estimate the signals contributed by different instruments in polyphonic music mixtures. Such tools can be already incorporated in music production and post-production workflows. In this paper, we describe recent experiments where audio source separation is applied to remixing and upmixing existing mono and stereo music content

Tijs Duel, David M. Frohlich, Christian Kroos, Yong Xu, Philip J. B. Jackson, Mark D. Plumbley (2018)Supporting audiography: Design of a system for sentimental sound recording, classification and playback, In: Communications in Computer and Information Science: HCI International 2018 - Posters' Extended Abstracts850pp. 24-31 Scientific Publishing Services, on behalf of Springer

DOI: 10.1007/978-3-319-92270-6_4

It is now commonplace to capture and share images in photography as triggers for memory. In this paper we explore the possibility of using sound in the same sort of way, in a practice we call audiography. We report an initial design activity to create a system called Audio Memories comprising a ten second sound recorder, an intelligent archive for auto-classifying sound clips, and a multi-layered sound player for the social sharing of audio souvenirs around a table. The recorder and player components are essentially user experience probes that provide tangible interfaces for capturing and interacting with audio memory cues. We discuss our design decisions and process in creating these tools that harmonize user interaction and machine listening to evoke rich memories and conversations in an exploratory and open-ended way.

SA Abdallah, MD Plumbley (2010)A measure of statistical complexity based on predictive information

We introduce an information theoretic measure of statistical structure, called 'binding information', for sets of random variables, and compare it with several previously proposed measures including excess entropy, Bialek et al.'s predictive information, and the multi-information. We derive some of the properties of the binding information, particularly in relation to the multi-information, and show that, for finite sets of binary random variables, the processes which maximises binding information are the 'parity' processes. Finally we discuss some of the implications this has for the use of the binding information as a measure of complexity.

Emad M Grais, Mark D Plumbley (2017)Single Channel Audio Source Separation using Convolutional Denoising Autoencoders, In: GlobalSIP2017 Proceedingspp. 1265-1269 IEEE

DOI: 10.1109/GlobalSIP.2017.8309164

Deep learning techniques have been used recently to tackle the audio source separation problem. In this work, we propose to use deep fully convolutional denoising autoencoders (CDAEs) for monaural audio source separation. We use as many CDAEs as the number of sources to be separated from the mixed signal. Each CDAE is trained to separate one source and treats the other sources as background noise. The main idea is to allow each CDAE to learn suitable spectral-temporal filters and features to its corresponding source. Our experimental results show that CDAEs perform source separation slightly better than the deep feedforward neural networks (FNNs) even with fewer parameters than FNNs.

AM Stark, MD Plumbley (2012)Performance Following: Real-Time Prediction of Musical Sequences Without a Score, In: IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING20(1)pp. 190-199 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TASL.2011.2159593

Alfredo Zermini, Wenwu Wang, Qiuqiang Kong, Yong Xu, Mark Plumbley (2017)Audio source separation with deep neural networks using the dropout algorithm, In: Signal Processing with Adaptive Sparse Structured Representations (SPARS) 2017 Book of Abstractspp. 1-2 Instituto de Telecomunicações

A method based on Deep Neural Networks (DNNs) and time-frequency masking has been recently developed for binaural audio source separation. In this method, the DNNs are used to predict the Direction Of Arrival (DOA) of the audio sources with respect to the listener which is then used to generate soft time-frequency masks for the recovery/estimation of the individual audio sources. In this paper, an algorithm called ‘dropout’ will be applied to the hidden layers, affecting the sparsity of hidden units activations: randomly selected neurons and their connections are dropped during the training phase, preventing feature co-adaptation. These methods are evaluated on binaural mixtures generated with Binaural Room Impulse Responses (BRIRs), accounting a certain level of room reverberation. The results show that the proposed DNNs system with randomly deleted neurons is able to achieve higher SDRs performances compared to the baseline method without the dropout algorithm.

Yi Yuan, Haohe Liu, Xubo Liu, Xiyuan Kang, Peipei Wu, Mark Plumbley, Wenwu Wang (2023)Text-Driven Foley Sound Generation With Latent Diffusion Model, In: arXiv.org Cornell University Library, arXiv.org

DOI: 10.48550/arxiv.2306.10359

Foley sound generation aims to synthesise the background sound for multimedia content. Previous models usually employ a large development set with labels as input (e.g., single numbers or one-hot vector). In this work, we propose a diffusion model based system for Foley sound generation with text conditions. To alleviate the data scarcity issue, our model is initially pre-trained with large-scale datasets and fine-tuned to this task via transfer learning using the contrastive language-audio pertaining (CLAP) technique. We have observed that the feature embedding extracted by the text encoder can significantly affect the performance of the generation model. Hence, we introduce a trainable layer after the encoder to improve the text embedding produced by the encoder. In addition, we further refine the generated waveform by generating multiple candidate audio clips simultaneously and selecting the best one, which is determined in terms of the similarity score between the embedding of the candidate clips and the embedding of the target text label. Using the proposed method, our system ranks \({1}^{st}\) among the systems submitted to DCASE Challenge 2023 Task 7. The results of the ablation studies illustrate that the proposed techniques significantly improve sound generation performance. The codes for implementing the proposed system are available online.

Yi Yuan, Haohe Liu, Xubo Liu, Xiyuan Kang, Mark Plumbley, Wenwu Wang (2023)Latent Diffusion Model Based Foley Sound Generation System For DCASE Challenge 2023 Task 7, In: arXiv.org Cornell University Library, arXiv.org

DOI: 10.48550/arxiv.2305.15905

Foley sound presents the background sound for multimedia content and the generation of Foley sound involves computationally modelling sound effects with specialized techniques. In this work, we proposed a system for DCASE 2023 challenge task 7: Foley Sound Synthesis. The proposed system is based on AudioLDM, which is a diffusion-based text-to-audio generation model. To alleviate the data-hungry problem, the system first trained with large-scale datasets and then downstreamed into this DCASE task via transfer learning. Through experiments, we found out that the feature extracted by the encoder can significantly affect the performance of the generation model. Hence, we improve the results by leveraging the input label with related text embedding features obtained by a significant language model, i.e., contrastive language-audio pertaining (CLAP). In addition, we utilize a filtering strategy to further refine the output, i.e. by selecting the best results from the candidate clips generated in terms of the similarity score between the sound and target labels. The overall system achieves a Frechet audio distance (FAD) score of 4.765 on average among all seven different classes, substantially outperforming the baseline system which performs a FAD score of 9.7.

Arshdeep Singh, Haohe Liu, Mark D. Plumbley (2023)E-PANNs: Sound Recognition Using Efficient Pre-trained Audio Neural Networks

K Toyama, MD Plumbley (2009)Estimating Phase Linearity in the Frequency-Domain ICA Demixing Matrix, In: T Adali, C Jutten, JMT Romano, AK Barros (eds.), INDEPENDENT COMPONENT ANALYSIS AND SIGNAL SEPARATION, PROCEEDINGS5441pp. 362-370

Dominic Ward, Hagen Wierstorf, Russell Mason, Mark Plumbley, Christopher Hummersone (2017)Estimating the loudness balance of musical mixtures using audio source separation, In: Proceedings of the 3rd Workshop on Intelligent Music Production (WIMP 2017)

To assist with the development of intelligent mixing systems, it would be useful to be able to extract the loudness balance of sources in an existing musical mixture. The relative-to-mix loudness level of four instrument groups was predicted using the sources extracted by 12 audio source separation algorithms. The predictions were compared with the ground truth loudness data of the original unmixed stems obtained from a recent dataset involving 100 mixed songs. It was found that the best source separation system could predict the relative loudness of each instrument group with an average root-mean-square error of 1.2 LU, with superior performance obtained on vocals.

S Li, S Dixon, Mark D Plumbley (2018)A demonstration of hierarchical structure usage in expressive timing analysis by model selection tests, In: Proceedings of (CCC2018) IEEE

DOI: 10.23919/ChiCC.2018.8483169

Analysing expressive timing in performed music can help machine to perform various perceptual tasks such as identifying performers and understand music structures in classical music. A hierarchical structure is commonly used for expressive timing analysis. This paper provides a statistical demonstration to support the use of hierarchical structure in expressive timing analysis by presenting two groups of model selection tests. The first model selection test uses expressive timing to determine the location of music structure boundaries. The second model selection test is matching a piece of performance with the same performer playing another given piece. Comparing the results of model selection tests, the preferred hierarchical structures in these two model selection tests are not the same. While determining music structure boundaries demands a hierarchical structure with more levels in the expressive timing analysis, a hierarchical structure with less levels helps identifying the dedicated performer in most cases.

Christian Kroos, Oliver Bones, Yin Cao, Lara Harris, Philip J. B. Jackson, William J. Davies, Wenwu Wang, Trevor J. Cox, Mark D. Plumbley (2019)GENERALISATION IN ENVIRONMENTAL SOUND CLASSIFICATION: THE 'MAKING SENSE OF SOUNDS' DATA SET AND CHALLENGE, In: 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)2019-pp. 8082-8086 IEEE

DOI: 10.1109/ICASSP.2019.8683292

Humans are able to identify a large number of environmental sounds and categorise them according to high-level semantic categories, e.g. urban sounds or music. They are also capable of generalising from past experience to new sounds when applying these categories. In this paper we report on the creation of a data set that is structured according to the top-level of a taxonomy derived from human judgements and the design of an associated machine learning challenge, in which strong generalisation abilities are required to be successful. We introduce a baseline classification system, a deep convolutional network, which showed strong performance with an average accuracy on the evaluation data of 80.8%. The result is discussed in the light of two alternative explanations: An unlikely accidental category bias in the sound recordings or a more plausible true acoustic grounding of the high-level categories.

D Ellis, T Virtanen, Mark Plumbley, B Raj (2018)Future Perspective, In: T Virtanen, Mark Plumbley, D Ellis (eds.), Computational Analysis of Sound Scenes and Eventspp. 401-415 Springer International Publishing

DOI: 10.1007/978-3-319-63450-0_14

F Hedayioglu, MG Jafari, SS Mattos, MD Plumbley, MT Coimbra (2012)Denoising and Segmentation of the Second Heart Sound Using Matching Pursuit, In: 2012 ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY (EMBC)pp. 3440-3443

D Stowell, MD Plumbley (2014)Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning, In: PeerJ2e488 PeerJ

DOI: 10.7717/peerj.488

Automatic species classification of birds from their sound is a computational tool of increasing importance in ecology, conservation monitoring and vocal communication studies. To make classification useful in practice, it is crucial to improve its accuracy while ensuring that it can run at big data scales. Many approaches use acoustic measures based on spectrogram-type data, such as the Mel-frequency cepstral coefficient (MFCC) features which represent a manually-designed summary of spectral information. However, recent work in machine learning has demonstrated that features learnt automatically from data can often outperform manually-designed feature transforms. Feature learning can be performed at large scale and "unsupervised", meaning it requires no manual data labelling, yet it can improve performance on "supervised" tasks such as classification. In this work we introduce a technique for feature learning from large volumes of bird sound recordings, inspired by techniques that have proven useful in other domains. We experimentally compare twelve different feature representations derived from the Mel spectrum (of which six use this technique), using four large and diverse databases of bird vocalisations, classified using a random forest classifier. We demonstrate that in our classification tasks, MFCCs can often lead to worse performance than the raw Mel spectral data from which they are derived. Conversely, we demonstrate that unsupervised feature learning provides a substantial boost over MFCCs and Mel spectra without adding computational complexity after the model has been trained. The boost is particularly notable for single-label classification tasks at large scale. The spectro-temporal activations learned through our procedure resemble spectro-temporal receptive fields calculated from avian primary auditory forebrain. However, for one of our datasets, which contains substantial audio data but few annotations, increased performance is not discernible. We study the interaction between dataset characteristics and choice of feature representation through further empirical analysis.

D Stowell, MD Plumbley (2014)Audio-only bird classification using unsupervised feature learningpp. 673-684

We describe our method for automatic bird species classification, which uses raw audio without segmentation and without using any auxiliary metadata. It successfully classifies among 501 bird categories, and was by far the highest scoring audio-only bird recognition algorithm submitted to BirdCLEF 2014. Our method uses unsupervised feature learning, a technique which learns regularities in spectro-temporal content without reference to the training labels, which helps a classifier to generalise to further content of the same type. Our strongest submission uses two layers of feature learning to capture regularities at two different time scales.

Yong Xu, Qiuqiang Kong, Wenwu Wang, Mark Plumbley (2018)Large-scale weakly supervised audio classification using gated convolutional neural network, In: Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)pp. 121-125 IEEE

DOI: 10.1109/ICASSP.2018.8461975

In this paper, we present a gated convolutional neural network and a temporal attention-based localization method for audio classification, which won the 1st place in the large-scale weakly supervised sound event detection task of Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 challenge. The audio clips in this task, which are extracted from YouTube videos, are manually labelled with one or more audio tags, but without time stamps of the audio events, hence referred to as weakly labelled data. Two subtasks are defined in this challenge including audio tagging and sound event detection using this weakly labelled data. We propose a convolutional recurrent neural network (CRNN) with learnable gated linear units (GLUs) non-linearity applied on the log Mel spectrogram. In addition, we propose a temporal attention method along the frames to predict the locations of each audio event in a chunk from the weakly labelled data. The performances of our systems were ranked the 1st and the 2nd as a team in these two sub-tasks of DCASE 2017 challenge with F value 55.6% and Equal error 0.73, respectively.

K O'Hanlon, MD Plumbley (2014)Polyphonic piano transcription using non-negative Matrix Factorisation with group sparsitypp. 3112-3116

DOI: 10.1109/ICASSP.2014.6854173

Non-negative Matrix Factorisation (NMF) is a popular tool in musical signal processing. However, problems using this methodology in the context of Automatic Music Transcription (AMT) have been noted resulting in the proposal of supervised and constrained variants of NMF for this purpose. Group sparsity has previously been seen to be effective for AMT when used with stepwise methods. In this paper group sparsity is introduced to supervised NMF decompositions and a dictionary tuning approach to AMT is proposed based upon group sparse NMF using the β-divergence. Experimental results are given showing improved AMT results over the state-of-the-art NMF-based AMT system.

Chris Baume, Mark D. Plumbley, Janko Ćalić, David Frohlich (2018)A Contextual Study of Semantic Speech Editing in Radio Production, In: International Journal of Human-Computer Studies115pp. 67-80 Elsevier

DOI: 10.1016/j.ijhcs.2018.03.006

Radio production involves editing speech-based audio using tools that represent sound using simple waveforms. Semantic speech editing systems allow users to edit audio using an automatically generated transcript, which has the potential to improve the production workflow. To investigate this, we developed a semantic audio editor based on a pilot study. Through a contextual qualitative study of five professional radio producers at the BBC, we examined the existing radio production process and evaluated our semantic editor by using it to create programmes that were later broadcast. We observed that the participants in our study wrote detailed notes about their recordings and used annotation to mark which parts they wanted to use. They collaborated closely with the presenter of their programme to structure the contents and write narrative elements. Participants reported that they often work away from the office to avoid distractions, and print transcripts so they can work away from screens. They also emphasised that listening is an important part of production, to ensure high sound quality. We found that semantic speech editing with automated speech recognition can be used to improve the radio production workflow, but that annotation, collaboration, portability and listening were not well supported by current semantic speech editing systems. In this paper, we make recommendations on how future semantic speech editing systems can better support the requirements of radio production.

AM Stark, MEP Davies, MD Plumbley (2009)Real-time beat-synchronous analysis of musical audio, In: Proceedings of the 12th International Conference on Digital Audio Effects, DAFx 2009pp. 299-304

In this paper we present a model for beat-synchronous analysis of musical audio signals. Introducing a real-time beat tracking model with performance comparable to offline techniques, we discuss its application to the analysis of musical performances segmented by beat. We discuss the various design choices for beat-synchronous analysis and their implications for real-time implementations before presenting some beat-synchronous harmonic analysis examples. We make available our beat tracker and beat-synchronous analysis techniques as externals for Max/MSP.

YIN CAO, TURAB IQBAL, Qiuqiang Kong, YUE ZHONG, WENWU WANG, MARK DAVID PLUMBLEY (2020)EVENT-INDEPENDENT NETWORK FOR POLYPHONIC SOUND EVENT LOCALIZATION AND DETECTION, In: Proceedings of the 5th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2020)

DOI: 10.5281/zenodo.4061782

Polyphonic sound event localization and detection is not only detecting what sound events are happening but localizing corresponding sound sources. This series of tasks was first introduced in DCASE 2019 Task 3. In 2020, the sound event localization and detection task introduces additional challenges in moving sound sources and overlapping-event cases, which include two events of the same type with two different direction-of-arrival (DoA) angles. In this paper, a novel event-independent network for polyphonic sound event lo-calization and detection is proposed. Unlike the two-stage method we proposed in DCASE 2019 Task 3, this new network is fully end-to-end. Inputs to the network are first-order Ambisonics (FOA) time-domain signals, which are then fed into a 1-D convolutional layer to extract acoustic features. The network is then split into two parallel branches. The first branch is for sound event detection (SED), and the second branch is for DoA estimation. There are three types of predictions from the network, SED predictions, DoA predictions , and event activity detection (EAD) predictions that are used to combine the SED and DoA features for onset and offset estimation. All of these predictions have the format of two tracks indicating that there are at most two overlapping events. Within each track, there could be at most one event happening. This architecture introduces a problem of track permutation. To address this problem, a frame-level permutation invariant training method is used. Experimental results show that the proposed method can detect polyphonic sound events and their corresponding DoAs. Its performance on the Task 3 dataset is greatly increased as compared with that of the baseline method. Index Terms— Sound event localization and detection, direction of arrival, event-independent, permutation invariant training.

AM Stark, MD Plumbley (2009)Real-time chord recognition for live performance, In: Proceedings of the 2009 International Computer Music Conference, ICMC 2009pp. 85-88

This paper describes work aimed at creating an efficient, real-time, robust and high performance chord recognition system for use on a single instrument in a live performance context. An improved chroma calculation method is combined with a classification technique based on masking out expected note positions in the chromagram and minimising the residual energy. We demonstrate that our approach can be used to classify a wide range of chords, in real-time, on a frame by frame basis. We present these analysis techniques as externals for Max/MSP. © July 2009- All copyright remains with the individual authors.

Hagen Wierstorf, Dominic Ward, Russell Mason, Emad M Grais, Christopher Hummersone, Mark Plumbley (2017)Perceptual Evaluation of Source Separation for Remixing Music, In: 143rd AES Convention Paper No 9880 Audio Engineering Society

Music remixing is difficult when the original multitrack recording is not available. One solution is to estimate the elements of a mixture using source separation. However, existing techniques suffer from imperfect separation and perceptible artifacts on single separated sources. To investigate their influence on a remix, five state-of-the-art source separation algorithms were used to remix six songs by increasing the level of the vocals. A listening test was conducted to assess the remixes in terms of loudness balance and sound quality. The results show that some source separation algorithms are able to increase the level of the vocals by up to 6 dB at the cost of introducing a small but perceptible degradation in sound quality.

Iwona Sobieraj, Lucas Rencker, Mark D Plumbley (2018)Orthogonality-regularized masked NMF for learning on weakly labeled audio data, In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)pp. 2436-2440 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/ICASSP.2018.8461293

Non-negative Matrix Factorization (NMF) is a well established tool for audio analysis. However, it is not well suited for learning on weakly labeled data, i.e. data where the exact timestamp of the sound of interest is not known. In this paper we propose a novel extension to NMF, that allows it to extract meaningful representations from weakly labeled audio data. Recently, a constraint on the activation matrix was proposed to adapt for learning on weak labels. To further improve the method we propose to add an orthogonality regularizer of the dictionary in the cost function of NMF. In that way we obtain appropriate dictionaries for the sounds of interest and background sounds from weakly labeled data. We demonstrate that the proposed Orthogonality-Regularized Masked NMF (ORM-NMF) can be used for Audio Event Detection of rare events and evaluate the method on the development data from Task2 of DCASE2017 Challenge.

EM Grais, G Roma, AJR Simpson, MD Plumbley (2016)Discriminative Enhancement for Single Channel Audio Source Separation using Deep Neural Networks, In: arXiv

J-B Thiebaut, SA Abdallah, A Robertson, N Bryan-Kinns, MD Plumbley (2008)Real Time Gesture Learning and Recognition: Towards Automatic Categorization., In: NIMEpp. 215-218

B Mailhe, B Sturm, MD Plumbley (2013)BEHAVIOR OF GREEDY SPARSE REPRESENTATION ALGORITHMS ON NESTED SUPPORTS, In: 2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)pp. 5710-5714

D Barchiesi, MD Plumbley (2013)Learning incoherent subspaces for classification via supervised iterative projections and rotations, In: Proceedings of MLSP 2013 IEEE

DOI: 10.1109/MLSP.2013.6661981

In this paper we present the supervised iterative projections and rotations (S-IPR) algorithm, a method to optimise a set of discriminative subspaces for supervised classification. We show how the proposed technique is based on our previous unsupervised iterative projections and rotations (IPR) algorithm for incoherent dictionary learning, and how projecting the features onto the learned sub-spaces can be employed as a feature transform algorithm in the context of classification. Numerical experiments on the FISHERIRIS and on the USPS datasets, and a comparison with the PCA and LDA methods for feature transform demonstrates the value of the proposed technique and its potential as a tool for machine learning. © 2013 IEEE.

MD Plumbley (2007)Dictionary learning for L1-exact sparse coding, In: ME Davies, CJ James, SA Abdallah, MD Plumbley (eds.), Independent Component Analysis and Signal Separation, Proceedings4666pp. 406-413

D Stowell, MD Plumbley (2014)Robust bird species recognition: making it work for dawn chorus audio archives

The recent (2013) bird species recognition challenges organised by the SABIOD project attracted some strong performances from automatic classifiers applied to short audio excerpts from passive acoustic monitoring stations. Can such strong results be achieved for dawn chorus field recordings in audio archives? The question is important because archives (such as the British Library Sound Archive) hold thousands such recordings, covering many decades and many countries, but they are mostly unlabelled. Automatic labelling holds the potential to unlock their value to ecological studies. Audio in such archives is quite different from passive acoustic monitoring data: importantly, the recording conditions vary randomly (and are usually unknown), making the scenario a ”cross-condition” rather than ”single-condition” train/test task. Dawn chorus recordings are generally long, and the annotations often indicate which birds are in a 20-minute recording but not within which 5-second segments they are active. Further, the amount of annotation available is very small. We report on experiments to evaluate a variety of classifier configurations for automatic multilabel species annotation in dawn chorus archive recordings. The audio data is an order of magnitude larger than the SABIOD challenges, but the ground-truth data is an order of magnitude smaller. We report some surprising findings, including clear variation in the bene- fits of some analysis choices (audio features, pooling techniques noise-robustness techniques) as we move to handle the specific multi-condition case relevant for audio archives.

N Keriven, K O'Hanlon, MD Plumbley (2013)STRUCTURED SPARSITY USING BACKWARDS ELIMINATION FOR AUTOMATIC MUSIC TRANSCRIPTION, In: S Sanei, P Smaragdis, A Nandi, ATS Ho, J Larsen (eds.), 2013 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP)

DOI: 10.1109/MLSP.2013.6661917

D Stowell, MD Plumbley (2010)Cross-associating unlabelled timbre distributions to create expressive musical mappings., In: T Diethe, N Cristianini, J Shawe-Taylor (eds.), WAPA11pp. 28-35

M Plumbley (2002)Conditions for nonnegative independent component analysis, In: IEEE SIGNAL PROCESSING LETTERS9(6)pp. 177-180 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/LSP.2002.800502

Lucas Rencker, F Bach, Wenwu Wang, Mark D Plumbley (2018)Consistent dictionary learning for signal declipping, In: Y Deville, S Gannot, R Mason, Mark Plumbley, D Ward (eds.), Latent Variable Analysis and Signal Separation. LVA/ICA 2018. Lecture Notes in Computer Science10891pp. 446-455 Springer Verlag

DOI: 10.1007/978-3-319-93764-9_41

Clipping, or saturation, is a common nonlinear distortion in signal processing. Recently, declipping techniques have been proposed based on sparse decomposition of the clipped signals on a fixed dictionary, with additional constraints on the amplitude of the clipped samples. Here we propose a dictionary learning approach, where the dictionary is directly learned from the clipped measurements. We propose a soft-consistency metric that minimizes the distance to a convex feasibility set, and takes into account our knowledge about the clipping process. We then propose a gradient descent-based dictionary learning algorithm that minimizes the proposed metric, and is thus consistent with the clipping measurement. Experiments show that the proposed algorithm outperforms other dictionary learning algorithms applied to clipped signals. We also show that learning the dictionary directly from the clipped signals outperforms consistent sparse coding with a fixed dictionary.

P Brossier, JP Bello, MD Plumbley (2004)Real-time temporal segmentation of note objects in music signals, In: Proceedings of ICMC 2004, the 30th Annual International Computer Music Conference

Segmenting note objects in a real time context is useful for live performances, audio broadcasting, or object-based coding. This temporal segmentation relies upon the correct detection of onsets and offsets of musical notes, an area of much research over recent years. However the low-latency requirements of real-time systems impose new, tight constraints on this process. In this paper, we present a system for the segmentation of note objects with very short delays, using recent developments in onset detection, specially modi ed to work in a real-time context. A portable and open C implementation is presented.

Lucas Rencker, Francis Bach, Wenwu Wang, Mark D. Plumbley (2019)Sparse Recovery and Dictionary Learning From Nonlinear Compressive Measurements, In: IEEE Transactions on Signal Processing67(21)pp. 5659-5670 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/TSP.2019.2941070

Sparse coding and dictionary learning are popular techniques for linear inverse problems such as denoising or inpainting. However in many cases, the measurement process is nonlinear, for example for clipped, quantized or 1-bit measurements. These problems have often been addressed by solving constrained sparse coding problems, which can be difficult to solve, and assuming that the sparsifying dictionary is known and fixed. Here we propose a simple and unified framework to deal with nonlinear measurements. We propose a cost function that minimizes the distance to a convex feasibility set, which models our knowledge about the nonlinear measurement. This provides an unconstrained, convex, and differentiable cost function that is simple to optimize, and generalizes the linear least squares cost commonly used in sparse coding. We then propose proximal based sparse coding and dictionary learning algorithms, that are able to learn directly from nonlinearly corrupted signals. We show how the proposed framework and algorithms can be applied to clipped, quantized and 1-bit data.

Qiuqiang Kong, Yong Xu, Wenwu Wang, Mark Plumbley (2017)A joint detection-classification model for audio tagging of weakly labelled data, In: Proceedings of ICASSP 2017 IEEE

DOI: 10.1109/ICASSP.2017.7952234

Audio tagging aims to assign one or several tags to an audio clip. Most of the datasets are weakly labelled, which means only the tags of the clip are known, without knowing the occurrence time of the tags. The labeling of an audio clip is often based on the audio events in the clip and no event level label is provided to the user. Previous works have used the bag of frames model assume the tags occur all the time, which is not the case in practice. We propose a joint detection-classification (JDC) model to detect and classify the audio clip simultaneously. The JDC model has the ability to attend to informative and ignore uninformative sounds. Then only informative regions are used for classification. Experimental results on the “CHiME Home” dataset show that the JDC model reduces the equal error rate (EER) from 19.0% to 16.9%. More interestingly, the audio event detector is trained successfully without needing the event level label.

Qiuqiang Kong, Yong Xu, Mark Plumbley (2017)Joint Detection and Classification Convolutional Neural Network on Weakly Labelled Bird Audio Detection, In: EUSIPCO 2017 Proceedings IEEE

DOI: 10.23919/EUSIPCO.2017.8081509

Bird audio detection (BAD) aims to detect whether there is a bird call in an audio recording or not. One difficulty of this task is that the bird sound datasets are weakly labelled, that is only the presence or absence of a bird in a recording is known, without knowing when the birds call. We propose to apply joint detection and classification (JDC) model on the weakly labelled data (WLD) to detect and classify an audio clip at the same time. First, we apply VGG like convolutional neural network (CNN) on mel spectrogram as baseline. Then we propose a JDC-CNN model with VGG as a classifier and CNN as a detector. We report the denoising method including optimally-modified log-spectral amplitude (OM-LSA), median filter and spectral spectrogram will worse the classification accuracy on the contrary to previous work. JDC-CNN can predict the time stamps of the events from weakly labelled data, so is able to do sound event detection from WLD. We obtained area under curve (AUC) of 95.70% on the development data and 81.36% on the unseen evaluation data, which is nearly comparable to the baseline CNN model.

Jinbo Hu, Yin Cao, Ming Wu, Qiuqiang Kong, Feiran Yang, Mark D. Plumbley, Jun Yang (2022)Sound Event Localization And Detection For Real Spatial Sound Scenes: Event-independent Network And Data Augmentation Chains, In: Proceedings of the 7th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2022)

Sound event localization and detection (SELD) is a joint task of sound event detection and direction-of-arrival estimation. In DCASE 2022 Task 3, types of data transform from computationally generated spatial recordings to recordings of real-sound scenes. Our system submitted to the DCASE 2022 Task 3 is based on our previous proposed Event-Independent Network V2 (EINV2) with a novel data augmentation method. Our method employs EINV2 with a track-wise output format, permutation-invariant training, and a soft parameter-sharing strategy, to detect different sound events of the same class but in different locations. The Conformer structure is used for extending EINV2 to learn local and global features. A data augmentation method, which contains several data augmentation chains composed of stochastic combinations of several different data augmentation operations, is utilized to generalize the model. To mitigate the lack of real-scene recordings in the development dataset and the presence of sound events being unbalanced, we exploit FSD50K, AudioSet, and TAU Spatial Room Impulse Response Database (TAU-SRIR DB) to generate simulated datasets for training. We present results on the validation set of Sony-TAu Realistic Spatial Soundscapes 2022 (STARSS22) in detail. Experimental results indicate that the ability to generalize to different environments and unbalanced performance among different classes are two main challenges. We evaluate our proposed method in Task 3 of the DCASE 2022 challenge and obtain the second rank in the teams ranking.

A Adler, V Emiya, MG Jafari, M Elad, R Gribonval, MD Plumbley (2012)Audio Inpainting, In: IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING20(3)pp. 922-932 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TASL.2011.2168211

We propose the audio inpainting framework that recovers portions of audio data distorted due to impairments such as impulsive noise, clipping, and packet loss. In this framework, the distorted data are treated as missing and their location is assumed to be known. The signal is decomposed into overlapping time-domain frames and the restoration problem is then formulated as an inverse problem per audio frame. Sparse representation modeling is employed per frame, and each inverse problem is solved using the Orthogonal Matching Pursuit algorithm together with a discrete cosine or a Gabor dictionary. The Signal-to-Noise Ratio performance of this algorithm is shown to be comparable or better than state-of-the-art methods when blocks of samples of variable durations are missing. We also demonstrate that the size of the block of missing samples, rather than the overall number of missing samples, is a crucial parameter for high quality signal restoration. We further introduce a constrained Matching Pursuit approach for the special case of audio declipping that exploits the sign pattern of clipped audio samples and their maximal absolute value, as well as allowing the user to specify the maximum amplitude of the signal. This approach is shown to outperform state-of-the-art and commercially available methods for audio declipping in terms of Signal-to-Noise Ratio

D Barchiesi, MD Plumbley (2013)Learning Incoherent Dictionaries for Sparse Approximation Using Iterative Projections and Rotations, In: IEEE TRANSACTIONS ON SIGNAL PROCESSING61(8)pp. 2055-2065 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TSP.2013.2245663

This article deals with learning dictionaries for sparse approximation whose atoms are both adapted to a training set of signals and mutually incoherent. To meet this objective, we employ a dictionary learning scheme consisting of sparse approximation followed by dictionary update and we add to the latter a decorrelation step in order to reach a target mutual coherence level. This step is accomplished by an iterative projection method complemented by a rotation of the dictionary. Experiments on musical audio data and a comparison with the method of optimal coherence-constrained directions (mocod) and the incoherent k-svd (ink-svd) illustrate that the proposed algorithm can learn dictionaries that exhibit a low mutual coherence while providing a sparse approximation with better signal-to-noise ratio (snr) than the benchmark techniques

MD Plumbley (2004)Optimization using Fourier expansion over a geodesic for non-negative ICA, In: CG Puntonet, A Prieto (eds.), INDEPENDENT COMPONENT ANALYSIS AND BLIND SIGNAL SEPARATION3195pp. 49-56

B Mailhé, MD Plumbley (2012)Dictionary Learning with Large Step Gradient Descent for Sparse Representations., In: FJ Theis, A Cichocki, A Yeredor, M Zibulevsky (eds.), LVA/ICA7191pp. 231-238

DOI: 10.1007/978-3-642-28551-6_29

K O'Hanlon, MD Plumbley (2013)AUTOMATIC MUSIC TRANSCRIPTION USING ROW WEIGHTED DECOMPOSITIONS, In: 2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)pp. 16-20

MEP Davies, MD Plumbley (2006)A spectral difference approach to downbeat extraction in musical audio., In: EUSIPCOpp. 1-4

Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, Jinzheng Zhao, Qiushi Huang, Mark D. Plumbley, Wenwu Wang (2022)Separate What You Describe: Language-Queried Audio Source Separation, In: Interspeech 2022pp. 1801-1805

DOI: 10.21437/Interspeech.2022-10894

In this paper, we introduce the task of language-queried audio source separation (LASS), which aims to separate a target source from an audio mixture based on a natural language query of the target source (e.g., “a man tells a joke followed by people laughing”). A unique challenge in LASS is associated with the complexity of natural language description and its relation with the audio sources. To address this issue, we proposed LASSNet, an end-to-end neural network that is learned to jointly process acoustic and linguistic information, and separate the target source that is consistent with the language query from an audio mixture. We evaluate the performance of our proposed system with a dataset created from the AudioCaps dataset. Experimental results show that LASS-Net achieves considerable improvements over baseline methods. Furthermore, we observe that LASS-Net achieves promising generalization results when using diverse human-annotated descriptions as queries, indicating its potential use in real-world scenarios. The separated audio samples and source code are available at https://liuxubo717.github.io/LASS-demopage.

P Brossier, JP Bello, MD Plumbley (2004)Fast labelling of notes in music signals., In: ISMIR

D Barchiesi, MD Plumbley (2011)Dictionary learning of convolved signals, In: 2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSINGpp. 5812-5815 IEEE

DOI: 10.1109/ICASSP.2011.5947682

Assuming that a set of source signals is sparsely representable in a given dictionary, we show how their sparse recovery fails whenever we can only measure a convolved observation of them. Starting from this motivation, we develop a block coordinate descent method which aims to learn a convolved dictionary and provide a sparse representation of the observed signals with small residual norm. We compare the proposed approach to the K-SVD dictionary learning algorithm and show through numerical experiment on synthetic signals that, provided some conditions on the problem data, our technique converges in a fixed number of iterations to a sparse representation with smaller residual norm.

D Stowell, MD Plumbley, N Bryan-Kinns (2008)Discourse Analysis Evaluation Method for Expressive Musical Interfaces., In: NIMEpp. 81-86

Christian Kroos, Mark Plumbley (2017)Neuroevolution for sound event detection in real life audio: A pilot study, In: T Virtanen, A Mesaros, T Heittola, A Diment, E Vincent, E Benetos, B Elizalde (eds.), Detection and Classification of Acoustic Scenes and Events (DCASE 2017) Proceedings 2017 Tampere University of Technology

Neuroevolution techniques combine genetic algorithms with artificial neural networks, some of them evolving network topology along with the network weights. One of these latter techniques is the NeuroEvolution of Augmenting Topologies (NEAT) algorithm. For this pilot study we devised an extended variant (joint NEAT, J-NEAT), introducing dynamic cooperative co-evolution, and applied it to sound event detection in real life audio (Task 3) in the DCASE 2017 challenge. Our research question was whether small networks could be evolved that would be able to compete with the much larger networks now typical for classification and detection tasks. We used the wavelet-based deep scattering transform and k-means clustering across the resulting scales (not across samples) to provide J-NEAT with a compact representation of the acoustic input. The results show that for the development data set J-NEAT was capable of evolving small networks that match the performance of the baseline system in terms of the segment-based error metrics, while exhibiting a substantially better event-related error rate. In the challenge, J-NEAT took first place overall according to the F1 error metric with an F1 of 44:9% and achieved rank 15 out of 34 on the ER error metric with a value of 0:891. We discuss the question of evolving versus learning for supervised tasks.

J Orwell, MD Plumbley (1999)Maximizing information about a noisy signal with a single non-linear neuron., In: NINTH INTERNATIONAL CONFERENCE ON ARTIFICIAL NEURAL NETWORKS (ICANN99), VOLS 1 AND 2(470)pp. 581-586

Iwona Sobieraj, Mark Plumbley (2016)Coupled Sparse NMF vs. Random Forest Classification for Real Life Acoustic Event Detection, In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016)pp. 90-94

In this paper, we propose two methods for polyphonic Acoustic Event Detection (AED) in real life environments. The first method is based on Coupled Sparse Non-negative Matrix Factorization (CSNMF) of spectral representations and their corresponding class activity annotations. The second method is based on Multi-class Random Forest (MRF) classification of time-frequency patches. We compare the performance of the two methods on a recently published dataset TUT Sound Events 2016 containing data from home and residential area environments. Both methods show comparable performance to the baseline system proposed for DCASE 2016 Challenge on the development dataset with MRF outperforming the baseline on the evaluation dataset.

MD Plumbley, SA Abdallah, JP Bello, ME Davies, G Monti, MB Sandler (2002)Automatic music transcription and audio source separation, In: CYBERNETICS AND SYSTEMS33(6)pp. 603-627

DOI: 10.1080/01969720290040777

B Mailhe, D Barchiesi, MD Plumbley (2012)INK-SVD: LEARNING INCOHERENT DICTIONARIES FOR SPARSE REPRESENTATIONS, In: 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)pp. 3573-3576

AJR Simpson, G Roma, MD Plumbley (2015)Deep Remix: Remixing Musical Mixtures Using a Convolutional Deep Neural Network

Audio source separation is a difficult machine learning problem and performance is measured by comparing extracted signals with the component source signals. However, if separation is motivated by the ultimate goal of re-mixing then complete separation is not necessary and hence separation difficulty and separation quality are dependent on the nature of the re-mix. Here, we use a convolutional deep neural network (DNN), trained to estimate 'ideal' binary masks for separating voice from music, to perform re-mixing of the vocal balance by operating directly on the individual magnitude components of the musical mixture spectrogram. Our results demonstrate that small changes in vocal gain may be applied with very little distortion to the ultimate re-mix. Our method may be useful for re-mixing existing mixes.

AM Stark, MD Plumbley (2010)PERFORMANCE FOLLOWING: TRACKING A PERFORMANCE WITHOUT A SCORE, In: 2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSINGpp. 2482-2485

DOI: 10.1109/ICASSP.2010.5494895

Christopher Baume, Mark Plumbley, David Frohlich, Janko Calic (2018)PaperClip: A Digital Pen Interface for Semantic Speech Editing in Radio Production, In: Journal of the Audio Engineering Society66(4)pp. 241-252 Audio Engineering Society

DOI: 10.17743/jaes.2018.0006

We introduce `PaperClip' - a novel digital pen interface for semantic editing of speech recordings for radio production. We explain how we designed and developed our system, then present the results of a contextual qualitative user study of eight professional radio producers that compared editing using PaperClip to a screen-based interface and normal paper. As in many other paper-versus-screen studies, we found no overall preferences but rather advantages and disadvantages of both in different contexts. We discuss these relative benefits and make recommendations for future development.

Bin Li, Lucas Rencker, Jing Dong, Yuhui Luo, Mark D Plumbley, Wenwu Wang (2021)Sparse Analysis Model Based Dictionary Learning for Signal Declipping, In: IEEE Journal of Selected Topics in Signal Processing15(1)pp. 25-36 IEEE

DOI: 10.1109/JSTSP.2021.3051746

Clipping is a common type of distortion in which the amplitude of a signal is truncated if it exceeds a certain threshold. Sparse representation has underpinned several algorithms developed recently for reconstruction of the original signal from clipped observations. However, these declipping algorithms are often built on a synthesis model, where the signal is represented by a dictionary weighted by sparse coding coefficients. In contrast to these works, we propose a sparse analysis-model-based declipping (SAD) method, where the declipping model is formulated on an analysis (i.e. transform) dictionary, and additional constraints characterizing the clipping process. The analysis dictionary is updated using the Analysis SimCO algorithm, and the signal is recovered by using a least-squares based method or a projected gradient descent method, incorporating the observable signal set. Numerical experiments on speech and music are used to demonstrate improved performance in signal to distortion ratio (SDR) compared to recent state-of-the-art methods including A-SPADE and ConsDL.

S Becker, M Plumbley (1996)Unsupervised neural network learning procedures for feature extraction and classification, In: APPLIED INTELLIGENCE6(3)pp. 185-203 KLUWER ACADEMIC PUBL

DOI: 10.1007/BF00126625

D Barchiesi, MD Plumbley (2015)Learning Incoherent Subspaces: Classification via Incoherent Dictionary Learning, In: Journal of Signal Processing Systems79(2)pp. 189-199

DOI: 10.1007/s11265-014-0937-5

In this article we present the supervised iterative projections and rotations (s-ipr) algorithm, a method for learning discriminative incoherent subspaces from data. We derive s-ipr as a supervised extension of our previously proposed iterative projections and rotations (ipr) algorithm for incoherent dictionary learning, and we employ it to learn incoherent sub-spaces that model signals belonging to different classes. We test our method as a feature transform for supervised classification, first by visualising transformed features from a synthetic dataset and from the ‘iris’ dataset, then by using the resulting features in a classification experiment.

T Weyde, S Cottrell, J Dykes, E Benetos, D Wolff, D Tidhar, A Kachkaev, M Plumbley, S Dixon, M Barthet, N Gold, S Abdallah, A Alancar-Brayner, M Mahey, A Tovell (2014)Big Data for Musicology

DOI: 10.1145/2660168.2660187

Digital music libraries and collections are growing quickly and are increasingly made available for research. We argue that the use of large data collections will enable a better understanding of music performance and music in general, which will benefit areas such as music search and recommendation, music archiving and indexing, music production and education. However, to achieve these goals it is necessary to develop new musicological research methods, to create and adapt the necessary technological infrastructure, and to find ways of working with legal limitations. Most of the necessary basic technologies exist, but they need to be brought together and applied to musicology. We aim to address these challenges in the Digital Music Lab project, and we feel that with suitable methods and technology Big Music Data can provide new opportunities to musicology.

SA Abdallah, MD Plumbley (2006)Unsupervised analysis of polyphonic music by sparse coding, In: IEEE TRANSACTIONS ON NEURAL NETWORKS17(1)pp. 179-196 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TNN.2005.861031

MD Plumbley, SA Abdallah, T Blumensath, ME Davies (2006)Sparse representations of polyphonic music, In: SIGNAL PROCESSING86(3)pp. 417-431 ELSEVIER SCIENCE BV

DOI: 10.1016/j.sigpro.2005.06.007

MD Plumbley, T Blumensath, L Daudet, R Gribonval, ME Davies (2010)Sparse Representations in Audio and Music: From Coding to Source Separation, In: PROCEEDINGS OF THE IEEE98(6)pp. 995-1005 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/JPROC.2009.2030345

Haohe Liu, Xubo Liu, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley (2024)Learning Temporal Resolution in Spectrogram for Audio Classification, In: Proceedings of the 38th AAAI Conference on Artificial Intelligence (AAAI-24)38(12: AAAI-24 Technical Tracks 12) Association for the Advancement of Artificial Intelligence (AAAI)

DOI: 10.1609/aaai.v38i12.29294

The audio spectrogram is a time-frequency representation that has been widely used for audio classification. One of the key attributes of the audio spectrogram is the temporal resolution, which depends on the hop size used in the Short-Time Fourier Transform (STFT). Previous works generally assume the hop size should be a constant value (e.g., 10 ms). However, a fixed temporal resolution is not always optimal for different types of sound. The temporal resolution affects not only classification accuracy but also computational cost. This paper proposes a novel method, DiffRes, that enables differentiable temporal resolution modeling for audio classification. Given a spectrogram calculated with a fixed hop size, DiffRes merges non-essential time frames while preserving important frames. DiffRes acts as a "drop-in" module between an audio spectrogram and a classifier and can be jointly optimized with the classification task. We evaluate DiffRes on five audio classification tasks, using mel-spectrograms as the acoustic features, followed by off-the-shelf classifier backbones. Compared with previous methods using the fixed temporal resolution, the DiffRes-based method can achieve the equivalent or better classification accuracy with at least 25% computational cost reduction. We further show that DiffRes can improve classification accuracy by increasing the temporal resolution of input acoustic features, without adding to the computational cost.

Cian O’Brien, Mark D Plumbley (2018)A Hierarchical Latent Mixture Model for Polyphonic Music Analysis, In: Proceedings of the 2018 26th European Signal Processing Conference (EUSIPCO)pp. 1910-1914 IEEE

DOI: 10.23919/EUSIPCO.2018.8553244

Polyphonic music transcription is a challenging problem, requiring the identification of a collection of latent pitches which can explain an observed music signal. Many state-of-the-art methods are based on the Non-negative Matrix Factorization (NMF) framework, which itself can be cast as a latent variable model. However, the basic NMF algorithm fails to consider many important aspects of music signals such as lowrank or hierarchical structure and temporal continuity. In this work we propose a probabilistic model to address some of the shortcomings of NMF. Probabilistic Latent Component Analysis (PLCA) provides a probabilistic interpretation of NMF and has been widely applied to problems in audio signal processing. Based on PLCA, we propose an algorithm which represents signals using a collection of low-rank dictionaries built from a base pitch dictionary. This allows each dictionary to specialize to a given chord or interval template which will be used to represent collections of similar frames. Experiments on a standard music transcription data set show that our method can successfully decompose signals into a hierarchical and smooth structure, improving the quality of the transcription.

Zhao Ren, Qiuqiang Kong, Jing Han, Mark D. Plumbley, Bjorn W Schuller (2020)CAA-Net: Conditional Atrous CNNs with Attention for Explainable Device-robust Acoustic Scene Classification, In: IEEE Transactions on Multimedia23(X)pp. 4131-4142 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/TMM.2020.3037534

Recently, Convolutional Neural Networks (CNNs) have been successfully applied to ASC. However, the data distributions of the audio signals recorded with multiple devices are different. There has been little research on the training of robust neural networks on acoustic scene datasets recorded with multiple devices, and on explaining the operation of the internal layers of the neural networks. In this article, we focus on training and explaining device-robust CNNs on multi-device acoustic scene data. We propose conditional atrous CNNs with attention for multi-device ASC. Our proposed system contains an ASC branch and a device classification branch, both modelled by CNNs. We visualise and analyse the intermediate layers of the atrous CNNs. A time-frequency attention mechanism is employed to analyse the contribution of each time-frequency bin of the feature maps in the CNNs. On the Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 ASC dataset, recorded with three devices, our proposed model performs significantly better than CNNs trained on single-device data.

MD PLUMBLEY, F FALLSIDE (1991)THE EFFECT OF RECEPTOR SIGNAL-TO-NOISE LEVELS ON OPTIMAL FILTERING IN A SENSORY SYSTEM, In: ICASSP 91, VOLS 1-5pp. 2321-2324

D Giannoulis, A Klapuri, MD Plumbley (2013)Recognition of harmonic sounds in polyphonic audio using a missing feature approach, In: 2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)pp. 8658-8662 IEEE

DOI: 10.1109/ICASSP.2013.6639356

A method based on local spectral features and missing feature techniques is proposed for the recognition of harmonic sounds in mixture signals. A mask estimation algorithm is proposed for identifying spectral regions that contain reliable information for each sound source and then bounded marginalization is employed to treat the feature vector elements that are determined as unreliable. The proposed method is tested on musical instrument sounds due to the extensive availability of data but it can be applied on other sounds (i.e. animal sounds, environmental sounds), whenever these are harmonic. In simulations the proposed method clearly outperformed a baseline method for mixture signals.

BL Sturm, B Mailhe, MD Plumbley (2013)On Theorem 10 in "On Polar Polytopes and the Recovery of Sparse Representations" (vol 50, pg 2231, 2004), In: IEEE TRANSACTIONS ON INFORMATION THEORY59(8)pp. 5206-5209 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TIT.2013.2259294

Emad M Grais, Mark Plumbley (2018)Combining Fully Convolutional and Recurrent Neural Networks for Single Channel Audio Source Separation, In: Proceedings of 144th AES Convention Audio Engineering Society

Combining different models is a common strategy to build a good audio source separation system. In this work, we combine two powerful deep neural networks for audio single channel source separation (SCSS). Namely, we combine fully convolutional neural networks (FCNs) and recurrent neural networks, specifically, bidirectional long short-term memory recurrent neural networks (BLSTMs). FCNs are good at extracting useful features from the audio data and BLSTMs are good at modeling the temporal structure of the audio signals. Our experimental results show that combining FCNs and BLSTMs achieves better separation performance than using each model individually.

MD PLUMBLEY (1993)EFFICIENT INFORMATION-TRANSFER AND ANTI-HEBBIAN NEURAL NETWORKS, In: NEURAL NETWORKS6(6)pp. 823-833 PERGAMON-ELSEVIER SCIENCE LTD

DOI: 10.1016/S0893-6080(05)80127-3

E Vincent, MD Plumbley (2007)Fast factorization-based inference for bayesian harmonic models, In: Proceedings of the 2006 16th IEEE Signal Processing Society Workshop on Machine Learning for Signal Processing, MLSP 2006pp. 117-122

DOI: 10.1109/MLSP.2006.275533

Harmonie sinusoidal models are a fundamental tool for audio signal analysis. Bayesian harmonic models guarantee a good resynthesis quality and allow joint use of learnt parameter priors and auditory motivated distortion measures. However inference algorithms based on Monte Carlo sampling are rather slow for realistic data. In this paper, we investigate fast inference algorithms based on approximate factorization of the joint posterior into a product of independent distributions on small subsets of parameters. We discuss the conditions under which these approximations hold true and evaluate their performance experimentally. We suggest how they could be used together with Monte Carlo algorithms for a faster sampling-based inference. © 2006 IEEE.

S Ewert, MD Plumbley, M Sandler (2015)A dynamic programming variant of non-negative matrix deconvolution for the transcription of struck string instrumentspp. 569-573

Given a musical audio recording, the goal of music transcription is to determine a score-like representation of the piece underlying the recording. Most current transcription methods employ variants of non-negative matrix factorization (NMF), which often fails to robustly model instruments producing non-stationary sounds. Using entire time-frequency patterns to represent sounds, non-negative matrix deconvolution (NMD) can capture certain types of nonstationary behavior but is only applicable if all sounds have the same length. In this paper, we present a novel method that combines the non-stationarity modeling capabilities available with NMD with the variable note lengths possible with NMF. Identifying frames in NMD patterns with states in a dynamical system, our method iteratively generates sound-object candidates separately for each pitch, which are then combined in a global optimization. We demonstrate the transcription capabilities of our method using piano pieces assuming the availability of single note recordings as training data.

(2016)Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), Budapest, Hungary, 3 Sep 2016. Tampere University of Technology. Department of Signal Processing

LA Figueira, C Cannam, MD Plumbley (2013)Software techniques for good practice in audio and music research, In: 134th Audio Engineering Society Convention 2013pp. 273-280 Audio Engineering Society

In this paper we discuss how software development can be improved in the audio and music research community by implementing tighter and more effective development feedback loops. We suggest first that researchers in an academic environment can benefit from the straightforward application of peer code review, even for ad-hoc research software; and second, that researchers should adopt automated software unit testing from the start of research projects. We discuss and illustrate how to adopt both code reviews and unit testing in a research environment. Finally, we observe that the use of a software version control system provides support for the foundations of both code reviews and automated unit tests. We therefore also propose that researchers should use version control with all their projects from the earliest stage.

D Stowell, MD Plumbley (2010)Timbre remapping through a regression-tree technique, In: Proceedings of the 7th Sound and Music Computing Conference, SMC 2010pp. 68-?

We consider the task of inferring associations between two differently-distributed and unlabelled sets of timbre data. This arises in applications such as concatenative synthesis/ audio mosaicing in which one audio recording is used to control sound synthesis through concatenating fragments of an unrelated source recording. Timbre is a multidimensional attribute with interactions between dimensions, so it is non-trivial to design a search process which makes best use of the timbral variety available in the source recording. We must be able to map from control signals whose timbre features have different distributions from the source material, yet labelling large collections of timbral sounds is often impractical, so we seek an unsupervised technique which can infer relationships between distributions. We present a regression tree technique which learns associations between two unlabelled multidimensional distributions, and apply the technique to a simple timbral concatenative synthesis system. We demonstrate numerically that the mapping makes better use of the source material than a nearest-neighbour search. © 2010 Dan Stowell et al.

Qiuqiang Kong, Yong Xu, Wenwu Wang, Mark Plumbley (2020)Sound Event Detection of Weakly Labelled Data with CNN-Transformer and Automatic Threshold Optimization, In: IEEE/ACM Transactions on Audio, Speech and Language Processing28pp. 2450-2460 Institute of Electrical and Electronics Engineers

DOI: 10.1109/TASLP.2020.3014737

Sound event detection (SED) is a task to detect sound events in an audio recording. One challenge of the SED task is that many datasets such as the Detection and Classification of Acoustic Scenes and Events (DCASE) datasets are weakly labelled. That is, there are only audio tags for each audio clip without the onset and offset times of sound events. We compare segment-wise and clip-wise training for SED that is lacking in previous works. We propose a convolutional neural network transformer (CNN-Transfomer) for audio tagging and SED, and show that CNN-Transformer performs similarly to a convolutional recurrent neural network (CRNN). Another challenge of SED is that thresholds are required for detecting sound events. Previous works set thresholds empirically, and are not an optimal approaches. To solve this problem, we propose an automatic threshold optimization method. The first stage is to optimize the system with respect to metrics that do not depend on thresholds, such as mean average precision (mAP). The second stage is to optimize the thresholds with respect to metrics that depends on those thresholds. Our proposed automatic threshold optimization system achieves a state-of-the-art audio tagging F1 of 0.646, outperforming that without threshold optimization of 0.629, and a sound event detection F1 of 0.584, outperforming that without threshold optimization of 0.564.

Annamaria Mesaros, Toni Heittola, Emmanouil Benetos, Peter Foster, Mathieu Lagrange, Tuomas Virtanen, Mark D. Plumbley (2017)Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge, In: IEEE/ACM Transactions on Audio, Speech and Language Processing26(2)pp. 379-393 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/TASLP.2017.2778423

Public evaluation campaigns and datasets promote active development in target research areas, allowing direct comparison of algorithms. The second edition of the challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2016) has offered such an opportunity for development of state-of-the-art methods, and succeeded in drawing together a large number of participants from academic and industrial backgrounds. In this paper, we report on the tasks and outcomes of the DCASE 2016 challenge. The challenge comprised four tasks: acoustic scene classification, sound event detection in synthetic audio, sound event detection in real-life audio, and domestic audio tagging. We present in detail each task and analyse the submitted systems in terms of design and performance. We observe the emergence of deep learning as the most popular classification method, replacing the traditional approaches based on Gaussian mixture models and support vector machines. By contrast, feature representations have not changed substantially throughout the years, as mel frequency-based representations predominate in all tasks. The datasets created for and used in DCASE 2016 are publicly available and are a valuable resource for further research.

D Stowell, MD Plumbley (2009)Fast Multidimensional Entropy Estimation by k-d Partitioning, In: IEEE SIGNAL PROCESSING LETTERS16(6)pp. 537-540 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/LSP.2009.2017346

Alfredo Zermini, Yang Yu, Yong Xu, Mark Plumbley, Wenwu Wang (2016)Deep neural network based audio source separation, In: Proceedings of the 11th IMA International Conference on Mathematics in Signal Processingpp. 1-4 Institute of Mathematics & its Applications (IMA)

Audio source separation aims to extract individual sources from mixtures of multiple sound sources. Many techniques have been developed such as independent compo- nent analysis, computational auditory scene analysis, and non-negative matrix factorisa- tion. A method based on Deep Neural Networks (DNNs) and time-frequency (T-F) mask- ing has been recently developed for binaural audio source separation. In this method, the DNNs are used to predict the Direction Of Arrival (DOA) of the audio sources with respect to the listener which is then used to generate soft T-F masks for the recovery/estimation of the individual audio sources.

John G. Taylor, Mark D. Plumbley (1993)Information Theory and Neural Networks, In: Mathematical Approaches to Neural Networks51pp. 307-340 Elsevier Science Publishers

DOI: 10.1016/S0924-6509(08)70042-4

This chapter discusses the role of information theory for analysis of neural networks using differential geometric ideas. Information theory is useful for understanding preprocessing, in terms of predictive coding in the retina and principal component analysis and decorrelation processing in early visual cortex. The chapter introduces some concepts from information theory. In particular, the entropy of a random variable and the mutual information between two random variables are focused. One of the major uses for information theory has been in interpretation and guidance for unsupervised neural networks: networks that are not provided with a teacher or target output that they are to emulate. The chapter describes how information relates to the more familiar supervised learning schemes, and discusses the use of error back propagation (BackProp) to minimize mean squared error (MSE) in a multi-layer perceptron (MLP). Other distortion measures are possible in place of MSE. In particular, the information theoretic cross-entropy distortion has been focused in the chapter.

E Vincent, MG Jafari, SA Abdallah, MD Plumbley, ME Davies (2010)Probabilistic modeling paradigms for audio source separation, In: W Wang (eds.), Machine Audition: Principles, Algorithms and Systemspp. 162-185 IGI Global

DOI: 10.4018/978-1-61520-919-4.ch007

Most sound scenes result from the superposition of several sources, which can be separately perceived and analyzed by human listeners. Source separation aims to provide machine listeners with similar skills by extracting the sounds of individual sources from a given scene. Existing separation systems operate either by emulating the human auditory system or by inferring the parameters of probabilistic sound models. In this chapter, the authors focus on the latter approach and provide a joint overview of established and recent models, including independent component analysis, local time-frequency models and spectral template-based models. They show that most models are instances of one of the following two general paradigms: linear modeling or variance modeling. They compare the merits of either paradigm and report objective performance figures. They also,conclude by discussing promising combinations of probabilistic priors and inference algorithms that could form the basis of future state-of-the-art systems.

T Weyde, S Cottrell, E Benetos, D Wolff, D Tidhar, J Dykes, M Plumbley, S Dixon, M Barthet, N Gold, S Abdallah, M Mahey (2014)Digital Music Lab: A Framework for Analysing Big Music Data

D Stowell, MD Plumbley (2010)Delayed Decision-making in Real-time Beatbox Percussion Classification, In: JOURNAL OF NEW MUSIC RESEARCH39(3)PII 929374pp. 203-213 ROUTLEDGE JOURNALS, TAYLOR & FRANCIS LTD

DOI: 10.1080/09298215.2010.512979

MD Plumbley (2005)Polar Polytopes and Recovery of Sparse Representations

MD Plumbley (2004)Lie group methods for optimization with orthogonality constraints, In: CG Puntonet, A Prieto (eds.), INDEPENDENT COMPONENT ANALYSIS AND BLIND SIGNAL SEPARATION3195pp. 1245-1252

E Benetos, D Stowell, Mark Plumbley (2018)Approaches to complex sound scene analysis, In: T Virtanen, Mark Plumbley, D Ellis (eds.), Computational Analysis of Sound Scenes and Eventspp. 215-242 Springer International Publishing

DOI: 10.1007/978-3-319-63450-0_8

D Giannoulis, E Benetos, D Stowell, M Rossignol, M Lagrange, MD Plumbley (2013)DETECTION AND CLASSIFICATION OF ACOUSTIC SCENES AND EVENTS: AN IEEE AASP CHALLENGE, In: 2013 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA)

Yue Gao, Zihang Song, Han Zhang, Sean Fuller, Andrew Lambert, Zhinong Ying, Petri Mähönen, Yonina Eldar, Shuguang Cui, Mark D. Plumbley, Clive Parini, Arumugam Nallanathan (2021)Sub-Nyquist spectrum sensing and learning challenge, In: Frontiers of Computer Science15154504 Higher Education Press

DOI: 10.1007/s11704-021-1275-y

Sub-Nyquist spectrum sensing and learning are investigated from theory to practice as a promising approach enabling cognitive and intelligent radios and wireless systems to work in GHz channel bandwidth. These techniques would be helpful for future electromagnetic spectrum sensing in sub-6 GHz, millimetre-wave, and Terahertz frequency bands. However, challenges such as computation complexity, real-time processing and theoretical sampling limits still exist. We issued a challenge with a reference sub-Nyquist algorithm, open data set and awards up to 10,000 USD to stimulate novel approaches and designs on sub-Nyquist spectrum sensing and learning algorithms to promote relative research and facilitate the theory-to-practice process of promising ideas.

K O'Hanlon, H Nagano, MD Plumbley (2012)Group non-negative basis pursuit for automatic music transcription, In: Proceedings of 5th International Workshop on Machine Learning and Music (MML 2012)pp. 15-16

Automatic Music Transcription is often performed by decomposing a spectrogram over a dictionary of note specific atoms. Several note template atoms may be used to represent one note, and a group structure may be imposed on the dictionary. We propose a group sparse algorithm based on a multiplicative update and thresholding and show transcription results on a challenging dataset.

Francesco Renna, Mark D. Plumbley, Miguel T. Coimbra (2022)Source separation of the second heart sound via alternating optimization

DOI: 10.23919/CinC53138.2021.9662814

A novel algorithm to separate S2 heart sounds into their aortic and pulmonary components is proposed. This approach is based on the assumption that, in different heartbeats of a given recording, aortic and pulmonary components maintain the same waveform but with different relative delays, which are induced by the variation of the thoracic pressure at different respiration phases. The proposed algorithm then retrieves the aortic and pulmonary components as the solution of an optimization problem which is approximated via alternating optimization. The proposed approach is shown to provide reconstructions of aortic and pulmonary components with normalized root mean-squared error consistently below 10% in various operational regimes.

G Roma, EM Grais, AJR Simpson, I Sobieraj, MD Plumbley (2016)UNTWIST: A NEW TOOLBOX FOR AUDIO SOURCE SEPARATION

Untwist is a new open source toolbox for audio source separation. The library provides a self-contained objectoriented framework including common source separation algorithms as well as input/output functions, data management utilities and time-frequency transforms. Everything is implemented in Python, facilitating research, experimentation and prototyping across platforms. The code is available on github 1.

(2018)Latent Variable Analysis and Signal Separation: 14th International Conference, LVA/ICA 2018, Guildford, UK, July 2–5, 2018, Proceedings, In: Yannick Deville, Sharon Gannot, Russell Mason, Mark D. Plumbley, Dominic Ward (eds.), Latent Variable Analysis and Signal Separation10891 Springer International Publishing

DOI: 10.1007/978-3-319-93764-9

This book constitutes the proceedings of the 14th International Conference on Latent Variable Analysis and Signal Separation, LVA/ICA 2018, held in Guildford, UK, in July 2018.The 52 full papers were carefully reviewed and selected from 62 initial submissions. As research topics the papers encompass a wide range of general mixtures of latent variables models but also theories and tools drawn from a great variety of disciplines such as structured tensor decompositions and applications; matrix and tensor factorizations; ICA methods; nonlinear mixtures; audio data and methods; signal separation evaluation campaign; deep learning and data-driven methods; advances in phase retrieval and applications; sparsity-related methods; and biomedical data and methods.

B Mailhe, MD Plumbley (2013)Dictionary Learning via Projected Maximal Exploration, In: 2013 IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (GLOBALSIP)pp. 626-626 IEEE

DOI: 10.1109/GlobalSIP.2013.6736963

This work presents a geometrical analysis of the Large Step Gradient Descent (LGD) dictionary learning algorithm. LGD updates the atoms of the dictionary using a gradient step with a step size equal to twice the optimal step size. We show that the large step gradient descent can be understood as a maximal exploration step where one goes as far away as possible without increasing the the error. We also show that the LGD iteration is monotonic when the algorithm used for the sparse approximation step is close enough to orthogonal.

GE Robbins, MD Plumbley, JC Hughes, F Fallside, RW Prager (1993)Generation and Adaptation of Neural Networks by Evolutionary Techniques (GANNET)., In: Neural Computing and Applications11pp. 23-31

DOI: 10.1007/BF01411372

E Vincent, R Gribonval, MD Plumbley (2007)Oracle estimators for the benchmarking of source separation algorithms, In: SIGNAL PROCESSING87(8)pp. 1933-1950 ELSEVIER SCIENCE BV

DOI: 10.1016/j.sigpro.2007.01.016

Sara Atito Ali Ahmed, Muhammad Awais, Wenwu Wang, Mark D. Plumbley, Josef Kittler (2024)ASiT: Local-Global Audio Spectrogram vIsion Transformer for Event Classification, In: IEEE/ACM Transactions on Audio, Speech, and Language Processing32pp. 3684-3693 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/TASLP.2024.3428908

Transformers, which were originally developed for natural language processing, have recently generated significant interest in the computer vision and audio communities due to their flexibility in learning long-range relationships. Constrained by the data hungry nature of transformers and the limited amount of labelled data, most transformer-based models for audio tasks are finetuned from ImageNet pretrained models, despite the huge gap between the domain of natural images and audio. This has motivated the research in self-supervised pretraining of audio transformers, which reduces the dependency on large amounts of labeled data and focuses on extracting concise representations of audio spectrograms. In this paper, we propose L ocal- G lobal A udio S pectrogram v I sion T ransformer, namely ASiT, a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation. We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification. We further conduct comprehensive ablation studies, including evaluations of different pretraining strategies. The proposed ASiT framework significantly boosts the performance on all tasks and sets a new state-of-the-art performance in five audio and speech classification tasks, outperforming recent methods, including the approaches that use additional datasets for pretraining.

Andrew J. R. Simpson, Gerard Roma, Emad M. Grais, Russell D. Mason, Chris Hummersone, Antoine Liutkus, Mark D. Plumbley (2016)Evaluation of Audio Source Separation Models Using Hypothesis-Driven Non-Parametric Statistical Methods, In: 2016 24TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO)2016-pp. 1763-1767 IEEE

DOI: 10.1109/EUSIPCO.2016.7760551

Audio source separation models are typically evaluated using objective separation quality measures, but rigorous statistical methods have yet to be applied to the problem of model comparison. As a result, it can be difficult to establish whether or not reliable progress is being made during the development of new models. In this paper, we provide a hypothesis-driven statistical analysis of the results of the recent source separation SiSEC challenge involving twelve competing models tested on separation of voice and accompaniment from fifty pieces of "professionally produced" contemporary music. Using non-parametric statistics, we establish reliable evidence for meaningful conclusions about the performance of the various models.

Emad M Grais, Hagen Wierstorf, Dominic Ward, Mark D Plumbley (2018)Multi-Resolution Fully Convolutional Neural Networks for Monaural Audio Source Separation, In: Proceedings of LVA/ICA 2018 (Lecture Notes in Computer Science)10891pp. 340-350 Springer Verlag

DOI: 10.1007/978-3-319-93764-9_32

In deep neural networks with convolutional layers, all the neurons in each layer typically have the same size receptive fields (RFs) with the same resolution. Convolutional layers with neurons that have large RF capture global information from the input features, while layers with neurons that have small RF size capture local details with high resolution from the input features. In this work, we introduce novel deep multi-resolution fully convolutional neural networks (MR-FCN), where each layer has a range of neurons with different RF sizes to extract multi- resolution features that capture the global and local information from its input features. The proposed MR-FCN is applied to separate the singing voice from mixtures of music sources. Experimental results show that using MR-FCN improves the performance compared to feedforward deep neural networks (DNNs) and single resolution deep fully convolutional neural networks (FCNs) on the audio source separation problem.

Xubo Liu, Qiushi Huang, Xinhao Mei, Haohe Liu, Qiuqiang Kong, Jianyuan Sun, Shengchen Li, Tom Ko, Yu Zhang, Lilian H. Tang, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang (2023)Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention, In: Proceedings of the 24th Annual Conference of the International Speech Communication Association, INTERSPEECH (INTERSPEECH 2023)pp. 2838-2842 International Speech Communication Association (ISCA)

DOI: 10.21437/interspeech.2023-914

Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects. Specifically , we introduce an off-the-shelf visual encoder to extract video features and incorporate the visual features into an audio captioning system. Furthermore, to better exploit complementary audiovisual contexts, we propose an audiovisual attention mechanism that adaptively integrates audio and visual context and removes the redundant information in the latent space. Experimental results on AudioCaps, the largest audio captioning dataset, show that our proposed method achieves state-of-the-art results on machine translation metrics.

Jianyuan Sun, Xubo Liu, Xinhao Mei, Volkan Kılıç, Mark D. Plumbley, Wenwu Wang (2023)Dual Transformer Decoder based Features Fusion Network for Automated Audio Captioning, In: Proceedings of the 24th Annual Conference of the International Speech Communication Association, INTERSPEECH (INTERSPEECH 2023)pp. 4164-4168 International Speech Communication Association (ISCA)

DOI: 10.21437/interspeech.2023-943

Automated audio captioning (AAC) which generates textual descriptions of audio content. Existing AAC models achieve good results but only use the high-dimensional representation of the encoder. There is always insufficient information learning of high-dimensional methods owing to high-dimensional representations having a large amount of information. In this paper, a new encoder-decoder model called the Low-and High-Dimensional Feature Fusion (LHDFF) is proposed. LHDFF uses a new PANNs encoder called Residual PANNs (RPANNs) to fuse low-and high-dimensional features. Low-dimensional features contain limited information about specific audio scenes. The fusion of low-and high-dimensional features can improve model performance by repeatedly emphasizing specific audio scene information. To fully exploit the fused features, LHDFF uses a dual transformer decoder structure to generate captions in parallel. Experimental results show that LHDFF outperforms existing audio captioning models.

EMILY MARY CORRIGAN-KAVANAGH, MARK DAVID PLUMBLEY, MARC GREEN, ANDRES FERNANDEZ (2021)Exploring Sound Sensing to Improve Quality of Life in Urban Living

Following the successful application of AI and machine learning technologies to the recognition of speech and images, computer systems can now automatically analyse and recognise everyday real-world sound scenes and events. This new technology presents promising potential applications in environmental sensing and urban living. Specifically, urban soundscape analysis could be used to monitor and improve soundscapes experienced for people in towns and cities, helping to identify and formulate strategies for enhancing quality of life through future urban planning and development.

K O'Hanlon, H Nagano, MD Plumbley (2012)STRUCTURED SPARSITY FOR AUTOMATIC MUSIC TRANSCRIPTION, In: 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)pp. 441-444

Emad M Grais, G Roma, AJR Simpson, Mark Plumbley (2016)Combining Mask Estimates for Single Channel Audio Source Separation using Deep Neural Networks, In: Interspeech2016 Proceedings ISCA

Deep neural networks (DNNs) are usually used for single channel source separation to predict either soft or binary time frequency masks. The masks are used to separate the sources from the mixed signal. Binary masks produce separated sources with more distortion and less interference than soft masks. In this paper, we propose to use another DNN to combine the estimates of binary and soft masks to achieve the advantages and avoid the disadvantages of using each mask individually. We aim to achieve separated sources with low distortion and low interference between each other. Our experimental results show that combining the estimates of binary and soft masks using DNN achieves lower distortion than using each estimate individually and achieves as low interference as the binary mask.

A Gretsistas, MD Plumbley (2010)A Multichannel Spatial Compressed Sensing Approach for Direction of Arrival Estimation, In: V Vigneron, V Zarzoso, E Moreau, R Gribonval, E Vincent (eds.), LATENT VARIABLE ANALYSIS AND SIGNAL SEPARATION6365pp. 458-465

DOI: 10.1007/978-3-642-15995-4_57

D Barchiesi, D Giannoulis, D Stowell, MD Plumbley (2015)Acoustic Scene Classification: Classifying environments from the sounds they produce, In: IEEE Signal Processing Magazine32(3)pp. 16-34

DOI: 10.1109/MSP.2014.2326181

In this article, we present an account of the state of the art in acoustic scene classification (ASC), the task of classifying environments from the sounds they produce. Starting from a historical review of previous research in this area, we define a general framework for ASC and present different implementations of its components. We then describe a range of different algorithms submitted for a data challenge that was held to provide a general and fair benchmark for ASC techniques. The data set recorded for this purpose is presented along with the performance metrics that are used to evaluate the algorithms and statistical significance tests to compare the submitted methods.

Qiuqiang Kong, Yong Xu, Wenwu Wang, Mark Plumbley (2018)Audio set classification with attention model: a probabilistic perspective, In: Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)pp. 316-320 IEEE

DOI: 10.1109/ICASSP.2018.8461392

This paper investigates the Audio Set classification. Audio Set is a large scale weakly labelled dataset (WLD) of audio clips. In WLD only the presence of a label is known, without knowing the happening time of the labels. We propose an attention model to solve this WLD problem and explain the attention model from a novel probabilistic perspective. Each audio clip in Audio Set consists of a collection of features. We call each feature as an instance and the collection as a bag following the terminology in multiple instance learning. In the attention model, each instance in the bag has a trainable probability measure for each class. The classification of the bag is the expectation of the classification output of the instances in the bag with respect to the learned probability measure. Experiments show that the proposed attention model achieves a mAP of 0.327 on Audio Set, outperforming the Google’s baseline of 0.314.

N Cleju, MG Jafari, MD Plumbley (2012)ANALYSIS-BASED SPARSE RECONSTRUCTION WITH SYNTHESIS-BASED SOLVERS, In: 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)pp. 5401-5404

MD PLUMBLEY (1995)LYAPUNOV FUNCTIONS FOR CONVERGENCE OF PRINCIPAL COMPONENT ALGORITHMS, In: NEURAL NETWORKS8(1)pp. 11-23 PERGAMON-ELSEVIER SCIENCE LTD

DOI: 10.1016/0893-6080(95)91644-9

Recent theoretical analyses of a class of unsupervized Hebbian principal component algorithms have identified its local stability conditions. The only locally stable solution for the subspace P extracted by the network is the principal component subspace P∗. In this paper we use the Lyapunov function approach to discover the global stability characteristics of this class of algorithms. The subspace projection error, least mean squared projection error, and mutual information I are all Lyapunov functions for convergence to the principal subspace, although the various domains of convergence indicated by these Lyapunov functions leave some of P-space uncovered. A modification to I yields a principal subspace information Lyapunov function I′ with a domain of convergence that covers almost all of P-space. This shows that this class of algorithms converges to the principal subspace from almost everywhere.

T Virtanen, Mark Plumbley, D Ellis (2018)Introduction to sound scene and event analysis, In: T Virtanen, Mark Plumbley, D Ellis (eds.), Computational Analysis of Sound Scenes and Eventspp. 3-12 Springer

DOI: 10.1007/978-3-319-63450-0_1

Saeid Safavi, Wenwu Wang, Mark Plumbley, Ali Janalizadeh Choobbasti, George Fazekas (2018)Predicting the Perceived Level of Reverberation using Features from Nonlinear Auditory Model, In: Proceedings of the 23rd FRUCT conferencepp. 527-531 Institute of Electrical and Electronics Engineers (IEEE)

Perceptual measurements have typically been recognized as the most reliable measurements in assessing perceived levels of reverberation. In this paper, a combination of blind RT60 estimation method and a binaural, nonlinear auditory model is employed to derive signal-based measures (features) that are then utilized in predicting the perceived level of reverberation. Such measures lack the excess of effort necessary for calculating perceptual measures; not to mention the variations in either stimuli or assessors that may cause such measures to be statistically insignificant. As a result, the automatic extraction of objective measurements that can be applied to predict the perceived level of reverberation become of vital significance. Consequently, this work is aimed at discovering measurements such as clarity, reverberance, and RT60 which can automatically be derived directly from audio data. These measurements along with labels from human listening tests are then forwarded to a machine learning system seeking to build a model to estimate the perceived level of reverberation, which is labeled by an expert, autonomously. The data has been labeled by an expert human listener for a unilateral set of files from arbitrary audio source types. By examining the results, it can be observed that the automatically extracted features can aid in estimating the perceptual rates.

S Abdallah, MD Plumbley (2004)Application of geometric dependency analysis to the separation of convolved mixtures, In: CG Puntonet, A Prieto (eds.), INDEPENDENT COMPONENT ANALYSIS AND BLIND SIGNAL SEPARATION3195pp. 540-547

Qiang Huang, Yong Xu, Philip Jackson, Wenwu Wang, Mark Plumbley (2017)Fast Tagging of Natural Sounds Using Marginal Co-regularization, In: Proceedings of ICASSP2017 IEEE

DOI: 10.1109/ICASSP.2017.7952705

Automatic and fast tagging of natural sounds in audio collections is a very challenging task due to wide acoustic variations, the large number of possible tags, the incomplete and ambiguous tags provided by different labellers. To handle these problems, we use a co-regularization approach to learn a pair of classifiers on sound and text. The first classifier maps low-level audio features to a true tag list. The second classifier maps actively corrupted tags to the true tags, reducing incorrect mappings caused by low-level acoustic variations in the first classifier, and to augment the tags with additional relevant tags. Training the classifiers is implemented using marginal co-regularization, pair of which draws the two classifiers into agreement by a joint optimization. We evaluate this approach on two sound datasets, Freefield1010 and Task4 of DCASE2016. The results obtained show that marginal co-regularization outperforms the baseline GMM in both ef- ficiency and effectiveness.

E Vincent, MD Plumbley (2008)Efficient Bayesian inference for harmonic models via adaptive posterior factorization, In: NEUROCOMPUTING72(1-3)pp. 79-87 ELSEVIER SCIENCE BV

DOI: 10.1016/j.neucom.2007.12.050

MG Jafari, MD Plumbley (2008)An adaptive orthogonal sparsifying transform for speech signals, In: 2008 3RD INTERNATIONAL SYMPOSIUM ON COMMUNICATIONS, CONTROL AND SIGNAL PROCESSING, VOLS 1-3pp. 786-790

DOI: 10.1109/ISCCSP.2008.4537329

Saeid Safavi, Turab Iqbal, Wenwu Wang, Philip Coleman, Mark D. Plumbley (2020)Open-Window: A Sound Event Dataset For Window State Detection and Recognition, In: Proc. 5th International Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2020)

DOI: 10.5281/zenodo.3620748

Situated in the domain of urban sound scene classiﬁcation by humans and machines, this research is the ﬁrst step towards mapping urban noise pollution experienced indoors and ﬁnding ways to reduce its negative impact in peoples’ homes. We have recorded a sound dataset, called Open-Window, which contains recordings from three different locations and four different window states; two stationary states (open and close) and two transitional states (open to close and close to open). We have then built our machine recognition base lines for different scenarios (open set versus closed set) using a deep learning framework. The human listening test is also performed to be able to compare the human and machine performance for detecting the window state just using the acoustic cues. Our experimental results reveal that when using a simple machine baseline system, humans and machines are achieving similar average performance for closed set experiments.

Yi Yuan, Haohe Liu, Jinhua Liang, Xubo Liu, Mark D. Plumbley, Wenwu Wang (2023)Leveraging Pre-Trained AudioLDM for Sound Generation: A Benchmark Study, In: 2023 31st European Signal Processing Conference (EUSIPCO)pp. 765-769 EURASIP

DOI: 10.23919/EUSIPCO58844.2023.10289975

Deep neural networks have recently achieved break-throughs in sound generation. Despite the outstanding sample quality, current sound generation models face issues on small-scale datasets (e.g., overfitting), significantly limiting performance. In this paper, we make the first attempt to investigate the benefits of pre-training on sound generation with AudioLDM, the cutting-edge model for audio generation, as the backbone. Our study demonstrates the advantages of the pre-trained AudioLDM, especially in data-scarcity scenarios. In addition, the baselines and evaluation protocol for sound generation systems are not consistent enough to compare different studies directly. Aiming to facilitate further study on sound generation tasks, we benchmark the sound generation task on various frequently-used datasets. We hope our results on transfer learning and benchmarks can provide references for further research on conditional sound generation.

Y Nishimori, S Akaho, MD Plumbley (2008)Natural Conjugate Gradient on Complex Flag Manifolds for Complex Independent Subspace Analysis, In: V Kurkova, R Neruda, J Koutnik (eds.), ARTIFICIAL NEURAL NETWORKS - ICANN 2008, PT I5163pp. 165-174

R Badeau, MD Plumbley (2013)MULTICHANNEL HR-NMF FOR MODELLING CONVOLUTIVE MIXTURES OF NON-STATIONARY SIGNALS IN THE TIME-FREQUENCY DOMAIN, In: 2013 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA)

AM Stark, MD Plumbley, MEP Davies (2007)Real-Time Beat-Synchronous Audio Effects., In: NIMEpp. 344-345

DOI: 10.1145/1279740.1279816

Yin Cao, Qiuqiang Kong, Turab Iqbal, Fengyan An, Wenwu Wang, Mark D. Plumbley (2019)Polyphonic sound event detection and localization using a two-stage strategy, In: Michael Mandel, Justin Salamon, Daniel P. W Ellis (eds.), Proceedings of Detection and Classification of Acoustic Scenes and Events Workshop (DCASE 2019)pp. pp 30-34 New York University

DOI: 10.33682/4jhy-bj81

Sound event detection (SED) and localization refer to recognizing sound events and estimating their spatial and temporal locations. Using neural networks has become the prevailing method for SED. In the area of sound localization, which is usually performed by estimating the direction of arrival (DOA), learning-based methods have recently been developed. In this paper, it is experimentally shown that the trained SED model is able to contribute to the direction of arrival estimation (DOAE). However, joint training of SED and DOAE degrades the performance of both. Based on these results, a two-stage polyphonic sound event detection and localization method is proposed. The method learns SED first, after which the learned feature layers are transferred for DOAE. It then uses the SED ground truth as a mask to train DOAE. The proposed method is evaluated on the DCASE 2019 Task 3 dataset, which contains different overlapping sound events in different environments. Experimental results show that the proposed method is able to improve the performance of both SED and DOAE, and also performs significantly better than the baseline method.

D Stowell, MD Plumbley (2012)Multi-target pitch tracking of vibrato sources in noise using the GM-PHD filter, In: Proceedings of 5th International Workshop on Machine Learning and Music (MML 2012)pp. 27-28

Probabilistic approaches to tracking often use single-source Bayesian models; applying these to multi-source tasks is problematic. We apply a principled multi-object tracking implementation, the Gaussian mixture probability hypothesis density filter, to track multiple sources having fixed pitch plus vibrato. We demonstrate high-quality ltering in a synthetic experiment, and nd improved tracking using a richer feature set which captures underlying dynamics. Our implementation is available as open-source Python code.

MD Plumbley, M Bevilacqua (2009)SPARSE RECONSTRUCTION FOR COMPRESSED SENSING USING STAGE WISE POLYTOPE FACES PURSUIT, In: 2009 16TH INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING, VOLS 1 AND 2pp. 14-21

Qiuqiang Kong, Changsong Yu, Yong Xu, Turab Iqbal, Wenwu Wang, Mark D. Plumbley (2019)Weakly Labelled AudioSet Tagging With Attention Neural Networks, In: IEEE/ACM Transactions on Audio, Speech, and Language Processing27(11)pp. 1791-1802 IEEE

DOI: 10.1109/TASLP.2019.2930913

Audio tagging is the task of predicting the presence or absence of sound classes within an audio clip. Previous work in audio tagging focused on relatively small datasets limited to recognising a small number of sound classes. We investigate audio tagging on AudioSet, which is a dataset consisting of over 2 million audio clips and 527 classes. AudioSet is weakly labelled, in that only the presence or absence of sound classes is known for each clip, while the onset and offset times are unknown. To address the weakly-labelled audio tagging problem, we propose attention neural networks as a way to attend the most salient parts of an audio clip. We bridge the connection between attention neural networks and multiple instance learning (MIL) methods, and propose decision-level and feature-level attention neural networks for audio tagging. We investigate attention neural networks modelled by different functions, depths and widths. Experiments on AudioSet show that the feature-level attention neural network achieves a state-of-the-art mean average precision (mAP) of 0.369, outperforming the best multiple instance learning (MIL) method of 0.317 and Google’s deep neural network baseline of 0.314. In addition, we discover that the audio tagging performance on AudioSet embedding features has a weak correlation with the number of training examples and the quality of labels of each sound class.

Qiuqiang Kong, Yong Xu, Wenwu Wang, Mark D Plumbley (2018)A joint separation-classification model for sound event detection of weakly labelled data, In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)pp. 321-325 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/ICASSP.2018.8462448

Source separation (SS) aims to separate individual sources from an audio recording. Sound event detection (SED) aims to detect sound events from an audio recording. We propose a joint separation-classification (JSC) model trained only on weakly labelled audio data, that is, only the tags of an audio recording are known but the time of the events are unknown. First, we propose a separation mapping from the time-frequency (T-F) representation of an audio to the T-F segmentation masks of the audio events. Second, a classification mapping is built from each T-F segmentation mask to the presence probability of each audio event. In the source separation stage, sources of audio events and time of sound events can be obtained from the T-F segmentation masks. The proposed method achieves an equal error rate (EER) of 0.14 in SED, outperforming deep neural network baseline of 0.29. Source separation SDR of 8.08 dB is obtained by using global weighted rank pooling (GWRP) as probability mapping, outperforming the global max pooling (GMP) based probability mapping giving SDR at 0.03 dB. Source code of our work is published.

W Wang, J Dong, MD Plumbley (2015)Analysis SimCO Algorithms for Sparse Analysis Model Based Dictionary Learning, In: IEEE Transactions on Signal Processing IEEE

DOI: 10.1109/TSP.2015.2483480

In this paper, we consider the dictionary learning problem for the sparse analysis model. A novel algorithm is proposed by adapting the simultaneous codeword optimization (SimCO) algorithm, based on the sparse synthesis model, to the sparse analysis model. This algorithm assumes that the analysis dictionary contains unit ℓ2-norm atoms and learns the dictionary by optimization on manifolds. This framework allows multiple dictionary atoms to be updated simultaneously in each iteration. However, similar to several existing analysis dictionary learning algorithms, dictionaries learned by the proposed algorithm may contain similar atoms, leading to a degenerate (coherent) dictionary. To address this problem, we also consider restricting the coherence of the learned dictionary and propose Incoherent Analysis SimCO by introducing an atom decorrelation step following the update of the dictionary. We demonstrate the competitive performance of the proposed algorithms using experiments with synthetic data

MEP Davies, MD Plumbley (2005)Beat tracking with a two state model, In: 2005 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1-5pp. 241-244

Saeid Safavi, Andy Pearce, Wenwu Wang, Mark Plumbley (2018)Predicting the perceived level of reverberation using machine learning, In: Proceedings of the 52nd Asilomar Conference on Signals, Systems and Computers (ACSSC 2018) Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/ACSSC.2018.8645201

Perceptual measures are usually considered more reliable than instrumental measures for evaluating the perceived level of reverberation. However, such measures are time consuming and expensive, and, due to variations in stimuli or assessors, the resulting data is not always statistically significant. Therefore, an (objective) measure of the perceived level of reverberation becomes desirable. In this paper, we develop a new method to predict the level of reverberation from audio signals by relating the perceptual listening test results with those obtained from a machine learned model. More specifically, we compare the use of a multiple stimuli test for within and between class architectures to evaluate the perceived level of reverberation. An expert set of 16 human listeners rated the perceived level of reverberation for a same set of files from different audio source types. We then train a machine learning model using the training data gathered for the same set of files and a variety of reverberation related features extracted from the data such as reverberation time, and direct to reverberation ratio. The results suggest that the machine learned model offers an accurate prediction of the perceptual scores.

N Degara, E Argones Rua, A Pena, S Torres-Guijarro, MEP Davies, MD Plumbley (2012)Reliability-Informed Beat Tracking of Musical Signals, In: IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING20(1)pp. 290-301 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TASL.2011.2160854

Emad M. Grais, Gerard Roma, Andrew J. R. Simpson, Mark D. Plumbley (2017)Two-Stage Single-Channel Audio Source Separation using Deep Neural Networks, In: IEEE/ACM Transactions on Audio, Speech, and Language Processing25(9)pp. 1773-1783 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/TASLP.2017.2716443

Most single channel audio source separation (SCASS) approaches produce separated sources accompanied by interference from other sources and other distortions. To tackle this problem, we propose to separate the sources in two stages. In the first stage, the sources are separated from the mixed signal. In the second stage, the interference between the separated sources and the distortions are reduced using deep neural networks (DNNs). We propose two methods that use DNNs to improve the quality of the separated sources in the second stage. In the first method, each separated source is improved individually using its own trained DNN, while in the second method all the separated sources are improved together using a single DNN. To further improve the quality of the separated sources, the DNNs in the second stage are trained discriminatively to further decrease the interference and the distortions of the separated sources. Our experimental results show that using two stages of separation improves the quality of the separated signals by decreasing the interference between the separated sources and distortions compared to separating the sources using a single stage of separation.

AJR Simpson, G Roma, Emad M Grais, Russell Mason, Christopher Hummersone, A Liutkus, Mark Plumbley (2016)Evaluation of Audio Source Separation Models Using Hypothesis-Driven Non-Parametric Statistical Methods, In: European Signal Processing Conference (EUSIPCO) 2016

Audio source separation models are typically evaluated using objective separation quality measures, but rigorous statistical methods have yet to be applied to the problem of model comparison. As a result, it can be difficult to establish whether or not reliable progress is being made during the development of new models. In this paper, we provide a hypothesis-driven statistical analysis of the results of the recent source separation SiSEC challenge involving twelve competing models tested on separation of voice and accompaniment from fifty pieces of “professionally produced” contemporary music. Using nonparametric statistics, we establish reliable evidence for meaningful conclusions about the performance of the various models.

Yong Xu, Qiuqiang Kong, Wenwu Wang, Mark Plumbley (2017)Surrey-CVSSP system for DCASE2017 challenge task 4, In: Tuomas Virtanen, Annamaria Mesaros, Toni Heittola, Aleksandr Diment, Emmanuel Vincent, Emmanouil Benetos, Benjamin Martinez Elizalde (eds.), Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017) Tampere University of Technology. Laboratory of Signal Processing

In this technique report, we present a bunch of methods for the task 4 of Detection and Classification of Acoustic Scenes and Events 2017 (DCASE2017) challenge. This task evaluates systems for the large-scale detection of sound events using weakly labeled training data. The data are YouTube video excerpts focusing on transportation and warnings due to their industry applications. There are two tasks, audio tagging and sound event detection from weakly labeled data. Convolutional neural network (CNN) and gated recurrent unit (GRU) based recurrent neural network (RNN) are adopted as our basic framework. We proposed a learnable gating activation function for selecting informative local features. Attention-based scheme is used for localizing the specific events in a weakly-supervised mode. A new batch-level balancing strategy is also proposed to tackle the data unbalancing problem. Fusion of posteriors from different systems are found effective to improve the performance. In a summary, we get 61% F-value for the audio tagging subtask and 0.73 error rate (ER) for the sound event detection subtask on the development set. While the official multilayer perceptron (MLP) based baseline just obtained 13.1% F-value for the audio tagging and 1.02 for the sound event detection.

A Kachkaev, D Wolff, M Barthet, MD Plumbley, J Dykes, T Weyde (2014)Visualising chord progressions in music collections: a big data approach

The analysis of large datasets of music audio and other representations entails the need for techniques that support musicologists and other users in interpreting extracted data. We explore and develop visualisation techniques of chord sequence patterns mined from a corpus of over one million tracks. The visualisations use different representations of root movements and chord qualities with geometrical representations, and mostly colour mappings for pattern support. The presented visualisations are being developed in close collaboration with musicologists and can help gain insights into the differences of musical genres and styles as well as support the development of new classification methods.

K Toyama, MD Plumbley (2009)USING PHASE LINEARITY IN FREQUENCY-DOMAIN ICA TO TACKLE THE PERMUTATION PROBLEM, In: 2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGSpp. 3165-3168

DOI: 10.1109/ICASSP.2009.4960296

I Johnson, M Plumbley (2000)On-line connectionist Q-learning produces unreliable performance with a synonym finding task., In: SI Amari, CL Giles, M Mori, V Piuri (eds.), IJCNN 2000: PROCEEDINGS OF THE IEEE-INNS-ENNS INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOL IIIpp. 451-456

Z Qin, Y Gao, MD Plumbley, CG Parini (2015)Wideband spectrum sensing on real-time signals at sub-Nyquist sampling rates in single and cooperative multiple nodes, In: IEEE Transactions on Signal Processing64(12)pp. 3106-3117 IEEE

DOI: 10.1109/TSP.2015.2512562

This paper presents two new algorithms for wideband spectrum sensing at sub-Nyquist sampling rates, for both single nodes and cooperative multiple nodes. In single-node spectrum sensing, a two-phase spectrum sensing algorithm based on compressive sensing is proposed to reduce the computational complexity and improve the robustness at secondary users (SUs). In the cooperative multiple nodes case, the signals received at SUs exhibit a sparsity property that yields a low-rank matrix of compressed measurements at the fusion center. This therefore leads to a two-phase cooperative spectrum sensing algorithm for cooperative multiple SUs based on low-rank matrix completion. In addition, the two proposed spectrum sensing algorithms are evaluated on the TV white space (TVWS), in which pioneering work aimed at enabling dynamic spectrum access into practice has been promoted by both the Federal Communications Commission and the U.K. Office of Communications. The proposed algorithms are tested on the real-time signals after they have been validated by the simulated signals in TVWS. The numerical results show that our proposed algorithms are more robust to channel noise and have lower computational complexity than the state-of-the-art algorithms.

D Stowell, M Plumbley (2007)Adaptive whitening for improved real-time audio onset detection, In: Proceedings of the 2007 International Computer Music Conference, ICMC 2007pp. 312-319

We describe a new method for preprocessing STFT phasevocoder frames for improved performance in real-time onset detection, which we term "adaptive whitening". The procedure involves normalising the magnitude of each bin according to a recent maximum value for that bin, with the aim of allowing each bin to achieve a similar dynamic range over time, which helps to mitigate against the influence of spectral roll-off and strongly-varying dynamics. Adaptive whitening requires no training, is relatively lightweight to compute, and can run in real-time. Yet it can improve onset detector performance by more than ten percentage points (peak F-measure) in some cases, and improves the performance of most of the onset detectors tested. We present results demonstrating that adaptive whitening can significantly improve the performance of various STFT-based onset detection functions, including functions based on the power, spectral flux, phase deviation, and complex deviation measures. Our results find the process to be especially beneficial for certain types of audio signal (e.g. complex mixtures such as pop music).

Arshdeep Singh, Mark Plumbley (2023)Efficient CNNs via Passive Filter Pruning, In: Efficient CNNs via Passive Filter Pruning Cornell University Library, arXiv.org

DOI: 10.48550/arxiv.2304.02319

Convolutional neural networks (CNNs) have shown state-of-the-art performance in various applications. However, CNNs are resource-hungry due to their requirement of high computational complexity and memory storage. Recent efforts toward achieving computational efficiency in CNNs involve filter pruning methods that eliminate some of the filters in CNNs based on the \enquote{importance} of the filters. The majority of existing filter pruning methods are either "active", which use a dataset and generate feature maps to quantify filter importance, or "passive", which compute filter importance using entry-wise norm of the filters without involving data. Under a high pruning ratio where large number of filters are to be pruned from the network, the entry-wise norm methods eliminate relatively smaller norm filters without considering the significance of the filters in producing the node output, resulting in degradation in the performance. To address this, we present a passive filter pruning method where the filters are pruned based on their contribution in producing output by considering the operator norm of the filters. The proposed pruning method generalizes better across various CNNs compared to that of the entry-wise norm-based pruning methods. In comparison to the existing active filter pruning methods, the proposed pruning method is at least 4.5 times faster in computing filter importance and is able to achieve similar performance compared to that of the active filter pruning methods. The efficacy of the proposed pruning method is evaluated on audio scene classification and image classification using various CNNs architecture such as VGGish, DCASE21_Net, VGG-16 and ResNet-50.

E BENETOS, G Lafay, M Lagrange, MD Plumbley (2016)Detection of overlapping acoustic events using a temporally-constrained probabilistic model, In: ICASSP 2016 Proceedings

In this paper, a system for overlapping acoustic event detection is proposed, which models the temporal evolution of sound events. The system is based on probabilistic latent component analysis, supporting the use of a sound event dictionary where each exemplar consists of a succession of spectral templates. The temporal succession of the templates is controlled through event class-wise Hidden Markov Models (HMMs). As input time/frequency representation, the Equivalent Rectangular Bandwidth (ERB) spectrogram is used. Experiments are carried out on polyphonic datasets of office sounds generated using an acoustic scene simulator, as well as real and synthesized monophonic datasets for comparative purposes. Results show that the proposed system outperforms several state-of-the-art methods for overlapping acoustic event detection on the same task, using both frame-based and event-based metrics, and is robust to varying event density and noise levels.

S Sigtia, AM Stark, S Krstulovic, Mark Plumbley (2016)Automatic environmental sound recognition: Performance versus computational cost, In: IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING24(11)pp. 2096-2107 IEEE

DOI: 10.1109/TASLP.2016.2592698

In the context of the Internet of Things (IoT), sound sensing applications are required to run on embedded platforms where notions of product pricing and form factor impose hard constraints on the available computing power. Whereas Automatic Environmental Sound Recognition (AESR) algorithms are most often developed with limited consideration for computational cost, this article seeks which AESR algorithm can make the most of a limited amount of computing power by comparing the sound classification performance as a function of its computational cost. Results suggest that Deep Neural Networks yield the best ratio of sound classification accuracy across a range of computational costs, while Gaussian Mixture Models offer a reasonable accuracy at a consistently small cost, and Support Vector Machines stand between both in terms of compromise between accuracy and computational cost.

AN Robertson, MD Plumbley (2009)Post-processing fiddle~: A real-time multi-pitch tracking technique using harmonic partial subtraction for use within live performance systems, In: Proceedings of the 2009 International Computer Music Conference (ICMC 2009)pp. 227-230

We present a method for real-time pitch-tracking which generates an estimation of the relative amplitudes of the partials relative to the fundamental for each detected note. We then employ a subtraction method, whereby lower fundamentals in the spectrum are accounted for when looking at higher fundamental notes. By tracking notes which are playing, we look for note off events and continually update our expected partial weightings for each note. The resulting algorithm makes use of these relative partial weightings within its decision process. We have evaluated the system against a data set and compared it with specialised offline pitch-trackers. © July 2009- All copyright remains with the individual authors.

François Grondin, François Glass, Iwona Sobieraj, Mark D. Plumbley (2019)Sound Event Localization and Detection Using CRNN on Pairs of Microphones, In: Michael Mandel, Justin Salamon, Daniel P. W. Ellis (eds.), Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019)pp. 84-88 New York University

DOI: 10.33682/4v2a-7q02

This paper proposes sound event localization and detection methods from multichannel recording. The proposed system is based on two Convolutional Recurrent Neural Networks (CRNNs) to perform sound event detection (SED) and time difference of arrival (TDOA) estimation on each pair of microphones in a microphone array. In this paper, the system is evaluated with a four-microphone array, and thus combines the results from six pairs of microphones to provide a final classification and a 3-D direction of arrival (DOA) estimate. Results demonstrate that the proposed approach outperforms the DCASE 2019 baseline system.

Qiuqiang Kong, Turab Iqbal, Yong Xu, Wenwu Wang, Mark D Plumbley (2018)DCASE 2018 Challenge Surrey Cross-task convolutional neural network baseline, In: DCASE2018 Workshop

DOI: 10.48550/arXiv.1808.00773

The Detection and Classiﬁcation of Acoustic Scenes and Events (DCASE) consists of ﬁve audio classiﬁcation and sound event detectiontasks: 1)Acousticsceneclassiﬁcation,2)General-purposeaudio tagging of Freesound, 3) Bird audio detection, 4) Weakly-labeled semi-supervised sound event detection and 5) Multi-channel audio classiﬁcation. In this paper, we create a cross-task baseline system for all ﬁve tasks based on a convlutional neural network (CNN): a “CNN Baseline” system. We implemented CNNs with 4 layers and 8 layers originating from AlexNet and VGG from computer vision. We investigated how the performance varies from task to task with the same conﬁguration of neural networks. Experiments show that deeper CNN with 8 layers performs better than CNN with 4 layers on all tasks except Task 1. Using CNN with 8 layers, we achieve an accuracy of 0.680 on Task 1, an accuracy of 0.895 and a mean average precision (MAP) of 0.928 on Task 2, an accuracy of 0.751 andanareaunderthecurve(AUC)of0.854onTask3,asoundevent detectionF1scoreof20.8%onTask4,andanF1scoreof87.75%on Task 5. We released the Python source code of the baseline systems under the MIT license for further research.

D Stowell, MD Plumbley (2008)Robustness and independence of voice timbre features under live performance acoustic degradations, In: Proceedings - 11th International Conference on Digital Audio Effects, DAFx 2008pp. 325-332

Live performance situations can lead to degradations in the vocal signal from a typical microphone, such as ambient noise or echoes due to feedback. We investigate the robustness of continuousvalued timbre features measured on vocal signals (speech, singing, beatboxing) under simulated degradations. We also consider nonparametric dependencies between features, using information theoretic measures and a feature-selection algorithm. We discuss how robustness and independence issues reflect on the choice of acoustic features for use in constructing a continuous-valued vocal timbre space. While some measures (notably spectral crest factors) emerge as good candidates for such a task, others are poor, and some features such as ZCR exhibit an interaction with the type of voice signal being analysed.

F Font, Tim Brookes, G Fazekas, M Guerber, A La Burthe, D Plans, MD Plumbley, M Shaashua, W Wang, X Serra (2016)Audio Commons: bringing Creative Commons audio content to the creative industries, In: AES E-Library Audio Engineering Society

Significant amounts of user-generated audio content, such as sound effects, musical samples and music pieces, are uploaded to online repositories and made available under open licenses. Moreover, a constantly increasing amount of multimedia content, originally released with traditional licenses, is becoming public domain as its license expires. Nevertheless, the creative industries are not yet using much of all this content in their media productions. There is still a lack of familiarity and understanding of the legal context of all this open content, but there are also problems related with its accessibility. A big percentage of this content remains unreachable either because it is not published online or because it is not well organised and annotated. In this paper we present the Audio Commons Initiative, which is aimed at promoting the use of open audio content and at developing technologies with which to support the ecosystem composed by content repositories, production tools and users. These technologies should enable the reuse of this audio material, facilitating its integration in the production workflows used by the creative industries. This is a position paper in which we describe the core ideas behind this initiative and outline the ways in which we plan to address the challenges it poses.

Thomas Deacon, Mark D Plumbley (2024)Working with AI Sound: Exploring the Future of Workplace AI Sound Technologies, In: CHIWORK '24: Proceedings of the 3rd Annual Meeting of the Symposium on Human-Computer Interaction for Work2pp. 1-21

DOI: 10.1145/3663384.3663391

The workplace is a site for the rapid development and deployment of Artificial Intelligence (AI) systems. However, our research suggests that their adoption could already be hindered by critical issues such as trust, privacy, and security. This paper examines the integration of AI-enabled sound technologies in the workplace, with a focus on enhancing well-being and productivity through a soundscape approach while addressing ethical concerns. To explore these concepts, we used scenario-based design and structured feedback sessions with knowledge workers from open-plan offices and those working from home. To do this, we present initial design concepts for AI sound analysis and control systems. Based on the perspectives gathered, we present user requirements and concerns, particularly regarding privacy and the potential for workplace surveillance, emphasising the need for user consent and levels of transparency in AI deployments. Navigating these ethical considerations is a key implication of the study. We advocate for novel ways to incorporate people’s involvement in the design process through co-design and serious games to shape the future of AI audio technologies in the workplace.

Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, Mark Plumbley (2023)AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining, In: AudioLDM 2: Learning holistic audio generation with self-supervised pretraining Cornell University Library, arXiv.org

DOI: 10.48550/arxiv.2308.05734

Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called "language of audio" (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pretrained AudioMAE and latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech demonstrate state-of-the-art or competitive performance against previous approaches. Our code, pretrained model, and demo are available at https://audioldm.github.io/audioldm2.

Estefanía Cano, Derry FitzGerald, Antoine Liutkus, Mark D. Plumbley, Fabian-Robert Stöter (2019)Musical Source Separation: An Introduction, In: IEEE Signal Processing Magazine36(1)pp. 31-40 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/MSP.2018.2874719

Many people listen to recorded music as part of their everyday lives, for example from radio or TV programmes, CDs, downloads or increasingly from online streaming services. Sometimes we might want to remix the balance within the music, perhaps to make the vocals louder or to suppress an unwanted sound, or we might want to upmix a 2-channel stereo recording to a 5.1- channel surround sound system. We might also want to change the spatial location of a musical instrument within the mix. All of these applications are relatively straightforward, provided we have access to separate sound channels (stems) for each musical audio object. However, if we only have access to the final recording mix, which is usually the case, this is much more challenging. To estimate the original musical sources, which would allow us to remix, suppress or upmix the sources, we need to perform musical source separation (MSS). In the general source separation problem, we are given one or more mixture signals that contain different mixtures of some original source signals. This is illustrated in Figure 1 where four sources, namely vocals, drums, bass and guitar, are all present in the mixture. The task is to recover one or more of the source signals given the mixtures. In some cases, this is relatively straightforward, for example, if there are at least as many mixtures as there are sources, and if the mixing process is fixed, with no delays, filters or non-linear mastering [1]. However, MSS is normally more challenging. Typically, there may be many musical instruments and voices in a 2-channel recording, and the sources have often been processed with the addition of filters and reverberation (sometimes nonlinear) in the recording and mixing process. In some cases, the sources may move, or the production parameters may change, meaning that the mixture is time-varying. All of these issues make MSS a very challenging problem. Nevertheless, musical sound sources have particular properties and structures that can help us. For example, musical source signals often have a regular harmonic structure of frequencies at regular intervals, and can have frequency contours characteristic of each musical instrument. They may also repeat in particular temporal patterns based on the musical structure. In this paper we will explore the MSS problem and introduce approaches to tackle it. We will begin by introducing characteristics of music signals, we will then give an introduction to MSS, and finally consider a range of MSS models. We will also discuss how to evaluate MSS approaches, and discuss limitations and directions for future research

K O'Hanlon, H Nagano, MD Plumbley (2012)Oracle analysis of sparse automatic music transcriptionpp. 591-598

We have previously proposed a structured sparse approach to piano transcription with promising results recorded on a challenging dataset. The approach taken was measured in terms of both frame-based and onset-based metrics. Close inspection of the results revealed problems in capturing frames displaying low-energy of a given note, for example in sustained notes. Further problems were also noticed in the onset detection, where for many notes seen to be active in the output trancription an onset was not detected. A brief description of the approach is given here, and further analysis of the system is given by considering an oracle transcription, derived from the ground truth piano roll and the given dictionary of spectral template atoms, which gives a clearer indication of the problems which need to be overcome in order to improve the proposed approach.

K O'Hanlon, H Nagano, N Keriven, MD Plumbley (2016)Non-Negative Group Sparsity with Subspace Note Modelling for Polyphonic Transcription, In: IEEE/ACM Transactions on Audio, Speech and Language Processing24(3)pp. 530-542 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TASLP.2016.2515514

Automatic music transcription (AMT) can be performed by deriving a pitch-time representation through decomposition of a spectrogram with a dictionary of pitch-labelled atoms. Typically, non-negative matrix factorisation (NMF) methods are used to decompose magnitude spectrograms. One atom is often used to represent each note. However, the spectrum of a note may change over time. Previous research considered this variability using different atoms to model specific parts of a note, or large dictionaries comprised of datapoints from the spectrograms of full notes. In this paper, the use of subspace modelling of note spectra is explored, with group sparsity employed as a means of coupling activations of related atoms into a pitched subspace. Stepwise and gradient-based methods for non-negative group sparse decompositions are proposed. Finally, a group sparse NMF approach is used to tune a generic harmonic subspace dictionary, leading to improved NMF-based AMT results.

MD Plumbley, E Oja (2004)A "nonnegative PCA" algorithm for independent component analysis, In: IEEE TRANSACTIONS ON NEURAL NETWORKS15(1)pp. 66-76 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TNN.2003.820672

JA Hockman, JP Bello, MEP Davies, MD Plumbley (2008)Automated rhythmic transformation of musical audio, In: Proceedings - 11th International Conference on Digital Audio Effects, DAFx 2008pp. 177-180

Time-scale transformations of audio signals have traditionally relied exclusively upon manipulations of tempo. We present a novel technique for automatic mixing and synchronization between two musical signals. In this transformation, the original signal assumes the tempo, meter, and rhythmic structure of the model signal, while the extracted downbeats and salient intra-measure infrastructure of the original are maintained.

MG Jafari, MD Plumbley, ME Davies (2008)Speech separation using an adaptive sparse dictionary algorithm, In: 2008 HANDS-FREE SPEECH COMMUNICATION AND MICROPHONE ARRAYSpp. 26-29

MD Plumbley (2006)Recovery of sparse representations by polytope faces pursuit, In: J Rosca, D Erdogmus, JC Principe, S Haykin (eds.), INDEPENDENT COMPONENT ANALYSIS AND BLIND SIGNAL SEPARATION, PROCEEDINGS3889pp. 206-213

E Benetos, G Lafay, M Lagrange, Mark Plumbley (2017)Polyphonic Sound Event Tracking using Linear Dynamical Systems, In: IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING25(6)pp. 1266-1277 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TASLP.2017.2690576

In this paper, a system for polyphonic sound event detection and tracking is proposed, based on spectrogram factorisation techniques and state space models. The system extends probabilistic latent component analysis (PLCA) and is modelled around a 4-dimensional spectral template dictionary of frequency, sound event class, exemplar index, and sound state. In order to jointly track multiple overlapping sound events over time, the integration of linear dynamical systems (LDS) within the PLCA inference is proposed. The system assumes that the PLCA sound event activation is the (noisy) observation in an LDS, with the latent states corresponding to the true event activations. LDS training is achieved using fully observed data, making use of ground truth-informed event activations produced by the PLCA-based model. Several LDS variants are evaluated, using polyphonic datasets of office sounds generated from an acoustic scene simulator, as well as real and synthesized monophonic datasets for comparative purposes. Results show that the integration of LDS tracking within PLCA leads to an improvement of +8.5-10.5% in terms of frame-based F-measure as compared to the use of the PLCA model alone. In addition, the proposed system outperforms several state-of-the-art methods for the task of polyphonic sound event detection.

Qiang Huang, Philip Jackson, Mark D. Plumbley, Wenwu Wang (2018)Synthesis of images by two-stage generative adversarial networks, In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)pp. 1593-1597 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/ICASSP.2018.8461984

In this paper, we propose a divide-and-conquer approach using two generative adversarial networks (GANs) to explore how a machine can draw colorful pictures (bird) using a small amount of training data. In our work, we simulate the procedure of an artist drawing a picture, where one begins with drawing objects’ contours and edges and then paints them different colors. We adopt two GAN models to process basic visual features including shape, texture and color. We use the first GAN model to generate object shape, and then paint the black and white image based on the knowledge learned using the second GAN model. We run our experiments on 600 color images. The experimental results show that the use of our approach can generate good quality synthetic images, comparable to real ones.

MG Jafari, E Vincent, SA Abdallah, MD Plumbley, ME Davies (2008)An adaptive stereo basis method for convolutive blind audio source separation, In: NEUROCOMPUTING71(10-12)pp. 2087-2097 ELSEVIER SCIENCE BV

DOI: 10.1016/j.neucom.2007.08.029

S Li, S Dixon, Mark Plumbley (2017)Clustering expressive timing with regressed polynomial coefficients demonstrated by a model selection test, In: Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR 2017)pp. 457-463

Though many past works have tried to cluster expressive timing within a phrase, there have been few attempts to cluster features of expressive timing with constant dimensions regardless of phrase lengths. For example, used as a way to represent expressive timing, tempo curves can be regressed by a polynomial function such that the number of regressed polynomial coefficients remains constant with a given order regardless of phrase lengths. In this paper, clustering the regressed polynomial coefficients is proposed for expressive timing analysis. A model selection test is presented to compare Gaussian Mixture Models (GMMs) fitting regressed polynomial coefficients and fitting expressive timing directly. As there are no expected results of clustering expressive timing, the proposed method is demonstrated by how well the expressive timing are approximated by the centroids of GMMs. The results show that GMMs fitting the regressed polynomial coefficients outperform GMMs fitting expressive timing directly. This conclusion suggests that it is possible to use regressed polynomial coefficients to represent expressive timing within a phrase and cluster expressive timing within phrases of different lengths.

H Ekeus, MD Plumbley, PW McOwan (2012)Eye Tracking as Interface for the Design of Generative Visual Forms and Patternspp. 117-121

When working with generative systems, designers enter into a loop of discrete steps; external evaluations of the output feedback into the system, and new outputs are subsequently reevaluated. In such systems, interacting low level elements can engender a difficult to predict emergence of macro-level characteristics. Furthermore, the state space of some systems can be vast. Consequently, designers generally rely on trial-and-error, experience or intuition in selecting parameter values to develop the aesthetic aspects of their designs. We investigate an alternative means of exploring the state spaces of generative visual systems by using a gaze- contingent display. A user's gaze continuously controls and directs an evolution of visual forms and patterns on screen. As time progresses and the viewer and system remain coupled in this evolution, a population of generative artefacts tends towards an area of their state space that is 'of interest', as defined by the eye tracking data. The evaluation-feedback loop is continuous and uninterrupted, gaze the guiding feedback mechanism in the exploration of state space.

Mark D. Plumbley (2005)Geometrical methods for non-negative ICA: Manifolds, Lie groups and toral subalgebras, In: Neurocomputing67pp. 161-197 Elsevier

DOI: 10.1016/j.neucom.2004.11.040

We explore the use of geometrical methods to tackle the non-negative independent component analysis (non-negative ICA) problem, without assuming the reader has an existing background in differential geometry. We concentrate on methods that achieve this by minimizing a cost function over the space of orthogonal matrices. We introduce the idea of the manifold and Lie group SO(n) of special orthogonal matrices that we wish to search over, and explain how this is related to the Lie algebra so(n) of skew-symmetric matrices. We describe how familiar optimization methods such as steepest-descent and conjugate gradients can be transformed into this Lie group setting, and how the Newton update step has an alternative Fourier version in SO(n). Finally we introduce the concept of a toral subgroup generated by a particular element of the Lie group or Lie algebra, and explore how this commutative subgroup might be used to simplify searches on our constraint surface. No proofs are presented in this article.

MEP Davies, MD Plumbley (2008)Exploring the effect of rhythmic style classification on automatic tempo estimation., In: EUSIPCOpp. 1-5

SJ Welburn, MD Plumbley (2009)Rendering audio using expressive MIDI, In: 127th Audio Engineering Society Convention 20091pp. 176-184

MIDI renderings of audio are traditionally regarded as lifeless and unnatural - lacking in expression. However, MIDI is simply a protocol for controlling a synthesizer. Lack of expression is caused by either an expressionless synthesizer or by the difficulty in setting the MIDI parameters to provide expressive output. We have developed a system to construct an expressive MIDI representation of an audio signal, i.e. an audio representation which uses tailored pitch variations in addition to the note base pitch parameters which audio-to-MIDI systems usually attempt. A pitch envelope is estimated from the original audio, and a genetic algorithm is then used to estimate pitch modulator parameters from that envelope. These pitch modulations are encoded in a MIDI file and rerendered using a sampler. We present some initial comparisons between the final output audio and the estimated pitch envelopes.

Dominic Ward, Russell D. Mason, Ryan Chungeun Kim, Fabian-Robert Stöter, Antoine Liutkus, Mark D. Plumbley (2018)SISEC 2018: state of the art in musical audio source separation - Subjective selection of the best algorithm, In: Proceedings of the 4th Workshop on Intelligent Music Production, Huddersfield, UK, 14 September 2018 University of Huddersfield

The Signal Separation Evaluation Campaign (SiSEC) is a large-scale regular event aimed at evaluating current progress in source separation through a systematic and reproducible comparison of the participants’ algorithms, providing the source separation community with an invaluable glimpse of recent achievements and open challenges. This paper focuses on the music separation task from SiSEC 2018, which compares algorithms aimed at recovering instrument stems from a stereo mix. In this context, we conducted a subjective evaluation whereby 34 listeners picked which of six competing algorithms, with high objective performance scores, best separated the singing-voice stem from 13 professionally mixed songs. The subjective results reveal strong differences between the algorithms, and highlight the presence of song-dependent performance for state-of-the-art systems. Correlations between the subjective results and the scores of two popular performance metrics are also presented.

MG Jafari, MD Plumbley (2009)Speech denoising based on a greedy adaptive dictionary algorithm., In: EUSIPCOpp. 1423-1426

Christian Kroos, Oliver Bones, Yin Cao, Lara Harris, Philip J. B. Jackson, William J. Davies, Wenwu Wang, Trevor J. Cox, Mark D. Plumbley (2019)Generalisation in environmental sound classification: the 'making sense of sounds' data set and challenge, In: Proceedings of the 44th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2019) Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/ICASSP.2019.8683292

D Stowell, A Robertson, N Bryan-Kinns, MD Plumbley (2009)Evaluation of live human-computer music-making: Quantitative and qualitative approaches, In: INTERNATIONAL JOURNAL OF HUMAN-COMPUTER STUDIES67(11)pp. 960-975 ACADEMIC PRESS LTD- ELSEVIER SCIENCE LTD

DOI: 10.1016/j.ijhcs.2009.05.007

MD Plumbley (2014)Separating Musical Audio Signals, In: Acoustics Bulletin39(6)pp. 44-47 Institute of Acoustics

As consumers move increasingly to multichannel and surround-sound reproduction of sound, and also perhaps wish to remix their music to suit their own tastes, there will be an increasing need for high quality automatic source separation to recover sound sources from legacy mono or 2-channel stereo recordings. In this Contribution, we will give an overview of some for audio source separation, and some of the remaining research challenges in this area.

A Robertson, AM Stark, MD Plumbley (2011)Real-time Visual Beat Tracking using a Comb Filter Matrix., In: Proceedings of the International Computer Music Conference 2011pp. 617-620

This paper describes an algorithm for real-time beat tracking with a visual interface. Multiple tempo and phase hypotheses are represented by a comb filter matrix. The user can interact by specifying the tempo and phase to be tracked by the algorithm, which will seek to find a continuous path through the space. We present results from evaluating the algorithm on the Hainsworth database and offer a comparison with another existing real-time beat tracking algorithm and offline algorithms.

A Gretsistas, I Damnjanovic, MD Plumbley (2010)GRADIENT POLYTOPE FACES PURSUIT FOR LARGE SCALE SPARSE RECOVERY PROBLEMS, In: 2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSINGpp. 2030-2033

DOI: 10.1109/ICASSP.2010.5494955

C Baume, MD Plumbley, J Calic (2015)Use of audio editors in radio production, In: AES E-Library

Audio editing is performed at scale in the production of radio, but often the tools used are poorly targeted toward the task at hand. There are a number of audio analysis techniques that have the potential to aid radio producers, but without a detailed understanding of their process and requirements, it can be difficult to apply these methods. To aid this understanding, a study of radio production practice was conducted on three varied case studies—a news bulletin, drama, and documentary. It examined the audio/metadata workflow, the roles and motivations of the producers, and environmental factors. The study found that producers prefer to interact with higher-level representations of audio content like transcripts and enjoy working on paper. The study also identified opportunities to improve the work flow with tools that link audio to text, highlight repetitions, compare takes, and segment speakers.

Emad M Grais, Dominic Ward, Mark D Plumbley (2018)Raw Multi-Channel Audio Source Separation using Multi-Resolution Convolutional Auto-Encoders, In: Proceedings of the 2018 26th European Signal Processing Conference (EUSIPCO)pp. 1577-1581 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.23919/EUSIPCO.2018.8553571

Supervised multi-channel audio source separation requires extracting useful spectral, temporal, and spatial features from the mixed signals. The success of many existing systems is therefore largely dependent on the choice of features used for training. In this work, we introduce a novel multi-channel, multiresolution convolutional auto-encoder neural network that works on raw time-domain signals to determine appropriate multiresolution features for separating the singing-voice from stereo music. Our experimental results show that the proposed method can achieve multi-channel audio source separation without the need for hand-crafted features or any pre- or post-processing.

D Giannoulis, D Stowell, E Benetos, M Rossignol, M Lagrange, MD Plumbley (2013)A DATABASE AND CHALLENGE FOR ACOUSTIC SCENE CLASSIFICATION AND EVENT DETECTION, In: 2013 PROCEEDINGS OF THE 21ST EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO)pp. ?-?

An increasing number of researchers work in computational auditory scene analysis (CASA). However, a set of tasks, each with a well-defined evaluation framework and commonly used datasets do not yet exist. Thus, it is difficult for results and algorithms to be compared fairly, which hinders research on the field. In this paper we will introduce a newly-launched public evaluation challenge dealing with two closely related tasks of the field: acoustic scene classification and event detection. We give an overview of the tasks involved; describe the processes of creating the dataset; and define the evaluation metrics. Finally, illustrations on results for both tasks using baseline methods applied on this dataset are presented, accompanied by open-source code. © 2013 EURASIP.

H Laurberg, MG Christensen, MD Plumbley, LK Hansen, SH Jensen (2008)Theorems on positive data: on the uniqueness of NMF., In: Comput Intell Neurosci Hindawi Publishing Corporation

DOI: 10.1155/2008/764206

We investigate the conditions for which nonnegative matrix factorization (NMF) is unique and introduce several theorems which can determine whether the decomposition is in fact unique or not. The theorems are illustrated by several examples showing the use of the theorems and their limitations. We have shown that corruption of a unique NMF matrix by additive noise leads to a noisy estimation of the noise-free unique solution. Finally, we use a stochastic view of NMF to analyze which characterization of the underlying model will result in an NMF with small estimation errors.

S Li, D Black, E Chew, Mark Plumbley (2014)Evidence that phrase-level tempo variation may be represented using a limited dictionary

Phrases are common musical units akin to that in speech and text. In music performance, performers often change the way they vary the tempo from one phrase to the next in order to choreograph patterns of repetition and contrast. This activity is commonly referred to as expressive music performance. Despite its importance, expressive performance is still poorly understood. No formal models exist that would explain, or at least quantify and characterise, aspects of commonalities and differences in performance style. In this paper we present such a model for tempo variation between phrases in a performance. We demonstrate that the model provides a good fit with a performance database of 25 pieces and that perceptually important information is not lost through the modelling process.

A Nesbit, E Vincent, MD Plumbley (2009)Extension of Sparse, Adaptive Signal Decompositions to Semi-blind Audio Source Separation, In: T Adali, C Jutten, JMT Romano, AK Barros (eds.), INDEPENDENT COMPONENT ANALYSIS AND SIGNAL SEPARATION, PROCEEDINGS5441pp. 605-612

MEP Davies, PM Brossier, MD Plumbley (2005)Beat tracking towards automatic musical accompaniment, In: Audio Engineering Society - 118th Convention Spring Preprints 20052pp. 751-757

In this paper we address the issue of causal rhythmic analysis, primarily towards predicting the locations of musical beats such that they are consistent with a musical audio input. This will be a key component required for a system capable of automatic accompaniment with a live musician. We are implementing our approach as part of the aubio real-time audio library. While performance for this causal system is reduced in comparison to our previous non-causal system, it is still suitable for our intended purpose.

T Murray-Browne, D Mainstone, N Bryan-Kinns, MD Plumbley (2013)The serendiptichord: Reflections on the collaborative design process between artist and researcher, In: Leonardo46(1)pp. 86-87

DOI: 10.1162/LEON_a_00494

The Serendiptichord is a wearable instrument, resulting from a collaboration crossing fashion, technology, music and dance. This paper reflects on the collaborative process and how defining both creative and research roles for each party led to a successful creative partnership built on mutual respect and open communication. After a brief snapshot of the instrument in performance, the instrument is considered within the context of dance-driven interactive music systems followed by a discussion on the nature of the collaboration and its impact upon the design process and final piece. © 2013 ISAST.

D Stowell, MD Plumbley (2013)Segregating Event Streams and Noise with a Markov Renewal Process Model, In: Journal of Machine Learning Research14pp. 2213-2238

We describe an inference task in which a set of timestamped event observations must be clustered into an unknown number of temporal sequences with independent and varying rates of observations. Various existing approaches to multi-object tracking assume a fixed number of sources and/or a fixed observation rate; we develop an approach to inferring structure in timestamped data produced by a mixture of an unknown and varying number of similar Markov renewal processes, plus independent clutter noise. The inference simultaneously distinguishes signal from noise as well as clustering signal observations into separate source streams. We illustrate the technique via synthetic experiments as well as an experiment to track a mixture of singing birds. Source code is available.

Lucas Rencker, Wenwu Wang, Mark Plumbley (2017)Multivariate Iterative Hard Thresholding for sparse decomposition with flexible sparsity patterns, In: Proceedings of the European Signal Processing Conference (EUSIPCO) 2017pp. 2220-2224 EUSIPCO

DOI: 10.23919/EUSIPCO.2017.8081591

We address the problem of decomposing several consecutive sparse signals, such as audio time frames or image patches. A typical approach is to process each signal sequentially and independently, with an arbitrary sparsity level fixed for each signal. Here, we propose to process several frames simultaneously, allowing for more flexible sparsity patterns to be considered. We propose a multivariate sparse coding approach, where sparsity is enforced on average across several frames. We propose a Multivariate Iterative Hard Thresholding to solve this problem. The usefulness of the proposed approach is demonstrated on audio coding and denoising tasks. Experiments show that the proposed approach leads to better results when the signal contains both transients and tonal components.

D Stowell, MD Plumbley (2014)Large-scale analysis of frequency modulation in birdsong data bases, In: Methods in Ecology and Evolution5(9)pp. 901-912

DOI: 10.1111/2041-210X.12223

* Birdsong often contains large amounts of rapid frequency modulation (FM). It is believed that the use or otherwise of FM is adaptive to the acoustic environment and also that there are specific social uses of FM such as trills in aggressive territorial encounters. Yet temporal fine detail of FM is often absent or obscured in standard audio signal analysis methods such as Fourier analysis or linear prediction. Hence, it is important to consider high-resolution signal processing techniques for analysis of FM in bird vocalizations. If such methods can be applied at big data scales, this offers a further advantage as large data sets become available. * We introduce methods from the signal processing literature which can go beyond spectrogram representations to analyse the fine modulations present in a signal at very short time-scales. Focusing primarily on the genus Phylloscopus, we investigate which of a set of four analysis methods most strongly captures the species signal encoded in birdsong. We evaluate this through a feature selection technique and an automatic classification experiment. In order to find tools useful in practical analysis of large data bases, we also study the computational time taken by the methods, and their robustness to additive noise and MP3 compression. * We find three methods which can robustly represent species-correlated FM attributes and can be applied to large data sets, and that the simplest method tested also appears to perform the best. We find that features representing the extremes of FM encode species identity supplementary to that captured in frequency features, whereas bandwidth features do not encode additional information. * FM analysis can extract information useful for bioacoustic studies, in addition to measures more commonly used to characterize vocalizations. Further, it can be applied efficiently across very large data sets and archives.

I Damnjanovic, MEP Davies, MD Plumbley (2010)SMALLbox - An Evaluation Framework for Sparse Representations and Dictionary Learning Algorithms, In: V Vigneron, V Zarzoso, E Moreau, R Gribonval, E Vincent (eds.), LATENT VARIABLE ANALYSIS AND SIGNAL SEPARATION6365pp. 418-425

DOI: 10.1007/978-3-642-15995-4_52

Z Qin, Y Gao, MD Plumbley, CG Parini, LG Cuthbert (2013)Low-rank Matrix Completion based Malicious User Detection in Cooperative Spectrum Sensing, In: 2013 IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (GLOBALSIP)pp. 1186-1189 IEEE

DOI: 10.1109/GlobalSIP.2013.6737119

In a cognitive radio (CR) system, cooperative spectrum sensing (CSS) is the key to improving sensing performance in deep fading channels. In CSS networks, signals received at the secondary users (SUs) are sent to a fusion center to make a final decision of the spectrum occupancy. In this process, the presence of malicious users sending false sensing samples can severely degrade the performance of the CSS network. In this paper, with the compressive sensing (CS) technique being implemented at each SU, we build a CSS network with double sparsity property. A new malicious user detection scheme is proposed by utilizing the adaptive outlier pursuit (AOP) based low-rank matrix completion in the CSS network. In the proposed scheme, the malicious users are removed in the process of signal recovery at the fusion center. The numerical analysis of the proposed scheme is carried out and compared with an existing malicious user detection algorithm.

C Cannam, LA Figueira, MD Plumbley (2012)SOUND SOFTWARE: TOWARDS SOFTWARE REUSE IN AUDIO AND MUSIC RESEARCH, In: 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)pp. 2745-2748

MD PLUMBLEY (1993)A HEBBIAN ANTI-HEBBIAN NETWORK WHICH OPTIMIZES INFORMATION CAPACITY BY ORTHONORMALIZING THE PRINCIPAL SUBSPACE, In: THIRD INTERNATIONAL CONFERENCE ON ARTIFICIAL NEURAL NETWORKS372pp. 86-90

R Badeau, MD Plumbley (2013)PROBABILISTIC TIME-FREQUENCY SOURCE-FILTER DECOMPOSITION OF NON-STATIONARY SIGNALS, In: 2013 PROCEEDINGS OF THE 21ST EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO)pp. ?-?

AJR Simpson, G Roma, Emad M Grais, Russell Mason, Christopher Hummersone, Mark Plumbley (2017)Psychophysical Evaluation of Audio Source Separation Methods, In: LNCS: Latent Variable Analysis and Signal Separation10169pp. 211-221 Springer

DOI: 10.1007/978-3-319-53547-0_21

Source separation evaluation is typically a top-down process, starting with perceptual measures which capture fitness-for-purpose and followed by attempts to find physical (objective) measures that are predictive of the perceptual measures. In this paper, we take a contrasting bottom-up approach. We begin with the physical measures provided by the Blind Source Separation Evaluation Toolkit (BSS Eval) and we then look for corresponding perceptual correlates. This approach is known as psychophysics and has the distinct advantage of leading to interpretable, psychophysical models. We obtained perceptual similarity judgments from listeners in two experiments featuring vocal sources within musical mixtures. In the first experiment, listeners compared the overall quality of vocal signals estimated from musical mixtures using a range of competing source separation methods. In a loudness experiment, listeners compared the loudness balance of the competing musical accompaniment and vocal. Our preliminary results provide provisional validation of the psychophysical approach

Arshdeep Singh, Mark D. Plumbley (2022)Low-complexity CNNs for acoustic scene classification

This paper presents a low-complexity framework for acoustic scene classification (ASC). Most of the frameworks designed for ASC use convolutional neural networks (CNNs) due to their learning ability and improved performance compared to hand-engineered features. However, CNNs are resource hungry due to their large size and high computational complexity. Therefore, CNNs are difficult to deploy on resource constrained devices. This paper addresses the problem of reducing the computational complexity and memory requirement in CNNs. We propose a low-complexity CNN architecture, and apply pruning and quantization to further reduce the parameters and memory. We then propose an ensemble framework that combines various low-complexity CNNs to improve the overall performance. An experimental evaluation of the proposed framework is performed on the publicly available DCASE 2022 Task 1 that focuses on ASC. The proposed ensemble framework has approximately 60K parameters, requires 19M multiply-accumulate operations and improves the performance by approximately 2-4 percentage points compared to the DCASE 2022 Task 1 baseline network .

AR Simpson, G Roma, M Plumbley (2015)Deep Karaoke: Extracting Vocals from Musical Mixtures Using a Convolutional Deep Neural Network, In: E Vincent, A Yeredor, Z Koldovský, P Tichavský (eds.), Lecture Notes in Computer Science9237pp. 429-436

DOI: 10.1007/978-3-319-22482-4_50

Identification and extraction of singing voice from within musical mixtures is a key challenge in source separation and machine audition. Recently, deep neural networks (DNN) have been used to estimate 'ideal' binary masks for carefully controlled cocktail party speech separation problems. However, it is not yet known whether these methods are capable of generalizing to the discrimination of voice and non-voice in the context of musical mixtures. Here, we trained a convolutional DNN (of around a billion parameters) to provide probabilistic estimates of the ideal binary mask for separation of vocal sounds from real-world musical mixtures. We contrast our DNN results with more traditional linear methods. Our approach may be useful for automatic removal of vocal sounds from musical mixtures for 'karaoke' type applications.

Jingshu Zhang, Mark D. Plumbley, Wenwu Wang (2021)Weighted Magnitude-Phase Loss for Speech Dereverberation

DOI: 10.1109/ICASSP39728.2021.9414929

In real rooms, recorded speech usually contains reverberation, which degrades the quality and intelligibility of the speech. It has proven effective to use neural networks to estimate complex ideal ratio masks (cIRMs) using mean square error (MSE) loss for speech dereverberation. However, in some cases, when using MSE loss to estimate complex-valued masks, phase may have a disproportionate effect compared to magnitude. We propose a new weighted magnitude-phase loss function, which is divided into a magnitude component and a phase component, to train a neural network to estimate complex ideal ratio masks. A weight parameter is introduced to adjust the relative contribution of magnitude and phase to the overall loss. We find that our proposed loss function outperforms the regular MSE loss function for speech dereverberation.

Cian O'Brien, Mark Plumbley (2017)Automatic Music Transcription Using Low Rank Non-Negative Matrix Decomposition, In: EUSIPCO 2017 Proceedings

DOI: 10.23919/EUSIPCO.2017.8081529

Automatic Music Transcription (AMT) is concerned with the problem of producing the pitch content of a piece of music given a recorded signal. Many methods rely on sparse or low rank models, where the observed magnitude spectra are represented as a linear combination of dictionary atoms corresponding to individual pitches. Some of the most successful approaches use Non-negative Matrix Decomposition (NMD) or Factorization (NMF), which can be used to learn a dictionary and pitch activation matrix from a given signal. Here we introduce a further refinement of NMD in which we assume the transcription itself is approximately low rank. The intuition behind this approach is that the total number of distinct activation patterns should be relatively small since the pitch content between adjacent frames should be similar. A rank penalty is introduced into the NMD objective function and solved using an iterative algorithm based on Singular Value thresholding. We find that the low rank assumption leads to a significant increase in performance compared to NMD using β-divergence on a standard AMT dataset.

Iwona Sobieraj, Qiuqiang Kong, Mark Plumbley (2017)Masked Non-negative Matrix Factorization for Bird Detection Using Weakly Labeled Data, In: EUSIPCO 2017 Proceedings IEEE

DOI: 10.23919/EUSIPCO.2017.8081513

Acoustic monitoring of bird species is an increasingly important field in signal processing. Many available bird sound datasets do not contain exact timestamp of the bird call but have a coarse weak label instead. Traditional Non-negative Matrix Factorization (NMF) models are not well designed to deal with weakly labeled data. In this paper we propose a novel Masked Non-negative Matrix Factorization (Masked NMF) approach for bird detection using weakly labeled data. During dictionary extraction we introduce a binary mask on the activation matrix. In that way we are able to control which parts of dictionary are used to reconstruct the training data. We compare our method with conventional NMF approaches and current state of the art methods. The proposed method outperforms the NMF baseline and offers a parsimonious model for bird detection on weakly labeled data. Moreover, to our knowledge, the proposed Masked NMF achieved the best result among non-deep learning methods on a test dataset used for the recent Bird Audio Detection Challenge.

MD Plumbley, F Fallside (1989)Sensory adaptation: An information-theoretic viewpoint, In: IJCNN Int Jt Conf Neural Networkpp. 598-?

Summary form only given. The authors examine the goals of early stages of a perceptual system, before the signal reaches the cortex, and describe its operation in information-theoretic terms. The effects of receptor adaptation, lateral inhibition, and decorrelation can all be seen as part of an optimization of information throughput, given that available resources such as average power and maximum firing rates are limited. The authors suggest a modification to Gabor functions which improves their performance as band-pass filters.

Zhao Ren, Jing Han, Nicholas Cummins, Qiuqiang Kong, Mark Plumbley, Björn W. Schuller (2019)Multi-instance Learning for Bipolar Disorder Diagnosis using Weakly Labelled Speech Data, In: Proceedings of DPH 2019: 9th International Digital Public Health Conference 2019 Association for Computing Machinery (ACM)

DOI: 10.1145/3357729.3357743

While deep learning is undoubtedly the predominant learning technique across speech processing, it is still not widely used in health-based applications. The corpora available for health-style recognition problems are often small, both concerning the total amount of data available and the number of individuals present. The Bipolar Disorder corpus, used in the 2018 Audio/Visual Emotion Challenge, contains only 218 audio samples from 46 individuals. Herein, we present a multi-instance learning framework aimed at constructing more reliable deep learning-based models in such conditions. First, we segment the speech files into multiple chunks. However, the problem is that each of the individual chunks is weakly labelled, as they are annotated with the label of the corresponding speech file, but may not be indicative of that label. We then train the deep learning-based (ensemble) multi-instance learning model, aiming at solving such a weakly labelled problem. The presented results demonstrate that this approach can improve the accuracy of feedforward, recurrent, and convolutional neural nets on the 3-class mania classification tasks undertaken on the Bipolar Disorder corpus.

Jianyuan Sun, Xubo Liu, Xinhao Mei, Mark Plumbley, Volkan Kilic, Wenwu Wang (2022)Automated Audio Captioning via Fusion of Low- and High- Dimensional Features, In: JOURNAL OF LATEX CLASS FILES14(8) Cornell University Library, arXiv.org

DOI: 10.48550/arxiv.2210.05037

Automated audio captioning (AAC) aims to describe the content of an audio clip using simple sentences. Existing AAC methods are developed based on an encoder-decoder architecture that success is attributed to the use of a pre-trained CNN10 called PANNs as the encoder to learn rich audio representations. AAC is a highly challenging task due to its high-dimensional talent space involves audio of various scenarios. Existing methods only use the high-dimensional representation of the PANNs as the input of the decoder. However, the low-dimension representation may retain as much audio information as the high-dimensional representation may be neglected. In addition, although the high-dimensional approach may predict the audio captions by learning from existing audio captions, which lacks robustness and efficiency. To deal with these challenges, a fusion model which integrates low- and high-dimensional features AAC framework is proposed. In this paper, a new encoder-decoder framework is proposed called the Low- and High-Dimensional Feature Fusion (LHDFF) model for AAC. Moreover, in LHDFF, a new PANNs encoder is proposed called Residual PANNs (RPANNs) by fusing the low-dimensional feature from the intermediate convolution layer output and the high-dimensional feature from the final layer output of PANNs. To fully explore the information of the low- and high-dimensional fusion feature and high-dimensional feature respectively, we proposed dual transformer decoder structures to generate the captions in parallel. Especially, a probabilistic fusion approach is proposed that can ensure the overall performance of the system is improved by concentrating on the respective advantages of the two transformer decoders. Experimental results show that LHDFF achieves the best performance on the Clotho and AudioCaps datasets compared with other existing models

Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang (2024)Towards Generating Diverse Audio Captions Via Adversarial Training, In: IEEE/ACM transactions on audio, speech, and language processing32pp. 3311-3323 IEEE

DOI: 10.1109/TASLP.2024.3416686

Automated audio captioning is a cross-modal translation task for describing the content of audio clips with natural language sentences. This task has attracted increasing attention and substantial progress has been made in recent years. Captions generated by existing models are generally faithful to the content of audio clips, however, these machine-generated captions are often deterministic (e.g., generating a fixed caption for a given audio clip), simple (e.g., using common words and simple grammar), and generic (e.g., generating the same caption for similar audio clips). When people are asked to describe the content of an audio clip, different people tend to focus on different sound events and describe an audio clip diversely from various aspects using distinct words and grammar. We believe that an audio captioning system should have the ability to generate diverse captions, either for a fixed audio clip, or across similar audio clips. To this end, we propose an adversarial training framework based on a conditional generative adversarial network (C-GAN) to improve diversity of audio captioning systems. A caption generator and two hybrid discriminators compete and are learned jointly, where the caption generator can be any standard encoder-decoder captioning model used to generate captions, and the hybrid discriminators assess the generated captions from different criteria, such as their naturalness and semantics. We conduct experiments on the Clotho dataset. The results show that our proposed model can generate captions with better diversity as compared to state-of-the-art methods.

MD Plumbley (2013)Hearing the shape of a room, In: PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA110(30)pp. 12162-12163 NATL ACAD SCIENCES

DOI: 10.1073/pnas.1309932110

Xubo Liu, Turab Iqbal, Jinzheng Zhao, Qiushi Huang, Mark D. Plumbley, Wenwu Wang (2021)Conditional Sound Generation Using Neural Discrete Time-Frequency Representation Learningpp. 25-28

DOI: 10.1109/MLSP52302.2021.9596430

Deep generative models have recently achieved impressive performance in speech and music synthesis. However, compared to the generation of those domain-specific sounds, generating general sounds (such as siren, gunshots) has received less attention , despite their wide applications. In previous work, the SampleRNN method was considered for sound generation in the time domain. However, SampleRNN is potentially limited in capturing long-range dependencies within sounds as it only back-propagates through a limited number of samples. In this work, we propose a method for generating sounds via neural discrete time-frequency representation learning, conditioned on sound classes. This offers an advantage in efficiently modelling long-range dependencies and retaining local fine-grained structures within sound clips. We evaluate our approach on the UrbanSound8K dataset, compared to SampleRNN, with the performance metrics measuring the quality and diversity of generated sounds. Experimental results show that our method offers comparable performance in quality and significantly better performance in diversity.

MEP Davies, MD Plumbley, D Eck (2009)Towards a musical beat emphasis function., In: WASPAApp. 61-64

DOI: 10.1109/ASPAA.2009.5346462

Cian O’Brien, Mark Plumbley (2018)Inexact proximal operators for ℓp-quasinorm minimization, In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)pp. 4724-4728 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/ICASSP.2018.8462524

Proximal methods are an important tool in signal processing applications, where many problems can be characterized by the minimization of an expression involving a smooth fitting term and a convex regularization term – for example the classic ℓ1-Lasso. Such problems can be solved using the relevant proximal operator. Here we consider the use of proximal operators for the ℓp-quasinorm where 0 ≤ p ≤ 1. Rather than seek a closed form solution, we develop an iterative algorithm using a Majorization-Minimization procedure which results in an inexact operator. Experiments on image denoising show that for p ≤ 1 the algorithm is effective in the high-noise scenario, outperforming the Lasso despite the inexactness of the proximal step.

Qiuqiang Kong, Yong Xu, Turab Iqbal, Yin Cao, Wenwu Wang, Mark D. Plumbley (2019)Acoustic scene generation with conditional SampleRNN, In: Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019) Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/ICASSP.2019.8683727

Acoustic scene generation (ASG) is a task to generate waveforms for acoustic scenes. ASG can be used to generate audio scenes for movies and computer games. Recently, neural networks such as SampleRNN have been used for speech and music generation. However, ASG is more challenging due to its wide variety. In addition, evaluating a generative model is also difficult. In this paper, we propose to use a conditional SampleRNN model to generate acoustic scenes conditioned on the input classes. We also propose objective criteria to evaluate the quality and diversity of the generated samples based on classification accuracy. The experiments on the DCASE 2016 Task 1 acoustic scene data show that with the generated audio samples, a classification accuracy of 65:5% can be achieved compared to samples generated by a random model of 6:7% and samples from real recording of 83:1%. The performance of a classifier trained only on generated samples achieves an accuracy of 51:3%, as opposed to an accuracy of 6:7% with samples generated by a random model.

Annamaria Mesaros, Toni Heittola, Tuomas Virtanen, Mark D Plumbley (2021)Sound Event Detection: A Tutorial, In: IEEE Signal Processing Magazine38(5)pp. 67-83 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/MSP.2021.3090678

Imagine standing on a street corner in the city. With your eyes closed, you can hear and recognize a succession of sounds: cars passing by, people speaking, their footsteps when they walk by, and the continuously falling rain. Recognition of all these sounds and interpretation of the perceived scene as a city street soundscape comes naturally to humans. It is, however, the result of years of “training”: encountering and learning associations between the vast variety of sounds in everyday life, the sources producing these sounds, and the names given to them. Our everyday environment consists of many sound sources that create a complex mixture audio signal. Human auditory perception is highly specialized in segregating the sound sources and directing attention to the sound source of interest. This phenomenon is called cocktail party effect, as an analogy to being able to focus on a single conversation in a noisy room. Perception groups the spectro-temporal information in acoustic signals into auditory objects such that sounds or groups of sounds are perceived as a coherent whole [1]. This determines for example a complex sequence of sounds to be perceived as a single sound event instance, be it “bird singing” or “footsteps”. The goal of automatic sound event detection (SED) methods is to recognize what is happening in an audio signal and when it is happening. In practice, the goal is to recognize at what temporal instances different sounds are active within an audio signal.

Emad M Grais, G Roma, AJR Simpson, Mark Plumbley (2017)Discriminative Enhancement for Single Channel Audio Source Separation using Deep Neural Networks, In: LNCS10169pp. 236-246

DOI: 10.1007/978-3-319-53547-0_23

The sources separated by most single channel audio source separation techniques are usually distorted and each separated source contains residual signals from the other sources. To tackle this problem, we propose to enhance the separated sources to decrease the distortion and interference between the separated sources using deep neural networks (DNNs). Two different DNNs are used in this work. The first DNN is used to separate the sources from the mixed signal. The second DNN is used to enhance the separated signals. To consider the interactions between the separated sources, we propose to use a single DNN to enhance all the separated sources together. To reduce the residual signals of one source from the other separated sources (interference), we train the DNN for enhancement discriminatively to maximize the dissimilarity between the predicted sources. The experimental results show that using discriminative enhancement decreases the distortion and interference between the separated sources

P Foster, A Klapuri, MD Plumbley (2011)Causal Prediction of Continuous-Valued Music Features., In: A Klapuri, C Leider (eds.), ISMIRpp. 501-506

SA Abdallah, MD Plumbley (2012)Predictive Information Rate in Discrete-time Gaussian Processes arXiv

We derive expressions for the predicitive information rate (PIR) for the class of autoregressive Gaussian processes AR(N), both in terms of the prediction coefficients and in terms of the power spectral density. The latter result suggests a duality between the PIR and the multi-information rate for processes with mutually inverse power spectra (i.e. with poles and zeros of the transfer function exchanged). We investigate the behaviour of the PIR in relation to the multi-information rate for some simple examples, which suggest, somewhat counter-intuitively, that the PIR is maximised for very `smooth' AR processes whose power spectra have multiple poles at zero frequency. We also obtain results for moving average Gaussian processes which are consistent with the duality conjectured earlier. One consequence of this is that the PIR is unbounded for MA(N) processes.

Alfredo Zermini, Qiuqiang Kong, Yong Xu, Mark D. Plumbley, Wenwu Wang (2018)Improving reverberant speech separation with binaural cues using temporal context and convolutional neural networks, In: Y Deville, S Gannot, R Mason, Mark Plumbley, D Ward (eds.), Latent Variable Analysis and Signal Separation. LVA/ICA 2018. Lecture Notes in Computer Science10891pp. 361-371 Springer

DOI: 10.1007/978-3-319-93764-9_34

Given binaural features as input, such as interaural level difference and interaural phase difference, Deep Neural Networks (DNNs) have been recently used to localize sound sources in a mixture of speech signals and/or noise, and to create time-frequency masks for the estimation of the sound sources in reverberant rooms. Here, we explore a more advanced system, where feed-forward DNNs are replaced by Convolutional Neural Networks (CNNs). In addition, the adjacent frames of each time frame (occurring before and after this frame) are used to exploit contextual information, thus improving the localization and separation for each source. The quality of the separation results is evaluated in terms of Signal to Distortion Ratio (SDR).

Y Xu, Q Huang, W Wang, PJB Jackson, MD Plumbley (2016)Fully DNN-based Multi-label regression for audio tagging, In: T Virtanen, A Mesaros, A Heittola, MD Plumbley, P Foster, E Benetos, M Lagrange (eds.), Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016)pp. 110-114

Acoustic event detection for content analysis in most cases relies on lots of labeled data. However, manually annotating data is a time-consuming task, which thus makes few annotated resources available so far. Unlike audio event detection, automatic audio tagging, a multi-label acoustic event classification task, only relies on weakly labeled data. This is highly desirable to some practical applications using audio analysis. In this paper we propose to use a fully deep neural network (DNN) framework to handle the multi-label classification task in a regression way. Considering that only chunk-level rather than frame-level labels are available, the whole or almost whole frames of the chunk were fed into the DNN to perform a multi-label regression for the expected tags. The fully DNN, which is regarded as an encoding function, can well map the audio features sequence to a multi-tag vector. A deep pyramid structure was also designed to extract more robust high-level features related to the target tags. Further improved methods were adopted, such as the Dropout and background noise aware training, to enhance its generalization capability for new audio recordings in mismatched environments. Compared with the conventional Gaussian Mixture Model (GMM) and support vector machine (SVM) methods, the proposed fully DNN-based method could well utilize the long-term temporal information with the whole chunk as the input. The results show that our approach obtained a 15% relative improvement compared with the official GMM-based method of DCASE 2016 challenge.

Emad M. Grais, Hagen Wierstorf, Dominic Ward, Russell Mason, Mark Plumbley (2019)Referenceless Performance Evaluation of Audio Source Separation using Deep Neural Networks, In: Proceedings 2019 27th European Signal Processing Conference (EUSIPCO) IEEE

DOI: 10.23919/EUSIPCO.2019.8902932

Current performance evaluation for audio source separation depends on comparing the processed or separated signals with reference signals. Therefore, common performance evaluation toolkits are not applicable to real-world situations where the ground truth audio is unavailable. In this paper, we propose a performance evaluation technique that does not require reference signals in order to assess separation quality. The proposed technique uses a deep neural network (DNN) to map the processed audio into its quality score. Our experiment results show that the DNN is capable of predicting the sources-to-artifacts ratio from the blind source separation evaluation toolkit [1] for singing-voice separation without the need for reference signals.

E Vincent, MD Plumbley (2006)Single-channel mixture decomposition using Bayesian harmonic models, In: J Rosca, D Erdogmus, JC Principe, S Haykin (eds.), INDEPENDENT COMPONENT ANALYSIS AND BLIND SIGNAL SEPARATION, PROCEEDINGS3889pp. 722-730

Francesco Renna, Alex Gaudio, Sandra Mattos, Mark D. Plumbley, Miguel Tavares Coimbra (2024)Separation of the Aortic and Pulmonary Components of the Second Heart Sound via Alternating Optimization, In: IEEE Access12pp. 34632-34643 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/ACCESS.2024.3371510

An algorithm for blind source separation (BSS) of the second heart sound (S2) into aortic and pulmonary components is proposed. It recovers aortic (A2) and pulmonary (P2) waveforms, as well as their relative delays, by solving an alternating optimization problem on the set of S2 sounds, without the use of auxiliary ECG or respiration phase measurement data. This unsupervised and data-driven approach assumes that the A2 and P2 components maintain the same waveform across heartbeats and that the relative delay between onset of the components varies according to respiration phase. The proposed approach is applied to synthetic heart sounds and to real-world heart sounds from 43 patients. It improves over two state-of-the-art BSS approaches by 10% normalized root mean-squared error in the reconstruction of aortic and pulmonary components using synthetic heart sounds, demonstrates robustness to noise, and recovery of splitting delays. The detection of pulmonary hypertension (PH) in a Brazilian population is demonstrated by training a classifier on three scalar features from the recovered A2 and P2 waveforms, and this yields an auROC of 0.76.

Yong Xu, Qiuqiang Kong, Qiang Huang, Wenwu Wang, Mark Plumbley (2017)Convolutional Gated Recurrent Neural Network Incorporating Spatial Features for Audio Tagging, In: IJCNN 2017 Conference Proceedings IEEE

DOI: 10.1109/IJCNN.2017.7966291

Environmental audio tagging is a newly proposed task to predict the presence or absence of a specific audio event in a chunk. Deep neural network (DNN) based methods have been successfully adopted for predicting the audio tags in the domestic audio scene. In this paper, we propose to use a convolutional neural network (CNN) to extract robust features from mel-filter banks (MFBs), spectrograms or even raw waveforms for audio tagging. Gated recurrent unit (GRU) based recurrent neural networks (RNNs) are then cascaded to model the long-term temporal structure of the audio signal. To complement the input information, an auxiliary CNN is designed to learn on the spatial features of stereo recordings. We evaluate our proposed methods on Task 4 (audio tagging) of the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE 2016) challenge. Compared with our recent DNN-based method, the proposed structure can reduce the equal error rate (EER) from 0.13 to 0.11 on the development set. The spatial features can further reduce the EER to 0.10. The performance of the end-to-end learning on raw waveforms is also comparable. Finally, on the evaluation set, we get the state-of-the-art performance with 0.12 EER while the performance of the best existing system is 0.15 EER.

Qiuqiang Kong, Iwona Sobieraj, Wenwu Wang, Mark Plumbley (2016)Deep Neural Network Baseline for DCASE Challenge 2016, In: Proceedings of DCASE 2016

The DCASE Challenge 2016 contains tasks for Acoustic Scene Classification (ASC), Acoustic Event Detection (AED), and audio tagging. Since 2006, Deep Neural Networks (DNNs) have been widely applied to computer visions, speech recognition and natural language processing tasks. In this paper, we provide DNN baselines for the DCASE Challenge 2016. In Task 1 we obtained accuracy of 81.0% using Mel + DNN against 77.2% by using Mel Frequency Cepstral Coefficients (MFCCs) + Gaussian Mixture Model (GMM). In Task 2 we obtained F value of 12.6% using Mel + DNN against 37.0% by using Constant Q Transform (CQT) + Nonnegative Matrix Factorization (NMF). In Task 3 we obtained F value of 36.3% using Mel + DNN against 23.7% by using MFCCs + GMM. In Task 4 we obtained Equal Error Rate (ERR) of 18.9% using Mel + DNN against 20.9% by using MFCCs + GMM. Therefore the DNN improves the baseline in Task 1, 3, and 4, although it is worse than the baseline in Task 2. This indicates that DNNs can be successful in many of these tasks, but may not always perform better than the baselines.

D Stowell, S Musevic, J Bonada, MD Plumbley (2013)IMPROVED MULTIPLE BIRDSONG TRACKING WITH DISTRIBUTION DERIVATIVE METHOD AND MARKOV RENEWAL PROCESS CLUSTERING, In: 2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)pp. 468-472

D Stowell, MD Plumbley (2014)An Open Dataset for Research on Audio Field Recording Archives: freefield1010, In: Proceedings of the AES 53rd International Conference: Semantic Audiopp. 80-86

We introduce a free and open dataset of 7690 audio clips sampled from the field-recording tag in the Freesound audio archive. The dataset is designed for use in research related to data mining in audio archives of field recordings / soundscapes. Audio is standardised, and audio and metadata are Creative Commons licensed. We describe the data preparation process, characterise the dataset descriptively, and illustrate its use through an auto-tagging experiment.

J Fritsch, MD Plumbley (2013)Score informed audio source separation using constrained nonnegative matrix factorization and score synthesis, In: 2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)pp. 888-891 IEEE

DOI: 10.1109/ICASSP.2013.6637776

In this paper we present a new method for musical audio source separation, using the information from the musical score to supervise the decomposition process. An original framework using nonnegative matrix factorization (NMF) is presented, where the components are initially learnt on synthetic signals with temporal and harmonic constraints. A new dataset of multitrack recordings with manually aligned MIDI scores is created (TRIOS), and we compare our separation results with other methods from the literature using the BSS EVAL and PEASS evaluation toolboxes. The results show a general improvement of the BSS EVAL metrics for the various instrumental configurations used.

Emad M Grais, G Roma, AJR Simpson, Mark Plumbley (2016)Single Channel Audio Source Separation using Deep Neural Network Ensembles, In: AES Convention Proceedings Audio Engineering Society

Deep neural networks (DNNs) are often used to tackle the single channel source separation (SCSS) problem by predicting time-frequency masks. The predicted masks are then used to separate the sources from the mixed signal. Different types of masks produce separated sources with different levels of distortion and interference. Some types of masks produce separated sources with low distortion, while other masks produce low interference between the separated sources. In this paper, a combination of different DNNs’ predictions (masks) is used for SCSS to achieve better quality of the separated sources than using each DNN individually. We train four different DNNs by minimizing four different cost functions to predict four different masks. The first and second DNNs are trained to approximate reference binary and soft masks. The third DNN is trained to predict a mask from the reference sources directly. The last DNN is trained similarly to the third DNN but with an additional discriminative constraint to maximize the differences between the estimated sources. Our experimental results show that combining the predictions of different DNNs achieves separated sources with better quality than using each DNN individually

Yong Xu, Qiuqiang Kong, Qiang Huang, Wenwu Wang, Mark Plumbley (2017)Attention and Localization based on a Deep Convolutional Recurrent Model for Weakly Supervised Audio Tagging, In: Proceedings of Interspeech 2017pp. 3083-3087 ISCA

DOI: 10.21437/Interspeech.2017-486

Audio tagging aims to perform multi-label classification on audio chunks and it is a newly proposed task in the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE 2016) challenge. This task encourages research efforts to better analyze and understand the content of the huge amounts of audio data on the web. The difficulty in audio tagging is that it only has a chunk-level label without a frame-level label. This paper presents a weakly supervised method to not only predict the tags but also indicate the temporal locations of the occurred acoustic events. The attention scheme is found to be effective in identifying the important frames while ignoring the unrelated frames. The proposed framework is a deep convolutional recurrent model with two auxiliary modules: an attention module and a localization module. The proposed algorithm was evaluated on the Task 4 of DCASE 2016 challenge. State-of-the-art performance was achieved on the evaluation set with equal error rate (EER) reduced from 0.13 to 0.11, compared with the convolutional recurrent baseline system.

MEP Davies, MD Plumbley (2007)On the use of entropy for beat tracking evaluation, In: 2007 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol IV, Pts 1-3pp. 1305-1308

G Evangelista, S Marchand, MD Plumbley, E Vincent (2011)Sound Source Separation, In: U Zölzer (eds.), DAFX: Digital Audio Effectspp. 551-588 John Wiley & Sons, Ltd

DOI: 10.1002/9781119991298.ch14

FL Hedayioglu, MG Jafari, SS Mattos, MD Plumbley, MT Coimbra (2011)SEPARATING SOURCES FROM SEQUENTIALLY ACQUIRED MIXTURES OF HEART SIGNALS, In: 2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSINGpp. 653-656

MD Plumbley (2003)Algorithms for nonnegative independent component analysis, In: IEEE TRANSACTIONS ON NEURAL NETWORKS14(3)pp. 534-543 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TNN.2003.810616

A Nesbit, ME Davies, MD Plumbley, MB Sandler (2006)Source extraction from two-channel mixtures by joint cosine packet analysis., In: EUSIPCOpp. 1-5

SA Abdallah, MD Plumbley (2013)Predictive Information in Gaussian Processes with Application to Music Analysis., In: F Nielsen, F Barbaresco (eds.), GSI8085pp. 650-657

DOI: 10.1007/978-3-642-40020-9_72

MD Plumbley, A Cichocki, R Bro (2010)Non-negative mixtures, In: Handbook of Blind Source Separationpp. 515-547

DOI: 10.1016/B978-0-12-374726-6.00018-7

This chapter discusses some algorithms for the use of non-negativity constraints in unmixing problems, including positive matrix factorization, nonnegative matrix factorization (NMF), and their combination with other unmixing methods such as non-negative independent component analysis and sparse non-negative matrix factorization. The 2D models can be naturally extended to multiway array (tensor) decompositions, especially non-negative tensor factorization (NTF) and non-negative tucker decomposition (NTD). The standard NMF model has been extended in various ways, including semi-NMF, multilayer NMF, tri-NMF, orthogonal NMF, nonsmooth NMF, and convolutive NMF. When gradient descent is a simple procedure, convergence can be slow, and the convergence can be sensitive to the step size. This can be overcome by applying multiplicative update rules, which have proved particularly popular in NMF. These multiplicative update rules have proved to be attractive since they are simple, do not need the selection of an update parameter, and their multiplicative nature, and non-negative terms on the RHS ensure that the elements cannot become negative. © 2010 Elsevier Ltd. All rights reserved.

(2018)Computational Analysis of Sound Scenes and Events Springer International Publishing

This book presents computational methods for extracting the useful information from audio signals, collecting the state of the art in the field of sound event and scene analysis. The authors cover the entire procedure for developing such methods, ranging from data acquisition and labeling, through the design of taxonomies used in the systems, to signal processing methods for feature extraction and machine learning methods for sound recognition. The book also covers advanced techniques for dealing with environmental variation and multiple overlapping sound sources, and taking advantage of multiple microphones or other modalities. The book gives examples of usage scenarios in large media databases, acoustic monitoring, bioacoustics, and context-aware devices. Graphical illustrations of sound signals and their spectrographic representations are presented, as well as block diagrams and pseudocode of algorithms. Gives an overview of methods for computational analysis of sounds scenes and events, allowing those new to the field to become fully informed; Covers all the aspects of the machine learning approach to computational analysis of sound scenes and events, ranging from data capture and labeling process to development of algorithms; Includes descriptions of algorithms accompanied by a website from which software implementations can be downloaded, facilitating practical interaction with the techniques.

H Fujihara, A Klapuri, MD Plumbley (2012)INSTRUMENTATION-BASED MUSIC SIMILARITY USING SPARSE REPRESENTATIONS, In: 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)pp. 433-436

Chungeun Kim, Emad M Grais, Russell Mason, Mark D Plumbley (2018)Perception of phase changes in the context of musical audio source separation, In: 145th AES Convention10031 AES

This study investigates into the perceptual consequence of phase change in conventional magnitude-based source separation. A listening test was conducted, where the participants compared three different source separation scenarios, each with two phase retrieval cases: phase from the original mix or from the target source. The participants’ responses regarding their similarity to the reference showed that 1) the difference between the mix phase and the perfect target phase was perceivable in the majority of cases with some song-dependent exceptions, and 2) use of the mix phase degraded the perceived quality even in the case of perfect magnitude separation. The findings imply that there is room for perceptual improvement by attempting correct phase reconstruction, in addition to achieving better magnitude-based separation.

YIN CAO, TURAB IQBAL, QIUQIANG KONG, Fengyan An, WENWU WANG, Mark D Plumbley (2021)An Improved Event-Independent Network for Polyphonic Sound Event Localization and Detection, In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/ICASSP39728.2021.9413473

Polyphonic sound event localization and detection (SELD), which jointly performs sound event detection (SED) and direction-of-arrival (DoA) estimation, detects the type and occurrence time of sound events as well as their corresponding DoA angles simultaneously. We study the SELD task from a multi-task learning perspective. Two open problems are addressed in this paper. Firstly, to detect overlapping sound events of the same type but with different DoAs, we propose to use a trackwise output format and solve the accompanying track permutation problem with permutation-invariant training. Multi-head self-attention is further used to separate tracks. Secondly, a previous finding is that, by using hard parameter-sharing, SELD suffers from a performance loss compared with learning the subtasks separately. This is solved by a soft parameter-sharing scheme. We term the proposed method as Event Independent Network V2 (EINV2), which is an improved version of our previously-proposed method and an end-to-end network for SELD. We show that our proposed EINV2 for joint SED and DoA estimation outperforms previous methods by a large margin, and has comparable performance to state-of-the-art ensemble models. Index Terms— Sound event localization and detection, direction of arrival, event-independent, permutation-invariant training, multi-task learning.

Turab Iqbal, Yin Cao, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang (2020)Learning with Out-of-Distribution Data for Audio Classification, In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020

DOI: 10.1109/ICASSP40776.2020.9054444

In supervised machine learning, the assumption that training data is labelled correctly is not always satisfied. In this paper, we investigate an instance of labelling error for classification tasks in which the dataset is corrupted with out-of-distribution (OOD) instances: data that does not belong to any of the target classes, but is labelled as such. We show that detecting and relabelling certain OOD instances, rather than discarding them, can have a positive effect on learning. The proposed method uses an auxiliary classifier, trained on data that is known to be in-distribution, for detection and relabelling. The amount of data required for this is shown to be small. Experiments are carried out on the FSDnoisy18k audio dataset, where OOD instances are very prevalent. The proposed method is shown to improve the performance of convolutional neural networks by a significant margin. Comparisons with other noise-robust techniques are similarly encouraging.

MG Jafari, SA Abdallah, MD Plumbley, ME Davies (2006)Sparse coding for convolutive blind audio source separation, In: J Rosca, D Erdogmus, JC Principe, S Haykin (eds.), INDEPENDENT COMPONENT ANALYSIS AND BLIND SIGNAL SEPARATION, PROCEEDINGS3889pp. 132-139

Turab Iqbal, Qiuqiang Kong, Mark D Plumbley, Wenwu Wang (2018)General-purpose audio tagging from noisy labels using convolutional neural networks, In: Mark D Plumbley, Christian Kroos, JP Bello, G Richard, DPW Ellis, A Mesaros (eds.), Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018pp. 212-216 Tampere University of Technology

General-purpose audio tagging refers to classifying sounds that are of a diverse nature, and is relevant in many applications where domain-specific information cannot be exploited. The DCASE 2018 challenge introduces Task 2 for this very problem. In this task, there are a large number of classes and the audio clips vary in duration. Moreover, a subset of the labels are noisy. In this paper, we propose a system to address these challenges. The basis of our system is an ensemble of convolutional neural networks trained on log-scaled mel spectrograms. We use preprocessing and data augmentation methods to improve the performance further. To reduce the effects of label noise, two techniques are proposed: loss function weighting and pseudo-labeling. Experiments on the private test set of this task show that our system achieves state-of-the-art performance with a mean average precision score of 0.951

Dominic Ward, Hagen Wierstorf, Russell Mason, Emad M. Grais, Mark Plumbley (2018)BSS eval or peass? Predicting the perception of singing-voice separation, In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)pp. 596-600 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/ICASSP.2018.8462194

There is some uncertainty as to whether objective metrics for predicting the perceived quality of audio source separation are sufficiently accurate. This issue was investigated by employing a revised experimental methodology to collect subjective ratings of sound quality and interference of singing-voice recordings that have been extracted from musical mixtures using state-of-the-art audio source separation. A correlation analysis between the experimental data and the measures of two objective evaluation toolkits, BSS Eval and PEASS, was performed to assess their performance. The artifacts-related perceptual score of the PEASS toolkit had the strongest correlation with the perception of artifacts and distortions caused by singing-voice separation. Both the source-to-interference ratio of BSS Eval and the interference-related perceptual score of PEASS showed comparable correlations with the human ratings of interference.

D Giannoulis, D Barchiesi, A Klapuri, MD Plumbley (2011)On the disjointess of sources in music using different time-frequency representations, In: 2011 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA)pp. 261-264 IEEE

DOI: 10.1109/ASPAA.2011.6082321

This paper studies the disjointness of the time-frequency representations of simultaneously playing musical instruments. As a measure of disjointness, we use the approximate W-disjoint orthogonality as proposed by Yilmaz and Rickard [1], which (loosely speaking) measures the degree of overlap of different sources in the time-frequency domain. The motivation for this study is to find a maximally disjoint representation in order to facilitate the separation and recognition of musical instruments in mixture signals. The transforms investigated in this paper include the short-time Fourier transform (STFT), constant-Q transform, modified discrete cosine transform (MDCT), and pitch-synchronous lapped orthogonal transforms. Simulation results are reported for a database of polyphonic music where the multitrack data (instrument signals before mixing) were available. Absolute performance varies depending on the instrument source in question, but on the average MDCT with 93 ms frame size performed best.

Qiuqiang Kong, Yong Xu, Wenwu Wang, PHILIP J B JACKSON, Mark D. Plumbley (2019)Single-Channel Signal Separation and Deconvolution with Generative Adversarial Networks, In: Sarit Kraus (eds.), Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligencepp. 2747-2753 International Joint Conferences on Artificial Intelligence

DOI: 10.24963/ijcai.2019/381

Single-channel signal separation and deconvolution aims to separate and deconvolve individual sources from a single-channel mixture and is a challenging problem in which no prior knowledge of the mixing filters is available. Both individual sources and mixing filters need to be estimated. In addition, a mixture may contain non-stationary noise which is unseen in the training set. We propose a synthesizing-decomposition (S-D) approach to solve the single-channel separation and deconvolution problem. In synthesizing, a generative model for sources is built using a generative adversarial network (GAN). In decomposition, both mixing filters and sources are optimized to minimize the reconstruction error of the mixture. The proposed S-D approach achieves a peak-to-noise-ratio (PSNR) of 18.9 dB and 15.4 dB in image inpainting and completion, outperforming a baseline convolutional neural network PSNR of 15.3 dB and 12.2 dB, respectively and achieves a PSNR of 13.2 dB in source separation together with deconvolution, outperforming a convolutive non-negative matrix factorization (NMF) baseline of 10.1 dB.

MEP Davies, N Degara, MD Plumbley (2011)Measuring the Performance of Beat Tracking Algorithms Using a Beat Error Histogram, In: IEEE SIGNAL PROCESSING LETTERS18(3)pp. 157-160 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/LSP.2010.2102018

Qiuqiang Kong, Yuxuan Wang, Xuchen Song, Yin Cao, Wenwu Wang, Mark D. Plumbley (2020)Source separation with weakly labelled data: An approach to computational auditory scene analysis, In: ICASSP 2020

DOI: 10.1109/ICASSP40776.2020.9053396

Source separation is the task of separating an audio recording into individual sound sources. Source separation is fundamental for computational auditory scene analysis. Previous work on source separation has focused on separating particular sound classes such as speech and music. Much previous work requires mixtures and clean source pairs for training. In this work, we propose a source separation framework trained with weakly labelled data. Weakly labelled data only contains the tags of an audio clip, without the occurrence time of sound events. We first train a sound event detection system with AudioSet. The trained sound event detection system is used to detect segments that are most likely to contain a target sound event. Then a regression is learnt from a mixture of two randomly selected segments to a target segment conditioned on the audio tagging prediction of the target segment. Our proposed system can separate 527 kinds of sound classes from AudioSet within a single system. A U-Net is adopted for the separation system and achieves an average SDR of 5.67 dB over 527 sound classes in AudioSet.

Peipei Wu, Jinzheng Zhao, Yaru Chen, Davide Berghi, Yi Yuan, Chenfei Zhu, Yin Cao, Yang Liu, Philip J B Jackson, Mark David Plumbley, Wenwu Wang (2023)PLDISET: Probabilistic Localization and Detection of Independent Sound Events with Transformers

Sound Event Localization and Detection (SELD) is a task that involves detecting different types of sound events along with their temporal and spatial information, specifically, detecting the classes of events and estimating their corresponding direction of arrivals at each frame. In practice, real-world sound scenes might be complex as they may contain multiple overlapping events. For instance, in DCASE challenges task 3, each clip may involve simultaneous occurrences of up to five events. To handle multiple overlapping sound events, current methods prefer multiple output branches to estimate each event, which increases the size of the models. Therefore, current methods are often difficult to be deployed on the edge of sensor networks. In this paper, we propose a method called Probabilistic Localization and Detection of Independent Sound Events with Transformers (PLDISET), which estimates numerous events by using one output branch. The method has three stages. First, we introduce the track generation module to obtain various tracks from extracted features. Then, these tracks are fed into two transformers for sound event detection (SED) and localization, respectively. Finally, one output system, including a linear Gaussian system and regression network, is used to estimate each track. We give the evaluation resn results of our model on DCASE 2023 Task 3 development dataset.

(2018)Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), In: Mark D. Plumbley, Christian Kroos, Juan P. Bello, Gaël Richard, Daniel P.W. Ellis, Annamaria Mesaros (eds.), Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018) Tampere University of Technology

N Degara, A Pena, MEP Davies, MD Plumbley (2010)NOTE ONSET DETECTION USING RHYTHMIC STRUCTURE, In: 2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSINGpp. 5526-5529

DOI: 10.1109/ICASSP.2010.5495220

Y Nishimori, S Akaho, MD Plumbley (2006)Riemannian optimization method on generalized flag manifolds for complex and subspace ICA, In: A MohammadDjafari (eds.), Bayesian Inference and Maximum Entropy Methods in Science and Engineering872pp. 89-96

R Badeau, MD Plumbley (2014)Multichannel high-resolution NMF for modeling convolutive mixtures of non-stationary signals in the Time-Frequency domain, In: IEEE/ACM Transactions on Audio, Speech and Language Processing22(11)pp. 1670-1680

DOI: 10.1109/TASLP.2014.2341920

Several probabilistic models involving latent components have been proposed for modeling time-frequency (TF) representations of audio signals such as spectrograms, notably in the nonnegative matrix factorization (NMF) literature. Among them, the recent high-resolution NMF (HR-NMF) model is able to take both phases and local correlations in each frequency band into account, and its potential has been illustrated in applications such as source separation and audio inpainting. In this paper, HR-NMF is extended to multichannel signals and to convolutive mixtures. The new model can represent a variety of stationary and non-stationary signals, including autoregressive moving average (ARMA) processes and mixtures of damped sinusoids. A fast variational expectation-maximization (EM) algorithm is proposed to estimate the enhanced model. This algorithm is applied to piano signals, and proves capable of accurately modeling reverberation, restoring missing observations, and separating pure tones with close frequencies.

A Nesbit, MG Jafari, E Vincent, MD Plumbley (2010)Audio source separation using sparse representations, In: Machine Audition: Principles, Algorithms and Systemspp. 246-264

DOI: 10.4018/978-1-61520-919-4.ch010

The authors address the problem of audio source separation, namely, the recovery of audio signals from recordings of mixtures of those signals. The sparse component analysis framework is a powerful method for achieving this. Sparse orthogonal transforms, in which only few transform coefficients differ significantly from zero, are developed; once the signal has been transformed, energy is apportioned from each transform coefficient to each estimated source, and, finally, the signal is reconstructed using the inverse transform. The overriding aim of this chapter is to demonstrate how this framework, as exemplified here by two different decomposition methods which adapt to the signal to represent it sparsely, can be used to solve different problems in different mixing scenarios. To address the instantaneous (neither delays nor echoes) and underdetermined (more sources than mixtures) mixing model, a lapped orthogonal transform is adapted to the signal by selecting a basis from a library of predetermined bases. This method is highly related to the windowing methods used in the MPEG audio coding framework. In considering the anechoic (delays but no echoes) and determined (equal number of sources and mixtures) mixing case, a greedy adaptive transform is used based on orthogonal basis functions that are learned from the observed data, instead of being selected from a predetermined library of bases. This is found to encode the signal characteristics, by introducing a feedback system between the bases and the observed data. Experiments on mixtures of speech and music signals demonstrate that these methods give good signal approximations and separation performance, and indicate promising directions for future research. © 2011, IGI Global.

G Wilson, DA Aruliah, CT Brown, NPC Hong, M Davis, RT Guy, SHD Haddock, KD Huff, IM Mitchell, MD Plumbley, B Waugh, EP White, P Wilson (2014)Best Practices for Scientific Computing, In: PLOS BIOLOGY12(1)ARTN e PUBLIC LIBRARY SCIENCE

DOI: 10.1371/journal.pbio.1001745

AM Stark, MD Plumbley, MEP Davies (2007)Audio effects for real-time performance using beat tracking, In: Audio Engineering Society - 122nd Audio Engineering Society Convention 20072pp. 866-872

We present a new class of digital audio effects which can automatically relate parameter values to the tempo of a musical input in real-time. Using a beat tracking system as the front end, we demonstrate a tempo-dependent delay effect and a set of beat-synchronous low frequency oscillator (LFO) effects including auto-wah, tremolo and vibrato. The effects show better performance than might be expected as they are blind to certain beat tracker errors. All effects are implemented as VST plug-ins which operate in real-time, enabling their use both in live musical performance and the off-line modification of studio recordings.

P Foster, S Sigtia, S Krstulovic, J Barker, MD Plumbley (2015)CHiME-home: A dataset for sound source recognition in a domestic environment., In: WASPAApp. 1-5

DOI: 10.1109/WASPAA.2015.7336899

For the task of sound source recognition, we introduce a novel data set based on 6.8 hours of domestic environment audio recordings. We describe our approach of obtaining annotations for the recordings. Further, we quantify agreement between obtained annotations. Finally, we report baseline results for sound source recognition using the obtained dataset. Our annotation approach associates each 4-second excerpt from the audio recordings with multiple labels, on a set of 7 labels associated with sound sources in the acoustic environment. With the aid of 3 human annotators, we obtain 3 sets of multi-label annotations, for 4378 4-second audio excerpts. We evaluate agreement between annotators by computing Jaccard indices between sets of label assignments. Observing varying levels of agreement across labels, with a view to obtaining a representation of ‘ground truth’ in annotations, we refine our dataset to obtain a set of multi-label annotations for 1946 audio excerpts. For the set of 1946 annotated audio excerpts, we predict binary label assignments using Gaussian mixture models estimated on MFCCs. Evaluated using the area under receiver operating characteristic curves, across considered labels we observe performance scores in the range 0.76 to 0.98

Qiuqiang Kong, Ke Chen, Haohe Liu, Xingjian Du, Taylor Berg-Kirkpatrick, Shlomo Dubnov, Mark Plumbley (2023)Universal Source Separation with Weakly Labelled Data, In: arXiv.org Cornell University Library, arXiv.org

DOI: 10.48550/arxiv.2305.07447

Universal source separation (USS) is a fundamental research task for computational auditory scene analysis, which aims to separate mono recordings into individual source tracks. There are three potential challenges awaiting the solution to the audio source separation task. First, previous audio source separation systems mainly focus on separating one or a limited number of specific sources. There is a lack of research on building a unified system that can separate arbitrary sources via a single model. Second, most previous systems require clean source data to train a separator, while clean source data are scarce. Third, there is a lack of USS system that can automatically detect and separate active sound classes in a hierarchical level. To use large-scale weakly labeled/unlabeled audio data for audio source separation, we propose a universal audio source separation framework containing: 1) an audio tagging model trained on weakly labeled data as a query net; and 2) a conditional source separation model that takes query net outputs as conditions to separate arbitrary sound sources. We investigate various query nets, source separation models, and training strategies and propose a hierarchical USS strategy to automatically detect and separate sound classes from the AudioSet ontology. By solely leveraging the weakly labelled AudioSet, our USS system is successful in separating a wide variety of sound classes, including sound event separation, music source separation, and speech enhancement. The USS system achieves an average signal-to-distortion ratio improvement (SDRi) of 5.57 dB over 527 sound classes of AudioSet; 10.57 dB on the DCASE 2018 Task 2 dataset; 8.12 dB on the MUSDB18 dataset; an SDRi of 7.28 dB on the Slakh2100 dataset; and an SSNR of 9.00 dB on the voicebank-demand dataset. We release the source code at https://github.com/bytedance/uss

A Nesbit, MD Plumbley (2008)Oracle estimation of adaptive cosine packet transforms for underdetermined audio source separation, In: 2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12pp. 41-44

DOI: 10.1109/ICASSP.2008.4517541

A Adler, V Emiya, MG Jafari, M Elad, R Gribonval, MD Plumbley (2011)A CONSTRAINED MATCHING PURSUIT APPROACH TO AUDIO DECLIPPING, In: 2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSINGpp. 329-332

Lucas Rencker, Francis Bach, Wenwu Wang, Mark D. Plumbley (2018)Fast Iterative Shrinkage for Signal Declipping and Dequantization, In: Proceedings of iTWIST’18 - International Traveling Workshop on Interactions between low-complexity data models and Sensing Techniques iTWIST

DOI: 10.48550/arXiv.1812.01540

We address the problem of recovering a sparse signal from clipped or quantized measurements. We show how these two problems can be formulated as minimizing the distance to a convex feasibility set, which provides a convex and differentiable cost function. We then propose a fast iterative shrinkage/thresholding algorithm that minimizes the proposed cost, which provides a fast and efficient algorithm to recover sparse signals from clipped and quantized measurements.

Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, Mark D. Plumbley, Wenwu Wang (2023)Simple Pooling Front-Ends for Efficient Audio Classification, In: Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2023)pp. 1-5 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/ICASSP49357.2023.10096211

Recently, there has been increasing interest in building efficient audio neural networks for on-device scenarios. Most existing approaches are designed to reduce the size of audio neural networks using methods such as model pruning. In this work, we show that instead of reducing model size using complex methods, eliminating the temporal redundancy in the input audio features (e.g., mel-spectrogram) could be an effective approach for efficient audio classification. To do so, we proposed a family of simple pooling front-ends (SimPFs) which use simple non-parametric pooling operations to reduce the redundant information within the mel-spectrogram. We perform extensive experiments on four audio classification tasks to evaluate the performance of SimPFs. Experimental results show that SimPFs can achieve a reduction in more than half of the number of floating point operations (FLOPs) for off-the-shelf audio neural networks, with negligible degradation or even some improvements in audio classification performance.

MD Plumbley (1997)Communications and neural networks: Theory and practice, In: 1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - Vpp. 135-138

Tomoya Koike, Kun Qian, Qiuqiang Kong, Mark D. Plumbley, Björn W. Schuller, Yoshiharu Yamamoto (2020)Audio for Audio is Better? An Investigation on Transfer Learning Models for Heart Sound Classification, In: 42nd Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)

DOI: 10.1109/EMBC44109.2020.9175450

Cardiovascular disease is one of the leading factors for death cause of human beings. In the past decade, heart sound classification has been increasingly studied for its feasibility to develop a non-invasive approach to monitor a subject’s health status. Particularly, relevant studies have benefited from the fast development of wearable devices and machine learning techniques. Nevertheless, finding and designing efficient acoustic properties from heart sounds is an expensive and time-consuming task. It is known that transfer learning methods can help extract higher representations automatically from the heart sounds without any human domain knowledge. However, most existing studies are based on models pre-trained on images, which may not fully represent the characteristics inherited from audio. To this end, we propose a novel transfer learning model pretrained on large scale audio data for a heart sound classification task. In this study, the PhysioNet CinC Challenge Dataset is used for evaluation. Experimental results demonstrate that, our proposed pre-trained audio models can outperform other popular models pre-trained by images by achieving the highest unweighted average recall at 89.7 %.

Qiuqiang Kong, Yong Xu, Iwona Sobieraj, Wenwu Wang, Mark D. Plumbley (2019)Sound Event Detection and Time-Frequency Segmentation from Weakly Labelled Data, In: IEEE/ACM Transactions on Audio, Speech, and Language Processing27(4)pp. 777-787 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/TASLP.2019.2895254

Sound event detection (SED) aims to detect when and recognize what sound events happen in an audio clip. Many supervised SED algorithms rely on strongly labelled data which contains the onset and offset annotations of sound events. However, many audio tagging datasets are weakly labelled, that is, only the presence of the sound events is known, without knowing their onset and offset annotations. In this paper, we propose a time-frequency (T-F) segmentation framework trained on weakly labelled data to tackle the sound event detection and separation problem. In training, a segmentation mapping is applied on a T-F representation, such as log mel spectrogram of an audio clip to obtain T-F segmentation masks of sound events. The T-F segmentation masks can be used for separating the sound events from the background scenes in the time-frequency domain. Then a classification mapping is applied on the T-F segmentation masks to estimate the presence probabilities of the sound events. We model the segmentation mapping using a convolutional neural network and the classification mapping using a global weighted rank pooling (GWRP). In SED, predicted onset and offset times can be obtained from the T-F segmentation masks. As a byproduct, separated waveforms of sound events can be obtained from the T-F segmentation masks. We remixed the DCASE 2018 Task 1 acoustic scene data with the DCASE 2018 Task 2 sound events data. When mixing under 0 dB, the proposed method achieved F1 scores of 0.534, 0.398 and 0.167 in audio tagging, frame-wise SED and event-wise SED, outperforming the fully connected deep neural network baseline of 0.331, 0.237 and 0.120, respectively. In T-F segmentation, we achieved an F1 score of 0.218, where previous methods were not able to do T-F segmentation.

(2007)Independent Component Analysis and Signal Separation, 7th International Conference, ICA 2007, London, UK, September 9-12, 2007.4666 Springer

Emad M. Grais, Fei Zhao, Mark D. Plumbley (2020)Multi-Band Multi-Resolution Fully Convolutional Neural Networks for Singing Voice Separation, In: 28th European Signal Processing Conference (EUSIPCO 2020)

DOI: 10.23919/Eusipco47968.2020.9287367

Deep neural networks with convolutional layers usually process the entire spectrogram of an audio signal with the same time-frequency resolutions, number of filters, and dimensionality reduction scale. According to the constant-Q transform, good features can be extracted from audio signals if the low frequency bands are processed with high frequency resolution filters and the high frequency bands with high time resolution filters. In the spectrogram of a mixture of singing voices and music signals, there is usually more information about the voice in the low frequency bands than the high frequency bands. These raise the need for processing each part of the spectrogram differently. In this paper, we propose a multi-band multi-resolution fully convolutional neural network (MBR-FCN) for singing voice separation. The MBR-FCN processes the frequency bands that have more information about the target signals with more filters and smaller dimensionality reduction scale than the bands with less information. Furthermore, the MBR-FCN processes the low frequency bands with high frequency resolution filters and the high frequency bands with high time resolution filters. Our experimental results show that the proposed MBRFCN with very few parameters achieves better singing voice separation performance than other deep neural networks.

SA Abdallah, MD Plumbley (2004)Polyphonic music transcription by non-negative sparse coding of power spectra, In: ISMIR

SJ Welburn, MD Plumbley (2009)Estimating parameters from audio for an EG+LFO model of pitch envelopes, In: Proceedings of the 12th International Conference on Digital Audio Effects, DAFx 2009pp. 451-455

Envelope generator (EG) and Low Frequency Oscillator (LFO) parameters give a compact representation of audio pitch envelopes. By estimating these parameters from audio per-note, they could be used as part of an audio coding scheme. Recordings of various instruments and articulations were examined, and pitch envelopes found. Using an evolutionary algorithm, EG and LFO parameters for the envelopes were estimated. The resulting estimated envelopes are compared to both the original envelope, and to a fixed-pitch estimate. Envelopes estimated using EG+LFO can closely represent the envelope from the original audio and provide a more accurate estimate than the mean pitch.

AJR Simpson, G Roma, MD Plumbley (2015)Deep Karaoke: Extracting Vocals from Musical Mixtures Using a Convolutional Deep Neural Network., In: CoRRabs/15

Jinbo Hu, Yin Cao, Ming Wu, Feiran Yang, Ziying Yu, Wenwu Wang, Mark Plumbley, Jun Yang (2023)META-SELD: Meta-Learning for Fast Adaptation to the new environment in Sound Event Localization and Detection, In: arXiv.org Cornell University Library, arXiv.org

DOI: 10.48550/arxiv.2308.08847

For learning-based sound event localization and detection (SELD) methods, different acoustic environments in the training and test sets may result in large performance differences in the validation and evaluation stages. Different environments, such as different sizes of rooms, different reverberation times, and different background noise, may be reasons for a learning-based system to fail. On the other hand, acquiring annotated spatial sound event samples, which include onset and offset time stamps, class types of sound events, and direction-of-arrival (DOA) of sound sources is very expensive. In addition, deploying a SELD system in a new environment often poses challenges due to time-consuming training and fine-tuning processes. To address these issues, we propose Meta-SELD, which applies meta-learning methods to achieve fast adaptation to new environments. More specifically, based on Model Agnostic Meta-Learning (MAML), the proposed Meta-SELD aims to find good meta-initialized parameters to adapt to new environments with only a small number of samples and parameter updating iterations. We can then quickly adapt the meta-trained SELD model to unseen environments. Our experiments compare fine-tuning methods from pre-trained SELD models with our Meta-SELD on the Sony-TAU Realistic Spatial Soundscapes 2023 (STARSSS23) dataset. The evaluation results demonstrate the effectiveness of Meta-SELD when adapting to new environments.

K O'Hanlon, MD Plumbley (2011)STRUCTURE-AWARE DICTIONARY LEARNING WITH HARMONIC ATOMS, In: X Mestre, J Hernando, M Pardas (eds.), 19TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO-2011)pp. 1761-1765

S Li, S Dixon, DAA Black, MD Plumbley (2016)A model selection test on effective factors of the choice of expressive timing clusters for a phrase, In: Proceedings of the 13th Sound and Music Conference (SMC 2016)

We model expressive timing for a phrase in performed classical music as being dependent on two factors: the expressive timing in the previous phrase and the position of the phrase within the piece. We present a model selection test for evaluating candidate models that assert different dependencies for deciding the Cluster of Expressive Timing (CET) for a phrase. We use cross entropy and Kullback Leibler (KL) divergence to evaluate the resulting models: with these criteria we find that both the expressive timing in the previous phrase and the position of the phrase in the music score affect expressive timing in a phrase. The results show that the expressive timing in the previous phrase has a greater effect on timing choices than the position of the phrase, as the phrase position only impacts the choice of expressive timing in combination with the choice of expressive timing in the previous phrase.

SJ Welburn, MD Plumbley, E Vincent (2007)Object-Coding for Resolution-Free Musical Audio, In: Proceedings of the AES International Conference

Object-based coding of audio represents the signal as a parameter stream for a set of sound-producing objects. Encoding in this manner can provide a resolution-free representation of an audio signal. Given a robust estimation of the object-parameters and a multi-resolution synthesis engine, the signal can be "intelligently" upsampled, extending the bandwidth and getting best use out of a high-resolution signal-chain. We present some initial findings on extending bandwidth using harmonic models.

MG Jafari, MD Plumbley (2007)Convolutive blind source separation of speech signals in the low frequency bands, In: Audio Engineering Society - 123rd Audio Engineering Society Convention 20073pp. 1195-1198

Sub-band methods are often used to address the problem of convolutive blind speech separation, as they offer the computational advantage of approximating convolutions by multiplications. The computational load, however, often remains quite high, because separation is performed on several sub-bands. In this paper, we exploit the well known fact that the high frequency content of speech signals typically conveys little information, since most of the speech power is found in frequencies up to 4kHz, and consider separation only in frequency bands below a certain threshold. We investigate the effect of changing the threshold, and find that separation performed only in the low frequencies can lead to the recovered signals being similar in quality to those extracted from all frequencies.

E Vincent, MD Plumbley (2005)A prototype system for object coding of musical audio, In: 2005 WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA)pp. 239-242

DOI: 10.1109/ASPAA.2005.1540214

MEP Davies, MD Plumbley (2004)Causal Tempo Tracking of Audio., In: ISMIR

T Murray-Browne, M Plumbley (2014)Harmonic Motion: A Toolkit for Processing Gestural Data for Interactive Soundpp. 213-216

We introduce Harmonic Motion, a free open source toolkit for artists, musicians and designers working with gestural data. Extracting musically useful features from captured gesture data can be challenging, with projects often requiring bespoke processing techniques developed through iterations of tweaking equations involving a number of constant values – sometimes referred to as ‘magic numbers’. Harmonic Motion provides a robust interface for rapid prototyping of patches to process gestural data and a framework through which approaches may be encapsulated, reused and shared with others. In addition, we describe our design process in which both personal experience and a survey of potential users informed a set of specific goals for the software.

H Ekeus, SA Abdallah, MD Plumbley, PW McOwan (2012)The Melody Triangle: Exploring Pattern and Predictability in Music, In: Musical Metacreation: Papers from the 2012 AIIDE Workshoppp. 35-42 AAAI

The Melody Triangle is an interface for the discovery of melodic materials, where the input positions within a triangle directly map to information theoretic properties of the output. A model of human expectation and surprise in the perception of music, information dynamics, is used to 'map out' a musical generative system's parameter space. This enables a user to explore the possibilities afforded by a generative algorithm, in this case Markov chains, not by directly selecting parameters, but by specifying the subjective predictability of the output sequence. We describe some of the relevant ideas from information dynamics and how the Melody Triangle is defined in terms of these. We describe its incarnation as a screen based performance tool and compositional aid for the generation of musical textures; the users control at the abstract level of randomness and predictability, and some pilot studies carried out with it. We also briefly outline a multi-user installation, where collaboration in a performative setting provides a playful yet informative way to explore expectation and surprise in music, and a forthcoming mobile phone version of the Melody Triangle.

A Gretsistas, MD Plumbley (2012)Group Polytope Faces Pursuit for Recovery of Block-Sparse Signals., In: FJ Theis, A Cichocki, A Yeredor, M Zibulevsky (eds.), LVA/ICA7191pp. 255-262

DOI: 10.1007/978-3-642-28551-6_32

Jinzheng Zhao, Yong Xu, Xinyuan Qian, Haohe Liu, Mark D. Plumbley, Wenwu Wang (2024)Attention-Based End-to-End Differentiable Particle Filter for Audio Speaker Tracking, In: IEEE Open Journal of Signal Processing5pp. 449-458 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/OJSP.2024.3363649

Particle filters (PFs) have been widely used in speaker tracking due to their capability in modeling a non-linear process or a non-Gaussian environment. However, particle filters are limited by several issues. For example, pre-defined handcrafted measurements are often used which can limit the model performance. In addition, the transition and update models are often preset which make PF less flexible to be adapted to different scenarios. To address these issues, we propose an end-to-end differentiable particle filter framework by employing the multi-head attention to model the long-range dependencies. The proposed model employs the self-attention as the learned transition model and the cross-attention as the learned update model. To our knowledge, this is the first proposal of combining particle filter and transformer for speaker tracking, where the measurement extraction, transition and update steps are integrated into an end-to-end architecture. Experimental results show that the proposed model achieves superior performance over the recurrent baseline models.

T Murray-Browne, D Mainstone, N Bryan-Kinns, MD Plumbley (2011)The medium is the message: Composing instruments and performing mappingspp. 56-59

Many performers of novel musical instruments find it diffi- cult to engage audiences beyond those in the field. Previous research points to a failure to balance complexity with usability, and a loss of transparency due to the detachment of the controller and sound generator. The issue is often exacerbated by an audiences lack of prior exposure to the instrument and its workings. However, we argue that there is a conflict underlying many novel musical instruments in that they are intended to be both a tool for creative expression and a creative work of art in themselves, resulting in incompatible requirements. By considering the instrument, the composition and the performance together as a whole with careful consideration of the rate of learning demanded of the audience, we propose that a lack of transparency can become an asset rather than a hindrance. Our approach calls for not only controller and sound generator to be designed in sympathy with each other, but composition, performance and physical form too. Identifying three design principles, we illustrate this approach with the Serendiptichord, a wearable instrument for dancers created by the authors.

MD Plumbley (2007)On polar polytopes and the recovery of sparse representations, In: IEEE TRANSACTIONS ON INFORMATION THEORY53(9)pp. 3188-3195 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TIT.2007.903129

A Robertson, MD Plumbley (2007)B-Keeper: A Beat-Tracker for Live Performance., In: NIMEpp. 234-237

DOI: 10.1145/1279740.1279787

K O'Hanlon, MD Plumbley, M Sandler (2015)Non-negative Matrix Factorisation incorporating greedy Hellinger sparse coding applied to polyphonic music transcription, In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2015)pp. 2214-2218

DOI: 10.1109/ICASSP.2015.7178364

Non-negative Matrix Factorisation (NMF) is a commonly used tool in many musical signal processing tasks, including Automatic Music Transcription (AMT). However unsupervised NMF is seen to be problematic in this context, and harmonically constrained variants of NMF have been proposed. While useful, the harmonic constraints may be constrictive in mixed signals. We have previously observed that recovery of overlapping signal elements using NMF is improved through introduction of a sparse coding step, and propose here the incorporation of a sparse coding step using the Hellinger distance into a NMF algorithm. Improved AMT results for unsupervised NMF are reported.

MD Plumbley (1996)Information processing in negative feedback neural networks, In: NETWORK-COMPUTATION IN NEURAL SYSTEMS7(2)pp. 301-305

DOI: 10.1088/0954-898X/7/2/010

MD Plumbley (1999)Do cortical maps adapt to optimize information density?, In: NETWORK-COMPUTATION IN NEURAL SYSTEMS10(1)pp. 41-58 IOP PUBLISHING LTD

DOI: 10.1088/0954-898X/10/1/003

A Robertson, MD Plumbley (2013)Synchronizing Sequencing Software to a Live Drummer, In: COMPUTER MUSIC JOURNAL37(2)pp. 46-60 MIT PRESS

DOI: 10.1162/COMJ_a_00178

Zhao Ren, Qiuqiang Kong, Kun Qian, Mark D Plumbley, Bj¨orn W Schuller (2018)Attention-based Convolutional Neural Networks for Acoustic Scene Classification, In: DCASE 2018 Workshop Proceedings

We propose a convolutional neural network (CNN) model based on an attention pooling method to classify ten different acoustic scenes, participating in the acoustic scene classiﬁcation task of the IEEE AASPChallengeonDetectionandClassiﬁcationofAcousticScenes and Events (DCASE 2018), which includes data from one device (subtask A) and data from three different devices (subtask B). The log mel spectrogram images of the audio waves are ﬁrst forwarded to convolutional layers, and then fed into an attention pooling layer to reduce the feature dimension and achieve classiﬁcation. From attention perspective, we build a weighted evaluation of the features, instead of simple max pooling or average pooling. On the ofﬁcial development set of the challenge, the best accuracy of subtask A is 72.6%,whichisanimprovementof12.9%whencomparedwiththe ofﬁcial baseline (p < .001 in a one-tailed z-test). For subtask B, the best result of our attention-based CNN is a signiﬁcant improvement of the baseline as well, in which the accuracies are 71.8%, 58.3%, and 58.3% for the three devices A to C (p < .001 for device A, p < .01 for device B, and p < .05 for device C).

H Ekeus, SA Abdallah, PW Mcowan, MD Plumbley (2013)How Predictable Do We Like Our Music? Eliciting Aesthetic Preferences With The Melody Triangle Mobile Apppp. 80-85

The Melody Triangle is a smartphone application for Android that lets users easily create musical patterns and textures. The user creates melodies by specifying positions within a triangle, and these positions correspond to the information theoretic properties of generated musical sequences. A model of human expectation and surprise in the perception of music, information dynamics, is used to 'map out' a musical generative system's parameter space, in this case Markov chains. This enables a user to explore the possibilities afforded by Markov chains, not by directly selecting their parameters, but by specifying the subjective predictability of the output sequence. As users of the app find melodies and patterns they like, they are encouraged to press a 'like' button, where their setting are uploaded to our servers for analysis. Collecting the 'liked' settings of many users worldwide will allow us to elicit trends and commonalities in aesthetic preferences across users of the app, and to investigate how these might relate to the informationdynamic model of human expectation and surprise. We outline some of the relevant ideas from information dynamics and how the Melody Triangle is defined in terms of these. We then describe the Melody Triangle mobile application, how it is being used to collect research data and how the collected data will be evaluated.

F Jaillet, R Gribonval, MD Plumbley, H Zayyani (2010)AN L1 CRITERION FOR DICTIONARY LEARNING BY SUBSPACE IDENTIFICATION, In: 2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSINGpp. 5482-5485

DOI: 10.1109/ICASSP.2010.5495206

A Robertson, MD Plumbley, N Bryan-Kinns (2008)A Turing Test for B-Keeper: Evaluating an Interactive Real-Time Beat-Tracker, In: Proceedings of the 8th International Conference on New Interfaces for Musical Expression (NIME 2008)pp. 319-324

S Li, DAA Black, Mark Plumbley (2016)The Clustering of Expressive Timing Within a Phrase in Classical Piano Performances by Gaussian Mixture Models, In: M Aramaki, R Kronland-Martinet, S Ystad (eds.), Music, Mind, and Embodiment: 11th International Symposium, CMMR 2015, Plymouth, UK, June 16-19, 2015, Revised Selected Papers. (Lecture Notes in Computer Science, vol. 9617)9617pp. 322-345

DOI: 10.1007/978-3-319-46282-0_21

In computational musicology research, clustering is a common approach to the analysis of expression. Our research uses mathematical model selection criteria to evaluate the performance of clustered and non-clustered models applied to intra-phrase tempo variations in classical piano performances. By engaging different standardisation methods for the tempo variations and engaging different types of covariance matrices, multiple pieces of performances are used for evaluating the performance of candidate models. The results of tests suggest that the clustered models perform better than the non-clustered models and the original tempo data should be standardised by the mean of tempo within a phrase.

C Barabasa, M Jafari, MD Plumbley (2012)A Robust Method for S1/S2 Heart Sounds Detection Without ECG Reference Based on Music Beat Tracking, In: 2012 10TH INTERNATIONAL SYMPOSIUM ON ELECTRONICS AND TELECOMMUNICATIONSpp. 307-310

DOI: 10.1109/ISETC.2012.6408110

We present a robust method for the detection of the first and second heart sounds (s1 and s2), without ECG reference, based on a music beat tracking algorithm. An intermediate representation of the input signal is first calculated by using an onset detection function based on complex spectral difference. A music beat tracking algorithm is then used to determine the location of the first heart sound. The beat tracker works in two steps, it first calculates the beat period and then finds the temporal beat alignment. Once the first sound is detected, inverse Gaussian weights are applied to the onset function on the detected positions and the algorithm is run again to find the second heart sound. At the last step s1 and s2 labels are attributed to the detected sounds. The algorithm was evaluated in terms of location accuracy as well as sensitivity and specificity and the results showed good results even in the presence of murmurs or noisy signals.

A Gretsistas, MD Plumbley (2012)AN ALTERNATING DESCENT ALGORITHM FOR THE OFF-GRID DOA ESTIMATION PROBLEM WITH SPARSITY CONSTRAINTS, In: 2012 PROCEEDINGS OF THE 20TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO)pp. 874-878

SA Abdallah, MD Plumbley (2012)A measure of statistical complexity based on predictive information with application to finite spin systems, In: PHYSICS LETTERS A376(4)pp. 275-281 ELSEVIER SCIENCE BV

DOI: 10.1016/j.physleta.2011.10.066

Alfredo Zermini, Qingju Liu, Yong Xu, Mark Plumbley, Dave Betts, Wenwu Wang (2017)Binaural and Log-Power Spectra Features with Deep Neural Networks for Speech-Noise Separation, In: Proceedings of MMSP 2017 - IEEE 19th International Workshop on Multimedia Signal Processing IEEE

DOI: 10.1109/MMSP.2017.8122280

Binaural features of interaural level difference and interaural phase difference have proved to be very effective in training deep neural networks (DNNs), to generate timefrequency masks for target speech extraction in speech-speech mixtures. However, effectiveness of binaural features is reduced in more common speech-noise scenarios, since the noise may over-shadow the speech in adverse conditions. In addition, the reverberation also decreases the sparsity of binaural features and therefore adds difficulties to the separation task. To address the above limitations, we highlight the spectral difference between speech and noise spectra and incorporate the log-power spectra features to extend the DNN input. Tested on two different reverberant rooms at different signal to noise ratios (SNR), our proposed method shows advantages over the baseline method using only binaural features in terms of signal to distortion ratio (SDR) and Short-Time Perceptual Intelligibility (STOI).

B Ophir, M Elad, N Bertin, MD Plumbley (2011)SEQUENTIAL MINIMAL EIGENVALUES - AN APPROACH TO ANALYSIS DICTIONARY LEARNING, In: X Mestre, J Hernando, M Pardas (eds.), 19TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO-2011)pp. 1465-1469

Mark D. PLUMBLEY, Tuomas Virtanen (2023)Creating a new research community on detection and classification of acoustic scenes and events: Lessons from the first ten years of DCASE challenges and workshops, In: NOISE-CON proceedings / National Conference on Noise Control Engineering

DOI: 10.3397/IN_2022_0643

Research work on automatic speech recognition and automatic music transcription has been around for several decades, supported by dedicated conferences or conference sessions. However, while individual researchers have been working on recognition of more general environmental sounds, until ten years ago there were no regular workshops or conference sessions where this research, or its researchers, could be found. There was also little available data for researchers to work on or to benchmark their work. In this paper we will outline how a new research community working on Detection and Classification of Acoustic Scenes and Events (DCASE) has grown over the last ten years, from two challenges on acoustic scene classification and sound event detection with a small workshop poster session, to an annual data challenge with six tasks and a dedicated annual workshop, attracting hundreds of delegates and strong industry interest. We will also describe how the analysis methods have evolved, from mel frequency cepstral coefficients (MFCCs) or cochelograms classified by support vector machines (SVMs) or hidden Markov models (HMMs), to deep learning methods such as transfer learning, transformers, and self-supervised learning. We will finish by suggesting some potential future directions for automatic sound recognition and the DCASE community.

Yong Xu, Qiang Huang, Wenwu Wang, Peter Foster, S Sigtia, Philip Jackson, Mark Plumbley (2017)Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging, In: IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING25(6)pp. 1230-1241 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TASLP.2017.2690563

Environmental audio tagging aims to predict only the presence or absence of certain acoustic events in the interested acoustic scene. In this paper we make contributions to audio tagging in two parts, respectively, acoustic modeling and feature learning. We propose to use a shrinking deep neural network (DNN) framework incorporating unsupervised feature learning to handle the multi-label classification task. For the acoustic modeling, a large set of contextual frames of the chunk are fed into the DNN to perform a multi-label classification for the expected tags, considering that only chunk (or utterance) level rather than frame-level labels are available. Dropout and background noise aware training are also adopted to improve the generalization capability of the DNNs. For the unsupervised feature learning, we propose to use a symmetric or asymmetric deep de-noising auto-encoder (syDAE or asyDAE) to generate new data-driven features from the logarithmic Mel-Filter Banks (MFBs) features. The new features, which are smoothed against background noise and more compact with contextual information, can further improve the performance of the DNN baseline. Compared with the standard Gaussian Mixture Model (GMM) baseline of the DCASE 2016 audio tagging challenge, our proposed method obtains a significant equal error rate (EER) reduction from 0.21 to 0.13 on the development set. The proposed asyDAE system can get a relative 6.7% EER reduction compared with the strong DNN baseline on the development set. Finally, the results also show that our approach obtains the state-of-the-art performance with 0.15 EER on the evaluation set of the DCASE 2016 audio tagging task while EER of the first prize of this challenge is 0.17.

G Bugmann, P Sojka, M Reiss, M Plumbley, JG Taylor (1992)Direct Approaches to Improving the Robustness of Multilayer Neural Networkspp. 1063-1066

DOI: 10.1016/B978-0-444-89488-5.50049-X

Abstract Multilayer neural networks trained with backpropagation are in general not robust against the loss of a hidden neuron. In this paper we define a form of robustness called 1-node robustness and propose methods to improve it. One approach is based on a modification of the error function by the addition of a "robustness error". It leads to more robust networks but at the cost of a reduced accuracy. A second approach, "pruning-and-duplication", consists of duplicating the neurons whose loss is the most damaging for the network. Pruned neurons are used for the duplication. This procedure leads to robust and accurate networks at low computational cost. It may also prove benefical for generalisation. Both methods are evaluated on the XOR function.

R Hamon, Lucas Rencker, V Emiya, Wenwu Wang, Mark Plumbley (2017)Assessment of musical noise using localization of isolated peaks in time-frequency domain, In: ICASSP2017 Proceedings IEEE

DOI: 10.1109/ICASSP.2017.7952245

Musical noise is a recurrent issue that appears in spectral techniques for denoising or blind source separation. Due to localised errors of estimation, isolated peaks may appear in the processed spectrograms, resulting in annoying tonal sounds after synthesis known as “musical noise”. In this paper, we propose a method to assess the amount of musical noise in an audio signal, by characterising the impact of these artificial isolated peaks on the processed sound. It turns out that because of the constraints between STFT coefficients, the isolated peaks are described as time-frequency “spots” in the spectrogram of the processed audio signal. The quantification of these “spots”, achieved through the adaptation of a method for localisation of significant STFT regions, allows for an evaluation of the amount of musical noise. We believe that this will pave the way to an objective measure and a better understanding of this phenomenon.

D Stowell, MD Plumbley (2013)Segregating event streams and noise with a Markov renewal process model., In: Journal of Machine Learning Research141pp. 2213-2238

Z Qin, Yue Gao, Mark D. Plumbley (2018)Malicious User Detection Based on Low-Rank Matrix Completion in Wideband Spectrum Sensing, In: IEEE Transactions on Signal Processing66(1)pp. 5-17 IEEE

DOI: 10.1109/TSP.2017.2759082

In cognitive radio networks, cooperative spectrum sensing (CSS) has been a promising approach to improve sensing performance by utilizing spatial diversity of participating secondary users (SUs). In current CSS networks, all cooperative SUs are assumed to be honest and genuine. However, the presence of malicious users sending out dishonest data can severely degrade the performance of CSS networks. In this paper, a framework with high detection accuracy and low costs of data acquisition at SUs is developed, with the purpose of mitigating the influences of malicious users. More specifically, a low-rank matrix completion based malicious user detection framework is proposed. In the proposed framework, in order to avoid requiring any prior information about the CSS network, a rank estimation algorithm and an estimation strategy for the number of corrupted channels are proposed. Numerical results show that the proposed malicious user detection framework achieves high detection accuracy with lower data acquisition costs in comparison with the conventional approach. After being validated by simulations, the proposed malicious user detection framework is tested on the real-world signals over TV white space spectrum.

MG Jafari, MD Plumbley (2010)A doubly sparse greedy adaptive dictionary learning algorithm for music and large-scale data, In: 128th Audio Engineering Society Convention 20102pp. 940-945

We consider the extension of the greedy adaptive dictionary learning algorithm that we introduced previously, to applications other than speech signals. The algorithm learns a dictionary of sparse atoms, while yielding a sparse representation for the speech signals. We investigate its behavior in the analysis of music signals, and propose a different dictionary learning approach that can be applied to large data sets. This facilitates the application of the algorithm to problems that generate large amounts of data, such as multimedia of multi-channel application areas.

EMILY MARY CORRIGAN-KAVANAGH, ANDRES FERNANDEZ, Mark D. PLUMBLEY (2021)ENVISIONING SOUND SENSING TECHNOLOGY FOR ENHANCING URBAN LIVING, In: Proceedings of Environments by Designpp. 186-192

G Roma, AJR Simpson, Emad M Grais, Mark D Plumbley (2016)Remixing musical audio on the web using source separation, In: Proceedings of the 2nd Web Audio Conference (WAC-2016)

Research in audio source separation has progressed a long way, producing systems that are able to approximate the component signals of sound mixtures. In recent years, many efforts have focused on learning time-frequency masks that can be used to filter a monophonic signal in the frequency domain. Using current web audio technologies, time-frequency masking can be implemented in a web browser in real time. This allows applying source separation techniques to arbitrary audio streams, such as internet radios, depending on cross-domain security configurations. While producing good quality separated audio from monophonic music mixtures is still challenging, current methods can be applied to remixing scenarios, where part of the signal is emphasized or deemphasized. This paper describes a system for remixing musical audio on the web by applying time-frequency masks estimated using deep neural networks. Our example prototype, implemented in client-side Javascript, provides reasonable quality results for small modifications.

C. O’Brien, Mark D. Plumbley (2016)Sparse Kernel Dictionary Learning, In: Proceedings of the 11th IMA International Conference on Mathematics in Signal Processing

Dictionary Learning (DL) has seen widespread use in signal processing and machine learning. Given a data set, DL seeks to find a so-called ‘dictionary’ such that the data can be well represented by a sparse linear combination of dictionary elements. The representational power of DL may be extended by the use of kernel mappings, which implicitly map the data to some high dimensional feature space. In Kernel DL we wish to learn a dictionary in this underlying high-dimensional feature space, which can often model the data more accurately than learning in the original space. Kernel DL is more challenging than the linear case however since we no longer have access to the dictionary atoms directly – only their relationship to the data via the kernel matrix. One strategy is therefore to represent the dictionary as a linear combination of the input data whose coefficients can be learned during training [1], relying on the fact that any optimal dictionary lies in the span of the data. A difficulty in Kernel DL is that given a data set of size N, the full (N ×N) kernel matrix needs to be manipulated at each iteration and dealing with such a large dense matrix can be extremely slow for big datasets. Here, we impose an additional constraint of sparsity on the coefficients so that the learned dictionary is given by a sparse linear combination of the input data. This greatly speeds up learning, and furthermore the speed-up is greater for larger datasets and can be tuned via a dictionary-sparsity parameter. The proposed approach thus combines Kernel DL with the ‘double sparse’ DL model [2] in which the learned dictionary is given by a sparse reconstruction over some base dictionary (in this case, the data itself). We investigate the use of sparse Kernel DL as a feature learning step for a music transcription task and compare it to another Kernel DL approach based on the K-SVD algorithm [1] (which doesnt lead to sparse dictionaries in general), in terms of computation-time and performance. Initial experiments show that Sparse Kernel DL is significantly faster than the non-sparse Kernel DL approach (6× to 8× speed-up depending on the size of the training data and the sparsity level) while leading to similar performance.

T Murray-Browne, D Mainstone, N Bryan-Kinns, MD Plumbley (2010)The Serendiptichord: A wearable instrument for contemporary dance performance, In: 128th Audio Engineering Society Convention 20103pp. 1547-1554

We describe a novel musical instrument designed for use in contemporary dance performance. This instrument, the Serendiptichord, takes the form of a headpiece plus associated pods which sense movements of the dancer, together with associated audio processing software driven by the sensors. Movements such as translating the pods or shaking the trunk of the headpiece cause selection and modification of sampled sounds. We discuss how we have closely integrated physical form, sensor choice and positioning and software to avoid issues which otherwise arise with disconnection of the innate physical link between action and sound, leading to an instrument that non-musicians (in this case, dancers) are able to enjoy using immediately.

Lucas Rencker, Wenwu Wang, Mark Plumbley (2017)A greedy algorithm with learned statistics for sparse signal reconstruction, In: ICASSP 2017 Proceedings IEEE

We address the problem of sparse signal reconstruction from a few noisy samples. Recently, a Covariance-Assisted Matching Pursuit (CAMP) algorithm has been proposed, improving the sparse coefficient update step of the classic Orthogonal Matching Pursuit (OMP) algorithm. CAMP allows the a-priori mean and covariance of the non-zero coefficients to be considered in the coefficient update step. In this paper, we analyze CAMP, which leads to a new interpretation of the update step as a maximum-a-posteriori (MAP) estimation of the non-zero coefficients at each step. We then propose to leverage this idea, by finding a MAP estimate of the sparse reconstruction problem, in a greedy OMP-like way. Our approach allows the statistical dependencies between sparse coefficients to be modelled, while keeping the practicality of OMP. Experiments show improved performance when reconstructing the signal from a few noisy samples.

Christian Kroos, Mark Plumbley (2017)Learning the mapping function from voltage amplitudes to sensor positions in 3D-EMA using deep neural networks, In: Interspeech 2017 Proceedingspp. 454-458 ISCA

DOI: 10.21437/Interspeech.2017-1681

The first generation of three-dimensional Electromagnetic Articulography devices (Carstens AG500) suffered from occasional critical tracking failures. Although now superseded by new devices, the AG500 is still in use in many speech labs and many valuable data sets exist. In this study we investigate whether deep neural networks (DNNs) can learn the mapping function from raw voltage amplitudes to sensor positions based on a comprehensive movement data set. This is compared to arriving sample by sample at individual position values via direct optimisation as used in previous methods. We found that with appropriate hyperparameter settings a DNN was able to approximate the mapping function with good accuracy, leading to a smaller error than the previous methods, but that the DNN-based approach was not able to solve the tracking problem completely.

S Ewert, B Pardo, M Mueller, MD Plumbley (2014)Score-Informed Source Separation for Musical Audio Recordings [An overview], In: IEEE SIGNAL PROCESSING MAGAZINE31(3)pp. 116-124 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/MSP.2013.2296076

E Cano, M Plumbley, C Dittmar (2014)Phase-based harmonic/percussive separationpp. 1628-1632

In this paper, a method for separation of harmonic and percussive elements in music recordings is presented. The proposed method is based on a simple spectral peak detection step followed by a phase expectation analysis that discriminates between harmonic and percussive components. The proposed method was tested on a database of 10 audio tracks and has shown superior results to the reference state-of-the-art approach.

Y Xu, Q Huang, W Wang, MD Plumbley (2016)Hierarchical Learning for DNN-Based Acoustic Scene Classification, In: T Virtanen, A Mesaros, T Heittola, MD Plumbley, P Foster, E Benetos, M Lagrange (eds.), Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016)pp. 105-109

In this paper, we present a deep neural network (DNN)-based acoustic scene classification framework. Two hierarchical learning methods are proposed to improve the DNN baseline performance by incorporating the hierarchical taxonomy information of environmental sounds. Firstly, the parameters of the DNN are initialized by the proposed hierarchical pre-training. Multi-level objective function is then adopted to add more constraint on the cross-entropy based loss function. A series of experiments were conducted on the Task1 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2016 challenge. The final DNN-based system achieved a 22.9% relative improvement on average scene classification error as compared with the Gaussian Mixture Model (GMM)-based benchmark system across four standard folds.

SA Abdallah, MD Plumbley (2008)Information dynamics and the perception of temporal structure18pp. 179-190

DOI: 10.1142/9789812834232_0015

P Brossier, M Sandler, M Plumbley (2003)Matching live sources with physical modelspp. 305-307

A Nesbit, E Vincent, MD Plumbley (2009)BENCHMARKING FLEXIBLE ADAPTIVE TIME-FREQUENCY TRANSFORMS FOR UNDERDETERMINED AUDIO SOURCE SEPARATION, In: 2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGSpp. 37-40

DOI: 10.1109/ICASSP.2009.4959514

Philip Jackson, Mark D Plumbley, Wenwu Wang, Tim Brookes, Philip Coleman, Russell Mason, David Frohlich, Carla Bonina, David Plans (2017)Signal Processing, Psychoacoustic Engineering and Digital Worlds: Interdisciplinary Audio Research at the University of Surrey

At the University of Surrey (Guildford, UK), we have brought together research groups in different disciplines, with a shared interest in audio, to work on a range of collaborative research projects. In the Centre for Vision, Speech and Signal Processing (CVSSP) we focus on technologies for machine perception of audio scenes; in the Institute of Sound Recording (IoSR) we focus on research into human perception of audio quality; the Digital World Research Centre (DWRC) focusses on the design of digital technologies; while the Centre for Digital Economy (CoDE) focusses on new business models enabled by digital technology. This interdisciplinary view, across different traditional academic departments and faculties, allows us to undertake projects which would be impossible for a single research group. In this poster we will present an overview of some of these interdisciplinary projects, including projects in spatial audio, sound scene and event analysis, and creative commons audio.

CK Yong, CM Lim, M Plumbley, D Beighton, R Davidson (2002)Identification of dental bacteria using statistical and neural approaches, In: Proceedings of the 9th International Conference on Neural Information Processing (ICONIP '02)2pp. 606-610 IEEE

DOI: 10.1109/ICONIP.2002.1198129

This paper is devoted to enhancing rapid decision-making and identification of lactobacilli from dental plaque using statistical and neural network methods. Current techniques of identification such as clustering and principal component analysis are discussed with respect to the field of bacterial taxonomy. Decision-making using multilayer perceptron neural network and Kohonen self-organizing feature map is highlighted. Simulation work and corresponding results are presented with main emphasis on neural network convergence and identification capability using resubstitution, leave-one-out and cross validation techniques. Rapid analyses on two separate sets of bacterial data from dental plaque revealed accuracy of more than 90% in the identification process. The risk of misdiagnosis was estimated at 14% worst case. Test with unknown strains yields close correlation to cluster dendograms. The use of the AXEON VindAX simulator indicated close correlations of the results. The paper concludes that artificial neural networks are suitable for use in the rapid identification of dental bacteria.

SA Abdallah, H Ekeus, P Foster, A Robertson, MD Plumbley (2012)Cognitive music modelling: An information dynamics approach, In: 2012 3rd International Workshop on Cognitive Information Processing, CIP 2012 IEEE

DOI: 10.1109/CIP.2012.6232940

We describe an information-theoretic approach to the analysis of music and other sequential data, which emphasises the predictive aspects of perception, and the dynamic process of forming and modifying expectations about an unfolding stream of data, characterising these using the tools of information theory: entropies, mutual informations, and related quantities. After reviewing the theoretical foundations, we discuss a few emerging areas of application, including musicological analysis, real-time beat-tracking analysis, and the generation of musical materials as a cognitively-informed compositional aid. © 2012 IEEE.

MG Jafari, MD Plumbley (2007)The role of high frequencies in convolutive blind source separation of speech signals, In: ME Davies, CJ James, SA Abdallah, MD Plumbley (eds.), Independent Component Analysis and Signal Separation, Proceedings4666pp. 488-494

Xinhao Mei, Xubo Liu, Mark D. Plumbley, Wenwu Wang (2022)Automated Audio Captioning: An Overview of Recent Progress and New Challenges, In: EURASIP journal on audio, speech, and music processing2022(Recent advances in computational sound scene analysis)26 Springer Open

DOI: 10.1186/s13636-022-00259-2

Automated audio captioning is a cross-modal translation task that aims to generate natural language descriptions for given audio clips. This task has received increasing attention with the release of freely available datasets in recent years. The problem has been addressed predominantly with deep learning techniques. Numerous approaches have been proposed, such as investigating different neural network architectures, exploiting auxiliary information such as keywords or sentence information to guide caption generation, and employing different training strategies, which have greatly facilitated the development of this field. In this paper, we present a comprehensive review of the published contributions in automated audio captioning, from a variety of existing approaches to evaluation metrics and datasets. We also discuss open challenges and envisage possible future research directions.

S Abdallah, M Plumbley (2009)Information dynamics: patterns of expectation and surprise in the perception of music, In: CONNECTION SCIENCE21(2-3)PII 911580pp. 89-117 TAYLOR & FRANCIS LTD

DOI: 10.1080/09540090902733756

MEP Davies, MD Plumbley (2007)Context-dependent beat tracking of musical audio, In: IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING15(3)pp. 1009-1020 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TASL.2006.885257

Jinbo Hu, Yin Cao, Ming Wu, Qiuqiang Kong, Feiran Yang, Mark D. Plumbley, Jun Yang (2022)A track-wise ensemble event independent network for polyphonic sound event localization and detection, In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

DOI: 10.1109/ICASSP43922.2022.9747283

Polyphonic sound event localization and detection (SELD) aims at detecting types of sound events with corresponding temporal activities and spatial locations. In this paper, a track-wise ensemble event independent network with a novel data augmentation method is proposed. The proposed model is based on our previous proposed Event-Independent Network V2 and is extended by conformer blocks and dense blocks. The track-wise ensemble model with track-wise output format is proposed to solve an ensemble model problem for track-wise output format that track permutation may occur among different models. The data augmentation approach contains several data augmentation chains, which are composed of random combinations of several data augmentation operations. The method also utilizes log-mel spectrograms, intensity vectors, and Spatial Cues-Augmented Log-Spectrogram (SALSA) for different models. We evaluate our proposed method in the Task of the L3DAS22 challenge and obtain the top ranking solution with a location-dependent F-score to be 0.699. Source code is released 1 .

MD Plumbley (2007)Geometry and manifolds for independent component analysis, In: 2007 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol IV, Pts 1-3pp. 1397-1400

E Oja, M Plumbley (2004)Blind separation of positive sources by globally convergent gradient search, In: NEURAL COMPUTATION16(9)pp. 1811-1825 M I T PRESS

DOI: 10.1162/0899766041336413

The instantaneous noise-free linear mixing model in independent component analysis is largely a solved problem under the usual assumption of independent nongaussian sources and full column rank mixing matrix. However, with some prior information on the sources, like positivity, new analysis and perhaps simplified solution methods may yet become possible. In this letter, we consider the task of independent component analysis when the independent sources are known to be nonnegative and well grounded, which means that they have a nonzero pdf in the region of zero. It can be shown that in this case, the solution method is basically very simple: an orthogonal rotation of the whitened observation vector into nonnegative outputs will give a positive permutation of the original sources. We propose a cost function whose minimum coincides with nonnegativity and derive the gradient algorithm under the whitening constraint, under which the separating matrix is orthogonal. We further prove that in the Stiefel manifold of orthogonal matrices, the cost function is a Lyapunov function for the matrix gradient flow, implying global convergence. Thus, this algorithm is guaranteed to find the nonnegative well-grounded independent sources. The analysis is complemented by a numerical simulation, which illustrates the algorithm.

D Stowell, MD Plumbley (2012)FRAMEWISE HETERODYNE CHIRP ANALYSIS OF BIRDSONG, In: 2012 PROCEEDINGS OF THE 20TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO)pp. 2694-2698

N Cleju, MG Jafari, MD Plumbley (2012)CHOOSING ANALYSIS OR SYNTHESIS RECOVERY FOR SPARSE RECONSTRUCTION, In: 2012 PROCEEDINGS OF THE 20TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO)pp. 869-873

H Ekeus, PW Mcowan, MD Plumbley (2013)Eye Tracking as Interface for Parametric Design

This research investigates the potential of eye tracking as an interface to parameter search in visual design. We outline our experimental framework where a user's gaze acts as guiding feedback mechanism in an exploration of the state space of parametric designs. A small scale pilot study was carried out where participants in uence the evolution of generative patterns by looking at a screen while having their eyes tracked. Preliminary findings suggest that although our eye tracking system can be used to e ectively navigate small areas of a parametric design's state-space, there are challenges to overcome before such a system is practical in a design context. Finally we outline future directions of this research.

MD Plumbley (1993)Information theory and neural network learning algorithmspp. 145-155

There have been a number of recent papers on information theory and neural networks, especially in a perceptual system such as vision. Some of these approaches are examined, and their implications for neural network learning algorithms are considered. Existing supervised learning algorithms such as Back Propagation to minimize mean squared error can be viewed as attempting to minimize an upper bound on information loss. By making an assumption of noise either at the input or the output to the system, unsupervised learning algorithms such as those based on Hebbian (principal component analysing) or anti-Hebbian (decorrelating) approaches can also be viewed in a similar light. The optimization of information by the use of interneurons to decorrelate output units suggests a role for inhibitory interneurons and cortical loops in biological sensory systems.

N Degara, MEP Davies, A Pena, MD Plumbley (2011)Onset Event Decoding Exploiting the Rhythmic Structure of Polyphonic Music, In: IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING5(6)pp. 1228-1239 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/JSTSP.2011.2146229

E Vincent, MD Plumbley (2007)Low bit-rate object coding of musical audio using Bayesian harmonic models, In: IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING15(4)pp. 1273-1282 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TASL.2006.889792

MD Plumbley, S Abdallah, T Blumensath, MG Jafari, A Nesbit, E Vincent, B Wang (2006)Musical audio analysis using sparse representationspp. 105-117

DOI: 10.1007/978-3-7908-1709-6_9

Sparse representations are becoming an increasingly useful tool in the analysis of musical audio signals. In this paper we will given an overview of work by ourselves and others in this area, to give a flavour of the work being undertaken, and to give some pointers for further information about this interesting and challenging research topic.