My publications


Kroos Christian, Bones Oliver, Cao Yin, Harris Lara, Jackson Philip J. B., Davies William J., Wang Wenwu, Cox Trevor J., Plumbley Mark D. (2019) Generalisation in environmental sound classification: the 'making sense of sounds' data set and challenge,Proceedings of the 44th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2019) Institute of Electrical and Electronics Engineers (IEEE)
Humans are able to identify a large number of environmental
sounds and categorise them according to high-level semantic
categories, e.g. urban sounds or music. They are also capable
of generalising from past experience to new sounds when
applying these categories. In this paper we report on the creation
of a data set that is structured according to the top-level
of a taxonomy derived from human judgements and the design
of an associated machine learning challenge, in which
strong generalisation abilities are required to be successful.
We introduce a baseline classification system, a deep convolutional
network, which showed strong performance with an
average accuracy on the evaluation data of 80.8%. The result
is discussed in the light of two alternative explanations:
An unlikely accidental category bias in the sound recordings
or a more plausible true acoustic grounding of the high-level
Kong Qiuqiang, Xu Yong, Iqbal Turab, Cao Yin, Wang Wenwu, Plumbley Mark D. (2019) Acoustic scene generation with conditional SampleRNN,Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019) Institute of Electrical and Electronics Engineers (IEEE)
Acoustic scene generation (ASG) is a task to generate waveforms
for acoustic scenes. ASG can be used to generate audio
scenes for movies and computer games. Recently, neural networks
such as SampleRNN have been used for speech and
music generation. However, ASG is more challenging due to
its wide variety. In addition, evaluating a generative model is
also difficult. In this paper, we propose to use a conditional
SampleRNN model to generate acoustic scenes conditioned on
the input classes. We also propose objective criteria to evaluate
the quality and diversity of the generated samples based on
classification accuracy. The experiments on the DCASE 2016
Task 1 acoustic scene data show that with the generated audio
samples, a classification accuracy of 65:5% can be achieved
compared to samples generated by a random model of 6:7%
and samples from real recording of 83:1%. The performance
of a classifier trained only on generated samples achieves an
accuracy of 51:3%, as opposed to an accuracy of 6:7% with
samples generated by a random model.
Cao Yin, Kong Qiuqiang, Iqbal Turab, An Fengyan, Wang Wenwu, Plumbley Mark D. (2019) Polyphonic sound event detection and localization using a two-stage strategy,Proceedings of Detection and Classification of Acoustic Scenes and Events Workshop (DCASE 2019) pp. pp 30-34 New York University
Sound event detection (SED) and localization refer to recognizing
sound events and estimating their spatial and temporal locations.
Using neural networks has become the prevailing method for SED.
In the area of sound localization, which is usually performed by estimating
the direction of arrival (DOA), learning-based methods have
recently been developed. In this paper, it is experimentally shown
that the trained SED model is able to contribute to the direction
of arrival estimation (DOAE). However, joint training of SED and
DOAE degrades the performance of both. Based on these results, a
two-stage polyphonic sound event detection and localization method
is proposed. The method learns SED first, after which the learned
feature layers are transferred for DOAE. It then uses the SED ground
truth as a mask to train DOAE. The proposed method is evaluated on
the DCASE 2019 Task 3 dataset, which contains different overlapping
sound events in different environments. Experimental results
show that the proposed method is able to improve the performance
of both SED and DOAE, and also performs significantly better than
the baseline method.
Kong Qiuqiang, Wang Yuxuan, Song Xuchen, Cao Yin, Wang Wenwu, Plumbley Mark D. (2020) Source separation with weakly labelled data: An approach to computational auditory scene analysis,ICASSP 2020
Source separation is the task of separating an audio recording into individual sound sources. Source separation is fundamental for computational auditory scene analysis. Previous work on source separation has focused on separating particular sound classes such as speech and music. Much previous work requires mixtures and clean source pairs for training. In this work, we propose a source separation framework trained with weakly labelled data. Weakly labelled data only contains
the tags of an audio clip, without the occurrence time of sound events. We first train a sound event detection system with AudioSet. The trained sound event detection system is used to detect segments that are most likely to contain a target sound event. Then a regression is learnt from a mixture of two randomly selected segments to a target segment conditioned on the audio tagging prediction of the target segment. Our proposed
system can separate 527 kinds of sound classes from
AudioSet within a single system. A U-Net is adopted for the separation system and achieves an average SDR of 5.67 dB over 527 sound classes in AudioSet.
Iqbal Turab, Cao Yin, Kong Qiuqiang, Plumbley Mark D., Wang Wenwu (2020) Learning with Out-of-Distribution Data for Audio Classification,International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020
In supervised machine learning, the assumption that training data is labelled correctly is not always satisfied. In this paper, we investigate an instance of labelling error for classification tasks in which the dataset is corrupted with out-of-distribution
(OOD) instances: data that does not belong to any of the target classes, but is labelled as such. We show that detecting and relabelling certain OOD instances, rather than discarding them, can have a positive effect on learning. The proposed method uses an auxiliary classifier, trained on data that is known to be
in-distribution, for detection and relabelling. The amount of data required for this is shown to be small. Experiments are carried out on the FSDnoisy18k audio dataset, where OOD instances are very prevalent. The proposed method is shown to improve the performance of convolutional neural networks by a significant margin. Comparisons with other noise-robust techniques are similarly encouraging.