Qiuqiang Kong

Dr Qiuqiang Kong

My publications


Xu Y, Kong Q, Huang Q, Wang W, Plumbley M (2017) Convolutional Gated Recurrent Neural Network Incorporating Spatial Features for Audio Tagging,IJCNN 2017 Conference Proceedings IEEE
Environmental audio tagging is a newly proposed task to predict the presence or absence of a specific audio event in a chunk. Deep neural network (DNN) based methods have been successfully adopted for predicting the audio tags in the domestic audio scene. In this paper, we propose to use a convolutional neural network (CNN) to extract robust features from mel-filter banks (MFBs), spectrograms or even raw waveforms for audio tagging. Gated recurrent unit (GRU) based recurrent neural networks (RNNs) are then cascaded to model the long-term temporal structure of the audio signal. To complement the input information, an auxiliary CNN is designed to learn on the spatial features of stereo recordings. We evaluate our proposed methods on Task 4 (audio tagging) of the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE 2016) challenge. Compared with our recent DNN-based method, the proposed structure can reduce the equal error rate (EER) from 0.13 to 0.11 on the development set. The spatial features can further reduce the EER to 0.10. The performance of the end-to-end learning on raw waveforms is also comparable. Finally, on the evaluation set, we get the state-of-the-art performance with 0.12 EER while the performance of the best existing system is 0.15 EER.
Xu Yong, Kong Qiuqiang, Huang Qiang, Wang Wenwu, Plumbley Mark (2017) Attention and Localization based on a Deep Convolutional Recurrent Model for Weakly Supervised Audio Tagging,Proceedings of Interspeech 2017 pp. 3083-3087 ISCA
Audio tagging aims to perform multi-label classification on audio
chunks and it is a newly proposed task in the Detection and
Classification of Acoustic Scenes and Events 2016 (DCASE
2016) challenge. This task encourages research efforts to better
analyze and understand the content of the huge amounts of
audio data on the web. The difficulty in audio tagging is that
it only has a chunk-level label without a frame-level label. This
paper presents a weakly supervised method to not only predict
the tags but also indicate the temporal locations of the occurred
acoustic events. The attention scheme is found to be effective
in identifying the important frames while ignoring the unrelated
frames. The proposed framework is a deep convolutional recurrent
model with two auxiliary modules: an attention module
and a localization module. The proposed algorithm was evaluated
on the Task 4 of DCASE 2016 challenge. State-of-the-art
performance was achieved on the evaluation set with equal error
rate (EER) reduced from 0.13 to 0.11, compared with the
convolutional recurrent baseline system.
Bird audio detection (BAD) aims to detect whether
there is a bird call in an audio recording or not. One difficulty of
this task is that the bird sound datasets are weakly labelled, that
is only the presence or absence of a bird in a recording is known,
without knowing when the birds call. We propose to apply joint
detection and classification (JDC) model on the weakly labelled
data (WLD) to detect and classify an audio clip at the same time.
First, we apply VGG like convolutional neural network (CNN)
on mel spectrogram as baseline. Then we propose a JDC-CNN
model with VGG as a classifier and CNN as a detector. We report
the denoising method including optimally-modified log-spectral
amplitude (OM-LSA), median filter and spectral spectrogram
will worse the classification accuracy on the contrary to previous
work. JDC-CNN can predict the time stamps of the events from
weakly labelled data, so is able to do sound event detection from
WLD. We obtained area under curve (AUC) of 95.70% on the
development data and 81.36% on the unseen evaluation data,
which is nearly comparable to the baseline CNN model.
Kong Q, Xu Y, Wang W, Plumbley MD (2017) A joint detection-classification model for audio tagging of weakly labelled data,Proceedings of ICASSP 2017 IEEE
Audio tagging aims to assign one or several tags to an audio clip. Most of the datasets are weakly labelled, which means only the tags of the clip are known, without knowing the occurrence time of the tags. The labeling of an audio clip is often based on the audio events in the clip and no event level label is provided to the user. Previous works have used the bag of frames model assume the tags occur all the time, which is not the case in practice. We propose a joint detection-classification (JDC) model to detect and classify the audio clip simultaneously. The JDC model has the ability to attend to informative and ignore uninformative sounds. Then only informative regions are used for classification. Experimental results on the ?CHiME Home? dataset show that the JDC model reduces the equal error rate (EER) from 19.0% to 16.9%. More interestingly, the audio event detector is trained successfully without needing the event level label.
Kong Q, Sobieraj I, Wang W, Plumbley MD (2016) Deep Neural Network Baseline for DCASE Challenge 2016,Proceedings of DCASE 2016
The DCASE Challenge 2016 contains tasks for Acoustic Scene Classification (ASC), Acoustic Event Detection (AED), and audio tagging. Since 2006, Deep Neural Networks (DNNs) have been widely applied to computer visions, speech recognition and natural language processing tasks. In this paper, we provide DNN baselines for the DCASE Challenge 2016. In Task 1 we obtained accuracy of 81.0% using Mel + DNN against 77.2% by using Mel Frequency Cepstral Coefficients (MFCCs) + Gaussian Mixture Model (GMM). In Task 2 we obtained F value of 12.6% using Mel + DNN against 37.0% by using Constant Q Transform (CQT) + Nonnegative Matrix Factorization (NMF). In Task 3 we obtained F value of 36.3% using Mel + DNN against 23.7% by using MFCCs + GMM. In Task 4 we obtained Equal Error Rate (ERR) of 18.9% using Mel + DNN against 20.9% by using MFCCs + GMM. Therefore the DNN improves the baseline in Task 1, 3, and 4, although it is worse than the baseline in Task 2. This indicates that DNNs can be successful in many of these tasks, but may not always perform better than the baselines.
Zermini A, Wang W, Kong Q, Xu Y, Plumbley M (2017) Audio source separation with deep neural networks using the dropout algorithm,Signal Processing with Adaptive Sparse Structured Representations (SPARS) 2017 Book of Abstracts pp. 1-2 Instituto de Telecomunicações
A method based on Deep Neural Networks (DNNs) and
time-frequency masking has been recently developed for binaural audio
source separation. In this method, the DNNs are used to predict the
Direction Of Arrival (DOA) of the audio sources with respect to the
listener which is then used to generate soft time-frequency masks for
the recovery/estimation of the individual audio sources. In this paper, an
algorithm called ?dropout? will be applied to the hidden layers, affecting
the sparsity of hidden units activations: randomly selected neurons and
their connections are dropped during the training phase, preventing
feature co-adaptation. These methods are evaluated on binaural mixtures
generated with Binaural Room Impulse Responses (BRIRs), accounting
a certain level of room reverberation. The results show that the proposed
DNNs system with randomly deleted neurons is able to achieve higher
SDRs performances compared to the baseline method without the dropout
Sobieraj Iwona, Kong Qiuqiang, Plumbley Mark (2017) Masked Non-negative Matrix Factorization for Bird Detection Using Weakly Labeled Data,EUSIPCO 2017 Proceedings IEEE
Acoustic monitoring of bird species is an increasingly
important field in signal processing. Many available bird
sound datasets do not contain exact timestamp of the bird call
but have a coarse weak label instead. Traditional Non-negative
Matrix Factorization (NMF) models are not well designed to
deal with weakly labeled data. In this paper we propose a
novel Masked Non-negative Matrix Factorization (Masked NMF)
approach for bird detection using weakly labeled data. During
dictionary extraction we introduce a binary mask on the activation
matrix. In that way we are able to control which parts of
dictionary are used to reconstruct the training data. We compare
our method with conventional NMF approaches and current state
of the art methods. The proposed method outperforms the NMF
baseline and offers a parsimonious model for bird detection on
weakly labeled data. Moreover, to our knowledge, the proposed
Masked NMF achieved the best result among non-deep learning
methods on a test dataset used for the recent Bird Audio Detection
Xu Yong, Kong Qiuqiang, Wang Wenwu, Plumbley Mark (2017) Surrey-CVSSP system for DCASE2017 challenge task 4,Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017) Tampere University of Technology. Laboratory of Signal Processing
In this technique report, we present a bunch of methods for the task 4 of Detection and Classification of Acoustic Scenes and Events 2017 (DCASE2017) challenge. This task evaluates systems for the large-scale detection of sound events using weakly labeled training data. The data are YouTube video excerpts focusing on transportation and warnings due to their industry applications. There are two tasks, audio tagging and sound event detection from weakly labeled data. Convolutional neural network (CNN) and gated recurrent unit (GRU) based recurrent neural network (RNN) are adopted as our basic framework. We proposed a learnable gating activation function for selecting informative local features. Attention-based scheme is used for localizing the specific events in a weakly-supervised mode. A new batch-level balancing strategy is also proposed to tackle the data unbalancing problem. Fusion of posteriors from different systems are found effective to improve the performance. In a summary, we get 61% F-value for the audio tagging subtask and 0.73 error rate (ER) for the sound event detection subtask on the development set. While the official multilayer perceptron (MLP) based baseline just obtained 13.1% F-value for the audio tagging and 1.02 for the sound event detection.
Xu Yong, Kong Qiuqiang, Wang Wenwu, Plumbley Mark (2018) Large-scale weakly supervised audio classification using gated convolutional neural network,Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 121-125 IEEE
In this paper, we present a gated convolutional neural network
and a temporal attention-based localization method for
audio classification, which won the 1st place in the large-scale
weakly supervised sound event detection task of Detection
and Classification of Acoustic Scenes and Events (DCASE)
2017 challenge. The audio clips in this task, which are extracted
from YouTube videos, are manually labelled with one
or more audio tags, but without time stamps of the audio
events, hence referred to as weakly labelled data. Two subtasks
are defined in this challenge including audio tagging and
sound event detection using this weakly labelled data. We
propose a convolutional recurrent neural network (CRNN)
with learnable gated linear units (GLUs) non-linearity applied
on the log Mel spectrogram. In addition, we propose a temporal
attention method along the frames to predict the locations
of each audio event in a chunk from the weakly labelled data.
The performances of our systems were ranked the 1st and the
2nd as a team in these two sub-tasks of DCASE 2017 challenge
with F value 55.6% and Equal error 0.73, respectively.
Kong Qiuqiang, Xu Yong, Wang Wenwu, Plumbley Mark D (2018) A joint separation-classification model for sound event detection of weakly labelled data,Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 321-325 Institute of Electrical and Electronics Engineers (IEEE)
Source separation (SS) aims to separate individual sources
from an audio recording. Sound event detection (SED) aims
to detect sound events from an audio recording. We propose
a joint separation-classification (JSC) model trained only on
weakly labelled audio data, that is, only the tags of an audio
recording are known but the time of the events are unknown.
First, we propose a separation mapping from the
time-frequency (T-F) representation of an audio to the T-F
segmentation masks of the audio events. Second, a classification
mapping is built from each T-F segmentation mask to
the presence probability of each audio event. In the source
separation stage, sources of audio events and time of sound
events can be obtained from the T-F segmentation masks. The
proposed method achieves an equal error rate (EER) of 0.14
in SED, outperforming deep neural network baseline of 0.29.
Source separation SDR of 8.08 dB is obtained by using global
weighted rank pooling (GWRP) as probability mapping, outperforming
the global max pooling (GMP) based probability
mapping giving SDR at 0.03 dB. Source code of our work is
Kong Qiuqiang, Xu Yong, Wang Wenwu, Plumbley Mark (2018) Audio set classification with attention model: a probabilistic perspective,Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 316-320 IEEE
This paper investigates the Audio Set classification. Audio
Set is a large scale weakly labelled dataset (WLD) of audio
clips. In WLD only the presence of a label is known, without
knowing the happening time of the labels. We propose an
attention model to solve this WLD problem and explain the
attention model from a novel probabilistic perspective. Each
audio clip in Audio Set consists of a collection of features. We
call each feature as an instance and the collection as a bag following
the terminology in multiple instance learning. In the
attention model, each instance in the bag has a trainable probability
measure for each class. The classification of the bag is
the expectation of the classification output of the instances in
the bag with respect to the learned probability measure. Experiments
show that the proposed attention model achieves
a mAP of 0.327 on Audio Set, outperforming the Google?s
baseline of 0.314.
Zermini Alfredo, Kong Qiuqiang, Xu Yong, Plumbley Mark D., Wang Wenwu (2018) Improving reverberant speech separation with binaural cues using temporal context and convolutional neural networks,In: Deville Y, Gannot S, Mason R, Plumbley Mark, Ward D (eds.), Latent Variable Analysis and Signal Separation. LVA/ICA 2018. Lecture Notes in Computer Science Latent Variable Analysis and Signal Separation: LVA/ICA 2018. Lecture Notes in Computer Science 10891 pp. 361-371 Springer
Given binaural features as input, such as interaural level difference
and interaural phase difference, Deep Neural Networks (DNNs)
have been recently used to localize sound sources in a mixture of speech
signals and/or noise, and to create time-frequency masks for the estimation
of the sound sources in reverberant rooms. Here, we explore a
more advanced system, where feed-forward DNNs are replaced by Convolutional
Neural Networks (CNNs). In addition, the adjacent frames
of each time frame (occurring before and after this frame) are used to
exploit contextual information, thus improving the localization and separation
for each source. The quality of the separation results is evaluated
in terms of Signal to Distortion Ratio (SDR).
Iqbal Turab, Xu Yong, Kong Qiuqiang, Wang Wenwu (2018) Capsule Routing for Sound Event Detection,Proceedings of 2018 26th European Signal Processing Conference (EUSIPCO) pp. 2255-2259 IEEE
The detection of acoustic scenes is a challenging
problem in which environmental sound events must be detected
from a given audio signal. This includes classifying the events as
well as estimating their onset and offset times. We approach this
problem with a neural network architecture that uses the recentlyproposed
capsule routing mechanism. A capsule is a group of
activation units representing a set of properties for an entity of
interest, and the purpose of routing is to identify part-whole
relationships between capsules. That is, a capsule in one layer is
assumed to belong to a capsule in the layer above in terms of the
entity being represented. Using capsule routing, we wish to train
a network that can learn global coherence implicitly, thereby
improving generalization performance. Our proposed method is
evaluated on Task 4 of the DCASE 2017 challenge. Results show
that classification performance is state-of-the-art, achieving an Fscore
of 58.6%. In addition, overfitting is reduced considerably
compared to other architectures.
Kong Qiuqiang, Iqbal Turab, Xu Yong, Wang Wenwu, Plumbley Mark D (2018) DCASE 2018 Challenge Surrey Cross-task convolutional neural network baseline,DCASE2018 Workshop
The Detection and Classi?cation of Acoustic Scenes and Events (DCASE) consists of ?ve audio classi?cation and sound event detectiontasks: 1)Acousticsceneclassi?cation,2)General-purposeaudio tagging of Freesound, 3) Bird audio detection, 4) Weakly-labeled semi-supervised sound event detection and 5) Multi-channel audio classi?cation. In this paper, we create a cross-task baseline system for all ?ve tasks based on a convlutional neural network (CNN): a ?CNN Baseline? system. We implemented CNNs with 4 layers and 8 layers originating from AlexNet and VGG from computer vision. We investigated how the performance varies from task to task with the same con?guration of neural networks. Experiments show that deeper CNN with 8 layers performs better than CNN with 4 layers on all tasks except Task 1. Using CNN with 8 layers, we achieve an accuracy of 0.680 on Task 1, an accuracy of 0.895 and a mean average precision (MAP) of 0.928 on Task 2, an accuracy of 0.751 andanareaunderthecurve(AUC)of0.854onTask3,asoundevent detectionF1scoreof20.8%onTask4,andanF1scoreof87.75%on Task 5. We released the Python source code of the baseline systems under the MIT license for further research.
Wei Shengyun, Xu Kele, Wang Dezhi, Liao Feifan, Wang Huaimin, Kong Qiuqiang (2018) Sample mixed-based data augmentation for domestic audio tagging,DCASE 2018 Workshop
Audio tagging has attracted increasing attention since last decade
and has various potential applications in many fields. The objective
of audio tagging is to predict the labels of an audio clip. Recently
deep learning methods have been applied to audio tagging and
have achieved state-of-the-art performance, which provides a poor
generalization ability on new data. However due to the limited size
of audio tagging data such as DCASE data, the trained models tend
to result in overfitting of the network. Previous data augmentation
methods such as pitch shifting, time stretching and adding background
noise do not show much improvement in audio tagging. In
this paper, we explore the sample mixed data augmentation for the
domestic audio tagging task, including mixup, SamplePairing and
extrapolation. We apply a convolutional recurrent neural network
(CRNN) with attention module with log-scaled mel spectrum as a
baseline system. In our experiments, we achieve an state-of-the-art
of equal error rate (EER) of 0.10 on DCASE 2016 task4 dataset
with mixup approach, outperforming the baseline system without
data augmentation.
Ren Zhao, Kong Qiuqiang, Qian Kun, Plumbley Mark D, Schuller Bj¨orn W (2018) Attention-based Convolutional Neural Networks for Acoustic Scene Classification,DCASE 2018 Workshop Proceedings
We propose a convolutional neural network (CNN) model based on an attention pooling method to classify ten different acoustic scenes, participating in the acoustic scene classi?cation task of the IEEE AASPChallengeonDetectionandClassi?cationofAcousticScenes and Events (DCASE 2018), which includes data from one device (subtask A) and data from three different devices (subtask B). The log mel spectrogram images of the audio waves are ?rst forwarded to convolutional layers, and then fed into an attention pooling layer to reduce the feature dimension and achieve classi?cation. From attention perspective, we build a weighted evaluation of the features, instead of simple max pooling or average pooling. On the of?cial development set of the challenge, the best accuracy of subtask A is 72.6%,whichisanimprovementof12.9%whencomparedwiththe of?cial baseline (p
Yu Changsong, Barsim Karim Said, Kong Qiuqiang, Yang Bin (2018) Multi-level attention model for weakly supervised audio classification,DCASE 2018 Workshop
In this paper, we propose a multi-level attention model for the
weakly labelled audio classification problem. The objective of audio
classification is to predict the presence or the absence of sound
events in an audio clip. Recently, Google published a large scale
weakly labelled AudioSet dataset containing 2 million audio clips
with only the presence or the absence labels of the sound events,
without the onset and offset time of the sound events. Previously
proposed attention models only applied a single attention module
on the last layer of a neural network which limited the capacity of
the attention model. In this paper, we propose a multi-level attention
model which consists of multiple attention modules applied on the
intermediate neural network layers. The outputs of these attention
modules are concatenated to a vector followed by a fully connected
layer to obtain the final prediction of each class. Experiments show
that the proposed multi-attention attention model achieves a state-of-the-art
mean average precision (mAP) of 0.360, outperforming
the single attention model and the Google baseline system of 0.327
and 0.314, respectively.
Hou Yuanbo, Kong Qiuqiang, Wang Jun, Li Shengchen (2018) Polyphonic audio tagging with sequentially labelled data using CRNN with learnable gated linear units,DCASE2018 Workshop
Audio tagging aims to detect the types of sound events occurring in an audio recording. To tag the polyphonic audio recordings, we propose to use Connectionist Temporal Classification (CTC) loss function on the top of Convolutional Recurrent Neural Network (CRNN) with learnable Gated Linear Units (GLUCTC), based on a new type of audio label data: Sequentially Labelled Data (SLD). In GLU-CTC, CTC objective function maps the frame-level probability of labels to clip-level probability of labels. To compare the mapping ability of GLU-CTC for sound events, we train a CRNN with GLU based on Global Max Pooling (GLU-GMP) and a CRNN with GLU based on Global Average Pooling (GLU-GAP). And we also compare the proposed GLU-CTC system with the baseline system, which is a CRNN trained using CTC loss function without GLU. The experiments show that the GLU-CTC achieves an Area Under Curve (AUC) score of 0.882 in audio tagging, outperforming the GLU-GMP of 0.803, GLU-GAP of 0.766 and baseline system of 0.837. That means based on the same CRNN model with GLU, the performance of CTC mapping is better than the GMP and GAP mapping. Given both based on the CTC mapping, the CRNN with GLU outperforms the CRNN without GLU.
Iqbal Turab, Kong Qiuqiang, Plumbley Mark D, Wang Wenwu (2018) General-purpose audio tagging from noisy labels using convolutional neural networks,Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018 pp. 212-216 Tampere University of Technology
General-purpose audio tagging refers to classifying sounds that are
of a diverse nature, and is relevant in many applications where
domain-specific information cannot be exploited. The DCASE 2018
challenge introduces Task 2 for this very problem. In this task, there
are a large number of classes and the audio clips vary in duration.
Moreover, a subset of the labels are noisy. In this paper, we propose
a system to address these challenges. The basis of our system is
an ensemble of convolutional neural networks trained on log-scaled
mel spectrograms. We use preprocessing and data augmentation
methods to improve the performance further. To reduce the effects
of label noise, two techniques are proposed: loss function weighting
and pseudo-labeling. Experiments on the private test set of this task
show that our system achieves state-of-the-art performance with a
mean average precision score of 0.951
Kong Qiuqiang, Xu Yong, Sobieraj Iwona, Wang Wenwu, Plumbley Mark D. (2019) Sound Event Detection and Time-Frequency Segmentation from Weakly Labelled Data,IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (4) pp. 777-787 Institute of Electrical and Electronics Engineers (IEEE)
Sound event detection (SED) aims to detect when and recognize what sound events happen in an audio clip. Many supervised SED algorithms rely on strongly labelled data which contains the onset and offset annotations of sound events. However, many audio tagging datasets are weakly labelled, that is, only the presence of the sound events is known, without knowing their onset and offset annotations. In this paper, we propose a time-frequency (T-F) segmentation framework trained on weakly labelled data to tackle the sound event detection and separation problem. In training, a segmentation mapping is applied on a T-F representation, such as log mel spectrogram of an audio clip to obtain T-F segmentation masks of sound events. The T-F segmentation masks can be used for separating the sound events from the background scenes in the time-frequency domain. Then a classification mapping is applied on the T-F segmentation masks to estimate the presence probabilities of the sound events. We model the segmentation mapping using a convolutional neural network and the classification mapping using a global weighted rank pooling (GWRP). In SED, predicted onset and offset times can be obtained from the T-F segmentation masks. As a byproduct, separated waveforms of sound events can be obtained from the T-F segmentation masks. We remixed the DCASE 2018 Task 1 acoustic scene data with the DCASE 2018 Task 2 sound events data. When mixing under 0 dB, the proposed method achieved F1 scores of 0.534, 0.398 and 0.167 in audio tagging, frame-wise SED and event-wise SED, outperforming the fully connected deep neural network baseline of 0.331, 0.237 and 0.120, respectively. In T-F segmentation, we achieved an F1 score of 0.218, where previous methods were not able to do T-F segmentation.
Ren Zhao, Kong Qiuqiang, Han Jing, Plumbley Mark, Schuller Björn W (2019) ATTENTION-BASED ATROUS CONVOLUTIONAL NEURAL NETWORKS: VISUALISATION AND UNDERSTANDING PERSPECTIVES OF ACOUSTIC SCENES,2019 Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019) IEEE
The goal of Acoustic Scene Classification (ASC) is to recognise
the environment in which an audio waveform has been
recorded. Recently, deep neural networks have been applied to
ASC and have achieved state-of-the-art performance. However,
few works have investigated how to visualise and understand
what a neural network has learnt from acoustic scenes. Previous
work applied local pooling after each convolutional layer,
therefore reduced the size of the feature maps. In this paper,
we suggest that local pooling is not necessary, but the size of
the receptive field is important. We apply atrous Convolutional
Neural Networks (CNNs) with global attention pooling as the
classification model. The internal feature maps of the attention
model can be visualised and explained. On the Detection
and Classification of Acoustic Scenes and Events (DCASE)
2018 dataset, our proposed method achieves an accuracy of
72.7 %, significantly outperforming the CNNs without dilation
at 60.4 %. Furthermore, our results demonstrate that the learnt
feature maps contain rich information on acoustic scenes in
the time-frequency domain.
Hou Yuanbo, Kong Qiuqiang, Li Shengchen, Plumbley Mark D. (2019) Sound event detection with sequentially labelled data based on Connectionist temporal classification and unsupervised clustering,Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019) Institute of Electrical and Electronics Engineers (IEEE)
Sound event detection (SED) methods typically rely on either
strongly labelled data or weakly labelled data. As an alternative,
sequentially labelled data (SLD) was proposed. In SLD,
the events and the order of events in audio clips are known,
without knowing the occurrence time of events. This paper
proposes a connectionist temporal classification (CTC) based
SED system that uses SLD instead of strongly labelled data,
with a novel unsupervised clustering stage. Experiments on
41 classes of sound events show that the proposed two-stage
method trained on SLD achieves performance comparable to
the previous state-of-the-art SED system trained on strongly
labelled data, and is far better than another state-of-the-art
SED system trained on weakly labelled data, which indicates
the effectiveness of the proposed two-stage method trained on
SLD without any onset/offset time of sound events.
Kong Qiuqiang, Xu Yong, Iqbal Turab, Cao Yin, Wang Wenwu, Plumbley Mark D. (2019) Acoustic scene generation with conditional SampleRNN,Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019) Institute of Electrical and Electronics Engineers (IEEE)
Acoustic scene generation (ASG) is a task to generate waveforms
for acoustic scenes. ASG can be used to generate audio
scenes for movies and computer games. Recently, neural networks
such as SampleRNN have been used for speech and
music generation. However, ASG is more challenging due to
its wide variety. In addition, evaluating a generative model is
also difficult. In this paper, we propose to use a conditional
SampleRNN model to generate acoustic scenes conditioned on
the input classes. We also propose objective criteria to evaluate
the quality and diversity of the generated samples based on
classification accuracy. The experiments on the DCASE 2016
Task 1 acoustic scene data show that with the generated audio
samples, a classification accuracy of 65:5% can be achieved
compared to samples generated by a random model of 6:7%
and samples from real recording of 83:1%. The performance
of a classifier trained only on generated samples achieves an
accuracy of 51:3%, as opposed to an accuracy of 6:7% with
samples generated by a random model.
Kong Qiuqiang, Xu Yong, Jackson Philip J.B., Wang Wenwu, Plumbley Mark D. (2019) Single-Channel Signal Separation and Deconvolution with Generative Adversarial Networks,Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) pp. 2747-2753 International Joint Conferences on Artificial Intelligence
Single-channel signal separation and deconvolution aims to separate and deconvolve individual sources from a single-channel mixture and is a challenging problem in which no prior knowledge of the mixing filters is available. Both individual sources and mixing filters need to be estimated. In addition, a mixture may contain non-stationary noise which is unseen in the training set. We propose a synthesizing-decomposition (S-D) approach to solve the single-channel separation and deconvolution problem. In synthesizing, a generative model for sources is built using a generative adversarial network (GAN). In decomposition, both mixing filters and sources are optimized to minimize
the reconstruction error of the mixture. The proposed S-D approach achieves a peak-to-noise-ratio (PSNR) of 18.9 dB and 15.4 dB in image inpainting and completion, outperforming a baseline convolutional neural network PSNR of 15.3 dB and 12.2 dB, respectively and achieves a PSNR of 13.2 dB in source separation together with deconvolution, outperforming a convolutive non-negative matrix factorization (NMF) baseline of 10.1 dB.
Kong Qiuqiang, Yu Changsong, Xu Yong, Iqbal Turab, Wang Wenwu, Plumbley Mark D. (2019) Weakly Labelled AudioSet Tagging With Attention Neural Networks,IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (11) pp. 1791-1802 IEEE
Audio tagging is the task of predicting the presence or absence of sound classes within an audio clip. Previous work in audio tagging focused on relatively small datasets limited to recognising a small number of sound classes. We investigate audio tagging on AudioSet, which is a dataset consisting of over 2 million audio clips and 527 classes. AudioSet is weakly labelled, in that only the presence or absence of sound classes is known for each clip, while the onset and offset times are unknown. To address the weakly-labelled audio tagging problem, we propose attention neural networks as a way to attend the most salient parts of an audio clip. We bridge the connection between attention neural networks and multiple instance learning (MIL) methods, and propose decision-level and feature-level attention neural networks for audio tagging. We investigate attention neural networks modelled by different functions, depths and widths. Experiments on AudioSet show that the feature-level attention neural network achieves a state-of-the-art mean average precision (mAP) of 0.369, outperforming the best multiple instance learning (MIL) method of 0.317 and Google?s deep neural network baseline of 0.314. In addition, we discover that the audio
tagging performance on AudioSet embedding features has a weak correlation with the number of training examples and the quality of labels of each sound class.
Ren Zhao, Han Jing, Cummins Nicholas, Kong Qiuqiang, Plumbley Mark, Schuller Björn W. (2019) Multi-instance Learning for Bipolar Disorder Diagnosis using Weakly Labelled Speech Data,Proceedings of DPH 2019: 9th International Digital Public Health Conference 2019 Association for Computing Machinery (ACM)
While deep learning is undoubtedly the predominant learning technique
across speech processing, it is still not widely used in health-based
applications. The corpora available for health-style recognition
problems are often small, both concerning the total amount of
data available and the number of individuals present. The Bipolar
Disorder corpus, used in the 2018 Audio/Visual Emotion Challenge,
contains only 218 audio samples from 46 individuals. Herein, we
present a multi-instance learning framework aimed at constructing
more reliable deep learning-based models in such conditions. First,
we segment the speech files into multiple chunks. However, the
problem is that each of the individual chunks is weakly labelled,
as they are annotated with the label of the corresponding speech
file, but may not be indicative of that label. We then train the deep
learning-based (ensemble) multi-instance learning model, aiming
at solving such a weakly labelled problem. The presented results
demonstrate that this approach can improve the accuracy of feedforward,
recurrent, and convolutional neural nets on the 3-class mania
classification tasks undertaken on the Bipolar Disorder corpus.
Cao Yin, Kong Qiuqiang, Iqbal Turab, An Fengyan, Wang Wenwu, Plumbley Mark D. (2019) Polyphonic sound event detection and localization using a two-stage strategy,Proceedings of Detection and Classification of Acoustic Scenes and Events Workshop (DCASE 2019) pp. pp 30-34 New York University
Sound event detection (SED) and localization refer to recognizing
sound events and estimating their spatial and temporal locations.
Using neural networks has become the prevailing method for SED.
In the area of sound localization, which is usually performed by estimating
the direction of arrival (DOA), learning-based methods have
recently been developed. In this paper, it is experimentally shown
that the trained SED model is able to contribute to the direction
of arrival estimation (DOAE). However, joint training of SED and
DOAE degrades the performance of both. Based on these results, a
two-stage polyphonic sound event detection and localization method
is proposed. The method learns SED first, after which the learned
feature layers are transferred for DOAE. It then uses the SED ground
truth as a mask to train DOAE. The proposed method is evaluated on
the DCASE 2019 Task 3 dataset, which contains different overlapping
sound events in different environments. Experimental results
show that the proposed method is able to improve the performance
of both SED and DOAE, and also performs significantly better than
the baseline method.
Iqbal Turab, Cao Yin, Kong Qiuqiang, Plumbley Mark D., Wang Wenwu (2020) Learning with Out-of-Distribution Data for Audio Classification,International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020
In supervised machine learning, the assumption that training data is labelled correctly is not always satisfied. In this paper, we investigate an instance of labelling error for classification tasks in which the dataset is corrupted with out-of-distribution
(OOD) instances: data that does not belong to any of the target classes, but is labelled as such. We show that detecting and relabelling certain OOD instances, rather than discarding them, can have a positive effect on learning. The proposed method uses an auxiliary classifier, trained on data that is known to be
in-distribution, for detection and relabelling. The amount of data required for this is shown to be small. Experiments are carried out on the FSDnoisy18k audio dataset, where OOD instances are very prevalent. The proposed method is shown to improve the performance of convolutional neural networks by a significant margin. Comparisons with other noise-robust techniques are similarly encouraging.
Sound event detection (SED) is a problem to detect the onset and offset times of sound events in an audio recording. SED has many applications in both academia and industry, such as multimedia information retrieval and monitoring domestic and public security. However, compared to speech signal processing that have been researched for many years, the classification and detection of general sounds has not been researched much until recent years.

One limitation of the study on audio classification and sound event detection is that there have been limited datasets public available until the release of the release of the detection and classification of acoustic scenes and events (DCASE) dataset. The DCASE dataset consists of data for acoustic scene classification (ASC), audio tagging (AT) and sound event detection. ASC and AT are tasks to design systems to predict pre-defined labels in an audio clip. SED is a task to design systems to
predict both the presence or absence of sound events in an audio clip as well as the
onset and offset times of the sound events.
One difficulty of the audio classification and SED task is that many datasets
such as the DCASE dataset are weakly labelled. That is, only the presence or
absence of sound events in an audio clip is known, without knowing the onset and
offset annotations of the sound events. This thesis focused on solving the audio
tagging and sound event detection problem using only weakly labelled data. This
thesis proposed attention neural networks to solve the general weakly labelled AT
and SED problem. The attention neural networks can automatically learn to attend
to important segments and ignore silence and irrelevant segments in an audio clip.
We developed a set of weak learning methods for AT and SED using attention neu-
Abstract 3
ral networks. The proposed methods have achieved a state-of-the-art performance
in audio tagging and sound event detection.