problem in which environmental sound events must be detected
from a given audio signal. This includes classifying the events as
well as estimating their onset and offset times. We approach this
problem with a neural network architecture that uses the recentlyproposed
capsule routing mechanism. A capsule is a group of
activation units representing a set of properties for an entity of
interest, and the purpose of routing is to identify part-whole
relationships between capsules. That is, a capsule in one layer is
assumed to belong to a capsule in the layer above in terms of the
entity being represented. Using capsule routing, we wish to train
a network that can learn global coherence implicitly, thereby
improving generalization performance. Our proposed method is
evaluated on Task 4 of the DCASE 2017 challenge. Results show
that classification performance is state-of-the-art, achieving an Fscore
of 58.6%. In addition, overfitting is reduced considerably
compared to other architectures.
of a diverse nature, and is relevant in many applications where
domain-specific information cannot be exploited. The DCASE 2018
challenge introduces Task 2 for this very problem. In this task, there
are a large number of classes and the audio clips vary in duration.
Moreover, a subset of the labels are noisy. In this paper, we propose
a system to address these challenges. The basis of our system is
an ensemble of convolutional neural networks trained on log-scaled
mel spectrograms. We use preprocessing and data augmentation
methods to improve the performance further. To reduce the effects
of label noise, two techniques are proposed: loss function weighting
and pseudo-labeling. Experiments on the private test set of this task
show that our system achieves state-of-the-art performance with a
mean average precision score of 0.951
for acoustic scenes. ASG can be used to generate audio
scenes for movies and computer games. Recently, neural networks
such as SampleRNN have been used for speech and
music generation. However, ASG is more challenging due to
its wide variety. In addition, evaluating a generative model is
also difficult. In this paper, we propose to use a conditional
SampleRNN model to generate acoustic scenes conditioned on
the input classes. We also propose objective criteria to evaluate
the quality and diversity of the generated samples based on
classification accuracy. The experiments on the DCASE 2016
Task 1 acoustic scene data show that with the generated audio
samples, a classification accuracy of 65:5% can be achieved
compared to samples generated by a random model of 6:7%
and samples from real recording of 83:1%. The performance
of a classifier trained only on generated samples achieves an
accuracy of 51:3%, as opposed to an accuracy of 6:7% with
samples generated by a random model.
tagging performance on AudioSet embedding features has a weak correlation with the number of training examples and the quality of labels of each sound class.
sound events and estimating their spatial and temporal locations.
Using neural networks has become the prevailing method for SED.
In the area of sound localization, which is usually performed by estimating
the direction of arrival (DOA), learning-based methods have
recently been developed. In this paper, it is experimentally shown
that the trained SED model is able to contribute to the direction
of arrival estimation (DOAE). However, joint training of SED and
DOAE degrades the performance of both. Based on these results, a
two-stage polyphonic sound event detection and localization method
is proposed. The method learns SED first, after which the learned
feature layers are transferred for DOAE. It then uses the SED ground
truth as a mask to train DOAE. The proposed method is evaluated on
the DCASE 2019 Task 3 dataset, which contains different overlapping
sound events in different environments. Experimental results
show that the proposed method is able to improve the performance
of both SED and DOAE, and also performs significantly better than
the baseline method.
(OOD) instances: data that does not belong to any of the target classes, but is labelled as such. We show that detecting and relabelling certain OOD instances, rather than discarding them, can have a positive effect on learning. The proposed method uses an auxiliary classifier, trained on data that is known to be
in-distribution, for detection and relabelling. The amount of data required for this is shown to be small. Experiments are carried out on the FSDnoisy18k audio dataset, where OOD instances are very prevalent. The proposed method is shown to improve the performance of convolutional neural networks by a significant margin. Comparisons with other noise-robust techniques are similarly encouraging.