Polyphonic sound event localization and detection is not only detecting what sound events are happening but localizing corresponding sound sources. This series of tasks was first introduced in DCASE 2019 Task 3. In 2020, the sound event localization and detection task introduces additional challenges in moving sound sources and overlapping-event cases, which include two events of the same type with two different direction-of-arrival (DoA) angles. In this paper, a novel event-independent network for polyphonic sound event lo-calization and detection is proposed. Unlike the two-stage method we proposed in DCASE 2019 Task 3, this new network is fully end-to-end. Inputs to the network are first-order Ambisonics (FOA) time-domain signals, which are then fed into a 1-D convolutional layer to extract acoustic features. The network is then split into two parallel branches. The first branch is for sound event detection (SED), and the second branch is for DoA estimation. There are three types of predictions from the network, SED predictions, DoA predictions , and event activity detection (EAD) predictions that are used to combine the SED and DoA features for onset and offset estimation. All of these predictions have the format of two tracks indicating that there are at most two overlapping events. Within each track, there could be at most one event happening. This architecture introduces a problem of track permutation. To address this problem, a frame-level permutation invariant training method is used. Experimental results show that the proposed method can detect polyphonic sound events and their corresponding DoAs. Its performance on the Task 3 dataset is greatly increased as compared with that of the baseline method. Index Terms— Sound event localization and detection, direction of arrival, event-independent, permutation invariant training.
Kong Qiuqiang, Xu Yong, Iqbal Turab, Cao Yin, Wang Wenwu, Plumbley Mark D. (2019)Acoustic scene generation with conditional SampleRNN, In: Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019)
Institute of Electrical and Electronics Engineers (IEEE)
Acoustic scene generation (ASG) is a task to generate waveforms for acoustic scenes. ASG can be used to generate audio scenes for movies and computer games. Recently, neural networks such as SampleRNN have been used for speech and music generation. However, ASG is more challenging due to its wide variety. In addition, evaluating a generative model is also difficult. In this paper, we propose to use a conditional SampleRNN model to generate acoustic scenes conditioned on the input classes. We also propose objective criteria to evaluate the quality and diversity of the generated samples based on classification accuracy. The experiments on the DCASE 2016 Task 1 acoustic scene data show that with the generated audio samples, a classification accuracy of 65:5% can be achieved compared to samples generated by a random model of 6:7% and samples from real recording of 83:1%. The performance of a classifier trained only on generated samples achieves an accuracy of 51:3%, as opposed to an accuracy of 6:7% with samples generated by a random model.
Iqbal Turab, Kong Qiuqiang, Plumbley Mark D, Wang Wenwu (2018)General-purpose audio tagging from noisy labels using convolutional neural networks, In: Plumbley Mark D, Kroos Christian, Bello JP, Richard G, Ellis DPW, Mesaros A (eds.), Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018pp. 212-216
Tampere University of Technology
General-purpose audio tagging refers to classifying sounds that are of a diverse nature, and is relevant in many applications where domain-specific information cannot be exploited. The DCASE 2018 challenge introduces Task 2 for this very problem. In this task, there are a large number of classes and the audio clips vary in duration. Moreover, a subset of the labels are noisy. In this paper, we propose a system to address these challenges. The basis of our system is an ensemble of convolutional neural networks trained on log-scaled mel spectrograms. We use preprocessing and data augmentation methods to improve the performance further. To reduce the effects of label noise, two techniques are proposed: loss function weighting and pseudo-labeling. Experiments on the private test set of this task show that our system achieves state-of-the-art performance with a mean average precision score of 0.951
In supervised machine learning, the assumption that training data is labelled correctly is not always satisfied. In this paper, we investigate an instance of labelling error for classification tasks in which the dataset is corrupted with out-of-distribution (OOD) instances: data that does not belong to any of the target classes, but is labelled as such. We show that detecting and relabelling certain OOD instances, rather than discarding them, can have a positive effect on learning. The proposed method uses an auxiliary classifier, trained on data that is known to be in-distribution, for detection and relabelling. The amount of data required for this is shown to be small. Experiments are carried out on the FSDnoisy18k audio dataset, where OOD instances are very prevalent. The proposed method is shown to improve the performance of convolutional neural networks by a significant margin. Comparisons with other noise-robust techniques are similarly encouraging.
The Detection and Classiﬁcation of Acoustic Scenes and Events (DCASE) consists of ﬁve audio classiﬁcation and sound event detectiontasks: 1)Acousticsceneclassiﬁcation,2)General-purposeaudio tagging of Freesound, 3) Bird audio detection, 4) Weakly-labeled semi-supervised sound event detection and 5) Multi-channel audio classiﬁcation. In this paper, we create a cross-task baseline system for all ﬁve tasks based on a convlutional neural network (CNN): a “CNN Baseline” system. We implemented CNNs with 4 layers and 8 layers originating from AlexNet and VGG from computer vision. We investigated how the performance varies from task to task with the same conﬁguration of neural networks. Experiments show that deeper CNN with 8 layers performs better than CNN with 4 layers on all tasks except Task 1. Using CNN with 8 layers, we achieve an accuracy of 0.680 on Task 1, an accuracy of 0.895 and a mean average precision (MAP) of 0.928 on Task 2, an accuracy of 0.751 andanareaunderthecurve(AUC)of0.854onTask3,asoundevent detectionF1scoreof20.8%onTask4,andanF1scoreof87.75%on Task 5. We released the Python source code of the baseline systems under the MIT license for further research.
The detection of acoustic scenes is a challenging problem in which environmental sound events must be detected from a given audio signal. This includes classifying the events as well as estimating their onset and offset times. We approach this problem with a neural network architecture that uses the recentlyproposed capsule routing mechanism. A capsule is a group of activation units representing a set of properties for an entity of interest, and the purpose of routing is to identify part-whole relationships between capsules. That is, a capsule in one layer is assumed to belong to a capsule in the layer above in terms of the entity being represented. Using capsule routing, we wish to train a network that can learn global coherence implicitly, thereby improving generalization performance. Our proposed method is evaluated on Task 4 of the DCASE 2017 challenge. Results show that classification performance is state-of-the-art, achieving an Fscore of 58.6%. In addition, overfitting is reduced considerably compared to other architectures.
Sound event detection (SED) and localization refer to recognizing sound events and estimating their spatial and temporal locations. Using neural networks has become the prevailing method for SED. In the area of sound localization, which is usually performed by estimating the direction of arrival (DOA), learning-based methods have recently been developed. In this paper, it is experimentally shown that the trained SED model is able to contribute to the direction of arrival estimation (DOAE). However, joint training of SED and DOAE degrades the performance of both. Based on these results, a two-stage polyphonic sound event detection and localization method is proposed. The method learns SED first, after which the learned feature layers are transferred for DOAE. It then uses the SED ground truth as a mask to train DOAE. The proposed method is evaluated on the DCASE 2019 Task 3 dataset, which contains different overlapping sound events in different environments. Experimental results show that the proposed method is able to improve the performance of both SED and DOAE, and also performs significantly better than the baseline method.
—Audio pattern recognition is an important research topic in the machine learning area, and includes several tasks such as audio tagging, acoustic scene classification, music classification , speech emotion classification and sound event detection. Recently, neural networks have been applied to tackle audio pattern recognition problems. However, previous systems are built on specific datasets with limited durations. Recently, in computer vision and natural language processing, systems pretrained on large-scale datasets have generalized well to several tasks. However, there is limited research on pretraining systems on large-scale datasets for audio pattern recognition. In this paper, we propose pretrained audio neural networks (PANNs) trained on the large-scale AudioSet dataset. These PANNs are transferred to other audio related tasks. We investigate the performance and computational complexity of PANNs modeled by a variety of convolutional neural networks. We propose an architecture called Wavegram-Logmel-CNN using both log-mel spectrogram and waveform as input feature. Our best PANN system achieves a state-of-the-art mean average precision (mAP) of 0.439 on AudioSet tagging, outperforming the best previous system of 0.392. We transfer PANNs to six audio pattern recognition tasks, and demonstrate state-of-the-art performance in several of those tasks. We have released the source code and pretrained models of PANNs: https://github.com/qiuqiangkong/audioset_tagging_cnn.
Audio tagging is the task of predicting the presence or absence of sound classes within an audio clip. Previous work in audio tagging focused on relatively small datasets limited to recognising a small number of sound classes. We investigate audio tagging on AudioSet, which is a dataset consisting of over 2 million audio clips and 527 classes. AudioSet is weakly labelled, in that only the presence or absence of sound classes is known for each clip, while the onset and offset times are unknown. To address the weakly-labelled audio tagging problem, we propose attention neural networks as a way to attend the most salient parts of an audio clip. We bridge the connection between attention neural networks and multiple instance learning (MIL) methods, and propose decision-level and feature-level attention neural networks for audio tagging. We investigate attention neural networks modelled by different functions, depths and widths. Experiments on AudioSet show that the feature-level attention neural network achieves a state-of-the-art mean average precision (mAP) of 0.369, outperforming the best multiple instance learning (MIL) method of 0.317 and Google’s deep neural network baseline of 0.314. In addition, we discover that the audio tagging performance on AudioSet embedding features has a weak correlation with the number of training examples and the quality of labels of each sound class.
Situated in the domain of urban sound scene classiﬁcation by humans and machines, this research is the ﬁrst step towards mapping urban noise pollution experienced indoors and ﬁnding ways to reduce its negative impact in peoples’ homes. We have recorded a sound dataset, called Open-Window, which contains recordings from three different locations and four different window states; two stationary states (open and close) and two transitional states (open to close and close to open). We have then built our machine recognition base lines for different scenarios (open set versus closed set) using a deep learning framework. The human listening test is also performed to be able to compare the human and machine performance for detecting the window state just using the acoustic cues. Our experimental results reveal that when using a simple machine baseline system, humans and machines are achieving similar average performance for closed set experiments.
Approximate message passing (AMP) algorithms have shown great promise in sparse signal reconstruction due to their low computational requirements and fast convergence to an exact solution. Moreover, they provide a probabilistic framework that is often more intuitive than alternatives such as convex optimisation. In this paper, AMP is used for audio source separation from underdetermined instantaneous mixtures. In the time-frequency domain, it is typical to assume a priori that the sources are sparse, so we solve the corresponding sparse linear inverse problem using AMP. We present a block-based approach that uses AMP to process multiple time-frequency points simultaneously. Two algorithms known as AMP and vector AMP (VAMP) are evaluated in particular. Results show that they are promising in terms of artefact suppression.