Environmental audio tagging is a newly proposed task to predict the presence or absence of a specific audio event in a chunk. Deep neural network (DNN) based methods have been successfully adopted for predicting the audio tags in the domestic audio scene. In this paper, we propose to use a convolutional neural network (CNN) to extract robust features from mel-filter banks (MFBs), spectrograms or even raw waveforms for audio tagging. Gated recurrent unit (GRU) based recurrent neural networks (RNNs) are then cascaded to model the long-term temporal structure of the audio signal. To complement the input information, an auxiliary CNN is designed to learn on the spatial features of stereo recordings. We evaluate our proposed methods on Task 4 (audio tagging) of the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE 2016) challenge. Compared with our recent DNN-based method, the proposed structure can reduce the equal error rate (EER) from 0.13 to 0.11 on the development set. The spatial features can further reduce the EER to 0.10. The performance of the end-to-end learning on raw waveforms is also comparable. Finally, on the evaluation set, we get the state-of-the-art performance with 0.12 EER while the performance of the best existing system is 0.15 EER.
In this paper, we present a deep neural network (DNN)-based acoustic
scene classification framework. Two hierarchical learning methods
are proposed to improve the DNN baseline performance by incorporating
the hierarchical taxonomy information of environmental
sounds. Firstly, the parameters of the DNN are initialized by the
proposed hierarchical pre-training. Multi-level objective function
is then adopted to add more constraint on the cross-entropy based
loss function. A series of experiments were conducted on the Task1
of the Detection and Classification of Acoustic Scenes and Events
(DCASE) 2016 challenge. The final DNN-based system achieved
a 22.9% relative improvement on average scene classification error
as compared with the Gaussian Mixture Model (GMM)-based
benchmark system across four standard folds.
Acoustic event detection for content analysis in most cases relies on lots of labeled data. However, manually annotating data is a time-consuming task, which thus makes few annotated resources available so far. Unlike audio event detection, automatic audio tagging, a multi-label acoustic event classification task, only relies on weakly labeled data. This is highly desirable to some practical applications using audio analysis. In this paper we propose to use a fully deep neural network (DNN) framework to handle the multi-label classification task in a regression way. Considering that only chunk-level rather than frame-level labels are available, the whole or almost whole frames of the chunk were fed into the DNN to perform a multi-label regression for the expected tags. The fully DNN, which is regarded as an encoding function, can well map the audio features sequence to a multi-tag vector. A deep pyramid structure was also designed to extract more robust high-level features related to the target tags. Further improved methods were adopted, such as the Dropout and background noise aware training, to enhance its generalization capability for new audio recordings in mismatched environments. Compared with the conventional Gaussian Mixture Model (GMM) and support vector machine (SVM) methods, the proposed fully DNN-based method could well utilize the long-term temporal information with the whole chunk as the input. The results show that our approach obtained a 15% relative improvement compared with the official GMM-based method of DCASE 2016 challenge.
Yan F, Kittler J, Windridge D, Christmas W, Mikolajczyk K, Cox S, Huang Q (2014) Automatic annotation of tennis games: An integration of audio, vision, and learning, IMAGE AND VISION COMPUTING 32 (11) pp. 896-903 ELSEVIER SCIENCE BV
Audio tagging aims to perform multi-label classification on audio
chunks and it is a newly proposed task in the Detection and
Classification of Acoustic Scenes and Events 2016 (DCASE
2016) challenge. This task encourages research efforts to better
analyze and understand the content of the huge amounts of
audio data on the web. The difficulty in audio tagging is that
it only has a chunk-level label without a frame-level label. This
paper presents a weakly supervised method to not only predict
the tags but also indicate the temporal locations of the occurred
acoustic events. The attention scheme is found to be effective
in identifying the important frames while ignoring the unrelated
frames. The proposed framework is a deep convolutional recurrent
model with two auxiliary modules: an attention module
and a localization module. The proposed algorithm was evaluated
on the Task 4 of DCASE 2016 challenge. State-of-the-art
performance was achieved on the evaluation set with equal error
rate (EER) reduced from 0.13 to 0.11, compared with the
convolutional recurrent baseline system.
Zhao L, Luo D, Wu J, Huang Q, Zhang W, Chen K, Liu T, Liu L, Zhang Y, Lui F, Russell T, Snaith H, Zhu R, Gong Q (2016) High-Performance Inverted Planar Heterojunction Perovskite Solar Cells Based on Lead Acetate Precursor with Efficiency Exceeding 18%, Advanced Functional Materials 26 (20) pp. 3508-3514
Organic?inorganic lead halide perovskites are emerging materials for the next-generation photovoltaics. Lead halides are the most commonly used lead precursors for perovskite active layers. Recently, lead acetate (Pb(Ac)2) has shown its superiority as the potential replacement for traditional lead halides. Here, we demonstrate a strategy to improve the efficiency for the perovskite solar cell based on lead acetate precursor. We utilized methylammonium bromide as an additive in the Pb(Ac)2 and methylammonium iodide precursor solution, resulting in uniform, compact and pinhole-free perovskite films. We observed enhanced charge carrier extraction between the perovskite layer and charge collection layers and delivered a champion power conversion efficiency of 18.3% with a stabilized output efficiency of 17.6% at the maximum power point. The optimized devices also exhibited negligible current density?voltage (J?V) hysteresis under the scanning conditions.
Environmental audio tagging aims to predict only the presence or absence of certain acoustic events in the interested acoustic scene. In this paper we make contributions to audio tagging in two parts, respectively, acoustic modeling and feature learning. We propose to use a shrinking deep neural network (DNN) framework incorporating unsupervised feature learning to handle the multi-label classification task. For the acoustic modeling, a large set of contextual frames of the chunk are fed into the DNN to perform a multi-label classification for the expected tags, considering that only chunk (or utterance) level rather than frame-level labels are available. Dropout and background noise aware training are also adopted to improve the generalization capability of the DNNs. For the unsupervised feature learning, we propose to use a symmetric or asymmetric deep de-noising auto-encoder (syDAE or asyDAE) to generate new data-driven features from the logarithmic Mel-Filter Banks (MFBs) features. The new features, which are smoothed against background noise and more compact with contextual information, can further improve the performance of the DNN baseline. Compared with the standard Gaussian Mixture Model (GMM) baseline of the DCASE 2016 audio tagging challenge, our proposed method obtains a significant equal error rate (EER) reduction from 0.21 to 0.13 on the development set. The proposed asyDAE system can get a relative 6.7% EER reduction compared with the strong DNN baseline on the development set. Finally, the results also show that our approach obtains the state-of-the-art performance with 0.15 EER on the evaluation set of the DCASE 2016 audio tagging task while EER of the first prize of this challenge is 0.17.
Automatic and fast tagging of natural sounds in audio collections is a very challenging task due to wide acoustic variations, the large number of possible tags, the incomplete and ambiguous tags provided by different labellers. To handle these problems, we use a co-regularization approach to learn a pair of classifiers on sound and text. The first classifier maps low-level audio features to a true tag list. The second classifier maps actively corrupted tags to the true tags, reducing incorrect mappings caused by low-level acoustic variations in the first classifier, and to augment the tags with additional relevant tags. Training the classifiers is implemented using marginal co-regularization, pair of which draws the two classifiers into agreement by a joint optimization. We evaluate this approach on two sound datasets, Freefield1010 and Task4 of DCASE2016. The results obtained show that marginal co-regularization outperforms the baseline GMM in both ef- ficiency and effectiveness.
In this paper, we propose a divide-and-conquer approach using
two generative adversarial networks (GANs) to explore
how a machine can draw colorful pictures (bird) using a small
amount of training data. In our work, we simulate the procedure
of an artist drawing a picture, where one begins with
drawing objects? contours and edges and then paints them
different colors. We adopt two GAN models to process basic
visual features including shape, texture and color. We use
the first GAN model to generate object shape, and then paint
the black and white image based on the knowledge learned
using the second GAN model. We run our experiments on
600 color images. The experimental results show that the use
of our approach can generate good quality synthetic images,
comparable to real ones.