I appreciate the financial support for my research from the following bodies (since 2008): Engineering and Physical Science Research Council (EPSRC), Ministry of Defence (MOD), Defence Science and Technology Laboratory (DSTL), Home Office (HO), Royal Academy of Engineering (RAENG), European Commission (EC), Samsung Electronics Research Institute UK (SAMSUNG), National Natural Science Foundation of China (NSFC), the University Research Support Fund (URSF), and the Ohio State University (OSU). [Total award to Surrey where I am a PI/CI: approximately £5.4M (as PI £1.2M, as CI £4.2M). As PI/CI, on a total grant award portfolio: approximately £15M]
Note: EEM - Master students module; EE1 - First-year undergraduate students module.
Find me on campus Room: 04 BB 01
"This book covers advances in algorithmic developments, theoretical frameworks, andexperimental research findings to assist professionals who want an improved ...
With the fast development of information acquisition, there is a rapid growth of multimodality data, e.g., text, audio, image and even video, in fields of health care, multimedia retrieval and scientific research. Confronted with the challenges of clustering, classification or regression with multi-modality information, it is essential to effectively measure the distance or similarity between objects described with heterogeneous features. Metric learning, aimed at finding a task-oriented distance function, is a hot topic in machine learning. However, most existing algorithms lack efficiency for highdimensional multi-modality tasks. In this work, we develop an effective and efficient metric learning algorithm for multi-modality data, i.e., Efficient Multi-modal Geometric Mean Metric Learning (EMGMML). The proposed algorithm learns a distinctive distance metric for each view by minimizing the distance between similar pairs while maximizing the distance between dissimilar pairs. To avoid overfitting, the optimization objective is regularized by symmetrized LogDet divergence. EMGMML is very efficient in that there is a closed-formsolution for each distance metric. Experiment results show that the proposed algorithm outperforms the state-of-the-art metric learning methods in terms of both accuracy and efficiency.
Environmental audio tagging aims to predict only the presence or absence of certain acoustic events in the interested acoustic scene. In this paper we make contributions to audio tagging in two parts, respectively, acoustic modeling and feature learning. We propose to use a shrinking deep neural network (DNN) framework incorporating unsupervised feature learning to handle the multi-label classification task. For the acoustic modeling, a large set of contextual frames of the chunk are fed into the DNN to perform a multi-label classification for the expected tags, considering that only chunk (or utterance) level rather than frame-level labels are available. Dropout and background noise aware training are also adopted to improve the generalization capability of the DNNs. For the unsupervised feature learning, we propose to use a symmetric or asymmetric deep de-noising auto-encoder (syDAE or asyDAE) to generate new data-driven features from the logarithmic Mel-Filter Banks (MFBs) features. The new features, which are smoothed against background noise and more compact with contextual information, can further improve the performance of the DNN baseline. Compared with the standard Gaussian Mixture Model (GMM) baseline of the DCASE 2016 audio tagging challenge, our proposed method obtains a significant equal error rate (EER) reduction from 0.21 to 0.13 on the development set. The proposed asyDAE system can get a relative 6.7% EER reduction compared with the strong DNN baseline on the development set. Finally, the results also show that our approach obtains the state-of-the-art performance with 0.15 EER on the evaluation set of the DCASE 2016 audio tagging task while EER of the first prize of this challenge is 0.17.
Using an acoustic vector sensor (AVS), an efficient method has been presented recently for direction-of-arrival (DOA) estimation of multiple speech sources via the clustering of the inter-sensor data ratio (AVS-ISDR). Through extensive experiments on simulated and recorded data, we observed that the performance of the AVS-DOA method is largely dependent on the reliable extraction of the target speech dominated time-frequency points (TD-TFPs) which, however, may be degraded with the increase in the level of additive noise and room reverberation in the background. In this paper, inspired by the great success of deep learning in speech recognition, we design two new soft mask learners, namely deep neural network (DNN) and DNN cascaded with a support vector machine (DNN-SVM), for multi-source DOA estimation, where a novel feature, namely, the tandem local spectrogram block (TLSB) is used as the input to the system. Using our proposed soft mask learners, the TD-TFPs can be accurately extracted under different noisy and reverberant conditions. Additionally, the generated soft masks can be used to calculate the weighted centers of the ISDR-clusters for better DOA estimation as compared with the original center used in our previously proposed AVS-ISDR. Extensive experiments on simulated and recorded data have been presented to show the improved performance of our proposed methods over two baseline AVS-DOA methods in presence of noise and reverberation.
Panning techniques, such as vector base amplitude panning (VBAP) are a widely-used practical approach for spatial sound reproduction using multiple loudspeakers. Although limited to a relatively small listening area, they are very efficient and offer good localisation accuracy, timbral quality as well as a graceful degradation of quality outside the sweet spot. The aim of this paper is to investigate optimal sound reproduction techniques that adopt some of the advantageous properties of VBAP, such as the sparsity and the locality of the active loudspeakers for the reproduction of a single audio object. To this end, we state the task of multi-loudspeaker panning as an `1 optimization problem. We demonstrate and prove that the resulting solutions are exactly sparse. Moreover, we show the effect of adding a nonnegativity constraint on the loudspeaker gains in order to preserve the locality of the panning solution. Adding this constraint, `1- optimal panning can be formulated as a linear program. Using this representation, we prove that unique `1-optimal panning solutions incorporating a nonnegativity constraint are identical to VBAP using a Delaunay triangulation for the loudspeaker setup. Using results from linear programming and duality theory, we describe properties and special cases, such as solution ambiguity, of the VBAP solution.
Blind deconvolution is an ill-posed problem. To solve such a prob- lem, prior information, such as, the sparseness of the source (i.e. input) signal or channel impulse responses, is usually adopted. In speech deconvolution, the source signal is not naturally sparse. However, the direct impulse and early reflections of the impulse responses of an acoustic system can be considered as sparse. In this paper, we exploit the channel sparsity and present an algorithm for speech deconvolution, where the dynamic range of the convolutive speech is also used as the prior information. In this algorithm, the estimation of the impulse response and the source signal is achieved by alternating between two steps, namely, the ℓ1 regularized least squares optimization and a proximal operation. As demonstrated in our experiments, the proposed method pro- vides superior performance for deconvolution of a sparse acoustic system, as compared with two state-of-the-art methods.
The multiplicative noise removal problem for a corrupted image has recently been considered under the framework of regularization based approaches, where the regularizations are typically de ned on sparse dictionaries and/or total va- riation (TV). This framework was demonstrated to be e ective. However, the sparse regularizers used so far are based overwhelmingly on the synthesis model, and the TV based regularizer may induce the stair-casing e ect in the recon- structed image. In this paper, we propose a new method using a sparse analysis model. Our formulation contains a data delity term derived from the distri- bution of the noise and two regularizers. One regularizer employs a learned analysis dictionary, and the other regularizer is an enhanced TV by introducing a parameter to control the smoothness constraint de ned on pixel-wise di er- ences. To address the resulting optimization problem, we adapt the alternating direction method of multipliers (ADMM) framework, and present a new method where a relaxation technique is developed to update the variables exibly with either image patches or the whole image, as required by the learned dictionary and the enhanced TV regularizers, respectively. Experimental results demon- strate the improved performance of the proposed method as compared with several recent baseline methods, especially for relatively high noise levels.
Acoustic reflector localization is an important issue in audio signal processing, with direct applications in spatial audio, scene reconstruction, and source separation. Several methods have recently been proposed to estimate the 3D positions of acoustic reflectors given room impulse responses (RIRs). In this article, we categorize these methods as “image-source reversion”, which localizes the image source before finding the reflector position, and “direct localization”, which localizes the reflector without intermediate steps. We present five new contributions. First, an onset detector, called the clustered dynamic programming projected phase-slope algorithm, is proposed to automatically extract the time of arrival for early reflections within the RIRs of a compact microphone array. Second, we propose an image-source reversion method that uses the RIRs from a single loudspeaker. It is constructed by combining an image source locator (the image source direction and range (ISDAR) algorithm), and a reflector locator (using the loudspeaker-image bisection (LIB) algorithm). Third, two variants of it, exploiting multiple loudspeakers, are proposed. Fourth, we present a direct localization method, the ellipsoid tangent sample consensus (ETSAC), exploiting ellipsoid properties to localize the reflector. Finally, systematic experiments on simulated and measured RIRs are presented, comparing the proposed methods with the state-of-the-art. ETSAC generates errors lower than the alternative methods compared through our datasets. Nevertheless, the ISDAR-LIB combination performs well and has a run time 200 times faster than ETSAC.
Metric learning plays a fundamental role in the fields of multimedia retrieval and pattern recognition. Recently, an online multi-kernel similarity (OMKS) learning method has been presented for content-based image retrieval (CBIR), which was shown to be promising for capturing the intrinsic nonlinear relations within multimodal features from large-scale data. However, the similarity function in this method is learned only from labeled images. In this paper, we present a new framework to exploit unlabeled images and develop a semi-supervised OMKS algorithm. The proposed method is a multi-stage algorithm consisting of feature selection, selective ensemble learning, active sample selection and triplet generation. The novel aspects of our work are the introduction of classification confidence to evaluate the labeling process and select the reliably labeled images to train the metric function, and a method for reliable triplet generation, where a new criterion for sample selection is used to improve the accuracy of label prediction for unlabelled images. Our proposed method offers advantages in challenging scenarios, in particular, for a small set of labeled images with high-dimensional features. Experimental results demonstrate the effectiveness of the proposed method as compared with several baseline methods.
Head pose is an important cue in many applications such as, speech recognition and face recognition. Most approaches to head pose estimation to date have focussed on the use of visual information of a subject’s head. These visual approaches have a number of limitations such as, an inability to cope with occlusions, changes in the appearance of the head, and low resolution images. We present here a novel method for determining coarse head pose orientation purely from audio information, exploiting the direct to reverberant speech energy ratio (DRR) within a reverberant room environment. Our hypothesis is that a speaker facing towards a microphone will have a higher DRR and a speaker facing away from the microphone will have a lower DRR. This method has the advantage of actually exploiting the reverberations within a room rather than trying to suppress them. This also has the practical advantage that most enclosed living spaces, such as meeting rooms or offices are highly reverberant environments. In order to test this hypothesis we also present a new data set featuring 56 subjects recorded in three different rooms, with different acoustic properties, adopting 8 different head poses in 4 different room positions captured with a 16 element microphone array. As far as the authors are aware this data set is unique and will make a significant contribution to further work in the area of audio head pose estimation. Using this data set we demonstrate that our proposed method of using the DRR for audio head pose estimation provides a significant improvement over previous methods.
The probability hypothesis density (PHD) filter based on sequential Monte Carlo (SMC) approximation (also known as SMC-PHD filter) has proven to be a promising algorithm for multi-speaker tracking. However, it has a heavy computational cost as surviving, spawned and born particles need to be distributed in each frame to model the state of the speakers and to estimate jointly the variable number of speakers with their states. In particular, the computational cost is mostly caused by the born particles as they need to be propagated over the entire image in every frame to detect the new speaker presence in the view of the visual tracker. In this paper, we propose to use audio data to improve the visual SMC-PHD (VSMC- PHD) filter by using the direction of arrival (DOA) angles of the audio sources to determine when to propagate the born particles and re-allocate the surviving and spawned particles. The tracking accuracy of the AV-SMC-PHD algorithm is further improved by using a modified mean-shift algorithm to search and climb density gradients iteratively to find the peak of the probability distribution, and the extra computational complexity introduced by mean-shift is controlled with a sparse sampling technique. These improved algorithms, named as AVMS-SMCPHD and sparse-AVMS-SMC-PHD respectively, are compared systematically with AV-SMC-PHD and V-SMC-PHD based on the AV16.3, AMI and CLEAR datasets.
© 1999-2012 IEEE.The problem of tracking multiple moving speakers in indoor environments has received much attention. Earlier techniques were based purely on a single modality, e.g., vision. Recently, the fusion of multi-modal information has been shown to be instrumental in improving tracking performance, as well as robustness in the case of challenging situations like occlusions (by the limited field of view of cameras or by other speakers). However, data fusion algorithms often suffer from noise corrupting the sensor measurements which cause non-negligible detection errors. Here, a novel approach to combining audio and visual data is proposed. We employ the direction of arrival angles of the audio sources to reshape the typical Gaussian noise distribution of particles in the propagation step and to weight the observation model in the measurement step. This approach is further improved by solving a typical problem associated with the PF, whose efficiency and accuracy usually depend on the number of particles and noise variance used in state estimation and particle propagation. Both parameters are specified beforehand and kept fixed in the regular PF implementation which makes the tracker unstable in practice. To address these problems, we design an algorithm which adapts both the number of particles and noise variance based on tracking error and the area occupied by the particles in the image. Experiments on the AV16.3 dataset show the advantage of our proposed methods over the baseline PF method and an existing adaptive PF algorithm for tracking occluded speakers with a significantly reduced number of particles.
Recent studies show that facial information contained in visual speech can be helpful for the performance enhancement of audio-only blind source separation (BSS) algorithms. Such information is exploited through the statistical characterization of the coherence between the audio and visual speech using, e.g., a Gaussian mixture model (GMM). In this paper, we present three contributions. With the synchronized features, we propose an adapted expectation maximization (AEM) algorithm to model the audiovisual coherence in the off-line training process. To improve the accuracy of this coherence model, we use a frame selection scheme to discard nonstationary features. Then with the coherence maximization technique, we develop a new sorting method to solve the permutation problem in the frequency domain. We test our algorithm on a multimodal speech database composed of different combinations of vowels and consonants. The experimental results show that our proposed algorithm outperforms traditional audio-only BSS, which confirms the benefit of using visual speech to assist in separation of the audio. © 2011 Elsevier B.V. All rights reserved.
We consider the data-driven dictionary learning problem. The goal is to seek an over-complete dictionary from which every training signal can be best approximated by a linear combination of only a few codewords. This task is often achieved by iteratively executing two operations: sparse coding and dictionary update. The focus of this paper is on the dictionary update step, where the dictionary is optimized with a given sparsity pattern. We propose a novel framework where an arbitrary set of codewords and the corresponding sparse coefficients are simultaneously updated, hence the term simultaneous codeword optimization (SimCO). The SimCO formulation not only generalizes benchmark mechanisms MOD and K-SVD, but also allows the discovery that singular points, rather than local minima, are the major bottleneck of dictionary update. To mitigate the problem caused by the singular points, regularized SimCO is proposed. First and second order optimization procedures are designed to solve regularized SimCO. Simulations show that regularization substantially improves the performance of dictionary learning. © 1991-2012 IEEE.
Binaural features of interaural level difference and interaural phase difference have proved to be very effective in training deep neural networks (DNNs), to generate timefrequency masks for target speech extraction in speech-speech mixtures. However, effectiveness of binaural features is reduced in more common speech-noise scenarios, since the noise may over-shadow the speech in adverse conditions. In addition, the reverberation also decreases the sparsity of binaural features and therefore adds difficulties to the separation task. To address the above limitations, we highlight the spectral difference between speech and noise spectra and incorporate the log-power spectra features to extend the DNN input. Tested on two different reverberant rooms at different signal to noise ratios (SNR), our proposed method shows advantages over the baseline method using only binaural features in terms of signal to distortion ratio (SDR) and Short-Time Perceptual Intelligibility (STOI).
We study the problem of wideband direction of arrival (DoA) estimation by joint optimisation of array and spatial sparsity. Two-step iterative process is proposed. In the first step, the wideband signal is reshaped and used as the input to derive the weight coefficients using a sparse array optimisation method. The weights are then used to scale the observed signal model for which a compressive sensing based spatial sparsity optimisation method is used for DoA estimation. Simulations are provided to demonstrate the performance of the proposed method for both stationary and moving sources.
Deep neural networks (DNN) have recently been shown to give state-of-the-art performance in monaural speech enhancement. However in the DNN training process, the perceptual difference between different components of the DNN output is not fully exploited, where equal importance is often assumed. To address this limitation, we have proposed a new perceptually-weighted objective function within a feedforward DNN framework, aiming to minimize the perceptual difference between the enhanced speech and the target speech. A perceptual weight is integrated into the proposed objective function, and has been tested on two types of output features: spectra and ideal ratio masks. Objective evaluations for both speech quality and speech intelligibility have been performed. Integration of our perceptual weight shows consistent improvement on several noise levels and a variety of different noise types.
We study the problem of dictionary learning for signals that can be represented as polynomials or polynomial matrices, such as convolutive signals with time delays or acoustic impulse responses. Recently, we developed a method for polynomial dictionary learning based on the fact that a polynomial matrix can be expressed as a polynomial with matrix coefficients, where the coefficient of the polynomial at each time lag is a scalar matrix. However, a polynomial matrix can be also equally represented as a matrix with polynomial elements. In this paper, we develop an alternative method for learning a polynomial dictionary and a sparse representation method for polynomial signal reconstruction based on this model. The proposed methods can be used directly to operate on the polynomial matrix without having to access its coefficients matrices. We demonstrate the performance of the proposed method for acoustic impulse response modeling.
Audio tagging aims to perform multi-label classification on audio chunks and it is a newly proposed task in the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE 2016) challenge. This task encourages research efforts to better analyze and understand the content of the huge amounts of audio data on the web. The difficulty in audio tagging is that it only has a chunk-level label without a frame-level label. This paper presents a weakly supervised method to not only predict the tags but also indicate the temporal locations of the occurred acoustic events. The attention scheme is found to be effective in identifying the important frames while ignoring the unrelated frames. The proposed framework is a deep convolutional recurrent model with two auxiliary modules: an attention module and a localization module. The proposed algorithm was evaluated on the Task 4 of DCASE 2016 challenge. State-of-the-art performance was achieved on the evaluation set with equal error rate (EER) reduced from 0.13 to 0.11, compared with the convolutional recurrent baseline system.
A method based on Deep Neural Networks (DNNs) and time-frequency masking has been recently developed for binaural audio source separation. In this method, the DNNs are used to predict the Direction Of Arrival (DOA) of the audio sources with respect to the listener which is then used to generate soft time-frequency masks for the recovery/estimation of the individual audio sources. In this paper, an algorithm called ‘dropout’ will be applied to the hidden layers, affecting the sparsity of hidden units activations: randomly selected neurons and their connections are dropped during the training phase, preventing feature co-adaptation. These methods are evaluated on binaural mixtures generated with Binaural Room Impulse Responses (BRIRs), accounting a certain level of room reverberation. The results show that the proposed DNNs system with randomly deleted neurons is able to achieve higher SDRs performances compared to the baseline method without the dropout algorithm.
Sequential Monte Carlo probability hypothesis density (SMC- PHD) ltering has been recently exploited for audio-visual (AV) based tracking of multiple speakers, where audio data are used to inform the particle distribution and propagation in the visual SMC-PHD lter. How- ever, the performance of the AV-SMC-PHD lter can be a ected by the mismatch between the proposal and the posterior distribution. In this pa- per, we present a new method to improve the particle distribution where audio information (i.e. DOA angles derived from microphone array mea- surements) is used to detect new born particles and visual information (i.e. histograms) is used to modify the particles with particle ow (PF). Using particle ow has the bene t of migrating particles smoothly from the prior to the posterior distribution. We compare the proposed algo- rithm with the baseline AV-SMC-PHD algorithm using experiments on the AV16.3 dataset with multi-speaker sequences.
Audio tagging aims to assign one or several tags to an audio clip. Most of the datasets are weakly labelled, which means only the tags of the clip are known, without knowing the occurrence time of the tags. The labeling of an audio clip is often based on the audio events in the clip and no event level label is provided to the user. Previous works have used the bag of frames model assume the tags occur all the time, which is not the case in practice. We propose a joint detection-classification (JDC) model to detect and classify the audio clip simultaneously. The JDC model has the ability to attend to informative and ignore uninformative sounds. Then only informative regions are used for classification. Experimental results on the “CHiME Home” dataset show that the JDC model reduces the equal error rate (EER) from 19.0% to 16.9%. More interestingly, the audio event detector is trained successfully without needing the event level label.
Musical noise is a recurrent issue that appears in spectral techniques for denoising or blind source separation. Due to localised errors of estimation, isolated peaks may appear in the processed spectrograms, resulting in annoying tonal sounds after synthesis known as “musical noise”. In this paper, we propose a method to assess the amount of musical noise in an audio signal, by characterising the impact of these artificial isolated peaks on the processed sound. It turns out that because of the constraints between STFT coefficients, the isolated peaks are described as time-frequency “spots” in the spectrogram of the processed audio signal. The quantification of these “spots”, achieved through the adaptation of a method for localisation of significant STFT regions, allows for an evaluation of the amount of musical noise. We believe that this will pave the way to an objective measure and a better understanding of this phenomenon.
Environmental audio tagging is a newly proposed task to predict the presence or absence of a specific audio event in a chunk. Deep neural network (DNN) based methods have been successfully adopted for predicting the audio tags in the domestic audio scene. In this paper, we propose to use a convolutional neural network (CNN) to extract robust features from mel-filter banks (MFBs), spectrograms or even raw waveforms for audio tagging. Gated recurrent unit (GRU) based recurrent neural networks (RNNs) are then cascaded to model the long-term temporal structure of the audio signal. To complement the input information, an auxiliary CNN is designed to learn on the spatial features of stereo recordings. We evaluate our proposed methods on Task 4 (audio tagging) of the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE 2016) challenge. Compared with our recent DNN-based method, the proposed structure can reduce the equal error rate (EER) from 0.13 to 0.11 on the development set. The spatial features can further reduce the EER to 0.10. The performance of the end-to-end learning on raw waveforms is also comparable. Finally, on the evaluation set, we get the state-of-the-art performance with 0.12 EER while the performance of the best existing system is 0.15 EER.
Target tracking is a challenging task and generally no analytical solution is available, especially for the multi-target tracking systems. To address this problem, probability hypothesis density (PHD) filter is used by propagating the PHD instead of the full multi-target posterior. Recently, the particle flow filter based on the log homotopy provides a new way for state estimation. In this paper, we propose a novel sequential Monte Carlo (SMC) implementation for the PHD filter assisted by the particle flow (PF), which is called PF-SMCPHD filter. Experimental results show that our proposed filter has higher accuracy than the SMC-PHD filter and is computationally cheaper than the Gaussian mixture PHD (GM-PHD) filter.
Automatic and fast tagging of natural sounds in audio collections is a very challenging task due to wide acoustic variations, the large number of possible tags, the incomplete and ambiguous tags provided by different labellers. To handle these problems, we use a co-regularization approach to learn a pair of classifiers on sound and text. The first classifier maps low-level audio features to a true tag list. The second classifier maps actively corrupted tags to the true tags, reducing incorrect mappings caused by low-level acoustic variations in the first classifier, and to augment the tags with additional relevant tags. Training the classifiers is implemented using marginal co-regularization, pair of which draws the two classifiers into agreement by a joint optimization. We evaluate this approach on two sound datasets, Freefield1010 and Task4 of DCASE2016. The results obtained show that marginal co-regularization outperforms the baseline GMM in both ef- ficiency and effectiveness.
We address the problem of sparse signal reconstruction from a few noisy samples. Recently, a Covariance-Assisted Matching Pursuit (CAMP) algorithm has been proposed, improving the sparse coefficient update step of the classic Orthogonal Matching Pursuit (OMP) algorithm. CAMP allows the a-priori mean and covariance of the non-zero coefficients to be considered in the coefficient update step. In this paper, we analyze CAMP, which leads to a new interpretation of the update step as a maximum-a-posteriori (MAP) estimation of the non-zero coefficients at each step. We then propose to leverage this idea, by finding a MAP estimate of the sparse reconstruction problem, in a greedy OMP-like way. Our approach allows the statistical dependencies between sparse coefficients to be modelled, while keeping the practicality of OMP. Experiments show improved performance when reconstructing the signal from a few noisy samples.
Audio source separation aims to extract individual sources from mixtures of multiple sound sources. Many techniques have been developed such as independent compo- nent analysis, computational auditory scene analysis, and non-negative matrix factorisa- tion. A method based on Deep Neural Networks (DNNs) and time-frequency (T-F) mask- ing has been recently developed for binaural audio source separation. In this method, the DNNs are used to predict the Direction Of Arrival (DOA) of the audio sources with respect to the listener which is then used to generate soft T-F masks for the recovery/estimation of the individual audio sources.
State-of-the-art binaural objective intelligibility measures (OIMs) require individual source signals for making intelligibility predictions, limiting their usability in real-time online operations. This limitation may be addressed by a blind source separation (BSS) process, which is able to extract the underlying sources from a mixture. In this study, a speech source is presented with either a stationary noise masker or a fluctuating noise masker whose azimuth varies in a horizontal plane, at two speech-to-noise ratios (SNRs). Three binaural OIMs are used to predict speech intelligibility from the signals separated by a BSS algorithm. The model predictions are compared with listeners' word identification rate in a perceptual listening experiment. The results suggest that with SNR compensation to the BSS-separated speech signal, the OIMs can maintain their predictive power for individual maskers compared to their performance measured from the direct signals. It also reveals that the errors in SNR between the estimated signals are not the only factors that decrease the predictive accuracy of the OIMs with the separated signals. Artefacts or distortions on the estimated signals caused by the BSS algorithm may also be concerns.
In this paper, we present a deep neural network (DNN)-based acoustic scene classification framework. Two hierarchical learning methods are proposed to improve the DNN baseline performance by incorporating the hierarchical taxonomy information of environmental sounds. Firstly, the parameters of the DNN are initialized by the proposed hierarchical pre-training. Multi-level objective function is then adopted to add more constraint on the cross-entropy based loss function. A series of experiments were conducted on the Task1 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2016 challenge. The final DNN-based system achieved a 22.9% relative improvement on average scene classification error as compared with the Gaussian Mixture Model (GMM)-based benchmark system across four standard folds.
Acoustic event detection for content analysis in most cases relies on lots of labeled data. However, manually annotating data is a time-consuming task, which thus makes few annotated resources available so far. Unlike audio event detection, automatic audio tagging, a multi-label acoustic event classification task, only relies on weakly labeled data. This is highly desirable to some practical applications using audio analysis. In this paper we propose to use a fully deep neural network (DNN) framework to handle the multi-label classification task in a regression way. Considering that only chunk-level rather than frame-level labels are available, the whole or almost whole frames of the chunk were fed into the DNN to perform a multi-label regression for the expected tags. The fully DNN, which is regarded as an encoding function, can well map the audio features sequence to a multi-tag vector. A deep pyramid structure was also designed to extract more robust high-level features related to the target tags. Further improved methods were adopted, such as the Dropout and background noise aware training, to enhance its generalization capability for new audio recordings in mismatched environments. Compared with the conventional Gaussian Mixture Model (GMM) and support vector machine (SVM) methods, the proposed fully DNN-based method could well utilize the long-term temporal information with the whole chunk as the input. The results show that our approach obtained a 15% relative improvement compared with the official GMM-based method of DCASE 2016 challenge.
Significant amounts of user-generated audio content, such as sound effects, musical samples and music pieces, are uploaded to online repositories and made available under open licenses. Moreover, a constantly increasing amount of multimedia content, originally released with traditional licenses, is becoming public domain as its license expires. Nevertheless, the creative industries are not yet using much of all this content in their media productions. There is still a lack of familiarity and understanding of the legal context of all this open content, but there are also problems related with its accessibility. A big percentage of this content remains unreachable either because it is not published online or because it is not well organised and annotated. In this paper we present the Audio Commons Initiative, which is aimed at promoting the use of open audio content and at developing technologies with which to support the ecosystem composed by content repositories, production tools and users. These technologies should enable the reuse of this audio material, facilitating its integration in the production workflows used by the creative industries. This is a position paper in which we describe the core ideas behind this initiative and outline the ways in which we plan to address the challenges it poses.
The DCASE Challenge 2016 contains tasks for Acoustic Scene Classification (ASC), Acoustic Event Detection (AED), and audio tagging. Since 2006, Deep Neural Networks (DNNs) have been widely applied to computer visions, speech recognition and natural language processing tasks. In this paper, we provide DNN baselines for the DCASE Challenge 2016. In Task 1 we obtained accuracy of 81.0% using Mel + DNN against 77.2% by using Mel Frequency Cepstral Coefficients (MFCCs) + Gaussian Mixture Model (GMM). In Task 2 we obtained F value of 12.6% using Mel + DNN against 37.0% by using Constant Q Transform (CQT) + Nonnegative Matrix Factorization (NMF). In Task 3 we obtained F value of 36.3% using Mel + DNN against 23.7% by using MFCCs + GMM. In Task 4 we obtained Equal Error Rate (ERR) of 18.9% using Mel + DNN against 20.9% by using MFCCs + GMM. Therefore the DNN improves the baseline in Task 1, 3, and 4, although it is worse than the baseline in Task 2. This indicates that DNNs can be successful in many of these tasks, but may not always perform better than the baselines.
In this paper, we propose a new method for underdetermined blind source separation of reverberant speech mixtures by classifying each time-frequency (T-F) point of the mixtures according to a combined variational Bayesian model of spatial cues, under sparse signal representation assumption. We model the T-F observations by a variational mixture of circularly-symmetric complex-Gaussians. The spatial cues, e.g. interaural level difference (ILD), interaural phase difference (IPD) and mixing vector cues, are modelled by a variational mixture of Gaussians. We then establish appropriate conjugate prior distributions for the parameters of all the mixtures to create a variational Bayesian framework. Using the Bayesian approach we then iteratively estimate the hyper-parameters for the prior distributions by optimizing the variational posterior distribution. The main advantage of this approach is that no prior knowledge of the number of sources is needed, and it will be automatically determined by the algorithm. The proposed approach does not suffer from overfitting problem, as opposed to the Expectation-Maximization (EM) algorithm, therefore it is not sensitive to initializations.
Reverberant speech source separation has been of great interest for over a decade, leading to two major approaches. One of them is based on statistical properties of the signals and mixing process known as blind source separation (BSS). The other approach named as computational auditory scene analysis (CASA) is inspired by human auditory system and exploits monaural and binaural cues. In this paper these two approaches are studied and compared in more depth.
Most of the binaural source separation algorithms only consider the dissimilarities between the recorded mixtures such as interaural phase and level differences (IPD, ILD) to classify and assign the time-frequency (T-F) regions of the mixture spectrograms to each source. However, in this paper we show that the coherence between the left and right recordings can provide extra information to label the T-F units from the sources. This also reduces the effect of reverberation which contains random reflections from different directions showing low correlation between the sensors. Our algorithm assigns the T-F regions into original sources based on weighted combination of IPD, ILD, the observation vectors models and the estimated interaural coherence (IC) between the left and right recordings. The binaural room impulse responses measured in four rooms with various acoustic conditions have been used to evaluate the performance of the proposed method which shows an improvement of more than 1:4 dB in signal-to-distortion ratio (SDR) in room D with T60 = 0:89 s over the state-of-the-art algorithms.
Acoustic vector sensor (AVS) based convolutive blind source separation problem has been recently addressed under the framework of probabilistic time-frequency (T-F) masking, where both the DOA and the mixing vector cues are modelled by Gaussian distributions. In this paper, we show that the distributions of these cues vary with room acoustics, such as reverberation. Motivated by this observation, we propose a mixed model of Laplacian and Gaussian distributions to provide a better fit for these cues. The parameters of the mixed model are estimated and refined iteratively by an expectation-maximization (EM) algorithm. Experiments performed on the speech mixtures in simulated room environments show that the mixed model offers an average of about 0.68 dB and 1.18 dB improvements in signal-to-distotion (SDR) over the Gaussian and Laplacian model, respectively. © 2013 IEEE.
Most existing speech source separation algorithms have been developed for separating sound mixtures acquired by using a conventional microphone array. In contrast, little attention has been paid to the problem of source separation using an acoustic vector sensor (AVS). We propose a new method for the separation of convolutive mixtures by incorporating the intensity vector of the acoustic field, obtained using spatially co-located microphones which carry the direction of arrival (DOA) information. The DOA cues from the intensity vector, together with the frequency bin-wise mixing vector cues, are then used to determine the probability of each time-frequency (T-F) point of the mixture being dominated by a specific source, based on the Gaussian mixture models (GMM), whose parameters are evaluated and refined iteratively using an expectation-maximization (EM) algorithm. Finally, the probability is used to derive the T-F masks for recovering the sources. The proposed method is evaluated in simulated reverberant environments in terms of signal-to-distortion ratio (SDR), giving an average improvement of approximately 1:5 dB as compared with a related T-F mask approach based on a conventional microphone setting. © 2013 EURASIP.
In the past, both theoretical work and practical implementation of particle filtering (PF) method have been extensively studied. However, its application in underwater signal processing has received much less attention. This paper intends to introduce PF approach for underwater acoustic signal processing. Particularly, we are interested in direction of arrival (DOA) estimation using PF. A detailed introduction along with this perspective is presented in this paper. Since the noise usually spreads the mainlobe of likelihood function and causes problem in subsequent particle resampling step, an exponential weighted likelihood model is developed to emphasize particles at more relevant area. Hence, the the effect due to background noise can be reduced. Real underwater acoustic data collected in SWELLEx-96 experiment are employed to demonstrate the performance of the proposed PF approaches for underwater DOA tracking. © 2013 IEEE.
Dictionary learning algorithms are typically derived for dealing with one or two dimensional signals using vector-matrix operations. Little attention has been paid to the problem of dictionary learning over high dimensional tensor data. We propose a new algorithm for dictionary learning based on tensor factorization using a TUCKER model. In this algorithm, sparseness constraints are applied to the core tensor, of which the n-mode factors are learned from the input data in an alternate minimization manner using gradient descent. Simulations are provided to show the convergence and the reconstruction performance of the proposed algorithm. We also apply our algorithm to the speaker identification problem and compare the discriminative ability of the dictionaries learned with those of TUCKER and K-SVD algorithms. The results show that the classification performance of the dictionaries learned by our proposed algorithm is considerably better as compared to the two state of the art algorithms. © 2013 IEEE.
Particle filtering has emerged as a useful tool for tracking problems. However, the efficiency and accuracy of the filter usually depend on the number of particles and noise variance used in the estimation and propagation functions for re-allocating these particles at each iteration. Both of these parameters are specified beforehand and are kept fixed in the regular implementation of the filter which makes the tracker unstable in practice. In this paper we are interested in the design of a particle filtering algorithm which is able to adapt the number of particles and noise variance. The new filter, which is based on audio-visual (AV) tracking, uses information from the tracking errors to modify the number of particles and noise variance used. Its performance is compared with a previously proposed audio-visual particle filtering algorithm with a fixed number of particles and an existing adaptive particle filtering algorithm, using the AV 16.3 dataset with single and multi-speaker sequences. Our proposed approach demonstrates good tracking performance with a significantly reduced number of particles. © 2013 EURASIP.
Probabilistic models of binaural cues, such as the interaural phase difference (IPD) and the interaural level difference (ILD), can be used to obtain the audio mask in the time-frequency (TF) domain, for source separation of binaural mixtures. Those models are, however, often degraded by acoustic noise. In contrast, the video stream contains relevant information about the synchronous audio stream that is not affected by acoustic noise. In this paper, we present a novel method for modeling the audio-visual (AV) coherence based on dictionary learning. A visual mask is constructed from the video signal based on the learnt AV dictionary, and incorporated with the audio mask to obtain a noise-robust audio-visual mask, which is then applied to the binaural signal for source separation. We tested our algorithm on the XM2VTS database, and observed considerable performance improvement for noise corrupted signals.
Dictionary learning aims to adapt elementary codewords directly from training data so that each training signal can be best approximated by a linear combination of only a few codewords. Following the two-stage iterative processes: sparse coding and dictionary update, that are commonly used, for example, in the algorithms of MOD and K-SVD, we propose a novel framework that allows one to update an arbitrary set of codewords and the corresponding sparse coefficients simultaneously, hence termed simultaneous codeword optimization (SimCO). Under this framework, we have developed two algorithms, namely the primitive and the regularized SimCO. Simulations are provided to show the advantages of our approach over the K-SVD algorithm in terms of both learning performance and running speed. © 2012 IEEE.
This paper proposes a method for jointly performing blind source separation (BSS) and blind dereverberation (BD) for speech mixtures. In most of the previous studies, BSS and BD have been explored separately. It is common that the performance of the speech separation algorithms deteriorates with the increase of room reverberations. Also most of the dereverberation algorithms rely on the availability of room impulse responses (RIRs) which are not readily accessible in practice. Therefore in this work the dereverberation and separation method are combined to mitigate the effects of room reverberations on the speech mixtures and hence to improve the separation performance. As required by the dereverberation algorithm, a step for blind estimation of reverberation time (RT) is used to estimate the decay rate of reverberations directly from the reverberant speech signal (i.e., speech mixtures) by modeling the decay as a Laplacian random process modulated by a deterministic envelope. Hence the developed algorithm works in a blind manner, i.e., directly dealing with the reverberant speech signals without explicit information from the RIRs. Evaluation results in terms of signal to distortion ratio (SDR) and segmental signal to reverberation ratio (SegSRR) reveal that using this method the performance of the separation algorithm that we have developed previously can be further enhanced. © 2012 EURASIP.
Suppression of late reverberations is a challenging problem in reverberant speech enhancement. A promising recent approach to this problem is to apply a spectral subtraction mask to the spectrum of the reverberant speech, where the spectral variance of the late reverberations was estimated based on a frequency independent statistical model of the decay rate of the late reverberations. In this paper, we develop a dereverberation algorithm by following a similar process. Instead of using the frequency independent model, however, we estimate the frequency dependent reverberation time and decay rate, and use them for the estimation of the spectral subtraction mask. In order to remove the processing artifacts, the mask is further filtered by a smoothing function, and then applied to reduce the late reverberations from the reverberant speech. The performance of the proposed algorithm, measured by the segmental signal to reverberation ratio (SegSRR) and the signal to distortion ratio (SDR), is evaluated for both simulated and real data. As compared with the related frequency indepenent algorithm, the proposed algorithm offers considerable performance improvement.
We propose a novel algorithm for the enhancement of noisy reverberant speech using empirical-mode-decomposition (EMD) based subband processing. The proposed algorithm is a one-microphone multistage algorithm. In the first step, noisy reverberant speech is decomposed adaptively into oscillatory components called intrinsic mode functions (IMFs) via an EMD algorithm. Denoising is then applied to selected high frequency IMFs using EMD-based minimum mean-squared error (MMSE) filter, followed by spectral subtraction of the resulting denoised high-frequency IMFs and low-frequency IMFs. Finally, the enhanced speech signal is reconstructed from the processed IMFs. The method was motivated by our observation that the noise and reverberations are disproportionally distributed across the IMF components. Therefore, different levels of suppression can be applied to the additive noise and reverberation in each IMF. This leads to an improved enhancement performance as shown in comparison to a related recent approach, based on the measurements by the signal-to-noise ratio (SNR). © 2011 EURASIP.
Despite being studied extensively, the performance of blind source separation (BSS) is still limited especially for the sensor data collected in adverse environments. Recent studies show that such an issue can be mitigated by incorporating multimodal information into the BSS process. In this paper, we propose a method for the enhancement of the target speech separated by a BSS algorithm from sound mixtures, using visual voice activity detection (VAD) and spectral subtraction. First, a classifier for visual VAD is formed in the off-line training stage, using labelled features extracted from the visual stimuli. Then we use this visual VAD classifier to detect the voice activity of the target speech. Finally we apply a multi-band spectral subtraction algorithm to enhance the BSS-separated speech signal based on the detected voice activity. We have tested our algorithm on the mixtures generated artificially by the mixing filters with different reverberation times, and the results show that our algorithm improves the quality of the separated target signal. © 2011 IEEE.
A novel multimodal (audio-visual) approach to the problem of blind source separation (BSS) is evaluated in room environments. The main challenges of BSS in realistic environments are: 1) sources are moving in complex motions and 2) the room impulse responses are long. For moving sources the unmixing filters to separate the audio signals are difficult to calculate from only statistical information available from a limited number of audio samples. For physically stationary sources measured in rooms with long impulse responses, the performance of audio only BSS methods is limited. Therefore, visual modality is utilized to facilitate the separation. The movement of the sources is detected with a 3-D tracker based on a Markov Chain Monte Carlo particle filter (MCMC-PF), and the direction of arrival information of the sources to the microphone array is estimated. A robust least squares frequency invariant data independent (RLSFIDI) beamformer is implemented to perform real time speech enhancement. The uncertainties in source localization and direction of arrival information are also controlled by using a convex optimization approach in the beamformer design. A 16 element circular array configuration is used. Simulation studies based on objective and subjective measures confirm the advantage of beamforming based processing over conventional BSS methods. © 2011 EURASIP.
This paper presents a new method for reverberant speech separation, based on the combination of binaural cues and blind source separation (BSS) for the automatic classification of the time-frequency (T-F) units of the speech mixture spectrogram. The main idea is to model interaural phase difference, interaural level difference and frequency bin-wise mixing vectors by Gaussian mixture models for each source and then evaluate that model at each T-F point and assign the units with high probability to that source. The model parameters and the assigned regions are refined iteratively using the Expectation-Maximization (EM) algorithm. The proposed method also addresses the permutation problem of the frequency domain BSS by initializing the mixing vectors for each frequency channel. The EM algorithm starts with binaural cues and after a few iterations the estimated probabilistic mask is used to initialize and re-estimate the mixing vector model parameters. We performed experiments on speech mixtures, and showed an average of about 0.8 dB improvement in signal-to-distortion (SDR) over the binaural-only baseline. © 2011 IEEE.
Information from video has been used recently to address the issue of scaling ambiguity in convolutive blind source separation (BSS) in the frequency domain, based on statistical modeling of the audio-visual coherence with Gaussian mixture models (GMMs) in the feature space. However, outliers in the feature space may greatly degrade the system performance in both training and separation stages. In this paper, a new feature selection scheme is proposed to discard non-stationary features, which improves the robustness of the coherence model and reduces its computational complexity. The scaling parameters obtained by coherence maximization and non-linear interpolation from the selected features are applied to the separated frequency components to mitigate the scaling ambiguity. A multimodal database composed of different combinations of vowels and consonants was used to test our algorithm. Experimental results show the performance improvement with our proposed algorithm.
Separating multiple music sources from a single channel mixture is a challenging problem. We present a new approach to this problem based on non-negative matrix factorization (NMF) and note classification, assuming that the instruments used to play the sound signals are known a priori. The spectrogram of the mixture signal is first decomposed into building components (musical notes) using an NMF algorithm. The Mel frequency cepstrum coefficients (MFCCs) of both the decomposed components and the signals in the training dataset are extracted. The mean squared errors (MSEs) between the MFCC feature space of the decomposed music component and those of the training signals are used as the similarity measures for the decomposed music notes. The notes are then labelled to the corresponding type of instruments by the K nearest neighbors (K-NN) classification algorithm based on the MSEs. Finally, the source signals are reconstructed from the classified notes and the weighting matrices obtained from the NMF algorithm. Simulations are provided to show the performance of the proposed system. © 2011 Springer-Verlag Berlin Heidelberg.
Spontaneous speech in videos capturing the speaker's mouth provides bimodal information. Exploiting the relationship between the audio and visual streams, we propose a new visual voice activity detection (VAD) algorithm, to overcome the vulnerability of conventional audio VAD techniques in the presence of background interference. First, a novel lip extraction algorithm combining rotational templates and prior shape constraints with active contours is introduced. The visual features are then obtained from the extracted lip region. Second, with the audio voice activity vector used in training, adaboosting is applied to the visual features, to generate a strong final voice activity classifier by boosting a set of weak classifiers. We have tested our lip extraction algorithm on the XM2VTS database (with higher resolution) and some video clips from YouTube (with lower resolution). The visual VAD was shown to offer low error rates.
We present a novel method for speech separation from their audio mixtures using the audio-visual coherence. It consists of two stages: in the off-line training process, we use the Gaussian mixture model to characterise statistically the audio-visual coherence with features obtained from the training set; at the separation stage, likelihood maximization is performed on the independent component analysis (ICA)-separated spectral components. To address the permutation and scaling indeterminacies of the frequency-domain blind source separation (BSS), a new sorting and rescaling scheme using the bimodal coherence is proposed.We tested our algorithm on the XM2VTS database, and the results show that our algorithm can address the permutation problem with high accuracy, and mitigate the scaling problem effectively.
Recent studies show that visual information contained in visual speech can be helpful for the performance enhancement of audio-only blind source separation (BSS) algorithms. Such information is exploited through the statistical characterisation of the coherence between the audio and visual speech using, e.g. a Gaussian mixture model (GMM). In this paper, we present two new contributions. An adapted expectation maximization (AEM) algorithm is proposed in the training process to model the audio-visual coherence upon the extracted features. The coherence is exploited to solve the permutation problem in the frequency domain using a new sorting scheme. We test our algorithm on the XM2VTS multimodal database. The experimental results show that our proposed algorithm outperforms traditional audio-only BSS.
We present a novel method for extracting target speech from auditory mixtures using bimodal coherence, which is statistically characterised by a Gaussian mixture modal (GMM) in the offline training process, using the robust features obtained from the audio-visual speech. We then adjust the ICA-separated spectral components using the bimodal coherence in the time-frequency domain, to mitigate the scale ambiguities in different frequency bins. We tested our algorithm on the XM2VTS database, and the results show the performance improvement with our proposed algorithm in terms of SIR measurements.
"This book covers advances in algorithmic developments, theoretical frameworks, andexperimental research findings to assist professionals who want an improved ...
Sparse representation has been studied extensively in the past decade in a variety of applications, such as denoising, source separation and classification. Earlier effort has been focused on the well-known synthesis model, where a signal is decomposed as a linear combination of a few atoms of a dictionary. However, the analysis model, a counterpart of the synthesis model, has not received much attention until recent years. The analysis model takes a different viewpoint to sparse representation, and it assumes that the product of an analysis dictionary and a signal is sparse. Compared with the synthesis model, this model tends to be more expressive to represent signals, as a much richer union of subspaces can be described. This thesis focuses on the analysis model and aims to address the two main challenges: analysis dictionary learning (ADL) and signal reconstruction. In the ADL problem, the dictionary is learned from a set of training samples so that the signals can be represented sparsely based on the analysis model, thus offering the potential to fit the signals better than pre-defined dictionaries. Among the existing ADL algorithms, such as the well-known Analysis K-SVD, the dictionary atoms are updated sequentially. The first part of this thesis presents two novel analysis dictionary learning algorithms to update the atoms simultaneously. Specifically, the Analysis Simultaneous Codeword Optimization (Analysis SimCO) algorithm is proposed, by adapting the SimCO algorithm which is proposed originally for the synthesis model. In Analysis SimCO, the dictionary is updated using optimization on manifolds, under the $\ell_2$-norm constraints on the dictionary atoms. This framework allows multiple dictionary atoms to be updated simultaneously in each iteration. However, similar to the existing ADL algorithms, the dictionary learned by Analysis SimCO may contain similar atoms. To address this issue, Incoherent Analysis SimCO is proposed by employing a coherence constraint and introducing a decorrelation step to enforce this constraint. The competitive performance of the proposed algorithms is demonstrated in the experiments for recovering synthetic dictionaries and removing additional noise in images, as compared with existing ADL methods. The second part of this thesis studies how to reconstruct signals with learned dictionaries under the analysis model. This is demonstrated by a challenging application problem: multiplicative noise removal (MNR) of images
Page Owner: ww0003
Page Created: Thursday 16 September 2010 14:47:13 by lb0014
Last Modified: Monday 21 March 2016 13:55:59 by jg0036
Expiry Date: Friday 16 December 2011 14:42:55
Assembly date: Mon Aug 21 13:00:02 BST 2017
Content ID: 37289