Orwell J, Plumbley MD (1999) Maximizing information about a noisy signal with a single non-linear neuron., NINTH INTERNATIONAL CONFERENCE ON ARTIFICIAL NEURAL NETWORKS (ICANN99), VOLS 1 AND 2 (470) pp. 581-586 INST ELECTRICAL ENGINEERS INSPEC INC
Gretsistas A, Plumbley MD (2012) Group Polytope Faces Pursuit for Recovery of Block-Sparse Signals., LVA/ICA 7191 pp. 255-262 Springer
We have previously proposed a structured sparse approach to piano transcription with promising results recorded on a challenging dataset. The approach taken was measured in terms of both frame-based and onset-based metrics. Close inspection of the results revealed problems in capturing frames displaying low-energy of a given note, for example in sustained notes. Further problems were also noticed in the onset detection, where for many notes seen to be active in the output trancription an onset was not detected. A brief description of the approach is given here, and further analysis of the system is given by considering an oracle transcription, derived from the ground truth piano roll and the given dictionary of spectral template atoms, which gives a clearer indication of the problems which need to be overcome in order to improve the proposed approach.
Hedayioglu FDL, Jafari MG, Mattos SDS, Plumbley MD, Coimbra MT (2011) Separating sources from sequentially acquired mixtures of heart signals., ICASSP pp. 653-656 IEEE
Plumbley MD (2013) Hearing the shape of a room., Proc Natl Acad Sci U S A 110 (30) pp. 12162-12163
Nesbit A, Vincent E, Plumbley MD (2009) Extension of Sparse, Adaptive Signal Decompositions to Semi-blind Audio Source Separation., ICA 5441 pp. 605-612 Springer
Welburn SJ, Plumbley MD (2009) Estimating parameters from audio for an EG+LFO model of pitch envelopes, Proceedings of the 12th International Conference on Digital Audio Effects, DAFx 2009 pp. 451-455
Envelope generator (EG) and Low Frequency Oscillator (LFO) parameters give a compact representation of audio pitch envelopes. By estimating these parameters from audio per-note, they could be used as part of an audio coding scheme. Recordings of various instruments and articulations were examined, and pitch envelopes found. Using an evolutionary algorithm, EG and LFO parameters for the envelopes were estimated. The resulting estimated envelopes are compared to both the original envelope, and to a fixed-pitch estimate. Envelopes estimated using EG+LFO can closely represent the envelope from the original audio and provide a more accurate estimate than the mean pitch.
We address the problem of sparse signal reconstruction from a few noisy samples. Recently, a Covariance-Assisted Matching Pursuit (CAMP) algorithm has been proposed, improving the sparse coefficient update step of the classic Orthogonal Matching Pursuit (OMP) algorithm. CAMP allows the a-priori mean and covariance of the non-zero coefficients to be considered in the coefficient update step. In this paper, we analyze CAMP, which leads to a new interpretation of the update step as a maximum-a-posteriori (MAP) estimation of the non-zero coefficients at each step. We then propose to leverage this idea, by finding a MAP estimate of the sparse reconstruction problem, in a greedy OMP-like way. Our approach allows the statistical dependencies between sparse coefficients to be modelled, while keeping the practicality of OMP. Experiments show improved performance when reconstructing the signal from a few noisy samples.
We describe our method for automatic bird species classification, which uses raw audio without segmentation and without using any auxiliary metadata. It successfully classifies among 501 bird categories, and was by far the highest scoring audio-only bird recognition algorithm submitted to BirdCLEF 2014. Our method uses unsupervised feature learning, a technique which learns regularities in spectro-temporal content without reference to the training labels, which helps a classifier to generalise to further content of the same type. Our strongest submission uses two layers of feature learning to capture regularities at two different time scales.
This paper is devoted to enhancing rapid decision-making and identification of lactobacilli from dental plaque using statistical and neural network methods. Current techniques of identification such as clustering and principal component analysis are discussed with respect to the field of bacterial taxonomy. Decision-making using multilayer perceptron neural network and Kohonen self-organizing feature map is highlighted. Simulation work and corresponding results are presented with main emphasis on neural network convergence and identification capability using resubstitution, leave-one-out and cross validation techniques. Rapid analyses on two separate sets of bacterial data from dental plaque revealed accuracy of more than 90% in the identification process. The risk of misdiagnosis was estimated at 14% worst case. Test with unknown strains yields close correlation to cluster dendograms. The use of the AXEON VindAX simulator indicated close correlations of the results. The paper concludes that artificial neural networks are suitable for use in the rapid identification of dental bacteria.
Several probabilistic models involving latent components have been proposed for modeling time-frequency (TF) representations of audio signals such as spectrograms, notably in the nonnegative matrix factorization (NMF) literature. Among them, the recent high-resolution NMF (HR-NMF) model is able to take both phases and local correlations in each frequency band into account, and its potential has been illustrated in applications such as source separation and audio inpainting. In this paper, HR-NMF is extended to multichannel signals and to convolutive mixtures. The new model can represent a variety of stationary and non-stationary signals, including autoregressive moving average (ARMA) processes and mixtures of damped sinusoids. A fast variational expectation-maximization (EM) algorithm is proposed to estimate the enhanced model. This algorithm is applied to piano signals, and proves capable of accurately modeling reverberation, restoring missing observations, and separating pure tones with close frequencies.
Stowell D, Robertson A, Bryan-Kinns N, Plumbley MD (2009) Evaluation of live human-computer music-making: Quantitative and qualitative approaches., Int. J. Hum.-Comput. Stud. 67 11 pp. 960-975
Nishimori Y, Akaho S, Plumbley MD (2006) Riemannian optimization method on generalized flag manifolds for complex and subspace ICA, AIP Conference Proceedings 872 pp. 89-96
In this paper we introduce a new class of manifolds, generalized flag manifolds, for the complex and subspace ICA problems. A generalized flag manifold is a manifold consisting of subspaces which are orthogonal to each other. The class of generalized flag manifolds include the class of Grassmann manifolds. We extend the Riemannian optimization method to include this new class of manifolds by deriving the formulas for the natural gradient and geodesics on these manifolds. We show how the complex and subspace ICA problems can be solved by optimization of cost functions on a generalized flag manifold. Computer simulations demonstrate our algorithm gives good performance compared with the ordinary gradient descent method. © 2006 American Institute of Physics.
Toyama K, Plumbley MD (2009) Using phase linearity in frequency-domain ICA to tackle the permutation problem., ICASSP pp. 3165-3168 IEEE
Plumbley MD (2007) Geometry and manifolds for independent component analysis, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings 4
In the last few years, there has been a great interest in the use of geometrical methods for independent component analysis (ICA), both to gain insight into the optimization process and to develop more efficient optimization algorithms. Much of this work involves concepts from differential geometry, such as Lie groups, Stiefel manifolds, or tangent planes that may be unfamiliar to signal processing researchers. The purpose of this tutorial paper is to introduce some of these geometry concepts to signal processing and ICA researchers, without assuming any existing background in differential geometry. The emphasis of the paper is on making the important concepts in this field accessible, rather than mathematical rigour. © 2007 IEEE.
Automatic music transcription (AMT) can be performed by deriving a pitch-time representation through decomposition of a spectrogram with a dictionary of pitch-labelled atoms. Typically, non-negative matrix factorisation (NMF) methods are used to decompose magnitude spectrograms. One atom is often used to represent each note. However, the spectrum of a note may change over time. Previous research considered this variability using different atoms to model specific parts of a note, or large dictionaries comprised of datapoints from the spectrograms of full notes. In this paper, the use of subspace modelling of note spectra is explored, with group sparsity employed as a means of coupling activations of related atoms into a pitched subspace. Stepwise and gradient-based methods for non-negative group sparse decompositions are proposed. Finally, a group sparse NMF approach is used to tune a generic harmonic subspace dictionary, leading to improved NMF-based AMT results.
For dictionary-based decompositions of certain types, it has been observed that there might be a link between sparsity in the dictionary and sparsity in the decomposition. Sparsity in the dictionary has also been associated with the derivation of fast and efficient dictionary learning algorithms. Therefore, in this paper we present a greedy adaptive dictionary learning algorithm that sets out to find sparse atoms for speech signals. The algorithm learns the dictionary atoms on data frames taken from a speech signal. It iteratively extracts the data frame with minimum sparsity index, and adds this to the dictionary matrix. The contribution of this atom to the data frames is then removed, and the process is repeated. The algorithm is found to yield a sparse signal decomposition, supporting the hypothesis of a link between sparsity in the decomposition and dictionary. The algorithm is applied to the problem of speech representation and speech denoising, and its performance is compared to other existing methods. The method is shown to find dictionary atoms that are sparser than their time-domain waveform, and also to result in a sparser speech representation. In the presence of noise, the algorithm is found to have similar performance to the well established principal component analysis. © 2011 IEEE.
Vincent E, Plumbley MD (2007) Fast factorization-based inference for bayesian harmonic models, Proceedings of the 2006 16th IEEE Signal Processing Society Workshop on Machine Learning for Signal Processing, MLSP 2006 pp. 117-122
Harmonie sinusoidal models are a fundamental tool for audio signal analysis. Bayesian harmonic models guarantee a good resynthesis quality and allow joint use of learnt parameter priors and auditory motivated distortion measures. However inference algorithms based on Monte Carlo sampling are rather slow for realistic data. In this paper, we investigate fast inference algorithms based on approximate factorization of the joint posterior into a product of independent distributions on small subsets of parameters. We discuss the conditions under which these approximations hold true and evaluate their performance experimentally. We suggest how they could be used together with Monte Carlo algorithms for a faster sampling-based inference. © 2006 IEEE.
We introduce a free and open dataset of 7690 audio clips sampled from the field-recording tag in the Freesound audio archive. The dataset is designed for use in research related to data mining in audio archives of field recordings / soundscapes. Audio is standardised, and audio and metadata are Creative Commons licensed. We describe the data preparation process, characterise the dataset descriptively, and illustrate its use through an auto-tagging experiment.
Plumbley, Mark D (2007) Independent Component Analysis and Signal Separation, 7th International Conference, ICA 2007, London, UK, September 9-12, 2007., 4666 Springer
Font F, Brookes Tim, Fazekas G, Guerber M, La Burthe A, Plans D, Plumbley MD, Shaashua M, Wang W, Serra X (2016) Audio Commons: bringing Creative Commons audio content to the creative industries,AES E-Library
Audio Engineering Society
Significant amounts of user-generated audio content, such as sound effects, musical samples and music pieces, are uploaded to online repositories and made available under open licenses. Moreover, a constantly increasing amount of multimedia content, originally released with traditional licenses, is becoming public domain as its license expires. Nevertheless, the creative industries are not yet using much of all this content in their media productions. There is still a lack of familiarity and understanding of the legal context of all this open content, but there are also problems related with its accessibility. A big percentage of this content remains unreachable either because it is not published online or because it is not well organised and annotated. In this paper we present the Audio Commons Initiative, which is aimed at promoting the use of open audio content and at developing technologies with which to support the ecosystem composed by content repositories, production tools and users. These technologies should enable the reuse of this audio material, facilitating its integration in the production workflows used by the creative industries. This is a position paper in which we describe the core ideas behind this initiative and outline the ways in which we plan to address the challenges it poses.
Nesbit A, Jafari MG, Vincent E, Plumbley MD (2010) Audio source separation using sparse representations, In: Machine Audition: Principles, Algorithms and Systems pp. 246-264
The authors address the problem of audio source separation, namely, the recovery of audio signals from recordings of mixtures of those signals. The sparse component analysis framework is a powerful method for achieving this. Sparse orthogonal transforms, in which only few transform coefficients differ significantly from zero, are developed; once the signal has been transformed, energy is apportioned from each transform coefficient to each estimated source, and, finally, the signal is reconstructed using the inverse transform. The overriding aim of this chapter is to demonstrate how this framework, as exemplified here by two different decomposition methods which adapt to the signal to represent it sparsely, can be used to solve different problems in different mixing scenarios. To address the instantaneous (neither delays nor echoes) and underdetermined (more sources than mixtures) mixing model, a lapped orthogonal transform is adapted to the signal by selecting a basis from a library of predetermined bases. This method is highly related to the windowing methods used in the MPEG audio coding framework. In considering the anechoic (delays but no echoes) and determined (equal number of sources and mixtures) mixing case, a greedy adaptive transform is used based on orthogonal basis functions that are learned from the observed data, instead of being selected from a predetermined library of bases. This is found to encode the signal characteristics, by introducing a feedback system between the bases and the observed data. Experiments on mixtures of speech and music signals demonstrate that these methods give good signal approximations and separation performance, and indicate promising directions for future research. © 2011, IGI Global.
Plumbley MD (1997) Communications and neural networks: Theory and practice, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings 1 pp. 135-138
In this paper we shall see that neural networks and communications are interlinked in a number of ways, towards the goal of efficient communication of information. One concrete example of this is the use of neural networks to ensure efficient use of communication channels, through connection admission control in ATM networks. In addition, however, efficient communication is also important within a decision making system such as a neural network. Finally we examine what type of neural network solutions are suggested by this approach.
Davies MEP, Plumbley MD (2007) Context-Dependent Beat Tracking of Musical Audio., IEEE Transactions on Audio, Speech & Language Processing 15 3 pp. 1009-1020
Abdallah SM, Plumbley MD (2009) Information dynamics: patterns of expectation and surprise in the perception of music., Connect. Sci. 21 2&3 pp. 89-117
Sturm BL, Mailhe B, Plumbley MD (2013) On Theorem 10 in "On Polar Polytopes and the Recovery of Sparse Representations" (vol 50, pg 2231, 2004), IEEE TRANSACTIONS ON INFORMATION THEORY 59 (8) pp. 5206-5209 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Plumbley MD, Fallside F (1991) The effect of receptor signal-to-noise levels on optimal filtering in a sensory system, Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing 4 pp. 2321-2324
Consideration is given to image filtering (temporal and spatial) in a neural system for transmitting images through a limited capacity channel, in the case of a noisy image at the receptors. The authors use an extension of Shannon's formula for the capacity of a Gaussian channel to determine the optimum filter to be used. For realistic image statistics, they show that the bandwidth of this filter is self-limiting, and it has a high frequency boost that disappears at low signal levels. This behavior is mirrored in biological retinas.
Toyama K, Plumbley MD (2009) Estimating Phase Linearity in the Frequency-Domain ICA Demixing Matrix., ICA 5441 pp. 362-370 Springer
In this paper, a system for overlapping acoustic event detection is
proposed, which models the temporal evolution of sound events. The
system is based on probabilistic latent component analysis, supporting
the use of a sound event dictionary where each exemplar consists
of a succession of spectral templates. The temporal succession
of the templates is controlled through event class-wise Hidden
Markov Models (HMMs). As input time/frequency representation,
the Equivalent Rectangular Bandwidth (ERB) spectrogram is used.
Experiments are carried out on polyphonic datasets of office sounds
generated using an acoustic scene simulator, as well as real and synthesized
monophonic datasets for comparative purposes. Results
show that the proposed system outperforms several state-of-the-art
methods for overlapping acoustic event detection on the same task,
using both frame-based and event-based metrics, and is robust to
varying event density and noise levels.
Abdallah SA, Plumbley MD (2013) Predictive Information in Gaussian Processes with Application to Music Analysis., GSI 8085 pp. 650-657 Springer
Abdallah SA, Plumbley MD (2006) Unsupervised analysis of polyphonic music by sparse coding., IEEE Trans Neural Netw 17 (1) pp. 179-196
We investigate a data-driven approach to the analysis and transcription of polyphonic music, using a probabilistic model which is able to find sparse linear decompositions of a sequence of short-term Fourier spectra. The resulting system represents each input spectrum as a weighted sum of a small number of "atomic" spectra chosen from a larger dictionary; this dictionary is, in turn, learned from the data in such a way as to represent the given training set in an (information theoretically) efficient way. When exposed to examples of polyphonic music, most of the dictionary elements take on the spectral characteristics of individual notes in the music, so that the sparse decomposition can be used to identify the notes in a polyphonic mixture. Our approach differs from other methods of polyphonic analysis based on spectral decomposition by combining all of the following: (a) a formulation in terms of an explicitly given probabilistic model, in which the process estimating which notes are present corresponds naturally with the inference of latent variables in the model; (b) a particularly simple generative model, motivated by very general considerations about efficient coding, that makes very few assumptions about the musical origins of the signals being processed; and (c) the ability to learn a dictionary of atomic spectra (most of which converge to harmonic spectral profiles associated with specific notes) from polyphonic examples alone-no separate training on monophonic examples is required.
Segmenting note objects in a real time context is useful for
live performances, audio broadcasting, or object-based coding.
This temporal segmentation relies upon the correct detection
of onsets and offsets of musical notes, an area of much
research over recent years. However the low-latency requirements
of real-time systems impose new, tight constraints on
this process. In this paper, we present a system for the segmentation
of note objects with very short delays, using recent
developments in onset detection, specially modi ed to work in
a real-time context. A portable and open C implementation is
Weyde T, Cottrell S, Benetos E, Wolff D, Tidhar D, Dykes J, Plumbley M, Dixon S, Barthet M, Gold N, Abdallah S, Mahey M (2014) Digital Music Lab: A Framework for Analysing Big Music Data,
We present a new method for score-informed source separation, combining ideas from two previous approaches: one based on paramet- ric modeling of the score which constrains the NMF updating process, the other based on PLCA that uses synthesized scores as prior probability distributions. We experimentally show improved separation results using the BSS EVAL and PEASS toolkits, and discuss strengths and weaknesses compared with the previous PLCA-based approach.
Jafari MG, Plumbley MD (2009) Speech denoising based on a greedy adaptive dictionary algorithm, European Signal Processing Conference pp. 1423-1426
In this paper we consider the problem of speech denoising based on a greedy adaptive dictionary (GAD) algorithm. The transform is orthogonal by construction, and is found to give a sparse representation of the data being analysed, and to be robust to additive Gaussian noise. The performance of the algorithm is compared to that of the principal component analysis (PCA) method, for a speech denoising application. It is found that the GAD algorithm offers a sparser solution than PCA, while having a similar performance in the presence of noise. © EURASIP, 2009.
Adler A, Emiya V, Jafari MG, Elad M, Gribonval R, Plumbley MD (2011) A constrained matching pursuit approach to audio declipping., ICASSP pp. 329-332 IEEE
Degara N, Davies MEP, Pena A, Plumbley MD (2011) Onset Event Decoding Exploiting the Rhythmic Structure of Polyphonic Music., J. Sel. Topics Signal Processing 5 6 pp. 1228-1239
Musical noise is a recurrent issue that appears in spectral techniques for denoising or blind source separation. Due to localised errors of estimation, isolated peaks may appear in the processed spectrograms, resulting in annoying tonal sounds after synthesis known as ?musical noise?. In this paper, we propose a method to assess the amount of musical noise in an audio signal, by characterising the impact of these artificial isolated peaks on the processed sound. It turns out that because of the constraints between STFT coefficients, the isolated peaks are described as time-frequency ?spots? in the spectrogram of the processed audio signal. The quantification of these ?spots?, achieved through the adaptation of a method for localisation of significant STFT regions, allows for an evaluation of the amount of musical noise. We believe that this will pave the way to an objective measure and a better understanding of this phenomenon.
Plumbley MD (2004) Lie group methods for optimization with orthogonality constraints, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 3195 pp. 1245-1252
Optimization of a cost function J(W) under an orthogonality constraint WWT = I is a common requirement for ICA methods. In this paper, we will review the use of Lie group methods to perform this constrained optimization. Instead of searching in the space of n × n matrices W, we will introduce the concept of the Lie group SO(n) of orthogonal matrices, and the corresponding Lie algebra so(n). Using so(n) for our coordinates, we can multiplicatively update W by a rotation matrix R so that W2 = RW always remains orthogonal. Steepest descent and conjugate gradient algorithms can be used in this framework.
Acoustic event detection for content analysis in most cases relies on lots of labeled data. However, manually annotating data is a time-consuming task, which thus makes few annotated resources available so far. Unlike audio event detection, automatic audio tagging, a multi-label acoustic event classification task, only relies on weakly labeled data. This is highly desirable to some practical applications using audio analysis. In this paper we propose to use a fully deep neural network (DNN) framework to handle the multi-label classification task in a regression way. Considering that only chunk-level rather than frame-level labels are available, the whole or almost whole frames of the chunk were fed into the DNN to perform a multi-label regression for the expected tags. The fully DNN, which is regarded as an encoding function, can well map the audio features sequence to a multi-tag vector. A deep pyramid structure was also designed to extract more robust high-level features related to the target tags. Further improved methods were adopted, such as the Dropout and background noise aware training, to enhance its generalization capability for new audio recordings in mismatched environments. Compared with the conventional Gaussian Mixture Model (GMM) and support vector machine (SVM) methods, the proposed fully DNN-based method could well utilize the long-term temporal information with the whole chunk as the input. The results show that our approach obtained a 15% relative improvement compared with the official GMM-based method of DCASE 2016 challenge.
Stark AM, Davies MEP, Plumbley MD (2009) Real-time beat-synchronous analysis of musical audio, Proceedings of the 12th International Conference on Digital Audio Effects, DAFx 2009 pp. 299-304
In this paper we present a model for beat-synchronous analysis of musical audio signals. Introducing a real-time beat tracking model with performance comparable to offline techniques, we discuss its application to the analysis of musical performances segmented by beat. We discuss the various design choices for beat-synchronous analysis and their implications for real-time implementations before presenting some beat-synchronous harmonic analysis examples. We make available our beat tracker and beat-synchronous analysis techniques as externals for Max/MSP.
Welburn SJ, Plumbley MD, Vincent E (2007) Object-Coding for Resolution-Free Musical Audio, Proceedings of the AES International Conference
Object-based coding of audio represents the signal as a parameter stream for a set of sound-producing objects. Encoding in this manner can provide a resolution-free representation of an audio signal. Given a robust estimation of the object-parameters and a multi-resolution synthesis engine, the signal can be "intelligently" upsampled, extending the bandwidth and getting best use out of a high-resolution signal-chain. We present some initial findings on extending bandwidth using harmonic models.
We outline a set of audio effects that use rhythmic analysis, in particular the extraction of beat and tempo information, to automatically synchronise temporal parameters to the input signal. We demonstrate that this analysis, known as beat-tracking, can be used to create adaptive parameters that adjust themselves according to changes in the properties of the input signal. We present common audio effects such as delay, tremolo and auto-wah augmented in this fashion and discuss their real-time implementation as Audio Unit plug-ins and objects for Max/MSP.
Jafari MG, Plumbley MD (2007) Convolutive blind source separation of speech signals in the low frequency bands, Audio Engineering Society - 123rd Audio Engineering Society Convention 2007 3 pp. 1195-1198
Sub-band methods are often used to address the problem of convolutive blind speech separation, as they offer the computational advantage of approximating convolutions by multiplications. The computational load, however, often remains quite high, because separation is performed on several sub-bands. In this paper, we exploit the well known fact that the high frequency content of speech signals typically conveys little information, since most of the speech power is found in frequencies up to 4kHz, and consider separation only in frequency bands below a certain threshold. We investigate the effect of changing the threshold, and find that separation performed only in the low frequencies can lead to the recovered signals being similar in quality to those extracted from all frequencies.
A method is proposed for instrument recognition in polyphonic music which combines two independent detector systems. A polyphonic musical instrument recognition system using a missing feature approach and an automatic music transcription system based on shift invariant probabilistic latent component analysis that includes instrument assignment. We propose a method to integrate the two systems by fusing the instrument contributions estimated by the first system onto the transcription system in the form of Dirichlet priors. Both systems, as well as the integrated system are evaluated using a dataset of continuous polyphonic music recordings. Detailed results that highlight a clear improvement in the performance of the integrated system are reported for different training conditions.
Most sound scenes result from the superposition of several sources, which can be separately perceived and analyzed by human listeners. Source separation aims to provide machine listeners with similar skills by extracting the sounds of individual sources from a given scene. Existing separation systems operate either by emulating the human auditory system or by inferring the parameters of probabilistic sound models. In this chapter, the authors focus on the latter approach and provide a joint overview of established and recent models, including independent component analysis, local time-frequency models and spectral template-based models. They show that most models are instances of one of the following two general paradigms: linear modeling or variance modeling. They compare the merits of either paradigm and report objective performance figures. They also,conclude by discussing promising combinations of probabilistic priors and inference algorithms that could form the basis of future state-of-the-art systems.
Plumbley MD (1993) Hebbian/anti-Hebbian network which optimizes information capacity by orthonormalizing the principal subspace, IEE Conference Publication (372) pp. 86-90
A number of recent papers have used the approach of maximizing information capacity on mutual information(MI) to examine unsupervised neural networks. In this paper we extend this work to develop an algorithm for the case of both input and output noise, with an output power constraint. We find that it is possible to simplify the obvious algorithm obtained by concatenating the two previous solutions.
In this paper, we consider the dictionary learning problem for the sparse analysis model. A novel algorithm is proposed by adapting the simultaneous codeword optimization (SimCO) algorithm, based on the sparse synthesis model, to the sparse analysis model. This algorithm assumes that the analysis dictionary contains unit l2-norm atoms and learns the dictionary by optimization on manifolds. This framework allows multiple dictionary atoms to be updated simultaneously in each iteration. However, similar to several existing analysis dictionary learning algorithms, dictionaries learned by the proposed algorithm may contain similar atoms, leading to a degenerate (coherent) dictionary. To address this problem, we also consider restricting the coherence of the learned dictionary and propose Incoherent Analysis SimCO by introducing an atom decorrelation step following the update of the dictionary. We demonstrate the competitive performance of the proposed algorithms using experiments with synthetic data and image denoising as compared with existing algorithms.
Thiebaut J-B, Abdallah SA, Robertson A, Bryan-Kinns N, Plumbley MD (2008) Real Time Gesture Learning and Recognition: Towards Automatic Categorization., NIME pp. 215-218 nime.org
Welburn SJ, Plumbley MD (2010) Improving the performance of pitch estimators, 128th Audio Engineering Society Convention 2010 2 pp. 1319-1332
We are looking to use pitch estimators to provide an accurate high-resolution pitch track for resynthesis of musical audio. We found that current evaluation measures such as gross error rate (GER) are not suitable for algorithm selection. In this paper we examine the issues relating to evaluating pitch estimators and use these insights to improve performance of existing algorithms such as the well-known YIN pitch estimation algorithm.
We consider the task of solving the independent component analysis (ICA) problem x=As given observations x, with a constraint of nonnegativity of the source random vector s. We refer to this as nonnegative independent component analysis and we consider methods for solving this task. For independent sources with nonzero probability density function (pdf) p(s) down to s=0 it is sufficient to find the orthonormal rotation y=Wz of prewhitened sources z=Vx, which minimizes the mean squared error of the reconstruction of z from the rectified version y/sup +/ of y. We suggest some algorithms which perform this, both based on a nonlinear principal component analysis (PCA) approach and on a geodesic search method driven by differential geometry considerations. We demonstrate the operation of these algorithms on an image separation problem, which shows in particular the fast convergence of the rotation and geodesic methods and apply the approach to a musical audio analysis task.
O'Hanlon K, Plumbley MD, Sandler M (2015) Non-negative Matrix Factorisation incorporating greedy Hellinger sparse coding applied to polyphonic music transcription, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2015) pp. 2214-2218
Non-negative Matrix Factorisation (NMF) is a commonly used tool in many musical signal processing tasks, including Automatic Music Transcription (AMT). However unsupervised NMF is seen to be problematic in this context, and harmonically constrained variants of NMF have been proposed. While useful, the harmonic constraints may be constrictive in mixed signals. We have previously observed that recovery of overlapping signal elements using NMF is improved through introduction of a sparse coding step, and propose here the incorporation of a sparse coding step using the Hellinger distance into a NMF algorithm. Improved AMT results for unsupervised NMF are reported.
The Melody Triangle is an interface for the discovery of melodic materials, where the input ý positions within a triangle ý directly map to information theoretic properties of the output. A model of human expectation and surprise in the perception of music, information dynamics, is used to 'map out' a musical generative system's parameter space. This enables a user to explore the possibilities afforded by a generative algorithm, in this case Markov chains, not by directly selecting parameters, but by specifying the subjective predictability of the output sequence. We describe some of the relevant ideas from information dynamics and how the Melody Triangle is defined in terms of these. We describe its incarnation as a screen based performance tool and compositional aid for the generation of musical textures; the users control at the abstract level of randomness and predictability, and some pilot studies carried out with it. We also briefly outline a multi-user installation, where collaboration in a performative setting provides a playful yet informative way to explore expectation and surprise in music, and a forthcoming mobile phone version of the Melody Triangle.
Stowell D, Plumbley MD (2014) Robust bird species recognition: making it work for dawn chorus audio archives, pp. 94-94
The recent (2013) bird species recognition challenges organised by the SABIOD project
attracted some strong performances from automatic classifiers applied to short audio excerpts
from passive acoustic monitoring stations. Can such strong results be achieved for
dawn chorus field recordings in audio archives? The question is important because archives
(such as the British Library Sound Archive) hold thousands such recordings, covering many
decades and many countries, but they are mostly unlabelled. Automatic labelling holds the
potential to unlock their value to ecological studies.
Audio in such archives is quite different from passive acoustic monitoring data: importantly,
the recording conditions vary randomly (and are usually unknown), making the scenario a
?cross-condition? rather than ?single-condition? train/test task. Dawn chorus recordings are
generally long, and the annotations often indicate which birds are in a 20-minute recording
but not within which 5-second segments they are active. Further, the amount of annotation
available is very small.
We report on experiments to evaluate a variety of classifier configurations for automatic
multilabel species annotation in dawn chorus archive recordings. The audio data is an order
of magnitude larger than the SABIOD challenges, but the ground-truth data is an order of
magnitude smaller. We report some surprising findings, including clear variation in the bene-
fits of some analysis choices (audio features, pooling techniques noise-robustness techniques)
as we move to handle the specific multi-condition case relevant for audio archives.
Robertson A, Plumbley MD, Bryan-Kinns N (2008) A Turing Test for B-Keeper: Evaluating an Interactive Real-Time Beat-Tracker, Proceedings of the 8th International Conference on New Interfaces for Musical Expression (NIME 2008) pp. 319-324
Techniques based on non-negative matrix factorization (NMF) have been successfully used to decompose a spectrogram of a music recording into a dictionary of templates and activations. While advanced NMF variants often yield robust signal models, there are usually some inaccuracies in the factorization since the underlying methods are not prepared for phase cancellations that occur when sounds with similar frequency are mixed. In this paper, we present a novel method that takes phase cancellations into account to refine dictionaries learned by NMF-based methods. Our approach exploits the fact that advanced NMF methods are often robust enough to provide information about how sound sources interact in a spectrogram, where they overlap, and thus where phase cancellations could occur. Using this information, the distances used in NMF are weighted entry-wise to attenuate the influence of regions with phase cancellations. Experiments on full-length, polyphonic piano recordings indicate that our method can be successfully used to refine NMF-based dictionaries.
Plumbley MD (2004) Optimization using fourier expansion over a geodesic for non-negative ICA, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 3195 pp. 49-56
We propose a new algorithm for the non-negative ICA problem, based on the rotational nature of optimization over a set of square orthogonal (orthonormal) matrices W, i.e. where WTW = WWT = In. Using a truncated Fourier expansion of J(t), we obtain a Newton-like update step along the steepest-descent geodesic, which automatically approximates to a usual (Taylor expansion) Newton update step near to a minimum. Experiments confirm that this algorithm is effective, and it compares favourably with existing non-negative ICA algorithms. We suggest that this approach could modified for other algorithms, such as the normal ICA task. © Springer-Verlag 2004.
Jafari MG, Plumbley MD (2010) A doubly sparse greedy adaptive dictionary learning algorithm for music and large-scale data, 128th Audio Engineering Society Convention 2010 2 pp. 940-945
We consider the extension of the greedy adaptive dictionary learning algorithm that we introduced previously, to applications other than speech signals. The algorithm learns a dictionary of sparse atoms, while yielding a sparse representation for the speech signals. We investigate its behavior in the analysis of music signals, and propose a different dictionary learning approach that can be applied to large data sets. This facilitates the application of the algorithm to problems that generate large amounts of data, such as multimedia of multi-channel application areas.
Plumbley MD, Fallside F (1989) Sensory adaptation: An information-theoretic viewpoint, IJCNN Int Jt Conf Neural Network
Summary form only given. The authors examine the goals of early stages of a perceptual system, before the signal reaches the cortex, and describe its operation in information-theoretic terms. The effects of receptor adaptation, lateral inhibition, and decorrelation can all be seen as part of an optimization of information throughput, given that available resources such as average power and maximum firing rates are limited. The authors suggest a modification to Gabor functions which improves their performance as band-pass filters.
Welburn SJ, Plumbley MD (2009) Rendering audio using expressive MIDI, 127th Audio Engineering Society Convention 2009 1 pp. 176-184
MIDI renderings of audio are traditionally regarded as lifeless and unnatural - lacking in expression. However, MIDI is simply a protocol for controlling a synthesizer. Lack of expression is caused by either an expressionless synthesizer or by the difficulty in setting the MIDI parameters to provide expressive output. We have developed a system to construct an expressive MIDI representation of an audio signal, i.e. an audio representation which uses tailored pitch variations in addition to the note base pitch parameters which audio-to-MIDI systems usually attempt. A pitch envelope is estimated from the original audio, and a genetic algorithm is then used to estimate pitch modulator parameters from that envelope. These pitch modulations are encoded in a MIDI file and rerendered using a sampler. We present some initial comparisons between the final output audio and the estimated pitch envelopes.
Davies MEP, Degara N, Plumbley MD (2011) Measuring the Performance of Beat Tracking Algorithms Using a Beat Error Histogram., IEEE Signal Process. Lett. 18 3 pp. 157-160
Jafari MG, Abdallah SA, Plumbley MD, Davies ME (2006) Sparse Coding for Convolutive Blind Audio Source Separation., ICA 3889 pp. 132-139 Springer
Kachkaev A, Wolff D, Barthet M, Plumbley MD, Dykes J, Weyde T (2014) Visualising chord progressions in music collections: a big data approach,
In the Digital Music Lab project we work on the automatic analysis of large audio databases that results in rich annotations for large corpora of music. The musicological interpretation of this data from thousands of pieces is a challenging task that can benefit greatly from specifically designed interactive visualisation. Most existing big music data visualisation focuses on cultural attributes, mood, or listener behaviour. In this ongoing work we explore chord sequence patterns extracted by sequential pattern mining of more than one million tracks from the I Like Music commercial music collection. We present here several new visual representations that summarise chord patterns according to chord types, chroma, pattern structure and support, enabling musicologists to develop and answer questions about chord patterns in music collections. Our visualisations represent root movement and chord qualities mostly in a geometrical way and use colour to represent pattern support. We use two individually configurable views in parallel to encourage comparisons, either between different representations of one corpus, highlighting complimentary musical aspects, or between different datasets, here representing different genres. We adapt several visualisation techniques to chord pattern sets using some novel layouts to support musicologists with their exploration and interpretation of the corpora. We found that differences between chord patterns of different genres, e.g. Rock & Roll vs. Jazz, are visible and can be used to generate hypotheses for the study of individual pieces, further statistical investigations or new data processing and visualisation. Our designs will be adapted as user needs are established through ongoing work. Means of aggregating, focusing and filtering by selected characteristics (such as key, melodic patterns etc.) will be added as we develop our design for the visualisation of chord patterns in close collaboration with musicologists. The visualisations are available as a web application at http://dml.city.ac.uk/csvd/
Degara N, Pena A, Davies MEP, Plumbley MD (2010) Note onset detection using rhythmic structure., ICASSP pp. 5526-5529 IEEE
Audio editing is performed at scale in the production of radio, but often the tools used are poorly targeted toward the task at hand. There are a number of audio analysis techniques that have the potential to aid radio producers, but without a detailed understanding of their process and requirements, it can be difficult to apply these methods. To aid this understanding, a study of radio production practice was conducted on three varied case studies?a news bulletin, drama, and documentary. It examined the audio/metadata workflow, the roles and motivations of the producers, and environmental factors. The study found that producers prefer to interact with higher-level representations of audio content like transcripts and enjoy working on paper. The study also identified opportunities to improve the work flow with tools that link audio to text, highlight repetitions, compare takes, and segment speakers.
Barchiesi D, Plumbley MD (2015) Learning incoherent subspaces: Classification via incoherent dictionary learning, Journal of Signal Processing Systems 79 (2) pp. 189-199 Springer
In this article we present the supervised iterative projections and rotations (S-IPR) algorithm, a method for learning discriminative incoherent subspaces from data. We derive S-IPR as a supervised extension of our previously proposed iterative projections and rotations (IPR) algorithm for incoherent dictionary learning, and we employ it to learn incoherent sub-spaces that model signals belonging to different classes. We test our method as a feature transform for supervised classification, first by visualising transformed features from a synthetic dataset and from the ?iris? dataset, then by using the resulting features in a classification experiment.
Nishimori Y, Akaho S, Plumbley MD (2006) Riemannian Optimization Method on the Flag Manifold for Independent Subspace Analysis., ICA 3889 pp. 295-302 Springer
Jafari MG, Plumbley MD (2008) An adaptive orthogonal sparsifying transform for speech signals, 2008 3rd International Symposium on Communications, Control, and Signal Processing, ISCCSP 2008 pp. 786-790
In this paper we consider the problem of representing a speech signal with an adaptive transform that captures the main features of the data. The transform is orthogonal by construction, and is found to give a sparse representation of the data being analysed. The orthogonality property implies that evaluation of both the forward and inverse transform involve a simple matrix multiplication. The proposed dictionary learning algorithm is compared to the K singular value decomposition (K-SVD) method, which is found to yield very sparse representations, at the cost of a high approximation error. The proposed algorithm is shown to have a much lower computational complexity than K-SVD, while the resulting signal representation remains relatively sparse. ©2008 IEEE.
Stowell D, Plumbley MD (2008) Robustness and independence of voice timbre features under live performance acoustic degradations, Proceedings - 11th International Conference on Digital Audio Effects, DAFx 2008 pp. 325-332
Live performance situations can lead to degradations in the vocal signal from a typical microphone, such as ambient noise or echoes due to feedback. We investigate the robustness of continuousvalued timbre features measured on vocal signals (speech, singing, beatboxing) under simulated degradations. We also consider nonparametric dependencies between features, using information theoretic measures and a feature-selection algorithm. We discuss how robustness and independence issues reflect on the choice of acoustic features for use in constructing a continuous-valued vocal timbre space. While some measures (notably spectral crest factors) emerge as good candidates for such a task, others are poor, and some features such as ZCR exhibit an interaction with the type of voice signal being analysed.
Plumbley MD (2007) Dictionary Learning for L1-Exact Sparse Coding., ICA 4666 pp. 406-413 Springer
Jafari MG, Plumbley MD, Davies ME (2008) Speech separation using an adaptive sparse dictionary algorithm, 2008 Hands-free Speech Communication and Microphone Arrays, Proceedings, HSCMA 2008 pp. 25-28
We present a greedy adaptive algorithm that builds a sparse orthogonal dictionary from the observed data. In this paper, the algorithm is used to separate stereo speech signals, and the phase information that is inherent to the extracted atom pairs is used for clustering and identification of the original sources. The performance of the algorithm is compared that of the adaptive stereo basis algorithm, when the sources are mixed in echoic and anechoic environments. We find that the algorithm correctly separates the sources, and can do this even with a relatively small number of atoms. ©2008 IEEE.
Laurberg H, Christensen MG, Plumbley MD, Hansen LK, Jensen SH (2008) Theorems on positive data: on the uniqueness of NMF., Comput Intell Neurosci
We investigate the conditions for which nonnegative matrix factorization (NMF) is unique and introduce several theorems which can determine whether the decomposition is in fact unique or not. The theorems are illustrated by several examples showing the use of the theorems and their limitations. We have shown that corruption of a unique NMF matrix by additive noise leads to a noisy estimation of the noise-free unique solution. Finally, we use a stochastic view of NMF to analyze which characterization of the underlying model will result in an NMF with small estimation errors.
Murray-Browne T, Mainstone D, Bryan-Kinns N, Plumbley MD (2010) The Serendiptichord: A wearable instrument for contemporary dance performance, 128th Audio Engineering Society Convention 2010 3 pp. 1547-1554
We describe a novel musical instrument designed for use in contemporary dance performance. This instrument, the Serendiptichord, takes the form of a headpiece plus associated pods which sense movements of the dancer, together with associated audio processing software driven by the sensors. Movements such as translating the pods or shaking the trunk of the headpiece cause selection and modification of sampled sounds. We discuss how we have closely integrated physical form, sensor choice and positioning and software to avoid issues which otherwise arise with disconnection of the innate physical link between action and sound, leading to an instrument that non-musicians (in this case, dancers) are able to enjoy using immediately.
Wilson G, Aruliah DA, Brown CT, Chue Hong NP, Davis M, Guy RT, Haddock SH, Huff KD, Mitchell IM, Plumbley MD, Waugh B, White EP, Wilson P (2014) Best practices for scientific computing.,PLoS Biol 12 (1)
Plumbley MD, Abdallah SA, Bello JP, Davies ME, Monti G, Sandler MB (2002) Automatic music transcription and audio source separation, CYBERNETICS AND SYSTEMS 33 (6) pp. 603-627 TAYLOR & FRANCIS INC
Badeau R, Plumbley MD (2013) Multichannel HR-NMF for modelling convolutive mixtures of non-stationary signals in the time-frequency domain, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
Several probabilistic models involving latent components have been proposed for modelling time-frequency (TF) representations of audio signals (such as spectrograms), notably in the nonnegative matrix factorization (NMF) literature. Among them, the recent high resolution NMF (HR-NMF) model is able to take both phases and local correlations in each frequency band into account, and its potential has been illustrated in applications such as source separation and audio inpainting. In this paper, HR-NMF is extended to multichannel signals and to convolutive mixtures. A fast variational expectation-maximization (EM) algorithm is proposed to estimate the enhanced model. This algorithm is applied to a stereophonic piano signal, and proves capable of accurately modelling reverberation and restoring missing observations. © 2013 IEEE.
Brossier P, Bello JP, Plumbley MD (2004) Fast labelling of notes in music signals., ISMIR
This chapter discusses the role of information theory for analysis of neural networks using differential geometric ideas. Information theory is useful for understanding preprocessing, in terms of predictive coding in the retina and principal component analysis and decorrelation processing in early visual cortex. The chapter introduces some concepts from information theory. In particular, the entropy of a random variable and the mutual information between two random variables are focused. One of the major uses for information theory has been in interpretation and guidance for unsupervised neural networks: networks that are not provided with a teacher or target output that they are to emulate. The chapter describes how information relates to the more familiar supervised learning schemes, and discusses the use of error back propagation (BackProp) to minimize mean squared error (MSE) in a multi-layer perceptron (MLP). Other distortion measures are possible in place of MSE. In particular, the information theoretic cross-entropy distortion has been focused in the chapter.
Vincent E, Plumbley MD (2007) Low Bit-Rate Object Coding of Musical Audio Using Bayesian Harmonic Models., IEEE Transactions on Audio, Speech & Language Processing 15 4 pp. 1273-1282
Gretsistas A, Plumbley MD (2012) AN ALTERNATING DESCENT ALGORITHM FOR THE OFF-GRID DOA ESTIMATION PROBLEM WITH SPARSITY CONSTRAINTS, 2012 PROCEEDINGS OF THE 20TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO) pp. 874-878 IEEE COMPUTER SOC
Plumbley MD (2005) On polar polytopes and the recovery of sparse representations, IEEE Transactions on Information Theory 53 (9) pp. 3188-3195
Suppose we have a signal y which we wish to represent using a linear combination of a number of basis atoms ai, y = £i xiai = Ax. The problem of finding the minimum l0 norm representation for y is a hard problem. The basis pursuit (BP) approach proposes to find the minimum l1 norm representation instead, which corresponds to a linear program (LP) that can be solved using modern LP techniques, and several recent authors have given conditions for the BP (minimum l1 norm) and sparse (minimum l0 norm) representations to be identical. In this paper, we explore this sparse representation problem using the geometry of convex polytopes, as recently introduced into the field by Donoho. By considering the dual LP we find that the so-called polar polytope P* of the centrally symmetric polytope P whose vertices are the atom pairs plusmn;ai is particularly helpful in providing us with geometrical insight into optimality conditions given by Fuchs and Tropp for non-unit-norm atom sets. In exploring this geometry, we are able to tighten some of these earlier results, showing for example that a condition due to Fuchs is both necessary and sufficient for l1-unique-optimality, and there are cases where orthogonal matching pursuit (OMP) can eventually find all l1-unique-optimal solutions with m nonzeros even if the exact recover condition (ERC) fails for m. © 2007 IEEE.
Ekeus H, Abdallah SA, Mcowan PW, Plumbley MD (2013) How Predictable Do We Like Our Music? Eliciting Aesthetic Preferences With The Melody Triangle Mobile App, pp. 80-85 Logos Verlag Berlin
The Melody Triangle is a smartphone application for Android that lets users easily create musical patterns and textures. The user creates melodies by specifying positions within a triangle, and these positions correspond to the information theoretic properties of generated musical sequences. A model of human expectation and surprise in the perception of music, information dynamics, is used to 'map out' a musical generative system's parameter space, in this case Markov chains. This enables a user to explore the possibilities afforded by Markov chains, not by directly selecting their parameters, but by specifying the subjective predictability of the output sequence. As users of the app find melodies and patterns they like, they are encouraged to press a 'like' button, where their setting are uploaded to our servers for analysis. Collecting the 'liked' settings of many users worldwide will allow us to elicit trends and commonalities in aesthetic preferences across users of the app, and to investigate how these might relate to the informationdynamic model of human expectation and surprise. We outline some of the relevant ideas from information dynamics and how the Melody Triangle is defined in terms of these. We then describe the Melody Triangle mobile application, how it is being used to collect research data and how the collected data will be evaluated.
Stowell D, Plumbley MD (2010) Delayed decision-making in real-time beatbox percussion classification, Journal of New Music Research 39 (3) pp. 203-213
Real-time classification applied to a vocal percussion signal holds potential as an interface for live musical control. In this article we propose a novel approach to resolving the tension between the needs for low-latency reaction and reliable classification, by deferring the final classification decision until after a response has been initiated. We introduce a new dataset of annotated human beatbox recordings, and use it to study the optimal delay for classification accuracy. We then investigate the effect of such delayed decision-making on the quality of the audio output of a typical reactive system, via a MUSHRA-type listening test. Our results show that the effect depends on the output audio type: for popular dance/pop drum sounds the acceptable delay is on the order of 12-35 ms. © 2010 Taylor & Francis.
Nesbit A, Davies M, Plumbley M, Sandler M (2006) Source extraction from two-channel mixtures by joint cosine packet analysis, European Signal Processing Conference
This paper describes novel, computationally efficient approaches to source separation of underdetermined instantaneous two-channel mixtures. A best basis algorithm is applied to trees of local cosine bases to determine a sparse transform. We assume that the mixing parameters are known and focus on demixing sources by binary time-frequency masking. We describe a method for deriving a best local cosine basis from the mixtures by minimising an l1 norm cost function. This basis is adapted to the input of the masking process. Then, we investigate how to increase sparsity by adapting local cosine bases to the expected output of a single source instead of to the input mixtures. The heuristically derived cost function maximises the energy of the transform coefficients associated with a particular direction. Experiments on a mixture of four musical instruments are performed, and results are compared. It is shown that local cosine bases can give better results than fixed-basis representations.
Stark AM, Plumbley MD, Davies MEP (2007) Audio effects for real-time performance using beat tracking, Audio Engineering Society - 122nd Audio Engineering Society Convention 2007 2 pp. 866-872
We present a new class of digital audio effects which can automatically relate parameter values to the tempo of a musical input in real-time. Using a beat tracking system as the front end, we demonstrate a tempo-dependent delay effect and a set of beat-synchronous low frequency oscillator (LFO) effects including auto-wah, tremolo and vibrato. The effects show better performance than might be expected as they are blind to certain beat tracker errors. All effects are implemented as VST plug-ins which operate in real-time, enabling their use both in live musical performance and the off-line modification of studio recordings.
O'Hanlon K, Plumbley MD (2011) Structure-aware dictionary learning with harmonic atoms, European Signal Processing Conference pp. 1761-1765
Non-negative blind signal decomposition methods are widely used for musical signal processing tasks, such as automatic transcription and source separation. A spectrogram can be decomposed into a dictionary of full spectrum basis atoms and their corresponding time activation vectors using methods such as Non-negative Matrix Factorisation (NMF) and Non-negative K-SVD (NN-K-SVD). These methods are constrained by their learning order and problems posed by overlapping sources in the time and frequency domains of the source spectrogram. We consider that it may be possible to improve on current results by providing prior knowledge on the number of sources in a given spectrogram and on the individual structure of the basis atoms, an approach we refer to as structure-aware dictionary learning. In this work we consider dictionary recoverability of harmonic atoms, as harmonicity is a common structure in music signals. We present results showing improvements in recoverability using structure-aware decomposition methods, based on NN-KSVD and NMF. Finally we propose an alternative structureaware dictionary learning algorithm incorporating the advantages of NMF and NN-K-SVD. © EURASIP, 2011.
This paper describes work aimed at creating an efficient, real-time, robust and high performance chord recognition system for use on a single instrument in a live performance context. An improved chroma calculation method is combined with a classification technique based on masking out expected note positions in the chromagram and minimising the residual energy. We demonstrate that our approach can be used to classify a wide range of chords, in real-time, on a frame by frame basis. We present these analysis techniques as externals for Max/MSP. © July 2009- All copyright remains with the individual authors.
Abdallah SA, Plumbley MD (2012) A measure of statistical complexity based on predictive information with application to finite spin systems, Physics Letters, Section A: General, Atomic and Solid State Physics 376 (4) pp. 275-281
We propose the binding information as an information theoretic measure of complexity between multiple random variables, such as those found in the Ising or Potts models of interacting spins, and compare it with several previously proposed measures of statistical complexity, including excess entropy, Bialek et al.'s predictive information, and the multi-information. We discuss and prove some of the properties of binding information, particularly in relation to multi-information and entropy, and show that, in the case of binary random variables, the processes which maximise binding information are the 'parity' processes. The computation of binding information is demonstrated on Ising models of finite spin systems, showing that various upper and lower bounds are respected and also that there is a strong relationship between the introduction of high-order interactions and an increase of binding-information. Finally we discuss some of the implications this has for the use of the binding information as a measure of complexity. © 2011 Elsevier B.V. All rights reserved.
Phrases are common musical units akin to that in speech and text. In music performance, performers often change the way they vary the tempo from one phrase to the next in order to choreograph patterns of repetition and contrast. This activity is commonly referred to as expressive music performance. Despite its importance, expressive performance is still poorly understood. No formal models exist that would explain, or at least quantify and characterise, aspects of commonalities and differences in performance style. In this paper we present such a model for tempo variation between phrases in a performance. We demonstrate that the model provides a good fit with a performance database of 25 pieces and that perceptually important information is not lost through the modelling process.
Stowell D, Plumbley MD (2009) Fast Multidimensional Entropy Estimation by k -d Partitioning., IEEE Signal Process. Lett. 16 6 pp. 537-540
Plumbley MD (1995) Lyapunov functions for convergence of principal component algorithms, Neural Networks 8 (1) pp. 11-23
Recent theoretical analyses of a class of unsupervized Hebbian principal component algorithms have identified its local stability conditions. The only locally stable solution for the subspace P extracted by the network is the principal component subspace P*. In this paper we use the Lyapunov function approach to discover the global stability characteristics of this class of algorithms. The subspace projection error, least mean squared projection error, and mutual information I are all Lyapunov functions for convergence to the principal subspace, although the various domains of convergence indicated by these Lyapunov functions leave some of P-space uncovered. A modification to I yields a principal subspace information Lyapunov function I2 with a domain of convergence that covers almost all of P-space. This shows that this class of algorithms converges to the principal subspace from almost everywhere. © 1994.
Abdallah SA, Plumbley MD (2010) A measure of statistical complexity based on predictive information,
We introduce an information theoretic measure of statistical structure,
called 'binding information', for sets of random variables, and compare it with
several previously proposed measures including excess entropy, Bialek et al.'s
predictive information, and the multi-information. We derive some of the
properties of the binding information, particularly in relation to the
multi-information, and show that, for finite sets of binary random variables,
the processes which maximises binding information are the 'parity' processes.
Finally we discuss some of the implications this has for the use of the binding
information as a measure of complexity.
We present a method for real-time pitch-tracking which generates an estimation of the relative amplitudes of the partials relative to the fundamental for each detected note. We then employ a subtraction method, whereby lower fundamentals in the spectrum are accounted for when looking at higher fundamental notes. By tracking notes which are playing, we look for note off events and continually update our expected partial weightings for each note. The resulting algorithm makes use of these relative partial weightings within its decision process. We have evaluated the system against a data set and compared it with specialised offline pitch-trackers. © July 2009- All copyright remains with the individual authors.
Stowell D, Plumbley MD (2010) Cross-associating unlabelled timbre distributions to create expressive musical mappings., WAPA 11 pp. 28-35 JMLR.org
Sparse representations are becoming an increasingly useful tool in the analysis of musical audio signals. In this paper we will given an overview of work by ourselves and others in this area, to give a flavour of the work being undertaken, and to give some pointers for further information about this interesting and challenging research topic.
Plumbley MD, Blumensath T, Daudet L, Gribonval R, Davies ME (2010) Sparse Representations in Audio and Music: From Coding to Source Separation., Proceedings of the IEEE 98 6 pp. 995-1005
Johnson I, Plumbley MD (2000) On-Line Connectionist Q-Learning Produces Unreliable Performance with A Synonym Finding Task., IJCNN (3) pp. 451-458
Vincent E, Gribonval R, Plumbley MD (2007) Oracle estimators for the benchmarking of source separation algorithms., Signal Processing 87 8 pp. 1933-1950
Degara N, Rua EA, Pena A, Torres-Guijarro S, Davies MEP, Plumbley MD (2012) Reliability-informed beat tracking of musical signals, IEEE Transactions on Audio, Speech and Language Processing 20 (1) pp. 278-289
A new probabilistic framework for beat tracking of musical audio is presented. The method estimates the time between consecutive beat events and exploits both beat and non-beat information by explicitly modeling non-beat states. In addition to the beat times, a measure of the expected accuracy of the estimated beats is provided. The quality of the observations used for beat tracking is measured and the reliability of the beats is automatically calculated. A k-nearest neighbor regression algorithm is proposed to predict the accuracy of the beat estimates. The performance of the beat tracking system is statistically evaluated using a database of 222 musical signals of various genres. We show that modeling non-beat states leads to a significant increase in performance. In addition, a large experiment where the parameters of the model are automatically learned has been completed. Results show that simple approximations for the parameters of the model can be used. Furthermore, the performance of the system is compared with existing algorithms. Finally, a new perspective for beat tracking evaluation is presented. We show how reliability information can be successfully used to increase the mean performance of the proposed algorithm and discuss how far automatic beat tracking is from human tapping. © 2011 IEEE.
Davies MEP, Plumbley MD (2006) A spectral difference approach to downbeat extraction in musical audio, European Signal Processing Conference
We introduce a method for detecting downbeats in musical audio given a sequence of beat times. Using musical knowledge that lower frequency bands are perceptually more important, we find the spectral difference between band-limited beat synchronous analysis frames as a robust downbeat indicator. Initial results are encouraging for this type of system.
When working with generative systems, designers enter into a loop of discrete steps; external evaluations of the output feedback into the system, and new outputs are subsequently reevaluated. In such systems, interacting low level elements can engender a difficult to predict emergence of macro-level characteristics. Furthermore, the state space of some systems can be vast. Consequently, designers generally rely on trial-and-error, experience or intuition in selecting parameter values to develop the aesthetic aspects of their designs. We investigate an alternative means of exploring the state spaces of generative visual systems by using a gaze- contingent display. A user's gaze continuously controls and directs an evolution of visual forms and patterns on screen. As time progresses and the viewer and system remain coupled in this evolution, a population of generative artefacts tends towards an area of their state space that is 'of interest', as defined by the eye tracking data. The evaluation-feedback loop is continuous and uninterrupted, gaze the guiding feedback mechanism in the exploration of state space.
Murray-Browne T, Plumbley M (2014) Harmonic Motion: A Toolkit for Processing Gestural Data for Interactive Sound, pp. 213-216
We introduce Harmonic Motion, a free open source toolkit for artists, musicians and designers working with gestural data. Extracting musically useful features from captured gesture data can be challenging, with projects often requiring bespoke processing techniques developed through iterations of tweaking equations involving a number of constant values ? sometimes referred to as ?magic numbers?. Harmonic Motion provides a robust interface for rapid prototyping of patches to process gestural data and a framework through which approaches may be encapsulated, reused and shared with others. In addition, we describe our design process in which both personal experience and a survey of potential users informed a set of specific goals for the software.
Davies MEP, Plumbley MD (2007) On the use of entropy for beat tracking evaluation, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings 4
Despite continued attention toward the problem of automatic beat detection in musical audio, the issue of how to evaluate beat tracking systems remains pertinent and controversial. As yet no consistent evaluation metric has been adopted by the research community. To this aim, we propose a new method for beat tracking evaluation by measuring beat accuracy in terms of the entropy of a beat error histogram. We demonstrate the ability of our approach to address several shortcomings of existing methods. © 2007 IEEE.
Plumbley MD (1993) Information theory and neural network learning algorithms, pp. 145-155 Institute of Physics Publishing
There have been a number of recent papers on information theory and neural networks, especially in a perceptual system such as vision.ý Some of these approaches are examined, and their implications for neural network learning algorithms are considered.ý Existing supervised learning algorithms such as Back Propagation to minimize mean squared error can be viewed as attempting to minimize an upper bound on information loss.ý By making an assumption of noise either at the input or the output to the system, unsupervised learning algorithms such as those based on Hebbian (principal component analysing) or anti-Hebbian (decorrelating) approaches can also be viewed in a similar light.ý The optimization of information by the use of interneurons to decorrelate output units suggests a role for inhibitory interneurons and cortical loops in biological sensory systems.
Damnjanovic I, Davies MEP, Plumbley MD (2010) SMALLbox - An Evaluation Framework for Sparse Representations and Dictionary Learning Algorithms., LVA/ICA 6365 pp. 418-425 Springer
Stowell D, Plumbley MD, Bryan-Kinns N (2008) Discourse Analysis Evaluation Method for Expressive Musical Interfaces., NIME pp. 81-86 nime.org
Spectrum used for Machine-to-Machine (M2M) communications should be as cheap as possible or even free in order to connect billions of devices. Recently, both UK and US regulators have conducted trails and pilots to release the UHF TV spectrum for secondary licence-exempt applications. However, it is a very challenging task to implement wideband spectrum sensing in compact and low power M2M devices as high sampling rates are very expensive and difficult to achieve. In recent years, compressive sensing (CS) technique makes fast wideband spectrum sensing possible by taking samples at sub-Nyquist sampling rates. In this paper, we propose a two-step CS based spectrum sensing algorithm. In the first step, the CS is implemented in an SU and only part of the spectrum of interest is supposed to be sensed by an SU in each sensing period to reduce the complexity in the signal recovery process. In the second step, a denoising algorithm is proposed to improve the detection performance of spectrum sensing. The proposed two-step CS based spectrum sensing is compared with the traditional scheme and the theoretical curves.
Nishimori Y, Akaho S, Abdallah S, Plumbley MD (2007) Flag manifolds for subspace ICA problems, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings 4
We investigate the use of the Riemannian optimization method over the flag manifold in subspace ICA problems such as independent subspace analysis (ISA) and complex ICA. In the ISA experiment, we use the Riemannian approach over the flag manifold together with an MCMC method to overcome the problem of local minima of the ISA cost function. Experiments demonstrate the effectiveness of both Riemannian methods - simple geodesic gradient descent and hybrid geodesic gradient descent, compared with the ordinary gradient method. © 2007 IEEE.
Mailhé B, Barchiesi D, Plumbley MD (2012) INK-SVD: Learning incoherent dictionaries for sparse representations, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings pp. 3573-3576
This work considers the problem of learning an incoherent dictionary that is both adapted to a set of training data and incoherent so that existing sparse approximation algorithms can recover the sparsest representation. A new decorrelation method is presented that computes a fixed coherence dictionary close to a given dictionary. That step iterates pairwise decorrelations of atoms in the dictionary. Dictionary learning is then performed by adding this decorrelation method as an intermediate step in the K-SVD learning algorithm. The proposed algorithm INK-SVD is tested on musical data and compared to another existing decorrelation method. INK-SVD can compute a dictionary that approximates the training data as well as K-SVD while decreasing the coherence from 0.6 to 0.2. © 2012 IEEE.
The analysis of large datasets of music audio and other representations entails the need for techniques that support musicologists and other users in interpreting extracted data. We explore and develop visualisation techniques of chord sequence patterns mined from a corpus of over one million tracks. The visualisations use different representations of root movements and chord qualities with geometrical representations, and mostly colour mappings for pattern support. The presented visualisations are being developed in close collaboration with musicologists and can help gain insights into the differences of musical genres and styles as well as support the development of new classification methods.
This paper describes an algorithm for real-time beat tracking
with a visual interface. Multiple tempo and phase
hypotheses are represented by a comb filter matrix. The
user can interact by specifying the tempo and phase to be
tracked by the algorithm, which will seek to find a continuous
path through the space. We present results from
evaluating the algorithm on the Hainsworth database and
offer a comparison with another existing real-time beat
tracking algorithm and offline algorithms.
Ophir B, Elad M, Bertin N, Plumbley MD (2011) Sequential minimal eigenvalues - An approach to analysis dictionary learning, European Signal Processing Conference pp. 1465-1469
Over the past decade there has been a great interest in a synthesis-based model for signals, based on sparse and redundant representations. Such a model assumes that the signal of interest can be decomposed as a linear combination of few columns from a given matrix (the dictionary). An alternative, analysis-based, model can be envisioned, where an analysis operator multiplies the signal, leading to a sparse outcome. In this paper we propose a simple but effective analysis operator learning algorithm, where analysis "atoms" are learned sequentially by identifying directions that are orthogonal to a subset of the training data. We demonstrate the effectiveness of the algorithm in three experiments, treating synthetic data and real images, showing a successful and meaningful recovery of the analysis operator. © 2011 EURASIP.
Plumbley MD (1996) Information processing in negative feedback neural networks, NETWORK-COMPUTATION IN NEURAL SYSTEMS 7 (2) pp. 301-305 IOP PUBLISHING LTD
Fujihara H, Klapuri A, Plumbley MD (2012) Instrumentation-based music similarity using sparse representations, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings pp. 433-436
This paper describes a novelmusic similarity calculation method that is based on the instrumentation of music pieces. The approach taken here is based on the idea that sparse representations of musical audio signals are a rich source of information regarding the elements that constitute the observed spectra. We propose a method to extract feature vectors based on sparse representations and use these to calculate a similarity measure between songs. To train a dictionary for sparse representations from a large amount of training data, a novel dictionary-initialization method based on agglomerative clustering is proposed. An objective evaluation shows that the new features improve the performance of similarity calculation compared to the standard mel-frequency cepstral coefficients features. © 2012 IEEE.
In a cognitive radio (CR) system, cooperative spectrum
sensing (CSS) is the key to improving sensing performance
in deep fading channels. In CSS networks, signals received at the
secondary users (SUs) are sent to a fusion center to make a final
decision of the spectrum occupancy. In this process, the presence
of malicious users sending false sensing samples can severely
degrade the performance of the CSS network. In this paper, with
the compressive sensing (CS) technique being implemented at
each SU, we build a CSS network with double sparsity property. A
new malicious user detection scheme is proposed by utilizing the
adaptive outlier pursuit (AOP) based low-rank matrix completion
in the CSS network. In the proposed scheme, the malicious users
are removed in the process of signal recovery at the fusion center.
The numerical analysis of the proposed scheme is carried out and
compared with an existing malicious user detection algorithm.
Stowell D, Muaevi
S, Bonada J, Plumbley MD (2013) Improved multiple birdsong tracking with distribution derivative method and Markov renewal process clustering, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings pp. 468-472
Segregating an audio mixture containing multiple simultaneous bird sounds is a challenging task. However, birdsong often contains rapid pitch modulations, and these modulations carry information which may be of use in automatic recognition. In this paper we demonstrate that an improved spectrogram representation, based on the distribution derivative method, leads to improved performance of a segregation algorithm which uses a Markov renewal process model to track vocalisation patterns consisting of singing and silences. © 2013 IEEE.
Weyde T, Cottrell S, Dykes J, Benetos E, Wolff D, Tidhar D, Kachkaev A, Plumbley M, Dixon S, Barthet M, Gold N, Abdallah S, Alancar-Brayner A, Mahey M, Tovell A (2014) Big Data for Musicology,
Digital music libraries and collections are growing quickly and are increasingly made available for research. We argue that the use of large data collections will enable a better understanding of music performance and music in general, which will benefit areas such as music search and recommendation, music archiving and indexing, music production and education. However, to achieve these goals it is necessary to develop new musicological research methods, to create and adapt the necessary technological infrastructure, and to find ways of working with legal limitations. Most of the necessary basic technologies exist, but they need to be brought together and applied to musicology. We aim to address these challenges in the Digital Music Lab project, and we feel that with suitable methods and technology Big Music Data can provide new opportunities to musicology.
Plumbley MD, Abdallah SA, Blumensath T, Davies ME (2006) Sparse representations of polyphonic music., Signal Processing 86 3 pp. 417-431
Simpson AR, Roma G, Plumbley M (2015) Deep Karaoke: Extracting Vocals from Musical Mixtures Using a Convolutional Deep Neural Network, 9237 pp. 429-436 Springer International Publishing
Identification and extraction of singing voice from within musical mixtures is a key challenge in source separation and machine audition. Recently, deep neural networks (DNN) have been used to estimate 'ideal' binary masks for carefully controlled cocktail party speech separation problems. However, it is not yet known whether these methods are capable of generalizing to the discrimination of voice and non-voice in the context of musical mixtures. Here, we trained a convolutional DNN (of around a billion parameters) to provide probabilistic estimates of the ideal binary mask for separation of vocal sounds from real-world musical mixtures. We contrast our DNN results with more traditional linear methods. Our approach may be useful for automatic removal of vocal sounds from musical mixtures for 'karaoke' type applications.
Cleju N, Jafari MG, Plumbley MD (2012) Analysis-based sparse reconstruction with synthesis-based solvers., ICASSP pp. 5401-5404 IEEE
Vincent E, Plumbley MD (2006) Single-Channel Mixture Decomposition Using Bayesian Harmonic Models., ICA 3889 pp. 722-730 Springer
Robertson A, Plumbley M (2007) B-Keeper: A beat-tracker for live performance, Proceedings of the 7th International Conference on New Interfaces for Musical Expression, NIME '07 pp. 234-237
This paper describes the development of B-Keeper, a reatime beat tracking system implemented in Java and Max/MSP, which is capable of maintaining synchronisation between an electronic sequencer and a drummer. This enables musicians to interact with electronic parts which are triggered automatically by the computer from performance information. We describe an implementation which functions with the sequencer Ableton Live.
Automatic species classification of birds from their sound is a computational tool of increasing importance in ecology, conservation monitoring and vocal communication studies. To make classification useful in practice, it is crucial to improve its accuracy while ensuring that it can run at big data scales. Many approaches use acoustic measures based on spectrogram-type data, such as the Mel-frequency cepstral coefficient (MFCC) features which represent a manually-designed summary of spectral information. However, recent work in machine learning has demonstrated that features learnt automatically from data can often outperform manually-designed feature transforms. Feature learning can be performed at large scale and "unsupervised", meaning it requires no manual data labelling, yet it can improve performance on "supervised" tasks such as classification. In this work we introduce a technique for feature learning from large volumes of bird sound recordings, inspired by techniques that have proven useful in other domains. We experimentally compare twelve different feature representations derived from the Mel spectrum (of which six use this technique), using four large and diverse databases of bird vocalisations, classified using a random forest classifier. We demonstrate that in our classification tasks, MFCCs can often lead to worse performance than the raw Mel spectral data from which they are derived. Conversely, we demonstrate that unsupervised feature learning provides a substantial boost over MFCCs and Mel spectra without adding computational complexity after the model has been trained. The boost is particularly notable for single-label classification tasks at large scale. The spectro-temporal activations learned through our procedure resemble spectro-temporal receptive fields calculated from avian primary auditory forebrain. However, for one of our datasets, which contains substantial audio data but few annotations, increased performance is not discernible. We study the interaction between dataset characteristics and choice of feature representation through further empirical analysis.
Stowell D, Plumbley MD (2012) Framewise heterodyne chirp analysis of birdsong, European Signal Processing Conference pp. 2694-2698
Harmonic birdsong is often highly nonstationary, which suggests that standard FFT representations may be of limited suitability. Wavelet and chirplet techniques exist in the literature, but are not often applied to signals such as bird vocalisations, perhaps due to analysis complexity. In this paper we develop a single-scale chirp analysis (computationally accelerated using FFT) which can be treated as an ordinary time-series. We then study a sinusoidal representation simply derived from the peak bins of this time-series. We show that it can lead to improved species classification from birdsong. © 2012 EURASIP.
Murray-Browne T, Mainstone D, Bryan-Kinns N, Plumbley MD (2013) The serendiptichord: Reflections on the collaborative design process between artist and researcher, Leonardo 46 (1) pp. 86-87
The Serendiptichord is a wearable instrument, resulting from a collaboration crossing fashion, technology, music and dance. This paper reflects on the collaborative process and how defining both creative and research roles for each party led to a successful creative partnership built on mutual respect and open communication. After a brief snapshot of the instrument in performance, the instrument is considered within the context of dance-driven interactive music systems followed by a discussion on the nature of the collaboration and its impact upon the design process and final piece. © 2013 ISAST.
In this article, we present an account of the state of the art in acoustic scene classification (ASC), the task of classifying environments from the sounds they produce. Starting from a historical review of previous research in this area, we define a general framework for ASC and present different implementations of its components. We then describe a range of different algorithms submitted for a data challenge that was held to provide a general and fair benchmark for ASC techniques. The data set recorded for this purpose is presented along with the performance metrics that are used to evaluate the algorithms and statistical significance tests to compare the submitted methods.
O'Hanlon K, Plumbley MD (2013) Automatic Music Transcription using row weighted decompositions, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings pp. 16-20
Automatic Music Transcription (AMT) seeks to understand a musical piece in terms of note activities. Matrix decomposition methods are often used for AMT, seeking to decompose a spectrogram over a dictionary matrix of note-specific template vectors. The performance of these methods can suffer due to the large harmonic overlap found in tonal musical spectra. We propose a row weighting scheme that transforms each spectrogram frame and the dictionary, with the weighting determined by the effective correlations in the decomposition. Experiments show improved AMT performance. © 2013 IEEE.
Ekeus H, Mcowan PW, Plumbley MD (2013) Eye Tracking as Interface for Parametric Design,
This research investigates the potential of eye tracking as an interface to parameter search in visual design. We outline our experimental framework where a user's gaze acts as guiding feedback mechanism in an exploration of the state space of parametric designs. A small scale pilot study was carried out where participants in uence the evolution of generative patterns by looking at a screen while having their eyes tracked. Preliminary findings suggest that although our eye tracking system can be used to e ectively navigate small areas of a parametric design's state-space, there are challenges to overcome before such a system is practical in a design context. Finally we outline future directions of this research.
Stark AM, Plumbley MD (2010) Performance following: Tracking a performance without a score., ICASSP pp. 2482-2485 IEEE
Jafari MG, Plumbley MD (2007) The Role of High Frequencies in Convolutive Blind Source Separation of Speech Signals., ICA 4666 pp. 488-494 Springer
In this paper, a method for separation of harmonic and percussive elements in music recordings is presented. The proposed method is based on a simple spectral peak detection step followed by a phase expectation analysis that discriminates between harmonic and percussive components. The proposed method was tested on a database of 10 audio tracks and has shown superior results to the reference state-of-the-art approach.
Vincent E, Plumbley MD (2008) Efficient Bayesian inference for harmonic models via adaptive posterior factorization., Neurocomputing 72 1-3 pp. 79-87
Gretsistas A, Damnjanovic I, Plumbley MD (2010) Gradient Polytope Faces Pursuit for large scale sparse recovery problems., ICASSP pp. 2030-2033 IEEE
Audio source separation is a difficult machine learning problem and performance is measured by comparing extracted signals with the component source signals. However, if separation is motivated by the ultimate goal of re-mixing then complete separation is not necessary and hence separation difficulty and separation quality are dependent on the nature of the re-mix. Here, we use a convolutional deep neural network (DNN), trained to estimate 'ideal' binary masks for separating voice from music, to perform re-mixing of the vocal balance by operating directly on the individual magnitude components of the musical mixture spectrogram. Our results demonstrate that small changes in vocal gain may be applied with very little distortion to the ultimate re-mix. Our method may be useful for re-mixing existing mixes.
* Birdsong often contains large amounts of rapid frequency modulation (FM). It is believed that the use or otherwise of FM is adaptive to the acoustic environment and also that there are specific social uses of FM such as trills in aggressive territorial encounters. Yet temporal fine detail of FM is often absent or obscured in standard audio signal analysis methods such as Fourier analysis or linear prediction. Hence, it is important to consider high-resolution signal processing techniques for analysis of FM in bird vocalizations. If such methods can be applied at big data scales, this offers a further advantage as large data sets become available. * We introduce methods from the signal processing literature which can go beyond spectrogram representations to analyse the fine modulations present in a signal at very short time-scales. Focusing primarily on the genus Phylloscopus, we investigate which of a set of four analysis methods most strongly captures the species signal encoded in birdsong. We evaluate this through a feature selection technique and an automatic classification experiment. In order to find tools useful in practical analysis of large data bases, we also study the computational time taken by the methods, and their robustness to additive noise and MP3 compression. * We find three methods which can robustly represent species-correlated FM attributes and can be applied to large data sets, and that the simplest method tested also appears to perform the best. We find that features representing the extremes of FM encode species identity supplementary to that captured in frequency features, whereas bandwidth features do not encode additional information. * FM analysis can extract information useful for bioacoustic studies, in addition to measures more commonly used to characterize vocalizations. Further, it can be applied efficiently across very large data sets and archives.
Given a musical audio recording, the goal of music transcription is to determine a score-like representation of the piece underlying the recording. Most current transcription methods employ variants of non-negative matrix factorization (NMF), which often fails to robustly model instruments producing non-stationary sounds. Using entire time-frequency patterns to represent sounds, non-negative matrix deconvolution (NMD) can capture certain types of nonstationary behavior but is only applicable if all sounds have the same length. In this paper, we present a novel method that combines the non-stationarity modeling capabilities available with NMD with the variable note lengths possible with NMF. Identifying frames in NMD patterns with states in a dynamical system, our method iteratively generates sound-object candidates separately for each pitch, which are then combined in a global optimization. We demonstrate the transcription capabilities of our method using piano pieces assuming the availability of single note recordings as training data.
We present a robust method for the detection of the
first and second heart sounds (s1 and s2), without ECG reference,
based on a music beat tracking algorithm. An intermediate
representation of the input signal is first calculated by using an
onset detection function based on complex spectral difference. A
music beat tracking algorithm is then used to determine the
location of the first heart sound. The beat tracker works in two
steps, it first calculates the beat period and then finds the
temporal beat alignment. Once the first sound is detected, inverse
Gaussian weights are applied to the onset function on the
detected positions and the algorithm is run again to find the
second heart sound. At the last step s1 and s2 labels are
attributed to the detected sounds. The algorithm was evaluated in
terms of location accuracy as well as sensitivity and specificity
and the results showed good results even in the presence of
murmurs or noisy signals.
Plumbley MD (1999) Do cortical maps adapt to optimize information density?, Network: Computation in Neural Systems 10 (1) pp. 41-58
Topographic maps are found in many biological and artificial neural systems. In biological systems, some parts of these can form a significantly expanded representation of their sensory input, such as the representation of the fovea in the visual cortex. We propose that a cortical feature map should be organized to optimize the efficiency of information transmission through it. This leads to a principle of uniform cortical information density across the map as the desired optimum. An expanded representation in the cortex for a particular sensory area (i.e. a high magnification factor) means that a greater information density is concentrated in that sensory area, leading to finer discrimination thresholds. Improvement may ultimately be limited by the construction of the sensors themselves. This approach gives a good fit to threshold versus cortical area data of Recanzone et al on owl monkeys trained on a tactile frequency-discrimination task.
Stowell D, Plumbley MD (2011) Learning Timbre Analogies from Unlabelled Data by Multivariate Tree Regression, Journal of New Music Research 40 (4) pp. 325-336
Applications such as concatenative synthesis (audio mosaicing) and query-by-example require the ability to search a database using a sound which is qualitatively different from the actual desired result-for example when using vocal queries to retrieve nonvocal sound. Standard query techniques such as nearest neighbours do not account for this difference between source and target; they perform retrieval but do not learn to make timbral analogies. This paper addresses this issue by considering timbral query as a multivariate regression problem from one timbre distribution onto another. We develop a novel variant of multivariate tree regression: given only a set of unlabelled and unpaired samples from two distributions on the same space, the regression learns a cross-associative mapping which assumes general similarities in structure of the two distributions, yet can accommodate differences in shape at various scales. We demonstrate the technique with a synthetic example and with a concatenative synthesizer. © 2011 Copyright Taylor and Francis Group, LLC.
Davies MEP, Plumbley MD (2008) Exploring the effect of rhythmic style classification on automatic tempo estimation, European Signal Processing Conference
Within ballroom dance music, tempo and rhythmic style are strongly related. In this paper we explore this relationship, by using knowledge of rhythmic style to improve tempo estimation in musical audio signals. We demonstrate how the use of a simple 1-NN classification method, able to determine rhythmic style with 75% accuracy, can lead to an 8% point improvement over existing tempo estimation algorithms with further gains possible through the use of more sophisticated classification techniques.
Nesbit A, Plumbley MD (2008) Oracle estimation of adaptive cosine packet transforms for underdetermined audio source separation., ICASSP pp. 41-44 IEEE
Mailhé B, Plumbley MD (2012) Dictionary Learning with Large Step Gradient Descent for Sparse Representations., LVA/ICA 7191 pp. 231-238 Springer
This work presents a geometrical analysis of the
Large Step Gradient Descent (LGD) dictionary learning algorithm.
LGD updates the atoms of the dictionary using a gradient
step with a step size equal to twice the optimal step size.
We show that the large step gradient descent can be understood
as a maximal exploration step where one goes as far away as
possible without increasing the the error. We also show that the
LGD iteration is monotonic when the algorithm used for the
sparse approximation step is close enough to orthogonal.
We propose a denoising and segmentation technique for the second heart sound (S2). To denoise, Matching Pursuit (MP) was applied using a set of non-linear chirp signals as atoms. We show that the proposed method can be used to segment the phonocardiogram of the second heart sound into its two clinically meaningful components: the aortic (A2) and pulmonary (P2) components. © 2012 IEEE.
Giannoulis D, Benetos E, Stowell D, Rossignol M, Lagrange M, Plumbley MD (2013) Detection and classification of acoustic scenes and events: An IEEE AASP challenge, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
This paper describes a newly-launched public evaluation challenge on acoustic scene classification and detection of sound events within a scene. Systems dealing with such tasks are far from exhibiting human-like performance and robustness. Undermining factors are numerous: the extreme variability of sources of interest possibly interfering, the presence of complex background noise as well as room effects like reverberation. The proposed challenge is an attempt to help the research community move forward in defining and studying the aforementioned tasks. Apart from the challenge description, this paper provides an overview of systems submitted to the challenge as well as a detailed evaluation of the results achieved by those systems. © 2013 IEEE.
Automatic Music Transcription is often performed by decomposing a spectrogram over a dictionary of note specific atoms. Several note template atoms may be used to represent one note, and a group structure may be imposed on the dictionary. We propose a group sparse algorithm based on a multiplicative update and thresholding and show transcription results on a challenging dataset.
Barchiesi D, Plumbley MD (2013) Learning incoherent dictionaries for sparse approximation using iterative projections and rotations, IEEE Transactions on Signal Processing 61 (8) pp. 2055-2065 IEEE
This article deals with learning dictionaries for sparse approximation whose atoms are both adapted to a training set of signals and mutually incoherent. To meet this objective, we employ a dictionary learning scheme consisting of sparse approximation followed by dictionary update and we add to the latter a decorrelation step in order to reach a target mutual coherence level. This step is accomplished by an iterative projection method complemented by a rotation of the dictionary. Experiments on musical audio data and a comparison with the method of optimal coherence-constrained directions (mocod) and the incoherent k-svd (ink-svd) illustrate that the proposed algorithm can learn dictionaries that exhibit a low mutual coherence while providing a sparse approximation with better signal-to-noise ratio (snr) than the benchmark techniques. © 1991-2012 IEEE.
Oja E, Plumbley M (2004) Blind separation of positive sources by globally convergent gradient search., Neural Comput 16 (9) pp. 1811-1825
The instantaneous noise-free linear mixing model in independent component analysis is largely a solved problem under the usual assumption of independent nongaussian sources and full column rank mixing matrix. However, with some prior information on the sources, like positivity, new analysis and perhaps simplified solution methods may yet become possible. In this letter, we consider the task of independent component analysis when the independent sources are known to be nonnegative and well grounded, which means that they have a nonzero pdf in the region of zero. It can be shown that in this case, the solution method is basically very simple: an orthogonal rotation of the whitened observation vector into nonnegative outputs will give a positive permutation of the original sources. We propose a cost function whose minimum coincides with nonnegativity and derive the gradient algorithm under the whitening constraint, under which the separating matrix is orthogonal. We further prove that in the Stiefel manifold of orthogonal matrices, the cost function is a Lyapunov function for the matrix gradient flow, implying global convergence. Thus, this algorithm is guaranteed to find the nonnegative well-grounded independent sources. The analysis is complemented by a numerical simulation, which illustrates the algorithm.
For intelligent systems to make best use of the audio modality, it is important that they can recognize not just speech and music, which have been researched as specific tasks, but also general sounds in everyday environments. To stimulate research in this field we conducted a public research challenge: the IEEE Audio and Acoustic Signal Processing Technical Committee challenge on Detection and Classification of Acoustic Scenes and Events (DCASE). In this paper, we report on the state of the art in automatically classifying audio scenes, and automatically detecting and classifying audio events. We survey prior work as well as the state of the art represented by the submissions to the challenge from various research groups. We also provide detail on the organization of the challenge, so that our experience as challenge hosts may be useful to those organizing challenges in similar domains. We created new audio datasets and baseline systems for the challenge; these, as well as some submitted systems, are publicly available under open licenses, to serve as benchmarks for further research in general-purpose machine listening.
Plumbley MD (2005) Polar Polytopes and Recovery of Sparse Representations,
In this paper we present the supervised iterative projections and rotations (S-IPR) algorithm, a method to optimise a set of discriminative subspaces for supervised classification. We show how the proposed technique is based on our previous unsupervised iterative projections and rotations (IPR) algorithm for incoherent dictionary learning, and how projecting the features onto the learned sub-spaces can be employed as a feature transform algorithm in the context of classification. Numerical experiments on the FISHERIRIS and on the USPS datasets, and a comparison with the PCA and LDA methods for feature transform demonstrates the value of the proposed technique and its potential as a tool for machine learning. © 2013 IEEE.
Barthet M, Plumbley MD, Kachkaev A, Dykes J, Wolff D, Weyde T (2014) Big chord data extraction and mining,
Harmonic progression is one of the cornerstones of tonal music composition and is thereby essential to many musical styles and traditions. Previous studies have shown that musical genres and composers could be discriminated based on chord progressions modeled as chord n-grams. These studies were however conducted on small-scale datasets and using symbolic music transcriptions. In this work, we apply pattern mining techniques to over 200,000 chord progression sequences out of 1,000,000 extracted from the I Like Music (ILM) commercial music audio collection. The ILM collection spans 37 musical genres and includes pieces released between 1907 and 2013. We developed a single program multiple data parallel computing approach whereby audio feature extraction tasks are split up and run simultaneously on multiple cores. An audio-based chord recognition model (Vamp plugin Chordino) was used to extract the chord progressions from the ILM set. To keep low-weight feature sets, the chord data were stored using a compact binary format. We used the CM-SPADE algorithm, which performs a vertical mining of sequential patterns using co-occurence information, and which is fast and efficient enough to be applied to big data collections like the ILM set. In order to derive key-independent frequent patterns, the transition between chords are modeled by changes of qualities (e.g. major, minor, etc.) and root keys (e.g. fourth, fifth, etc.). The resulting key-independent chord progression patterns vary in length (from 2 to 16) and frequency (from 2 to 19,820) across genres. As illustrated by graphs generated to represent frequent 4-chord progressions, some patterns like circleof- fifths movements are well represented in most genres but in varying degrees. These large-scale results offer the opportunity to uncover similarities and discrepancies between sets of musical pieces and therefore to build classifiers for search and recommendation. They also support the empirical testing of music theory. It is however more difficult to derive new hypotheses from such dataset due to its size. This can be addressed by using pattern detection algorithms or suitable visualisation which we present in a companion study.
Stowell D, Plumbley MD (2010) Timbre remapping through a regression-tree technique, Proceedings of the 7th Sound and Music Computing Conference, SMC 2010
We consider the task of inferring associations between two differently-distributed and unlabelled sets of timbre data. This arises in applications such as concatenative synthesis/ audio mosaicing in which one audio recording is used to control sound synthesis through concatenating fragments of an unrelated source recording. Timbre is a multidimensional attribute with interactions between dimensions, so it is non-trivial to design a search process which makes best use of the timbral variety available in the source recording. We must be able to map from control signals whose timbre features have different distributions from the source material, yet labelling large collections of timbral sounds is often impractical, so we seek an unsupervised technique which can infer relationships between distributions. We present a regression tree technique which learns associations between two unlabelled multidimensional distributions, and apply the technique to a simple timbral concatenative synthesis system. We demonstrate numerically that the mapping makes better use of the source material than a nearest-neighbour search. © 2010 Dan Stowell et al.
A method based on local spectral features and missing feature techniques
is proposed for the recognition of harmonic sounds in mixture
signals. A mask estimation algorithm is proposed for identifying
spectral regions that contain reliable information for each sound
source and then bounded marginalization is employed to treat the
feature vector elements that are determined as unreliable. The proposed
method is tested on musical instrument sounds due to the
extensive availability of data but it can be applied on other sounds
(i.e. animal sounds, environmental sounds), whenever these are harmonic.
In simulations the proposed method clearly outperformed a
baseline method for mixture signals.
Cannam C, Figueira LA, Plumbley MD (2012) Sound software: Towards software reuse in audio and music research, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings pp. 2745-2748
Although researchers are increasingly aware of the need to publish and maintain software code alongside their results, practical barriers prevent this from happening in many cases. We examine these barriers, propose an incremental approach to overcoming some of them, and describe the Sound Software project, an effort to support software development practice in the UK audio and music research community. Finally we make some recommendations for research groups seeking to improve their own researchers' software practice. © 2012 IEEE.
Mailhe B, Sturm B, Plumbley MD (2013) Recovery of nested supports by greedy sparse representation algorithms over non-normalized dictionaries,
We prove that if Orthogonal Matching Pursuit (OMP)
recovers all s-sparse signals for a given dictionary, then it also recovers
-sparse signals on the same dictionary for any s
extend Tropp?s Exact Recovery Condition (ERC) to dictionaries with
non-normalized atoms. Our result does not contradict an earlier result
stating that there are dictionaries and cardinalities s
s-size supports satisfy Tropp?s (ERC) but not all s
-size supports do:
that result was proved using non-normalized dictionaries and in that
case Tropp?s ERC is not linked to the recovery by OMP.
Cleju N, Jafari MG, Plumbley MD (2012) Analysis-based sparse reconstruction with synthesis-based solvers, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings pp. 5401-5404
Analysis based reconstruction has recently been introduced as an alternative to the well-known synthesis sparsity model used in a variety of signal processing areas. In this paper we convert the analysis exact-sparse reconstruction problem to an equivalent synthesis recovery problem with a set of additional constraints. We are therefore able to use existing synthesis-based algorithms for analysis-based exact-sparse recovery. We call this the Analysis-By-Synthesis (ABS) approach. We evaluate our proposed approach by comparing it against the recent Greedy Analysis Pursuit (GAP) analysis-based recovery algorithm. The results show that our approach is a viable option for analysis-based reconstruction, while at the same time allowing many algorithms that have been developed for synthesis reconstruction to be directly applied for analysis reconstruction as well. © 2012 IEEE.
Robertson A, Plumbley MD (2013) Synchronizing sequencing software to a live drummer, Computer Music Journal 37 (2) pp. 46-60
This article presents a method of adjusting the tempo of a music software sequencer so that it remains synchronized with a drummer's musical pulse. This allows music sequencer technology to be integrated into a band scenario without the compromise of using click tracks or triggering loops with a fixed tempo. Our design implements real-time mechanisms for both underlying tempo and phase adjustment using adaptable parameters that control its behavior. The aim is to create a system that responds to timing variations in the drummer's playing but is also stable during passages of syncopation and fills. We present an evaluation of the system using a stochastic drum machine that incorporates a level of noise in the underlying tempo and phase of the beat. We measure synchronization error between the output of the system and the underlying pulse of the drum machine and contrast this with other real-time beat trackers. The software, B-Keeper, has been released as a Max for Live device, available online at www.b-keeper.org
. © 2013 Massachusetts Institute of Technology.
Keriven N, O'Hanlon K, Plumbley MD (2013) Structured sparsity using backwards elimination for Automatic Music Transcription, IEEE International Workshop on Machine Learning for Signal Processing, MLSP
Musical signals can be thought of as being sparse and structured, with few elements active at a given instant and temporal continuity of active elements observed. Greedy algorithms such as Orthogonal Matching Pursuit (OMP), and structured variants, have previously been proposed for Automatic Music Transcription (AMT), however some problems have been noted. Hence, we propose the use of a backwards elimination strategy in order to perform sparse decompositions for AMT, in particular with a proposed alternative sparse cost function. However, the main advantage of this approach is the ease with which structure can be incorporated. The use of group spar-sity is shown to give increased AMT performance, while a molecular method incorporating onset information is seen to provide further improvements with little computational effort. © 2013 IEEE.
In this paper we present a new method for musical audio
source separation, using the information from the musical
score to supervise the decomposition process. An original
framework using nonnegative matrix factorization (NMF)
is presented, where the components are initially learnt on
synthetic signals with temporal and harmonic constraints. A
new dataset of multitrack recordings with manually aligned
MIDI scores is created (TRIOS), and we compare our separation
results with other methods from the literature using
the BSS EVAL and PEASS evaluation toolboxes. The results
show a general improvement of the BSS EVAL metrics for
the various instrumental configurations used.
We describe an information-theoretic approach to the analysis of music and other sequential data, which emphasises the predictive aspects of perception, and the dynamic process of forming and modifying expectations about an unfolding stream of data, characterising these using the tools of information theory: entropies, mutual informations, and related quantities. After reviewing the theoretical foundations, we discuss a few emerging areas of application, including musicological analysis, real-time beat-tracking analysis, and the generation of musical materials as a cognitively-informed compositional aid. © 2012 IEEE.
Music remixing is difficult when the original multitrack recording is not available. One solution is to estimate the elements of a mixture using source separation. However, existing techniques suffer from imperfect separation and perceptible artifacts on single separated sources. To investigate their influence on a remix, five state-of-the-art source separation algorithms were used to remix six songs by increasing the level of the vocals. A listening test was conducted to assess the remixes in terms of loudness balance and sound quality. The results show that some source separation algorithms are able to increase the level of the vocals by up to 6 dB at the cost of introducing a small but perceptible degradation in sound quality.
Plumbley MD, Bevilacqua M (2009) Sparse reconstruction for compressed sensing using stagewise polytope faces pursuit, DSP 2009: 16th International Conference on Digital Signal Processing, Proceedings
Compressed sensing, also known as compressive sampling, is an approach to the measurement of signals which have a sparse representation, that can reduce the number of measurements that are needed to reconstruct the signal. The signal reconstruction part requires efficient methods to perform sparse reconstruction, such as those based on linear programming. In this paper we present a method for sparse reconstruction which is an extension of our earlier Polytope Faces Pursuit algorithm, based on the polytope geometry of the dual linear program. The new algorithm adds several basis vectors at each stage, in a similar way to the recent Stagewise Orthogonal Matching Pursuit (StOMP) algorithm. We demonstrate the application of the algorithm to some standard compressed sensing problems. © 2009 IEEE.
Roma G, Grais EM, Simpson AJR, Sobieraj I, Plumbley MD (2016) UNTWIST: A NEW TOOLBOX FOR AUDIO SOURCE SEPARATION,
Untwist is a new open source toolbox for audio source separation. The library
provides a self-contained objectoriented framework including common source separation
algorithms as well as input/output functions, data management utilities and time-frequency
transforms. Everything is implemented in Python, facilitating research, experimentation and
prototyping across platforms. The code is available on github 1.
Barchiesi D, Giannoulis D, Stowell D, Plumbley MD (2015) Acoustic Scene Classification: Classifying environments from the sounds they produce., IEEE Signal Process. Mag. 32 3 pp. 16-34
Stark AM, Plumbley MD, Davies MEP (2007) Real-time beat-synchronous audio effects, Proceedings of the 7th International Conference on New Interfaces for Musical Expression, NIME '07 pp. 344-345
We present a new group of audio effects that use beat tracking, the detection of beats in an audio signal, to relate effect parameters to the beats in an input signal. Conventional audio effects are augmented so that their operation is related to the output of a beat tracking system. We present a temposynchronous delay effect and a set of beat synchronous low frequency oscillator effects including tremolo, vibrato and auto-wah. All effects are implemented in real-time as VST plug-ins to allow for their use in live performance.
Cleju N, Jafari MG, Plumbley MD (2012) Choosing analysis or synthesis recovery for sparse reconstruction, European Signal Processing Conference pp. 869-873
The analysis sparsity model is a recently introduced alternative to the standard synthesis sparsity model frequently used in signal processing. However, the exact conditions when analysis-based recovery is better than synthesis recovery are still not known. This paper constitutes an initial investigation into determining when one model is better than the other, under similar conditions. We perform separate analysis and synthesis recovery on a large number of randomly generated signals that are simultaneously sparse in both models and we compare the average reconstruction errors with both recovery methods. The results show that analysis-based recovery is the better option for a large number of signals, but it is less robust with signals that are only approximately sparse or when fewer measurements are available. © 2012 EURASIP.
Stowell D, Musevic S, Bonada J, Plumbley MD (2013) Improved multiple birdsong tracking with distribution derivative method and Markov renewal process clustering., ICASSP pp. 468-472 IEEE
Jaillet F, Gribonval R, Plumbley MD, Zayyani H (2010) An L1 criterion for dictionary learning by subspace identification., ICASSP pp. 5482-5485 IEEE
Vincent E, Plumbley MD (2005) A prototype system for object coding of musical audio, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics pp. 239-242
This article deals with low bitrate object coding of musical audio, and more precisely with the extraction of pitched sound objects in polyphonic music. After a brief review of existing methods, we discuss the potential benefits of recasting this problem in a Bayesian framework. We define pitched objects by a set of probabilistic priors and derive efficient algorithms to infer active objects and their parameters. Preliminary experiments suggest that the proposed method results in a better sound quality than simple sinusoidal coding while achieving a lower bitrate. © 2005 IEEE.
Davies MEP, Plumbley MD, Eck D (2009) Towards a musical beat emphasis function., WASPAA pp. 61-64 IEEE
O'Hanlon K, Nagano H, Plumbley MD (2012) Structured sparsity for automatic music transcription, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings pp. 441-444
Sparse representations have previously been applied to the automatic music transcription (AMT) problem. Structured sparsity, such as group and molecular sparsity allows the introduction of prior knowledge to sparse representations. Molecular sparsity has previously been proposed for AMT, however the use of greedy group sparsity has not previously been proposed for this problem. We propose a greedy sparse pursuit based on nearest subspace classification for groups with coherent blocks, based in a non-negative framework, and apply this to AMT. Further to this, we propose an enhanced molecular variant of this group sparse algorithm and demonstrate the effectiveness of this approach. © 2012 IEEE.
We describe a new method for preprocessing STFT phasevocoder frames for improved performance in real-time onset detection, which we term "adaptive whitening". The procedure involves normalising the magnitude of each bin according to a recent maximum value for that bin, with the aim of allowing each bin to achieve a similar dynamic range over time, which helps to mitigate against the influence of spectral roll-off and strongly-varying dynamics. Adaptive whitening requires no training, is relatively lightweight to compute, and can run in real-time. Yet it can improve onset detector performance by more than ten percentage points (peak F-measure) in some cases, and improves the performance of most of the onset detectors tested. We present results demonstrating that adaptive whitening can significantly improve the performance of various STFT-based onset detection functions, including functions based on the power, spectral flux, phase deviation, and complex deviation measures. Our results find the process to be especially beneficial for certain types of audio signal (e.g. complex mixtures such as pop music).
Abdallah S, Plumbley MD (2004) Application of geometric dependency analysis to the separation of convolved mixtures, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 3195 pp. 540-547
We investigate a generalisation of the structure of frequency domain ICA as applied to the separation of convolved mixtures, and show how a geometric representation of residual dependency can be used both as an aid 'to visualisation and intuition, and as tool for clustering components into independent subspaces, thus providing a solution to the source separation problem. © Springer-Verlag 2004.
Murray-Browne T, Mainstone D, Bryan-Kinns N, Plumbley MD (2011) The medium is the message: Composing instruments and performing mappings, pp. 56-59
Many performers of novel musical instruments find it diffi- cult to engage audiences beyond those in the field. Previous research points to a failure to balance complexity with usability, and a loss of transparency due to the detachment of the controller and sound generator. The issue is often exacerbated by an audienceýs lack of prior exposure to the instrument and its workings. However, we argue that there is a conflict underlying many novel musical instruments in that they are intended to be both a tool for creative expression and a creative work of art in themselves, resulting in incompatible requirements. By considering the instrument, the composition and the performance together as a whole with careful consideration of the rate of learning demanded of the audience, we propose that a lack of transparency can become an asset rather than a hindrance. Our approach calls for not only controller and sound generator to be designed in sympathy with each other, but composition, performance and physical form too. Identifying three design principles, we illustrate this approach with the Serendiptichord, a wearable instrument for dancers created by the authors.
Non-negative Matrix Factorisation (NMF) is a popular tool in musical signal processing. However, problems using this methodology in the context of Automatic Music Transcription (AMT) have been noted resulting in the proposal of supervised and constrained variants of NMF for this purpose. Group sparsity has previously been seen to be effective for AMT when used with stepwise methods. In this paper group sparsity is introduced to supervised NMF decompositions and a dictionary tuning approach to AMT is proposed based upon group sparse NMF using the ²-divergence. Experimental results are given showing improved AMT results over the state-of-the-art NMF-based AMT system.
Bird audio detection (BAD) aims to detect whether
there is a bird call in an audio recording or not. One difficulty of
this task is that the bird sound datasets are weakly labelled, that
is only the presence or absence of a bird in a recording is known,
without knowing when the birds call. We propose to apply joint
detection and classification (JDC) model on the weakly labelled
data (WLD) to detect and classify an audio clip at the same time.
First, we apply VGG like convolutional neural network (CNN)
on mel spectrogram as baseline. Then we propose a JDC-CNN
model with VGG as a classifier and CNN as a detector. We report
the denoising method including optimally-modified log-spectral
amplitude (OM-LSA), median filter and spectral spectrogram
will worse the classification accuracy on the contrary to previous
work. JDC-CNN can predict the time stamps of the events from
weakly labelled data, so is able to do sound event detection from
WLD. We obtained area under curve (AUC) of 95.70% on the
development data and 81.36% on the unseen evaluation data,
which is nearly comparable to the baseline CNN model.
Audio source separation models are typically evaluated using objective separation quality measures, but rigorous statistical methods have yet to be applied to the problem of model comparison. As a result, it can be difficult to establish whether or not reliable progress is being made during the development of new models. In this paper, we provide a hypothesis-driven statistical analysis of the results of the recent source separation SiSEC challenge involving twelve competing models tested on separation of voice and accompaniment from fifty pieces of ?professionally produced? contemporary music. Using nonparametric statistics, we establish reliable evidence for meaningful conclusions about the performance of the various models.
Giannoulis D, Stowell D, Benetos E, Rossignol M, Lagrange M, Plumbley MD (2013) A DATABASE AND CHALLENGE FOR ACOUSTIC SCENE CLASSIFICATION AND EVENT DETECTION, 2013 PROCEEDINGS OF THE 21ST EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO) IEEE
An increasing number of researchers work in computational auditory scene analysis (CASA). However, a set of tasks, each with a well-defined evaluation framework and commonly used datasets do not yet exist. Thus, it is difficult for results and algorithms to be compared fairly, which hinders research on the field. In this paper we will introduce a newly-launched public evaluation challenge dealing with two closely related tasks of the field: acoustic scene classification and event detection. We give an overview of the tasks involved; describe the processes of creating the dataset; and define the evaluation metrics. Finally, illustrations on results for both tasks using baseline methods applied on this dataset are presented, accompanied by open-source code. © 2013 EURASIP
Nesbit A, Vincent E, Plumbley MD (2009) Benchmarking flexible adaptive time-frequency transforms for underdetermined audio source separation, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings pp. 37-40
We have implemented several fast and flexible adaptive lapped orthogonal transform (LOT) schemes for underdetermined audio source separation. This is generally addressed by time-frequency masking, requiring the sources to be disjoint in the time-frequency domain. We have already shown that disjointness can be increased via adaptive dyadic LOTs. By taking inspiration from the windowing schemes used in many audio coding frameworks, we improve on earlier results in two ways. Firstly, we consider non-dyadic LOTs which match the time-varying signal structures better. Secondly, we allow for a greater range of overlapping window profiles to decrease window boundary artifacts. This new scheme is benchmarked through oracle evaluations, and is shown to decrease computation time by over an order of magnitude compared to using very general schemes, whilst maintaining high separation performance and flexible signal adaptivity. As the results demonstrate, this work may find practical applications in high fidelity audio source separation. ©2009 IEEE.
Figueira LA, Cannam C, Plumbley MD (2013) Software techniques for good practice in audio and music research, 134th Audio Engineering Society Convention 2013 pp. 273-280
In this paper we discuss how software development can be improved in the audio and music research community by implementing tighter and more effective development feedback loops. We suggest first that researchers in an academic environment can benefit from the straightforward application of peer code review, even for ad-hoc research software; and second, that researchers should adopt automated software unit testing from the start of research projects. We discuss and illustrate how to adopt both code reviews and unit testing in a research environment. Finally, we observe that the use of a software version control system provides support for the foundations of both code reviews and automated unit tests. We therefore also propose that researchers should use version control with all their projects from the earliest stage.
Hockman JA, Bello JP, Davies MEP, Plumbley MD (2008) Automated rhythmic transformation of musical audio, Proceedings - 11th International Conference on Digital Audio Effects, DAFx 2008 pp. 177-180
Time-scale transformations of audio signals have traditionally relied exclusively upon manipulations of tempo. We present a novel technique for automatic mixing and synchronization between two musical signals. In this transformation, the original signal assumes the tempo, meter, and rhythmic structure of the model signal, while the extracted downbeats and salient intra-measure infrastructure of the original are maintained.
Plumbley MD (2014) Separating Musical Audio Signals, Acoustics Bulletin 39 (6) pp. 44-47 Institute of Acoustics
As consumers move increasingly to multichannel and surround-sound reproduction of sound, and
also perhaps wish to remix their music to suit their own tastes, there will be an increasing need for
high quality automatic source separation to recover sound sources from legacy mono or 2-channel
stereo recordings. In this Contribution, we will give an overview of some for audio source separation,
and some of the remaining research challenges in this area.
Deep neural networks (DNNs) are often used to tackle the single channel source separation (SCSS) problem by predicting time-frequency masks. The predicted masks are then used to separate the sources from the mixed signal. Different types of masks produce separated sources with different levels of distortion and interference. Some types of masks produce separated sources with low distortion, while other masks produce low interference between the separated sources. In this paper, a combination of different DNNs? predictions (masks) is used for SCSS to achieve better quality of the separated sources than using each DNN individually. We train four different DNNs by minimizing four different cost functions to predict four different masks. The first and second DNNs are trained to approximate reference binary and soft masks. The third DNN is trained to predict a mask from the reference sources directly. The last DNN is trained similarly to the third DNN but with an additional discriminative constraint to maximize the differences between the estimated sources. Our experimental results show that combining the predictions of different DNNs achieves separated sources with better quality than using each DNN individually
Dictionary Learning (DL) has seen widespread use in signal processing and machine learning. Given a data set, DL seeks to find a so-called ?dictionary? such that the data can be well represented by a sparse linear combination of dictionary elements. The representational power of DL may be extended by the use of kernel mappings, which implicitly map the data to some high dimensional feature space. In Kernel DL we wish to learn a dictionary in this underlying high-dimensional feature space, which can often model the data more accurately than learning in the original space. Kernel DL is more challenging than the linear case however since we no longer have access to the dictionary atoms directly ? only their relationship to the data via the kernel matrix. One strategy is therefore to represent the dictionary as a linear combination of the input data whose coefficients can be learned during training , relying on the fact that any optimal dictionary lies in the span of the data. A difficulty in Kernel DL is that given a data set of size N, the full (N ×N) kernel matrix needs to be manipulated at each iteration and dealing with such a large dense matrix can be extremely slow for big datasets. Here, we impose an additional constraint of sparsity on the coefficients so that the learned dictionary is given by a sparse linear combination of the input data. This greatly speeds up learning, and furthermore the speed-up is greater for larger datasets and can be tuned via a dictionary-sparsity parameter. The proposed approach thus combines Kernel DL with the ?double sparse? DL model  in which the learned dictionary is given by a sparse reconstruction over some base dictionary (in this case, the data itself). We investigate the use of sparse Kernel DL as a feature learning step for a music transcription task and compare it to another Kernel DL approach based on the K-SVD algorithm  (which doesnt lead to sparse dictionaries in general), in terms of computation-time and performance. Initial experiments show that Sparse Kernel DL is significantly faster than the non-sparse Kernel DL approach (6× to 8× speed-up depending on the size of the training data and the sparsity level) while leading to similar performance.
Mailhe B, Sturm B, Plumbley MD (2013) Behavior of greedy sparse representation algorithms on nested supports, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings pp. 5710-5714
In this work, we study the links between the recovery properties of sparse signals for Orthogonal Matching Pursuit (OMP) and the whole General MP class over nested supports. We show that the optimality of those algorithms is not locally nested: there is a dictionary and supports I and J with J included in I such that OMP will recover all signals of support I, but not all signals of support J. We also show that the optimality of OMP is globally nested: if OMP can recover all s-sparse signals, then it can recover all s2-sparse signals with s2 smaller than s. We also provide a tighter version of Donoho and Elad's spark theorem, which allows us to complete Tropp's proof that sparse representation algorithms can only be optimal for all s-sparse signals if s is strictly lower than half the spark of the dictionary. © 2013 IEEE.
Plumbley MD (2006) Recovery of Sparse Representations by Polytope Faces Pursuit., ICA 3889 pp. 206-213 Springer
Jafari MG, Vincent E, Abdallah SA, Plumbley MD, Davies ME (2008) An adaptive stereo basis method for convolutive blind audio source separation., Neurocomputing 71 10-12 pp. 2087-2097
Jafari MG, Plumbley MD (2008) Separation of stereo speech signals based on a sparse dictionary algorithm, European Signal Processing Conference
We address the problem of source separation in echoic and anechoic environments, with a new algorithm which adaptively learns a set of sparse stereo dictionary elements, which are then clustered to identify the original sources. The atom pairs learned by the algorithm are found to capture information about the direction of arrival of the source signals, which allows to determine the clusters. A similar approach is also used here to extend the dictionary learning K singular value decomposition (K-SVD) algorithm, to address the source separation problem, and results from the two methods are compared. Computer simulations indicate that the proposed adaptive sparse stereo dictionary (ASSD) algorithm yields good performance in both anechoic and echoic environments. copyright by EURASIP.
In this paper, we present a deep neural network (DNN)-based acoustic
scene classification framework. Two hierarchical learning methods
are proposed to improve the DNN baseline performance by incorporating
the hierarchical taxonomy information of environmental
sounds. Firstly, the parameters of the DNN are initialized by the
proposed hierarchical pre-training. Multi-level objective function
is then adopted to add more constraint on the cross-entropy based
loss function. A series of experiments were conducted on the Task1
of the Detection and Classification of Acoustic Scenes and Events
(DCASE) 2016 challenge. The final DNN-based system achieved
a 22.9% relative improvement on average scene classification error
as compared with the Gaussian Mixture Model (GMM)-based
benchmark system across four standard folds.
Adler A, Emiya V, Jafari MG, Elad M, Gribonval R, Plumbley MD (2012) Audio inpainting,IEEE Transactions on Audio, Speech and Language Processing 20 (3) pp. 922-932
We propose the audio inpainting framework that recovers portions of audio data distorted due to impairments such as impulsive noise, clipping, and packet loss. In this framework, the distorted data are treated as missing and their location is assumed to be known. The signal is decomposed into overlapping time-domain frames and the restoration problem is then formulated as an inverse problem per audio frame. Sparse representation modeling is employed per frame, and each inverse problem is solved using the Orthogonal Matching Pursuit algorithm together with a discrete cosine or a Gabor dictionary. The Signal-to-Noise Ratio performance of this algorithm is shown to be comparable or better than state-of-the-art methods when blocks of samples of variable durations are missing. We also demonstrate that the size of the block of missing samples, rather 8than the overall number of missing samples, is a crucial parameter for high quality signal restoration. We further introduce a constrained Matching Pursuit approach for the special case of audio declipping that exploits the sign pattern of clipped audio samples and their maximal absolute value, as well as allowing the user to specify the maximum amplitude of the signal. This approach is shown to outperform state-of-the-art and commercially available methods for audio declipping in terms of Signal-to-Noise Ratio. © 2006 IEEE.
We consider the separation of sources when only one movable sensor is available to record a set of mixtures at distinct locations. A single mixture signal is acquired, which is firstly segmented. Then, based on the assumption that the underlying sources are temporally periodic, we align the resulting signals and form a measurement vector on which source separation can be performed. We demonstrate that this approach can successfully recover the original sources both when working with simulated data, and for a real problem of heart sound separation. © 2011 University of Zagreb.
Gretsistas A, Plumbley MD (2010) A Multichannel Spatial Compressed Sensing Approach for Direction of Arrival Estimation., LVA/ICA 6365 pp. 458-465 Springer
Davies MEP, Plumbley MD (2005) Beat tracking with a two state model [music applications]., ICASSP (3) pp. 241-244 IEEE
O?Hanlon K, Nagano H, Plumbley MD (2013) Using Oracle Analysis for Decomposition-Based Automatic Music Transcription, 7900 pp. 353-365 Springer Berlin Heidelberg
One approach to Automatic Music Transcription (AMT) is to decompose a spectrogram with a dictionary matrix that contains a pitch-labelled note spectrum atom in each column. AMT performance is typically measured using frame-based comparison, while an alternative perspective is to use an event-based analysis. We have previously proposed an AMT system, based on the use of structured sparse representations. The method is described and experimental results are given, which are seen to be promising. An inspection of the graphical AMT output known as a piano roll may lead one to think that the performance may be slightly better than is suggested by the AMT metrics used. This leads us to perform an oracle analysis of the AMT system, with some interesting outcomes which may have implications for decomposition based AMT in general.
Nishimori Y, Akaho S, Plumbley MD (2008) Natural Conjugate Gradient on Complex Flag Manifolds for Complex Independent Subspace Analysis., ICANN (1) 5163 pp. 165-174 Springer
Bugmann G, Sojka P, Reiss M, Plumbley M, Taylor JG (1992) Direct Approaches to Improving the Robustness of Multilayer Neural Networks, pp. 1063-1066 North-Holland
Abstract Multilayer neural networks trained with backpropagation are in general not robust against the loss of a hidden neuron. In this paper we define a form of robustness called 1-node robustness and propose methods to improve it. One approach is based on a modification of the error function by the addition of a "robustness error". It leads to more robust networks but at the cost of a reduced accuracy. A second approach, "pruning-and-duplication", consists of duplicating the neurons whose loss is the most damaging for the network. Pruned neurons are used for the duplication. This procedure leads to robust and accurate networks at low computational cost. It may also prove benefical for generalisation. Both methods are evaluated on the XOR function.
Davies MEP, Plumbley MD (2004) Causal Tempo Tracking of Audio., ISMIR
Plumbley MD, Cichocki A, Bro R (2010) Non-negative mixtures, In: Handbook of Blind Source Separation pp. 515-547
This chapter discusses some algorithms for the use of non-negativity constraints in unmixing problems, including positive matrix factorization, nonnegative matrix factorization (NMF), and their combination with other unmixing methods such as non-negative independent component analysis and sparse non-negative matrix factorization. The 2D models can be naturally extended to multiway array (tensor) decompositions, especially non-negative tensor factorization (NTF) and non-negative tucker decomposition (NTD). The standard NMF model has been extended in various ways, including semi-NMF, multilayer NMF, tri-NMF, orthogonal NMF, nonsmooth NMF, and convolutive NMF. When gradient descent is a simple procedure, convergence can be slow, and the convergence can be sensitive to the step size. This can be overcome by applying multiplicative update rules, which have proved particularly popular in NMF. These multiplicative update rules have proved to be attractive since they are simple, do not need the selection of an update parameter, and their multiplicative nature, and non-negative terms on the RHS ensure that the elements cannot become negative. © 2010 Elsevier Ltd. All rights reserved.
Stark AM, Plumbley MD (2012) Performance following: Real-time prediction of musical sequences without a score, IEEE Transactions on Audio, Speech and Language Processing 20 (1) pp. 178-187
This paper introduces a technique for predicting harmonic sequences in a musical performance for which no score is available, using real-time audio signals. Recent short-term information is aligned with longer term information, contextualizing the present within the past, allowing predictions about the future of the performance to be made. Using a mid-level representation in the form of beat-synchronous harmonic sequences, we reduce the size of the information needed to represent the performance. This allows the implementation of real-time performance following in live performance situations. We conduct an objective evaluation on a database of rock, pop, and folk music. Our results show that we are able to predict a large majority of repeated harmonic content with no prior knowledge in the form of a score. © 2011 IEEE.
Badeau R, Plumbley MD (2013) Probabilistic time-frequency source-filter decomposition of non-stationary signals, Proceedings of the 21st European European Signal Processing Conference 2013
Probabilistic modelling of non-stationary signals in the time-frequency (TF) domain has been an active research topic recently. Various models have been proposed, notably in the nonnegative matrix factorization (NMF) literature. In this paper, we propose a new TF probabilistic model that can represent a variety of stationary and non-stationary signals, such as autoregressive moving average (ARMA) processes, uncorrelated noise, damped sinusoids, and transient signals. This model also generalizes and improves both the Itakura-Saito (IS)-NMF and high resolution (HR)-NMF models. © 2013 EURASIP.
Neuroevolution techniques combine genetic algorithms with artificial
neural networks, some of them evolving network topology
along with the network weights. One of these latter techniques is
the NeuroEvolution of Augmenting Topologies (NEAT) algorithm.
For this pilot study we devised an extended variant (joint NEAT,
J-NEAT), introducing dynamic cooperative co-evolution, and applied
it to sound event detection in real life audio (Task 3) in the
DCASE 2017 challenge. Our research question was whether small
networks could be evolved that would be able to compete with the
much larger networks now typical for classification and detection
tasks. We used the wavelet-based deep scattering transform and
k-means clustering across the resulting scales (not across samples)
to provide J-NEAT with a compact representation of the acoustic
input. The results show that for the development data set J-NEAT
was capable of evolving small networks that match the performance
of the baseline system in terms of the segment-based error metrics,
while exhibiting a substantially better event-related error rate. In
the challenge, J-NEAT took first place overall according to the F1
error metric with an F1 of 44:9% and achieved rank 15 out of 34 on
the ER error metric with a value of 0:891. We discuss the question
of evolving versus learning for supervised tasks.
Foster P, Klapuri A, Plumbley MD (2011) Causal prediction of continuous-valued music features, Proceedings of the 12th International Society for Music Information Retrieval Conference, ISMIR 2011 pp. 501-506
This paper investigates techniques for predicting sequences of continuous-valued feature vectors extracted from musical audio. In particular, we consider prediction of beatsynchronous Mel-frequency cepstral coefficients and chroma features in a causal setting, where features are predicted as they unfold in time. The methods studied comprise autoregressive models, N-gram models incorporating a smoothing scheme, and a novel technique based on repetition detection using a self-distance matrix. Furthermore, we propose a method for combining predictors, which relies on a running estimate of the error variance of the predictors to inform a linear weighting of the predictor outputs. Results indicate that incorporating information on long-term structure improves the prediction performance for continuous-valued, sequential musical data. For the Beatles data set, combining the proposed self-distance based predictor with both N-gram and autoregressive methods results in an average of 13% improvement compared to a linear predictive baseline. © 2011 International Society for Music Information Retrieval.
Probabilistic approaches to tracking often use single-source Bayesian models; applying these to multi-source tasks is problematic. We apply a principled multi-object tracking implementation, the Gaussian mixture probability hypothesis density filter, to track multiple sources having fixed pitch plus vibrato. We demonstrate high-quality ltering in a synthetic experiment, and nd improved tracking using a richer feature set which captures underlying dynamics. Our implementation is available as open-source Python code.
Foster P, Sigtia S, Krstulovic S, Barker J, Plumbley MD (2015) CHiME-Home: A dataset for sound source recognition in a domestic environment., Proc 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 18-21 Oct. 2015 IEEE
Evangelista G, Marchand S, Plumbley MD, Vincent E (2011) Sound Source Separation,In: Zölzer U (eds.), DAFX: Digital Audio Effects 14 pp. 551-588
John Wiley & Sons, Ltd
Abdallah SA, Plumbley MD (2012) Predictive Information Rate in Discrete-time Gaussian Processes,
We derive expressions for the predicitive information rate (PIR) for the
class of autoregressive Gaussian processes AR(N), both in terms of the
prediction coefficients and in terms of the power spectral density. The latter
result suggests a duality between the PIR and the multi-information rate for
processes with mutually inverse power spectra (i.e. with poles and zeros of the
transfer function exchanged). We investigate the behaviour of the PIR in
relation to the multi-information rate for some simple examples, which suggest,
somewhat counter-intuitively, that the PIR is maximised for very `smooth' AR
processes whose power spectra have multiple poles at zero frequency. We also
obtain results for moving average Gaussian processes which are consistent with
the duality conjectured earlier. One consequence of this is that the PIR is
unbounded for MA(N) processes.
Non-negative Matrix Factorisation (NMF) is a popular
tool in which a ?parts-based? representation of a non-negative
matrix is sought. NMF tends to produce sparse decompositions.
This sparsity is a desirable property in many applications, and
Sparse NMF (S-NMF) methods have been proposed to enhance
this feature. Typically these enforce sparsity through use of
a penalty term, and a `1 norm penalty term is often used.
However an `1 penalty term may not be appropriate in a
non-negative framework. In this paper the use of a `0 norm
penalty for NMF is proposed, approximated using backwards
elimination from an initial NNLS decomposition. Dictionary
recovery experiments using overcomplete dictionaries show that
this method outperforms both NMF and a state of the art S-NMF
method, in particular when the dictionary to be learnt is dense.
Stowell D, Plumbley MD (2013) Segregating Event Streams and Noise with a Markov Renewal Process Model, Journal of Machine Learning Research 14 pp. 2213-2238
We describe an inference task in which a set of timestamped event observations must be clustered into an unknown number of temporal sequences with independent and varying rates of observations. Various existing approaches to multi-object tracking assume a fixed number of sources and/or a fixed observation rate; we develop an approach to inferring structure in timestamped data produced by a mixture of an unknown and varying number of similar Markov renewal processes, plus independent clutter noise. The inference simultaneously distinguishes signal from noise as well as clustering signal observations into separate source streams. We illustrate the technique via synthetic experiments as well as an experiment to track a mixture of singing birds. Source code is available.
Davies MEP, Brossier PM, Plumbley MD (2005) Beat tracking towards automatic musical accompaniment, Audio Engineering Society - 118th Convention Spring Preprints 2005 2 pp. 751-757
In this paper we address the issue of causal rhythmic analysis, primarily towards predicting the locations of musical beats such that they are consistent with a musical audio input. This will be a key component required for a system capable of automatic accompaniment with a live musician. We are implementing our approach as part of the aubio real-time audio library. While performance for this causal system is reduced in comparison to our previous non-causal system, it is still suitable for our intended purpose.
This study concludes a tripartite investigation into the indirect
visibility of the moving tongue in human speech as reflected in
co-occurring changes of the facial surface. We were in particular
interested in how the shared information is distributed over
the range of contributing frequencies. In the current study we
examine the degree to which tongue movements during speech
can be reliably estimated from face motion using artificial neural
networks. We simultaneously acquired data for both movement
types; tongue movements were measured with Electromagnetic
Articulography (EMA), face motion with a passive
marker-based motion capture system. A multiresolution analysis
using wavelets provided the desired decomposition into frequency
subbands. In the two earlier studies of the project we
established linear and non-linear relations between lingual and
facial speech motions, as predicted and compatible with previous
research in auditory-visual speech. The results of the current
study using a Deep Neural Network (DNN) for prediction
show that a substantive amount of variance can be recovered
(between 13.9 and 33.2% dependent on the speaker and tongue
sensor location). Importantly, however, the recovered variance
values and the root mean squared error values of the Euclidean
distances between the measured and the predicted tongue trajectories
are in the range of the linear estimations of our earlier
The sources separated by most single channel audio source separation techniques are usually distorted and each separated source contains residual signals from the other sources. To tackle this problem, we propose to enhance the separated sources to decrease the distortion and interference between the separated sources using deep neural networks (DNNs). Two different DNNs are used in this work. The first DNN is used to separate the sources from the mixed signal. The second DNN is used to enhance the separated signals. To consider the interactions between the separated sources, we propose to use a single DNN to enhance all the separated sources together. To reduce the residual signals of one source from the other separated sources (interference), we train the DNN for enhancement discriminatively to maximize the dissimilarity between the predicted sources. The experimental results show that using discriminative enhancement decreases the distortion and interference between the separated sources
In this paper, we propose two methods for polyphonic Acoustic Event Detection (AED) in real life environments. The first method is based on Coupled Sparse Non-negative Matrix Factorization (CSNMF) of spectral representations and their corresponding class activity annotations. The second method is based on Multi-class Random Forest (MRF) classification of time-frequency patches. We compare the performance of the two methods on a recently published dataset TUT Sound Events 2016 containing data from home and residential area environments. Both methods show comparable performance to the baseline system proposed for DCASE 2016 Challenge on the development dataset with MRF outperforming the baseline on the evaluation dataset.
Most single channel audio source separation (SCASS) approaches produce separated sources accompanied by interference from other sources and other distortions. To tackle this problem, we propose to separate the sources in two stages. In the first stage, the sources are separated from the mixed signal. In the second stage, the interference between the separated sources and the distortions are reduced using deep neural networks (DNNs). We propose two methods that use DNNs to improve the quality of the separated sources in the second stage. In the first method, each separated source is improved individually using its own trained DNN, while in the second method all the separated sources are improved together using a single DNN. To further improve the quality of the separated sources, the DNNs in the second stage are trained discriminatively to further decrease the interference and the distortions of the separated sources. Our experimental results show that using two stages of separation improves the quality of the separated signals by decreasing the interference between the separated sources and distortions compared to separating the sources using a single stage of separation.
Audio tagging aims to assign one or several tags to an audio clip. Most of the datasets are weakly labelled, which means only the tags of the clip are known, without knowing the occurrence time of the tags. The labeling of an audio clip is often based on the audio events in the clip and no event level label is provided to the user. Previous works have used the bag of frames model assume the tags occur all the time, which is not the case in practice. We propose a joint detection-classification (JDC) model to detect and classify the audio clip simultaneously. The JDC model has the ability to attend to informative and ignore uninformative sounds. Then only informative regions are used for classification. Experimental results on the ?CHiME Home? dataset show that the JDC model reduces the equal error rate (EER) from 19.0% to 16.9%. More interestingly, the audio event detector is trained successfully without needing the event level label.
Though many past works have tried to cluster expressive timing within a phrase, there have been few attempts to cluster features of expressive timing with constant dimensions regardless of phrase lengths. For example, used as a way to represent expressive timing, tempo curves can be regressed by a polynomial function such that the number of regressed polynomial coefficients remains constant with a given order regardless of phrase lengths. In this paper, clustering the regressed polynomial coefficients is proposed for expressive timing analysis. A model selection test is presented to compare Gaussian Mixture Models (GMMs) fitting regressed polynomial coefficients and fitting expressive timing directly. As there are no expected results of clustering expressive timing, the proposed method is demonstrated by how well the expressive timing are approximated by the centroids of GMMs. The results show that GMMs fitting the regressed polynomial coefficients outperform GMMs fitting expressive timing directly. This conclusion suggests that it is possible to use regressed polynomial coefficients to represent expressive timing within a phrase and cluster expressive timing within phrases of different lengths.
Automatic and fast tagging of natural sounds in audio collections is a very challenging task due to wide acoustic variations, the large number of possible tags, the incomplete and ambiguous tags provided by different labellers. To handle these problems, we use a co-regularization approach to learn a pair of classifiers on sound and text. The first classifier maps low-level audio features to a true tag list. The second classifier maps actively corrupted tags to the true tags, reducing incorrect mappings caused by low-level acoustic variations in the first classifier, and to augment the tags with additional relevant tags. Training the classifiers is implemented using marginal co-regularization, pair of which draws the two classifiers into agreement by a joint optimization. We evaluate this approach on two sound datasets, Freefield1010 and Task4 of DCASE2016. The results obtained show that marginal co-regularization outperforms the baseline GMM in both ef- ficiency and effectiveness.
Environmental audio tagging aims to predict only the presence or absence of certain acoustic events in the interested acoustic scene. In this paper we make contributions to audio tagging in two parts, respectively, acoustic modeling and feature learning. We propose to use a shrinking deep neural network (DNN) framework incorporating unsupervised feature learning to handle the multi-label classification task. For the acoustic modeling, a large set of contextual frames of the chunk are fed into the DNN to perform a multi-label classification for the expected tags, considering that only chunk (or utterance) level rather than frame-level labels are available. Dropout and background noise aware training are also adopted to improve the generalization capability of the DNNs. For the unsupervised feature learning, we propose to use a symmetric or asymmetric deep de-noising auto-encoder (syDAE or asyDAE) to generate new data-driven features from the logarithmic Mel-Filter Banks (MFBs) features. The new features, which are smoothed against background noise and more compact with contextual information, can further improve the performance of the DNN baseline. Compared with the standard Gaussian Mixture Model (GMM) baseline of the DCASE 2016 audio tagging challenge, our proposed method obtains a significant equal error rate (EER) reduction from 0.21 to 0.13 on the development set. The proposed asyDAE system can get a relative 6.7% EER reduction compared with the strong DNN baseline on the development set. Finally, the results also show that our approach obtains the state-of-the-art performance with 0.15 EER on the evaluation set of the DCASE 2016 audio tagging task while EER of the first prize of this challenge is 0.17.
In this paper, a system for polyphonic sound event detection and tracking is proposed, based on spectrogram factorisation techniques and state space models. The system extends probabilistic latent component analysis (PLCA) and is modelled around a 4-dimensional spectral template dictionary of frequency, sound event class, exemplar index, and sound state. In order to jointly track multiple overlapping sound events over time, the integration of linear dynamical systems (LDS) within the PLCA inference is proposed. The system assumes that the PLCA sound event activation is the (noisy) observation in an LDS, with the latent states corresponding to the true event activations. LDS training is achieved using fully observed data, making use of ground truth-informed event activations produced by the PLCA-based model. Several LDS variants are evaluated, using polyphonic datasets of office sounds generated from an acoustic scene simulator, as well as real and synthesized monophonic datasets for comparative purposes. Results show that the integration of LDS tracking within PLCA leads to an improvement of +8.5-10.5% in terms of frame-based F-measure as compared to the use of the PLCA model alone. In addition, the proposed system outperforms several state-of-the-art methods for the task of polyphonic sound event detection.
In the context of the Internet of Things (IoT), sound sensing applications are required to run on embedded platforms where notions of product pricing and form factor impose hard constraints on the available computing power. Whereas Automatic Environmental Sound Recognition (AESR) algorithms are most often developed with limited consideration for computational cost, this article seeks which AESR algorithm can make the most of a limited amount of computing power by comparing the sound classification performance as a function of its computational cost. Results suggest that Deep Neural Networks yield the best ratio of sound classification accuracy across a range of computational costs, while Gaussian Mixture Models offer a reasonable accuracy at a consistently small cost, and Support Vector Machines stand between both in terms of compromise between accuracy and computational cost.
Li S, Black D, Plumbley MD (2016) The Clustering of Expressive Timing Within a Phrase in Classical Piano Performances by Gaussian Mixture Models,Music, Mind, and Embodiment: 11th International Symposium, CMMR 2015, Plymouth, UK, June 16-19, 2015, Revised Selected Papers. (Lecture Notes in Computer Science, vol. 9617) 9617 pp. 322-345
In computational musicology research, clustering is a common approach to the analysis of expression. Our research uses mathematical model selection criteria to evaluate the performance of clustered and non-clustered models applied to intra-phrase tempo variations in classical piano performances. By engaging different standardisation methods for the tempo variations and engaging different types of covariance matrices, multiple pieces of performances are used for evaluating the performance of candidate models. The results of tests suggest that the clustered models perform better than the non-clustered models and the original tempo data should be standardised by the mean of tempo within a phrase.
Deep learning techniques have been used recently to tackle the audio
source separation problem. In this work, we propose to use deep
fully convolutional denoising autoencoders (CDAEs) for monaural
audio source separation. We use as many CDAEs as the number
of sources to be separated from the mixed signal. Each CDAE
is trained to separate one source and treats the other sources as
background noise. The main idea is to allow each CDAE to learn
suitable spectral-temporal filters and features to its corresponding
source. Our experimental results show that CDAEs perform source
separation slightly better than the deep feedforward neural networks
(FNNs) even with fewer parameters than FNNs.
Binaural features of interaural level difference and
interaural phase difference have proved to be very effective
in training deep neural networks (DNNs), to generate timefrequency
masks for target speech extraction in speech-speech
mixtures. However, effectiveness of binaural features is reduced
in more common speech-noise scenarios, since the noise may
over-shadow the speech in adverse conditions. In addition, the
reverberation also decreases the sparsity of binaural features and
therefore adds difficulties to the separation task. To address the
above limitations, we highlight the spectral difference between
speech and noise spectra and incorporate the log-power spectra
features to extend the DNN input. Tested on two different
reverberant rooms at different signal to noise ratios (SNR), our
proposed method shows advantages over the baseline method
using only binaural features in terms of signal to distortion ratio
(SDR) and Short-Time Perceptual Intelligibility (STOI).
A method based on Deep Neural Networks (DNNs) and
time-frequency masking has been recently developed for binaural audio
source separation. In this method, the DNNs are used to predict the
Direction Of Arrival (DOA) of the audio sources with respect to the
listener which is then used to generate soft time-frequency masks for
the recovery/estimation of the individual audio sources. In this paper, an
algorithm called ?dropout? will be applied to the hidden layers, affecting
the sparsity of hidden units activations: randomly selected neurons and
their connections are dropped during the training phase, preventing
feature co-adaptation. These methods are evaluated on binaural mixtures
generated with Binaural Room Impulse Responses (BRIRs), accounting
a certain level of room reverberation. The results show that the proposed
DNNs system with randomly deleted neurons is able to achieve higher
SDRs performances compared to the baseline method without the dropout
Zermini A, Yu Y, Xu Y, Plumbley M, Wang W (2016) Deep neural network based audio source separation,Proceedings of the 11th IMA International Conference on Mathematics in Signal Processing pp. 1-4
Institute of Mathematics & its Applications (IMA)
Audio source separation aims to extract individual sources from mixtures of
multiple sound sources. Many techniques have been developed such as independent compo-
nent analysis, computational auditory scene analysis, and non-negative matrix factorisa-
tion. A method based on Deep Neural Networks (DNNs) and time-frequency (T-F) mask-
ing has been recently developed for binaural audio source separation. In this method, the
DNNs are used to predict the Direction Of Arrival (DOA) of the audio sources with respect
to the listener which is then used to generate soft T-F masks for the recovery/estimation
of the individual audio sources.
Automatic Music Transcription (AMT) is concerned with the problem of producing the pitch content of a piece of music given a recorded signal. Many methods rely on sparse or low rank models, where the observed magnitude spectra are represented as a linear combination of dictionary atoms corresponding to individual pitches. Some of the most successful approaches use Non-negative Matrix Decomposition (NMD) or Factorization (NMF), which can be used to learn a dictionary and pitch activation matrix from a given signal. Here we introduce a further refinement of NMD in which we assume the transcription itself is approximately low rank. The intuition behind this approach is that the total number of distinct activation patterns should be relatively small since the pitch content between adjacent frames should be similar. A rank penalty is introduced into the NMD objective function and solved using an iterative algorithm based on Singular Value thresholding. We find that the low rank assumption leads to a significant increase in performance compared to NMD using ²-divergence on a standard AMT dataset.
To assist with the development of intelligent mixing systems,
it would be useful to be able to extract the loudness
balance of sources in an existing musical mixture. The
relative-to-mix loudness level of four instrument groups was
predicted using the sources extracted by 12 audio source
separation algorithms. The predictions were compared with
the ground truth loudness data of the original unmixed stems
obtained from a recent dataset involving 100 mixed songs.
It was found that the best source separation system could
predict the relative loudness of each instrument group with
an average root-mean-square error of 1.2 LU, with superior
performance obtained on vocals.
Audio tagging aims to perform multi-label classification on audio
chunks and it is a newly proposed task in the Detection and
Classification of Acoustic Scenes and Events 2016 (DCASE
2016) challenge. This task encourages research efforts to better
analyze and understand the content of the huge amounts of
audio data on the web. The difficulty in audio tagging is that
it only has a chunk-level label without a frame-level label. This
paper presents a weakly supervised method to not only predict
the tags but also indicate the temporal locations of the occurred
acoustic events. The attention scheme is found to be effective
in identifying the important frames while ignoring the unrelated
frames. The proposed framework is a deep convolutional recurrent
model with two auxiliary modules: an attention module
and a localization module. The proposed algorithm was evaluated
on the Task 4 of DCASE 2016 challenge. State-of-the-art
performance was achieved on the evaluation set with equal error
rate (EER) reduced from 0.13 to 0.11, compared with the
convolutional recurrent baseline system.
The first generation of three-dimensional Electromagnetic Articulography
devices (Carstens AG500) suffered from occasional
critical tracking failures. Although now superseded by
new devices, the AG500 is still in use in many speech labs
and many valuable data sets exist. In this study we investigate
whether deep neural networks (DNNs) can learn the mapping
function from raw voltage amplitudes to sensor positions based
on a comprehensive movement data set. This is compared to
arriving sample by sample at individual position values via direct
optimisation as used in previous methods. We found that
with appropriate hyperparameter settings a DNN was able to
approximate the mapping function with good accuracy, leading
to a smaller error than the previous methods, but that the
DNN-based approach was not able to solve the tracking problem
Environmental audio tagging is a newly proposed task to predict the presence or absence of a specific audio event in a chunk. Deep neural network (DNN) based methods have been successfully adopted for predicting the audio tags in the domestic audio scene. In this paper, we propose to use a convolutional neural network (CNN) to extract robust features from mel-filter banks (MFBs), spectrograms or even raw waveforms for audio tagging. Gated recurrent unit (GRU) based recurrent neural networks (RNNs) are then cascaded to model the long-term temporal structure of the audio signal. To complement the input information, an auxiliary CNN is designed to learn on the spatial features of stereo recordings. We evaluate our proposed methods on Task 4 (audio tagging) of the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE 2016) challenge. Compared with our recent DNN-based method, the proposed structure can reduce the equal error rate (EER) from 0.13 to 0.11 on the development set. The spatial features can further reduce the EER to 0.10. The performance of the end-to-end learning on raw waveforms is also comparable. Finally, on the evaluation set, we get the state-of-the-art performance with 0.12 EER while the performance of the best existing system is 0.15 EER.
Source separation evaluation is typically a top-down process, starting with perceptual measures which capture fitness-for-purpose and followed by attempts to find physical (objective) measures that are predictive of the perceptual measures. In this paper, we take a contrasting bottom-up approach. We begin with the physical measures provided by the Blind Source Separation Evaluation Toolkit (BSS Eval) and we then look for corresponding perceptual correlates. This approach is known as psychophysics and has the distinct advantage of leading to interpretable, psychophysical models. We obtained perceptual similarity judgments from listeners in two experiments featuring vocal sources within musical mixtures. In the first experiment, listeners compared the overall quality of vocal signals estimated from musical mixtures using a range of competing source separation methods. In a loudness experiment, listeners compared the loudness balance of the competing musical accompaniment and vocal. Our preliminary results provide provisional validation of the psychophysical approach
The DCASE Challenge 2016 contains tasks for Acoustic Scene Classification (ASC), Acoustic Event Detection (AED), and audio tagging. Since 2006, Deep Neural Networks (DNNs) have been widely applied to computer visions, speech recognition and natural language processing tasks. In this paper, we provide DNN baselines for the DCASE Challenge 2016. In Task 1 we obtained accuracy of 81.0% using Mel + DNN against 77.2% by using Mel Frequency Cepstral Coefficients (MFCCs) + Gaussian Mixture Model (GMM). In Task 2 we obtained F value of 12.6% using Mel + DNN against 37.0% by using Constant Q Transform (CQT) + Nonnegative Matrix Factorization (NMF). In Task 3 we obtained F value of 36.3% using Mel + DNN against 23.7% by using MFCCs + GMM. In Task 4 we obtained Equal Error Rate (ERR) of 18.9% using Mel + DNN against 20.9% by using MFCCs + GMM. Therefore the DNN improves the baseline in Task 1, 3, and 4, although it is worse than the baseline in Task 2. This indicates that DNNs can be successful in many of these tasks, but may not always perform better than the baselines.
Deep neural networks (DNNs) are usually used for single channel source separation to predict either soft or binary time frequency masks. The masks are used to separate the sources from the mixed signal. Binary masks produce separated sources with more distortion and less interference than soft masks. In this paper, we propose to use another DNN to combine the estimates of binary and soft masks to achieve the advantages and avoid the disadvantages of using each mask individually. We aim to achieve separated sources with low distortion and low interference between each other. Our experimental results show that combining the estimates of binary and soft masks using DNN achieves lower distortion than using each estimate individually and achieves as low interference as the binary mask.
We address the problem of decomposing several
consecutive sparse signals, such as audio time frames or image
patches. A typical approach is to process each signal sequentially
and independently, with an arbitrary sparsity level fixed for each
signal. Here, we propose to process several frames simultaneously,
allowing for more flexible sparsity patterns to be considered. We
propose a multivariate sparse coding approach, where sparsity
is enforced on average across several frames. We propose a
Multivariate Iterative Hard Thresholding to solve this problem.
The usefulness of the proposed approach is demonstrated on
audio coding and denoising tasks. Experiments show that the
proposed approach leads to better results when the signal
contains both transients and tonal components.
Acoustic monitoring of bird species is an increasingly
important field in signal processing. Many available bird
sound datasets do not contain exact timestamp of the bird call
but have a coarse weak label instead. Traditional Non-negative
Matrix Factorization (NMF) models are not well designed to
deal with weakly labeled data. In this paper we propose a
novel Masked Non-negative Matrix Factorization (Masked NMF)
approach for bird detection using weakly labeled data. During
dictionary extraction we introduce a binary mask on the activation
matrix. In that way we are able to control which parts of
dictionary are used to reconstruct the training data. We compare
our method with conventional NMF approaches and current state
of the art methods. The proposed method outperforms the NMF
baseline and offers a parsimonious model for bird detection on
weakly labeled data. Moreover, to our knowledge, the proposed
Masked NMF achieved the best result among non-deep learning
methods on a test dataset used for the recent Bird Audio Detection
The sources separated by most single channel audio source separation techniques are usually distorted and each separated source contains residual signals from the other sources. To tackle this problem, we propose to enhance the separated sources by decreasing the distortion and interference between the separated sources using deep neural networks (DNNs). Two different DNNs are used in this work. The first DNN is used to separate the sources from the mixed signal. The second DNN is used to enhance the ...
In cognitive radio networks, cooperative spectrum
sensing (CSS) has been a promising approach to improve sensing
performance by utilizing spatial diversity of participating secondary
users (SUs). In current CSS networks, all cooperative SUs
are assumed to be honest and genuine. However, the presence of
malicious users sending out dishonest data can severely degrade
the performance of CSS networks. In this paper, a framework
with high detection accuracy and low costs of data acquisition at
SUs is developed, with the purpose of mitigating the influences of
malicious users. More specifically, a low-rank matrix completion
based malicious user detection framework is proposed. In the
proposed framework, in order to avoid requiring any prior
information about the CSS network, a rank estimation algorithm
and an estimation strategy for the number of corrupted channels
are proposed. Numerical results show that the proposed malicious
user detection framework achieves high detection accuracy with
lower data acquisition costs in comparison with the conventional
approach. After being validated by simulations, the proposed
malicious user detection framework is tested on the real-world
signals over TV white space spectrum.
Public evaluation campaigns and datasets promote
active development in target research areas, allowing direct
comparison of algorithms. The second edition of the challenge
on Detection and Classification of Acoustic Scenes and Events
(DCASE 2016) has offered such an opportunity for development
of state-of-the-art methods, and succeeded in drawing together a
large number of participants from academic and industrial backgrounds.
In this paper, we report on the tasks and outcomes of
the DCASE 2016 challenge. The challenge comprised four tasks:
acoustic scene classification, sound event detection in synthetic
audio, sound event detection in real-life audio, and domestic
audio tagging. We present in detail each task and analyse the
submitted systems in terms of design and performance. We
observe the emergence of deep learning as the most popular
classification method, replacing the traditional approaches based
on Gaussian mixture models and support vector machines. By
contrast, feature representations have not changed substantially
throughout the years, as mel frequency-based representations
predominate in all tasks. The datasets created for and used in
DCASE 2016 are publicly available and are a valuable resource
for further research.
Xu Yong, Kong Qiuqiang, Wang Wenwu, Plumbley Mark (2017) Surrey-CVSSP system for DCASE2017 challenge task 4,Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)
Tampere University of Technology. Laboratory of Signal Processing
In this technique report, we present a bunch of methods for the task 4 of Detection and Classification of Acoustic Scenes and Events 2017 (DCASE2017) challenge. This task evaluates systems for the large-scale detection of sound events using weakly labeled training data. The data are YouTube video excerpts focusing on transportation and warnings due to their industry applications. There are two tasks, audio tagging and sound event detection from weakly labeled data. Convolutional neural network (CNN) and gated recurrent unit (GRU) based recurrent neural network (RNN) are adopted as our basic framework. We proposed a learnable gating activation function for selecting informative local features. Attention-based scheme is used for localizing the specific events in a weakly-supervised mode. A new batch-level balancing strategy is also proposed to tackle the data unbalancing problem. Fusion of posteriors from different systems are found effective to improve the performance. In a summary, we get 61% F-value for the audio tagging subtask and 0.73 error rate (ER) for the sound event detection subtask on the development set. While the official multilayer perceptron (MLP) based baseline just obtained 13.1% F-value for the audio tagging and 1.02 for the sound event detection.
Benetos E, Stowell D, Plumbley M (2018) Approaches to complex sound scene analysis,In: Virtanen T, Plumbley M, Ellis D (eds.), Computational Analysis of Sound Scenes and Events pp. 215-242
Springer International Publishing
This paper studies the disjointness of the time-frequency representations
of simultaneously playing musical instruments. As a
measure of disjointness, we use the approximate W-disjoint orthogonality
as proposed by Yilmaz and Rickard , which (loosely
speaking) measures the degree of overlap of different sources in
the time-frequency domain. The motivation for this study is to find
a maximally disjoint representation in order to facilitate the separation
and recognition of musical instruments in mixture signals.
The transforms investigated in this paper include the short-time
Fourier transform (STFT), constant-Q transform, modified discrete
cosine transform (MDCT), and pitch-synchronous lapped orthogonal
transforms. Simulation results are reported for a database of
polyphonic music where the multitrack data (instrument signals
before mixing) were available. Absolute performance varies depending
on the instrument source in question, but on the average
MDCT with 93 ms frame size performed best.
Ellis D, Virtanen T, Plumbley M, Raj B (2018) Future Perspective,In: Virtanen T, Plumbley M, Ellis D (eds.), Computational Analysis of Sound Scenes and Events pp. 401-415
Springer International Publishing
Assuming that a set of source signals is sparsely representable in
a given dictionary, we show how their sparse recovery fails whenever
we can only measure a convolved observation of them. Starting
from this motivation, we develop a block coordinate descent method
which aims to learn a convolved dictionary and provide a sparse representation
of the observed signals with small residual norm. We
compare the proposed approach to the K-SVD dictionary learning
algorithm and show through numerical experiment on synthetic signals
that, provided some conditions on the problem data, our technique
converges in a fixed number of iterations to a sparse representation
with smaller residual norm.
This book presents computational methods for extracting the useful information from audio signals, collecting the state of the art in the field of sound event and scene analysis. The authors cover the entire procedure for developing such methods, ranging from data acquisition and labeling, through the design of taxonomies used in the systems, to signal processing methods for feature extraction and machine learning methods for sound recognition. The book also covers advanced techniques for dealing with environmental variation and multiple overlapping sound sources, and taking advantage of multiple microphones or other modalities. The book gives examples of usage scenarios in large media databases, acoustic monitoring, bioacoustics, and context-aware devices. Graphical illustrations of sound signals and their spectrographic representations are presented, as well as block diagrams and pseudocode of algorithms.
Gives an overview of methods for computational analysis of sounds scenes and events, allowing those new to the field to become fully informed;
Covers all the aspects of the machine learning approach to computational analysis of sound scenes and events, ranging from data capture and labeling process to development of algorithms;
Includes descriptions of algorithms accompanied by a website from which software implementations can be downloaded, facilitating practical interaction with the techniques.
In this paper, we present a gated convolutional neural network
and a temporal attention-based localization method for
audio classification, which won the 1st place in the large-scale
weakly supervised sound event detection task of Detection
and Classification of Acoustic Scenes and Events (DCASE)
2017 challenge. The audio clips in this task, which are extracted
from YouTube videos, are manually labelled with one
or more audio tags, but without time stamps of the audio
events, hence referred to as weakly labelled data. Two subtasks
are defined in this challenge including audio tagging and
sound event detection using this weakly labelled data. We
propose a convolutional recurrent neural network (CRNN)
with learnable gated linear units (GLUs) non-linearity applied
on the log Mel spectrogram. In addition, we propose a temporal
attention method along the frames to predict the locations
of each audio event in a chunk from the weakly labelled data.
The performances of our systems were ranked the 1st and the
2nd as a team in these two sub-tasks of DCASE 2017 challenge
with F value 55.6% and Equal error 0.73, respectively.
We introduce `PaperClip' - a novel digital pen interface for semantic editing of speech recordings for radio production. We explain how
we designed and developed our system, then present the results of a
contextual qualitative user study of eight professional radio producers that compared editing using PaperClip to a screen-based interface
and normal paper. As in many other paper-versus-screen studies, we
found no overall preferences but rather advantages and disadvantages
of both in different contexts. We discuss these relative benefits and
make recommendations for future development.
There is some uncertainty as to whether objective metrics for
predicting the perceived quality of audio source separation are
sufficiently accurate. This issue was investigated by employing
a revised experimental methodology to collect subjective
ratings of sound quality and interference of singing-voice
recordings that have been extracted from musical mixtures
using state-of-the-art audio source separation. A correlation
analysis between the experimental data and the measures of
two objective evaluation toolkits, BSS Eval and PEASS, was
performed to assess their performance. The artifacts-related
perceptual score of the PEASS toolkit had the strongest correlation
with the perception of artifacts and distortions caused
by singing-voice separation. Both the source-to-interference
ratio of BSS Eval and the interference-related perceptual
score of PEASS showed comparable correlations with the
human ratings of interference.
Source separation (SS) aims to separate individual sources
from an audio recording. Sound event detection (SED) aims
to detect sound events from an audio recording. We propose
a joint separation-classification (JSC) model trained only on
weakly labelled audio data, that is, only the tags of an audio
recording are known but the time of the events are unknown.
First, we propose a separation mapping from the
time-frequency (T-F) representation of an audio to the T-F
segmentation masks of the audio events. Second, a classification
mapping is built from each T-F segmentation mask to
the presence probability of each audio event. In the source
separation stage, sources of audio events and time of sound
events can be obtained from the T-F segmentation masks. The
proposed method achieves an equal error rate (EER) of 0.14
in SED, outperforming deep neural network baseline of 0.29.
Source separation SDR of 8.08 dB is obtained by using global
weighted rank pooling (GWRP) as probability mapping, outperforming
the global max pooling (GMP) based probability
mapping giving SDR at 0.03 dB. Source code of our work is
In this paper, we propose a divide-and-conquer approach using
two generative adversarial networks (GANs) to explore
how a machine can draw colorful pictures (bird) using a small
amount of training data. In our work, we simulate the procedure
of an artist drawing a picture, where one begins with
drawing objects? contours and edges and then paints them
different colors. We adopt two GAN models to process basic
visual features including shape, texture and color. We use
the first GAN model to generate object shape, and then paint
the black and white image based on the knowledge learned
using the second GAN model. We run our experiments on
600 color images. The experimental results show that the use
of our approach can generate good quality synthetic images,
comparable to real ones.
This paper investigates the Audio Set classification. Audio
Set is a large scale weakly labelled dataset (WLD) of audio
clips. In WLD only the presence of a label is known, without
knowing the happening time of the labels. We propose an
attention model to solve this WLD problem and explain the
attention model from a novel probabilistic perspective. Each
audio clip in Audio Set consists of a collection of features. We
call each feature as an instance and the collection as a bag following
the terminology in multiple instance learning. In the
attention model, each instance in the bag has a trainable probability
measure for each class. The classification of the bag is
the expectation of the classification output of the instances in
the bag with respect to the learned probability measure. Experiments
show that the proposed attention model achieves
a mAP of 0.327 on Audio Set, outperforming the Google?s
baseline of 0.314.
Combining different models is a common strategy to build a good audio source separation system. In this work,
we combine two powerful deep neural networks for audio single channel source separation (SCSS). Namely, we
combine fully convolutional neural networks (FCNs) and recurrent neural networks, specifically, bidirectional
long short-term memory recurrent neural networks (BLSTMs). FCNs are good at extracting useful features from
the audio data and BLSTMs are good at modeling the temporal structure of the audio signals. Our experimental
results show that combining FCNs and BLSTMs achieves better separation performance than using each model
Current research on audio source separation provides tools to estimate the signals contributed by different instruments in polyphonic music mixtures. Such tools can be already incorporated in music production and post-production workflows. In this paper, we describe recent experiments where audio source separation is applied to remixing and upmixing existing mono and stereo music content
Radio production involves editing speech-based audio using tools
that represent sound using simple waveforms. Semantic speech editing systems allow users to edit audio using an automatically generated
transcript, which has the potential to improve the production workflow. To investigate this, we developed a semantic audio editor based
on a pilot study. Through a contextual qualitative study of five professional radio producers at the BBC, we examined the existing radio
production process and evaluated our semantic editor by using it to
create programmes that were later broadcast.
We observed that the participants in our study wrote detailed notes
about their recordings and used annotation to mark which parts they
wanted to use. They collaborated closely with the presenter of their
programme to structure the contents and write narrative elements.
Participants reported that they often work away from the office to
avoid distractions, and print transcripts so they can work away from
screens. They also emphasised that listening is an important part
of production, to ensure high sound quality. We found that semantic speech editing with automated speech recognition can be used to improve the radio production workflow, but that annotation, collaboration, portability and listening were not well supported by current
semantic speech editing systems. In this paper, we make recommendations on how future semantic speech editing systems can better
support the requirements of radio production.
Clipping, or saturation, is a common nonlinear distortion in
signal processing. Recently, declipping techniques have been proposed
based on sparse decomposition of the clipped signals on a fixed dictionary,
with additional constraints on the amplitude of the clipped samples.
Here we propose a dictionary learning approach, where the dictionary
is directly learned from the clipped measurements. We propose a soft-consistency
metric that minimizes the distance to a convex feasibility
set, and takes into account our knowledge about the clipping process.
We then propose a gradient descent-based dictionary learning algorithm
that minimizes the proposed metric, and is thus consistent with the clipping
measurement. Experiments show that the proposed algorithm outperforms
other dictionary learning algorithms applied to clipped signals.
We also show that learning the dictionary directly from the clipped signals
outperforms consistent sparse coding with a fixed dictionary.
Non-negative Matrix Factorization (NMF) is a well established
tool for audio analysis. However, it is not well suited
for learning on weakly labeled data, i.e. data where the exact
timestamp of the sound of interest is not known. In this paper
we propose a novel extension to NMF, that allows it to extract
meaningful representations from weakly labeled audio data.
Recently, a constraint on the activation matrix was proposed
to adapt for learning on weak labels. To further improve the
method we propose to add an orthogonality regularizer of the
dictionary in the cost function of NMF. In that way we obtain
appropriate dictionaries for the sounds of interest and background
sounds from weakly labeled data. We demonstrate
that the proposed Orthogonality-Regularized Masked NMF
(ORM-NMF) can be used for Audio Event Detection of rare
events and evaluate the method on the development data from
Task2 of DCASE2017 Challenge.
Proximal methods are an important tool in signal processing
applications, where many problems can be characterized by
the minimization of an expression involving a smooth fitting
term and a convex regularization term ? for example the classic
?1-Lasso. Such problems can be solved using the relevant
proximal operator. Here we consider the use of proximal operators
for the ?p-quasinorm where 0 d p d 1. Rather than
seek a closed form solution, we develop an iterative algorithm
using a Majorization-Minimization procedure which results
in an inexact operator. Experiments on image denoising show
that for p d 1 the algorithm is effective in the high-noise scenario,
outperforming the Lasso despite the inexactness of the
In deep neural networks with convolutional layers, all the
neurons in each layer typically have the same size receptive fields (RFs)
with the same resolution. Convolutional layers with neurons that have
large RF capture global information from the input features, while layers
with neurons that have small RF size capture local details with high
resolution from the input features. In this work, we introduce novel deep
multi-resolution fully convolutional neural networks (MR-FCN), where
each layer has a range of neurons with different RF sizes to extract multi-
resolution features that capture the global and local information from its
input features. The proposed MR-FCN is applied to separate the singing
voice from mixtures of music sources. Experimental results show that
using MR-FCN improves the performance compared to feedforward deep
neural networks (DNNs) and single resolution deep fully convolutional
neural networks (FCNs) on the audio source separation problem.
Radio production is a creative pursuit that uses sound to inform, educate and entertain an audience. Radio producers use audio editing tools to visually select, re-arrange and assemble sound recordings into programmes. However, current tools represent audio using waveform visualizations that display limited information about the sound.
Semantic audio analysis can be used to extract useful information from audio recordings, including when people are speaking and what they are saying. This thesis investigates how such information can be applied to create semantic audio tools that improve the radio production process.
An initial ethnographic study of radio production at the BBC reveals that producers use textual representations and paper transcripts to interact with audio, and waveforms to edit programmes. Based on these findings, three methods for improving radio production are developed and evaluated, which form the primary contribution of this thesis.
Audio visualizations can be enhanced by mapping semantic audio features to colour, but this approach had not been formally tested. We show that with an enhanced audio waveform, a typical radio production task can be completed faster, with less effort and with greater accuracy than a normal waveform.
Speech recordings can be represented and edited using transcripts, but this approach had not been formally evaluated for radio production. By developing and testing a semantic speech editor, we show that automatically-generated transcripts can be used to semantically edit speech in a professional radio production context, and identify requirements for annotation, collaboration, portability and listening.
Finally, we present a novel approach for editing audio on paper that combines semantic speech editing with a digital pen interface. Through a user study with radio producers, we compare the relative benefits of semantic speech editing using paper and screen interfaces. We find that paper is better for simple edits of familiar audio with accurate transcripts.
Zermini Alfredo, Kong Qiuqiang, Xu Yong, Plumbley Mark D., Wang Wenwu (2018) Improving reverberant speech separation with binaural cues using temporal context and convolutional neural networks,In: Deville Y, Gannot S, Mason R, Plumbley Mark, Ward D (eds.), Latent Variable Analysis and Signal Separation. LVA/ICA 2018. Lecture Notes in Computer Science Latent Variable Analysis and Signal Separation: LVA/ICA 2018. Lecture Notes in Computer Science 10891 pp. 361-371
Given binaural features as input, such as interaural level difference
and interaural phase difference, Deep Neural Networks (DNNs)
have been recently used to localize sound sources in a mixture of speech
signals and/or noise, and to create time-frequency masks for the estimation
of the sound sources in reverberant rooms. Here, we explore a
more advanced system, where feed-forward DNNs are replaced by Convolutional
Neural Networks (CNNs). In addition, the adjacent frames
of each time frame (occurring before and after this frame) are used to
exploit contextual information, thus improving the localization and separation
for each source. The quality of the separation results is evaluated
in terms of Signal to Distortion Ratio (SDR).
It is now commonplace to capture and share images in photography as triggers for memory. In this paper we explore the possibility of using sound in the same sort of way, in a practice we call audiography. We report an initial design activity to create a system called Audio Memories comprising a ten second sound recorder, an intelligent archive for auto-classifying sound clips, and a multi-layered sound player for the social sharing of audio souvenirs around a table. The recorder and player components are essentially user experience probes that provide tangible interfaces for capturing and interacting with audio memory cues. We discuss our design decisions and process in creating these tools that harmonize user interaction and machine listening to evoke rich memories and conversations in an exploratory and open-ended way.
Analysing expressive timing in performed music can help machine to perform various perceptual tasks
such as identifying performers and understand music structures in classical music. A hierarchical structure is
commonly used for expressive timing analysis. This paper provides a statistical demonstration to support the use
of hierarchical structure in expressive timing analysis by presenting two groups of model selection tests. The first
model selection test uses expressive timing to determine the location of music structure boundaries. The second
model selection test is matching a piece of performance with the same performer playing another given piece.
Comparing the results of model selection tests, the preferred hierarchical structures in these two model selection
tests are not the same. While determining music structure boundaries demands a hierarchical structure with more
levels in the expressive timing analysis, a hierarchical structure with less levels helps identifying the dedicated
performer in most cases.
We explore the use of geometrical methods to tackle the non-negative independent
component analysis (non-negative ICA) problem, without assuming the reader has
an existing background in differential geometry. We concentrate on methods that
achieve this by minimizing a cost function over the space of orthogonal matrices.
We introduce the idea of the manifold and Lie group SO(n) of special orthogonal
matrices that we wish to search over, and explain how this is related to the Lie
algebra so(n) of skew-symmetric matrices. We describe how familiar optimization
methods such as steepest-descent and conjugate gradients can be transformed into
this Lie group setting, and how the Newton update step has an alternative Fourier
version in SO(n). Finally we introduce the concept of a toral subgroup generated
by a particular element of the Lie group or Lie algebra, and explore how this commutative
subgroup might be used to simplify searches on our constraint surface. No
proofs are presented in this article.
Supervised multi-channel audio source separation
requires extracting useful spectral, temporal, and spatial features
from the mixed signals. The success of many existing systems is
therefore largely dependent on the choice of features used for
training. In this work, we introduce a novel multi-channel, multiresolution
convolutional auto-encoder neural network that works
on raw time-domain signals to determine appropriate multiresolution
features for separating the singing-voice from stereo
music. Our experimental results show that the proposed method
can achieve multi-channel audio source separation without the
need for hand-crafted features or any pre- or post-processing.
Jackson Philip, Plumbley Mark D, Wang Wenwu, Brookes Tim, Coleman Philip, Mason Russell, Frohlich David, Bonina Carla, Plans David (2017) Signal Processing, Psychoacoustic Engineering and Digital Worlds: Interdisciplinary Audio Research at the University of Surrey,
At the University of Surrey (Guildford, UK), we have brought together research groups in different disciplines, with a shared interest in audio, to work on a range of collaborative research projects. In the Centre for Vision, Speech and Signal Processing (CVSSP) we focus on technologies for machine perception of audio scenes; in the Institute of Sound Recording (IoSR) we focus on research into human perception of audio quality; the Digital World Research Centre (DWRC) focusses on the design of digital technologies; while the Centre for Digital Economy (CoDE) focusses on new business models enabled by digital technology. This interdisciplinary view, across different traditional academic departments and faculties, allows us to undertake projects which would be impossible for a single research group. In this poster we will present an overview of some of these interdisciplinary projects, including projects in spatial audio, sound scene and event analysis, and creative commons audio.
This book constitutes the proceedings of the 14th International Conference on Latent Variable Analysis and Signal Separation, LVA/ICA 2018, held in Guildford, UK, in July 2018.The 52 full papers were carefully reviewed and selected from 62 initial submissions. As research topics the papers encompass a wide range of general mixtures of latent variables models but also theories and tools drawn from a great variety of disciplines such as structured tensor decompositions and applications; matrix and tensor factorizations; ICA methods; nonlinear mixtures; audio data and methods; signal separation evaluation campaign; deep learning and data-driven methods; advances in phase retrieval and applications; sparsity-related methods; and biomedical data and methods.
Polyphonic music transcription is a challenging
problem, requiring the identification of a collection of latent
pitches which can explain an observed music signal. Many
state-of-the-art methods are based on the Non-negative Matrix
Factorization (NMF) framework, which itself can be cast as a
latent variable model. However, the basic NMF algorithm fails
to consider many important aspects of music signals such as lowrank
or hierarchical structure and temporal continuity. In this
work we propose a probabilistic model to address some of the
shortcomings of NMF. Probabilistic Latent Component Analysis
(PLCA) provides a probabilistic interpretation of NMF and has
been widely applied to problems in audio signal processing. Based
on PLCA, we propose an algorithm which represents signals
using a collection of low-rank dictionaries built from a base
pitch dictionary. This allows each dictionary to specialize to a
given chord or interval template which will be used to represent
collections of similar frames. Experiments on a standard music
transcription data set show that our method can successfully
decompose signals into a hierarchical and smooth structure,
improving the quality of the transcription.
The Signal Separation Evaluation Campaign (SiSEC) is a
large-scale regular event aimed at evaluating current progress
in source separation through a systematic and reproducible
comparison of the participants? algorithms, providing the
source separation community with an invaluable glimpse of
recent achievements and open challenges. This paper focuses
on the music separation task from SiSEC 2018, which
compares algorithms aimed at recovering instrument stems
from a stereo mix. In this context, we conducted a subjective
evaluation whereby 34 listeners picked which of six competing
algorithms, with high objective performance scores,
best separated the singing-voice stem from 13 professionally
mixed songs. The subjective results reveal strong differences
between the algorithms, and highlight the presence
of song-dependent performance for state-of-the-art systems.
Correlations between the subjective results and the scores of
two popular performance metrics are also presented.
We address the problem of recovering a sparse
signal from clipped or quantized measurements. We show how
these two problems can be formulated as minimizing the distance
to a convex feasibility set, which provides a convex and
differentiable cost function. We then propose a fast iterative
shrinkage/thresholding algorithm that minimizes the proposed
cost, which provides a fast and efficient algorithm to recover
sparse signals from clipped and quantized measurements.
Perceptual measures are usually considered more
reliable than instrumental measures for evaluating the perceived
level of reverberation. However, such measures are time consuming
and expensive, and, due to variations in stimuli or assessors,
the resulting data is not always statistically significant. Therefore,
an (objective) measure of the perceived level of reverberation
becomes desirable. In this paper, we develop a new method to
predict the level of reverberation from audio signals by relating
the perceptual listening test results with those obtained from a
machine learned model. More specifically, we compare the use of
a multiple stimuli test for within and between class architectures
to evaluate the perceived level of reverberation. An expert set
of 16 human listeners rated the perceived level of reverberation
for a same set of files from different audio source types. We
then train a machine learning model using the training data
gathered for the same set of files and a variety of reverberation
related features extracted from the data such as reverberation
time, and direct to reverberation ratio. The results suggest that
the machine learned model offers an accurate prediction of the
The Detection and Classi?cation of Acoustic Scenes and Events (DCASE) consists of ?ve audio classi?cation and sound event detectiontasks: 1)Acousticsceneclassi?cation,2)General-purposeaudio tagging of Freesound, 3) Bird audio detection, 4) Weakly-labeled semi-supervised sound event detection and 5) Multi-channel audio classi?cation. In this paper, we create a cross-task baseline system for all ?ve tasks based on a convlutional neural network (CNN): a ?CNN Baseline? system. We implemented CNNs with 4 layers and 8 layers originating from AlexNet and VGG from computer vision. We investigated how the performance varies from task to task with the same con?guration of neural networks. Experiments show that deeper CNN with 8 layers performs better than CNN with 4 layers on all tasks except Task 1. Using CNN with 8 layers, we achieve an accuracy of 0.680 on Task 1, an accuracy of 0.895 and a mean average precision (MAP) of 0.928 on Task 2, an accuracy of 0.751 andanareaunderthecurve(AUC)of0.854onTask3,asoundevent detectionF1scoreof20.8%onTask4,andanF1scoreof87.75%on Task 5. We released the Python source code of the baseline systems under the MIT license for further research.
Perceptual measurements have typically been recognized
as the most reliable measurements in assessing perceived
levels of reverberation. In this paper, a combination of blind
RT60 estimation method and a binaural, nonlinear auditory
model is employed to derive signal-based measures (features)
that are then utilized in predicting the perceived level of reverberation.
Such measures lack the excess of effort necessary for
calculating perceptual measures; not to mention the variations
in either stimuli or assessors that may cause such measures to
be statistically insignificant. As a result, the automatic extraction
of objective measurements that can be applied to predict the
perceived level of reverberation become of vital significance.
Consequently, this work is aimed at discovering measurements
such as clarity, reverberance, and RT60 which can automatically
be derived directly from audio data. These measurements along
with labels from human listening tests are then forwarded to a
machine learning system seeking to build a model to estimate
the perceived level of reverberation, which is labeled by an
expert, autonomously. The data has been labeled by an expert
human listener for a unilateral set of files from arbitrary audio
source types. By examining the results, it can be observed that
the automatically extracted features can aid in estimating the
We propose a convolutional neural network (CNN) model based on an attention pooling method to classify ten different acoustic scenes, participating in the acoustic scene classi?cation task of the IEEE AASPChallengeonDetectionandClassi?cationofAcousticScenes and Events (DCASE 2018), which includes data from one device (subtask A) and data from three different devices (subtask B). The log mel spectrogram images of the audio waves are ?rst forwarded to convolutional layers, and then fed into an attention pooling layer to reduce the feature dimension and achieve classi?cation. From attention perspective, we build a weighted evaluation of the features, instead of simple max pooling or average pooling. On the of?cial development set of the challenge, the best accuracy of subtask A is 72.6%,whichisanimprovementof12.9%whencomparedwiththe of?cial baseline (p
General-purpose audio tagging refers to classifying sounds that are
of a diverse nature, and is relevant in many applications where
domain-specific information cannot be exploited. The DCASE 2018
challenge introduces Task 2 for this very problem. In this task, there
are a large number of classes and the audio clips vary in duration.
Moreover, a subset of the labels are noisy. In this paper, we propose
a system to address these challenges. The basis of our system is
an ensemble of convolutional neural networks trained on log-scaled
mel spectrograms. We use preprocessing and data augmentation
methods to improve the performance further. To reduce the effects
of label noise, two techniques are proposed: loss function weighting
and pseudo-labeling. Experiments on the private test set of this task
show that our system achieves state-of-the-art performance with a
mean average precision score of 0.951
Cano Estefanía, FitzGerald Derry, Liutkus Antoine, Plumbley Mark D., Stöter Fabian-Robert (2018) Musical Source Separation: An Introduction,IEEE Signal Processing Magazine
Institute of Electrical and Electronics Engineers (IEEE)
Many people listen to recorded music as part of their everyday lives, for example from radio
or TV programmes, CDs, downloads or increasingly from online streaming services. Sometimes
we might want to remix the balance within the music, perhaps to make the vocals louder or to
suppress an unwanted sound, or we might want to upmix a 2-channel stereo recording to a 5.1-
channel surround sound system. We might also want to change the spatial location of a musical
instrument within the mix. All of these applications are relatively straightforward, provided we
have access to separate sound channels (stems) for each musical audio object.
However, if we only have access to the final recording mix, which is usually the case, this
is much more challenging. To estimate the original musical sources, which would allow us to
remix, suppress or upmix the sources, we need to perform musical source separation (MSS).
In the general source separation problem, we are given one or more mixture signals that
contain different mixtures of some original source signals. This is illustrated in Figure 1 where
four sources, namely vocals, drums, bass and guitar, are all present in the mixture. The task is
to recover one or more of the source signals given the mixtures. In some cases, this is relatively
straightforward, for example, if there are at least as many mixtures as there are sources, and if
the mixing process is fixed, with no delays, filters or non-linear mastering .
However, MSS is normally more challenging. Typically, there may be many musical instruments
and voices in a 2-channel recording, and the sources have often been processed with the addition
of filters and reverberation (sometimes nonlinear) in the recording and mixing process. In some
cases, the sources may move, or the production parameters may change, meaning that the mixture
is time-varying. All of these issues make MSS a very challenging problem.
Nevertheless, musical sound sources have particular properties and structures that can help
us. For example, musical source signals often have a regular harmonic structure of frequencies
at regular intervals, and can have frequency contours characteristic of each musical instrument.
They may also repeat in particular temporal patterns based on the musical structure.
In this paper we will explore the MSS problem and introduce approaches to tackle it. We
will begin by introducing characteristics of music signals, we will then give an introduction to
MSS, and finally consider a range of MSS models. We will also discuss how to evaluate MSS
approaches, and discuss limitations and directions for future research
Audio source separation is a very challenging problem, and many different approaches have been proposed in attempts to solve it. We consider the problem of separating sources from two-channel instantaneous audio mixtures. One approach to this is to transform the mixtures into the time-frequency domain to obtain approximately disjoint representations of the sources, and then separate the sources using time-frequency masking. We focus on demixing the sources by binary masking, and assume that the mixing parameters are known. In this paper, we investigate the application of cosine packet (CP) trees as a foundation for the transform.
We determine an appropriate transform by applying a computationally efficient best basis algorithm to a set of possible local cosine bases organised in a tree structure. We develop a heuristically motivated cost function which maximises the energy of the transform coefficients associated with a particular source. Finally, we evaluate objectively our proposed transform method by comparing it against fixed-basis transforms such as the short-time Fourier transform (STFT) and modified discrete cosine transform (MDCT). Evaluation results indicate that our proposed transform method outperforms MDCT and is competitive with the STFT, and informal listening tests suggest that the proposed method exhibits less objectionable noise than the STFT.
Sound event detection (SED) aims to detect when and recognize what sound events happen in an audio clip. Many supervised SED algorithms rely on strongly labelled data which contains the onset and offset annotations of sound events. However, many audio tagging datasets are weakly labelled, that is, only the presence of the sound events is known, without knowing their onset and offset annotations. In this paper, we propose a time-frequency (T-F) segmentation framework trained on weakly labelled data to tackle the sound event detection and separation problem. In training, a segmentation mapping is applied on a T-F representation, such as log mel spectrogram of an audio clip to obtain T-F segmentation masks of sound events. The T-F segmentation masks can be used for separating the sound events from the background scenes in the time-frequency domain. Then a classification mapping is applied on the T-F segmentation masks to estimate the presence probabilities of the sound events. We model the segmentation mapping using a convolutional neural network and the classification mapping using a global weighted rank pooling (GWRP). In SED, predicted onset and offset times can be obtained from the T-F segmentation masks. As a byproduct, separated waveforms of sound events can be obtained from the T-F segmentation masks. We remixed the DCASE 2018 Task 1 acoustic scene data with the DCASE 2018 Task 2 sound events data. When mixing under 0 dB, the proposed method achieved F1 scores of 0.534, 0.398 and 0.167 in audio tagging, frame-wise SED and event-wise SED, outperforming the fully connected deep neural network baseline of 0.331, 0.237 and 0.120, respectively. In T-F segmentation, we achieved an F1 score of 0.218, where previous methods were not able to do T-F segmentation.
The goal of Acoustic Scene Classification (ASC) is to recognise
the environment in which an audio waveform has been
recorded. Recently, deep neural networks have been applied to
ASC and have achieved state-of-the-art performance. However,
few works have investigated how to visualise and understand
what a neural network has learnt from acoustic scenes. Previous
work applied local pooling after each convolutional layer,
therefore reduced the size of the feature maps. In this paper,
we suggest that local pooling is not necessary, but the size of
the receptive field is important. We apply atrous Convolutional
Neural Networks (CNNs) with global attention pooling as the
classification model. The internal feature maps of the attention
model can be visualised and explained. On the Detection
and Classification of Acoustic Scenes and Events (DCASE)
2018 dataset, our proposed method achieves an accuracy of
72.7 %, significantly outperforming the CNNs without dilation
at 60.4 %. Furthermore, our results demonstrate that the learnt
feature maps contain rich information on acoustic scenes in
the time-frequency domain.
Kroos Christian, Bones Oliver, Cao Yin, Harris Lara, Jackson Philip J. B., Davies William J., Wang Wenwu, Cox Trevor J., Plumbley Mark D. (2019) Generalisation in environmental sound classification: the 'making sense of sounds' data set and challenge,Proceedings of the 44th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2019)
Institute of Electrical and Electronics Engineers (IEEE)
Humans are able to identify a large number of environmental
sounds and categorise them according to high-level semantic
categories, e.g. urban sounds or music. They are also capable
of generalising from past experience to new sounds when
applying these categories. In this paper we report on the creation
of a data set that is structured according to the top-level
of a taxonomy derived from human judgements and the design
of an associated machine learning challenge, in which
strong generalisation abilities are required to be successful.
We introduce a baseline classification system, a deep convolutional
network, which showed strong performance with an
average accuracy on the evaluation data of 80.8%. The result
is discussed in the light of two alternative explanations:
An unlikely accidental category bias in the sound recordings
or a more plausible true acoustic grounding of the high-level
Acoustic Event Detection (AED) is an important task of machine
listening which, in recent years, has been addressed using common
machine learning methods like Non-negative Matrix Factorization
(NMF) or deep learning. However, most of these approaches do not
take into consideration the way that human auditory system detects
salient sounds. In this work, we propose a method for AED using
weakly labeled data that combines a Non-negative Matrix Factorization
model with a salience model based on predictive coding in the
form of Kalman filters. We show that models of auditory perception,
particularly auditory salience, can be successfully incorporated into
existing AED methods and improve their performance on rare event
detection. We evaluate the method on the Task2 of DCASE2017
Sound event detection (SED) methods typically rely on either
strongly labelled data or weakly labelled data. As an alternative,
sequentially labelled data (SLD) was proposed. In SLD,
the events and the order of events in audio clips are known,
without knowing the occurrence time of events. This paper
proposes a connectionist temporal classification (CTC) based
SED system that uses SLD instead of strongly labelled data,
with a novel unsupervised clustering stage. Experiments on
41 classes of sound events show that the proposed two-stage
method trained on SLD achieves performance comparable to
the previous state-of-the-art SED system trained on strongly
labelled data, and is far better than another state-of-the-art
SED system trained on weakly labelled data, which indicates
the effectiveness of the proposed two-stage method trained on
SLD without any onset/offset time of sound events.
Kong Qiuqiang, Xu Yong, Iqbal Turab, Cao Yin, Wang Wenwu, Plumbley Mark D. (2019) Acoustic scene generation with conditional SampleRNN,Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019)
Institute of Electrical and Electronics Engineers (IEEE)
Acoustic scene generation (ASG) is a task to generate waveforms
for acoustic scenes. ASG can be used to generate audio
scenes for movies and computer games. Recently, neural networks
such as SampleRNN have been used for speech and
music generation. However, ASG is more challenging due to
its wide variety. In addition, evaluating a generative model is
also difficult. In this paper, we propose to use a conditional
SampleRNN model to generate acoustic scenes conditioned on
the input classes. We also propose objective criteria to evaluate
the quality and diversity of the generated samples based on
classification accuracy. The experiments on the DCASE 2016
Task 1 acoustic scene data show that with the generated audio
samples, a classification accuracy of 65:5% can be achieved
compared to samples generated by a random model of 6:7%
and samples from real recording of 83:1%. The performance
of a classifier trained only on generated samples achieves an
accuracy of 51:3%, as opposed to an accuracy of 6:7% with
samples generated by a random model.
This study investigates into the perceptual consequence of phase change in conventional magnitude-based source
separation. A listening test was conducted, where the participants compared three different source separation
scenarios, each with two phase retrieval cases: phase from the original mix or from the target source. The
participants? responses regarding their similarity to the reference showed that 1) the difference between the mix
phase and the perfect target phase was perceivable in the majority of cases with some song-dependent exceptions,
and 2) use of the mix phase degraded the perceived quality even in the case of perfect magnitude separation.
The findings imply that there is room for perceptual improvement by attempting correct phase reconstruction, in
addition to achieving better magnitude-based separation.
Current performance evaluation for audio source separation depends on comparing the processed or separated signals with reference signals. Therefore, common performance evaluation toolkits are not applicable to real-world situations where the ground truth audio is unavailable. In this paper, we propose a performance evaluation technique that does not require reference signals in order to assess separation quality. The proposed technique uses a deep neural network (DNN) to map the processed audio into its quality score. Our experiment results show that the DNN is capable of predicting the sources-to-artifacts ratio from the blind source separation evaluation toolkit
 for singing-voice separation without the need for reference signals.
We propose a denoising and segmentation technique for the second heart sound (S2). To denoise, Matching Pursuit (MP) was applied using a set of non-linear chirp signals as atoms. We show that the proposed method can be used to segment the phonocardiogram of the second heart sound into its two clinically meaningful components: the aortic (A2) and pulmonary (P2) components.
Single-channel signal separation and deconvolution aims to separate and deconvolve individual sources from a single-channel mixture and is a challenging problem in which no prior knowledge of the mixing filters is available. Both individual sources and mixing filters need to be estimated. In addition, a mixture may contain non-stationary noise which is unseen in the training set. We propose a synthesizing-decomposition (S-D) approach to solve the single-channel separation and deconvolution problem. In synthesizing, a generative model for sources is built using a generative adversarial network (GAN). In decomposition, both mixing filters and sources are optimized to minimize
the reconstruction error of the mixture. The proposed S-D approach achieves a peak-to-noise-ratio (PSNR) of 18.9 dB and 15.4 dB in image inpainting and completion, outperforming a baseline convolutional neural network PSNR of 15.3 dB and 12.2 dB, respectively and achieves a PSNR of 13.2 dB in source separation together with deconvolution, outperforming a convolutive non-negative matrix factorization (NMF) baseline of 10.1 dB.
Audio tagging is the task of predicting the presence or absence of sound classes within an audio clip. Previous work in audio tagging focused on relatively small datasets limited to recognising a small number of sound classes. We investigate audio tagging on AudioSet, which is a dataset consisting of over 2 million audio clips and 527 classes. AudioSet is weakly labelled, in that only the presence or absence of sound classes is known for each clip, while the onset and offset times are unknown. To address the weakly-labelled audio tagging problem, we propose attention neural networks as a way to attend the most salient parts of an audio clip. We bridge the connection between attention neural networks and multiple instance learning (MIL) methods, and propose decision-level and feature-level attention neural networks for audio tagging. We investigate attention neural networks modelled by different functions, depths and widths. Experiments on AudioSet show that the feature-level attention neural network achieves a state-of-the-art mean average precision (mAP) of 0.369, outperforming the best multiple instance learning (MIL) method of 0.317 and Google?s deep neural network baseline of 0.314. In addition, we discover that the audio
tagging performance on AudioSet embedding features has a weak correlation with the number of training examples and the quality of labels of each sound class.
Sparse coding and dictionary learning are popular techniques for linear inverse problems such as denoising or inpainting.
However in many cases, the measurement process is nonlinear, for example for clipped, quantized or 1-bit measurements. These problems have often been addressed by solving constrained sparse coding problems, which can be difficult to solve, and assuming that the sparsifying dictionary is known and fixed. Here we propose a simple and unified framework to deal with nonlinear measurements. We propose a cost function that minimizes the distance to a convex feasibility set, which models our knowledge about the nonlinear measurement. This provides an unconstrained,
convex, and differentiable cost function that is simple to optimize, and generalizes the linear least squares cost commonly used in sparse coding. We then propose proximal based sparse coding and dictionary learning algorithms, that are able to learn directly from nonlinearly corrupted signals. We show how the proposed framework and algorithms can be applied to clipped, quantized and 1-bit data.
This paper proposes sound event localization and detection methods from multichannel recording. The proposed system is based
on two Convolutional Recurrent Neural Networks (CRNNs) to perform sound event detection (SED) and time difference of arrival
(TDOA) estimation on each pair of microphones in a microphone array. In this paper, the system is evaluated with a four-microphone
array, and thus combines the results from six pairs of microphones to provide a final classification and a 3-D direction of arrival (DOA)
estimate. Results demonstrate that the proposed approach outperforms the DCASE 2019 baseline system.
While deep learning is undoubtedly the predominant learning technique
across speech processing, it is still not widely used in health-based
applications. The corpora available for health-style recognition
problems are often small, both concerning the total amount of
data available and the number of individuals present. The Bipolar
Disorder corpus, used in the 2018 Audio/Visual Emotion Challenge,
contains only 218 audio samples from 46 individuals. Herein, we
present a multi-instance learning framework aimed at constructing
more reliable deep learning-based models in such conditions. First,
we segment the speech files into multiple chunks. However, the
problem is that each of the individual chunks is weakly labelled,
as they are annotated with the label of the corresponding speech
file, but may not be indicative of that label. We then train the deep
learning-based (ensemble) multi-instance learning model, aiming
at solving such a weakly labelled problem. The presented results
demonstrate that this approach can improve the accuracy of feedforward,
recurrent, and convolutional neural nets on the 3-class mania
classification tasks undertaken on the Bipolar Disorder corpus.
Sound event detection (SED) and localization refer to recognizing
sound events and estimating their spatial and temporal locations.
Using neural networks has become the prevailing method for SED.
In the area of sound localization, which is usually performed by estimating
the direction of arrival (DOA), learning-based methods have
recently been developed. In this paper, it is experimentally shown
that the trained SED model is able to contribute to the direction
of arrival estimation (DOAE). However, joint training of SED and
DOAE degrades the performance of both. Based on these results, a
two-stage polyphonic sound event detection and localization method
is proposed. The method learns SED first, after which the learned
feature layers are transferred for DOAE. It then uses the SED ground
truth as a mask to train DOAE. The proposed method is evaluated on
the DCASE 2019 Task 3 dataset, which contains different overlapping
sound events in different environments. Experimental results
show that the proposed method is able to improve the performance
of both SED and DOAE, and also performs significantly better than
the baseline method.
We consider the task of independent component analysis when the independent sources are known to be nonnegative and well-grounded, so that they have a nonzero probability density function (pdf) in the region of zero. We propose the use of a "nonnegative principal component analysis (nonnegative PCA)" algorithm, which is a special case of the nonlinear PCA algorithm, but with a rectification nonlinearity, and we conjecture that this algorithm will find such nonnegative well-grounded independent sources, under reasonable initial conditions. While the algorithm has proved difficult to analyze in the general case, we give some analytical results that are consistent with this conjecture and some numerical simulations that illustrate its operation.
Source separation is the task of separating an audio recording into individual sound sources. Source separation is fundamental for computational auditory scene analysis. Previous work on source separation has focused on separating particular sound classes such as speech and music. Much previous work requires mixtures and clean source pairs for training. In this work, we propose a source separation framework trained with weakly labelled data. Weakly labelled data only contains
the tags of an audio clip, without the occurrence time of sound events. We first train a sound event detection system with AudioSet. The trained sound event detection system is used to detect segments that are most likely to contain a target sound event. Then a regression is learnt from a mixture of two randomly selected segments to a target segment conditioned on the audio tagging prediction of the target segment. Our proposed
system can separate 527 kinds of sound classes from
AudioSet within a single system. A U-Net is adopted for the separation system and achieves an average SDR of 5.67 dB over 527 sound classes in AudioSet.
In supervised machine learning, the assumption that training data is labelled correctly is not always satisfied. In this paper, we investigate an instance of labelling error for classification tasks in which the dataset is corrupted with out-of-distribution
(OOD) instances: data that does not belong to any of the target classes, but is labelled as such. We show that detecting and relabelling certain OOD instances, rather than discarding them, can have a positive effect on learning. The proposed method uses an auxiliary classifier, trained on data that is known to be
in-distribution, for detection and relabelling. The amount of data required for this is shown to be small. Experiments are carried out on the FSDnoisy18k audio dataset, where OOD instances are very prevalent. The proposed method is shown to improve the performance of convolutional neural networks by a significant margin. Comparisons with other noise-robust techniques are similarly encouraging.
Sound event detection (SED) is a problem to detect the onset and offset times of sound events in an audio recording. SED has many applications in both academia and industry, such as multimedia information retrieval and monitoring domestic and public security. However, compared to speech signal processing that have been researched for many years, the classification and detection of general sounds has not been researched much until recent years.
One limitation of the study on audio classification and sound event detection is that there have been limited datasets public available until the release of the release of the detection and classification of acoustic scenes and events (DCASE) dataset. The DCASE dataset consists of data for acoustic scene classification (ASC), audio tagging (AT) and sound event detection. ASC and AT are tasks to design systems to predict pre-defined labels in an audio clip. SED is a task to design systems to
predict both the presence or absence of sound events in an audio clip as well as the
onset and offset times of the sound events.
One difficulty of the audio classification and SED task is that many datasets
such as the DCASE dataset are weakly labelled. That is, only the presence or
absence of sound events in an audio clip is known, without knowing the onset and
offset annotations of the sound events. This thesis focused on solving the audio
tagging and sound event detection problem using only weakly labelled data. This
thesis proposed attention neural networks to solve the general weakly labelled AT
and SED problem. The attention neural networks can automatically learn to attend
to important segments and ignore silence and irrelevant segments in an audio clip.
We developed a set of weak learning methods for AT and SED using attention neu-
ral networks. The proposed methods have achieved a state-of-the-art performance
in audio tagging and sound event detection.
Speech source separation aims to estimate one or more individual sources from mixtures
of multiple sound sources, e.g. speech, noise and music. While humans have
an innate ability to separate sources in a sound mixture, this is not a trivial task for
In this thesis, we study the problem of speech separation, with a varying degree
of complexity with respect to room reverberation, the number of speech sources and
the number of microphones available for capturing the sources. We focus on the stateof-
the-art deep learning techniques, and investigate the problem of separating speech
sources from binaural and B-format mixtures obtained in real reverberant rooms.
First, we evaluate a baseline system for binaural speech separation, where fullyconnected
Deep Neural Networks (DNNs) and spatial features, such as Interaural Level
Difference (ILD) and Interaural Phase Difference (IPD), are used. We further extend
this baseline by using the dropout technique to mitigate the overfitting problem and
adding spectral features, such as the Log-Power Spectrogram (LPS), to improve the
Second, we develop a Convolutional Neural Networks (CNNs)-based binaural speech
separation system. We then study the potential of using data augmentation techniques
to improve speech separation quality. In particular, we introduce contextual frames
expansion, by including the information from neighbouring time frames, before and
after a given time frame.
Finally, we study the use of deep learning methods for B-format recordings. This
allows the pressure gradient information to be exploited, in addition to the widely used
acoustic pressure information, for deriving the angular features for source separation.
Extensive experiments have been performed on two datasets captured in five different
rooms in the University of Surrey. The proposed methods are shown to offer
improved performance over the state-of-the-art, in terms of separation quality and
Deep neural networks with convolutional layers
usually process the entire spectrogram of an audio signal with the same time-frequency resolutions, number of filters, and dimensionality reduction scale. According to the constant-Q transform, good features can be extracted from audio signals if the low frequency bands are processed with high frequency resolution filters and the high frequency bands with high time resolution filters. In the spectrogram of a mixture of singing voices and music signals, there is usually more information about
the voice in the low frequency bands than the high frequency
bands. These raise the need for processing each part of the
spectrogram differently. In this paper, we propose a multi-band multi-resolution fully convolutional neural network (MBR-FCN) for singing voice separation. The MBR-FCN processes the frequency bands that have more information about the target signals with more filters and smaller dimensionality reduction scale than
the bands with less information. Furthermore, the MBR-FCN
processes the low frequency bands with high frequency resolution filters and the high frequency bands with high time resolution filters. Our experimental results show that the proposed MBRFCN with very few parameters achieves better singing voice separation performance than other deep neural networks.
Sound event detection (SED) is a task to detect sound
events in an audio recording. One challenge of the SED task
is that many datasets such as the Detection and Classification of Acoustic Scenes and Events (DCASE) datasets are weakly labelled. That is, there are only audio tags for each audio clip without the onset and offset times of sound events. We compare segment-wise and clip-wise training for SED that is lacking in previous works. We propose a convolutional neural network transformer (CNN-Transfomer) for audio tagging and SED, and show that CNN-Transformer performs similarly to a convolutional recurrent neural network (CRNN). Another challenge of SED is that thresholds are required for detecting
sound events. Previous works set thresholds empirically, and are not an optimal approaches. To solve this problem, we propose an automatic threshold optimization method. The first stage is to optimize the system with respect to metrics that do not depend on thresholds, such as mean average precision (mAP). The second
stage is to optimize the thresholds with respect to metrics that depends on those thresholds. Our proposed automatic threshold optimization system achieves a state-of-the-art audio tagging F1 of 0.646, outperforming that without threshold optimization of
0.629, and a sound event detection F1 of 0.584, outperforming that without threshold optimization of 0.564.
Cardiovascular disease is one of the leading
factors for death cause of human beings. In
the past decade, heart sound classification has been
increasingly studied for its feasibility to develop a
non-invasive approach to monitor a subject?s health
status. Particularly, relevant studies have benefited
from the fast development of wearable devices and
machine learning techniques. Nevertheless, finding
and designing efficient acoustic properties from heart
sounds is an expensive and time-consuming task. It
is known that transfer learning methods can help
extract higher representations automatically from the
heart sounds without any human domain knowledge.
However, most existing studies are based on models
pre-trained on images, which may not fully represent
the characteristics inherited from audio. To this
end, we propose a novel transfer learning model pretrained
on large scale audio data for a heart sound
classification task. In this study, the PhysioNet CinC
Challenge Dataset is used for evaluation. Experimental
results demonstrate that, our proposed pre-trained
audio models can outperform other popular models
pre-trained by images by achieving the highest unweighted
average recall at 89.7 %.
Situated in the domain of urban sound scene classi?cation by humans and machines, this research is the ?rst step towards mapping urban noise pollution experienced indoors and ?nding ways to reduce its negative impact in peoples? homes. We have recorded a sound dataset, called Open-Window, which contains recordings from three different locations and four different window states; two stationary states (open and close) and two transitional states (open to close and close to open). We have then built our machine recognition base lines for different scenarios (open set versus closed set) using a deep learning framework. The human listening test is also performed to be able to compare the human and machine performance for detecting the window state just using the acoustic cues. Our experimental results reveal that when using a simple machine baseline system, humans and machines are achieving similar average performance for closed set experiments.