Professor Mark Plumbley

Head of School of Computer Science and Electronic Engineering, Professor of Signal Processing
03 BB 01
Personal Assistant: Kelly Green
+44 (0)1483 689120


Research interests

My research concerns AI for Sound: using machine learning and signal processing for analysis and recognition of sounds. My focus is on detection, classification and separation of acoustic scenes and events, particularly real-world sounds, using methods such as deep learning, sparse representations and probabilistic models.

I have published over 350 papers in journals, conferences and books, including over 60 journal papers and the recent Springer co-edited book on Computational Analysis of Sound Scenes and Events.

Much of my research is funded by grants from EPSRC and EU, Innovate UK and other sources. I am currently hold an EPSRC Fellowship on "AI for Sound", and recently led EPSRC projects Making Sense of Sounds and Musical Audio Repurposing using Source Separation, and two EU research training networks, SpaRTaN and MacSeNet. My total grant funding over the last 12 years is around £37M, including £18M as Principal Investigator, Coordinator or Lead Applicant.

I was recently co-Chair of the DCASE 2018 Workshop on Detection and Classification of Acoustic Scenes and Events, co-Chair of the 14th International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA 2018) and co-Chair of the Signal Processing with Adaptive Sparse Structured Representations (SPARS 2017) workshop.

News & events

Publication highlights


For preprints for recent papers, see Surrey Research Insight, Plumbley, M.

For preprints of earlier papers, see my earlier publications page.


Mark Plumbley is Professor of Signal Processing at the Centre for Vision, Speech and Signal Processing (CVSSP) at the University of Surrey, in Guildford, UK. After receiving his Ph.D. degree in neural networks in 1991, he became a Lecturer at King’s College London, before moving to Queen Mary University of London in 2002. He subsequently became Professor and Director of the Centre for Digital Music, before joining the University of Surrey in 2015. His current research concerns AI for Sound: using machine learning, AI and signal processing for analysis and recognition of sounds, particularly real-world everyday sounds. He led the first data challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2013). He leads EPSRC-funded projects on making sense of everyday sounds and on audio source separation, and leads two EU-funded research training networks in sparse representations and compressed sensing. He is a co-editor of the recent Springer book on "Computational Analysis of Sound Scenes and Events". He is a Fellow of the IET and IEEE.

Previous roles

2017 (Jan-Sep)
Interim Head of Department
Department of Computer Science, University of Surrey
2010 - 2014
Director, Centre for Digital Music
Queen Mary University of London
2002 - 2014
Professor of Machine Learning and Signal Processing
Department of Electronic Engineering / School of Electronic Engineering and Computer Science, Queen Mary University of London

Affiliations and memberships

Fellow, Institution of Engineering and Technology (IET)
Fellow, Institute of Electrical and Electronics Engineers (IEEE)

Research projects

My publications


Orwell J, Plumbley MD (1999) Maximizing information about a noisy signal with a single non-linear neuron., NINTH INTERNATIONAL CONFERENCE ON ARTIFICIAL NEURAL NETWORKS (ICANN99), VOLS 1 AND 2(470)pp. 581-586 INST ELECTRICAL ENGINEERS INSPEC INC
Gretsistas A, Plumbley MD (2012) Group Polytope Faces Pursuit for Recovery of Block-Sparse Signals., LVA/ICA7191pp. 255-262 Springer
O'Hanlon K, Nagano H, Plumbley MD (2012) Oracle analysis of sparse automatic music transcription,pp. 591-598
We have previously proposed a structured sparse approach to piano transcription with promising results recorded on a challenging dataset. The approach taken was measured in terms of both frame-based and onset-based metrics. Close inspection of the results revealed problems in capturing frames displaying low-energy of a given note, for example in sustained notes. Further problems were also noticed in the onset detection, where for many notes seen to be active in the output trancription an onset was not detected. A brief description of the approach is given here, and further analysis of the system is given by considering an oracle transcription, derived from the ground truth piano roll and the given dictionary of spectral template atoms, which gives a clearer indication of the problems which need to be overcome in order to improve the proposed approach.
Hedayioglu FDL, Jafari MG, Mattos SDS, Plumbley MD, Coimbra MT (2011) Separating sources from sequentially acquired mixtures of heart signals., ICASSPpp. 653-656 IEEE
Plumbley MD (2013) Hearing the shape of a room., Proc Natl Acad Sci U S A110(30)pp. 12162-12163
Nesbit A, Vincent E, Plumbley MD (2009) Extension of Sparse, Adaptive Signal Decompositions to Semi-blind Audio Source Separation., ICA5441pp. 605-612 Springer
Welburn SJ, Plumbley MD (2009) Estimating parameters from audio for an EG+LFO model of pitch envelopes, Proceedings of the 12th International Conference on Digital Audio Effects, DAFx 2009pp. 451-455
Envelope generator (EG) and Low Frequency Oscillator (LFO) parameters give a compact representation of audio pitch envelopes. By estimating these parameters from audio per-note, they could be used as part of an audio coding scheme. Recordings of various instruments and articulations were examined, and pitch envelopes found. Using an evolutionary algorithm, EG and LFO parameters for the envelopes were estimated. The resulting estimated envelopes are compared to both the original envelope, and to a fixed-pitch estimate. Envelopes estimated using EG+LFO can closely represent the envelope from the original audio and provide a more accurate estimate than the mean pitch.
Rencker L, Wang W, Plumbley MD (2017) A greedy algorithm with learned statistics for sparse signal reconstruction,ICASSP 2017 Proceedings IEEE
We address the problem of sparse signal reconstruction from a few noisy samples. Recently, a Covariance-Assisted Matching Pursuit (CAMP) algorithm has been proposed, improving the sparse coefficient update step of the classic Orthogonal Matching Pursuit (OMP) algorithm. CAMP allows the a-priori mean and covariance of the non-zero coefficients to be considered in the coefficient update step. In this paper, we analyze CAMP, which leads to a new interpretation of the update step as a maximum-a-posteriori (MAP) estimation of the non-zero coefficients at each step. We then propose to leverage this idea, by finding a MAP estimate of the sparse reconstruction problem, in a greedy OMP-like way. Our approach allows the statistical dependencies between sparse coefficients to be modelled, while keeping the practicality of OMP. Experiments show improved performance when reconstructing the signal from a few noisy samples.
We describe our method for automatic bird species classification, which uses raw audio without segmentation and without using any auxiliary metadata. It successfully classifies among 501 bird categories, and was by far the highest scoring audio-only bird recognition algorithm submitted to BirdCLEF 2014. Our method uses unsupervised feature learning, a technique which learns regularities in spectro-temporal content without reference to the training labels, which helps a classifier to generalise to further content of the same type. Our strongest submission uses two layers of feature learning to capture regularities at two different time scales.
Yong C, Lim C, Plumbley M, Beighton D, Davidson R (2002) Identification of dental bacteria using statistical and neural approaches,Proceedings of the 9th International Conference on Neural Information Processing (ICONIP '02)2pp. 606-610 IEEE
This paper is devoted to enhancing rapid decision-making and identification of lactobacilli from dental plaque using statistical and neural network methods. Current techniques of identification such as clustering and principal component analysis are discussed with respect to the field of bacterial taxonomy. Decision-making using multilayer perceptron neural network and Kohonen self-organizing feature map is highlighted. Simulation work and corresponding results are presented with main emphasis on neural network convergence and identification capability using resubstitution, leave-one-out and cross validation techniques. Rapid analyses on two separate sets of bacterial data from dental plaque revealed accuracy of more than 90% in the identification process. The risk of misdiagnosis was estimated at 14% worst case. Test with unknown strains yields close correlation to cluster dendograms. The use of the AXEON VindAX simulator indicated close correlations of the results. The paper concludes that artificial neural networks are suitable for use in the rapid identification of dental bacteria.
Badeau R, Plumbley MD (2014) Multichannel high-resolution NMF for modeling convolutive mixtures of non-stationary signals in the Time-Frequency domain,IEEE/ACM Transactions on Audio, Speech and Language Processing22(11)pp. 1670-1680
Several probabilistic models involving latent components have been proposed for modeling time-frequency (TF) representations of audio signals such as spectrograms, notably in the nonnegative matrix factorization (NMF) literature. Among them, the recent high-resolution NMF (HR-NMF) model is able to take both phases and local correlations in each frequency band into account, and its potential has been illustrated in applications such as source separation and audio inpainting. In this paper, HR-NMF is extended to multichannel signals and to convolutive mixtures. The new model can represent a variety of stationary and non-stationary signals, including autoregressive moving average (ARMA) processes and mixtures of damped sinusoids. A fast variational expectation-maximization (EM) algorithm is proposed to estimate the enhanced model. This algorithm is applied to piano signals, and proves capable of accurately modeling reverberation, restoring missing observations, and separating pure tones with close frequencies.
Stowell D, Robertson A, Bryan-Kinns N, Plumbley MD (2009) Evaluation of live human-computer music-making: Quantitative and qualitative approaches., Int. J. Hum.-Comput. Stud.6711pp. 960-975
Nishimori Y, Akaho S, Plumbley MD (2006) Riemannian optimization method on generalized flag manifolds for complex and subspace ICA, AIP Conference Proceedings872pp. 89-96
In this paper we introduce a new class of manifolds, generalized flag manifolds, for the complex and subspace ICA problems. A generalized flag manifold is a manifold consisting of subspaces which are orthogonal to each other. The class of generalized flag manifolds include the class of Grassmann manifolds. We extend the Riemannian optimization method to include this new class of manifolds by deriving the formulas for the natural gradient and geodesics on these manifolds. We show how the complex and subspace ICA problems can be solved by optimization of cost functions on a generalized flag manifold. Computer simulations demonstrate our algorithm gives good performance compared with the ordinary gradient descent method. © 2006 American Institute of Physics.
Robbins GE, Plumbley MD, Hughes JC, Fallside F, Prager RW (1993) Generation and Adaptation of Neural Networks by Evolutionary Techniques (GANNET).,Neural Computing and Applications11pp. 23-31
Toyama K, Plumbley MD (2009) Using phase linearity in frequency-domain ICA to tackle the permutation problem., ICASSPpp. 3165-3168 IEEE
Plumbley MD (2007) Geometry and manifolds for independent component analysis, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings4
In the last few years, there has been a great interest in the use of geometrical methods for independent component analysis (ICA), both to gain insight into the optimization process and to develop more efficient optimization algorithms. Much of this work involves concepts from differential geometry, such as Lie groups, Stiefel manifolds, or tangent planes that may be unfamiliar to signal processing researchers. The purpose of this tutorial paper is to introduce some of these geometry concepts to signal processing and ICA researchers, without assuming any existing background in differential geometry. The emphasis of the paper is on making the important concepts in this field accessible, rather than mathematical rigour. © 2007 IEEE.
O'Hanlon K, Nagano H, Keriven N, Plumbley MD (2016) Non-Negative Group Sparsity with Subspace Note Modelling for Polyphonic Transcription,IEEE/ACM Transactions on Audio, Speech and Language Processing24(3)pp. 530-542 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Automatic music transcription (AMT) can be performed by deriving a pitch-time representation through decomposition of a spectrogram with a dictionary of pitch-labelled atoms. Typically, non-negative matrix factorisation (NMF) methods are used to decompose magnitude spectrograms. One atom is often used to represent each note. However, the spectrum of a note may change over time. Previous research considered this variability using different atoms to model specific parts of a note, or large dictionaries comprised of datapoints from the spectrograms of full notes. In this paper, the use of subspace modelling of note spectra is explored, with group sparsity employed as a means of coupling activations of related atoms into a pitched subspace. Stepwise and gradient-based methods for non-negative group sparse decompositions are proposed. Finally, a group sparse NMF approach is used to tune a generic harmonic subspace dictionary, leading to improved NMF-based AMT results.
Jafari MG, Plumbley MD (2011) Fast dictionary learning for sparse representations of speech signals,IEEE Journal on Selected Topics in Signal Processing5(5)pp. 1025-1031 IEEE
For dictionary-based decompositions of certain types, it has been observed that there might be a link between sparsity in the dictionary and sparsity in the decomposition. Sparsity in the dictionary has also been associated with the derivation of fast and efficient dictionary learning algorithms. Therefore, in this paper we present a greedy adaptive dictionary learning algorithm that sets out to find sparse atoms for speech signals. The algorithm learns the dictionary atoms on data frames taken from a speech signal. It iteratively extracts the data frame with minimum sparsity index, and adds this to the dictionary matrix. The contribution of this atom to the data frames is then removed, and the process is repeated. The algorithm is found to yield a sparse signal decomposition, supporting the hypothesis of a link between sparsity in the decomposition and dictionary. The algorithm is applied to the problem of speech representation and speech denoising, and its performance is compared to other existing methods. The method is shown to find dictionary atoms that are sparser than their time-domain waveform, and also to result in a sparser speech representation. In the presence of noise, the algorithm is found to have similar performance to the well established principal component analysis. © 2011 IEEE.
Vincent E, Plumbley MD (2007) Fast factorization-based inference for bayesian harmonic models, Proceedings of the 2006 16th IEEE Signal Processing Society Workshop on Machine Learning for Signal Processing, MLSP 2006pp. 117-122
Harmonie sinusoidal models are a fundamental tool for audio signal analysis. Bayesian harmonic models guarantee a good resynthesis quality and allow joint use of learnt parameter priors and auditory motivated distortion measures. However inference algorithms based on Monte Carlo sampling are rather slow for realistic data. In this paper, we investigate fast inference algorithms based on approximate factorization of the joint posterior into a product of independent distributions on small subsets of parameters. We discuss the conditions under which these approximations hold true and evaluate their performance experimentally. We suggest how they could be used together with Monte Carlo algorithms for a faster sampling-based inference. © 2006 IEEE.
We introduce a free and open dataset of 7690 audio clips sampled from the field-recording tag in the Freesound audio archive. The dataset is designed for use in research related to data mining in audio archives of field recordings / soundscapes. Audio is standardised, and audio and metadata are Creative Commons licensed. We describe the data preparation process, characterise the dataset descriptively, and illustrate its use through an auto-tagging experiment.
Plumbley, Mark D (2007) Independent Component Analysis and Signal Separation, 7th International Conference, ICA 2007, London, UK, September 9-12, 2007., 4666 Springer
Font F, Brookes Tim, Fazekas G, Guerber M, La Burthe A, Plans D, Plumbley MD, Shaashua M, Wang W, Serra X (2016) Audio Commons: bringing Creative Commons audio content to the creative industries,AES E-Library Audio Engineering Society
Significant amounts of user-generated audio content, such as sound effects, musical samples and music pieces, are uploaded to online repositories and made available under open licenses. Moreover, a constantly increasing amount of multimedia content, originally released with traditional licenses, is becoming public domain as its license expires. Nevertheless, the creative industries are not yet using much of all this content in their media productions. There is still a lack of familiarity and understanding of the legal context of all this open content, but there are also problems related with its accessibility. A big percentage of this content remains unreachable either because it is not published online or because it is not well organised and annotated. In this paper we present the Audio Commons Initiative, which is aimed at promoting the use of open audio content and at developing technologies with which to support the ecosystem composed by content repositories, production tools and users. These technologies should enable the reuse of this audio material, facilitating its integration in the production workflows used by the creative industries. This is a position paper in which we describe the core ideas behind this initiative and outline the ways in which we plan to address the challenges it poses.
Nesbit A, Jafari MG, Vincent E, Plumbley MD (2010) Audio source separation using sparse representations, In: Machine Audition: Principles, Algorithms and Systemspp. 246-264
The authors address the problem of audio source separation, namely, the recovery of audio signals from recordings of mixtures of those signals. The sparse component analysis framework is a powerful method for achieving this. Sparse orthogonal transforms, in which only few transform coefficients differ significantly from zero, are developed; once the signal has been transformed, energy is apportioned from each transform coefficient to each estimated source, and, finally, the signal is reconstructed using the inverse transform. The overriding aim of this chapter is to demonstrate how this framework, as exemplified here by two different decomposition methods which adapt to the signal to represent it sparsely, can be used to solve different problems in different mixing scenarios. To address the instantaneous (neither delays nor echoes) and underdetermined (more sources than mixtures) mixing model, a lapped orthogonal transform is adapted to the signal by selecting a basis from a library of predetermined bases. This method is highly related to the windowing methods used in the MPEG audio coding framework. In considering the anechoic (delays but no echoes) and determined (equal number of sources and mixtures) mixing case, a greedy adaptive transform is used based on orthogonal basis functions that are learned from the observed data, instead of being selected from a predetermined library of bases. This is found to encode the signal characteristics, by introducing a feedback system between the bases and the observed data. Experiments on mixtures of speech and music signals demonstrate that these methods give good signal approximations and separation performance, and indicate promising directions for future research. © 2011, IGI Global.
Plumbley MD (1997) Communications and neural networks: Theory and practice, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings1pp. 135-138
In this paper we shall see that neural networks and communications are interlinked in a number of ways, towards the goal of efficient communication of information. One concrete example of this is the use of neural networks to ensure efficient use of communication channels, through connection admission control in ATM networks. In addition, however, efficient communication is also important within a decision making system such as a neural network. Finally we examine what type of neural network solutions are suggested by this approach.
Davies MEP, Plumbley MD (2007) Context-Dependent Beat Tracking of Musical Audio., IEEE Transactions on Audio, Speech & Language Processing153pp. 1009-1020
Abdallah SM, Plumbley MD (2009) Information dynamics: patterns of expectation and surprise in the perception of music., Connect. Sci.212&3pp. 89-117
Sturm BL, Mailhe B, Plumbley MD (2013) On Theorem 10 in "On Polar Polytopes and the Recovery of Sparse Representations" (vol 50, pg 2231, 2004), IEEE TRANSACTIONS ON INFORMATION THEORY59(8)pp. 5206-5209 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Plumbley MD, Fallside F (1991) The effect of receptor signal-to-noise levels on optimal filtering in a sensory system, Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing4pp. 2321-2324
Consideration is given to image filtering (temporal and spatial) in a neural system for transmitting images through a limited capacity channel, in the case of a noisy image at the receptors. The authors use an extension of Shannon's formula for the capacity of a Gaussian channel to determine the optimum filter to be used. For realistic image statistics, they show that the bandwidth of this filter is self-limiting, and it has a high frequency boost that disappears at low signal levels. This behavior is mirrored in biological retinas.
Toyama K, Plumbley MD (2009) Estimating Phase Linearity in the Frequency-Domain ICA Demixing Matrix., ICA5441pp. 362-370 Springer
Benetos E, Lafay G, Lagrange M, Plumbley MD (2016) Detection of overlapping acoustic events using a temporally-constrained probabilistic model,Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2016), Shanghai, China, 20-25 March 2016pp. 6450-6454 IEEE
In this paper, a system for overlapping acoustic event detection is proposed, which models the temporal evolution of sound events. The system is based on probabilistic latent component analysis, supporting the use of a sound event dictionary where each exemplar consists of a succession of spectral templates. The temporal succession of the templates is controlled through event class-wise Hidden Markov Models (HMMs). As input time/frequency representation, the Equivalent Rectangular Bandwidth (ERB) spectrogram is used. Experiments are carried out on polyphonic datasets of office sounds generated using an acoustic scene simulator, as well as real and synthesized monophonic datasets for comparative purposes. Results show that the proposed system outperforms several state-of-the-art methods for overlapping acoustic event detection on the same task, using both frame-based and event-based metrics, and is robust to varying event density and noise levels.
Abdallah SA, Plumbley MD (2013) Predictive Information in Gaussian Processes with Application to Music Analysis., GSI8085pp. 650-657 Springer
Abdallah SA, Plumbley MD (2006) Unsupervised analysis of polyphonic music by sparse coding., IEEE Trans Neural Netw17(1)pp. 179-196
We investigate a data-driven approach to the analysis and transcription of polyphonic music, using a probabilistic model which is able to find sparse linear decompositions of a sequence of short-term Fourier spectra. The resulting system represents each input spectrum as a weighted sum of a small number of "atomic" spectra chosen from a larger dictionary; this dictionary is, in turn, learned from the data in such a way as to represent the given training set in an (information theoretically) efficient way. When exposed to examples of polyphonic music, most of the dictionary elements take on the spectral characteristics of individual notes in the music, so that the sparse decomposition can be used to identify the notes in a polyphonic mixture. Our approach differs from other methods of polyphonic analysis based on spectral decomposition by combining all of the following: (a) a formulation in terms of an explicitly given probabilistic model, in which the process estimating which notes are present corresponds naturally with the inference of latent variables in the model; (b) a particularly simple generative model, motivated by very general considerations about efficient coding, that makes very few assumptions about the musical origins of the signals being processed; and (c) the ability to learn a dictionary of atomic spectra (most of which converge to harmonic spectral profiles associated with specific notes) from polyphonic examples alone-no separate training on monophonic examples is required.
Brossier P, Bello JP, Plumbley MD (2004) Real-time temporal segmentation of note objects in music signals,Proceedings of ICMC 2004, the 30th Annual International Computer Music Conference Michigan Publishing
Segmenting note objects in a real time context is useful for live performances, audio broadcasting, or object-based coding. This temporal segmentation relies upon the correct detection of onsets and offsets of musical notes, an area of much research over recent years. However the low-latency requirements of real-time systems impose new, tight constraints on this process. In this paper, we present a system for the segmentation of note objects with very short delays, using recent developments in onset detection, specially modi ed to work in a real-time context. A portable and open C implementation is presented.
Weyde T, Cottrell S, Benetos E, Wolff D, Tidhar D, Dykes J, Plumbley M, Dixon S, Barthet M, Gold N, Abdallah S, Mahey M (2014) Digital Music Lab: A Framework for Analysing Big Music Data,
Fritsch J, Ganseman J, Plumbley M (2012) A comparison of two different methods for score-informed source separation,Proc. 5th International Workshop on Machine Learning and Music (MML 2012)pp. 11-12
We present a new method for score-informed source separation, combining ideas from two previous approaches: one based on paramet- ric modeling of the score which constrains the NMF updating process, the other based on PLCA that uses synthesized scores as prior probability distributions. We experimentally show improved separation results using the BSS EVAL and PEASS toolkits, and discuss strengths and weaknesses compared with the previous PLCA-based approach.
Jafari MG, Plumbley MD (2009) Speech denoising based on a greedy adaptive dictionary algorithm, European Signal Processing Conferencepp. 1423-1426
In this paper we consider the problem of speech denoising based on a greedy adaptive dictionary (GAD) algorithm. The transform is orthogonal by construction, and is found to give a sparse representation of the data being analysed, and to be robust to additive Gaussian noise. The performance of the algorithm is compared to that of the principal component analysis (PCA) method, for a speech denoising application. It is found that the GAD algorithm offers a sparser solution than PCA, while having a similar performance in the presence of noise. © EURASIP, 2009.
Adler A, Emiya V, Jafari MG, Elad M, Gribonval R, Plumbley MD (2011) A constrained matching pursuit approach to audio declipping., ICASSPpp. 329-332 IEEE
Degara N, Davies MEP, Pena A, Plumbley MD (2011) Onset Event Decoding Exploiting the Rhythmic Structure of Polyphonic Music., J. Sel. Topics Signal Processing56pp. 1228-1239
Hamon R, Rencker L, Emiya V, Wang W, Plumbley MD (2017) Assessment of musical noise using localization of isolated peaks in time-frequency domain,ICASSP2017 Proceedings IEEE
Musical noise is a recurrent issue that appears in spectral techniques for denoising or blind source separation. Due to localised errors of estimation, isolated peaks may appear in the processed spectrograms, resulting in annoying tonal sounds after synthesis known as ?musical noise?. In this paper, we propose a method to assess the amount of musical noise in an audio signal, by characterising the impact of these artificial isolated peaks on the processed sound. It turns out that because of the constraints between STFT coefficients, the isolated peaks are described as time-frequency ?spots? in the spectrogram of the processed audio signal. The quantification of these ?spots?, achieved through the adaptation of a method for localisation of significant STFT regions, allows for an evaluation of the amount of musical noise. We believe that this will pave the way to an objective measure and a better understanding of this phenomenon.
Plumbley MD (2004) Lie group methods for optimization with orthogonality constraints, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)3195pp. 1245-1252
Optimization of a cost function J(W) under an orthogonality constraint WWT = I is a common requirement for ICA methods. In this paper, we will review the use of Lie group methods to perform this constrained optimization. Instead of searching in the space of n × n matrices W, we will introduce the concept of the Lie group SO(n) of orthogonal matrices, and the corresponding Lie algebra so(n). Using so(n) for our coordinates, we can multiplicatively update W by a rotation matrix R so that W2 = RW always remains orthogonal. Steepest descent and conjugate gradient algorithms can be used in this framework.
Xu Y, Huang Q, Wang W, Jackson PJB, Plumbley MD (2016) Fully DNN-based Multi-label regression for audio tagging,Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016)pp. 110-114 Tampere University of Technology
Acoustic event detection for content analysis in most cases relies on lots of labeled data. However, manually annotating data is a time-consuming task, which thus makes few annotated resources available so far. Unlike audio event detection, automatic audio tagging, a multi-label acoustic event classification task, only relies on weakly labeled data. This is highly desirable to some practical applications using audio analysis. In this paper we propose to use a fully deep neural network (DNN) framework to handle the multi-label classification task in a regression way. Considering that only chunk-level rather than frame-level labels are available, the whole or almost whole frames of the chunk were fed into the DNN to perform a multi-label regression for the expected tags. The fully DNN, which is regarded as an encoding function, can well map the audio features sequence to a multi-tag vector. A deep pyramid structure was also designed to extract more robust high-level features related to the target tags. Further improved methods were adopted, such as the Dropout and background noise aware training, to enhance its generalization capability for new audio recordings in mismatched environments. Compared with the conventional Gaussian Mixture Model (GMM) and support vector machine (SVM) methods, the proposed fully DNN-based method could well utilize the long-term temporal information with the whole chunk as the input. The results show that our approach obtained a 15% relative improvement compared with the official GMM-based method of DCASE 2016 challenge.
Qin Z, Gao Y, Plumbley MD, Parini CG (2016) Wideband Spectrum Sensing on Real-Time Signals at Sub-Nyquist Sampling Rates in Single and Cooperative Multiple Nodes,IEEE TRANSACTIONS ON SIGNAL PROCESSING64(12)pp. 3106-3117 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Stark AM, Davies MEP, Plumbley MD (2009) Real-time beat-synchronous analysis of musical audio, Proceedings of the 12th International Conference on Digital Audio Effects, DAFx 2009pp. 299-304
In this paper we present a model for beat-synchronous analysis of musical audio signals. Introducing a real-time beat tracking model with performance comparable to offline techniques, we discuss its application to the analysis of musical performances segmented by beat. We discuss the various design choices for beat-synchronous analysis and their implications for real-time implementations before presenting some beat-synchronous harmonic analysis examples. We make available our beat tracker and beat-synchronous analysis techniques as externals for Max/MSP.
Welburn SJ, Plumbley MD, Vincent E (2007) Object-Coding for Resolution-Free Musical Audio, Proceedings of the AES International Conference
Object-based coding of audio represents the signal as a parameter stream for a set of sound-producing objects. Encoding in this manner can provide a resolution-free representation of an audio signal. Given a robust estimation of the object-parameters and a multi-resolution synthesis engine, the signal can be "intelligently" upsampled, extending the bandwidth and getting best use out of a high-resolution signal-chain. We present some initial findings on extending bandwidth using harmonic models.
Stark AM, Davies MEP, Plumbley MD (2008) Rhythmic analysis for real-time audio effects,International Computer Music Conference, ICMC 2008 Michigan Publishing
We outline a set of audio effects that use rhythmic analysis, in particular the extraction of beat and tempo information, to automatically synchronise temporal parameters to the input signal. We demonstrate that this analysis, known as beat-tracking, can be used to create adaptive parameters that adjust themselves according to changes in the properties of the input signal. We present common audio effects such as delay, tremolo and auto-wah augmented in this fashion and discuss their real-time implementation as Audio Unit plug-ins and objects for Max/MSP.
Jafari MG, Plumbley MD (2007) Convolutive blind source separation of speech signals in the low frequency bands, Audio Engineering Society - 123rd Audio Engineering Society Convention 20073pp. 1195-1198
Sub-band methods are often used to address the problem of convolutive blind speech separation, as they offer the computational advantage of approximating convolutions by multiplications. The computational load, however, often remains quite high, because separation is performed on several sub-bands. In this paper, we exploit the well known fact that the high frequency content of speech signals typically conveys little information, since most of the speech power is found in frequencies up to 4kHz, and consider separation only in frequency bands below a certain threshold. We investigate the effect of changing the threshold, and find that separation performed only in the low frequencies can lead to the recovered signals being similar in quality to those extracted from all frequencies.
Giannoulis D, Benetos E, Klapuri A, Plumbley MD (2014) Improving instrument recognition in polyphonic music through system integration,pp. 5222-5226
A method is proposed for instrument recognition in polyphonic music which combines two independent detector systems. A polyphonic musical instrument recognition system using a missing feature approach and an automatic music transcription system based on shift invariant probabilistic latent component analysis that includes instrument assignment. We propose a method to integrate the two systems by fusing the instrument contributions estimated by the first system onto the transcription system in the form of Dirichlet priors. Both systems, as well as the integrated system are evaluated using a dataset of continuous polyphonic music recordings. Detailed results that highlight a clear improvement in the performance of the integrated system are reported for different training conditions.
Vincent E, Jafari MG, Abdallah SA, Plumbley MD, Davies ME (2010) Probabilistic modeling paradigms for audio source separation,In: Wang W (eds.), Machine Audition: Principles, Algorithms and Systemspp. 162-185 IGI Global
Most sound scenes result from the superposition of several sources, which can be separately perceived and analyzed by human listeners. Source separation aims to provide machine listeners with similar skills by extracting the sounds of individual sources from a given scene. Existing separation systems operate either by emulating the human auditory system or by inferring the parameters of probabilistic sound models. In this chapter, the authors focus on the latter approach and provide a joint overview of established and recent models, including independent component analysis, local time-frequency models and spectral template-based models. They show that most models are instances of one of the following two general paradigms: linear modeling or variance modeling. They compare the merits of either paradigm and report objective performance figures. They also,conclude by discussing promising combinations of probabilistic priors and inference algorithms that could form the basis of future state-of-the-art systems.
Plumbley MD (1993) Hebbian/anti-Hebbian network which optimizes information capacity by orthonormalizing the principal subspace, IEE Conference Publication(372)pp. 86-90
A number of recent papers have used the approach of maximizing information capacity on mutual information(MI) to examine unsupervised neural networks. In this paper we extend this work to develop an algorithm for the case of both input and output noise, with an output power constraint. We find that it is possible to simplify the obvious algorithm obtained by concatenating the two previous solutions.
Dong J, Wang W, Dai W, Plumbley MD, Han Z-F, Chambers J (2016) Analysis SimCO Algorithms for Sparse Analysis Model Based Dictionary Learning,IEEE Transactions on Signal Processing64(2)pp. 417-431 IEEE
In this paper, we consider the dictionary learning problem for the sparse analysis model. A novel algorithm is proposed by adapting the simultaneous codeword optimization (SimCO) algorithm, based on the sparse synthesis model, to the sparse analysis model. This algorithm assumes that the analysis dictionary contains unit l2-norm atoms and learns the dictionary by optimization on manifolds. This framework allows multiple dictionary atoms to be updated simultaneously in each iteration. However, similar to several existing analysis dictionary learning algorithms, dictionaries learned by the proposed algorithm may contain similar atoms, leading to a degenerate (coherent) dictionary. To address this problem, we also consider restricting the coherence of the learned dictionary and propose Incoherent Analysis SimCO by introducing an atom decorrelation step following the update of the dictionary. We demonstrate the competitive performance of the proposed algorithms using experiments with synthetic data and image denoising as compared with existing algorithms.
Thiebaut J-B, Abdallah SA, Robertson A, Bryan-Kinns N, Plumbley MD (2008) Real Time Gesture Learning and Recognition: Towards Automatic Categorization., NIMEpp. 215-218
Welburn SJ, Plumbley MD (2010) Improving the performance of pitch estimators, 128th Audio Engineering Society Convention 20102pp. 1319-1332
We are looking to use pitch estimators to provide an accurate high-resolution pitch track for resynthesis of musical audio. We found that current evaluation measures such as gross error rate (GER) are not suitable for algorithm selection. In this paper we examine the issues relating to evaluating pitch estimators and use these insights to improve performance of existing algorithms such as the well-known YIN pitch estimation algorithm.
Plumbley MD (2003) Algorithms for nonnegative independent component analysis,IEEE TRANSACTIONS ON NEURAL NETWORKS14(3)pp. 534-543 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
We consider the task of solving the independent component analysis (ICA) problem x=As given observations x, with a constraint of nonnegativity of the source random vector s. We refer to this as nonnegative independent component analysis and we consider methods for solving this task. For independent sources with nonzero probability density function (pdf) p(s) down to s=0 it is sufficient to find the orthonormal rotation y=Wz of prewhitened sources z=Vx, which minimizes the mean squared error of the reconstruction of z from the rectified version y/sup +/ of y. We suggest some algorithms which perform this, both based on a nonlinear principal component analysis (PCA) approach and on a geodesic search method driven by differential geometry considerations. We demonstrate the operation of these algorithms on an image separation problem, which shows in particular the fast convergence of the rotation and geodesic methods and apply the approach to a musical audio analysis task.
O'Hanlon K, Plumbley MD, Sandler M (2015) Non-negative Matrix Factorisation incorporating greedy Hellinger sparse coding applied to polyphonic music transcription, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2015)pp. 2214-2218
Non-negative Matrix Factorisation (NMF) is a commonly used tool in many musical signal processing tasks, including Automatic Music Transcription (AMT). However unsupervised NMF is seen to be problematic in this context, and harmonically constrained variants of NMF have been proposed. While useful, the harmonic constraints may be constrictive in mixed signals. We have previously observed that recovery of overlapping signal elements using NMF is improved through introduction of a sparse coding step, and propose here the incorporation of a sparse coding step using the Hellinger distance into a NMF algorithm. Improved AMT results for unsupervised NMF are reported.
Ekeus H, Abdallah S, Plumbley M, McOwan P (2012) The Melody Triangle: Exploring Pattern and Predictability in Music,Musical Metacreation: Papers from the 2012 AIIDE Workshoppp. 35-42 AAAI
The Melody Triangle is an interface for the discovery of melodic materials, where the input ý positions within a triangle ý directly map to information theoretic properties of the output. A model of human expectation and surprise in the perception of music, information dynamics, is used to 'map out' a musical generative system's parameter space. This enables a user to explore the possibilities afforded by a generative algorithm, in this case Markov chains, not by directly selecting parameters, but by specifying the subjective predictability of the output sequence. We describe some of the relevant ideas from information dynamics and how the Melody Triangle is defined in terms of these. We describe its incarnation as a screen based performance tool and compositional aid for the generation of musical textures; the users control at the abstract level of randomness and predictability, and some pilot studies carried out with it. We also briefly outline a multi-user installation, where collaboration in a performative setting provides a playful yet informative way to explore expectation and surprise in music, and a forthcoming mobile phone version of the Melody Triangle.
Stowell D, Plumbley MD (2014) Robust bird species recognition: making it work for dawn chorus audio archives, pp. 94-94
The recent (2013) bird species recognition challenges organised by the SABIOD project attracted some strong performances from automatic classifiers applied to short audio excerpts from passive acoustic monitoring stations. Can such strong results be achieved for dawn chorus field recordings in audio archives? The question is important because archives (such as the British Library Sound Archive) hold thousands such recordings, covering many decades and many countries, but they are mostly unlabelled. Automatic labelling holds the potential to unlock their value to ecological studies. Audio in such archives is quite different from passive acoustic monitoring data: importantly, the recording conditions vary randomly (and are usually unknown), making the scenario a ?cross-condition? rather than ?single-condition? train/test task. Dawn chorus recordings are generally long, and the annotations often indicate which birds are in a 20-minute recording but not within which 5-second segments they are active. Further, the amount of annotation available is very small. We report on experiments to evaluate a variety of classifier configurations for automatic multilabel species annotation in dawn chorus archive recordings. The audio data is an order of magnitude larger than the SABIOD challenges, but the ground-truth data is an order of magnitude smaller. We report some surprising findings, including clear variation in the bene- fits of some analysis choices (audio features, pooling techniques noise-robustness techniques) as we move to handle the specific multi-condition case relevant for audio archives.
Robertson A, Plumbley MD, Bryan-Kinns N (2008) A Turing Test for B-Keeper: Evaluating an Interactive Real-Time Beat-Tracker, Proceedings of the 8th International Conference on New Interfaces for Musical Expression (NIME 2008)pp. 319-324
Techniques based on non-negative matrix factorization (NMF) have been successfully used to decompose a spectrogram of a music recording into a dictionary of templates and activations. While advanced NMF variants often yield robust signal models, there are usually some inaccuracies in the factorization since the underlying methods are not prepared for phase cancellations that occur when sounds with similar frequency are mixed. In this paper, we present a novel method that takes phase cancellations into account to refine dictionaries learned by NMF-based methods. Our approach exploits the fact that advanced NMF methods are often robust enough to provide information about how sound sources interact in a spectrogram, where they overlap, and thus where phase cancellations could occur. Using this information, the distances used in NMF are weighted entry-wise to attenuate the influence of regions with phase cancellations. Experiments on full-length, polyphonic piano recordings indicate that our method can be successfully used to refine NMF-based dictionaries.
Plumbley MD (2004) Optimization using fourier expansion over a geodesic for non-negative ICA, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)3195pp. 49-56
We propose a new algorithm for the non-negative ICA problem, based on the rotational nature of optimization over a set of square orthogonal (orthonormal) matrices W, i.e. where WTW = WWT = In. Using a truncated Fourier expansion of J(t), we obtain a Newton-like update step along the steepest-descent geodesic, which automatically approximates to a usual (Taylor expansion) Newton update step near to a minimum. Experiments confirm that this algorithm is effective, and it compares favourably with existing non-negative ICA algorithms. We suggest that this approach could modified for other algorithms, such as the normal ICA task. © Springer-Verlag 2004.
Jafari MG, Plumbley MD (2010) A doubly sparse greedy adaptive dictionary learning algorithm for music and large-scale data, 128th Audio Engineering Society Convention 20102pp. 940-945
We consider the extension of the greedy adaptive dictionary learning algorithm that we introduced previously, to applications other than speech signals. The algorithm learns a dictionary of sparse atoms, while yielding a sparse representation for the speech signals. We investigate its behavior in the analysis of music signals, and propose a different dictionary learning approach that can be applied to large data sets. This facilitates the application of the algorithm to problems that generate large amounts of data, such as multimedia of multi-channel application areas.
Ewert S, Pardo B, Mueller M, Plumbley MD (2014) Score-Informed Source Separation for Musical Audio Recordings [An overview],IEEE SIGNAL PROCESSING MAGAZINE31(3)pp. 116-124 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Plumbley MD, Fallside F (1989) Sensory adaptation: An information-theoretic viewpoint, IJCNN Int Jt Conf Neural Network
Summary form only given. The authors examine the goals of early stages of a perceptual system, before the signal reaches the cortex, and describe its operation in information-theoretic terms. The effects of receptor adaptation, lateral inhibition, and decorrelation can all be seen as part of an optimization of information throughput, given that available resources such as average power and maximum firing rates are limited. The authors suggest a modification to Gabor functions which improves their performance as band-pass filters.
Welburn SJ, Plumbley MD (2009) Rendering audio using expressive MIDI, 127th Audio Engineering Society Convention 20091pp. 176-184
MIDI renderings of audio are traditionally regarded as lifeless and unnatural - lacking in expression. However, MIDI is simply a protocol for controlling a synthesizer. Lack of expression is caused by either an expressionless synthesizer or by the difficulty in setting the MIDI parameters to provide expressive output. We have developed a system to construct an expressive MIDI representation of an audio signal, i.e. an audio representation which uses tailored pitch variations in addition to the note base pitch parameters which audio-to-MIDI systems usually attempt. A pitch envelope is estimated from the original audio, and a genetic algorithm is then used to estimate pitch modulator parameters from that envelope. These pitch modulations are encoded in a MIDI file and rerendered using a sampler. We present some initial comparisons between the final output audio and the estimated pitch envelopes.
Davies MEP, Degara N, Plumbley MD (2011) Measuring the Performance of Beat Tracking Algorithms Using a Beat Error Histogram., IEEE Signal Process. Lett.183pp. 157-160
Jafari MG, Abdallah SA, Plumbley MD, Davies ME (2006) Sparse Coding for Convolutive Blind Audio Source Separation., ICA3889pp. 132-139 Springer
Kachkaev A, Wolff D, Barthet M, Plumbley MD, Dykes J, Weyde T (2014) Visualising chord progressions in music collections: a big data approach,
In the Digital Music Lab project we work on the automatic analysis of large audio databases that results in rich annotations for large corpora of music. The musicological interpretation of this data from thousands of pieces is a challenging task that can benefit greatly from specifically designed interactive visualisation. Most existing big music data visualisation focuses on cultural attributes, mood, or listener behaviour. In this ongoing work we explore chord sequence patterns extracted by sequential pattern mining of more than one million tracks from the I Like Music commercial music collection. We present here several new visual representations that summarise chord patterns according to chord types, chroma, pattern structure and support, enabling musicologists to develop and answer questions about chord patterns in music collections. Our visualisations represent root movement and chord qualities mostly in a geometrical way and use colour to represent pattern support. We use two individually configurable views in parallel to encourage comparisons, either between different representations of one corpus, highlighting complimentary musical aspects, or between different datasets, here representing different genres. We adapt several visualisation techniques to chord pattern sets using some novel layouts to support musicologists with their exploration and interpretation of the corpora. We found that differences between chord patterns of different genres, e.g. Rock & Roll vs. Jazz, are visible and can be used to generate hypotheses for the study of individual pieces, further statistical investigations or new data processing and visualisation. Our designs will be adapted as user needs are established through ongoing work. Means of aggregating, focusing and filtering by selected characteristics (such as key, melodic patterns etc.) will be added as we develop our design for the visualisation of chord patterns in close collaboration with musicologists. The visualisations are available as a web application at
Degara N, Pena A, Davies MEP, Plumbley MD (2010) Note onset detection using rhythmic structure., ICASSPpp. 5526-5529 IEEE
Baume C, Plumbley MD, Calic J (2015) Use of audio editors in radio production,
Audio editing is performed at scale in the production of radio, but often the tools used are poorly targeted toward the task at hand. There are a number of audio analysis techniques that have the potential to aid radio producers, but without a detailed understanding of their process and requirements, it can be difficult to apply these methods. To aid this understanding, a study of radio production practice was conducted on three varied case studies?a news bulletin, drama, and documentary. It examined the audio/metadata workflow, the roles and motivations of the producers, and environmental factors. The study found that producers prefer to interact with higher-level representations of audio content like transcripts and enjoy working on paper. The study also identified opportunities to improve the work flow with tools that link audio to text, highlight repetitions, compare takes, and segment speakers.
Barchiesi D, Plumbley MD (2015) Learning incoherent subspaces: Classification via incoherent dictionary learning, Journal of Signal Processing Systems79(2)pp. 189-199 Springer
In this article we present the supervised iterative projections and rotations (S-IPR) algorithm, a method for learning discriminative incoherent subspaces from data. We derive S-IPR as a supervised extension of our previously proposed iterative projections and rotations (IPR) algorithm for incoherent dictionary learning, and we employ it to learn incoherent sub-spaces that model signals belonging to different classes. We test our method as a feature transform for supervised classification, first by visualising transformed features from a synthetic dataset and from the ?iris? dataset, then by using the resulting features in a classification experiment.
Nishimori Y, Akaho S, Plumbley MD (2006) Riemannian Optimization Method on the Flag Manifold for Independent Subspace Analysis., ICA3889pp. 295-302 Springer
Jafari MG, Plumbley MD (2008) An adaptive orthogonal sparsifying transform for speech signals, 2008 3rd International Symposium on Communications, Control, and Signal Processing, ISCCSP 2008pp. 786-790
In this paper we consider the problem of representing a speech signal with an adaptive transform that captures the main features of the data. The transform is orthogonal by construction, and is found to give a sparse representation of the data being analysed. The orthogonality property implies that evaluation of both the forward and inverse transform involve a simple matrix multiplication. The proposed dictionary learning algorithm is compared to the K singular value decomposition (K-SVD) method, which is found to yield very sparse representations, at the cost of a high approximation error. The proposed algorithm is shown to have a much lower computational complexity than K-SVD, while the resulting signal representation remains relatively sparse. ©2008 IEEE.
Stowell D, Plumbley MD (2008) Robustness and independence of voice timbre features under live performance acoustic degradations, Proceedings - 11th International Conference on Digital Audio Effects, DAFx 2008pp. 325-332
Live performance situations can lead to degradations in the vocal signal from a typical microphone, such as ambient noise or echoes due to feedback. We investigate the robustness of continuousvalued timbre features measured on vocal signals (speech, singing, beatboxing) under simulated degradations. We also consider nonparametric dependencies between features, using information theoretic measures and a feature-selection algorithm. We discuss how robustness and independence issues reflect on the choice of acoustic features for use in constructing a continuous-valued vocal timbre space. While some measures (notably spectral crest factors) emerge as good candidates for such a task, others are poor, and some features such as ZCR exhibit an interaction with the type of voice signal being analysed.
Plumbley MD (2007) Dictionary Learning for L1-Exact Sparse Coding., ICA4666pp. 406-413 Springer
Jafari MG, Plumbley MD, Davies ME (2008) Speech separation using an adaptive sparse dictionary algorithm, 2008 Hands-free Speech Communication and Microphone Arrays, Proceedings, HSCMA 2008pp. 25-28
We present a greedy adaptive algorithm that builds a sparse orthogonal dictionary from the observed data. In this paper, the algorithm is used to separate stereo speech signals, and the phase information that is inherent to the extracted atom pairs is used for clustering and identification of the original sources. The performance of the algorithm is compared that of the adaptive stereo basis algorithm, when the sources are mixed in echoic and anechoic environments. We find that the algorithm correctly separates the sources, and can do this even with a relatively small number of atoms. ©2008 IEEE.
Laurberg H, Christensen MG, Plumbley MD, Hansen LK, Jensen SH (2008) Theorems on positive data: on the uniqueness of NMF., Comput Intell Neurosci
We investigate the conditions for which nonnegative matrix factorization (NMF) is unique and introduce several theorems which can determine whether the decomposition is in fact unique or not. The theorems are illustrated by several examples showing the use of the theorems and their limitations. We have shown that corruption of a unique NMF matrix by additive noise leads to a noisy estimation of the noise-free unique solution. Finally, we use a stochastic view of NMF to analyze which characterization of the underlying model will result in an NMF with small estimation errors.
Murray-Browne T, Mainstone D, Bryan-Kinns N, Plumbley MD (2010) The Serendiptichord: A wearable instrument for contemporary dance performance, 128th Audio Engineering Society Convention 20103pp. 1547-1554
We describe a novel musical instrument designed for use in contemporary dance performance. This instrument, the Serendiptichord, takes the form of a headpiece plus associated pods which sense movements of the dancer, together with associated audio processing software driven by the sensors. Movements such as translating the pods or shaking the trunk of the headpiece cause selection and modification of sampled sounds. We discuss how we have closely integrated physical form, sensor choice and positioning and software to avoid issues which otherwise arise with disconnection of the innate physical link between action and sound, leading to an instrument that non-musicians (in this case, dancers) are able to enjoy using immediately.
Wilson G, Aruliah DA, Brown CT, Chue Hong NP, Davis M, Guy RT, Haddock SH, Huff KD, Mitchell IM, Plumbley MD, Waugh B, White EP, Wilson P (2014) Best practices for scientific computing.,PLoS Biol12(1)
Plumbley MD, Abdallah SA, Bello JP, Davies ME, Monti G, Sandler MB (2002) Automatic music transcription and audio source separation, CYBERNETICS AND SYSTEMS33(6)pp. 603-627 TAYLOR & FRANCIS INC
Badeau R, Plumbley MD (2013) Multichannel HR-NMF for modelling convolutive mixtures of non-stationary signals in the time-frequency domain, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
Several probabilistic models involving latent components have been proposed for modelling time-frequency (TF) representations of audio signals (such as spectrograms), notably in the nonnegative matrix factorization (NMF) literature. Among them, the recent high resolution NMF (HR-NMF) model is able to take both phases and local correlations in each frequency band into account, and its potential has been illustrated in applications such as source separation and audio inpainting. In this paper, HR-NMF is extended to multichannel signals and to convolutive mixtures. A fast variational expectation-maximization (EM) algorithm is proposed to estimate the enhanced model. This algorithm is applied to a stereophonic piano signal, and proves capable of accurately modelling reverberation and restoring missing observations. © 2013 IEEE.
Brossier P, Bello JP, Plumbley MD (2004) Fast labelling of notes in music signals., ISMIR
Taylor John G., Plumbley Mark D. (1993) Information Theory and Neural Networks,In: Mathematical Approaches to Neural Networks51pp. 307-340 Elsevier Science Publishers
This chapter discusses the role of information theory for analysis of neural networks using differential geometric ideas. Information theory is useful for understanding preprocessing, in terms of predictive coding in the retina and principal component analysis and decorrelation processing in early visual cortex. The chapter introduces some concepts from information theory. In particular, the entropy of a random variable and the mutual information between two random variables are focused. One of the major uses for information theory has been in interpretation and guidance for unsupervised neural networks: networks that are not provided with a teacher or target output that they are to emulate. The chapter describes how information relates to the more familiar supervised learning schemes, and discusses the use of error back propagation (BackProp) to minimize mean squared error (MSE) in a multi-layer perceptron (MLP). Other distortion measures are possible in place of MSE. In particular, the information theoretic cross-entropy distortion has been focused in the chapter.
Vincent E, Plumbley MD (2007) Low Bit-Rate Object Coding of Musical Audio Using Bayesian Harmonic Models., IEEE Transactions on Audio, Speech & Language Processing154pp. 1273-1282
Plumbley MD (2005) On polar polytopes and the recovery of sparse representations, IEEE Transactions on Information Theory53(9)pp. 3188-3195
Suppose we have a signal y which we wish to represent using a linear combination of a number of basis atoms ai, y = £i xiai = Ax. The problem of finding the minimum l0 norm representation for y is a hard problem. The basis pursuit (BP) approach proposes to find the minimum l1 norm representation instead, which corresponds to a linear program (LP) that can be solved using modern LP techniques, and several recent authors have given conditions for the BP (minimum l1 norm) and sparse (minimum l0 norm) representations to be identical. In this paper, we explore this sparse representation problem using the geometry of convex polytopes, as recently introduced into the field by Donoho. By considering the dual LP we find that the so-called polar polytope P* of the centrally symmetric polytope P whose vertices are the atom pairs plusmn;ai is particularly helpful in providing us with geometrical insight into optimality conditions given by Fuchs and Tropp for non-unit-norm atom sets. In exploring this geometry, we are able to tighten some of these earlier results, showing for example that a condition due to Fuchs is both necessary and sufficient for l1-unique-optimality, and there are cases where orthogonal matching pursuit (OMP) can eventually find all l1-unique-optimal solutions with m nonzeros even if the exact recover condition (ERC) fails for m. © 2007 IEEE.
Ekeus H, Abdallah SA, Mcowan PW, Plumbley MD (2013) How Predictable Do We Like Our Music? Eliciting Aesthetic Preferences With The Melody Triangle Mobile App, pp. 80-85 Logos Verlag Berlin
The Melody Triangle is a smartphone application for Android that lets users easily create musical patterns and textures. The user creates melodies by specifying positions within a triangle, and these positions correspond to the information theoretic properties of generated musical sequences. A model of human expectation and surprise in the perception of music, information dynamics, is used to 'map out' a musical generative system's parameter space, in this case Markov chains. This enables a user to explore the possibilities afforded by Markov chains, not by directly selecting their parameters, but by specifying the subjective predictability of the output sequence. As users of the app find melodies and patterns they like, they are encouraged to press a 'like' button, where their setting are uploaded to our servers for analysis. Collecting the 'liked' settings of many users worldwide will allow us to elicit trends and commonalities in aesthetic preferences across users of the app, and to investigate how these might relate to the informationdynamic model of human expectation and surprise. We outline some of the relevant ideas from information dynamics and how the Melody Triangle is defined in terms of these. We then describe the Melody Triangle mobile application, how it is being used to collect research data and how the collected data will be evaluated.
Stowell D, Plumbley MD (2010) Delayed decision-making in real-time beatbox percussion classification, Journal of New Music Research39(3)pp. 203-213
Real-time classification applied to a vocal percussion signal holds potential as an interface for live musical control. In this article we propose a novel approach to resolving the tension between the needs for low-latency reaction and reliable classification, by deferring the final classification decision until after a response has been initiated. We introduce a new dataset of annotated human beatbox recordings, and use it to study the optimal delay for classification accuracy. We then investigate the effect of such delayed decision-making on the quality of the audio output of a typical reactive system, via a MUSHRA-type listening test. Our results show that the effect depends on the output audio type: for popular dance/pop drum sounds the acceptable delay is on the order of 12-35 ms. © 2010 Taylor & Francis.
Nesbit A, Davies M, Plumbley M, Sandler M (2006) Source extraction from two-channel mixtures by joint cosine packet analysis, European Signal Processing Conference
This paper describes novel, computationally efficient approaches to source separation of underdetermined instantaneous two-channel mixtures. A best basis algorithm is applied to trees of local cosine bases to determine a sparse transform. We assume that the mixing parameters are known and focus on demixing sources by binary time-frequency masking. We describe a method for deriving a best local cosine basis from the mixtures by minimising an l1 norm cost function. This basis is adapted to the input of the masking process. Then, we investigate how to increase sparsity by adapting local cosine bases to the expected output of a single source instead of to the input mixtures. The heuristically derived cost function maximises the energy of the transform coefficients associated with a particular direction. Experiments on a mixture of four musical instruments are performed, and results are compared. It is shown that local cosine bases can give better results than fixed-basis representations.
Stark AM, Plumbley MD, Davies MEP (2007) Audio effects for real-time performance using beat tracking, Audio Engineering Society - 122nd Audio Engineering Society Convention 20072pp. 866-872
We present a new class of digital audio effects which can automatically relate parameter values to the tempo of a musical input in real-time. Using a beat tracking system as the front end, we demonstrate a tempo-dependent delay effect and a set of beat-synchronous low frequency oscillator (LFO) effects including auto-wah, tremolo and vibrato. The effects show better performance than might be expected as they are blind to certain beat tracker errors. All effects are implemented as VST plug-ins which operate in real-time, enabling their use both in live musical performance and the off-line modification of studio recordings.
O'Hanlon K, Plumbley MD (2011) Structure-aware dictionary learning with harmonic atoms, European Signal Processing Conferencepp. 1761-1765
Non-negative blind signal decomposition methods are widely used for musical signal processing tasks, such as automatic transcription and source separation. A spectrogram can be decomposed into a dictionary of full spectrum basis atoms and their corresponding time activation vectors using methods such as Non-negative Matrix Factorisation (NMF) and Non-negative K-SVD (NN-K-SVD). These methods are constrained by their learning order and problems posed by overlapping sources in the time and frequency domains of the source spectrogram. We consider that it may be possible to improve on current results by providing prior knowledge on the number of sources in a given spectrogram and on the individual structure of the basis atoms, an approach we refer to as structure-aware dictionary learning. In this work we consider dictionary recoverability of harmonic atoms, as harmonicity is a common structure in music signals. We present results showing improvements in recoverability using structure-aware decomposition methods, based on NN-KSVD and NMF. Finally we propose an alternative structureaware dictionary learning algorithm incorporating the advantages of NMF and NN-K-SVD. © EURASIP, 2011.
Stark AM, Plumbley MD (2009) Real-time chord recognition for live performance,Proceedings of the 2009 International Computer Music Conference, ICMC 2009pp. 85-88 Michigan Publishing
This paper describes work aimed at creating an efficient, real-time, robust and high performance chord recognition system for use on a single instrument in a live performance context. An improved chroma calculation method is combined with a classification technique based on masking out expected note positions in the chromagram and minimising the residual energy. We demonstrate that our approach can be used to classify a wide range of chords, in real-time, on a frame by frame basis. We present these analysis techniques as externals for Max/MSP. © July 2009- All copyright remains with the individual authors.
Abdallah SA, Plumbley MD (2012) A measure of statistical complexity based on predictive information with application to finite spin systems, Physics Letters, Section A: General, Atomic and Solid State Physics376(4)pp. 275-281
We propose the binding information as an information theoretic measure of complexity between multiple random variables, such as those found in the Ising or Potts models of interacting spins, and compare it with several previously proposed measures of statistical complexity, including excess entropy, Bialek et al.'s predictive information, and the multi-information. We discuss and prove some of the properties of binding information, particularly in relation to multi-information and entropy, and show that, in the case of binary random variables, the processes which maximise binding information are the 'parity' processes. The computation of binding information is demonstrated on Ising models of finite spin systems, showing that various upper and lower bounds are respected and also that there is a strong relationship between the introduction of high-order interactions and an increase of binding-information. Finally we discuss some of the implications this has for the use of the binding information as a measure of complexity. © 2011 Elsevier B.V. All rights reserved.
Li S, Black D, Chew E, Plumbley M (2014) Evidence that phrase-level tempo variation may be represented using a limited dictionary,Presented at: 13th International Conference for Music Perception and Cognition (ICMPC13-APSCOM5), Seoul, South Korea, 4-8 August 2014
Phrases are common musical units akin to that in speech and text. In music performance, performers often change the way they vary the tempo from one phrase to the next in order to choreograph patterns of repetition and contrast. This activity is commonly referred to as expressive music performance. Despite its importance, expressive performance is still poorly understood. No formal models exist that would explain, or at least quantify and characterise, aspects of commonalities and differences in performance style. In this paper we present such a model for tempo variation between phrases in a performance. We demonstrate that the model provides a good fit with a performance database of 25 pieces and that perceptually important information is not lost through the modelling process.
Stowell D, Plumbley MD (2009) Fast Multidimensional Entropy Estimation by k -d Partitioning., IEEE Signal Process. Lett.166pp. 537-540
Plumbley MD (1995) Lyapunov functions for convergence of principal component algorithms, Neural Networks8(1)pp. 11-23
Recent theoretical analyses of a class of unsupervized Hebbian principal component algorithms have identified its local stability conditions. The only locally stable solution for the subspace P extracted by the network is the principal component subspace P*. In this paper we use the Lyapunov function approach to discover the global stability characteristics of this class of algorithms. The subspace projection error, least mean squared projection error, and mutual information I are all Lyapunov functions for convergence to the principal subspace, although the various domains of convergence indicated by these Lyapunov functions leave some of P-space uncovered. A modification to I yields a principal subspace information Lyapunov function I2 with a domain of convergence that covers almost all of P-space. This shows that this class of algorithms converges to the principal subspace from almost everywhere. © 1994.
Abdallah SA, Plumbley MD (2010) A measure of statistical complexity based on predictive information,
We introduce an information theoretic measure of statistical structure, called 'binding information', for sets of random variables, and compare it with several previously proposed measures including excess entropy, Bialek et al.'s predictive information, and the multi-information. We derive some of the properties of the binding information, particularly in relation to the multi-information, and show that, for finite sets of binary random variables, the processes which maximises binding information are the 'parity' processes. Finally we discuss some of the implications this has for the use of the binding information as a measure of complexity.
Robertson AN, Plumbley MD (2009) Post-processing fiddle~: A real-time multi-pitch tracking technique using harmonic partial subtraction for use within live performance systems,Proceedings of the 2009 International Computer Music Conference (ICMC 2009)pp. 227-230 Michigan Publishing
We present a method for real-time pitch-tracking which generates an estimation of the relative amplitudes of the partials relative to the fundamental for each detected note. We then employ a subtraction method, whereby lower fundamentals in the spectrum are accounted for when looking at higher fundamental notes. By tracking notes which are playing, we look for note off events and continually update our expected partial weightings for each note. The resulting algorithm makes use of these relative partial weightings within its decision process. We have evaluated the system against a data set and compared it with specialised offline pitch-trackers. © July 2009- All copyright remains with the individual authors.
Stowell D, Plumbley MD (2010) Cross-associating unlabelled timbre distributions to create expressive musical mappings., WAPA11pp. 28-35
Plumbley M, Abdallah S, Blumensath T, Jafari M, Nesbit A, Vincent E, Wang B (2006) Musical audio analysis using sparse representations,pp. 105-117 Physica-Verlag HD
Sparse representations are becoming an increasingly useful tool in the analysis of musical audio signals. In this paper we will given an overview of work by ourselves and others in this area, to give a flavour of the work being undertaken, and to give some pointers for further information about this interesting and challenging research topic.
Plumbley MD, Blumensath T, Daudet L, Gribonval R, Davies ME (2010) Sparse Representations in Audio and Music: From Coding to Source Separation., Proceedings of the IEEE986pp. 995-1005
Johnson I, Plumbley MD (2000) On-Line Connectionist Q-Learning Produces Unreliable Performance with A Synonym Finding Task., IJCNN (3)pp. 451-458
Vincent E, Gribonval R, Plumbley MD (2007) Oracle estimators for the benchmarking of source separation algorithms., Signal Processing878pp. 1933-1950
Degara N, Rua EA, Pena A, Torres-Guijarro S, Davies MEP, Plumbley MD (2012) Reliability-informed beat tracking of musical signals, IEEE Transactions on Audio, Speech and Language Processing20(1)pp. 278-289
A new probabilistic framework for beat tracking of musical audio is presented. The method estimates the time between consecutive beat events and exploits both beat and non-beat information by explicitly modeling non-beat states. In addition to the beat times, a measure of the expected accuracy of the estimated beats is provided. The quality of the observations used for beat tracking is measured and the reliability of the beats is automatically calculated. A k-nearest neighbor regression algorithm is proposed to predict the accuracy of the beat estimates. The performance of the beat tracking system is statistically evaluated using a database of 222 musical signals of various genres. We show that modeling non-beat states leads to a significant increase in performance. In addition, a large experiment where the parameters of the model are automatically learned has been completed. Results show that simple approximations for the parameters of the model can be used. Furthermore, the performance of the system is compared with existing algorithms. Finally, a new perspective for beat tracking evaluation is presented. We show how reliability information can be successfully used to increase the mean performance of the proposed algorithm and discuss how far automatic beat tracking is from human tapping. © 2011 IEEE.
Davies MEP, Plumbley MD (2006) A spectral difference approach to downbeat extraction in musical audio, European Signal Processing Conference
We introduce a method for detecting downbeats in musical audio given a sequence of beat times. Using musical knowledge that lower frequency bands are perceptually more important, we find the spectral difference between band-limited beat synchronous analysis frames as a robust downbeat indicator. Initial results are encouraging for this type of system.
When working with generative systems, designers enter into a loop of discrete steps; external evaluations of the output feedback into the system, and new outputs are subsequently reevaluated. In such systems, interacting low level elements can engender a difficult to predict emergence of macro-level characteristics. Furthermore, the state space of some systems can be vast. Consequently, designers generally rely on trial-and-error, experience or intuition in selecting parameter values to develop the aesthetic aspects of their designs. We investigate an alternative means of exploring the state spaces of generative visual systems by using a gaze- contingent display. A user's gaze continuously controls and directs an evolution of visual forms and patterns on screen. As time progresses and the viewer and system remain coupled in this evolution, a population of generative artefacts tends towards an area of their state space that is 'of interest', as defined by the eye tracking data. The evaluation-feedback loop is continuous and uninterrupted, gaze the guiding feedback mechanism in the exploration of state space.
Murray-Browne T, Plumbley M (2014) Harmonic Motion: A Toolkit for Processing Gestural Data for Interactive Sound, pp. 213-216
We introduce Harmonic Motion, a free open source toolkit for artists, musicians and designers working with gestural data. Extracting musically useful features from captured gesture data can be challenging, with projects often requiring bespoke processing techniques developed through iterations of tweaking equations involving a number of constant values ? sometimes referred to as ?magic numbers?. Harmonic Motion provides a robust interface for rapid prototyping of patches to process gestural data and a framework through which approaches may be encapsulated, reused and shared with others. In addition, we describe our design process in which both personal experience and a survey of potential users informed a set of specific goals for the software.
Davies MEP, Plumbley MD (2007) On the use of entropy for beat tracking evaluation, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings4
Despite continued attention toward the problem of automatic beat detection in musical audio, the issue of how to evaluate beat tracking systems remains pertinent and controversial. As yet no consistent evaluation metric has been adopted by the research community. To this aim, we propose a new method for beat tracking evaluation by measuring beat accuracy in terms of the entropy of a beat error histogram. We demonstrate the ability of our approach to address several shortcomings of existing methods. © 2007 IEEE.
Plumbley MD (1993) Information theory and neural network learning algorithms, pp. 145-155 Institute of Physics Publishing
There have been a number of recent papers on information theory and neural networks, especially in a perceptual system such as vision.ý Some of these approaches are examined, and their implications for neural network learning algorithms are considered.ý Existing supervised learning algorithms such as Back Propagation to minimize mean squared error can be viewed as attempting to minimize an upper bound on information loss.ý By making an assumption of noise either at the input or the output to the system, unsupervised learning algorithms such as those based on Hebbian (principal component analysing) or anti-Hebbian (decorrelating) approaches can also be viewed in a similar light.ý The optimization of information by the use of interneurons to decorrelate output units suggests a role for inhibitory interneurons and cortical loops in biological sensory systems.
Damnjanovic I, Davies MEP, Plumbley MD (2010) SMALLbox - An Evaluation Framework for Sparse Representations and Dictionary Learning Algorithms., LVA/ICA6365pp. 418-425 Springer
Stowell D, Plumbley MD, Bryan-Kinns N (2008) Discourse Analysis Evaluation Method for Expressive Musical Interfaces., NIMEpp. 81-86
Zhijin Qin, Yue Gao, Plumbley MD, Parini CG, Cuthbert LG (2014) Efficient compressive spectrum sensing algorithm for M2M devices,pp. 1170-1174 IEEE
Spectrum used for Machine-to-Machine (M2M) communications should be as cheap as possible or even free in order to connect billions of devices. Recently, both UK and US regulators have conducted trails and pilots to release the UHF TV spectrum for secondary licence-exempt applications. However, it is a very challenging task to implement wideband spectrum sensing in compact and low power M2M devices as high sampling rates are very expensive and difficult to achieve. In recent years, compressive sensing (CS) technique makes fast wideband spectrum sensing possible by taking samples at sub-Nyquist sampling rates. In this paper, we propose a two-step CS based spectrum sensing algorithm. In the first step, the CS is implemented in an SU and only part of the spectrum of interest is supposed to be sensed by an SU in each sensing period to reduce the complexity in the signal recovery process. In the second step, a denoising algorithm is proposed to improve the detection performance of spectrum sensing. The proposed two-step CS based spectrum sensing is compared with the traditional scheme and the theoretical curves.
Nishimori Y, Akaho S, Abdallah S, Plumbley MD (2007) Flag manifolds for subspace ICA problems, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings4
We investigate the use of the Riemannian optimization method over the flag manifold in subspace ICA problems such as independent subspace analysis (ISA) and complex ICA. In the ISA experiment, we use the Riemannian approach over the flag manifold together with an MCMC method to overcome the problem of local minima of the ISA cost function. Experiments demonstrate the effectiveness of both Riemannian methods - simple geodesic gradient descent and hybrid geodesic gradient descent, compared with the ordinary gradient method. © 2007 IEEE.
Plumbley M (2002) Conditions for nonnegative independent component analysis,IEEE SIGNAL PROCESSING LETTERS9(6)pp. 177-180 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Mailhé B, Barchiesi D, Plumbley MD (2012) INK-SVD: Learning incoherent dictionaries for sparse representations, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedingspp. 3573-3576
This work considers the problem of learning an incoherent dictionary that is both adapted to a set of training data and incoherent so that existing sparse approximation algorithms can recover the sparsest representation. A new decorrelation method is presented that computes a fixed coherence dictionary close to a given dictionary. That step iterates pairwise decorrelations of atoms in the dictionary. Dictionary learning is then performed by adding this decorrelation method as an intermediate step in the K-SVD learning algorithm. The proposed algorithm INK-SVD is tested on musical data and compared to another existing decorrelation method. INK-SVD can compute a dictionary that approximates the training data as well as K-SVD while decreasing the coherence from 0.6 to 0.2. © 2012 IEEE.
Kachkaev A, Wolff D, Barthet M, Plumbley MD, Dykes J, Weyde T (2014) Visualising chord progressions in music collections: a big data approach,
The analysis of large datasets of music audio and other representations entails the need for techniques that support musicologists and other users in interpreting extracted data. We explore and develop visualisation techniques of chord sequence patterns mined from a corpus of over one million tracks. The visualisations use different representations of root movements and chord qualities with geometrical representations, and mostly colour mappings for pattern support. The presented visualisations are being developed in close collaboration with musicologists and can help gain insights into the differences of musical genres and styles as well as support the development of new classification methods.
Robertson A, Stark AM, Plumbley MD (2011) Real-time Visual Beat Tracking using a Comb Filter Matrix,Proceedings of the International Computer Music Conference 2011pp. 617-620 Michigan Publishing
This paper describes an algorithm for real-time beat tracking with a visual interface. Multiple tempo and phase hypotheses are represented by a comb filter matrix. The user can interact by specifying the tempo and phase to be tracked by the algorithm, which will seek to find a continuous path through the space. We present results from evaluating the algorithm on the Hainsworth database and offer a comparison with another existing real-time beat tracking algorithm and offline algorithms.
Ophir B, Elad M, Bertin N, Plumbley MD (2011) Sequential minimal eigenvalues - An approach to analysis dictionary learning, European Signal Processing Conferencepp. 1465-1469
Over the past decade there has been a great interest in a synthesis-based model for signals, based on sparse and redundant representations. Such a model assumes that the signal of interest can be decomposed as a linear combination of few columns from a given matrix (the dictionary). An alternative, analysis-based, model can be envisioned, where an analysis operator multiplies the signal, leading to a sparse outcome. In this paper we propose a simple but effective analysis operator learning algorithm, where analysis "atoms" are learned sequentially by identifying directions that are orthogonal to a subset of the training data. We demonstrate the effectiveness of the algorithm in three experiments, treating synthetic data and real images, showing a successful and meaningful recovery of the analysis operator. © 2011 EURASIP.
Plumbley MD (1996) Information processing in negative feedback neural networks, NETWORK-COMPUTATION IN NEURAL SYSTEMS7(2)pp. 301-305 IOP PUBLISHING LTD
Fujihara H, Klapuri A, Plumbley MD (2012) Instrumentation-based music similarity using sparse representations, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedingspp. 433-436
This paper describes a novelmusic similarity calculation method that is based on the instrumentation of music pieces. The approach taken here is based on the idea that sparse representations of musical audio signals are a rich source of information regarding the elements that constitute the observed spectra. We propose a method to extract feature vectors based on sparse representations and use these to calculate a similarity measure between songs. To train a dictionary for sparse representations from a large amount of training data, a novel dictionary-initialization method based on agglomerative clustering is proposed. An objective evaluation shows that the new features improve the performance of similarity calculation compared to the standard mel-frequency cepstral coefficients features. © 2012 IEEE.
Qin Z, Gao Y, Plumbley M, Parini C, Cuthbert L (2013) Low-rank Matrix Completion based Malicious User Detection in Cooperative Spectrum Sensing,2013 IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (GLOBALSIP)pp. 1186-1189 IEEE
In a cognitive radio (CR) system, cooperative spectrum sensing (CSS) is the key to improving sensing performance in deep fading channels. In CSS networks, signals received at the secondary users (SUs) are sent to a fusion center to make a final decision of the spectrum occupancy. In this process, the presence of malicious users sending false sensing samples can severely degrade the performance of the CSS network. In this paper, with the compressive sensing (CS) technique being implemented at each SU, we build a CSS network with double sparsity property. A new malicious user detection scheme is proposed by utilizing the adaptive outlier pursuit (AOP) based low-rank matrix completion in the CSS network. In the proposed scheme, the malicious users are removed in the process of signal recovery at the fusion center. The numerical analysis of the proposed scheme is carried out and compared with an existing malicious user detection algorithm.
Stowell D, Muaevi S, Bonada J, Plumbley MD (2013) Improved multiple birdsong tracking with distribution derivative method and Markov renewal process clustering, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedingspp. 468-472
Segregating an audio mixture containing multiple simultaneous bird sounds is a challenging task. However, birdsong often contains rapid pitch modulations, and these modulations carry information which may be of use in automatic recognition. In this paper we demonstrate that an improved spectrogram representation, based on the distribution derivative method, leads to improved performance of a segregation algorithm which uses a Markov renewal process model to track vocalisation patterns consisting of singing and silences. © 2013 IEEE.
Weyde T, Cottrell S, Dykes J, Benetos E, Wolff D, Tidhar D, Kachkaev A, Plumbley M, Dixon S, Barthet M, Gold N, Abdallah S, Alancar-Brayner A, Mahey M, Tovell A (2014) Big Data for Musicology,
Digital music libraries and collections are growing quickly and are increasingly made available for research. We argue that the use of large data collections will enable a better understanding of music performance and music in general, which will benefit areas such as music search and recommendation, music archiving and indexing, music production and education. However, to achieve these goals it is necessary to develop new musicological research methods, to create and adapt the necessary technological infrastructure, and to find ways of working with legal limitations. Most of the necessary basic technologies exist, but they need to be brought together and applied to musicology. We aim to address these challenges in the Digital Music Lab project, and we feel that with suitable methods and technology Big Music Data can provide new opportunities to musicology.
Plumbley MD, Abdallah SA, Blumensath T, Davies ME (2006) Sparse representations of polyphonic music., Signal Processing863pp. 417-431
Plumbley MD (1999) Do cortical maps adapt to optimize information density?,NETWORK-COMPUTATION IN NEURAL SYSTEMS10(1)pp. 41-58 IOP PUBLISHING LTD
Simpson AR, Roma G, Plumbley M (2015) Deep Karaoke: Extracting Vocals from Musical Mixtures Using a Convolutional Deep Neural Network, 9237pp. 429-436 Springer International Publishing
Identification and extraction of singing voice from within musical mixtures is a key challenge in source separation and machine audition. Recently, deep neural networks (DNN) have been used to estimate 'ideal' binary masks for carefully controlled cocktail party speech separation problems. However, it is not yet known whether these methods are capable of generalizing to the discrimination of voice and non-voice in the context of musical mixtures. Here, we trained a convolutional DNN (of around a billion parameters) to provide probabilistic estimates of the ideal binary mask for separation of vocal sounds from real-world musical mixtures. We contrast our DNN results with more traditional linear methods. Our approach may be useful for automatic removal of vocal sounds from musical mixtures for 'karaoke' type applications.
Cleju N, Jafari MG, Plumbley MD (2012) Analysis-based sparse reconstruction with synthesis-based solvers., ICASSPpp. 5401-5404 IEEE
Vincent E, Plumbley MD (2006) Single-Channel Mixture Decomposition Using Bayesian Harmonic Models., ICA3889pp. 722-730 Springer
Robertson A, Plumbley M (2007) B-Keeper: A beat-tracker for live performance, Proceedings of the 7th International Conference on New Interfaces for Musical Expression, NIME '07pp. 234-237
This paper describes the development of B-Keeper, a reatime beat tracking system implemented in Java and Max/MSP, which is capable of maintaining synchronisation between an electronic sequencer and a drummer. This enables musicians to interact with electronic parts which are triggered automatically by the computer from performance information. We describe an implementation which functions with the sequencer Ableton Live.
Automatic species classification of birds from their sound is a computational tool of increasing importance in ecology, conservation monitoring and vocal communication studies. To make classification useful in practice, it is crucial to improve its accuracy while ensuring that it can run at big data scales. Many approaches use acoustic measures based on spectrogram-type data, such as the Mel-frequency cepstral coefficient (MFCC) features which represent a manually-designed summary of spectral information. However, recent work in machine learning has demonstrated that features learnt automatically from data can often outperform manually-designed feature transforms. Feature learning can be performed at large scale and "unsupervised", meaning it requires no manual data labelling, yet it can improve performance on "supervised" tasks such as classification. In this work we introduce a technique for feature learning from large volumes of bird sound recordings, inspired by techniques that have proven useful in other domains. We experimentally compare twelve different feature representations derived from the Mel spectrum (of which six use this technique), using four large and diverse databases of bird vocalisations, classified using a random forest classifier. We demonstrate that in our classification tasks, MFCCs can often lead to worse performance than the raw Mel spectral data from which they are derived. Conversely, we demonstrate that unsupervised feature learning provides a substantial boost over MFCCs and Mel spectra without adding computational complexity after the model has been trained. The boost is particularly notable for single-label classification tasks at large scale. The spectro-temporal activations learned through our procedure resemble spectro-temporal receptive fields calculated from avian primary auditory forebrain. However, for one of our datasets, which contains substantial audio data but few annotations, increased performance is not discernible. We study the interaction between dataset characteristics and choice of feature representation through further empirical analysis.
Stowell D, Plumbley MD (2012) Framewise heterodyne chirp analysis of birdsong, European Signal Processing Conferencepp. 2694-2698
Harmonic birdsong is often highly nonstationary, which suggests that standard FFT representations may be of limited suitability. Wavelet and chirplet techniques exist in the literature, but are not often applied to signals such as bird vocalisations, perhaps due to analysis complexity. In this paper we develop a single-scale chirp analysis (computationally accelerated using FFT) which can be treated as an ordinary time-series. We then study a sinusoidal representation simply derived from the peak bins of this time-series. We show that it can lead to improved species classification from birdsong. © 2012 EURASIP.
Murray-Browne T, Mainstone D, Bryan-Kinns N, Plumbley MD (2013) The serendiptichord: Reflections on the collaborative design process between artist and researcher, Leonardo46(1)pp. 86-87
The Serendiptichord is a wearable instrument, resulting from a collaboration crossing fashion, technology, music and dance. This paper reflects on the collaborative process and how defining both creative and research roles for each party led to a successful creative partnership built on mutual respect and open communication. After a brief snapshot of the instrument in performance, the instrument is considered within the context of dance-driven interactive music systems followed by a discussion on the nature of the collaboration and its impact upon the design process and final piece. © 2013 ISAST.
Brossier P, Sandler M, Plumbley M (2003) Matching live sources with physical models,pp. 305-307
Barchiesi D, Giannoulis D, Stowell D, Plumbley MD (2015) Acoustic Scene Classification: Classifying environments from the sounds they produce,IEEE Signal Processing Magazine32(3)pp. 16-34
In this article, we present an account of the state of the art in acoustic scene classification (ASC), the task of classifying environments from the sounds they produce. Starting from a historical review of previous research in this area, we define a general framework for ASC and present different implementations of its components. We then describe a range of different algorithms submitted for a data challenge that was held to provide a general and fair benchmark for ASC techniques. The data set recorded for this purpose is presented along with the performance metrics that are used to evaluate the algorithms and statistical significance tests to compare the submitted methods.
O'Hanlon K, Plumbley MD (2013) Automatic Music Transcription using row weighted decompositions, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedingspp. 16-20
Automatic Music Transcription (AMT) seeks to understand a musical piece in terms of note activities. Matrix decomposition methods are often used for AMT, seeking to decompose a spectrogram over a dictionary matrix of note-specific template vectors. The performance of these methods can suffer due to the large harmonic overlap found in tonal musical spectra. We propose a row weighting scheme that transforms each spectrogram frame and the dictionary, with the weighting determined by the effective correlations in the decomposition. Experiments show improved AMT performance. © 2013 IEEE.
Ekeus H, Mcowan PW, Plumbley MD (2013) Eye Tracking as Interface for Parametric Design,
This research investigates the potential of eye tracking as an interface to parameter search in visual design. We outline our experimental framework where a user's gaze acts as guiding feedback mechanism in an exploration of the state space of parametric designs. A small scale pilot study was carried out where participants in uence the evolution of generative patterns by looking at a screen while having their eyes tracked. Preliminary findings suggest that although our eye tracking system can be used to e ectively navigate small areas of a parametric design's state-space, there are challenges to overcome before such a system is practical in a design context. Finally we outline future directions of this research.
Stark AM, Plumbley MD (2010) Performance following: Tracking a performance without a score., ICASSPpp. 2482-2485 IEEE
Jafari MG, Plumbley MD (2007) The Role of High Frequencies in Convolutive Blind Source Separation of Speech Signals., ICA4666pp. 488-494 Springer
Cano E, Plumbley M, Dittmar C (2014) Phase-based harmonic/percussive separation,pp. 1628-1632
In this paper, a method for separation of harmonic and percussive elements in music recordings is presented. The proposed method is based on a simple spectral peak detection step followed by a phase expectation analysis that discriminates between harmonic and percussive components. The proposed method was tested on a database of 10 audio tracks and has shown superior results to the reference state-of-the-art approach.
Vincent E, Plumbley MD (2008) Efficient Bayesian inference for harmonic models via adaptive posterior factorization., Neurocomputing721-3pp. 79-87
Stowell D, Plumbley MD (2013) Segregating event streams and noise with a Markov renewal process model.,Journal of Machine Learning Research141pp. 2213-2238
Gretsistas A, Damnjanovic I, Plumbley MD (2010) Gradient Polytope Faces Pursuit for large scale sparse recovery problems., ICASSPpp. 2030-2033 IEEE
Audio source separation is a difficult machine learning problem and performance is measured by comparing extracted signals with the component source signals. However, if separation is motivated by the ultimate goal of re-mixing then complete separation is not necessary and hence separation difficulty and separation quality are dependent on the nature of the re-mix. Here, we use a convolutional deep neural network (DNN), trained to estimate 'ideal' binary masks for separating voice from music, to perform re-mixing of the vocal balance by operating directly on the individual magnitude components of the musical mixture spectrogram. Our results demonstrate that small changes in vocal gain may be applied with very little distortion to the ultimate re-mix. Our method may be useful for re-mixing existing mixes.
Stowell D, Plumbley MD (2014) Large-scale analysis of frequency modulation in birdsong data bases,Methods in Ecology and Evolution5(9)pp. 901-912
* Birdsong often contains large amounts of rapid frequency modulation (FM). It is believed that the use or otherwise of FM is adaptive to the acoustic environment and also that there are specific social uses of FM such as trills in aggressive territorial encounters. Yet temporal fine detail of FM is often absent or obscured in standard audio signal analysis methods such as Fourier analysis or linear prediction. Hence, it is important to consider high-resolution signal processing techniques for analysis of FM in bird vocalizations. If such methods can be applied at big data scales, this offers a further advantage as large data sets become available. * We introduce methods from the signal processing literature which can go beyond spectrogram representations to analyse the fine modulations present in a signal at very short time-scales. Focusing primarily on the genus Phylloscopus, we investigate which of a set of four analysis methods most strongly captures the species signal encoded in birdsong. We evaluate this through a feature selection technique and an automatic classification experiment. In order to find tools useful in practical analysis of large data bases, we also study the computational time taken by the methods, and their robustness to additive noise and MP3 compression. * We find three methods which can robustly represent species-correlated FM attributes and can be applied to large data sets, and that the simplest method tested also appears to perform the best. We find that features representing the extremes of FM encode species identity supplementary to that captured in frequency features, whereas bandwidth features do not encode additional information. * FM analysis can extract information useful for bioacoustic studies, in addition to measures more commonly used to characterize vocalizations. Further, it can be applied efficiently across very large data sets and archives.
Tsui KC, Azvine B, Plumbley M (1996) The roles of neural and evolutionary computing in intelligent software systems,BT TECHNOLOGY JOURNAL14(4)pp. 46-54 SPRINGER
Given a musical audio recording, the goal of music transcription is to determine a score-like representation of the piece underlying the recording. Most current transcription methods employ variants of non-negative matrix factorization (NMF), which often fails to robustly model instruments producing non-stationary sounds. Using entire time-frequency patterns to represent sounds, non-negative matrix deconvolution (NMD) can capture certain types of nonstationary behavior but is only applicable if all sounds have the same length. In this paper, we present a novel method that combines the non-stationarity modeling capabilities available with NMD with the variable note lengths possible with NMF. Identifying frames in NMD patterns with states in a dynamical system, our method iteratively generates sound-object candidates separately for each pitch, which are then combined in a global optimization. We demonstrate the transcription capabilities of our method using piano pieces assuming the availability of single note recordings as training data.
Barabasa C, Jafari M, Plumbley M (2012) A Robust Method for S1/S2 Heart Sounds Detection Without ECG Reference Based on Music Beat Tracking,2012 10TH INTERNATIONAL SYMPOSIUM ON ELECTRONICS AND TELECOMMUNICATIONSpp. 307-310
We present a robust method for the detection of the first and second heart sounds (s1 and s2), without ECG reference, based on a music beat tracking algorithm. An intermediate representation of the input signal is first calculated by using an onset detection function based on complex spectral difference. A music beat tracking algorithm is then used to determine the location of the first heart sound. The beat tracker works in two steps, it first calculates the beat period and then finds the temporal beat alignment. Once the first sound is detected, inverse Gaussian weights are applied to the onset function on the detected positions and the algorithm is run again to find the second heart sound. At the last step s1 and s2 labels are attributed to the detected sounds. The algorithm was evaluated in terms of location accuracy as well as sensitivity and specificity and the results showed good results even in the presence of murmurs or noisy signals.
Plumbley MD (1999) Do cortical maps adapt to optimize information density?, Network: Computation in Neural Systems10(1)pp. 41-58
Topographic maps are found in many biological and artificial neural systems. In biological systems, some parts of these can form a significantly expanded representation of their sensory input, such as the representation of the fovea in the visual cortex. We propose that a cortical feature map should be organized to optimize the efficiency of information transmission through it. This leads to a principle of uniform cortical information density across the map as the desired optimum. An expanded representation in the cortex for a particular sensory area (i.e. a high magnification factor) means that a greater information density is concentrated in that sensory area, leading to finer discrimination thresholds. Improvement may ultimately be limited by the construction of the sensors themselves. This approach gives a good fit to threshold versus cortical area data of Recanzone et al on owl monkeys trained on a tactile frequency-discrimination task.
Stowell D, Plumbley MD (2011) Learning Timbre Analogies from Unlabelled Data by Multivariate Tree Regression, Journal of New Music Research40(4)pp. 325-336
Applications such as concatenative synthesis (audio mosaicing) and query-by-example require the ability to search a database using a sound which is qualitatively different from the actual desired result-for example when using vocal queries to retrieve nonvocal sound. Standard query techniques such as nearest neighbours do not account for this difference between source and target; they perform retrieval but do not learn to make timbral analogies. This paper addresses this issue by considering timbral query as a multivariate regression problem from one timbre distribution onto another. We develop a novel variant of multivariate tree regression: given only a set of unlabelled and unpaired samples from two distributions on the same space, the regression learns a cross-associative mapping which assumes general similarities in structure of the two distributions, yet can accommodate differences in shape at various scales. We demonstrate the technique with a synthetic example and with a concatenative synthesizer. © 2011 Copyright Taylor and Francis Group, LLC.
Davies MEP, Plumbley MD (2008) Exploring the effect of rhythmic style classification on automatic tempo estimation, European Signal Processing Conference
Within ballroom dance music, tempo and rhythmic style are strongly related. In this paper we explore this relationship, by using knowledge of rhythmic style to improve tempo estimation in musical audio signals. We demonstrate how the use of a simple 1-NN classification method, able to determine rhythmic style with 75% accuracy, can lead to an 8% point improvement over existing tempo estimation algorithms with further gains possible through the use of more sophisticated classification techniques.
Nesbit A, Plumbley MD (2008) Oracle estimation of adaptive cosine packet transforms for underdetermined audio source separation., ICASSPpp. 41-44 IEEE
Mailhé B, Plumbley MD (2012) Dictionary Learning with Large Step Gradient Descent for Sparse Representations., LVA/ICA7191pp. 231-238 Springer
Mailhe B, Plumbley M (2013) Dictionary Learning via Projected Maximal Exploration,2013 IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (GLOBALSIP)pp. 626-626 IEEE
This work presents a geometrical analysis of the Large Step Gradient Descent (LGD) dictionary learning algorithm. LGD updates the atoms of the dictionary using a gradient step with a step size equal to twice the optimal step size. We show that the large step gradient descent can be understood as a maximal exploration step where one goes as far away as possible without increasing the the error. We also show that the LGD iteration is monotonic when the algorithm used for the sparse approximation step is close enough to orthogonal.
Hedayioglu F, Jafari MG, Mattos SS, Plumbley MD, Coimbra MT (2012) Denoising and Segmentation of the Second Heart Sound Using Matching Pursuit,2012 ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY (EMBC)pp. 3440-3443
We propose a denoising and segmentation technique for the second heart sound (S2). To denoise, Matching Pursuit (MP) was applied using a set of non-linear chirp signals as atoms. We show that the proposed method can be used to segment the phonocardiogram of the second heart sound into its two clinically meaningful components: the aortic (A2) and pulmonary (P2) components. © 2012 IEEE.
Giannoulis D, Benetos E, Stowell D, Rossignol M, Lagrange M, Plumbley MD (2013) Detection and classification of acoustic scenes and events: An IEEE AASP challenge, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
This paper describes a newly-launched public evaluation challenge on acoustic scene classification and detection of sound events within a scene. Systems dealing with such tasks are far from exhibiting human-like performance and robustness. Undermining factors are numerous: the extreme variability of sources of interest possibly interfering, the presence of complex background noise as well as room effects like reverberation. The proposed challenge is an attempt to help the research community move forward in defining and studying the aforementioned tasks. Apart from the challenge description, this paper provides an overview of systems submitted to the challenge as well as a detailed evaluation of the results achieved by those systems. © 2013 IEEE.
O'Hanlon K, Nagano H, Plumbley M (2012) Group non-negative basis pursuit for automatic music transcription,Proceedings of 5th International Workshop on Machine Learning and Music (MML 2012)pp. 15-16
Automatic Music Transcription is often performed by decomposing a spectrogram over a dictionary of note specific atoms. Several note template atoms may be used to represent one note, and a group structure may be imposed on the dictionary. We propose a group sparse algorithm based on a multiplicative update and thresholding and show transcription results on a challenging dataset.
Barchiesi D, Plumbley MD (2013) Learning incoherent dictionaries for sparse approximation using iterative projections and rotations, IEEE Transactions on Signal Processing61(8)pp. 2055-2065 IEEE
This article deals with learning dictionaries for sparse approximation whose atoms are both adapted to a training set of signals and mutually incoherent. To meet this objective, we employ a dictionary learning scheme consisting of sparse approximation followed by dictionary update and we add to the latter a decorrelation step in order to reach a target mutual coherence level. This step is accomplished by an iterative projection method complemented by a rotation of the dictionary. Experiments on musical audio data and a comparison with the method of optimal coherence-constrained directions (mocod) and the incoherent k-svd (ink-svd) illustrate that the proposed algorithm can learn dictionaries that exhibit a low mutual coherence while providing a sparse approximation with better signal-to-noise ratio (snr) than the benchmark techniques. © 1991-2012 IEEE.
Oja E, Plumbley M (2004) Blind separation of positive sources by globally convergent gradient search., Neural Comput16(9)pp. 1811-1825
The instantaneous noise-free linear mixing model in independent component analysis is largely a solved problem under the usual assumption of independent nongaussian sources and full column rank mixing matrix. However, with some prior information on the sources, like positivity, new analysis and perhaps simplified solution methods may yet become possible. In this letter, we consider the task of independent component analysis when the independent sources are known to be nonnegative and well grounded, which means that they have a nonzero pdf in the region of zero. It can be shown that in this case, the solution method is basically very simple: an orthogonal rotation of the whitened observation vector into nonnegative outputs will give a positive permutation of the original sources. We propose a cost function whose minimum coincides with nonnegativity and derive the gradient algorithm under the whitening constraint, under which the separating matrix is orthogonal. We further prove that in the Stiefel manifold of orthogonal matrices, the cost function is a Lyapunov function for the matrix gradient flow, implying global convergence. Thus, this algorithm is guaranteed to find the nonnegative well-grounded independent sources. The analysis is complemented by a numerical simulation, which illustrates the algorithm.
Stowell D, Giannoulis D, Benetos E, Lagrange M, Plumbley MD (2015) Detection and Classification of Acoustic Scenes and Events.,IEEE Transactions on Multimedia17(10)pp. 1733-1746 IEEE
For intelligent systems to make best use of the audio modality, it is important that they can recognize not just speech and music, which have been researched as specific tasks, but also general sounds in everyday environments. To stimulate research in this field we conducted a public research challenge: the IEEE Audio and Acoustic Signal Processing Technical Committee challenge on Detection and Classification of Acoustic Scenes and Events (DCASE). In this paper, we report on the state of the art in automatically classifying audio scenes, and automatically detecting and classifying audio events. We survey prior work as well as the state of the art represented by the submissions to the challenge from various research groups. We also provide detail on the organization of the challenge, so that our experience as challenge hosts may be useful to those organizing challenges in similar domains. We created new audio datasets and baseline systems for the challenge; these, as well as some submitted systems, are publicly available under open licenses, to serve as benchmarks for further research in general-purpose machine listening.
Becker S, Plumbley M (1996) Unsupervised neural network learning procedures for feature extraction and classification,APPLIED INTELLIGENCE6(3)pp. 185-203 KLUWER ACADEMIC PUBL
Plumbley MD (2005) Polar Polytopes and Recovery of Sparse Representations,
In this paper we present the supervised iterative projections and rotations (S-IPR) algorithm, a method to optimise a set of discriminative subspaces for supervised classification. We show how the proposed technique is based on our previous unsupervised iterative projections and rotations (IPR) algorithm for incoherent dictionary learning, and how projecting the features onto the learned sub-spaces can be employed as a feature transform algorithm in the context of classification. Numerical experiments on the FISHERIRIS and on the USPS datasets, and a comparison with the PCA and LDA methods for feature transform demonstrates the value of the proposed technique and its potential as a tool for machine learning. © 2013 IEEE.
Barthet M, Plumbley MD, Kachkaev A, Dykes J, Wolff D, Weyde T (2014) Big chord data extraction and mining,
Harmonic progression is one of the cornerstones of tonal music composition and is thereby essential to many musical styles and traditions. Previous studies have shown that musical genres and composers could be discriminated based on chord progressions modeled as chord n-grams. These studies were however conducted on small-scale datasets and using symbolic music transcriptions. In this work, we apply pattern mining techniques to over 200,000 chord progression sequences out of 1,000,000 extracted from the I Like Music (ILM) commercial music audio collection. The ILM collection spans 37 musical genres and includes pieces released between 1907 and 2013. We developed a single program multiple data parallel computing approach whereby audio feature extraction tasks are split up and run simultaneously on multiple cores. An audio-based chord recognition model (Vamp plugin Chordino) was used to extract the chord progressions from the ILM set. To keep low-weight feature sets, the chord data were stored using a compact binary format. We used the CM-SPADE algorithm, which performs a vertical mining of sequential patterns using co-occurence information, and which is fast and efficient enough to be applied to big data collections like the ILM set. In order to derive key-independent frequent patterns, the transition between chords are modeled by changes of qualities (e.g. major, minor, etc.) and root keys (e.g. fourth, fifth, etc.). The resulting key-independent chord progression patterns vary in length (from 2 to 16) and frequency (from 2 to 19,820) across genres. As illustrated by graphs generated to represent frequent 4-chord progressions, some patterns like circleof- fifths movements are well represented in most genres but in varying degrees. These large-scale results offer the opportunity to uncover similarities and discrepancies between sets of musical pieces and therefore to build classifiers for search and recommendation. They also support the empirical testing of music theory. It is however more difficult to derive new hypotheses from such dataset due to its size. This can be addressed by using pattern detection algorithms or suitable visualisation which we present in a companion study.
Stowell D, Plumbley MD (2010) Timbre remapping through a regression-tree technique, Proceedings of the 7th Sound and Music Computing Conference, SMC 2010
We consider the task of inferring associations between two differently-distributed and unlabelled sets of timbre data. This arises in applications such as concatenative synthesis/ audio mosaicing in which one audio recording is used to control sound synthesis through concatenating fragments of an unrelated source recording. Timbre is a multidimensional attribute with interactions between dimensions, so it is non-trivial to design a search process which makes best use of the timbral variety available in the source recording. We must be able to map from control signals whose timbre features have different distributions from the source material, yet labelling large collections of timbral sounds is often impractical, so we seek an unsupervised technique which can infer relationships between distributions. We present a regression tree technique which learns associations between two unlabelled multidimensional distributions, and apply the technique to a simple timbral concatenative synthesis system. We demonstrate numerically that the mapping makes better use of the source material than a nearest-neighbour search. © 2010 Dan Stowell et al.
Giannoulis D, Klapuri A, Plumbley M (2013) Recognition of harmonic sounds in polyphonic audio using a missing feature approach,2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)pp. 8658-8662 IEEE
A method based on local spectral features and missing feature techniques is proposed for the recognition of harmonic sounds in mixture signals. A mask estimation algorithm is proposed for identifying spectral regions that contain reliable information for each sound source and then bounded marginalization is employed to treat the feature vector elements that are determined as unreliable. The proposed method is tested on musical instrument sounds due to the extensive availability of data but it can be applied on other sounds (i.e. animal sounds, environmental sounds), whenever these are harmonic. In simulations the proposed method clearly outperformed a baseline method for mixture signals.
Cannam C, Figueira LA, Plumbley MD (2012) Sound software: Towards software reuse in audio and music research, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedingspp. 2745-2748
Although researchers are increasingly aware of the need to publish and maintain software code alongside their results, practical barriers prevent this from happening in many cases. We examine these barriers, propose an incremental approach to overcoming some of them, and describe the Sound Software project, an effort to support software development practice in the UK audio and music research community. Finally we make some recommendations for research groups seeking to improve their own researchers' software practice. © 2012 IEEE.
Mailhe B, Sturm B, Plumbley MD (2013) Recovery of nested supports by greedy sparse representation algorithms over non-normalized dictionaries,
We prove that if Orthogonal Matching Pursuit (OMP) recovers all s-sparse signals for a given dictionary, then it also recovers all s 0 -sparse signals on the same dictionary for any s 0 < s. We also extend Tropp?s Exact Recovery Condition (ERC) to dictionaries with non-normalized atoms. Our result does not contradict an earlier result stating that there are dictionaries and cardinalities s 0 < s such that all s-size supports satisfy Tropp?s (ERC) but not all s 0 -size supports do: that result was proved using non-normalized dictionaries and in that case Tropp?s ERC is not linked to the recovery by OMP.
Cleju N, Jafari MG, Plumbley MD (2012) Analysis-based sparse reconstruction with synthesis-based solvers, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedingspp. 5401-5404
Analysis based reconstruction has recently been introduced as an alternative to the well-known synthesis sparsity model used in a variety of signal processing areas. In this paper we convert the analysis exact-sparse reconstruction problem to an equivalent synthesis recovery problem with a set of additional constraints. We are therefore able to use existing synthesis-based algorithms for analysis-based exact-sparse recovery. We call this the Analysis-By-Synthesis (ABS) approach. We evaluate our proposed approach by comparing it against the recent Greedy Analysis Pursuit (GAP) analysis-based recovery algorithm. The results show that our approach is a viable option for analysis-based reconstruction, while at the same time allowing many algorithms that have been developed for synthesis reconstruction to be directly applied for analysis reconstruction as well. © 2012 IEEE.
Robertson A, Plumbley MD (2013) Synchronizing sequencing software to a live drummer, Computer Music Journal37(2)pp. 46-60
This article presents a method of adjusting the tempo of a music software sequencer so that it remains synchronized with a drummer's musical pulse. This allows music sequencer technology to be integrated into a band scenario without the compromise of using click tracks or triggering loops with a fixed tempo. Our design implements real-time mechanisms for both underlying tempo and phase adjustment using adaptable parameters that control its behavior. The aim is to create a system that responds to timing variations in the drummer's playing but is also stable during passages of syncopation and fills. We present an evaluation of the system using a stochastic drum machine that incorporates a level of noise in the underlying tempo and phase of the beat. We measure synchronization error between the output of the system and the underlying pulse of the drum machine and contrast this with other real-time beat trackers. The software, B-Keeper, has been released as a Max for Live device, available online at © 2013 Massachusetts Institute of Technology.
Keriven N, O'Hanlon K, Plumbley MD (2013) Structured sparsity using backwards elimination for Automatic Music Transcription, IEEE International Workshop on Machine Learning for Signal Processing, MLSP
Musical signals can be thought of as being sparse and structured, with few elements active at a given instant and temporal continuity of active elements observed. Greedy algorithms such as Orthogonal Matching Pursuit (OMP), and structured variants, have previously been proposed for Automatic Music Transcription (AMT), however some problems have been noted. Hence, we propose the use of a backwards elimination strategy in order to perform sparse decompositions for AMT, in particular with a proposed alternative sparse cost function. However, the main advantage of this approach is the ease with which structure can be incorporated. The use of group spar-sity is shown to give increased AMT performance, while a molecular method incorporating onset information is seen to provide further improvements with little computational effort. © 2013 IEEE.
Fritsch J, Plumbley M (2013) Score informed audio source separation using constrained nonnegative matrix factorization and score synthesis,2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)pp. 888-891 IEEE
In this paper we present a new method for musical audio source separation, using the information from the musical score to supervise the decomposition process. An original framework using nonnegative matrix factorization (NMF) is presented, where the components are initially learnt on synthetic signals with temporal and harmonic constraints. A new dataset of multitrack recordings with manually aligned MIDI scores is created (TRIOS), and we compare our separation results with other methods from the literature using the BSS EVAL and PEASS evaluation toolboxes. The results show a general improvement of the BSS EVAL metrics for the various instrumental configurations used.
Abdallah SA, Ekeus H, Foster P, Robertson A, Plumbley MD (2012) Cognitive music modelling: An information dynamics approach,2012 3rd International Workshop on Cognitive Information Processing, CIP 2012 IEEE
We describe an information-theoretic approach to the analysis of music and other sequential data, which emphasises the predictive aspects of perception, and the dynamic process of forming and modifying expectations about an unfolding stream of data, characterising these using the tools of information theory: entropies, mutual informations, and related quantities. After reviewing the theoretical foundations, we discuss a few emerging areas of application, including musicological analysis, real-time beat-tracking analysis, and the generation of musical materials as a cognitively-informed compositional aid. © 2012 IEEE.
Wierstorf H, Ward D, Mason R, Girgis E, Hummersone C, Plumbley M (2017) Perceptual Evaluation of Source Separation for Remixing Music,143rd AES Convention Paper No 9880 Audio Engineering Society
Music remixing is difficult when the original multitrack recording is not available. One solution is to estimate the elements of a mixture using source separation. However, existing techniques suffer from imperfect separation and perceptible artifacts on single separated sources. To investigate their influence on a remix, five state-of-the-art source separation algorithms were used to remix six songs by increasing the level of the vocals. A listening test was conducted to assess the remixes in terms of loudness balance and sound quality. The results show that some source separation algorithms are able to increase the level of the vocals by up to 6 dB at the cost of introducing a small but perceptible degradation in sound quality.
Plumbley MD, Bevilacqua M (2009) Sparse reconstruction for compressed sensing using stagewise polytope faces pursuit, DSP 2009: 16th International Conference on Digital Signal Processing, Proceedings
Compressed sensing, also known as compressive sampling, is an approach to the measurement of signals which have a sparse representation, that can reduce the number of measurements that are needed to reconstruct the signal. The signal reconstruction part requires efficient methods to perform sparse reconstruction, such as those based on linear programming. In this paper we present a method for sparse reconstruction which is an extension of our earlier Polytope Faces Pursuit algorithm, based on the polytope geometry of the dual linear program. The new algorithm adds several basis vectors at each stage, in a similar way to the recent Stagewise Orthogonal Matching Pursuit (StOMP) algorithm. We demonstrate the application of the algorithm to some standard compressed sensing problems. © 2009 IEEE.
Roma G, Grais EM, Simpson AJR, Sobieraj I, Plumbley MD (2016) UNTWIST: A NEW TOOLBOX FOR AUDIO SOURCE SEPARATION,
Untwist is a new open source toolbox for audio source separation. The library provides a self-contained objectoriented framework including common source separation algorithms as well as input/output functions, data management utilities and time-frequency transforms. Everything is implemented in Python, facilitating research, experimentation and prototyping across platforms. The code is available on github 1.
Barchiesi D, Giannoulis D, Stowell D, Plumbley MD (2015) Acoustic Scene Classification: Classifying environments from the sounds they produce., IEEE Signal Process. Mag.323pp. 16-34
Stark AM, Plumbley MD, Davies MEP (2007) Real-time beat-synchronous audio effects, Proceedings of the 7th International Conference on New Interfaces for Musical Expression, NIME '07pp. 344-345
We present a new group of audio effects that use beat tracking, the detection of beats in an audio signal, to relate effect parameters to the beats in an input signal. Conventional audio effects are augmented so that their operation is related to the output of a beat tracking system. We present a temposynchronous delay effect and a set of beat synchronous low frequency oscillator effects including tremolo, vibrato and auto-wah. All effects are implemented in real-time as VST plug-ins to allow for their use in live performance.
Cleju N, Jafari MG, Plumbley MD (2012) Choosing analysis or synthesis recovery for sparse reconstruction, European Signal Processing Conferencepp. 869-873
The analysis sparsity model is a recently introduced alternative to the standard synthesis sparsity model frequently used in signal processing. However, the exact conditions when analysis-based recovery is better than synthesis recovery are still not known. This paper constitutes an initial investigation into determining when one model is better than the other, under similar conditions. We perform separate analysis and synthesis recovery on a large number of randomly generated signals that are simultaneously sparse in both models and we compare the average reconstruction errors with both recovery methods. The results show that analysis-based recovery is the better option for a large number of signals, but it is less robust with signals that are only approximately sparse or when fewer measurements are available. © 2012 EURASIP.
Stowell D, Musevic S, Bonada J, Plumbley MD (2013) Improved multiple birdsong tracking with distribution derivative method and Markov renewal process clustering., ICASSPpp. 468-472 IEEE
Jaillet F, Gribonval R, Plumbley MD, Zayyani H (2010) An L1 criterion for dictionary learning by subspace identification., ICASSPpp. 5482-5485 IEEE
Robertson A, Plumbley MD (2015) Event-based Multitrack Alignment using a Probabilistic Framework,JOURNAL OF NEW MUSIC RESEARCH44(2)pp. 71-82 ROUTLEDGE JOURNALS, TAYLOR & FRANCIS LTD
Vincent E, Plumbley MD (2005) A prototype system for object coding of musical audio, IEEE Workshop on Applications of Signal Processing to Audio and Acousticspp. 239-242
This article deals with low bitrate object coding of musical audio, and more precisely with the extraction of pitched sound objects in polyphonic music. After a brief review of existing methods, we discuss the potential benefits of recasting this problem in a Bayesian framework. We define pitched objects by a set of probabilistic priors and derive efficient algorithms to infer active objects and their parameters. Preliminary experiments suggest that the proposed method results in a better sound quality than simple sinusoidal coding while achieving a lower bitrate. © 2005 IEEE.
Davies MEP, Plumbley MD, Eck D (2009) Towards a musical beat emphasis function., WASPAApp. 61-64 IEEE
O'Hanlon K, Nagano H, Plumbley MD (2012) Structured sparsity for automatic music transcription, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedingspp. 441-444
Sparse representations have previously been applied to the automatic music transcription (AMT) problem. Structured sparsity, such as group and molecular sparsity allows the introduction of prior knowledge to sparse representations. Molecular sparsity has previously been proposed for AMT, however the use of greedy group sparsity has not previously been proposed for this problem. We propose a greedy sparse pursuit based on nearest subspace classification for groups with coherent blocks, based in a non-negative framework, and apply this to AMT. Further to this, we propose an enhanced molecular variant of this group sparse algorithm and demonstrate the effectiveness of this approach. © 2012 IEEE.
Stowell D, Plumbley M (2007) Adaptive whitening for improved real-time audio onset detection,Proceedings of the 2007 International Computer Music Conference, ICMC 2007pp. 312-319 Michigan Publishing
We describe a new method for preprocessing STFT phasevocoder frames for improved performance in real-time onset detection, which we term "adaptive whitening". The procedure involves normalising the magnitude of each bin according to a recent maximum value for that bin, with the aim of allowing each bin to achieve a similar dynamic range over time, which helps to mitigate against the influence of spectral roll-off and strongly-varying dynamics. Adaptive whitening requires no training, is relatively lightweight to compute, and can run in real-time. Yet it can improve onset detector performance by more than ten percentage points (peak F-measure) in some cases, and improves the performance of most of the onset detectors tested. We present results demonstrating that adaptive whitening can significantly improve the performance of various STFT-based onset detection functions, including functions based on the power, spectral flux, phase deviation, and complex deviation measures. Our results find the process to be especially beneficial for certain types of audio signal (e.g. complex mixtures such as pop music).
Abdallah S, Plumbley MD (2004) Application of geometric dependency analysis to the separation of convolved mixtures, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)3195pp. 540-547
We investigate a generalisation of the structure of frequency domain ICA as applied to the separation of convolved mixtures, and show how a geometric representation of residual dependency can be used both as an aid 'to visualisation and intuition, and as tool for clustering components into independent subspaces, thus providing a solution to the source separation problem. © Springer-Verlag 2004.
Murray-Browne T, Mainstone D, Bryan-Kinns N, Plumbley MD (2011) The medium is the message: Composing instruments and performing mappings, pp. 56-59
Many performers of novel musical instruments find it diffi- cult to engage audiences beyond those in the field. Previous research points to a failure to balance complexity with usability, and a loss of transparency due to the detachment of the controller and sound generator. The issue is often exacerbated by an audienceýs lack of prior exposure to the instrument and its workings. However, we argue that there is a conflict underlying many novel musical instruments in that they are intended to be both a tool for creative expression and a creative work of art in themselves, resulting in incompatible requirements. By considering the instrument, the composition and the performance together as a whole with careful consideration of the rate of learning demanded of the audience, we propose that a lack of transparency can become an asset rather than a hindrance. Our approach calls for not only controller and sound generator to be designed in sympathy with each other, but composition, performance and physical form too. Identifying three design principles, we illustrate this approach with the Serendiptichord, a wearable instrument for dancers created by the authors.
Non-negative Matrix Factorisation (NMF) is a popular tool in musical signal processing. However, problems using this methodology in the context of Automatic Music Transcription (AMT) have been noted resulting in the proposal of supervised and constrained variants of NMF for this purpose. Group sparsity has previously been seen to be effective for AMT when used with stepwise methods. In this paper group sparsity is introduced to supervised NMF decompositions and a dictionary tuning approach to AMT is proposed based upon group sparse NMF using the ²-divergence. Experimental results are given showing improved AMT results over the state-of-the-art NMF-based AMT system.
Bird audio detection (BAD) aims to detect whether there is a bird call in an audio recording or not. One difficulty of this task is that the bird sound datasets are weakly labelled, that is only the presence or absence of a bird in a recording is known, without knowing when the birds call. We propose to apply joint detection and classification (JDC) model on the weakly labelled data (WLD) to detect and classify an audio clip at the same time. First, we apply VGG like convolutional neural network (CNN) on mel spectrogram as baseline. Then we propose a JDC-CNN model with VGG as a classifier and CNN as a detector. We report the denoising method including optimally-modified log-spectral amplitude (OM-LSA), median filter and spectral spectrogram will worse the classification accuracy on the contrary to previous work. JDC-CNN can predict the time stamps of the events from weakly labelled data, so is able to do sound event detection from WLD. We obtained area under curve (AUC) of 95.70% on the development data and 81.36% on the unseen evaluation data, which is nearly comparable to the baseline CNN model.
Simpson A, Roma G, Girgis E, Mason R, Hummersone C, Liutkus A, Plumbley M (2016) Evaluation of Audio Source Separation Models Using Hypothesis-Driven Non-Parametric Statistical Methods,European Signal Processing Conference (EUSIPCO) 2016
Audio source separation models are typically evaluated using objective separation quality measures, but rigorous statistical methods have yet to be applied to the problem of model comparison. As a result, it can be difficult to establish whether or not reliable progress is being made during the development of new models. In this paper, we provide a hypothesis-driven statistical analysis of the results of the recent source separation SiSEC challenge involving twelve competing models tested on separation of voice and accompaniment from fifty pieces of ?professionally produced? contemporary music. Using nonparametric statistics, we establish reliable evidence for meaningful conclusions about the performance of the various models.
An increasing number of researchers work in computational auditory scene analysis (CASA). However, a set of tasks, each with a well-defined evaluation framework and commonly used datasets do not yet exist. Thus, it is difficult for results and algorithms to be compared fairly, which hinders research on the field. In this paper we will introduce a newly-launched public evaluation challenge dealing with two closely related tasks of the field: acoustic scene classification and event detection. We give an overview of the tasks involved; describe the processes of creating the dataset; and define the evaluation metrics. Finally, illustrations on results for both tasks using baseline methods applied on this dataset are presented, accompanied by open-source code. © 2013 EURASIP
Nesbit A, Vincent E, Plumbley MD (2009) Benchmarking flexible adaptive time-frequency transforms for underdetermined audio source separation, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedingspp. 37-40
We have implemented several fast and flexible adaptive lapped orthogonal transform (LOT) schemes for underdetermined audio source separation. This is generally addressed by time-frequency masking, requiring the sources to be disjoint in the time-frequency domain. We have already shown that disjointness can be increased via adaptive dyadic LOTs. By taking inspiration from the windowing schemes used in many audio coding frameworks, we improve on earlier results in two ways. Firstly, we consider non-dyadic LOTs which match the time-varying signal structures better. Secondly, we allow for a greater range of overlapping window profiles to decrease window boundary artifacts. This new scheme is benchmarked through oracle evaluations, and is shown to decrease computation time by over an order of magnitude compared to using very general schemes, whilst maintaining high separation performance and flexible signal adaptivity. As the results demonstrate, this work may find practical applications in high fidelity audio source separation. ©2009 IEEE.
Figueira LA, Cannam C, Plumbley MD (2013) Software techniques for good practice in audio and music research, 134th Audio Engineering Society Convention 2013pp. 273-280
In this paper we discuss how software development can be improved in the audio and music research community by implementing tighter and more effective development feedback loops. We suggest first that researchers in an academic environment can benefit from the straightforward application of peer code review, even for ad-hoc research software; and second, that researchers should adopt automated software unit testing from the start of research projects. We discuss and illustrate how to adopt both code reviews and unit testing in a research environment. Finally, we observe that the use of a software version control system provides support for the foundations of both code reviews and automated unit tests. We therefore also propose that researchers should use version control with all their projects from the earliest stage.
Hockman JA, Bello JP, Davies MEP, Plumbley MD (2008) Automated rhythmic transformation of musical audio, Proceedings - 11th International Conference on Digital Audio Effects, DAFx 2008pp. 177-180
Time-scale transformations of audio signals have traditionally relied exclusively upon manipulations of tempo. We present a novel technique for automatic mixing and synchronization between two musical signals. In this transformation, the original signal assumes the tempo, meter, and rhythmic structure of the model signal, while the extracted downbeats and salient intra-measure infrastructure of the original are maintained.
Plumbley MD (2014) Separating Musical Audio Signals, Acoustics Bulletin39(6)pp. 44-47 Institute of Acoustics
As consumers move increasingly to multichannel and surround-sound reproduction of sound, and also perhaps wish to remix their music to suit their own tastes, there will be an increasing need for high quality automatic source separation to recover sound sources from legacy mono or 2-channel stereo recordings. In this Contribution, we will give an overview of some for audio source separation, and some of the remaining research challenges in this area.
Girgis E, Roma G, Simpson A, Plumbley M (2016) Single Channel Audio Source Separation using Deep Neural Network Ensembles,AES Convention Proceedings Audio Engineering Society
Deep neural networks (DNNs) are often used to tackle the single channel source separation (SCSS) problem by predicting time-frequency masks. The predicted masks are then used to separate the sources from the mixed signal. Different types of masks produce separated sources with different levels of distortion and interference. Some types of masks produce separated sources with low distortion, while other masks produce low interference between the separated sources. In this paper, a combination of different DNNs? predictions (masks) is used for SCSS to achieve better quality of the separated sources than using each DNN individually. We train four different DNNs by minimizing four different cost functions to predict four different masks. The first and second DNNs are trained to approximate reference binary and soft masks. The third DNN is trained to predict a mask from the reference sources directly. The last DNN is trained similarly to the third DNN but with an additional discriminative constraint to maximize the differences between the estimated sources. Our experimental results show that combining the predictions of different DNNs achieves separated sources with better quality than using each DNN individually
O?Brien C, Plumbley M (2016) Sparse Kernel Dictionary Learning,Proceedings of the 11th IMA International Conference on Mathematics in Signal Processing
Dictionary Learning (DL) has seen widespread use in signal processing and machine learning. Given a data set, DL seeks to find a so-called ?dictionary? such that the data can be well represented by a sparse linear combination of dictionary elements. The representational power of DL may be extended by the use of kernel mappings, which implicitly map the data to some high dimensional feature space. In Kernel DL we wish to learn a dictionary in this underlying high-dimensional feature space, which can often model the data more accurately than learning in the original space. Kernel DL is more challenging than the linear case however since we no longer have access to the dictionary atoms directly ? only their relationship to the data via the kernel matrix. One strategy is therefore to represent the dictionary as a linear combination of the input data whose coefficients can be learned during training [1], relying on the fact that any optimal dictionary lies in the span of the data. A difficulty in Kernel DL is that given a data set of size N, the full (N ×N) kernel matrix needs to be manipulated at each iteration and dealing with such a large dense matrix can be extremely slow for big datasets. Here, we impose an additional constraint of sparsity on the coefficients so that the learned dictionary is given by a sparse linear combination of the input data. This greatly speeds up learning, and furthermore the speed-up is greater for larger datasets and can be tuned via a dictionary-sparsity parameter. The proposed approach thus combines Kernel DL with the ?double sparse? DL model [2] in which the learned dictionary is given by a sparse reconstruction over some base dictionary (in this case, the data itself). We investigate the use of sparse Kernel DL as a feature learning step for a music transcription task and compare it to another Kernel DL approach based on the K-SVD algorithm [1] (which doesnt lead to sparse dictionaries in general), in terms of computation-time and performance. Initial experiments show that Sparse Kernel DL is significantly faster than the non-sparse Kernel DL approach (6× to 8× speed-up depending on the size of the training data and the sparsity level) while leading to similar performance.
Mailhe B, Sturm B, Plumbley MD (2013) Behavior of greedy sparse representation algorithms on nested supports, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedingspp. 5710-5714
In this work, we study the links between the recovery properties of sparse signals for Orthogonal Matching Pursuit (OMP) and the whole General MP class over nested supports. We show that the optimality of those algorithms is not locally nested: there is a dictionary and supports I and J with J included in I such that OMP will recover all signals of support I, but not all signals of support J. We also show that the optimality of OMP is globally nested: if OMP can recover all s-sparse signals, then it can recover all s2-sparse signals with s2 smaller than s. We also provide a tighter version of Donoho and Elad's spark theorem, which allows us to complete Tropp's proof that sparse representation algorithms can only be optimal for all s-sparse signals if s is strictly lower than half the spark of the dictionary. © 2013 IEEE.
Plumbley MD (2006) Recovery of Sparse Representations by Polytope Faces Pursuit., ICA3889pp. 206-213 Springer
Jafari MG, Vincent E, Abdallah SA, Plumbley MD, Davies ME (2008) An adaptive stereo basis method for convolutive blind audio source separation., Neurocomputing7110-12pp. 2087-2097
Jafari MG, Plumbley MD (2008) Separation of stereo speech signals based on a sparse dictionary algorithm, European Signal Processing Conference
We address the problem of source separation in echoic and anechoic environments, with a new algorithm which adaptively learns a set of sparse stereo dictionary elements, which are then clustered to identify the original sources. The atom pairs learned by the algorithm are found to capture information about the direction of arrival of the source signals, which allows to determine the clusters. A similar approach is also used here to extend the dictionary learning K singular value decomposition (K-SVD) algorithm, to address the source separation problem, and results from the two methods are compared. Computer simulations indicate that the proposed adaptive sparse stereo dictionary (ASSD) algorithm yields good performance in both anechoic and echoic environments. copyright by EURASIP.
Xu Y, Huang Q, Wang W, Plumbley MD (2016) Hierarchical Learning for DNN-Based Acoustic Scene Classification,Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016)pp. 105-109 Tampere University of Technology
In this paper, we present a deep neural network (DNN)-based acoustic scene classification framework. Two hierarchical learning methods are proposed to improve the DNN baseline performance by incorporating the hierarchical taxonomy information of environmental sounds. Firstly, the parameters of the DNN are initialized by the proposed hierarchical pre-training. Multi-level objective function is then adopted to add more constraint on the cross-entropy based loss function. A series of experiments were conducted on the Task1 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2016 challenge. The final DNN-based system achieved a 22.9% relative improvement on average scene classification error as compared with the Gaussian Mixture Model (GMM)-based benchmark system across four standard folds.
Adler A, Emiya V, Jafari MG, Elad M, Gribonval R, Plumbley MD (2012) Audio inpainting,IEEE Transactions on Audio, Speech and Language Processing20(3)pp. 922-932
We propose the audio inpainting framework that recovers portions of audio data distorted due to impairments such as impulsive noise, clipping, and packet loss. In this framework, the distorted data are treated as missing and their location is assumed to be known. The signal is decomposed into overlapping time-domain frames and the restoration problem is then formulated as an inverse problem per audio frame. Sparse representation modeling is employed per frame, and each inverse problem is solved using the Orthogonal Matching Pursuit algorithm together with a discrete cosine or a Gabor dictionary. The Signal-to-Noise Ratio performance of this algorithm is shown to be comparable or better than state-of-the-art methods when blocks of samples of variable durations are missing. We also demonstrate that the size of the block of missing samples, rather 8than the overall number of missing samples, is a crucial parameter for high quality signal restoration. We further introduce a constrained Matching Pursuit approach for the special case of audio declipping that exploits the sign pattern of clipped audio samples and their maximal absolute value, as well as allowing the user to specify the maximum amplitude of the signal. This approach is shown to outperform state-of-the-art and commercially available methods for audio declipping in terms of Signal-to-Noise Ratio. © 2006 IEEE.
Jafari M, Hedayioglu F, Coimbra M, Plumbley M (2011) Blind source separation of periodic sources from sequentially recorded instantaneous mixtures,Proceedings of the 7th International Symposium on Image and Signal Processing and Analysis (ISPA 2011)pp. 540-545
We consider the separation of sources when only one movable sensor is available to record a set of mixtures at distinct locations. A single mixture signal is acquired, which is firstly segmented. Then, based on the assumption that the underlying sources are temporally periodic, we align the resulting signals and form a measurement vector on which source separation can be performed. We demonstrate that this approach can successfully recover the original sources both when working with simulated data, and for a real problem of heart sound separation. © 2011 University of Zagreb.
Gretsistas A, Plumbley MD (2010) A Multichannel Spatial Compressed Sensing Approach for Direction of Arrival Estimation., LVA/ICA6365pp. 458-465 Springer
Davies MEP, Plumbley MD (2005) Beat tracking with a two state model [music applications]., ICASSP (3)pp. 241-244 IEEE
O?Hanlon K, Nagano H, Plumbley MD (2013) Using Oracle Analysis for Decomposition-Based Automatic Music Transcription, 7900pp. 353-365 Springer Berlin Heidelberg
One approach to Automatic Music Transcription (AMT) is to decompose a spectrogram with a dictionary matrix that contains a pitch-labelled note spectrum atom in each column. AMT performance is typically measured using frame-based comparison, while an alternative perspective is to use an event-based analysis. We have previously proposed an AMT system, based on the use of structured sparse representations. The method is described and experimental results are given, which are seen to be promising. An inspection of the graphical AMT output known as a piano roll may lead one to think that the performance may be slightly better than is suggested by the AMT metrics used. This leads us to perform an oracle analysis of the AMT system, with some interesting outcomes which may have implications for decomposition based AMT in general.
Nishimori Y, Akaho S, Plumbley MD (2008) Natural Conjugate Gradient on Complex Flag Manifolds for Complex Independent Subspace Analysis., ICANN (1)5163pp. 165-174 Springer
Bugmann G, Sojka P, Reiss M, Plumbley M, Taylor JG (1992) Direct Approaches to Improving the Robustness of Multilayer Neural Networks, pp. 1063-1066 North-Holland
Abstract Multilayer neural networks trained with backpropagation are in general not robust against the loss of a hidden neuron. In this paper we define a form of robustness called 1-node robustness and propose methods to improve it. One approach is based on a modification of the error function by the addition of a "robustness error". It leads to more robust networks but at the cost of a reduced accuracy. A second approach, "pruning-and-duplication", consists of duplicating the neurons whose loss is the most damaging for the network. Pruned neurons are used for the duplication. This procedure leads to robust and accurate networks at low computational cost. It may also prove benefical for generalisation. Both methods are evaluated on the XOR function.
Davies MEP, Plumbley MD (2004) Causal Tempo Tracking of Audio., ISMIR
Plumbley MD, Cichocki A, Bro R (2010) Non-negative mixtures, In: Handbook of Blind Source Separationpp. 515-547
This chapter discusses some algorithms for the use of non-negativity constraints in unmixing problems, including positive matrix factorization, nonnegative matrix factorization (NMF), and their combination with other unmixing methods such as non-negative independent component analysis and sparse non-negative matrix factorization. The 2D models can be naturally extended to multiway array (tensor) decompositions, especially non-negative tensor factorization (NTF) and non-negative tucker decomposition (NTD). The standard NMF model has been extended in various ways, including semi-NMF, multilayer NMF, tri-NMF, orthogonal NMF, nonsmooth NMF, and convolutive NMF. When gradient descent is a simple procedure, convergence can be slow, and the convergence can be sensitive to the step size. This can be overcome by applying multiplicative update rules, which have proved particularly popular in NMF. These multiplicative update rules have proved to be attractive since they are simple, do not need the selection of an update parameter, and their multiplicative nature, and non-negative terms on the RHS ensure that the elements cannot become negative. © 2010 Elsevier Ltd. All rights reserved.
Stark AM, Plumbley MD (2012) Performance following: Real-time prediction of musical sequences without a score, IEEE Transactions on Audio, Speech and Language Processing20(1)pp. 178-187
This paper introduces a technique for predicting harmonic sequences in a musical performance for which no score is available, using real-time audio signals. Recent short-term information is aligned with longer term information, contextualizing the present within the past, allowing predictions about the future of the performance to be made. Using a mid-level representation in the form of beat-synchronous harmonic sequences, we reduce the size of the information needed to represent the performance. This allows the implementation of real-time performance following in live performance situations. We conduct an objective evaluation on a database of rock, pop, and folk music. Our results show that we are able to predict a large majority of repeated harmonic content with no prior knowledge in the form of a score. © 2011 IEEE.
Badeau R, Plumbley MD (2013) Probabilistic time-frequency source-filter decomposition of non-stationary signals, Proceedings of the 21st European European Signal Processing Conference 2013
Probabilistic modelling of non-stationary signals in the time-frequency (TF) domain has been an active research topic recently. Various models have been proposed, notably in the nonnegative matrix factorization (NMF) literature. In this paper, we propose a new TF probabilistic model that can represent a variety of stationary and non-stationary signals, such as autoregressive moving average (ARMA) processes, uncorrelated noise, damped sinusoids, and transient signals. This model also generalizes and improves both the Itakura-Saito (IS)-NMF and high resolution (HR)-NMF models. © 2013 EURASIP.
Kroos Christian, Plumbley Mark (2017) Neuroevolution for sound event detection in real life audio: A pilot study,Detection and Classification of Acoustic Scenes and Events (DCASE 2017) Proceedings 2017 Tampere University of Technology
Neuroevolution techniques combine genetic algorithms with artificial neural networks, some of them evolving network topology along with the network weights. One of these latter techniques is the NeuroEvolution of Augmenting Topologies (NEAT) algorithm. For this pilot study we devised an extended variant (joint NEAT, J-NEAT), introducing dynamic cooperative co-evolution, and applied it to sound event detection in real life audio (Task 3) in the DCASE 2017 challenge. Our research question was whether small networks could be evolved that would be able to compete with the much larger networks now typical for classification and detection tasks. We used the wavelet-based deep scattering transform and k-means clustering across the resulting scales (not across samples) to provide J-NEAT with a compact representation of the acoustic input. The results show that for the development data set J-NEAT was capable of evolving small networks that match the performance of the baseline system in terms of the segment-based error metrics, while exhibiting a substantially better event-related error rate. In the challenge, J-NEAT took first place overall according to the F1 error metric with an F1 of 44:9% and achieved rank 15 out of 34 on the ER error metric with a value of 0:891. We discuss the question of evolving versus learning for supervised tasks.
Foster P, Klapuri A, Plumbley MD (2011) Causal prediction of continuous-valued music features, Proceedings of the 12th International Society for Music Information Retrieval Conference, ISMIR 2011pp. 501-506
This paper investigates techniques for predicting sequences of continuous-valued feature vectors extracted from musical audio. In particular, we consider prediction of beatsynchronous Mel-frequency cepstral coefficients and chroma features in a causal setting, where features are predicted as they unfold in time. The methods studied comprise autoregressive models, N-gram models incorporating a smoothing scheme, and a novel technique based on repetition detection using a self-distance matrix. Furthermore, we propose a method for combining predictors, which relies on a running estimate of the error variance of the predictors to inform a linear weighting of the predictor outputs. Results indicate that incorporating information on long-term structure improves the prediction performance for continuous-valued, sequential musical data. For the Beatles data set, combining the proposed self-distance based predictor with both N-gram and autoregressive methods results in an average of 13% improvement compared to a linear predictive baseline. © 2011 International Society for Music Information Retrieval.
Stowell D, Plumbley M (2012) Multi-target pitch tracking of vibrato sources in noise using the GM-PHD filter,Proceedings of 5th International Workshop on Machine Learning and Music (MML 2012)pp. 27-28
Probabilistic approaches to tracking often use single-source Bayesian models; applying these to multi-source tasks is problematic. We apply a principled multi-object tracking implementation, the Gaussian mixture probability hypothesis density filter, to track multiple sources having fixed pitch plus vibrato. We demonstrate high-quality ltering in a synthetic experiment, and nd improved tracking using a richer feature set which captures underlying dynamics. Our implementation is available as open-source Python code.
Foster P, Sigtia S, Krstulovic S, Barker J, Plumbley MD (2015) CHiME-Home: A dataset for sound source recognition in a domestic environment., Proc 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 18-21 Oct. 2015 IEEE
Evangelista G, Marchand S, Plumbley MD, Vincent E (2011) Sound Source Separation,In: Zölzer U (eds.), DAFX: Digital Audio Effects14pp. 551-588 John Wiley & Sons, Ltd
Abdallah SA, Plumbley MD (2012) Predictive Information Rate in Discrete-time Gaussian Processes,
We derive expressions for the predicitive information rate (PIR) for the class of autoregressive Gaussian processes AR(N), both in terms of the prediction coefficients and in terms of the power spectral density. The latter result suggests a duality between the PIR and the multi-information rate for processes with mutually inverse power spectra (i.e. with poles and zeros of the transfer function exchanged). We investigate the behaviour of the PIR in relation to the multi-information rate for some simple examples, which suggest, somewhat counter-intuitively, that the PIR is maximised for very `smooth' AR processes whose power spectra have multiple poles at zero frequency. We also obtain results for moving average Gaussian processes which are consistent with the duality conjectured earlier. One consequence of this is that the PIR is unbounded for MA(N) processes.
O'Hanlon K, Plumbley M (2013) Learning overcomplete dictionaries with ?0-sparse Non-negative Matrix Factorisation,2013 IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (GLOBALSIP)pp. 977-980 IEEE
Non-negative Matrix Factorisation (NMF) is a popular tool in which a ?parts-based? representation of a non-negative matrix is sought. NMF tends to produce sparse decompositions. This sparsity is a desirable property in many applications, and Sparse NMF (S-NMF) methods have been proposed to enhance this feature. Typically these enforce sparsity through use of a penalty term, and a `1 norm penalty term is often used. However an `1 penalty term may not be appropriate in a non-negative framework. In this paper the use of a `0 norm penalty for NMF is proposed, approximated using backwards elimination from an initial NNLS decomposition. Dictionary recovery experiments using overcomplete dictionaries show that this method outperforms both NMF and a state of the art S-NMF method, in particular when the dictionary to be learnt is dense.
Stowell D, Plumbley MD (2013) Segregating Event Streams and Noise with a Markov Renewal Process Model, Journal of Machine Learning Research14pp. 2213-2238
We describe an inference task in which a set of timestamped event observations must be clustered into an unknown number of temporal sequences with independent and varying rates of observations. Various existing approaches to multi-object tracking assume a fixed number of sources and/or a fixed observation rate; we develop an approach to inferring structure in timestamped data produced by a mixture of an unknown and varying number of similar Markov renewal processes, plus independent clutter noise. The inference simultaneously distinguishes signal from noise as well as clustering signal observations into separate source streams. We illustrate the technique via synthetic experiments as well as an experiment to track a mixture of singing birds. Source code is available.
Roma G, Simpson A, Girgis E, Plumbley M (2016) Remixing musical audio on the web using source separation,Proceedings of the 2nd Web Audio Conference (WAC-2016)
Research in audio source separation has progressed a long way, producing systems that are able to approximate the component signals of sound mixtures. In recent years, many efforts have focused on learning time-frequency masks that can be used to filter a monophonic signal in the frequency domain. Using current web audio technologies, time-frequency masking can be implemented in a web browser in real time. This allows applying source separation techniques to arbitrary audio streams, such as internet radios, depending on cross-domain security configurations. While producing good quality separated audio from monophonic music mixtures is still challenging, current methods can be applied to remixing scenarios, where part of the signal is emphasized or deemphasized. This paper describes a system for remixing musical audio on the web by applying time-frequency masks estimated using deep neural networks. Our example prototype, implemented in client-side Javascript, provides reasonable quality results for small modifications.
Davies MEP, Brossier PM, Plumbley MD (2005) Beat tracking towards automatic musical accompaniment, Audio Engineering Society - 118th Convention Spring Preprints 20052pp. 751-757
In this paper we address the issue of causal rhythmic analysis, primarily towards predicting the locations of musical beats such that they are consistent with a musical audio input. This will be a key component required for a system capable of automatic accompaniment with a live musician. We are implementing our approach as part of the aubio real-time audio library. While performance for this causal system is reduced in comparison to our previous non-causal system, it is still suitable for our intended purpose.
Kroos Christian, Bundgaard-Nielsen RL, Best CT, Plumbley Mark (2017) Using deep neural networks to estimate tongue movements from speech face motion,Proceedings of AVSP 2017 KTH
This study concludes a tripartite investigation into the indirect visibility of the moving tongue in human speech as reflected in co-occurring changes of the facial surface. We were in particular interested in how the shared information is distributed over the range of contributing frequencies. In the current study we examine the degree to which tongue movements during speech can be reliably estimated from face motion using artificial neural networks. We simultaneously acquired data for both movement types; tongue movements were measured with Electromagnetic Articulography (EMA), face motion with a passive marker-based motion capture system. A multiresolution analysis using wavelets provided the desired decomposition into frequency subbands. In the two earlier studies of the project we established linear and non-linear relations between lingual and facial speech motions, as predicted and compatible with previous research in auditory-visual speech. The results of the current study using a Deep Neural Network (DNN) for prediction show that a substantive amount of variance can be recovered (between 13.9 and 33.2% dependent on the speaker and tongue sensor location). Importantly, however, the recovered variance values and the root mean squared error values of the Euclidean distances between the measured and the predicted tongue trajectories are in the range of the linear estimations of our earlier study.
Grais Emad M, Roma G, Simpson AJR, Plumbley Mark (2017) Discriminative Enhancement for Single Channel Audio Source Separation using Deep Neural Networks,LNCS10169pp. 236-246
The sources separated by most single channel audio source separation techniques are usually distorted and each separated source contains residual signals from the other sources. To tackle this problem, we propose to enhance the separated sources to decrease the distortion and interference between the separated sources using deep neural networks (DNNs). Two different DNNs are used in this work. The first DNN is used to separate the sources from the mixed signal. The second DNN is used to enhance the separated signals. To consider the interactions between the separated sources, we propose to use a single DNN to enhance all the separated sources together. To reduce the residual signals of one source from the other separated sources (interference), we train the DNN for enhancement discriminatively to maximize the dissimilarity between the predicted sources. The experimental results show that using discriminative enhancement decreases the distortion and interference between the separated sources
Sobieraj I, Plumbley MD (2016) Coupled Sparse NMF vs. Random Forest Classification for Real Life Acoustic Event Detection,Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016)pp. 90-94
In this paper, we propose two methods for polyphonic Acoustic Event Detection (AED) in real life environments. The first method is based on Coupled Sparse Non-negative Matrix Factorization (CSNMF) of spectral representations and their corresponding class activity annotations. The second method is based on Multi-class Random Forest (MRF) classification of time-frequency patches. We compare the performance of the two methods on a recently published dataset TUT Sound Events 2016 containing data from home and residential area environments. Both methods show comparable performance to the baseline system proposed for DCASE 2016 Challenge on the development dataset with MRF outperforming the baseline on the evaluation dataset.
Grais Emad M, Roma Gerard, Simpson Andrew J. R., Plumbley Mark (2017) Two Stage Single Channel Audio Source Separation using Deep Neural Networks,IEEE/ACM Transactions on Audio, Speech, and Language Processing25(9)pp. 1469-1479 IEEE
Most single channel audio source separation (SCASS) approaches produce separated sources accompanied by interference from other sources and other distortions. To tackle this problem, we propose to separate the sources in two stages. In the first stage, the sources are separated from the mixed signal. In the second stage, the interference between the separated sources and the distortions are reduced using deep neural networks (DNNs). We propose two methods that use DNNs to improve the quality of the separated sources in the second stage. In the first method, each separated source is improved individually using its own trained DNN, while in the second method all the separated sources are improved together using a single DNN. To further improve the quality of the separated sources, the DNNs in the second stage are trained discriminatively to further decrease the interference and the distortions of the separated sources. Our experimental results show that using two stages of separation improves the quality of the separated signals by decreasing the interference between the separated sources and distortions compared to separating the sources using a single stage of separation.
Kong Q, Xu Y, Wang W, Plumbley MD (2017) A joint detection-classification model for audio tagging of weakly labelled data,Proceedings of ICASSP 2017 IEEE
Audio tagging aims to assign one or several tags to an audio clip. Most of the datasets are weakly labelled, which means only the tags of the clip are known, without knowing the occurrence time of the tags. The labeling of an audio clip is often based on the audio events in the clip and no event level label is provided to the user. Previous works have used the bag of frames model assume the tags occur all the time, which is not the case in practice. We propose a joint detection-classification (JDC) model to detect and classify the audio clip simultaneously. The JDC model has the ability to attend to informative and ignore uninformative sounds. Then only informative regions are used for classification. Experimental results on the ?CHiME Home? dataset show that the JDC model reduces the equal error rate (EER) from 19.0% to 16.9%. More interestingly, the audio event detector is trained successfully without needing the event level label.
Li S, Dixon S, Plumbley M (2017) Clustering expressive timing with regressed polynomial coefficients demonstrated by a model selection test,Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR 2017)pp. 457-463
Though many past works have tried to cluster expressive timing within a phrase, there have been few attempts to cluster features of expressive timing with constant dimensions regardless of phrase lengths. For example, used as a way to represent expressive timing, tempo curves can be regressed by a polynomial function such that the number of regressed polynomial coefficients remains constant with a given order regardless of phrase lengths. In this paper, clustering the regressed polynomial coefficients is proposed for expressive timing analysis. A model selection test is presented to compare Gaussian Mixture Models (GMMs) fitting regressed polynomial coefficients and fitting expressive timing directly. As there are no expected results of clustering expressive timing, the proposed method is demonstrated by how well the expressive timing are approximated by the centroids of GMMs. The results show that GMMs fitting the regressed polynomial coefficients outperform GMMs fitting expressive timing directly. This conclusion suggests that it is possible to use regressed polynomial coefficients to represent expressive timing within a phrase and cluster expressive timing within phrases of different lengths.
Huang Q, Xu Y, Jackson P, Wang W, Plumbley M (2017) Fast Tagging of Natural Sounds Using Marginal Co-regularization,Proceedings of ICASSP2017 IEEE
Automatic and fast tagging of natural sounds in audio collections is a very challenging task due to wide acoustic variations, the large number of possible tags, the incomplete and ambiguous tags provided by different labellers. To handle these problems, we use a co-regularization approach to learn a pair of classifiers on sound and text. The first classifier maps low-level audio features to a true tag list. The second classifier maps actively corrupted tags to the true tags, reducing incorrect mappings caused by low-level acoustic variations in the first classifier, and to augment the tags with additional relevant tags. Training the classifiers is implemented using marginal co-regularization, pair of which draws the two classifiers into agreement by a joint optimization. We evaluate this approach on two sound datasets, Freefield1010 and Task4 of DCASE2016. The results obtained show that marginal co-regularization outperforms the baseline GMM in both ef- ficiency and effectiveness.
Xu Yong, Huang Qiang, Wang Wenwu, Foster Peter, Sigtia S, Jackson Philip, Plumbley Mark (2017) Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging,IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING25(6)pp. 1230-1241 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Environmental audio tagging aims to predict only the presence or absence of certain acoustic events in the interested acoustic scene. In this paper we make contributions to audio tagging in two parts, respectively, acoustic modeling and feature learning. We propose to use a shrinking deep neural network (DNN) framework incorporating unsupervised feature learning to handle the multi-label classification task. For the acoustic modeling, a large set of contextual frames of the chunk are fed into the DNN to perform a multi-label classification for the expected tags, considering that only chunk (or utterance) level rather than frame-level labels are available. Dropout and background noise aware training are also adopted to improve the generalization capability of the DNNs. For the unsupervised feature learning, we propose to use a symmetric or asymmetric deep de-noising auto-encoder (syDAE or asyDAE) to generate new data-driven features from the logarithmic Mel-Filter Banks (MFBs) features. The new features, which are smoothed against background noise and more compact with contextual information, can further improve the performance of the DNN baseline. Compared with the standard Gaussian Mixture Model (GMM) baseline of the DCASE 2016 audio tagging challenge, our proposed method obtains a significant equal error rate (EER) reduction from 0.21 to 0.13 on the development set. The proposed asyDAE system can get a relative 6.7% EER reduction compared with the strong DNN baseline on the development set. Finally, the results also show that our approach obtains the state-of-the-art performance with 0.15 EER on the evaluation set of the DCASE 2016 audio tagging task while EER of the first prize of this challenge is 0.17.
Benetos E, Lafay G, Lagrange M, Plumbley MD (2017) Polyphonic Sound Event Tracking using Linear Dynamical Systems,IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING25(6)pp. 1266-1277 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
In this paper, a system for polyphonic sound event detection and tracking is proposed, based on spectrogram factorisation techniques and state space models. The system extends probabilistic latent component analysis (PLCA) and is modelled around a 4-dimensional spectral template dictionary of frequency, sound event class, exemplar index, and sound state. In order to jointly track multiple overlapping sound events over time, the integration of linear dynamical systems (LDS) within the PLCA inference is proposed. The system assumes that the PLCA sound event activation is the (noisy) observation in an LDS, with the latent states corresponding to the true event activations. LDS training is achieved using fully observed data, making use of ground truth-informed event activations produced by the PLCA-based model. Several LDS variants are evaluated, using polyphonic datasets of office sounds generated from an acoustic scene simulator, as well as real and synthesized monophonic datasets for comparative purposes. Results show that the integration of LDS tracking within PLCA leads to an improvement of +8.5-10.5% in terms of frame-based F-measure as compared to the use of the PLCA model alone. In addition, the proposed system outperforms several state-of-the-art methods for the task of polyphonic sound event detection.
Sigtia S, Stark A, Krstulovic S, Plumbley MD (2016) Automatic environmental sound recognition: Performance versus computational cost,IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING24(11)pp. 2096-2107 IEEE
In the context of the Internet of Things (IoT), sound sensing applications are required to run on embedded platforms where notions of product pricing and form factor impose hard constraints on the available computing power. Whereas Automatic Environmental Sound Recognition (AESR) algorithms are most often developed with limited consideration for computational cost, this article seeks which AESR algorithm can make the most of a limited amount of computing power by comparing the sound classification performance as a function of its computational cost. Results suggest that Deep Neural Networks yield the best ratio of sound classification accuracy across a range of computational costs, while Gaussian Mixture Models offer a reasonable accuracy at a consistently small cost, and Support Vector Machines stand between both in terms of compromise between accuracy and computational cost.
Li S, Black D, Plumbley MD (2016) The Clustering of Expressive Timing Within a Phrase in Classical Piano Performances by Gaussian Mixture Models,Music, Mind, and Embodiment: 11th International Symposium, CMMR 2015, Plymouth, UK, June 16-19, 2015, Revised Selected Papers. (Lecture Notes in Computer Science, vol. 9617)9617pp. 322-345
In computational musicology research, clustering is a common approach to the analysis of expression. Our research uses mathematical model selection criteria to evaluate the performance of clustered and non-clustered models applied to intra-phrase tempo variations in classical piano performances. By engaging different standardisation methods for the tempo variations and engaging different types of covariance matrices, multiple pieces of performances are used for evaluating the performance of candidate models. The results of tests suggest that the clustered models perform better than the non-clustered models and the original tempo data should be standardised by the mean of tempo within a phrase.
Grais Emad M, Plumbley Mark D (2017) Single Channel Audio Source Separation using Convolutional Denoising Autoencoders,GlobalSIP2017 Proceedingspp. 1265-1269 IEEE
Deep learning techniques have been used recently to tackle the audio source separation problem. In this work, we propose to use deep fully convolutional denoising autoencoders (CDAEs) for monaural audio source separation. We use as many CDAEs as the number of sources to be separated from the mixed signal. Each CDAE is trained to separate one source and treats the other sources as background noise. The main idea is to allow each CDAE to learn suitable spectral-temporal filters and features to its corresponding source. Our experimental results show that CDAEs perform source separation slightly better than the deep feedforward neural networks (FNNs) even with fewer parameters than FNNs.
Zermini Alfredo, Liu Qingju, Xu Yong, Plumbley Mark, Betts Dave, Wang Wenwu (2017) Binaural and Log-Power Spectra Features with Deep Neural Networks for Speech-Noise Separation,Proceedings of MMSP 2017 - IEEE 19th International Workshop on Multimedia Signal Processing IEEE
Binaural features of interaural level difference and interaural phase difference have proved to be very effective in training deep neural networks (DNNs), to generate timefrequency masks for target speech extraction in speech-speech mixtures. However, effectiveness of binaural features is reduced in more common speech-noise scenarios, since the noise may over-shadow the speech in adverse conditions. In addition, the reverberation also decreases the sparsity of binaural features and therefore adds difficulties to the separation task. To address the above limitations, we highlight the spectral difference between speech and noise spectra and incorporate the log-power spectra features to extend the DNN input. Tested on two different reverberant rooms at different signal to noise ratios (SNR), our proposed method shows advantages over the baseline method using only binaural features in terms of signal to distortion ratio (SDR) and Short-Time Perceptual Intelligibility (STOI).
Zermini A, Wang W, Kong Q, Xu Y, Plumbley M (2017) Audio source separation with deep neural networks using the dropout algorithm,Signal Processing with Adaptive Sparse Structured Representations (SPARS) 2017 Book of Abstractspp. 1-2 Instituto de Telecomunicações
A method based on Deep Neural Networks (DNNs) and time-frequency masking has been recently developed for binaural audio source separation. In this method, the DNNs are used to predict the Direction Of Arrival (DOA) of the audio sources with respect to the listener which is then used to generate soft time-frequency masks for the recovery/estimation of the individual audio sources. In this paper, an algorithm called ?dropout? will be applied to the hidden layers, affecting the sparsity of hidden units activations: randomly selected neurons and their connections are dropped during the training phase, preventing feature co-adaptation. These methods are evaluated on binaural mixtures generated with Binaural Room Impulse Responses (BRIRs), accounting a certain level of room reverberation. The results show that the proposed DNNs system with randomly deleted neurons is able to achieve higher SDRs performances compared to the baseline method without the dropout algorithm.
Zermini A, Yu Y, Xu Y, Plumbley M, Wang W (2016) Deep neural network based audio source separation,Proceedings of the 11th IMA International Conference on Mathematics in Signal Processingpp. 1-4 Institute of Mathematics & its Applications (IMA)
Audio source separation aims to extract individual sources from mixtures of multiple sound sources. Many techniques have been developed such as independent compo- nent analysis, computational auditory scene analysis, and non-negative matrix factorisa- tion. A method based on Deep Neural Networks (DNNs) and time-frequency (T-F) mask- ing has been recently developed for binaural audio source separation. In this method, the DNNs are used to predict the Direction Of Arrival (DOA) of the audio sources with respect to the listener which is then used to generate soft T-F masks for the recovery/estimation of the individual audio sources.
O'Brien C, Plumbley M (2017) Automatic Music Transcription Using Low Rank Non-Negative Matrix Decomposition,EUSIPCO 2017 Proceedings
Automatic Music Transcription (AMT) is concerned with the problem of producing the pitch content of a piece of music given a recorded signal. Many methods rely on sparse or low rank models, where the observed magnitude spectra are represented as a linear combination of dictionary atoms corresponding to individual pitches. Some of the most successful approaches use Non-negative Matrix Decomposition (NMD) or Factorization (NMF), which can be used to learn a dictionary and pitch activation matrix from a given signal. Here we introduce a further refinement of NMD in which we assume the transcription itself is approximately low rank. The intuition behind this approach is that the total number of distinct activation patterns should be relatively small since the pitch content between adjacent frames should be similar. A rank penalty is introduced into the NMD objective function and solved using an iterative algorithm based on Singular Value thresholding. We find that the low rank assumption leads to a significant increase in performance compared to NMD using ²-divergence on a standard AMT dataset.
Ward Dominic, Wierstorf Hagen, Mason Russell, Plumbley Mark, Hummersone Christopher (2017) Estimating the loudness balance of musical mixtures using audio source separation,Proceedings of the 3rd Workshop on Intelligent Music Production (WIMP 2017)
To assist with the development of intelligent mixing systems, it would be useful to be able to extract the loudness balance of sources in an existing musical mixture. The relative-to-mix loudness level of four instrument groups was predicted using the sources extracted by 12 audio source separation algorithms. The predictions were compared with the ground truth loudness data of the original unmixed stems obtained from a recent dataset involving 100 mixed songs. It was found that the best source separation system could predict the relative loudness of each instrument group with an average root-mean-square error of 1.2 LU, with superior performance obtained on vocals.
Xu Yong, Kong Qiuqiang, Huang Qiang, Wang Wenwu, Plumbley Mark (2017) Attention and Localization based on a Deep Convolutional Recurrent Model for Weakly Supervised Audio Tagging,Proceedings of Interspeech 2017pp. 3083-3087 ISCA
Audio tagging aims to perform multi-label classification on audio chunks and it is a newly proposed task in the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE 2016) challenge. This task encourages research efforts to better analyze and understand the content of the huge amounts of audio data on the web. The difficulty in audio tagging is that it only has a chunk-level label without a frame-level label. This paper presents a weakly supervised method to not only predict the tags but also indicate the temporal locations of the occurred acoustic events. The attention scheme is found to be effective in identifying the important frames while ignoring the unrelated frames. The proposed framework is a deep convolutional recurrent model with two auxiliary modules: an attention module and a localization module. The proposed algorithm was evaluated on the Task 4 of DCASE 2016 challenge. State-of-the-art performance was achieved on the evaluation set with equal error rate (EER) reduced from 0.13 to 0.11, compared with the convolutional recurrent baseline system.
The first generation of three-dimensional Electromagnetic Articulography devices (Carstens AG500) suffered from occasional critical tracking failures. Although now superseded by new devices, the AG500 is still in use in many speech labs and many valuable data sets exist. In this study we investigate whether deep neural networks (DNNs) can learn the mapping function from raw voltage amplitudes to sensor positions based on a comprehensive movement data set. This is compared to arriving sample by sample at individual position values via direct optimisation as used in previous methods. We found that with appropriate hyperparameter settings a DNN was able to approximate the mapping function with good accuracy, leading to a smaller error than the previous methods, but that the DNN-based approach was not able to solve the tracking problem completely.
Xu Y, Kong Q, Huang Q, Wang W, Plumbley M (2017) Convolutional Gated Recurrent Neural Network Incorporating Spatial Features for Audio Tagging,IJCNN 2017 Conference Proceedings IEEE
Environmental audio tagging is a newly proposed task to predict the presence or absence of a specific audio event in a chunk. Deep neural network (DNN) based methods have been successfully adopted for predicting the audio tags in the domestic audio scene. In this paper, we propose to use a convolutional neural network (CNN) to extract robust features from mel-filter banks (MFBs), spectrograms or even raw waveforms for audio tagging. Gated recurrent unit (GRU) based recurrent neural networks (RNNs) are then cascaded to model the long-term temporal structure of the audio signal. To complement the input information, an auxiliary CNN is designed to learn on the spatial features of stereo recordings. We evaluate our proposed methods on Task 4 (audio tagging) of the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE 2016) challenge. Compared with our recent DNN-based method, the proposed structure can reduce the equal error rate (EER) from 0.13 to 0.11 on the development set. The spatial features can further reduce the EER to 0.10. The performance of the end-to-end learning on raw waveforms is also comparable. Finally, on the evaluation set, we get the state-of-the-art performance with 0.12 EER while the performance of the best existing system is 0.15 EER.
Simpson A, Roma G, Girgis E, Mason R, Hummersone C, Plumbley M (2017) Psychophysical Evaluation of Audio Source Separation Methods,LNCS: Latent Variable Analysis and Signal Separation10169pp. 211-221 Springer
Source separation evaluation is typically a top-down process, starting with perceptual measures which capture fitness-for-purpose and followed by attempts to find physical (objective) measures that are predictive of the perceptual measures. In this paper, we take a contrasting bottom-up approach. We begin with the physical measures provided by the Blind Source Separation Evaluation Toolkit (BSS Eval) and we then look for corresponding perceptual correlates. This approach is known as psychophysics and has the distinct advantage of leading to interpretable, psychophysical models. We obtained perceptual similarity judgments from listeners in two experiments featuring vocal sources within musical mixtures. In the first experiment, listeners compared the overall quality of vocal signals estimated from musical mixtures using a range of competing source separation methods. In a loudness experiment, listeners compared the loudness balance of the competing musical accompaniment and vocal. Our preliminary results provide provisional validation of the psychophysical approach
Kong Q, Sobieraj I, Wang W, Plumbley MD (2016) Deep Neural Network Baseline for DCASE Challenge 2016,Proceedings of DCASE 2016
The DCASE Challenge 2016 contains tasks for Acoustic Scene Classification (ASC), Acoustic Event Detection (AED), and audio tagging. Since 2006, Deep Neural Networks (DNNs) have been widely applied to computer visions, speech recognition and natural language processing tasks. In this paper, we provide DNN baselines for the DCASE Challenge 2016. In Task 1 we obtained accuracy of 81.0% using Mel + DNN against 77.2% by using Mel Frequency Cepstral Coefficients (MFCCs) + Gaussian Mixture Model (GMM). In Task 2 we obtained F value of 12.6% using Mel + DNN against 37.0% by using Constant Q Transform (CQT) + Nonnegative Matrix Factorization (NMF). In Task 3 we obtained F value of 36.3% using Mel + DNN against 23.7% by using MFCCs + GMM. In Task 4 we obtained Equal Error Rate (ERR) of 18.9% using Mel + DNN against 20.9% by using MFCCs + GMM. Therefore the DNN improves the baseline in Task 1, 3, and 4, although it is worse than the baseline in Task 2. This indicates that DNNs can be successful in many of these tasks, but may not always perform better than the baselines.
Girgis E, Roma G, Simpson A, Plumbley M (2016) Combining Mask Estimates for Single Channel Audio Source Separation using Deep Neural Networks,Interspeech2016 Proceedings ISCA
Deep neural networks (DNNs) are usually used for single channel source separation to predict either soft or binary time frequency masks. The masks are used to separate the sources from the mixed signal. Binary masks produce separated sources with more distortion and less interference than soft masks. In this paper, we propose to use another DNN to combine the estimates of binary and soft masks to achieve the advantages and avoid the disadvantages of using each mask individually. We aim to achieve separated sources with low distortion and low interference between each other. Our experimental results show that combining the estimates of binary and soft masks using DNN achieves lower distortion than using each estimate individually and achieves as low interference as the binary mask.
Rencker Lucas, Wang Wenwu, Plumbley Mark (2017) Multivariate Iterative Hard Thresholding for sparse decomposition with flexible sparsity patterns,Proceedings of the European Signal Processing Conference (EUSIPCO) 2017pp. 2220-2224 EUSIPCO
We address the problem of decomposing several consecutive sparse signals, such as audio time frames or image patches. A typical approach is to process each signal sequentially and independently, with an arbitrary sparsity level fixed for each signal. Here, we propose to process several frames simultaneously, allowing for more flexible sparsity patterns to be considered. We propose a multivariate sparse coding approach, where sparsity is enforced on average across several frames. We propose a Multivariate Iterative Hard Thresholding to solve this problem. The usefulness of the proposed approach is demonstrated on audio coding and denoising tasks. Experiments show that the proposed approach leads to better results when the signal contains both transients and tonal components.
Sobieraj Iwona, Kong Qiuqiang, Plumbley Mark (2017) Masked Non-negative Matrix Factorization for Bird Detection Using Weakly Labeled Data,EUSIPCO 2017 Proceedings IEEE
Acoustic monitoring of bird species is an increasingly important field in signal processing. Many available bird sound datasets do not contain exact timestamp of the bird call but have a coarse weak label instead. Traditional Non-negative Matrix Factorization (NMF) models are not well designed to deal with weakly labeled data. In this paper we propose a novel Masked Non-negative Matrix Factorization (Masked NMF) approach for bird detection using weakly labeled data. During dictionary extraction we introduce a binary mask on the activation matrix. In that way we are able to control which parts of dictionary are used to reconstruct the training data. We compare our method with conventional NMF approaches and current state of the art methods. The proposed method outperforms the NMF baseline and offers a parsimonious model for bird detection on weakly labeled data. Moreover, to our knowledge, the proposed Masked NMF achieved the best result among non-deep learning methods on a test dataset used for the recent Bird Audio Detection Challenge.
The sources separated by most single channel audio source separation techniques are usually distorted and each separated source contains residual signals from the other sources. To tackle this problem, we propose to enhance the separated sources by decreasing the distortion and interference between the separated sources using deep neural networks (DNNs). Two different DNNs are used in this work. The first DNN is used to separate the sources from the mixed signal. The second DNN is used to enhance the ...
Qin Z, Gao Yue, Plumbley Mark D. (2018) Malicious User Detection Based on Low-Rank Matrix Completion in Wideband Spectrum Sensing,IEEE Transactions on Signal Processing66(1)pp. 5-17 IEEE
In cognitive radio networks, cooperative spectrum sensing (CSS) has been a promising approach to improve sensing performance by utilizing spatial diversity of participating secondary users (SUs). In current CSS networks, all cooperative SUs are assumed to be honest and genuine. However, the presence of malicious users sending out dishonest data can severely degrade the performance of CSS networks. In this paper, a framework with high detection accuracy and low costs of data acquisition at SUs is developed, with the purpose of mitigating the influences of malicious users. More specifically, a low-rank matrix completion based malicious user detection framework is proposed. In the proposed framework, in order to avoid requiring any prior information about the CSS network, a rank estimation algorithm and an estimation strategy for the number of corrupted channels are proposed. Numerical results show that the proposed malicious user detection framework achieves high detection accuracy with lower data acquisition costs in comparison with the conventional approach. After being validated by simulations, the proposed malicious user detection framework is tested on the real-world signals over TV white space spectrum.
Mesaros Annamaria, Heittola Toni, Benetos Emmanouil, Foster Peter, Lagrange Mathieu, Virtanen Tuomas, Plumbley Mark D. (2017) Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge,IEEE/ACM Transactions on Audio, Speech and Language Processing26(2)pp. 379-393 Institute of Electrical and Electronics Engineers (IEEE)
Public evaluation campaigns and datasets promote active development in target research areas, allowing direct comparison of algorithms. The second edition of the challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2016) has offered such an opportunity for development of state-of-the-art methods, and succeeded in drawing together a large number of participants from academic and industrial backgrounds. In this paper, we report on the tasks and outcomes of the DCASE 2016 challenge. The challenge comprised four tasks: acoustic scene classification, sound event detection in synthetic audio, sound event detection in real-life audio, and domestic audio tagging. We present in detail each task and analyse the submitted systems in terms of design and performance. We observe the emergence of deep learning as the most popular classification method, replacing the traditional approaches based on Gaussian mixture models and support vector machines. By contrast, feature representations have not changed substantially throughout the years, as mel frequency-based representations predominate in all tasks. The datasets created for and used in DCASE 2016 are publicly available and are a valuable resource for further research.
Plumbley, Mark D (2016) Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), Budapest, Hungary, 3 Sep 2016., Tampere University of Technology. Department of Signal Processing
Xu Yong, Kong Qiuqiang, Wang Wenwu, Plumbley Mark (2017) Surrey-CVSSP system for DCASE2017 challenge task 4,Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017) Tampere University of Technology. Laboratory of Signal Processing
In this technique report, we present a bunch of methods for the task 4 of Detection and Classification of Acoustic Scenes and Events 2017 (DCASE2017) challenge. This task evaluates systems for the large-scale detection of sound events using weakly labeled training data. The data are YouTube video excerpts focusing on transportation and warnings due to their industry applications. There are two tasks, audio tagging and sound event detection from weakly labeled data. Convolutional neural network (CNN) and gated recurrent unit (GRU) based recurrent neural network (RNN) are adopted as our basic framework. We proposed a learnable gating activation function for selecting informative local features. Attention-based scheme is used for localizing the specific events in a weakly-supervised mode. A new batch-level balancing strategy is also proposed to tackle the data unbalancing problem. Fusion of posteriors from different systems are found effective to improve the performance. In a summary, we get 61% F-value for the audio tagging subtask and 0.73 error rate (ER) for the sound event detection subtask on the development set. While the official multilayer perceptron (MLP) based baseline just obtained 13.1% F-value for the audio tagging and 1.02 for the sound event detection.
Benetos E, Stowell D, Plumbley M (2018) Approaches to complex sound scene analysis,In: Virtanen T, Plumbley M, Ellis D (eds.), Computational Analysis of Sound Scenes and Eventspp. 215-242 Springer International Publishing
Giannoulis D, Barchiesi D, Klapuri A, Plumbley M (2011) On the disjointess of sources in music using different time-frequency representations,2011 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA)pp. 261-264 IEEE
This paper studies the disjointness of the time-frequency representations of simultaneously playing musical instruments. As a measure of disjointness, we use the approximate W-disjoint orthogonality as proposed by Yilmaz and Rickard [1], which (loosely speaking) measures the degree of overlap of different sources in the time-frequency domain. The motivation for this study is to find a maximally disjoint representation in order to facilitate the separation and recognition of musical instruments in mixture signals. The transforms investigated in this paper include the short-time Fourier transform (STFT), constant-Q transform, modified discrete cosine transform (MDCT), and pitch-synchronous lapped orthogonal transforms. Simulation results are reported for a database of polyphonic music where the multitrack data (instrument signals before mixing) were available. Absolute performance varies depending on the instrument source in question, but on the average MDCT with 93 ms frame size performed best.
Ellis D, Virtanen T, Plumbley M, Raj B (2018) Future Perspective,In: Virtanen T, Plumbley M, Ellis D (eds.), Computational Analysis of Sound Scenes and Eventspp. 401-415 Springer International Publishing
Virtanen T, Plumbley M, Ellis D (2018) Introduction to sound scene and event analysis,In: Virtanen T, Plumbley M, Ellis D (eds.), Computational Analysis of Sound Scenes and Eventspp. 3-12 Springer
Barchiesi D, Plumbley M (2011) Dictionary learning of convolved signals,2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSINGpp. 5812-5815 IEEE
Assuming that a set of source signals is sparsely representable in a given dictionary, we show how their sparse recovery fails whenever we can only measure a convolved observation of them. Starting from this motivation, we develop a block coordinate descent method which aims to learn a convolved dictionary and provide a sparse representation of the observed signals with small residual norm. We compare the proposed approach to the K-SVD dictionary learning algorithm and show through numerical experiment on synthetic signals that, provided some conditions on the problem data, our technique converges in a fixed number of iterations to a sparse representation with smaller residual norm.
Plumbley, Mark D (2018) Computational Analysis of Sound Scenes and Events, Springer International Publishing
This book presents computational methods for extracting the useful information from audio signals, collecting the state of the art in the field of sound event and scene analysis. The authors cover the entire procedure for developing such methods, ranging from data acquisition and labeling, through the design of taxonomies used in the systems, to signal processing methods for feature extraction and machine learning methods for sound recognition. The book also covers advanced techniques for dealing with environmental variation and multiple overlapping sound sources, and taking advantage of multiple microphones or other modalities. The book gives examples of usage scenarios in large media databases, acoustic monitoring, bioacoustics, and context-aware devices. Graphical illustrations of sound signals and their spectrographic representations are presented, as well as block diagrams and pseudocode of algorithms. Gives an overview of methods for computational analysis of sounds scenes and events, allowing those new to the field to become fully informed; Covers all the aspects of the machine learning approach to computational analysis of sound scenes and events, ranging from data capture and labeling process to development of algorithms; Includes descriptions of algorithms accompanied by a website from which software implementations can be downloaded, facilitating practical interaction with the techniques.
Xu Yong, Kong Qiuqiang, Wang Wenwu, Plumbley Mark (2018) Large-scale weakly supervised audio classification using gated convolutional neural network,Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)pp. 121-125 IEEE
In this paper, we present a gated convolutional neural network and a temporal attention-based localization method for audio classification, which won the 1st place in the large-scale weakly supervised sound event detection task of Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 challenge. The audio clips in this task, which are extracted from YouTube videos, are manually labelled with one or more audio tags, but without time stamps of the audio events, hence referred to as weakly labelled data. Two subtasks are defined in this challenge including audio tagging and sound event detection using this weakly labelled data. We propose a convolutional recurrent neural network (CRNN) with learnable gated linear units (GLUs) non-linearity applied on the log Mel spectrogram. In addition, we propose a temporal attention method along the frames to predict the locations of each audio event in a chunk from the weakly labelled data. The performances of our systems were ranked the 1st and the 2nd as a team in these two sub-tasks of DCASE 2017 challenge with F value 55.6% and Equal error 0.73, respectively.
Baume Christopher, Plumbley Mark, Frohlich David, Calic Janko (2018) PaperClip: A Digital Pen Interface for Semantic Speech Editing in Radio Production,Journal of the Audio Engineering Society66(4)pp. 241-252 Audio Engineering Society
We introduce `PaperClip' - a novel digital pen interface for semantic editing of speech recordings for radio production. We explain how we designed and developed our system, then present the results of a contextual qualitative user study of eight professional radio producers that compared editing using PaperClip to a screen-based interface and normal paper. As in many other paper-versus-screen studies, we found no overall preferences but rather advantages and disadvantages of both in different contexts. We discuss these relative benefits and make recommendations for future development.
Ward Dominic, Wierstorf Hagen, Mason Russell, Grais Emad M., Plumbley Mark (2018) BSS eval or peass? Predicting the perception of singing-voice separation,Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)pp. 596-600 Institute of Electrical and Electronics Engineers (IEEE)
There is some uncertainty as to whether objective metrics for predicting the perceived quality of audio source separation are sufficiently accurate. This issue was investigated by employing a revised experimental methodology to collect subjective ratings of sound quality and interference of singing-voice recordings that have been extracted from musical mixtures using state-of-the-art audio source separation. A correlation analysis between the experimental data and the measures of two objective evaluation toolkits, BSS Eval and PEASS, was performed to assess their performance. The artifacts-related perceptual score of the PEASS toolkit had the strongest correlation with the perception of artifacts and distortions caused by singing-voice separation. Both the source-to-interference ratio of BSS Eval and the interference-related perceptual score of PEASS showed comparable correlations with the human ratings of interference.
Kong Qiuqiang, Xu Yong, Wang Wenwu, Plumbley Mark D (2018) A joint separation-classification model for sound event detection of weakly labelled data,Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)pp. 321-325 Institute of Electrical and Electronics Engineers (IEEE)
Source separation (SS) aims to separate individual sources from an audio recording. Sound event detection (SED) aims to detect sound events from an audio recording. We propose a joint separation-classification (JSC) model trained only on weakly labelled audio data, that is, only the tags of an audio recording are known but the time of the events are unknown. First, we propose a separation mapping from the time-frequency (T-F) representation of an audio to the T-F segmentation masks of the audio events. Second, a classification mapping is built from each T-F segmentation mask to the presence probability of each audio event. In the source separation stage, sources of audio events and time of sound events can be obtained from the T-F segmentation masks. The proposed method achieves an equal error rate (EER) of 0.14 in SED, outperforming deep neural network baseline of 0.29. Source separation SDR of 8.08 dB is obtained by using global weighted rank pooling (GWRP) as probability mapping, outperforming the global max pooling (GMP) based probability mapping giving SDR at 0.03 dB. Source code of our work is published.
Huang Qiang, Jackson Philip, Plumbley Mark D., Wang Wenwu (2018) Synthesis of images by two-stage generative adversarial networks,Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)pp. 1593-1597 Institute of Electrical and Electronics Engineers (IEEE)
In this paper, we propose a divide-and-conquer approach using two generative adversarial networks (GANs) to explore how a machine can draw colorful pictures (bird) using a small amount of training data. In our work, we simulate the procedure of an artist drawing a picture, where one begins with drawing objects? contours and edges and then paints them different colors. We adopt two GAN models to process basic visual features including shape, texture and color. We use the first GAN model to generate object shape, and then paint the black and white image based on the knowledge learned using the second GAN model. We run our experiments on 600 color images. The experimental results show that the use of our approach can generate good quality synthetic images, comparable to real ones.
Kong Qiuqiang, Xu Yong, Wang Wenwu, Plumbley Mark (2018) Audio set classification with attention model: a probabilistic perspective,Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)pp. 316-320 IEEE
This paper investigates the Audio Set classification. Audio Set is a large scale weakly labelled dataset (WLD) of audio clips. In WLD only the presence of a label is known, without knowing the happening time of the labels. We propose an attention model to solve this WLD problem and explain the attention model from a novel probabilistic perspective. Each audio clip in Audio Set consists of a collection of features. We call each feature as an instance and the collection as a bag following the terminology in multiple instance learning. In the attention model, each instance in the bag has a trainable probability measure for each class. The classification of the bag is the expectation of the classification output of the instances in the bag with respect to the learned probability measure. Experiments show that the proposed attention model achieves a mAP of 0.327 on Audio Set, outperforming the Google?s baseline of 0.314.
Grais Emad M, Plumbley Mark (2018) Combining Fully Convolutional and Recurrent Neural Networks for Single Channel Audio Source Separation,Proceedings of 144th AES Convention Audio Engineering Society
Combining different models is a common strategy to build a good audio source separation system. In this work, we combine two powerful deep neural networks for audio single channel source separation (SCSS). Namely, we combine fully convolutional neural networks (FCNs) and recurrent neural networks, specifically, bidirectional long short-term memory recurrent neural networks (BLSTMs). FCNs are good at extracting useful features from the audio data and BLSTMs are good at modeling the temporal structure of the audio signals. Our experimental results show that combining FCNs and BLSTMs achieves better separation performance than using each model individually.
Roma G, Girgis E, Simpson A, Plumbley M (2016) MUSIC REMIXING AND UPMIXING USING SOURCE SEPARATION,Proceedings of the 2nd AES Workshop on Intelligent Music Production
Current research on audio source separation provides tools to estimate the signals contributed by different instruments in polyphonic music mixtures. Such tools can be already incorporated in music production and post-production workflows. In this paper, we describe recent experiments where audio source separation is applied to remixing and upmixing existing mono and stereo music content
Baume Chris, Plumbley Mark D., ?ali? Janko, Frohlich David (2018) A Contextual Study of Semantic Speech Editing in Radio Production,International Journal of Human-Computer Studies115pp. 67-80 Elsevier
Radio production involves editing speech-based audio using tools that represent sound using simple waveforms. Semantic speech editing systems allow users to edit audio using an automatically generated transcript, which has the potential to improve the production workflow. To investigate this, we developed a semantic audio editor based on a pilot study. Through a contextual qualitative study of five professional radio producers at the BBC, we examined the existing radio production process and evaluated our semantic editor by using it to create programmes that were later broadcast. We observed that the participants in our study wrote detailed notes about their recordings and used annotation to mark which parts they wanted to use. They collaborated closely with the presenter of their programme to structure the contents and write narrative elements. Participants reported that they often work away from the office to avoid distractions, and print transcripts so they can work away from screens. They also emphasised that listening is an important part of production, to ensure high sound quality. We found that semantic speech editing with automated speech recognition can be used to improve the radio production workflow, but that annotation, collaboration, portability and listening were not well supported by current semantic speech editing systems. In this paper, we make recommendations on how future semantic speech editing systems can better support the requirements of radio production.
Rencker Lucas, Bach F, Wang Wenwu, Plumbley Mark D (2018) Consistent dictionary learning for signal declipping,Latent Variable Analysis and Signal Separation. LVA/ICA 2018. Lecture Notes in Computer Science10891pp. 446-455 Springer Verlag
Clipping, or saturation, is a common nonlinear distortion in signal processing. Recently, declipping techniques have been proposed based on sparse decomposition of the clipped signals on a fixed dictionary, with additional constraints on the amplitude of the clipped samples. Here we propose a dictionary learning approach, where the dictionary is directly learned from the clipped measurements. We propose a soft-consistency metric that minimizes the distance to a convex feasibility set, and takes into account our knowledge about the clipping process. We then propose a gradient descent-based dictionary learning algorithm that minimizes the proposed metric, and is thus consistent with the clipping measurement. Experiments show that the proposed algorithm outperforms other dictionary learning algorithms applied to clipped signals. We also show that learning the dictionary directly from the clipped signals outperforms consistent sparse coding with a fixed dictionary.
Sobieraj Iwona, Rencker Lucas, Plumbley Mark D (2018) Orthogonality-regularized masked NMF for learning on weakly labeled audio data,Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)pp. 2436-2440 Institute of Electrical and Electronics Engineers (IEEE)
Non-negative Matrix Factorization (NMF) is a well established tool for audio analysis. However, it is not well suited for learning on weakly labeled data, i.e. data where the exact timestamp of the sound of interest is not known. In this paper we propose a novel extension to NMF, that allows it to extract meaningful representations from weakly labeled audio data. Recently, a constraint on the activation matrix was proposed to adapt for learning on weak labels. To further improve the method we propose to add an orthogonality regularizer of the dictionary in the cost function of NMF. In that way we obtain appropriate dictionaries for the sounds of interest and background sounds from weakly labeled data. We demonstrate that the proposed Orthogonality-Regularized Masked NMF (ORM-NMF) can be used for Audio Event Detection of rare events and evaluate the method on the development data from Task2 of DCASE2017 Challenge.
O?Brien Cian, Plumbley Mark (2018) Inexact proximal operators for ?p-quasinorm minimization,Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)pp. 4724-4728 Institute of Electrical and Electronics Engineers (IEEE)
Proximal methods are an important tool in signal processing applications, where many problems can be characterized by the minimization of an expression involving a smooth fitting term and a convex regularization term ? for example the classic ?1-Lasso. Such problems can be solved using the relevant proximal operator. Here we consider the use of proximal operators for the ?p-quasinorm where 0 d p d 1. Rather than seek a closed form solution, we develop an iterative algorithm using a Majorization-Minimization procedure which results in an inexact operator. Experiments on image denoising show that for p d 1 the algorithm is effective in the high-noise scenario, outperforming the Lasso despite the inexactness of the proximal step.
Grais Emad M, Wierstorf Hagen, Ward Dominic, Plumbley Mark D (2018) Multi-Resolution Fully Convolutional Neural Networks for Monaural Audio Source Separation,Proceedings of LVA/ICA 2018 (Lecture Notes in Computer Science)10891pp. 340-350 Springer Verlag
In deep neural networks with convolutional layers, all the neurons in each layer typically have the same size receptive fields (RFs) with the same resolution. Convolutional layers with neurons that have large RF capture global information from the input features, while layers with neurons that have small RF size capture local details with high resolution from the input features. In this work, we introduce novel deep multi-resolution fully convolutional neural networks (MR-FCN), where each layer has a range of neurons with different RF sizes to extract multi- resolution features that capture the global and local information from its input features. The proposed MR-FCN is applied to separate the singing voice from mixtures of music sources. Experimental results show that using MR-FCN improves the performance compared to feedforward deep neural networks (DNNs) and single resolution deep fully convolutional neural networks (FCNs) on the audio source separation problem.
Radio production is a creative pursuit that uses sound to inform, educate and entertain an audience. Radio producers use audio editing tools to visually select, re-arrange and assemble sound recordings into programmes. However, current tools represent audio using waveform visualizations that display limited information about the sound. Semantic audio analysis can be used to extract useful information from audio recordings, including when people are speaking and what they are saying. This thesis investigates how such information can be applied to create semantic audio tools that improve the radio production process. An initial ethnographic study of radio production at the BBC reveals that producers use textual representations and paper transcripts to interact with audio, and waveforms to edit programmes. Based on these findings, three methods for improving radio production are developed and evaluated, which form the primary contribution of this thesis. Audio visualizations can be enhanced by mapping semantic audio features to colour, but this approach had not been formally tested. We show that with an enhanced audio waveform, a typical radio production task can be completed faster, with less effort and with greater accuracy than a normal waveform. Speech recordings can be represented and edited using transcripts, but this approach had not been formally evaluated for radio production. By developing and testing a semantic speech editor, we show that automatically-generated transcripts can be used to semantically edit speech in a professional radio production context, and identify requirements for annotation, collaboration, portability and listening. Finally, we present a novel approach for editing audio on paper that combines semantic speech editing with a digital pen interface. Through a user study with radio producers, we compare the relative benefits of semantic speech editing using paper and screen interfaces. We find that paper is better for simple edits of familiar audio with accurate transcripts.
Zermini Alfredo, Kong Qiuqiang, Xu Yong, Plumbley Mark D., Wang Wenwu (2018) Improving reverberant speech separation with binaural cues using temporal context and convolutional neural networks,In: Deville Y, Gannot S, Mason R, Plumbley Mark, Ward D (eds.), Latent Variable Analysis and Signal Separation. LVA/ICA 2018. Lecture Notes in Computer ScienceLatent Variable Analysis and Signal Separation: LVA/ICA 2018. Lecture Notes in Computer Science10891pp. 361-371 Springer
Given binaural features as input, such as interaural level difference and interaural phase difference, Deep Neural Networks (DNNs) have been recently used to localize sound sources in a mixture of speech signals and/or noise, and to create time-frequency masks for the estimation of the sound sources in reverberant rooms. Here, we explore a more advanced system, where feed-forward DNNs are replaced by Convolutional Neural Networks (CNNs). In addition, the adjacent frames of each time frame (occurring before and after this frame) are used to exploit contextual information, thus improving the localization and separation for each source. The quality of the separation results is evaluated in terms of Signal to Distortion Ratio (SDR).
Duel Tijs, Frohlich David M., Kroos Christian, Xu Yong, Jackson Philip J. B., Plumbley Mark D. (2018) Supporting audiography: Design of a system for sentimental sound recording, classification and playback,Communications in Computer and Information Science: HCI International 2018 - Posters' Extended Abstracts850pp. 24-31 Scientific Publishing Services, on behalf of Springer
It is now commonplace to capture and share images in photography as triggers for memory. In this paper we explore the possibility of using sound in the same sort of way, in a practice we call audiography. We report an initial design activity to create a system called Audio Memories comprising a ten second sound recorder, an intelligent archive for auto-classifying sound clips, and a multi-layered sound player for the social sharing of audio souvenirs around a table. The recorder and player components are essentially user experience probes that provide tangible interfaces for capturing and interacting with audio memory cues. We discuss our design decisions and process in creating these tools that harmonize user interaction and machine listening to evoke rich memories and conversations in an exploratory and open-ended way.
Li S, Dixon S, Plumbley Mark D (2018) A demonstration of hierarchical structure usage in expressive timing analysis by model selection tests,Proceedings of (CCC2018) IEEE
Analysing expressive timing in performed music can help machine to perform various perceptual tasks such as identifying performers and understand music structures in classical music. A hierarchical structure is commonly used for expressive timing analysis. This paper provides a statistical demonstration to support the use of hierarchical structure in expressive timing analysis by presenting two groups of model selection tests. The first model selection test uses expressive timing to determine the location of music structure boundaries. The second model selection test is matching a piece of performance with the same performer playing another given piece. Comparing the results of model selection tests, the preferred hierarchical structures in these two model selection tests are not the same. While determining music structure boundaries demands a hierarchical structure with more levels in the expressive timing analysis, a hierarchical structure with less levels helps identifying the dedicated performer in most cases.
Plumbley Mark D. (2005) Geometrical methods for non-negative ICA: Manifolds, Lie groups and toral subalgebras,Neurocomputing67pp. 161-197 Elsevier
We explore the use of geometrical methods to tackle the non-negative independent component analysis (non-negative ICA) problem, without assuming the reader has an existing background in differential geometry. We concentrate on methods that achieve this by minimizing a cost function over the space of orthogonal matrices. We introduce the idea of the manifold and Lie group SO(n) of special orthogonal matrices that we wish to search over, and explain how this is related to the Lie algebra so(n) of skew-symmetric matrices. We describe how familiar optimization methods such as steepest-descent and conjugate gradients can be transformed into this Lie group setting, and how the Newton update step has an alternative Fourier version in SO(n). Finally we introduce the concept of a toral subgroup generated by a particular element of the Lie group or Lie algebra, and explore how this commutative subgroup might be used to simplify searches on our constraint surface. No proofs are presented in this article.
Grais Emad M, Ward Dominic, Plumbley Mark D (2018) Raw Multi-Channel Audio Source Separation using Multi-Resolution Convolutional Auto-Encoders,Proceedings of the 2018 26th European Signal Processing Conference (EUSIPCO)pp. 1577-1581 Institute of Electrical and Electronics Engineers (IEEE)
Supervised multi-channel audio source separation requires extracting useful spectral, temporal, and spatial features from the mixed signals. The success of many existing systems is therefore largely dependent on the choice of features used for training. In this work, we introduce a novel multi-channel, multiresolution convolutional auto-encoder neural network that works on raw time-domain signals to determine appropriate multiresolution features for separating the singing-voice from stereo music. Our experimental results show that the proposed method can achieve multi-channel audio source separation without the need for hand-crafted features or any pre- or post-processing.
Jackson Philip, Plumbley Mark D, Wang Wenwu, Brookes Tim, Coleman Philip, Mason Russell, Frohlich David, Bonina Carla, Plans David (2017) Signal Processing, Psychoacoustic Engineering and Digital Worlds: Interdisciplinary Audio Research at the University of Surrey,
At the University of Surrey (Guildford, UK), we have brought together research groups in different disciplines, with a shared interest in audio, to work on a range of collaborative research projects. In the Centre for Vision, Speech and Signal Processing (CVSSP) we focus on technologies for machine perception of audio scenes; in the Institute of Sound Recording (IoSR) we focus on research into human perception of audio quality; the Digital World Research Centre (DWRC) focusses on the design of digital technologies; while the Centre for Digital Economy (CoDE) focusses on new business models enabled by digital technology. This interdisciplinary view, across different traditional academic departments and faculties, allows us to undertake projects which would be impossible for a single research group. In this poster we will present an overview of some of these interdisciplinary projects, including projects in spatial audio, sound scene and event analysis, and creative commons audio.
This book constitutes the proceedings of the 14th International Conference on Latent Variable Analysis and Signal Separation, LVA/ICA 2018, held in Guildford, UK, in July 2018.The 52 full papers were carefully reviewed and selected from 62 initial submissions. As research topics the papers encompass a wide range of general mixtures of latent variables models but also theories and tools drawn from a great variety of disciplines such as structured tensor decompositions and applications; matrix and tensor factorizations; ICA methods; nonlinear mixtures; audio data and methods; signal separation evaluation campaign; deep learning and data-driven methods; advances in phase retrieval and applications; sparsity-related methods; and biomedical data and methods.
O?Brien Cian, Plumbley Mark D (2018) A Hierarchical Latent Mixture Model for Polyphonic Music Analysis,Proceedings of the 2018 26th European Signal Processing Conference (EUSIPCO)pp. 1910-1914 IEEE
Polyphonic music transcription is a challenging problem, requiring the identification of a collection of latent pitches which can explain an observed music signal. Many state-of-the-art methods are based on the Non-negative Matrix Factorization (NMF) framework, which itself can be cast as a latent variable model. However, the basic NMF algorithm fails to consider many important aspects of music signals such as lowrank or hierarchical structure and temporal continuity. In this work we propose a probabilistic model to address some of the shortcomings of NMF. Probabilistic Latent Component Analysis (PLCA) provides a probabilistic interpretation of NMF and has been widely applied to problems in audio signal processing. Based on PLCA, we propose an algorithm which represents signals using a collection of low-rank dictionaries built from a base pitch dictionary. This allows each dictionary to specialize to a given chord or interval template which will be used to represent collections of similar frames. Experiments on a standard music transcription data set show that our method can successfully decompose signals into a hierarchical and smooth structure, improving the quality of the transcription.
Ward Dominic, Mason Russell D., Kim Ryan Chungeun, Stöter Fabian-Robert, Liutkus Antoine, Plumbley Mark D. (2018) SISEC 2018: state of the art in musical audio source separation - Subjective selection of the best algorithm,Proceedings of the 4th Workshop on Intelligent Music Production, Huddersfield, UK, 14 September 2018 University of Huddersfield
The Signal Separation Evaluation Campaign (SiSEC) is a large-scale regular event aimed at evaluating current progress in source separation through a systematic and reproducible comparison of the participants? algorithms, providing the source separation community with an invaluable glimpse of recent achievements and open challenges. This paper focuses on the music separation task from SiSEC 2018, which compares algorithms aimed at recovering instrument stems from a stereo mix. In this context, we conducted a subjective evaluation whereby 34 listeners picked which of six competing algorithms, with high objective performance scores, best separated the singing-voice stem from 13 professionally mixed songs. The subjective results reveal strong differences between the algorithms, and highlight the presence of song-dependent performance for state-of-the-art systems. Correlations between the subjective results and the scores of two popular performance metrics are also presented.
Rencker Lucas, Bach Francis, Wang Wenwu, Plumbley Mark D. (2018) Fast Iterative Shrinkage for Signal Declipping and Dequantization,Proceedings of iTWIST?18 - International Traveling Workshop on Interactions between low-complexity data models and Sensing Techniques iTWIST
We address the problem of recovering a sparse signal from clipped or quantized measurements. We show how these two problems can be formulated as minimizing the distance to a convex feasibility set, which provides a convex and differentiable cost function. We then propose a fast iterative shrinkage/thresholding algorithm that minimizes the proposed cost, which provides a fast and efficient algorithm to recover sparse signals from clipped and quantized measurements.
Safavi Saeid, Pearce Andy, Wang Wenwu, Plumbley Mark (2018) Predicting the perceived level of reverberation using machine learning,Proceedings of the 52nd Asilomar Conference on Signals, Systems and Computers (ACSSC 2018) Institute of Electrical and Electronics Engineers (IEEE)
Perceptual measures are usually considered more reliable than instrumental measures for evaluating the perceived level of reverberation. However, such measures are time consuming and expensive, and, due to variations in stimuli or assessors, the resulting data is not always statistically significant. Therefore, an (objective) measure of the perceived level of reverberation becomes desirable. In this paper, we develop a new method to predict the level of reverberation from audio signals by relating the perceptual listening test results with those obtained from a machine learned model. More specifically, we compare the use of a multiple stimuli test for within and between class architectures to evaluate the perceived level of reverberation. An expert set of 16 human listeners rated the perceived level of reverberation for a same set of files from different audio source types. We then train a machine learning model using the training data gathered for the same set of files and a variety of reverberation related features extracted from the data such as reverberation time, and direct to reverberation ratio. The results suggest that the machine learned model offers an accurate prediction of the perceptual scores.
Kong Qiuqiang, Iqbal Turab, Xu Yong, Wang Wenwu, Plumbley Mark D (2018) DCASE 2018 Challenge Surrey Cross-task convolutional neural network baseline,DCASE2018 Workshop
The Detection and Classi?cation of Acoustic Scenes and Events (DCASE) consists of ?ve audio classi?cation and sound event detectiontasks: 1)Acousticsceneclassi?cation,2)General-purposeaudio tagging of Freesound, 3) Bird audio detection, 4) Weakly-labeled semi-supervised sound event detection and 5) Multi-channel audio classi?cation. In this paper, we create a cross-task baseline system for all ?ve tasks based on a convlutional neural network (CNN): a ?CNN Baseline? system. We implemented CNNs with 4 layers and 8 layers originating from AlexNet and VGG from computer vision. We investigated how the performance varies from task to task with the same con?guration of neural networks. Experiments show that deeper CNN with 8 layers performs better than CNN with 4 layers on all tasks except Task 1. Using CNN with 8 layers, we achieve an accuracy of 0.680 on Task 1, an accuracy of 0.895 and a mean average precision (MAP) of 0.928 on Task 2, an accuracy of 0.751 andanareaunderthecurve(AUC)of0.854onTask3,asoundevent detectionF1scoreof20.8%onTask4,andanF1scoreof87.75%on Task 5. We released the Python source code of the baseline systems under the MIT license for further research.
Safavi Saeid, Wang Wenwu, Plumbley Mark, Choobbasti Ali Janalizadeh, Fazekas George (2018) Predicting the Perceived Level of Reverberation using Features from Nonlinear Auditory Model,Proceedings of the 23rd FRUCT conferencepp. 527-531 Institute of Electrical and Electronics Engineers (IEEE)
Perceptual measurements have typically been recognized as the most reliable measurements in assessing perceived levels of reverberation. In this paper, a combination of blind RT60 estimation method and a binaural, nonlinear auditory model is employed to derive signal-based measures (features) that are then utilized in predicting the perceived level of reverberation. Such measures lack the excess of effort necessary for calculating perceptual measures; not to mention the variations in either stimuli or assessors that may cause such measures to be statistically insignificant. As a result, the automatic extraction of objective measurements that can be applied to predict the perceived level of reverberation become of vital significance. Consequently, this work is aimed at discovering measurements such as clarity, reverberance, and RT60 which can automatically be derived directly from audio data. These measurements along with labels from human listening tests are then forwarded to a machine learning system seeking to build a model to estimate the perceived level of reverberation, which is labeled by an expert, autonomously. The data has been labeled by an expert human listener for a unilateral set of files from arbitrary audio source types. By examining the results, it can be observed that the automatically extracted features can aid in estimating the perceptual rates.
Ren Zhao, Kong Qiuqiang, Qian Kun, Plumbley Mark D, Schuller Bj¨orn W (2018) Attention-based Convolutional Neural Networks for Acoustic Scene Classification,DCASE 2018 Workshop Proceedings
We propose a convolutional neural network (CNN) model based on an attention pooling method to classify ten different acoustic scenes, participating in the acoustic scene classi?cation task of the IEEE AASPChallengeonDetectionandClassi?cationofAcousticScenes and Events (DCASE 2018), which includes data from one device (subtask A) and data from three different devices (subtask B). The log mel spectrogram images of the audio waves are ?rst forwarded to convolutional layers, and then fed into an attention pooling layer to reduce the feature dimension and achieve classi?cation. From attention perspective, we build a weighted evaluation of the features, instead of simple max pooling or average pooling. On the of?cial development set of the challenge, the best accuracy of subtask A is 72.6%,whichisanimprovementof12.9%whencomparedwiththe of?cial baseline (p < .001 in a one-tailed z-test). For subtask B, the best result of our attention-based CNN is a signi?cant improvement of the baseline as well, in which the accuracies are 71.8%, 58.3%, and 58.3% for the three devices A to C (p < .001 for device A, p < .01 for device B, and p < .05 for device C).
Iqbal Turab, Kong Qiuqiang, Plumbley Mark D, Wang Wenwu (2018) General-purpose audio tagging from noisy labels using convolutional neural networks,Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018pp. 212-216 Tampere University of Technology
General-purpose audio tagging refers to classifying sounds that are of a diverse nature, and is relevant in many applications where domain-specific information cannot be exploited. The DCASE 2018 challenge introduces Task 2 for this very problem. In this task, there are a large number of classes and the audio clips vary in duration. Moreover, a subset of the labels are noisy. In this paper, we propose a system to address these challenges. The basis of our system is an ensemble of convolutional neural networks trained on log-scaled mel spectrograms. We use preprocessing and data augmentation methods to improve the performance further. To reduce the effects of label noise, two techniques are proposed: loss function weighting and pseudo-labeling. Experiments on the private test set of this task show that our system achieves state-of-the-art performance with a mean average precision score of 0.951
Plumbley, Mark D (2018) Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018),Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018) Tampere University of Technology
Cano Estefanía, FitzGerald Derry, Liutkus Antoine, Plumbley Mark D., Stöter Fabian-Robert (2018) Musical Source Separation: An Introduction,IEEE Signal Processing Magazine Institute of Electrical and Electronics Engineers (IEEE)

Many people listen to recorded music as part of their everyday lives, for example from radio or TV programmes, CDs, downloads or increasingly from online streaming services. Sometimes we might want to remix the balance within the music, perhaps to make the vocals louder or to suppress an unwanted sound, or we might want to upmix a 2-channel stereo recording to a 5.1- channel surround sound system. We might also want to change the spatial location of a musical instrument within the mix. All of these applications are relatively straightforward, provided we have access to separate sound channels (stems) for each musical audio object.

However, if we only have access to the final recording mix, which is usually the case, this is much more challenging. To estimate the original musical sources, which would allow us to remix, suppress or upmix the sources, we need to perform musical source separation (MSS).

In the general source separation problem, we are given one or more mixture signals that contain different mixtures of some original source signals. This is illustrated in Figure 1 where four sources, namely vocals, drums, bass and guitar, are all present in the mixture. The task is to recover one or more of the source signals given the mixtures. In some cases, this is relatively straightforward, for example, if there are at least as many mixtures as there are sources, and if the mixing process is fixed, with no delays, filters or non-linear mastering [1].

However, MSS is normally more challenging. Typically, there may be many musical instruments and voices in a 2-channel recording, and the sources have often been processed with the addition of filters and reverberation (sometimes nonlinear) in the recording and mixing process. In some cases, the sources may move, or the production parameters may change, meaning that the mixture is time-varying. All of these issues make MSS a very challenging problem.

Nevertheless, musical sound sources have particular properties and structures that can help us. For example, musical source signals often have a regular harmonic structure of frequencies at regular intervals, and can have frequency contours characteristic of each musical instrument. They may also repeat in particular temporal patterns based on the musical structure.

In this paper we will explore the MSS problem and introduce approaches to tackle it. We will begin by introducing characteristics of music signals, we will then give an introduction to MSS, and finally consider a range of MSS models. We will also discuss how to evaluate MSS approaches, and discuss limitations and directions for future research

Nesbit Andrew, Plumbley Mark D., Davies Mike E. (2007) Audio source separation with a signal-adaptive local cosine transform,Signal Processing87(8)8pp. 1848-1858 Elsevier

Audio source separation is a very challenging problem, and many different approaches have been proposed in attempts to solve it. We consider the problem of separating sources from two-channel instantaneous audio mixtures. One approach to this is to transform the mixtures into the time-frequency domain to obtain approximately disjoint representations of the sources, and then separate the sources using time-frequency masking. We focus on demixing the sources by binary masking, and assume that the mixing parameters are known. In this paper, we investigate the application of cosine packet (CP) trees as a foundation for the transform.

We determine an appropriate transform by applying a computationally efficient best basis algorithm to a set of possible local cosine bases organised in a tree structure. We develop a heuristically motivated cost function which maximises the energy of the transform coefficients associated with a particular source. Finally, we evaluate objectively our proposed transform method by comparing it against fixed-basis transforms such as the short-time Fourier transform (STFT) and modified discrete cosine transform (MDCT). Evaluation results indicate that our proposed transform method outperforms MDCT and is competitive with the STFT, and informal listening tests suggest that the proposed method exhibits less objectionable noise than the STFT.

Kong Qiuqiang, Xu Yong, Sobieraj Iwona, Wang Wenwu, Plumbley Mark D. (2019) Sound Event Detection and Time-Frequency Segmentation from Weakly Labelled Data,IEEE/ACM Transactions on Audio, Speech, and Language Processing27(4)pp. 777-787 Institute of Electrical and Electronics Engineers (IEEE)
Sound event detection (SED) aims to detect when and recognize what sound events happen in an audio clip. Many supervised SED algorithms rely on strongly labelled data which contains the onset and offset annotations of sound events. However, many audio tagging datasets are weakly labelled, that is, only the presence of the sound events is known, without knowing their onset and offset annotations. In this paper, we propose a time-frequency (T-F) segmentation framework trained on weakly labelled data to tackle the sound event detection and separation problem. In training, a segmentation mapping is applied on a T-F representation, such as log mel spectrogram of an audio clip to obtain T-F segmentation masks of sound events. The T-F segmentation masks can be used for separating the sound events from the background scenes in the time-frequency domain. Then a classification mapping is applied on the T-F segmentation masks to estimate the presence probabilities of the sound events. We model the segmentation mapping using a convolutional neural network and the classification mapping using a global weighted rank pooling (GWRP). In SED, predicted onset and offset times can be obtained from the T-F segmentation masks. As a byproduct, separated waveforms of sound events can be obtained from the T-F segmentation masks. We remixed the DCASE 2018 Task 1 acoustic scene data with the DCASE 2018 Task 2 sound events data. When mixing under 0 dB, the proposed method achieved F1 scores of 0.534, 0.398 and 0.167 in audio tagging, frame-wise SED and event-wise SED, outperforming the fully connected deep neural network baseline of 0.331, 0.237 and 0.120, respectively. In T-F segmentation, we achieved an F1 score of 0.218, where previous methods were not able to do T-F segmentation.
Ren Zhao, Kong Qiuqiang, Han Jing, Plumbley Mark, Schuller Björn W (2019) ATTENTION-BASED ATROUS CONVOLUTIONAL NEURAL NETWORKS: VISUALISATION AND UNDERSTANDING PERSPECTIVES OF ACOUSTIC SCENES,2019 Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019) IEEE
The goal of Acoustic Scene Classification (ASC) is to recognise the environment in which an audio waveform has been recorded. Recently, deep neural networks have been applied to ASC and have achieved state-of-the-art performance. However, few works have investigated how to visualise and understand what a neural network has learnt from acoustic scenes. Previous work applied local pooling after each convolutional layer, therefore reduced the size of the feature maps. In this paper, we suggest that local pooling is not necessary, but the size of the receptive field is important. We apply atrous Convolutional Neural Networks (CNNs) with global attention pooling as the classification model. The internal feature maps of the attention model can be visualised and explained. On the Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 dataset, our proposed method achieves an accuracy of 72.7 %, significantly outperforming the CNNs without dilation at 60.4 %. Furthermore, our results demonstrate that the learnt feature maps contain rich information on acoustic scenes in the time-frequency domain.
Kroos Christian, Bones Oliver, Cao Yin, Harris Lara, Jackson Philip J. B., Davies William J., Wang Wenwu, Cox Trevor J., Plumbley Mark D. (2019) Generalisation in environmental sound classification: the 'making sense of sounds' data set and challenge,Proceedings of the 44th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2019) Institute of Electrical and Electronics Engineers (IEEE)
Humans are able to identify a large number of environmental sounds and categorise them according to high-level semantic categories, e.g. urban sounds or music. They are also capable of generalising from past experience to new sounds when applying these categories. In this paper we report on the creation of a data set that is structured according to the top-level of a taxonomy derived from human judgements and the design of an associated machine learning challenge, in which strong generalisation abilities are required to be successful. We introduce a baseline classification system, a deep convolutional network, which showed strong performance with an average accuracy on the evaluation data of 80.8%. The result is discussed in the light of two alternative explanations: An unlikely accidental category bias in the sound recordings or a more plausible true acoustic grounding of the high-level categories.
Podwinska Zuzanna, Sobieraj Iwona, Fazenda Bruno M, Davies William J, Plumbley Mark D. (2019) Acoustic event detection from weakly labeled data using auditory salience,Proceedings of the 44th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2019) Institute of Electrical and Electronics Engineers (IEEE)
Acoustic Event Detection (AED) is an important task of machine listening which, in recent years, has been addressed using common machine learning methods like Non-negative Matrix Factorization (NMF) or deep learning. However, most of these approaches do not take into consideration the way that human auditory system detects salient sounds. In this work, we propose a method for AED using weakly labeled data that combines a Non-negative Matrix Factorization model with a salience model based on predictive coding in the form of Kalman filters. We show that models of auditory perception, particularly auditory salience, can be successfully incorporated into existing AED methods and improve their performance on rare event detection. We evaluate the method on the Task2 of DCASE2017 Challenge.
Hou Yuanbo, Kong Qiuqiang, Li Shengchen, Plumbley Mark D. (2019) Sound event detection with sequentially labelled data based on Connectionist temporal classification and unsupervised clustering,Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019) Institute of Electrical and Electronics Engineers (IEEE)
Sound event detection (SED) methods typically rely on either strongly labelled data or weakly labelled data. As an alternative, sequentially labelled data (SLD) was proposed. In SLD, the events and the order of events in audio clips are known, without knowing the occurrence time of events. This paper proposes a connectionist temporal classification (CTC) based SED system that uses SLD instead of strongly labelled data, with a novel unsupervised clustering stage. Experiments on 41 classes of sound events show that the proposed two-stage method trained on SLD achieves performance comparable to the previous state-of-the-art SED system trained on strongly labelled data, and is far better than another state-of-the-art SED system trained on weakly labelled data, which indicates the effectiveness of the proposed two-stage method trained on SLD without any onset/offset time of sound events.
Kong Qiuqiang, Xu Yong, Iqbal Turab, Cao Yin, Wang Wenwu, Plumbley Mark D. (2019) Acoustic scene generation with conditional SampleRNN,Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019) Institute of Electrical and Electronics Engineers (IEEE)
Acoustic scene generation (ASG) is a task to generate waveforms for acoustic scenes. ASG can be used to generate audio scenes for movies and computer games. Recently, neural networks such as SampleRNN have been used for speech and music generation. However, ASG is more challenging due to its wide variety. In addition, evaluating a generative model is also difficult. In this paper, we propose to use a conditional SampleRNN model to generate acoustic scenes conditioned on the input classes. We also propose objective criteria to evaluate the quality and diversity of the generated samples based on classification accuracy. The experiments on the DCASE 2016 Task 1 acoustic scene data show that with the generated audio samples, a classification accuracy of 65:5% can be achieved compared to samples generated by a random model of 6:7% and samples from real recording of 83:1%. The performance of a classifier trained only on generated samples achieves an accuracy of 51:3%, as opposed to an accuracy of 6:7% with samples generated by a random model.
Kim Chungeun, Grais Emad M, Mason Russell, Plumbley Mark D (2018) Perception of phase changes in the context of musical audio source separation,145th AES Convention10031 AES
This study investigates into the perceptual consequence of phase change in conventional magnitude-based source separation. A listening test was conducted, where the participants compared three different source separation scenarios, each with two phase retrieval cases: phase from the original mix or from the target source. The participants? responses regarding their similarity to the reference showed that 1) the difference between the mix phase and the perfect target phase was perceivable in the majority of cases with some song-dependent exceptions, and 2) use of the mix phase degraded the perceived quality even in the case of perfect magnitude separation. The findings imply that there is room for perceptual improvement by attempting correct phase reconstruction, in addition to achieving better magnitude-based separation.
Grais Emad M., Wierstorf Hagen, Ward Dominic, Mason Russell, Plumbley Mark (2019) Referenceless Performance Evaluation of Audio Source Separation using Deep Neural Networks,Proceedings 2019 27th European Signal Processing Conference (EUSIPCO) IEEE
Current performance evaluation for audio source separation depends on comparing the processed or separated signals with reference signals. Therefore, common performance evaluation toolkits are not applicable to real-world situations where the ground truth audio is unavailable. In this paper, we propose a performance evaluation technique that does not require reference signals in order to assess separation quality. The proposed technique uses a deep neural network (DNN) to map the processed audio into its quality score. Our experiment results show that the DNN is capable of predicting the sources-to-artifacts ratio from the blind source separation evaluation toolkit [1] for singing-voice separation without the need for reference signals.
Hedayioglu F, Jafari MG, Mattos SS, Plumbley MD, Coimbra MT (2012) Denoising and segmentation of the second heart sound using matching pursuit.,Conference proceedings : ... Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Conferencepp. 3440-3443
We propose a denoising and segmentation technique for the second heart sound (S2). To denoise, Matching Pursuit (MP) was applied using a set of non-linear chirp signals as atoms. We show that the proposed method can be used to segment the phonocardiogram of the second heart sound into its two clinically meaningful components: the aortic (A2) and pulmonary (P2) components.
Kong Qiuqiang, Xu Yong, Jackson Philip J.B., Wang Wenwu, Plumbley Mark D. (2019) Single-Channel Signal Separation and Deconvolution with Generative Adversarial Networks,Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)pp. 2747-2753 International Joint Conferences on Artificial Intelligence
Single-channel signal separation and deconvolution aims to separate and deconvolve individual sources from a single-channel mixture and is a challenging problem in which no prior knowledge of the mixing filters is available. Both individual sources and mixing filters need to be estimated. In addition, a mixture may contain non-stationary noise which is unseen in the training set. We propose a synthesizing-decomposition (S-D) approach to solve the single-channel separation and deconvolution problem. In synthesizing, a generative model for sources is built using a generative adversarial network (GAN). In decomposition, both mixing filters and sources are optimized to minimize the reconstruction error of the mixture. The proposed S-D approach achieves a peak-to-noise-ratio (PSNR) of 18.9 dB and 15.4 dB in image inpainting and completion, outperforming a baseline convolutional neural network PSNR of 15.3 dB and 12.2 dB, respectively and achieves a PSNR of 13.2 dB in source separation together with deconvolution, outperforming a convolutive non-negative matrix factorization (NMF) baseline of 10.1 dB.
Kong Qiuqiang, Yu Changsong, Xu Yong, Iqbal Turab, Wang Wenwu, Plumbley Mark D. (2019) Weakly Labelled AudioSet Tagging With Attention Neural Networks,IEEE/ACM Transactions on Audio, Speech, and Language Processing27(11)pp. 1791-1802 IEEE
Audio tagging is the task of predicting the presence or absence of sound classes within an audio clip. Previous work in audio tagging focused on relatively small datasets limited to recognising a small number of sound classes. We investigate audio tagging on AudioSet, which is a dataset consisting of over 2 million audio clips and 527 classes. AudioSet is weakly labelled, in that only the presence or absence of sound classes is known for each clip, while the onset and offset times are unknown. To address the weakly-labelled audio tagging problem, we propose attention neural networks as a way to attend the most salient parts of an audio clip. We bridge the connection between attention neural networks and multiple instance learning (MIL) methods, and propose decision-level and feature-level attention neural networks for audio tagging. We investigate attention neural networks modelled by different functions, depths and widths. Experiments on AudioSet show that the feature-level attention neural network achieves a state-of-the-art mean average precision (mAP) of 0.369, outperforming the best multiple instance learning (MIL) method of 0.317 and Google?s deep neural network baseline of 0.314. In addition, we discover that the audio tagging performance on AudioSet embedding features has a weak correlation with the number of training examples and the quality of labels of each sound class.
Rencker Lucas, Bach Francis, Wang Wenwu, Plumbley Mark D. (2019) Sparse Recovery and Dictionary Learning From Nonlinear Compressive Measurements,IEEE TRANSACTIONS ON SIGNAL PROCESSING67(21)pp. 5659-5670 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Sparse coding and dictionary learning are popular techniques for linear inverse problems such as denoising or inpainting. However in many cases, the measurement process is nonlinear, for example for clipped, quantized or 1-bit measurements. These problems have often been addressed by solving constrained sparse coding problems, which can be difficult to solve, and assuming that the sparsifying dictionary is known and fixed. Here we propose a simple and unified framework to deal with nonlinear measurements. We propose a cost function that minimizes the distance to a convex feasibility set, which models our knowledge about the nonlinear measurement. This provides an unconstrained, convex, and differentiable cost function that is simple to optimize, and generalizes the linear least squares cost commonly used in sparse coding. We then propose proximal based sparse coding and dictionary learning algorithms, that are able to learn directly from nonlinearly corrupted signals. We show how the proposed framework and algorithms can be applied to clipped, quantized and 1-bit data.
Grondin François, Glass François, Sobieraj Iwona, Plumbley Mark D. (2019) Sound Event Localization and Detection Using CRNN on Pairs of Microphones,Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019)pp. 84-88 New York University
This paper proposes sound event localization and detection methods from multichannel recording. The proposed system is based on two Convolutional Recurrent Neural Networks (CRNNs) to perform sound event detection (SED) and time difference of arrival (TDOA) estimation on each pair of microphones in a microphone array. In this paper, the system is evaluated with a four-microphone array, and thus combines the results from six pairs of microphones to provide a final classification and a 3-D direction of arrival (DOA) estimate. Results demonstrate that the proposed approach outperforms the DCASE 2019 baseline system.
Ren Zhao, Han Jing, Cummins Nicholas, Kong Qiuqiang, Plumbley Mark, Schuller Björn W. (2019) Multi-instance Learning for Bipolar Disorder Diagnosis using Weakly Labelled Speech Data,Proceedings of DPH 2019: 9th International Digital Public Health Conference 2019 Association for Computing Machinery (ACM)
While deep learning is undoubtedly the predominant learning technique across speech processing, it is still not widely used in health-based applications. The corpora available for health-style recognition problems are often small, both concerning the total amount of data available and the number of individuals present. The Bipolar Disorder corpus, used in the 2018 Audio/Visual Emotion Challenge, contains only 218 audio samples from 46 individuals. Herein, we present a multi-instance learning framework aimed at constructing more reliable deep learning-based models in such conditions. First, we segment the speech files into multiple chunks. However, the problem is that each of the individual chunks is weakly labelled, as they are annotated with the label of the corresponding speech file, but may not be indicative of that label. We then train the deep learning-based (ensemble) multi-instance learning model, aiming at solving such a weakly labelled problem. The presented results demonstrate that this approach can improve the accuracy of feedforward, recurrent, and convolutional neural nets on the 3-class mania classification tasks undertaken on the Bipolar Disorder corpus.
Cao Yin, Kong Qiuqiang, Iqbal Turab, An Fengyan, Wang Wenwu, Plumbley Mark D. (2019) Polyphonic sound event detection and localization using a two-stage strategy,Proceedings of Detection and Classification of Acoustic Scenes and Events Workshop (DCASE 2019)pp. pp 30-34 New York University
Sound event detection (SED) and localization refer to recognizing sound events and estimating their spatial and temporal locations. Using neural networks has become the prevailing method for SED. In the area of sound localization, which is usually performed by estimating the direction of arrival (DOA), learning-based methods have recently been developed. In this paper, it is experimentally shown that the trained SED model is able to contribute to the direction of arrival estimation (DOAE). However, joint training of SED and DOAE degrades the performance of both. Based on these results, a two-stage polyphonic sound event detection and localization method is proposed. The method learns SED first, after which the learned feature layers are transferred for DOAE. It then uses the SED ground truth as a mask to train DOAE. The proposed method is evaluated on the DCASE 2019 Task 3 dataset, which contains different overlapping sound events in different environments. Experimental results show that the proposed method is able to improve the performance of both SED and DOAE, and also performs significantly better than the baseline method.
Plumbley MD, Oja E (2004) A "nonnegative PCA" algorithm for independent component analysis,IEEE TRANSACTIONS ON NEURAL NETWORKS15(1)pp. 66-76 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
We consider the task of independent component analysis when the independent sources are known to be nonnegative and well-grounded, so that they have a nonzero probability density function (pdf) in the region of zero. We propose the use of a "nonnegative principal component analysis (nonnegative PCA)" algorithm, which is a special case of the nonlinear PCA algorithm, but with a rectification nonlinearity, and we conjecture that this algorithm will find such nonnegative well-grounded independent sources, under reasonable initial conditions. While the algorithm has proved difficult to analyze in the general case, we give some analytical results that are consistent with this conjecture and some numerical simulations that illustrate its operation.
Kong Qiuqiang, Wang Yuxuan, Song Xuchen, Cao Yin, Wang Wenwu, Plumbley Mark D. (2020) Source separation with weakly labelled data: An approach to computational auditory scene analysis,ICASSP 2020
Source separation is the task of separating an audio recording into individual sound sources. Source separation is fundamental for computational auditory scene analysis. Previous work on source separation has focused on separating particular sound classes such as speech and music. Much previous work requires mixtures and clean source pairs for training. In this work, we propose a source separation framework trained with weakly labelled data. Weakly labelled data only contains the tags of an audio clip, without the occurrence time of sound events. We first train a sound event detection system with AudioSet. The trained sound event detection system is used to detect segments that are most likely to contain a target sound event. Then a regression is learnt from a mixture of two randomly selected segments to a target segment conditioned on the audio tagging prediction of the target segment. Our proposed system can separate 527 kinds of sound classes from AudioSet within a single system. A U-Net is adopted for the separation system and achieves an average SDR of 5.67 dB over 527 sound classes in AudioSet.
Iqbal Turab, Cao Yin, Kong Qiuqiang, Plumbley Mark D., Wang Wenwu (2020) Learning with Out-of-Distribution Data for Audio Classification,International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020
In supervised machine learning, the assumption that training data is labelled correctly is not always satisfied. In this paper, we investigate an instance of labelling error for classification tasks in which the dataset is corrupted with out-of-distribution (OOD) instances: data that does not belong to any of the target classes, but is labelled as such. We show that detecting and relabelling certain OOD instances, rather than discarding them, can have a positive effect on learning. The proposed method uses an auxiliary classifier, trained on data that is known to be in-distribution, for detection and relabelling. The amount of data required for this is shown to be small. Experiments are carried out on the FSDnoisy18k audio dataset, where OOD instances are very prevalent. The proposed method is shown to improve the performance of convolutional neural networks by a significant margin. Comparisons with other noise-robust techniques are similarly encouraging.
Sound event detection (SED) is a problem to detect the onset and offset times of sound events in an audio recording. SED has many applications in both academia and industry, such as multimedia information retrieval and monitoring domestic and public security. However, compared to speech signal processing that have been researched for many years, the classification and detection of general sounds has not been researched much until recent years. One limitation of the study on audio classification and sound event detection is that there have been limited datasets public available until the release of the release of the detection and classification of acoustic scenes and events (DCASE) dataset. The DCASE dataset consists of data for acoustic scene classification (ASC), audio tagging (AT) and sound event detection. ASC and AT are tasks to design systems to predict pre-defined labels in an audio clip. SED is a task to design systems to predict both the presence or absence of sound events in an audio clip as well as the onset and offset times of the sound events. One difficulty of the audio classification and SED task is that many datasets such as the DCASE dataset are weakly labelled. That is, only the presence or absence of sound events in an audio clip is known, without knowing the onset and offset annotations of the sound events. This thesis focused on solving the audio tagging and sound event detection problem using only weakly labelled data. This thesis proposed attention neural networks to solve the general weakly labelled AT and SED problem. The attention neural networks can automatically learn to attend to important segments and ignore silence and irrelevant segments in an audio clip. We developed a set of weak learning methods for AT and SED using attention neu- Abstract 3 ral networks. The proposed methods have achieved a state-of-the-art performance in audio tagging and sound event detection.
Zermini Alfredo (2020) Deep learning for speech separation.,
Speech source separation aims to estimate one or more individual sources from mixtures of multiple sound sources, e.g. speech, noise and music. While humans have an innate ability to separate sources in a sound mixture, this is not a trivial task for computers. In this thesis, we study the problem of speech separation, with a varying degree of complexity with respect to room reverberation, the number of speech sources and the number of microphones available for capturing the sources. We focus on the stateof- the-art deep learning techniques, and investigate the problem of separating speech sources from binaural and B-format mixtures obtained in real reverberant rooms. First, we evaluate a baseline system for binaural speech separation, where fullyconnected Deep Neural Networks (DNNs) and spatial features, such as Interaural Level Difference (ILD) and Interaural Phase Difference (IPD), are used. We further extend this baseline by using the dropout technique to mitigate the overfitting problem and adding spectral features, such as the Log-Power Spectrogram (LPS), to improve the separation performance. Second, we develop a Convolutional Neural Networks (CNNs)-based binaural speech separation system. We then study the potential of using data augmentation techniques to improve speech separation quality. In particular, we introduce contextual frames expansion, by including the information from neighbouring time frames, before and after a given time frame. Finally, we study the use of deep learning methods for B-format recordings. This allows the pressure gradient information to be exploited, in addition to the widely used acoustic pressure information, for deriving the angular features for source separation. Extensive experiments have been performed on two datasets captured in five different rooms in the University of Surrey. The proposed methods are shown to offer improved performance over the state-of-the-art, in terms of separation quality and intelligibility.
Grais Emad M., Zhao Fei, Plumbley Mark D. (2020) Multi-Band Multi-Resolution Fully Convolutional Neural Networks for Singing Voice Separation,28th European Signal Processing Conference (EUSIPCO 2020)
Deep neural networks with convolutional layers usually process the entire spectrogram of an audio signal with the same time-frequency resolutions, number of filters, and dimensionality reduction scale. According to the constant-Q transform, good features can be extracted from audio signals if the low frequency bands are processed with high frequency resolution filters and the high frequency bands with high time resolution filters. In the spectrogram of a mixture of singing voices and music signals, there is usually more information about the voice in the low frequency bands than the high frequency bands. These raise the need for processing each part of the spectrogram differently. In this paper, we propose a multi-band multi-resolution fully convolutional neural network (MBR-FCN) for singing voice separation. The MBR-FCN processes the frequency bands that have more information about the target signals with more filters and smaller dimensionality reduction scale than the bands with less information. Furthermore, the MBR-FCN processes the low frequency bands with high frequency resolution filters and the high frequency bands with high time resolution filters. Our experimental results show that the proposed MBRFCN with very few parameters achieves better singing voice separation performance than other deep neural networks.
Kong Qiuqiang, Xu Yong, Wang Wenwu, Plumbley Mark (2020) Sound Event Detection of Weakly Labelled Data with CNN-Transformer and Automatic Threshold Optimization,IEEE/ACM Transactions on Audio, Speech and Language Processing28pp. 2450-2460 Institute of Electrical and Electronics Engineers
Sound event detection (SED) is a task to detect sound events in an audio recording. One challenge of the SED task is that many datasets such as the Detection and Classification of Acoustic Scenes and Events (DCASE) datasets are weakly labelled. That is, there are only audio tags for each audio clip without the onset and offset times of sound events. We compare segment-wise and clip-wise training for SED that is lacking in previous works. We propose a convolutional neural network transformer (CNN-Transfomer) for audio tagging and SED, and show that CNN-Transformer performs similarly to a convolutional recurrent neural network (CRNN). Another challenge of SED is that thresholds are required for detecting sound events. Previous works set thresholds empirically, and are not an optimal approaches. To solve this problem, we propose an automatic threshold optimization method. The first stage is to optimize the system with respect to metrics that do not depend on thresholds, such as mean average precision (mAP). The second stage is to optimize the thresholds with respect to metrics that depends on those thresholds. Our proposed automatic threshold optimization system achieves a state-of-the-art audio tagging F1 of 0.646, outperforming that without threshold optimization of 0.629, and a sound event detection F1 of 0.584, outperforming that without threshold optimization of 0.564.
Koike Tomoya, Kong Qiuqiang, Plumbley Mark (2020) Audio for Audio is Better? An Investigation on Transfer Learning Models for Heart Sound Classification,42nd Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)
Cardiovascular disease is one of the leading factors for death cause of human beings. In the past decade, heart sound classification has been increasingly studied for its feasibility to develop a non-invasive approach to monitor a subject?s health status. Particularly, relevant studies have benefited from the fast development of wearable devices and machine learning techniques. Nevertheless, finding and designing efficient acoustic properties from heart sounds is an expensive and time-consuming task. It is known that transfer learning methods can help extract higher representations automatically from the heart sounds without any human domain knowledge. However, most existing studies are based on models pre-trained on images, which may not fully represent the characteristics inherited from audio. To this end, we propose a novel transfer learning model pretrained on large scale audio data for a heart sound classification task. In this study, the PhysioNet CinC Challenge Dataset is used for evaluation. Experimental results demonstrate that, our proposed pre-trained audio models can outperform other popular models pre-trained by images by achieving the highest unweighted average recall at 89.7 %.
Safavi Saeid, Iqbal Turab, Wang Wenwu, Coleman Philip, Plumbley Mark D. (2020) Open-Window: A Sound Event Data Set For Window State Detection And Recognition,Proc. 5th International Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2020)
Situated in the domain of urban sound scene classi?cation by humans and machines, this research is the ?rst step towards mapping urban noise pollution experienced indoors and ?nding ways to reduce its negative impact in peoples? homes. We have recorded a sound dataset, called Open-Window, which contains recordings from three different locations and four different window states; two stationary states (open and close) and two transitional states (open to close and close to open). We have then built our machine recognition base lines for different scenarios (open set versus closed set) using a deep learning framework. The human listening test is also performed to be able to compare the human and machine performance for detecting the window state just using the acoustic cues. Our experimental results reveal that when using a simple machine baseline system, humans and machines are achieving similar average performance for closed set experiments.
Kong Qiuqiang, Cao Yin, Iqbal Turab, Wang Yuxuan, Wang Wenwu, Plumbley Mark D (2020) PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition,IEEE/ACM Transactions on Audio, Speech, and Language Processing28pp. 2880-2894 IEEE
?Audio pattern recognition is an important research topic in the machine learning area, and includes several tasks such as audio tagging, acoustic scene classification, music classification , speech emotion classification and sound event detection. Recently, neural networks have been applied to tackle audio pattern recognition problems. However, previous systems are built on specific datasets with limited durations. Recently, in computer vision and natural language processing, systems pretrained on large-scale datasets have generalized well to several tasks. However, there is limited research on pretraining systems on large-scale datasets for audio pattern recognition. In this paper, we propose pretrained audio neural networks (PANNs) trained on the large-scale AudioSet dataset. These PANNs are transferred to other audio related tasks. We investigate the performance and computational complexity of PANNs modeled by a variety of convolutional neural networks. We propose an architecture called Wavegram-Logmel-CNN using both log-mel spectrogram and waveform as input feature. Our best PANN system achieves a state-of-the-art mean average precision (mAP) of 0.439 on AudioSet tagging, outperforming the best previous system of 0.392. We transfer PANNs to six audio pattern recognition tasks, and demonstrate state-of-the-art performance in several of those tasks. We have released the source code and pretrained models of PANNs: