# Dr Emad M. Grais

Research Fellow

### My publications

### Publications

Grais EM, Sen MU, Erdogan H (2014) Deep neural networks for single channel source separation, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings pp. 3734-3738

In this paper, a novel approach for single channel source separation (SCSS) using a deep neural network (DNN) architecture is introduced. Unlike previous studies in which DNN and other classifiers were used for classifying time-frequency bins to obtain hard masks for each source, we use the DNN to classify estimated source spectra to check for their validity during separation. In the training stage, the training data for the source signals are used to train a DNN. In the separation stage, the trained DNN is utilized to aid in estimation of each source in the mixed signal. Single channel source separation problem is formulated as an energy minimization problem where each source spectra estimate is encouraged to fit the trained DNN model and the mixed signal spectrum is encouraged to be written as a weighted sum of the estimated source spectra. The proposed approach works regardless of the energy scale differences between the source signals in the training and separation stages. Nonnegative matrix factorization (NMF) is used to initialize the DNN estimate for each source. The experimental results show that using DNN initialized by NMF for source separation improves the quality of the separated signal compared with using NMF for source separation. © 2014 IEEE.

Grais EM, Erdogan H (2013) Discriminative nonnegative dictionary learning using cross-coherence penalties for single channel source separation, Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH pp. 808-812

In this work, we introduce a new discriminative training method for nonnegative dictionary learning. The new method can be used in single channel source separation (SCSS) applications. In SCSS, nonnegative matrix factorization (NMF) is used to learn a dictionary (a set of basis vectors) for each source in the magnitude spectrum domain. The trained dictionaries are then used in decomposing the mixed signal to find the estimate for each source. Learning discriminative dictionaries for the source signals can improve the separation performance. To achieve discriminative dictionaries, we try to avoid the bases set of one source dictionary from representing the other source signals. We propose to minimize cross-coherence between the dictionaries of all sources in the mixed signal. We incorporate a simplified cross-coherence penalty using a regularized NMF cost function to simultaneously learn discriminative and reconstructive dictionaries. The new regularized NMF update rules that are used to discriminatively train the dictionaries are introduced in this work. Experimental results show that using discriminative training gives better separation results than using conventional NMF. Copyright © 2013 ISCA.

Grais EM, Erdogan H (2012) Hidden Mmarkov models as priors for regularized nonnegative matrix factorization in single-channel source separation, 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012 2 pp. 1534-1537

We propose a new method to incorporate rich statistical priors, modeling temporal gain sequences in the solutions of nonnegative matrix factorization (NMF). The proposed method can be used for single-channel source separation (SCSS) applications. In NMF based SCSS, NMF is used to decompose the spectra of the observed mixed signal as a weighted linear combination of a set of trained basis vectors. In this work, the NMF decomposition weights are enforced to consider statistical and temporal prior information on the weight combination patterns that the trained basis vectors can jointly receive for each source in the observed mixed signal. The Hidden Markov Model (HMM) is used as a log-normalized gains (weights) prior model for the NMF solution. The normalization makes the prior models energy independent. HMM is used as a rich model that characterizes the statistics of sequential data. The NMF solutions for the weights are encouraged to increase the log-likelihood with the trained gain prior HMMs while reducing the NMF reconstruction error at the same time.

Grais EM, Erdogan H (2011) Single channel speech music separation using nonnegative matrix factorization and spectral masks, 17th DSP 2011 International Conference on Digital Signal Processing, Proceedings

A single channel speech-music separation algorithm based on nonnegative matrix factorization (NMF) with spectral masks is proposed in this work. The proposed algorithm uses training data of speech and music signals with nonnegative matrix factorization followed by masking to separate the mixed signal. In the training stage, NMF uses the training data to train a set of basis vectors for each source. These bases are trained using NMF in the magnitude spectrum domain. After observing the mixed signal, NMF is used to decompose its magnitude spectra into a linear combination of the trained bases for both sources. The decomposition results are used to build a mask, which explains the contribution of each source in the mixed signal. Experimental results show that using masks after NMF improves the separation process even when calculating NMF with fewer iterations, which yields a faster separation process. © 2011 IEEE.

Roma G, Grais EM, Simpson AJR, Sobieraj I, Plumbley MD (2016) UNTWIST: A NEW TOOLBOX FOR AUDIO SOURCE SEPARATION,

Untwist is a new open source toolbox for audio source separation. The library

provides a self-contained objectoriented framework including common source separation

algorithms as well as input/output functions, data management utilities and time-frequency

transforms. Everything is implemented in Python, facilitating research, experimentation and

prototyping across platforms. The code is available on github 1.

provides a self-contained objectoriented framework including common source separation

algorithms as well as input/output functions, data management utilities and time-frequency

transforms. Everything is implemented in Python, facilitating research, experimentation and

prototyping across platforms. The code is available on github 1.

Roma G, Simpson A, Girgis E, Plumbley M (2016) Remixing musical audio on the web using source separation, Proceedings of the 2nd Web Audio Conference (WAC-2016)

Research in audio source separation has progressed a long way, producing systems that are able to approximate the component signals of sound mixtures. In recent years, many efforts have focused on learning time-frequency masks that can be used to filter a monophonic signal in the frequency domain. Using current web audio technologies, time-frequency masking can be implemented in a web browser in real time. This allows applying source separation techniques to arbitrary audio streams, such as internet radios, depending on cross-domain security configurations. While producing good quality separated audio from monophonic music mixtures is still challenging, current methods can be applied to remixing scenarios, where part of the signal is emphasized or deemphasized. This paper describes a system for remixing musical audio on the web by applying time-frequency masks estimated using deep neural networks. Our example prototype, implemented in client-side Javascript, provides reasonable quality results for small modifications.

Grais EM, Erdogan H (2013) Regularized nonnegative matrix factorization using Gaussian mixture priors for supervised single channel source separation, Computer Speech and Language 27 (3) pp. 746-762

© 2012 Elsevier Ltd.We introduce a new regularized nonnegative matrix factorization (NMF) method for supervised single-channel source separation (SCSS). We propose a new multi-objective cost function which includes the conventional divergence term for the NMF together with a prior likelihood term. The first term measures the divergence between the observed data and the multiplication of basis and gains matrices. The novel second term encourages the log-normalized gain vectors of the NMF solution to increase their likelihood under a prior Gaussian mixture model (GMM) which is used to encourage the gains to follow certain patterns. In this model, the parameters to be estimated are the basis vectors, the gain vectors and the parameters of the GMM prior. We introduce two different ways to train the model parameters, sequential training and joint training. In sequential training, after finding the basis and gains matrices, the gains matrix is then used to train the prior GMM in a separate step. In joint training, within each NMF iteration the basis matrix, the gains matrix and the prior GMM parameters are updated jointly using the proposed regularized NMF. The normalization of the gains makes the prior models energy independent, which is an advantage as compared to earlier proposals. In addition, GMM is a much richer prior than the previously considered alternatives such as conjugate priors which may not represent the distribution of the gains in the best possible way. In the separation stage after observing the mixed signal, we use the proposed regularized cost function with a combined basis and the GMM priors for all sources that were learned from training data for each source. Only the gain vectors are estimated from the mixed data by minimizing the joint cost function. We introduce novel update rules that solve the optimization problem efficiently for the new regularized NMF problem. This optimization is challenging due to using energy normalization and GMM for prior modeling, which makes the problem highly nonlinear and non-convex. The experimental results show that the introduced methods improve the performance of single channel source separation for speech separation and speech-music separation with different NMF divergence functions. The experimental results also show that, using the GMM prior gives better separation results than using the conjugate prior.

Grais EM, Erdogan H (2012) Gaussian mixture gain priors for regularized nonnegative matrix factorization in single-channel source separation, 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012 2 pp. 1518-1521

We propose a new method to incorporate statistical priors on the solution of the nonnegative matrix factorization (NMF) for single-channel source separation (SCSS) applications. The Gaussian mixture model (GMM) is used as a log-normalized gain prior model for the NMF solution. The normalization makes the prior models energy independent. In NMF based SCSS, NMF is used to decompose the spectra of the observed mixed signal as a weighted linear combination of a set of trained basis vectors. In this work, the NMF decomposition weights are enforced to consider statistical prior information on the weight combination patterns that the trained basis vectors can jointly receive for each source in the observed mixed signal. The NMF solutions for the weights are encouraged to increase the loglikelihood with the trained gain prior GMMs while reducing the NMF reconstruction error at the same time.

Grais EM, Erdogan H (2011) Single channel speech music separation using nonnegative matrix factorization with sliding windows and spectral masks, Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH pp. 1773-1776

A single channel speech-music separation algorithm based on nonnegative matrix factorization (NMF) with sliding windows and spectral masks is proposed in this work. We train a set of basis vectors for each source signal using NMF in the magnitude spectral domain. Rather than forming the columns of the matrices to be decomposed by NMF of a single spectral frame, we build them with multiple spectral frames stacked in one column. After observing the mixed signal, NMF is used to decompose its magnitude spectra into a weighted linear combination of the trained basis vectors for both sources. An initial spectrogram estimate for each source is found, and a spectral mask is built using these initial estimates. This mask is used to weight the mixed signal spectrogram to find the contributions of each source signal in the mixed signal. The method is shown to perform better than the conventional NMF approach. Copyright © 2011 ISCA.

Grais EM, Erdogan H (2014) Source separation using regularized NMF with MMSE estimates under GMM priors with online learning for the uncertainties, Digital Signal Processing: A Review Journal 29 (1) pp. 20-34

We propose a new method to incorporate priors on the solution of nonnegative matrix factorization (NMF). The NMF solution is guided to follow the minimum mean square error (MMSE) estimates of the weight combinations under a Gaussian mixture model (GMM) prior. The proposed algorithm can be used for denoising or single-channel source separation (SCSS) applications. NMF is used in SCSS in two main stages, the training stage and the separation stage. In the training stage, NMF is used to decompose the training data spectrogram for each source into a multiplication of a trained basis and gains matrices. In the separation stage, the mixed signal spectrogram is decomposed as a weighted linear combination of the trained basis matrices for the source signals. In this work, to improve the separation performance of NMF, the trained gains matrices are used to guide the solution of the NMF weights during the separation stage. The trained gains matrix is used to train a prior GMM that captures the statistics of the valid weight combinations that the columns of the basis matrix can receive for a given source signal. In the separation stage, the prior GMMs are used to guide the NMF solution of the gains/weights matrices using MMSE estimation. The NMF decomposition weights matrix is treated as a distorted image by a distortion operator, which is learned directly from the observed signals. The MMSE estimate of the weights matrix under the trained GMM prior and log-normal distribution for the distortion is then found to improve the NMF decomposition results. The MMSE estimate is embedded within the optimization objective to form a novel regularized NMF cost function. The corresponding update rules for the new objectives are derived in this paper. The proposed MMSE estimates based regularization avoids the problem of computing the hyper-parameters and the regularization parameters. MMSE also provides a better estimate for the valid gains matrix. Experimental results show that the proposed regularized NMF algorithm improves the source separation performance compared with using NMF without a prior or with other prior models. © 2014 Elsevier Inc.

Grais EM (2013) Incorporating Prior Information in Nonnegative Matrix Factorization for Audio Source Separation,

In this work, we propose solutions to the problem of audio source separation from a single

recording. The audio source signals can be speech, music or any other audio signals. We

assume training data for the individual source signals that are present in the mixed signal

are available. The training data are used to build a representative model for each source.

In most cases, these models are sets of basis vectors in magnitude or power spectral

domain. The proposed algorithms basically depend on decomposing the spectrogram

of the mixed signal with the trained basis models for all observed sources in the mixed

signal. Nonnegative matrix factorization (NMF) is used to train the basis models for

the source signals. NMF is then used to decompose the mixed signal spectrogram as a

weighted linear combination of the trained basis vectors for each observed source in the

mixed signal. After decomposing the mixed signal, spectral masks are built and used to

reconstruct the source signals.

In this thesis, we improve the performance of NMF for source separation by incorporating

more constraints and prior information related to the source signals to the NMF

decomposition results. The NMF decomposition weights are encouraged to satisfy some

prior information that is related to the nature of the source signals. The priors are

modeled using Gaussian mixture models or hidden Markov models. These priors basically

represent valid weight combination sequences that the basis vectors can receive for

a certain type of source signal. The prior models are incorporated with the NMF cost

function using either log-likelihood or minimum mean squared error estimation (MMSE). We also incorporate the prior information as a post processing. We incorporate the

smoothness prior on the NMF solutions by using post smoothing processing. We also

introduce post enhancement using MMSE estimation to obtain better separation for the

source signals.

In this thesis, we also improve the NMF training for the basis models. In cases when

enough training data are not available, we introduce two different adaptation methods

for the trained basis to better fit the sources in the mixed signal. We also improve

the training procedures for the sources by learning more discriminative dictionaries for

the source signals. In addition, to consider a larger context in the models, we concatenate

neighboring spectra together and train basis sets from them instead of a single

frame which makes it possible to directly mod

recording. The audio source signals can be speech, music or any other audio signals. We

assume training data for the individual source signals that are present in the mixed signal

are available. The training data are used to build a representative model for each source.

In most cases, these models are sets of basis vectors in magnitude or power spectral

domain. The proposed algorithms basically depend on decomposing the spectrogram

of the mixed signal with the trained basis models for all observed sources in the mixed

signal. Nonnegative matrix factorization (NMF) is used to train the basis models for

the source signals. NMF is then used to decompose the mixed signal spectrogram as a

weighted linear combination of the trained basis vectors for each observed source in the

mixed signal. After decomposing the mixed signal, spectral masks are built and used to

reconstruct the source signals.

In this thesis, we improve the performance of NMF for source separation by incorporating

more constraints and prior information related to the source signals to the NMF

decomposition results. The NMF decomposition weights are encouraged to satisfy some

prior information that is related to the nature of the source signals. The priors are

modeled using Gaussian mixture models or hidden Markov models. These priors basically

represent valid weight combination sequences that the basis vectors can receive for

a certain type of source signal. The prior models are incorporated with the NMF cost

function using either log-likelihood or minimum mean squared error estimation (MMSE). We also incorporate the prior information as a post processing. We incorporate the

smoothness prior on the NMF solutions by using post smoothing processing. We also

introduce post enhancement using MMSE estimation to obtain better separation for the

source signals.

In this thesis, we also improve the NMF training for the basis models. In cases when

enough training data are not available, we introduce two different adaptation methods

for the trained basis to better fit the sources in the mixed signal. We also improve

the training procedures for the sources by learning more discriminative dictionaries for

the source signals. In addition, to consider a larger context in the models, we concatenate

neighboring spectra together and train basis sets from them instead of a single

frame which makes it possible to directly mod

Grais EM, Erdogan H (2013) Initialization of nonnegative matrix factorization dictionaries for single channel source separation, 2013 21st Signal Processing and Communications Applications Conference, SIU 2013

In this work, we study different initialization methods for the nonnegative matrix factorization (NMF) dictionaries or bases. There is a need for good initializations for NMF dictionary because NMF decomposition is a non-convex problem which has many local minima. The effect of the initialization of NMF is evaluated in this work on audio source separation applications. In supervised audio source separation, NMF is used to train a set of basis vectors (basis matrix) for each source in an iterative fashion. Then NMF is used to decompose the mixed signal spectrogram as a weighted linear combination of the trained basis vectors for all sources in the mixed signal. The estimate for each source is computed by summing the decomposition terms that include its corresponding trained bases. In this work, we use principal component analysis (PCA), spherical K-means, and fuzzy C-means (FCM) to initialize the NMF basis matrices during the training procedures. Experimental results show that, better initialization for NMF bases gives better audio separation performance than using NMF with random initialization. © 2013 IEEE.

Girgis E, Roma G, Simpson A, Plumbley M (2016) Single Channel Audio Source Separation using Deep Neural Network Ensembles, AES Convention Proceedings Audio Engineering Society

Deep neural networks (DNNs) are often used to tackle the single channel source separation (SCSS) problem by predicting time-frequency masks. The predicted masks are then used to separate the sources from the mixed signal. Different types of masks produce separated sources with different levels of distortion and interference. Some types of masks produce separated sources with low distortion, while other masks produce low interference between the separated sources. In this paper, a combination of different DNNs? predictions (masks) is used for SCSS to achieve better quality of the separated sources than using each DNN individually. We train four different DNNs by minimizing four different cost functions to predict four different masks. The first and second DNNs are trained to approximate reference binary and soft masks. The third DNN is trained to predict a mask from the reference sources directly. The last DNN is trained similarly to the third DNN but with an additional discriminative constraint to maximize the differences between the estimated sources. Our experimental results show that combining the predictions of different DNNs achieves separated sources with better quality than using each DNN individually

Erdogan H, Grais EM (2010) Semi-blind speech-music separation using sparsity and continuity priors, 20th International Conference on Pattern Recognition Proceedings pp. 4573-4576 IEEE

In this paper we propose an approach for the problem of single channel source

separation of speech and music signals. Our approach is based on representing each

source's power spectral density using dictionaries and nonlinearly projecting the mixture

signal spectrum onto the combined span of the dictionary entries. We encourage sparsity

and continuity of the dictionary coefficients using penalty terms (or log-priors) in an

optimization framework. We propose to use a novel coordinate descent technique for ...

separation of speech and music signals. Our approach is based on representing each

source's power spectral density using dictionaries and nonlinearly projecting the mixture

signal spectrum onto the combined span of the dictionary entries. We encourage sparsity

and continuity of the dictionary coefficients using penalty terms (or log-priors) in an

optimization framework. We propose to use a novel coordinate descent technique for ...

Grais EM, Topkaya IS, Erdogan H (2012) Audio-Visual speech recognition with background music using single-channel source separation, IEEE

In this paper, we consider audio-visual speech recognition with background

music. The proposed algorithm is an integration of audio-visual speech recognition and

single channel source separation (SCSS). We apply the proposed algorithm to recognize

spoken speech that is mixed with music signals. First, the SCSS algorithm based on

nonnegative matrix factorization (NMF) and spectral masks is used to separate the audio

speech signal from the background music in magnitude spectral domain. After speech ...

music. The proposed algorithm is an integration of audio-visual speech recognition and

single channel source separation (SCSS). We apply the proposed algorithm to recognize

spoken speech that is mixed with music signals. First, the SCSS algorithm based on

nonnegative matrix factorization (NMF) and spectral masks is used to separate the audio

speech signal from the background music in magnitude spectral domain. After speech ...

Grais EM, Erdogan H (2013) Spectro-temporal post-enhancement using MMSE estimation in NMF based single-channel source separation, Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH pp. 3279-3283

We propose to use minimum mean squared error (MMSE) estimates to enhance the signals that are separated by nonnegative matrix factorization (NMF). In single channel source separation (SCSS), NMF is used to train a set of basis vectors for each source from their training spectrograms. Then NMF is used to decompose the mixed signal spectrogram as a weighted linear combination of the trained basis vectors from which estimates of each corresponding source can be obtained. In this work, we deal with the spectrogram of each separated signal as a 2D distorted signal that needs to be restored. A multiplicative distortion model is assumed where the logarithm of the true signal distribution is modeled with a Gaussian mixture model (GMM) and the distortion is modeled as having a log-normal distribution. The parameters of theGMMare learned from training data whereas the distortion parameters are learned online from each separated signal. The initial source estimates are improved and replaced with their MMSE estimates under this new probabilistic framework. The experimental results show that using the proposed MMSE estimation technique as a post enhancement after NMF improves the quality of the separated signal. Copyright © 2013 ISCA.

Grais EM, Erdogan H (2011) Adaptation of speaker-specific bases in non-negative matrix factorization for single channel speech-music separation, Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH pp. 569-572

This paper introduces a speaker adaptation algorithm for nonnegative matrix factorization (NMF) models. The proposed adaptation algorithm is a combination of Bayesian and subspace model adaptation. The adapted model is used to separate speech signal from a background music signal in a single record. Training speech data for multiple speakers is used with NMF to train a set of basis vectors as a general model for speech signals. The probabilistic interpretation of NMF is used to achieve Bayesian adaptation to adjust the general model with respect to the actual properties of the speech signals that is observed in the mixed signal. The Bayesian adapted model is adapted again by a linear transform, which changes the subspace that the Bayesian adapted model spans to better match the speech signal that is in the mixed signal. The experimental results show that combining Bayesian with linear transform adaptation improves the separation results. Copyright © 2011 ISCA.

Grais EM, Erdogan H (2012) Spectro-temporal post-smoothing in NMF based single-channel source separation, European Signal Processing Conference pp. 584-588

In this paper, we propose a new, simple, fast, and effective method to enforce temporal smoothness on nonnegative matrix factorization (NMF) solutions by post-smoothing the NMF decomposition results. In NMF based single-channel source separation, NMF is used to decompose the magnitude spectra of the mixed signal as a weighted linear combination of the trained basis vectors. The decomposition results are used to build spectral masks. To get temporal smoothness of the estimated sources, we deal with the spectral masks as 2-D images, and we pass the masks through a smoothing filter. The smoothing direction of the filter is the time direction of the spectral masks. The smoothed masks are used to find estimates for the source signals. Experimental results show that, using the smoothed masks give better separation results than enforcing temporal smoothness prior using regularized NMF. © 2012 EURASIP.

Grais EM, Erdogan H (2011) Single channel speech-music separation using matching pursuit and spectral masks, 2011 IEEE 19th Signal Processing and Communications Applications Conference, SIU 2011 pp. 323-326

A single-channel speech music separation algorithm based on matching pursuit (MP) with multiple dictionaries and spectral masks is proposed in this work. A training data for speech and music signals is used to build two sets of magnitude spectral vectors of each source signal. These vectors' sets are called dictionaries, and the vectors are called atoms. Matching pursuit is used to sparsely decompose the magnitude spectrum of the observed mixed signal as a nonnegative weighted linear combination of the best atoms in the two dictionaries that match the mixed signal structure. The weighted sum of the resulting decomposition terms that include atoms from the speech dictionary is used as an initial estimate of the speech signal contribution in the mixed signal, and the weighted sum of the remaining terms for the music signal contribution. The initial estimate of each source is used to build a spectral mask that is used to reconstruct the source signals. Experimental results show that integrating MP with spectral mask gives good separation results. © 2011 IEEE.

Grais Emad M, Plumbley Mark D (2017) Single Channel Audio Source Separation using Convolutional Denoising Autoencoders, GlobalSIP2017 Proceedings pp. 1265-1269 IEEE

Deep learning techniques have been used recently to tackle the audio

source separation problem. In this work, we propose to use deep

fully convolutional denoising autoencoders (CDAEs) for monaural

audio source separation. We use as many CDAEs as the number

of sources to be separated from the mixed signal. Each CDAE

is trained to separate one source and treats the other sources as

background noise. The main idea is to allow each CDAE to learn

suitable spectral-temporal filters and features to its corresponding

source. Our experimental results show that CDAEs perform source

separation slightly better than the deep feedforward neural networks

(FNNs) even with fewer parameters than FNNs.

source separation problem. In this work, we propose to use deep

fully convolutional denoising autoencoders (CDAEs) for monaural

audio source separation. We use as many CDAEs as the number

of sources to be separated from the mixed signal. Each CDAE

is trained to separate one source and treats the other sources as

background noise. The main idea is to allow each CDAE to learn

suitable spectral-temporal filters and features to its corresponding

source. Our experimental results show that CDAEs perform source

separation slightly better than the deep feedforward neural networks

(FNNs) even with fewer parameters than FNNs.

Girgis E, Roma G, Simpson A, Plumbley M (2016) Combining Mask Estimates for Single Channel Audio Source Separation using Deep Neural Networks, Interspeech2016 Proceedings ISCA

Deep neural networks (DNNs) are usually used for single channel source separation to predict either soft or binary time frequency masks. The masks are used to separate the sources from the mixed signal. Binary masks produce separated sources with more distortion and less interference than soft masks. In this paper, we propose to use another DNN to combine the estimates of binary and soft masks to achieve the advantages and avoid the disadvantages of using each mask individually. We aim to achieve separated sources with low distortion and low interference between each other. Our experimental results show that combining the estimates of binary and soft masks using DNN achieves lower distortion than using each estimate individually and achieves as low interference as the binary mask.

Grais Emad M, Plumbley Mark (2018) Combining Fully Convolutional and Recurrent Neural

Networks for Single Channel Audio Source Separation, Proceedings of 144th AES Convention Audio Engineering Society

Networks for Single Channel Audio Source Separation, Proceedings of 144th AES Convention Audio Engineering Society

Combining different models is a common strategy to build a good audio source separation system. In this work,

we combine two powerful deep neural networks for audio single channel source separation (SCSS). Namely, we

combine fully convolutional neural networks (FCNs) and recurrent neural networks, specifically, bidirectional

long short-term memory recurrent neural networks (BLSTMs). FCNs are good at extracting useful features from

the audio data and BLSTMs are good at modeling the temporal structure of the audio signals. Our experimental

results show that combining FCNs and BLSTMs achieves better separation performance than using each model

individually.

we combine two powerful deep neural networks for audio single channel source separation (SCSS). Namely, we

combine fully convolutional neural networks (FCNs) and recurrent neural networks, specifically, bidirectional

long short-term memory recurrent neural networks (BLSTMs). FCNs are good at extracting useful features from

the audio data and BLSTMs are good at modeling the temporal structure of the audio signals. Our experimental

results show that combining FCNs and BLSTMs achieves better separation performance than using each model

individually.

Grais Emad M, Roma G, Simpson AJR, Plumbley Mark (2017) Discriminative Enhancement for Single Channel Audio Source Separation using Deep Neural Networks, LNCS 10169 pp. 236-246

The sources separated by most single channel audio source separation techniques are usually distorted and each separated source contains residual signals from the other sources. To tackle this problem, we propose to enhance the separated sources to decrease the distortion and interference between the separated sources using deep neural networks (DNNs). Two different DNNs are used in this work. The first DNN is used to separate the sources from the mixed signal. The second DNN is used to enhance the separated signals. To consider the interactions between the separated sources, we propose to use a single DNN to enhance all the separated sources together. To reduce the residual signals of one source from the other separated sources (interference), we train the DNN for enhancement discriminatively to maximize the dissimilarity between the predicted sources. The experimental results show that using discriminative enhancement decreases the distortion and interference between the separated sources

Roma G, Girgis E, Simpson A, Plumbley M (2016) MUSIC REMIXING AND UPMIXING USING SOURCE SEPARATION, Proceedings of the 2nd AES Workshop on Intelligent Music Production

Current research on audio source separation provides tools to estimate the signals contributed by different instruments in polyphonic music mixtures. Such tools can be already incorporated in music production and post-production workflows. In this paper, we describe recent experiments where audio source separation is applied to remixing and upmixing existing mono and stereo music content

Grais Emad M, Roma Gerard, Simpson Andrew J. R., Plumbley Mark (2017) Two Stage Single Channel Audio Source Separation using Deep Neural Networks, IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (9) pp. 1469-1479 IEEE

Most single channel audio source separation (SCASS) approaches produce separated sources accompanied by interference from other sources and other distortions. To tackle this problem, we propose to separate the sources in two stages. In the first stage, the sources are separated from the mixed signal. In the second stage, the interference between the separated sources and the distortions are reduced using deep neural networks (DNNs). We propose two methods that use DNNs to improve the quality of the separated sources in the second stage. In the first method, each separated source is improved individually using its own trained DNN, while in the second method all the separated sources are improved together using a single DNN. To further improve the quality of the separated sources, the DNNs in the second stage are trained discriminatively to further decrease the interference and the distortions of the separated sources. Our experimental results show that using two stages of separation improves the quality of the separated signals by decreasing the interference between the separated sources and distortions compared to separating the sources using a single stage of separation.

Wierstorf H, Ward D, Mason R, Girgis E, Hummersone C, Plumbley M (2017) Perceptual Evaluation of Source Separation for Remixing

Music, 143rd AES Convention Paper No 9880 Audio Engineering Society

Music, 143rd AES Convention Paper No 9880 Audio Engineering Society

Music remixing is difficult when the original multitrack recording is not available. One solution is to estimate the elements of a mixture using source separation. However, existing techniques suffer from imperfect separation and perceptible artifacts on single separated sources. To investigate their influence on a remix, five state-of-the-art source separation algorithms were used to remix six songs by increasing the level of the vocals. A listening test was conducted to assess the remixes in terms of loudness balance and sound quality. The results show that some source separation algorithms are able to increase the level of the vocals by up to 6 dB at the cost of introducing a small but perceptible degradation in sound quality.

Simpson A, Roma G, Girgis E, Mason R, Hummersone C, Liutkus A, Plumbley M (2016) Evaluation of Audio Source Separation Models Using Hypothesis-Driven Non-Parametric Statistical Methods, European Signal Processing Conference (EUSIPCO) 2016

Audio source separation models are typically evaluated using objective separation quality measures, but rigorous statistical methods have yet to be applied to the problem of model comparison. As a result, it can be difficult to establish whether or not reliable progress is being made during the development of new models. In this paper, we provide a hypothesis-driven statistical analysis of the results of the recent source separation SiSEC challenge involving twelve competing models tested on separation of voice and accompaniment from fifty pieces of ?professionally produced? contemporary music. Using nonparametric statistics, we establish reliable evidence for meaningful conclusions about the performance of the various models.

Simpson A, Roma G, Girgis E, Mason R, Hummersone C, Plumbley M (2017) Psychophysical Evaluation of Audio Source Separation Methods, LNCS: Latent Variable Analysis and Signal Separation 10169 pp. 211-221 Springer

Source separation evaluation is typically a top-down process, starting with perceptual measures which capture fitness-for-purpose and followed by attempts to find physical (objective) measures that are predictive of the perceptual measures. In this paper, we take a contrasting bottom-up approach. We begin with the physical measures provided by the Blind Source Separation Evaluation Toolkit (BSS Eval) and we then look for corresponding perceptual correlates. This approach is known as psychophysics and has the distinct advantage of leading to interpretable, psychophysical models. We obtained perceptual similarity judgments from listeners in two experiments featuring vocal sources within musical mixtures. In the first experiment, listeners compared the overall quality of vocal signals estimated from musical mixtures using a range of competing source separation methods. In a loudness experiment, listeners compared the loudness balance of the competing musical accompaniment and vocal. Our preliminary results provide provisional validation of the psychophysical approach

Grais Emad M, Wierstorf Hagen, Ward Dominic, Plumbley Mark D (2018) Multi-Resolution Fully Convolutional Neural Networks for Monaural Audio Source Separation, Proceedings of LVA/ICA 2018 (Lecture Notes in Computer Science) 10891 pp. 340-350 Springer Verlag

In deep neural networks with convolutional layers, all the

neurons in each layer typically have the same size receptive fields (RFs)

with the same resolution. Convolutional layers with neurons that have

large RF capture global information from the input features, while layers

with neurons that have small RF size capture local details with high

resolution from the input features. In this work, we introduce novel deep

multi-resolution fully convolutional neural networks (MR-FCN), where

each layer has a range of neurons with different RF sizes to extract multi-

resolution features that capture the global and local information from its

input features. The proposed MR-FCN is applied to separate the singing

voice from mixtures of music sources. Experimental results show that

using MR-FCN improves the performance compared to feedforward deep

neural networks (DNNs) and single resolution deep fully convolutional

neural networks (FCNs) on the audio source separation problem.

neurons in each layer typically have the same size receptive fields (RFs)

with the same resolution. Convolutional layers with neurons that have

large RF capture global information from the input features, while layers

with neurons that have small RF size capture local details with high

resolution from the input features. In this work, we introduce novel deep

multi-resolution fully convolutional neural networks (MR-FCN), where

each layer has a range of neurons with different RF sizes to extract multi-

resolution features that capture the global and local information from its

input features. The proposed MR-FCN is applied to separate the singing

voice from mixtures of music sources. Experimental results show that

using MR-FCN improves the performance compared to feedforward deep

neural networks (DNNs) and single resolution deep fully convolutional

neural networks (FCNs) on the audio source separation problem.

Grais Emad M, Ward Dominic, Plumbley Mark D (2018) Raw Multi-Channel Audio Source Separation using

Multi-Resolution Convolutional Auto-Encoders, Proceedings of the 2018 26th European Signal Processing Conference (EUSIPCO) pp. 1577-1581 Institute of Electrical and Electronics Engineers (IEEE)

Multi-Resolution Convolutional Auto-Encoders, Proceedings of the 2018 26th European Signal Processing Conference (EUSIPCO) pp. 1577-1581 Institute of Electrical and Electronics Engineers (IEEE)

Supervised multi-channel audio source separation

requires extracting useful spectral, temporal, and spatial features

from the mixed signals. The success of many existing systems is

therefore largely dependent on the choice of features used for

training. In this work, we introduce a novel multi-channel, multiresolution

convolutional auto-encoder neural network that works

on raw time-domain signals to determine appropriate multiresolution

features for separating the singing-voice from stereo

music. Our experimental results show that the proposed method

can achieve multi-channel audio source separation without the

need for hand-crafted features or any pre- or post-processing.

requires extracting useful spectral, temporal, and spatial features

from the mixed signals. The success of many existing systems is

therefore largely dependent on the choice of features used for

training. In this work, we introduce a novel multi-channel, multiresolution

convolutional auto-encoder neural network that works

on raw time-domain signals to determine appropriate multiresolution

features for separating the singing-voice from stereo

music. Our experimental results show that the proposed method

can achieve multi-channel audio source separation without the

need for hand-crafted features or any pre- or post-processing.

Kim Chungeun, Grais Emad M, Mason Russell, Plumbley Mark D (2018) Perception of phase changes in the context of musical audio source separation, 145th AES Convention 10031 AES

This study investigates into the perceptual consequence of phase change in conventional magnitude-based source

separation. A listening test was conducted, where the participants compared three different source separation

scenarios, each with two phase retrieval cases: phase from the original mix or from the target source. The

participants? responses regarding their similarity to the reference showed that 1) the difference between the mix

phase and the perfect target phase was perceivable in the majority of cases with some song-dependent exceptions,

and 2) use of the mix phase degraded the perceived quality even in the case of perfect magnitude separation.

The findings imply that there is room for perceptual improvement by attempting correct phase reconstruction, in

addition to achieving better magnitude-based separation.

separation. A listening test was conducted, where the participants compared three different source separation

scenarios, each with two phase retrieval cases: phase from the original mix or from the target source. The

participants? responses regarding their similarity to the reference showed that 1) the difference between the mix

phase and the perfect target phase was perceivable in the majority of cases with some song-dependent exceptions,

and 2) use of the mix phase degraded the perceived quality even in the case of perfect magnitude separation.

The findings imply that there is room for perceptual improvement by attempting correct phase reconstruction, in

addition to achieving better magnitude-based separation.