Placeholder image for staff profiles

Alfredo Zermini


Postgraduate Research Student

My publications

Publications

Zermini Alfredo, Liu Qingju, Xu Yong, Plumbley Mark, Betts Dave, Wang Wenwu (2017) Binaural and Log-Power Spectra Features with Deep Neural Networks for Speech-Noise Separation,Proceedings of MMSP 2017 - IEEE 19th International Workshop on Multimedia Signal Processing IEEE
Binaural features of interaural level difference and interaural phase difference have proved to be very effective in training deep neural networks (DNNs), to generate timefrequency masks for target speech extraction in speech-speech mixtures. However, effectiveness of binaural features is reduced in more common speech-noise scenarios, since the noise may over-shadow the speech in adverse conditions. In addition, the reverberation also decreases the sparsity of binaural features and therefore adds difficulties to the separation task. To address the above limitations, we highlight the spectral difference between speech and noise spectra and incorporate the log-power spectra features to extend the DNN input. Tested on two different reverberant rooms at different signal to noise ratios (SNR), our proposed method shows advantages over the baseline method using only binaural features in terms of signal to distortion ratio (SDR) and Short-Time Perceptual Intelligibility (STOI).
Zermini A, Wang W, Kong Q, Xu Y, Plumbley M (2017) Audio source separation with deep neural networks using the dropout algorithm,Signal Processing with Adaptive Sparse Structured Representations (SPARS) 2017 Book of Abstractspp. 1-2 Instituto de Telecomunicações
A method based on Deep Neural Networks (DNNs) and time-frequency masking has been recently developed for binaural audio source separation. In this method, the DNNs are used to predict the Direction Of Arrival (DOA) of the audio sources with respect to the listener which is then used to generate soft time-frequency masks for the recovery/estimation of the individual audio sources. In this paper, an algorithm called ?dropout? will be applied to the hidden layers, affecting the sparsity of hidden units activations: randomly selected neurons and their connections are dropped during the training phase, preventing feature co-adaptation. These methods are evaluated on binaural mixtures generated with Binaural Room Impulse Responses (BRIRs), accounting a certain level of room reverberation. The results show that the proposed DNNs system with randomly deleted neurons is able to achieve higher SDRs performances compared to the baseline method without the dropout algorithm.
Zermini A, Yu Y, Xu Y, Plumbley M, Wang W (2016) Deep neural network based audio source separation,Proceedings of the 11th IMA International Conference on Mathematics in Signal Processingpp. 1-4 Institute of Mathematics & its Applications (IMA)
Audio source separation aims to extract individual sources from mixtures of multiple sound sources. Many techniques have been developed such as independent compo- nent analysis, computational auditory scene analysis, and non-negative matrix factorisa- tion. A method based on Deep Neural Networks (DNNs) and time-frequency (T-F) mask- ing has been recently developed for binaural audio source separation. In this method, the DNNs are used to predict the Direction Of Arrival (DOA) of the audio sources with respect to the listener which is then used to generate soft T-F masks for the recovery/estimation of the individual audio sources.
Zermini Alfredo, Kong Qiuqiang, Xu Yong, Plumbley Mark D., Wang Wenwu (2018) Improving reverberant speech separation with binaural cues using temporal context and convolutional neural networks,In: Deville Y, Gannot S, Mason R, Plumbley Mark, Ward D (eds.), Latent Variable Analysis and Signal Separation. LVA/ICA 2018. Lecture Notes in Computer ScienceLatent Variable Analysis and Signal Separation: LVA/ICA 2018. Lecture Notes in Computer Science10891pp. 361-371 Springer
Given binaural features as input, such as interaural level difference and interaural phase difference, Deep Neural Networks (DNNs) have been recently used to localize sound sources in a mixture of speech signals and/or noise, and to create time-frequency masks for the estimation of the sound sources in reverberant rooms. Here, we explore a more advanced system, where feed-forward DNNs are replaced by Convolutional Neural Networks (CNNs). In addition, the adjacent frames of each time frame (occurring before and after this frame) are used to exploit contextual information, thus improving the localization and separation for each source. The quality of the separation results is evaluated in terms of Signal to Distortion Ratio (SDR).