Placeholder image for staff profiles

Alfredo Zermini


Postgraduate Research Student

My publications

Publications

Zermini Alfredo, Liu Qingju, Xu Yong, Plumbley Mark, Betts Dave, Wang Wenwu (2017) Binaural and Log-Power Spectra Features with Deep Neural Networks for Speech-Noise Separation,Proceedings of MMSP 2017 - IEEE 19th International Workshop on Multimedia Signal Processing IEEE
Binaural features of interaural level difference and
interaural phase difference have proved to be very effective
in training deep neural networks (DNNs), to generate timefrequency
masks for target speech extraction in speech-speech
mixtures. However, effectiveness of binaural features is reduced
in more common speech-noise scenarios, since the noise may
over-shadow the speech in adverse conditions. In addition, the
reverberation also decreases the sparsity of binaural features and
therefore adds difficulties to the separation task. To address the
above limitations, we highlight the spectral difference between
speech and noise spectra and incorporate the log-power spectra
features to extend the DNN input. Tested on two different
reverberant rooms at different signal to noise ratios (SNR), our
proposed method shows advantages over the baseline method
using only binaural features in terms of signal to distortion ratio
(SDR) and Short-Time Perceptual Intelligibility (STOI).
Zermini A, Wang W, Kong Q, Xu Y, Plumbley M (2017) Audio source separation with deep neural networks using the dropout algorithm,Signal Processing with Adaptive Sparse Structured Representations (SPARS) 2017 Book of Abstracts pp. 1-2 Instituto de Telecomunicações
A method based on Deep Neural Networks (DNNs) and
time-frequency masking has been recently developed for binaural audio
source separation. In this method, the DNNs are used to predict the
Direction Of Arrival (DOA) of the audio sources with respect to the
listener which is then used to generate soft time-frequency masks for
the recovery/estimation of the individual audio sources. In this paper, an
algorithm called ?dropout? will be applied to the hidden layers, affecting
the sparsity of hidden units activations: randomly selected neurons and
their connections are dropped during the training phase, preventing
feature co-adaptation. These methods are evaluated on binaural mixtures
generated with Binaural Room Impulse Responses (BRIRs), accounting
a certain level of room reverberation. The results show that the proposed
DNNs system with randomly deleted neurons is able to achieve higher
SDRs performances compared to the baseline method without the dropout
algorithm.
Zermini A, Yu Y, Xu Y, Plumbley M, Wang W (2016) Deep neural network based audio source separation,Proceedings of the 11th IMA International Conference on Mathematics in Signal Processing pp. 1-4 Institute of Mathematics & its Applications (IMA)
Audio source separation aims to extract individual sources from mixtures of
multiple sound sources. Many techniques have been developed such as independent compo-
nent analysis, computational auditory scene analysis, and non-negative matrix factorisa-
tion. A method based on Deep Neural Networks (DNNs) and time-frequency (T-F) mask-
ing has been recently developed for binaural audio source separation. In this method, the
DNNs are used to predict the Direction Of Arrival (DOA) of the audio sources with respect
to the listener which is then used to generate soft T-F masks for the recovery/estimation
of the individual audio sources.
Zermini Alfredo, Kong Qiuqiang, Xu Yong, Plumbley Mark D., Wang Wenwu (2018) Improving reverberant speech separation with binaural cues using temporal context and convolutional neural networks,In: Deville Y, Gannot S, Mason R, Plumbley Mark, Ward D (eds.), Latent Variable Analysis and Signal Separation. LVA/ICA 2018. Lecture Notes in Computer Science Latent Variable Analysis and Signal Separation: LVA/ICA 2018. Lecture Notes in Computer Science 10891 pp. 361-371 Springer
Given binaural features as input, such as interaural level difference
and interaural phase difference, Deep Neural Networks (DNNs)
have been recently used to localize sound sources in a mixture of speech
signals and/or noise, and to create time-frequency masks for the estimation
of the sound sources in reverberant rooms. Here, we explore a
more advanced system, where feed-forward DNNs are replaced by Convolutional
Neural Networks (CNNs). In addition, the adjacent frames
of each time frame (occurring before and after this frame) are used to
exploit contextual information, thus improving the localization and separation
for each source. The quality of the separation results is evaluated
in terms of Signal to Distortion Ratio (SDR).