To assist with the development of intelligent mixing systems,
it would be useful to be able to extract the loudness
balance of sources in an existing musical mixture. The
relative-to-mix loudness level of four instrument groups was
predicted using the sources extracted by 12 audio source
separation algorithms. The predictions were compared with
the ground truth loudness data of the original unmixed stems
obtained from a recent dataset involving 100 mixed songs.
It was found that the best source separation system could
predict the relative loudness of each instrument group with
an average root-mean-square error of 1.2 LU, with superior
performance obtained on vocals.
Music remixing is difficult when the original multitrack recording is not available. One solution is to estimate the elements of a mixture using source separation. However, existing techniques suffer from imperfect separation and perceptible artifacts on single separated sources. To investigate their influence on a remix, five state-of-the-art source separation algorithms were used to remix six songs by increasing the level of the vocals. A listening test was conducted to assess the remixes in terms of loudness balance and sound quality. The results show that some source separation algorithms are able to increase the level of the vocals by up to 6 dB at the cost of introducing a small but perceptible degradation in sound quality.
There is some uncertainty as to whether objective metrics for
predicting the perceived quality of audio source separation are
sufficiently accurate. This issue was investigated by employing
a revised experimental methodology to collect subjective
ratings of sound quality and interference of singing-voice
recordings that have been extracted from musical mixtures
using state-of-the-art audio source separation. A correlation
analysis between the experimental data and the measures of
two objective evaluation toolkits, BSS Eval and PEASS, was
performed to assess their performance. The artifacts-related
perceptual score of the PEASS toolkit had the strongest correlation
with the perception of artifacts and distortions caused
by singing-voice separation. Both the source-to-interference
ratio of BSS Eval and the interference-related perceptual
score of PEASS showed comparable correlations with the
human ratings of interference.
In deep neural networks with convolutional layers, all the
neurons in each layer typically have the same size receptive fields (RFs)
with the same resolution. Convolutional layers with neurons that have
large RF capture global information from the input features, while layers
with neurons that have small RF size capture local details with high
resolution from the input features. In this work, we introduce novel deep
multi-resolution fully convolutional neural networks (MR-FCN), where
each layer has a range of neurons with different RF sizes to extract multi-
resolution features that capture the global and local information from its
input features. The proposed MR-FCN is applied to separate the singing
voice from mixtures of music sources. Experimental results show that
using MR-FCN improves the performance compared to feedforward deep
neural networks (DNNs) and single resolution deep fully convolutional
neural networks (FCNs) on the audio source separation problem.
Supervised multi-channel audio source separation
requires extracting useful spectral, temporal, and spatial features
from the mixed signals. The success of many existing systems is
therefore largely dependent on the choice of features used for
training. In this work, we introduce a novel multi-channel, multiresolution
convolutional auto-encoder neural network that works
on raw time-domain signals to determine appropriate multiresolution
features for separating the singing-voice from stereo
music. Our experimental results show that the proposed method
can achieve multi-channel audio source separation without the
need for hand-crafted features or any pre- or post-processing.
This book constitutes the proceedings of the 14th International Conference on Latent Variable Analysis and Signal Separation, LVA/ICA 2018, held in Guildford, UK, in July 2018.The 52 full papers were carefully reviewed and selected from 62 initial submissions. As research topics the papers encompass a wide range of general mixtures of latent variables models but also theories and tools drawn from a great variety of disciplines such as structured tensor decompositions and applications; matrix and tensor factorizations; ICA methods; nonlinear mixtures; audio data and methods; signal separation evaluation campaign; deep learning and data-driven methods; advances in phase retrieval and applications; sparsity-related methods; and biomedical data and methods.
The Signal Separation Evaluation Campaign (SiSEC) is a
large-scale regular event aimed at evaluating current progress
in source separation through a systematic and reproducible
comparison of the participants? algorithms, providing the
source separation community with an invaluable glimpse of
recent achievements and open challenges. This paper focuses
on the music separation task from SiSEC 2018, which
compares algorithms aimed at recovering instrument stems
from a stereo mix. In this context, we conducted a subjective
evaluation whereby 34 listeners picked which of six competing
algorithms, with high objective performance scores,
best separated the singing-voice stem from 13 professionally
mixed songs. The subjective results reveal strong differences
between the algorithms, and highlight the presence
of song-dependent performance for state-of-the-art systems.
Correlations between the subjective results and the scores of
two popular performance metrics are also presented.