Placeholder image for staff profiles

Dr Christian Kroos


Research Software Developer
+44 (0)1483 684740
09 BB 01

My publications

Publications

Kroos Christian, Bundgaard-Nielsen RL, Best CT, Plumbley Mark (2017) Using deep neural networks to estimate tongue movements from speech face
motion,
Proceedings of AVSP 2017 KTH
This study concludes a tripartite investigation into the indirect
visibility of the moving tongue in human speech as reflected in
co-occurring changes of the facial surface. We were in particular
interested in how the shared information is distributed over
the range of contributing frequencies. In the current study we
examine the degree to which tongue movements during speech
can be reliably estimated from face motion using artificial neural
networks. We simultaneously acquired data for both movement
types; tongue movements were measured with Electromagnetic
Articulography (EMA), face motion with a passive
marker-based motion capture system. A multiresolution analysis
using wavelets provided the desired decomposition into frequency
subbands. In the two earlier studies of the project we
established linear and non-linear relations between lingual and
facial speech motions, as predicted and compatible with previous
research in auditory-visual speech. The results of the current
study using a Deep Neural Network (DNN) for prediction
show that a substantive amount of variance can be recovered
(between 13.9 and 33.2% dependent on the speaker and tongue
sensor location). Importantly, however, the recovered variance
values and the root mean squared error values of the Euclidean
distances between the measured and the predicted tongue trajectories
are in the range of the linear estimations of our earlier
study.
The first generation of three-dimensional Electromagnetic Articulography
devices (Carstens AG500) suffered from occasional
critical tracking failures. Although now superseded by
new devices, the AG500 is still in use in many speech labs
and many valuable data sets exist. In this study we investigate
whether deep neural networks (DNNs) can learn the mapping
function from raw voltage amplitudes to sensor positions based
on a comprehensive movement data set. This is compared to
arriving sample by sample at individual position values via direct
optimisation as used in previous methods. We found that
with appropriate hyperparameter settings a DNN was able to
approximate the mapping function with good accuracy, leading
to a smaller error than the previous methods, but that the
DNN-based approach was not able to solve the tracking problem
completely.
Kroos Christian, Plumbley Mark (2017) Neuroevolution for sound event detection in real life audio: A pilot study, Detection and Classification of Acoustic Scenes and Events (DCASE 2017) Proceedings 2017 Tampere University of Technology
Neuroevolution techniques combine genetic algorithms with artificial
neural networks, some of them evolving network topology
along with the network weights. One of these latter techniques is
the NeuroEvolution of Augmenting Topologies (NEAT) algorithm.
For this pilot study we devised an extended variant (joint NEAT,
J-NEAT), introducing dynamic cooperative co-evolution, and applied
it to sound event detection in real life audio (Task 3) in the
DCASE 2017 challenge. Our research question was whether small
networks could be evolved that would be able to compete with the
much larger networks now typical for classification and detection
tasks. We used the wavelet-based deep scattering transform and
k-means clustering across the resulting scales (not across samples)
to provide J-NEAT with a compact representation of the acoustic
input. The results show that for the development data set J-NEAT
was capable of evolving small networks that match the performance
of the baseline system in terms of the segment-based error metrics,
while exhibiting a substantially better event-related error rate. In
the challenge, J-NEAT took first place overall according to the F1
error metric with an F1 of 44:9% and achieved rank 15 out of 34 on
the ER error metric with a value of 0:891. We discuss the question
of evolving versus learning for supervised tasks.
Duel Tijs, Frohlich David M., Kroos Christian, Xu Yong, Jackson Philip J. B., Plumbley Mark D. (2018) Supporting audiography: Design of a system for sentimental sound recording, classification and playback, Communications in Computer and Information Science: HCI International 2018 - Posters' Extended Abstracts 850 pp. 24-31 Scientific Publishing Services, on behalf of Springer
It is now commonplace to capture and share images in photography as triggers for memory. In this paper we explore the possibility of using sound in the same sort of way, in a practice we call audiography. We report an initial design activity to create a system called Audio Memories comprising a ten second sound recorder, an intelligent archive for auto-classifying sound clips, and a multi-layered sound player for the social sharing of audio souvenirs around a table. The recorder and player components are essentially user experience probes that provide tangible interfaces for capturing and interacting with audio memory cues. We discuss our design decisions and process in creating these tools that harmonize user interaction and machine listening to evoke rich memories and conversations in an exploratory and open-ended way.
Iqbal Turab, Kong Qiuqiang, Plumbley Mark D, Wang Wenwu (2018) General-purpose audio tagging from noisy labels using convolutional neural networks, Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018 pp. 212-216 Tampere University of Technology
General-purpose audio tagging refers to classifying sounds that are
of a diverse nature, and is relevant in many applications where
domain-specific information cannot be exploited. The DCASE 2018
challenge introduces Task 2 for this very problem. In this task, there
are a large number of classes and the audio clips vary in duration.
Moreover, a subset of the labels are noisy. In this paper, we propose
a system to address these challenges. The basis of our system is
an ensemble of convolutional neural networks trained on log-scaled
mel spectrograms. We use preprocessing and data augmentation
methods to improve the performance further. To reduce the effects
of label noise, two techniques are proposed: loss function weighting
and pseudo-labeling. Experiments on the private test set of this task
show that our system achieves state-of-the-art performance with a
mean average precision score of 0.951
Kroos, Christian (2018) Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018) Tampere University of Technology