Placeholder image for staff profiles

Dr Christian Kroos

Research Software Developer

My publications


Kroos C, Bundgaard-Nielsen R, Best C, Plumbley MD (2017) Using deep neural networks to estimate tongue movements from speech face
Proceedings of AVSP 2017 KTH
This study concludes a tripartite investigation into the indirect
visibility of the moving tongue in human speech as reflected in
co-occurring changes of the facial surface. We were in particular
interested in how the shared information is distributed over
the range of contributing frequencies. In the current study we
examine the degree to which tongue movements during speech
can be reliably estimated from face motion using artificial neural
networks. We simultaneously acquired data for both movement
types; tongue movements were measured with Electromagnetic
Articulography (EMA), face motion with a passive
marker-based motion capture system. A multiresolution analysis
using wavelets provided the desired decomposition into frequency
subbands. In the two earlier studies of the project we
established linear and non-linear relations between lingual and
facial speech motions, as predicted and compatible with previous
research in auditory-visual speech. The results of the current
study using a Deep Neural Network (DNN) for prediction
show that a substantive amount of variance can be recovered
(between 13.9 and 33.2% dependent on the speaker and tongue
sensor location). Importantly, however, the recovered variance
values and the root mean squared error values of the Euclidean
distances between the measured and the predicted tongue trajectories
are in the range of the linear estimations of our earlier
Kroos C, Plumbley MD (2017) Learning the mapping function from voltage amplitudes to sensor positions in 3D-EMA using deep neural networks, Interspeech 2017 Proceedings pp. 454-458 ISCA
The first generation of three-dimensional Electromagnetic Articulography
devices (Carstens AG500) suffered from occasional
critical tracking failures. Although now superseded by
new devices, the AG500 is still in use in many speech labs
and many valuable data sets exist. In this study we investigate
whether deep neural networks (DNNs) can learn the mapping
function from raw voltage amplitudes to sensor positions based
on a comprehensive movement data set. This is compared to
arriving sample by sample at individual position values via direct
optimisation as used in previous methods. We found that
with appropriate hyperparameter settings a DNN was able to
approximate the mapping function with good accuracy, leading
to a smaller error than the previous methods, but that the
DNN-based approach was not able to solve the tracking problem
Kroos C, Plumbley M (2017) Neuroevolution for sound event detection in real life audio: A pilot study, Detection and Classification of Acoustic Scenes and Events (DCASE 2017) Proceedings 2017 Tampere University of Technology
Neuroevolution techniques combine genetic algorithms with artificial
neural networks, some of them evolving network topology
along with the network weights. One of these latter techniques is
the NeuroEvolution of Augmenting Topologies (NEAT) algorithm.
For this pilot study we devised an extended variant (joint NEAT,
J-NEAT), introducing dynamic cooperative co-evolution, and applied
it to sound event detection in real life audio (Task 3) in the
DCASE 2017 challenge. Our research question was whether small
networks could be evolved that would be able to compete with the
much larger networks now typical for classification and detection
tasks. We used the wavelet-based deep scattering transform and
k-means clustering across the resulting scales (not across samples)
to provide J-NEAT with a compact representation of the acoustic
input. The results show that for the development data set J-NEAT
was capable of evolving small networks that match the performance
of the baseline system in terms of the segment-based error metrics,
while exhibiting a substantially better event-related error rate. In
the challenge, J-NEAT took first place overall according to the F1
error metric with an F1 of 44:9% and achieved rank 15 out of 34 on
the ER error metric with a value of 0:891. We discuss the question
of evolving versus learning for supervised tasks.
Duel T, Frohlich D, Kroos C, Xu Y, Jackson P, Plumbley M (2018) Supporting audiography: Design of a system for senti-mental sound recording, classification and playback, Communications in Computer and Information Science: HCI International 2018 - Posters' Extended Abstracts 850 Scientific Publishing Services, on behalf of Springer
It is now commonplace to capture and share images in photography as triggers for memory. In this paper we explore the possibility of using sound in the same sort of way, in a practice we call audiography. We report an initial design activity to create a system called Audio Memories comprising a ten second sound recorder, an intelligent archive for auto-classifying sound clips, and a multi-layered sound player for the social sharing of audio souvenirs around a table. The recorder and player components are essentially user experience probes that provide tangible interfaces for capturing and interacting with audio memory cues. We discuss our design decisions and process in creating these tools that harmonize user interaction and machine listening to evoke rich memories and conversations in an exploratory and open-ended way.