
Dr Annika Neidhardt
Publications
A data set was created to study the position dependent perception of room acoustics in case of a small conference room. Binaural room impulse responses (BRIRs) were measured with KEMAR 45BA at 5 different potential listening positions. To keep the direct sound similar the Genelec 1030A two-way loudspeaker was always placed at a distance of 2.5m. In one condition, the loudspeaker was turned towards the listening position, in the other condition it was turned by 180° to achieve an indirect reproduction with low direct sound energy. Furthermore, an mh acoustics Eigenmike was placed at each of the five listening positions and 32-channel directional room impulses responses (DRIRs) as well as an omnidirectional room impulse response (RIR) were captured where the center of the head was placed before. Additionally, 360° visual footages were captured with a GoPro Omni spherical camera array to provide audiovisual impressions of the listening situation at the five positions.
Introduction: The following data set was collected to study the influence of several open headphone models on a real sound source. This investigation was motivated by studies on augmented acoustic reality (AAR) scenarios. In AAR applications virtual sound sources are usually played back using binaural synthesis over headphones with the goal to blend in with the real acoustic environment. Due to the presence of headphones, additional scattering, resonance and shadowing effects are introduced to the real sound field. The data set consists of binaural room impulse responses (BRIRs) measured with a KEMAR 45BA wearing different open headphone models. The measurements were carried out in the listening laboratory at the TU Ilmenau. Two distances (75cm and 200cm) and 92 angles (4° azimuthal resolution incl. 90° and 270°) were measured. Each configuration was measured three times with a repositioning of the headphones. Important notes: - The mat-files contain the BRIR matrix with the dimensions samples-channels-angles - The mat-files contain an angle vector for the corresponding azimuthal steps - The head rotation angle runs clockwise The data set was presented at DAGA 21 [1]. A perceptual analysis using this data set can be found in [2]. References: [1] Schneiderwind, C., Neidhardt, A., "Data Set: A Collection of BRIRs for a Listener Wearing Different Types of Open Headphones", DAGA 2021, Wien, Austria, August 2021. [2] Schneiderwind, C., Neidhardt, A., and Meyer, D., “Comparing the effect of different open headphone models on the perception of a real sound source,” 150th AES Convention, Online, June 2021.
This study investigates the plausibility of dynamic binaural audio scenarios wherein the listener interactively walks towards a virtual sound source. An originally measured BRIR set was manipulated and simplified systematically to challenge plausibility, explore its limits, and examine the relevance of selected acoustic properties. After the first investigation in a quite dry listening laboratory, this second exploratory study repeats and extends the experiment in a considerably more reverberant room. The participants had to rate externalization, continuity, stability of the apparent sound source, impression of walking towards the sound source and the plausibility of the virtual acoustic scene. The results confirm the observations of the first study in the different acoustic environment. Both studies indicate much room for simplifications, but certain modifications seriously affect plausibility. Even inexperienced listeners notice if the progress of the auditory distance change does not match their own walking motion. In addition, the meaning of context and expectation for the perception of binaural audio is highlighted.
A mixing console (300) for processing at least a first and a second source signal and for providing a mixed audio signal comprises an audio signal generator (100) for providing an audio signal (120) for a virtual listening position (202) within a space (200), in which an acoustic scene is recorded by at least a first microphone (204) at a first known position within the space (200) as the first source signal (210) and by at least a second microphone (206) at a second known position within the space (200) as the second source signal (212). The audio signal generator (100) comprises an input interface (102) configured to receive the first source signal (210) recorded by the first microphone (204) and the second source signal (212) recorded by the second microphone (206), and a geometry processor (104) configured to determine a first piece of geometry information (110) based on the first position and the virtual listening position (202) and a second piece of geometry information (112) based on the second position and the virtual listening position (202). A signal generator (106) for providing the audio signal (120) is configured to combine at least the first source signal (210) and the second source signal (212) according to a combination rule using the first piece of geometry information (110) and the second piece of geometry information (112).
It is pointed out that beyond reproducing the physically correct sound pressure at the eardrums, more effects play a significant role in the quality of the auditory illusion. In some cases, these can dominate perception and even overcome physical deviations. Perceptual effects like the room-divergence effect, additional visual influences, personalization, pose and position tracking as well as adaptation processes are discussed. These effects are described individually, and the interconnections between them are highlighted. With the results from experiments performed by the authors, the perceptual effects can be quantified. Furthermore, concepts are proposed to optimize reproduction systems with regard to those effects. One example could be a system that adapts to varying listening situations as well as individual listening habits, experience and preference.
The aim of auditory augmented reality is to create a highly immersive and plausible auditory illusion combining virtual audio objects and scenarios with the real acoustic surrounding. For this use case it is necessary to estimate the acoustics of the current room. A mismatch between real and simulated acoustics will easily be detected by the listener and will probably lead to In-head localization or an unrealistic acoustic envelopment of the virtual sound sources. This publication investigates State-of-the-Art algorithms for blind reverberation time estimation which are commonly used for speech enhancement algorithms or speech dereverberation and applies them to binaural ear signals. The outcome of these algorithms can be used to select the most appropriate room out of a room database for example. A room database could include pre-measured or simulated binaural room impulse responses which could directly be used to realize a binaural reproduction. First results show promising results combined with low computational effort. Further strategies for enhancing the used method are proposed in order to create a more precise reverberation time estimation.
This BRIR data set was measured with a Kemar 45BA at 9 different positions along a defined translation line with a length of 2m in two different rooms, a listening laboratory (T60 = 0.27s) and a seminar room (T60 = 1.0s). At each position BRIRs for a full 360° head rotation with an azimuth resolution of 4° were captured. Two Genelec 1030A loudspeakers served as the sound sources, one place in the fron of the line, one at the side. In addition, omni-directional room impulse responses were determined of the 9 positions at the center the head in both rooms. The data set is of interest to study dynamic binaural synthesis for walking towards or past a sound source in two different room acoustical settings. The same source and receiver setup was used in both rooms, therefore the direct sound and its progress along the given translation line is similar.
Interactive auditory displays are an interesting possibility presenting information in an alternative way. There have been lots of interesting works using binaural techniques. The use of a loudspeaker system has the advantage that more people can listen to the same data simultaneously. One application, where this is very important, is the audio gaming domain, as multiplayer games are usually more exciting. Additionally, the use of a loudspeaker system allows different dimensions of the game design. The main challenge in developing an interactive auditory display for a loudspeaker system is the design of the data sonification and the interaction for data exploration. In this paper we present an example implementation of such an interactive auditory display. The famous game Pong has been implemented using an audio-only loudspeaker display instead of a graphical. The goal of this investigation is to gather more experience in the perception of spatial audio-only representation of information.
At the beginning of the twentieth century, the physicist Erich Waetzmann had already identified limitations in the functionality of the human ear in his “School of Listening”,Footnote1 where he provided instructions on “how the severely neglected auditory organ could be trained and strengthened in its performance”,Footnote2 when compared to our sense of sight. Therefore, he decided to address the shortcomings of the auditory sense with his booklet.
Spatial audio representations have been shown to positively impact user experience in traditional, non-immersive communication media. While spatial audio also contributes to presence in single-user immersive VR, its impact in virtual communication scenarios has not yet been fully understood. This work aims to further investigate which communication scenarios benefit from spatial audio representations. We present a study in which pairs of interlocutors undertake a collaborative task in an audiovisual Virtual Environment (VE) under different auralization and scene arrangement conditions. The novel task is designed to encourage simultaneous conversation and movement, with the aim of increasing the relevance of spatial hearing. Results are obtained through questionnaires measuring social presence and plausibility, as well as through conversational and behavioral analysis. Although participants are shown to favor binaural auralization over diotic audio in a direct active-listening comparison, no significant differences in social presence, plausibility, or communication behavior could be found. Our results suggest that spatial audio may not affect user experience in dyadic communication scenarios where spatial auditory information is not directly relevant to the considered task.
Virtual Reality (VR) has become an important tool for conducting behavioral studies in realistic, reproducible environments. In this paper, we present ISA, an Immersive Study Analyzer system designed for the comprehensive analysis of social VR studies. For in-depth analysis of participant behavior, ISA records all user actions, speech, and the contextual environment of social VR studies. A key feature is the ability to review and analyze such immersive recordings collaboratively in VR, through support of behavioral coding and user-defined analysis queries for efficient identification of complex behavior. Respatialization of the recorded audio streams enables analysts to follow study participants' conversations in a natural and intuitive way. To support phases of close and loosely coupled collaboration, ISA allows joint and individual temporal navigation, and provides tools to facilitate collaboration among users at different temporal positions. An expert review confirms that ISA effectively supports collaborative immersive analysis, providing a novel and effective tool for nuanced understanding of user behavior in social VR studies.
Featured Application In Auditory Augmented Reality (AAR), the real room is enriched by virtual audio objects. Position-dynamic binaural synthesis is used to auralize the audio objects for moving listeners and to create a plausible experience of the mixed reality scenario. For a spatial audio reproduction in the context of augmented reality, a position-dynamic binaural synthesis system can be used to synthesize the ear signals for a moving listener. The goal is the fusion of the auditory perception of the virtual audio objects with the real listening environment. Such a system has several components, each of which help to enable a plausible auditory simulation. For each possible position of the listener in the room, a set of binaural room impulse responses (BRIRs) congruent with the expected auditory environment is required to avoid room divergence effects. Adequate and efficient approaches are methods to synthesize new BRIRs using very few measurements of the listening room. The required spatial resolution of the BRIR positions can be estimated by spatial auditory perception thresholds. Retrieving and processing the tracking data of the listener's head-pose and position as well as convolving BRIRs with an audio signal needs to be done in real-time. This contribution presents work done by the authors including several technical components of such a system in detail. It shows how the single components are affected by psychoacoustics. Furthermore, the paper also discusses the perceptive effect by means of listening tests demonstrating the appropriateness of the approaches.
For the realization of auditory augmented reality (AAR), it is important that the room acoustical properties of the virtual elements are perceived in agreement with the acoustics of the actual environment. This perceptual matching of room acoustics is the subject reviewed in this paper. Realizations of AAR that fulfill the listeners' expectations were achieved based on pre-characterization of the room acoustics, for example, by measuring acoustic impulse responses or creating detailed room models for acoustic simulations. For future applications, the goal is to realize an online adaptation in (close to) real-time. Perfect physical matching is hard to achieve with these practical constraints. For this reason, an understanding of the essential psychoacoustic cues is of interest and will help to explore options for simplifications. This paper reviews a broad selection of previous studies and derives a theoretical framework to examine possibilities for psychoacoustical optimization of room acoustical matching.
When a sound source is occluded, diffraction replaces direct sound as the first wavefront arrival and can influence important aspects of perception such as localisation. Few experiments have investigated how diffraction modelling influences the perceived plausibility of an acoustic simulation. In this paper, an experiment was run to investigate the plausibility of an acoustic simulation with and without diffraction in an L-shaped room in VR. The rendering was carried out using a real-time 6DOF geometrical acoustics and feedback-delay-network hybrid model, and diffraction was modelled using the physically accurate Biot-Tolstoy-Medwin model. The results show that diffraction increases the perceived plausibility of the acoustic simulation. In addition, the study compared diffraction of the direct sound alone and diffraction of both direct and reflected sound. A significant increase in plausibility was found by the addition of diffracted reflection paths, but only in the so-called shadow zone.
The Spatial Audio Quality Inventory (SAQI, Lindau et al. 2014 [1]) defines a comprehensive list of attributes for quality assessment of spatial audio. These attributes are traditionally used in perceptual experiments. However, automatic evaluation is a common alternative to assess spatial audio algorithms by means of acoustic recordings and numerical methods. This study aims at bridging the gap between perceptual evaluation and automatic assessment methods. We performed a focused literature review on available auditory models and proposed a list to cover the attributes in SAQI based on self-imposed selection criteria , such as binaural compatibility. The selected models are publicly available and ready to be used in automatic assessment methods. This Spatial Audio Models' Inventory (SAMI) could serve as relevant metrics to train and/or optimise machine-learning and deep-learning algorithms when the objective is to improve the perceived quality of reproduction in spatial audio applications. Moreover, SAMI composes a benchmark to challenge novel models.
In loudspeaker-based sound field reproduction, the perceived sound quality deteriorates significantly when listeners move outside of the sweet spot. Although a substantial increase in the number of loudspeakers enables rendering methods that can mitigate this issue, such a solution is not feasible for most real-life applications. This study aims to extend the listening area by finding a panning strategy that optimises an objective function reflecting the localisation and localisation uncertainty over a listening area. To that end we first introduce a psychoacoustic localisation model that outperforms existing models in the context of multichannel loudspeaker setups. Leveraging this model and an existing model of localisation uncertainty, we optimise inter-channel time and level differences for a stereophonic system. The outcome is a new panning approach that depends on the listening area and the most suitable trade-off between localisation and localisation uncertainty.
Spatial Audio for Augmented and Extended Reality (AR/XR) is still limited in its suitability for everyday use due to the perceptual and computational demands of such applications. A number of recent studies have highlighted the potential of simplified room acoustic modelling and its perceptual optimisation for making AR/XR applications more accessible with affordable, mobile hardware. This paper presents a perceptual evaluation of real-time acoustic modelling based on feedback delay networks (FDN), scattering delay networks (SDN), and a hybrid of the image source method (ISM) combined with FDN. The acoustics were modelled using a shoebox-approximation of the original room's geometry, two different approximations of the original loudspeaker directivity and generic HRTFs. We chose a flipped loudspeaker scenario as a critical test case. In the listening experiment, we assessed the Audiovisual Plausibility, Externalisation and Naturalness of the modelled sound fields. The SDN implementation with the subcardioid directivity was perceived similarly external and audiovisually plausible as the measured reference. The hybrid ISM-FDN method and the SDN, both with the directivity of a different loudspeaker, were perceived as natural as the reference. The tested FDN cases exhibit significantly lower ratings than the measurement for the three attributes. The chosen simplifications were not perceptually sufficient in the tested scenario, matching some attributes only. More research on efficient perceptual matching of modelled room acoustics for AR/XR is needed.
In 1971, Barron published a study on The subjective effects of first reflections in concert halls, comprising a lead/ lag paradigm experiment with two loudspeakers set up in an anechoic room. As a result, he presented the determined audibility threshold, as well as a figure showing the audible effects caused by the first reflection (lag) depending on its delay and level relative to the direct sound (lead). This study gave an inspiring first insight into prominent perceptual effects like spatial impression, colouration, image shift, and 'disturbance'. However, the diagram was created based on the responses of only two listeners, evaluating the various attributes of a single item of programme material. To assess the reproducibility and generalisabil-ity of the results, we repeated and extended Barron's experiment with a larger panel of participants and a slightly revised test method. Besides ensemble music, a solo piece played by an electronic bass guitar was considered. The analysis confirmed a signal dependency of the estimated thresholds. Furthermore, despite intense training, mapping the specific attributes to the perceptual effects remained challenging for the complex signals. Considerable individual differences were observed. We present an updated version of Barron's graph as a result of our study.
An open-source C++ library named RoomAcoustiC++: Real-time Acoustics Library for real-time room acoustic modelling is introduced that implements a hybrid geometric acoustic and feedback delay network model. The geometric acoustic component uses the image edge model to locate geometry-dependent early specular reflections and edge diffraction. The feedback delay network models late reverberation for a target reverberation time. The model is capable of a dynamic simulation including moving geometry, sources and listener and changing wall absorption, with binaural spatialisation over headphones and customisable head-related transfer functions using the 3D Tune-In toolbox. A comparison with existing closed-source and open-source projects is presented. This found that many state-of-the-art room acoustic models for real-time applications are closed-source, limiting reproducibility. RoomAcoustiC++ offers an improved room acoustic model compared to existing open-source projects. The library was validated against physical measurements from the Benchmark for Room Acoustic Simulations (BRAS) database. An analysis of the real-time performance shows that the software is capable of binaural rendering for scenes with occluding geometries and multiple sources.
The present work investigates the influence of different open headphone models on the perception of real sound sources in augmented audio reality applications. A set of binaural room impulse responses was measured with a dummy head wearing eight different open headphone configurations. A spectral error analysis showed strong deviations between the physical distortions in the real sound field caused by the different headphone models. The resulting perceptual effects were evaluated in a MUSHRA-like psychoacoustic experiment. The results show that all headphones introduce audible distortions. The extra-aural BK211 was found to be the one with the least audible corruption. In contrast, for some of the circumaural headphones strong coloration occurred and the spatial cues of the real sound sources were seriously affected.
This exploratory study investigates peoples' ability to discriminate between real and virtual sound sources in a position-dynamic headphone based augmented audio scene. For this purpose, an acoustic scene was created consisting of two loudspeakers at different positions in a small seminar room. Considering the presence of headphones, non-individualized BRIRs measured along a line with a dummy head wearing AKG K1000 headphones were used to allow for head rotation and translation. In a psychoacoustic experiment, participants had to explore the acoustic scene and tell which sound source they believe is real or virtual. The test cases included a dialog scenario, stereo pop-music and one person speaking while the other speaker played mono-music simultaneously. Results show that the participants were on trend able to debunk individual virtual sources. However, for the cases where both sound sources reproduced sound simultaneously, lower distinguishability rates were observed.
This paper presents a proof-of-concept study conducted to analyze the effect of simple diotic vs. spatial, position-dynamic binaural synthesis on social presence in VR, in comparison with face-to-face communication in the real world, for a sample two-party scenario. A conversational task with shared visual reference was realized. The collected data includes questionnaires for direct assessment, tracking data, and audio and video recordings of the individual participants’ sessions for indirect evaluation. While tendencies for improvements with binaural over diotic presentation can be observed, no significant difference in social presence was found for the considered scenario. The gestural analysis revealed that participants used the same amount and type of gestures in face-to-face as in VR, highlighting the importance of non-verbal behavior in communication. As part of the research, an end-to-end framework for conducting communication studies and analysis has been developed.
This study examines the plausibility of Auditory Augmented Reality (AAR) realized with position-dynamic binaural synthesis over headphones. An established method to evaluate the plausibility of AAR asks participants to decide whether they are listening to the virtual or real version of the sound object. To date, this method has only been used to evaluate AAR systems for seated listeners. The AAR realization examined in this study instead allows listeners to turn to arbitrary directions and walk towards, past, and away from a real loudspeaker that reproduced sound only virtually. The experiment was conducted in two parts. In the first part, the subjects were asked whether they are listening to the real or the virtual version, not knowing that it was always the virtual version. In the second part, the real versions of the scenes where the loudspeaker actually reproduced sound were added. Two different source positions, three different test stimuli, and two different sound levels were considered. Seventeen volunteers, including five experts, participated. In the first part, none of the participants noticed that the virtual reproduction was active throughout the different test scenes. The inexperienced listeners tended to accept the virtual reproduction as real, while experts distributed their answers approximately equally. In the second part, experts could identify the virtual version quite reliably. For inexperienced listeners, the individual results varied enormously. Since the presence of the headphones influences the perception of the real sound field, this shadowing effect had to be considered in the creation of the virtual sound source as well. This requirement still limits test methods considering the real version in its ecological validity. Although the results indicate that the availability of a hidden real reference leads to a more critical evaluation, it is crucial to be aware that the presence of the headphones slightly distorts the reference. This issue seems more vital to the plausibility estimates achieved with this evaluation method than the increased freedom in motion.
This paper presents a headphone-based audio augmented reality demonstrator showcasing the effects of manipulated late reverberation in rendering virtual sound sources. The setup is based on a dataset of binaural room impulse responses measured along a 2 m long line, which is used to imitate the reproduction of a pair of loudspeakers. Therefore, listeners can explore the virtual sources by moving back and forth and rotating arbitrarily on this line. The demo allows the user to adjust the late reverberation tail of the auralizations interactively from shorter to longer decay times regarding the baseline decay behavior. Modification of the decay times is based on resynthesizing the late reverberation using frequencydependent shaping of binaural white noise and modal reconstruction. The paper includes descriptions of the frameworks used for this demo and an overview of the required data and processing steps.
Augmented or mixed reality (AR/MR) is emerging as one of the key technologies in the future of computing. Audio cues are critical for maintaining a high degree of realism, social connection, and spatial awareness for various AR/MR applications, such as education and training, gaming, remote work, and virtual social gatherings to transport the user to an alternate world called the metaverse. Motivated by a wide variety of AR/MR listening experiences delivered over hearables, this article systematically reviews the integration of fundamental and advanced signal processing techniques for AR/MR audio to equip researchers and engineers in the signal processing community for the next wave of AR/MR.
The presented study examines the maximum distance at which listeners can still localize the direction of a nearby wall if the own mouth is the sound source. For this investigation, oral binaural room impulse responses (OBRIRs) were measured with a KEMAR dummyhead with mouth simulator at eight different distances to a wall in an anechoic chamber and two rooms with different reverberation properties. Using a headphone-based dynamic auralization, the participants had to turn until they thought to be facing the wall. In a stair-case inspired procedure, the test always started with the shortest distance of 25 cm. In case of a successful localization at least twice in three trials, the distance could be increased in intervals of 25 cm up to about 2 m. The results exhibit considerable differences in the individual performances, which is in line with results of earlier studies. At a 25 cm-distance, all participants could localize the direction of the reflecting wall. From 50 cm onward, more and more participants found it difficult to determine the correct direction. In the anechoic room, four of the 22 participants succeeded in the localization at the 2 m distance. In the reverberant rooms, the localizability decreased significantly.
Binaural synthesis systems can create virtual sound sources that are indistinguishable from reality. In Augmented Reality (AR) applications, virtual sound sources need to blend in with the real environment to create plausible illusions. However, in some scenarios, it may be desirable to enhance the natural acoustic properties of the virtual content to improve speech intelligibility, alleviate listener fatigue, or achieve a specific artistic effect. Previous research has shown that deviating from the original room acoustics can degrade the quality of the auditory illusion, often referred to as the room divergence effect. This study investigates whether it is possible to modify the auditory aesthetics of a room environment without compromising the plausibility of a sound event in AR. To accomplish this, the length of the reverberation tails of measured binaural room impulse responses are modified after the mixing time to change reverberance.A listening test was conducted to evaluate the externalization and audio-visual plausibility of an exemplary AR scene for different degrees of reverberation modification. The results indicate that externalization is unaffected even with extreme modifications (such as a stretch ratio of 1.8). However, audio-visual plausibility is only maintained for moderate modifications (such as stretch ratios of 0.8 and 1.2).
This paper presents a method for sound field interpolation/extrapolation from a spatially sparse set of binaural room impulse responses (BRIRs). The method focuses on the direct component and early reflections, and is framed as an inverse problem seeking the weight signals of an acoustic model based on the time-domain equivalent source (TES). Once the weight signals are estimated, the (continuous) sound field can be reconstructed and BRIRs can be synthesised at any position and orientation in a source-free volume bounded by the TESs. The L1-norm, sum of L2-norm, and Tikhonov regularisation functions were tested, with L1-norm (imposing spatio-temporal sparsity) performing the best. Simulations exhibit lower normalised mean squared error (NMSE) compared to a nearest-neighbour approach, which uses the spatially closest BRIR measurement for rendering. Results show good temporal alignment of direct sound and reflections, even when a non-individualised head-related impulse response (HRIR) was used for system inversion and BRIR synthesis. The performance is also assessed using an objective measure of perceived coloration called the predicted binaural coloration (PBC) model, which reveals a good perceptual match between interpolated/extrapolated and true BRIRs.
Proceedings of the ICA 2019 and EAA Euroregio : 23rd International Congress on Acoustics, integrating 4th EAA Euroregio 2019 : 9-13 September 2019, Aachen, Germany / proceedings editor: Martin Ochmann, Michael Vorländer, Janina Fels 23rd International Congress on Acoustics, integrating 4th EAA Euroregio 2019, ICA 2019, Aachen, Germany, 9 Sep 2019 - 13 Sep 2019; Aachen (2019). Published by Aachen