Object-based audio is an emerging paradigm for representing audio content. However, the limited availability of high-quality object-based content and the need for usable production and reproduction tools impede the exploration and evaluation of object-based audio. This engineering brief introduces the S3A object-based production dataset. It comprises a set of object-based scenes as projects for the Reaper digital audio workstation (DAW). They are accompanied by a set of open-source DAW plugins–—the VISR Production Suite—–for creating and reproducing object-based audio. In combination, these resources provide a practical way to experiment with object-based audio and facilitate loudspeaker and headphone reproduction. The dataset is provided to enable a larger audience to experience object-based audio, for use in perceptual experiments, and for audio system evaluation.
A study was conducted in participants’ homes to ascertain how they would position one to eight compact wireless loudspeakers, with the goal of enhancing their existing system. The eleven participants described three key themes, creating an arrangement that: was spatially balanced and evenly distributed; maintained the room’s aesthetics; maintained the room’s functionality. In practice, the results showed that participants prioritised aesthetics and functionality, whilst balance was not usually achieved. It was concluded that a hierarchy of preferred positions in each space exists, as the same positions were reused whilst positioning differing numbers of loudspeakers and by different participants in each location. Consistencies were observed between the locations which can be used to estimate loudspeaker positions for a given living room layout.
Planarity panning (PP) and planarity control (PC) have previously been shown to be efficient methods for focusing directional sound energy into listening zones. In this paper, we consider sound field control for two listeners. First, PP is extended to create spatial audio for two listeners consuming the same spatial audio content. Then, PC is used to create highly directional sound and cancel interfering audio. Simulation results compare PP and PC against pressure matching (PM) solutions. For multiple listeners listening to the same content, PP creates directional sound at lower effort than the PM counterpart. When listeners consume different audio, PC produces greater acoustic contrast than PM, with excellent directional control except for frequencies where grating lobes generate problematic interference patterns.
Personal audio systems generate a local sound ﬁeld for a listener while attenuating the sound energy at pre-deﬁned quiet zones. Their performance can be sensitive to errors in the acoustic transfer functions between the sources and the zones. In this paper, we model the acoustic transfer functions as a superposition of multipoles with a term to describe errors in the actual gain and phase. We then propose a design framework for robust reproduction, incorporating additional prior knowledge about the error distribution where available. We combine acoustic contrast control with worst-case and probability-model optimization, exploiting limited knowledge of the error distribution. Monte-Carlo simulations over 10000 test cases show that the method increases system robustness when errors are present in the assumed transfer functions.
It is of interest to create regions of increased and reduced sound pressure ('sound zones') in an enclosure such that different audio programs can be simultaneously delivered over loudspeakers, thus allowing listeners sharing a space to receive independent audio without physical barriers or headphones. Where previous comparisons of sound zoning techniques exist, they have been conducted under favorable acoustic conditions, utilizing simulations based on theoretical transfer functions or anechoic measurements. Outside of these highly specified and controlled environments, real-world factors including reflections, measurement errors, matrix conditioning and practical filter design degrade the realizable performance. This study compares the performance of sound zoning techniques when applied to create two sound zones in simulated and real acoustic environments. In order to compare multiple methods in a common framework without unduly hindering performance, an optimization procedure for each method is first used to select the best loudspeaker positions in terms of robustness, efficiency and the acoustic contrast deliverable to both zones. The characteristics of each control technique are then studied, noting the contrast and the impact of acoustic conditions on performance.
In our information-overloaded daily lives, unwanted sounds create confusion, disruption and fatigue in what do and experience. Taking control of your own sound environment, you can design what information to hear and how. Providing personalised sound to different people over loudspeakers enables communication, human connection and social activity in a shared space, meanwhile addressing the individuals’ needs. Recent developments in object-based audio, robust sound zoning algorithms, computer vision, device synchronisation and electronic hardware facilitate personal control of immersive and interactive reproduction techniques. Accordingly, the creative sector is moving towards more demand for personalisation and personalisable content. This tutorial offers participants a novel and timely introduction to the increasingly valuable capability to personalise sound over loudspeakers, alongside resources for the audio signal processing community. Presenting the science behind personalising sound technologies and providing insights for making sound zones in practice, we hope to create better listening experiences. The tutorial attempts a holistic exposition of techniques for producing personal sound over loudspeakers. It incorporates a practical step-by-step guide to digital filter design for real-world multizone sound reproduction and relates various approaches to one another thereby enabling comparison of the listener benefits.
Microphone array beamforming can be used to enhance and separate sound sources, with applications in the capture of object-based audio. Many beamforming methods have been proposed and assessed against each other. However, the effects of compact microphone array design on beamforming performance have not been studied for this kind of application. This study investigates how to maximize the quality of audio objects extracted from a horizontal sound field by filter-and-sum beamforming, through appropriate choice of microphone array design. Eight uniform geometries with practical constraints of a limited number of microphones and maximum array size are evaluated over a range of physical metrics. Results show that baffled circular arrays outperform the other geometries in terms of perceptually relevant frequency range, spatial resolution, directivity and robustness. Moreover, a subjective evaluation of microphone arrays and beamformers is conducted with regards to the quality of the target sound, interference suppression and overall quality of simulated music performance recordings. Baffled circular arrays achieve higher target quality and interference suppression than alternative geometries with wideband signals. Furthermore, subjective scores of beamformers regarding target quality and interference suppression agree well with beamformer onaxis and off-axis responses; with wideband signals the superdirective beamformer achieves the highest overall quality.
Philip Coleman, A Franck, Jon Francombe, Qingju Liu, Teofilo de Campos, R Hughes, D Menzies, M Simon Galvez,, Y Tang, J Woodcock, Philip Jackson, F Melchior, C Pike, F Fazi, T Cox, Adrian Hilton (2018)An Audio-Visual System for Object-Based Audio: From Recording to Listening, In: IEEE Transactions on Multimedia20(8)pp. 1919-1931
Object-based audio is an emerging representation for audio content, where content is represented in a reproductionformat- agnostic way and thus produced once for consumption on many different kinds of devices. This affords new opportunities for immersive, personalized, and interactive listening experiences. This article introduces an end-to-end object-based spatial audio pipeline, from sound recording to listening. A high-level system architecture is proposed, which includes novel audiovisual interfaces to support object-based capture and listenertracked rendering, and incorporates a proposed component for objectification, i.e., recording content directly into an object-based form. Text-based and extensible metadata enable communication between the system components. An open architecture for object rendering is also proposed. The system’s capabilities are evaluated in two parts. First, listener-tracked reproduction of metadata automatically estimated from two moving talkers is evaluated using an objective binaural localization model. Second, object-based scene capture with audio extracted using blind source separation (to remix between two talkers) and beamforming (to remix a recording of a jazz group), is evaluated with perceptually-motivated objective and subjective experiments. These experiments demonstrate that the novel components of the system add capabilities beyond the state of the art. Finally, we discuss challenges and future perspectives for object-based audio workflows.
Multi-point approaches for sound field control generally sample the listening zone(s) with pressure microphones, and use these measurements as an input for an optimisation cost function. A number of techniques are based on this concept, for single-zone (e.g. least-squares pressure matching (PM), brightness control, planarity panning) and multi-zone (e.g. PM, acoustic contrast control, planarity control) reproduction. Accurate performance predictions are obtained when distinct microphone positions are employed for setup versus evaluation. While, in simulation, one can afford a dense sampling of virtual microphones, it is desirable in practice to have a microphone array which can be positioned once in each zone to measure the setup transfer functions between each loudspeaker and that zone. In this contribution, we present simulation results over a fixed dense set of evaluation points comparing the performance of several multi-point optimisation approaches for 2D reproduction with a 60 channel circular loudspeaker arrangement. Various regular setup microphone arrays are used to calculate the sound zone filters: circular grid, circular, dual-circular, and spherical arrays, each with different numbers of microphones. Furthermore, the effect of a rigid spherical baffle is studied for the circular and spherical arrangements. The results of this comparative study show how the directivity and effective frequency range of multi-point optimisation techniques depend on the microphone array used to sample the zones. In general, microphone arrays with dense spacing around the boundary give better angular discrimination, leading to more accurate directional sound reproduction, while those distributed around the whole zone enable more accurate prediction of the reproduced target sound pressure level.
Whilst sound zoning methods have typically been studied under anechoic conditions, it is desirable to evaluate the performance of various methods in a real room. Three control methods were implemented (delay and sum, DS; acoustic contrast control, ACC; and pressure matching, PM) on two regular 24-element loudspeaker arrays (line and circle). The acoustic contrast between two zones was evaluated and the reproduced sound fields compared for uniformity of energy distribution. ACC generated the highest contrast, whilst PM produced a uniform bright zone. Listening tests were also performed using monophonic auralisations from measured system responses to collect ratings of perceived distraction due to the alternate audio programme. Distraction ratings were affected by control method and programme material. Copyright © (2013) by the Audio Engineering Society.
Philip Coleman, Andreas Franck, Jon Francombe, Qingju Liu, Teofilo de Campos, Richard Hughes, Dylan Menzies, Marcos Simón Gálvez, Yan Tang, James Woodcock, Frank Melchior, Chris Pike, Filippo Fazi, Trevor Cox, Adrian Hilton, Philip Jackson (2020)S3A Audio-Visual System for Object-Based Audio
University of Surrey
Audio production is moving toward an object-based approach, where content is represented as audio together with metadata that describe the sound scene. From current object definitions, it would usually be expected that the audio portion of the object is free from interfering sources. This poses a potential problem for object-based capture, if microphones cannot be placed close to a source. This paper investigates the application of microphone array beamforming to separate a mixture into distinct audio objects. Real mixtures recorded by a 48-channel microphone array in reflective rooms were separated, and the results were evaluated using perceptual models in addition to physical measures based on the beam pattern. The effect of interfering objects was reduced by applying the beamforming techniques.
Estimating and parameterizing the early and late reflections of an enclosed space is an interesting topic in acoustics. With a suitable set of parameters, the current concept of a spatial audio object (SAO), which is typically limited to either direct (dry) sound or diffuse field components, could be extended to afford an editable spatial description of the room acoustics. In this paper we present an analysis/synthesis method for parameterizing a set of measured room impulse responses (RIRs). RIRs were recorded in a medium-sized auditorium, using a uniform circular array of microphones representing the perspective of a listener in the front row. During the analysis process, these RIRs were decomposed, in time, into three parts: the direct sound, the early reflections, and the late reflections. From the direct sound and early reflections, parameters were extracted for the length, amplitude, and direction of arrival (DOA) of the propagation paths by exploiting the dynamic programming projected phase-slope algorithm (DYPSA) and classical delay-and-sum beamformer (DSB). Their spectral envelope was calculated using linear predictive coding (LPC). Late reflections were modeled by frequency-dependent decays excited by band-limited Gaussian noise. The combination of these parameters for a given source position and the direct source signal represents the reverberant or “wet” spatial audio object. RIRs synthesized for a specified rendering and reproduction arrangement were convolved with dry sources to form reverberant components of the sound scene. The resulting signals demonstrated potential for these techniques, e.g., in SAO reproduction over a 22.2 surround sound system.
Personal audio generates sound zones in a shared space to provide private and personalized listening experiences with minimized interference between consumers. Regularization has been commonly used to increase the robustness of such systems against potential perturbations in the sound reproduction. However, the performance is limited by the system geometry such as the number and location of the loudspeakers and controlled zones. This paper proposes a geometry optimization method to find the most geometrically robust approach for personal audio amongst all available candidate system placements. The proposed method aims to approach the most “natural” sound reproduction so that the solo control of the listening zone coincidently accompanies the preferred quiet zone. Being formulated in the SVD-based modal domain, the method is demonstrated by applications in three typical personal audio optimizations, i.e., the acoustic contrast control, the pressure matching, and the planarity control. Simulation results show that the proposed method can obtain the system geometry with better avoidance of “occlusion,” improved robustness to regularization, and improved broadband equalization.
Sound field control to create multiple personal audio spaces (sound zones) in a shared listening environment is an active research topic. Typically, sound zones in the literature have aimed to reproduce monophonic audio programme material. The planarity control optimization approach can reproduce sound zones with high levels of acoustic contrast, while constraining the energy flux distribution in the target zone to impinge from a certain range of azimuths. Such a constraint has been shown to reduce problematic self-cancellation artefacts such as uneven sound pressure levels and complex phase patterns within the target zone. Furthermore, multichannel reproduction systems have the potential to reproduce spatial audio content at arbitrary listening positions (although most exclusively target a `sweet spot'). By designing the planarity control to constrain the impinging energy rather tightly, a sound field approximating a plane-wave can be reproduced for a listener in an arbitrarily-placed target zone. In this study, the application of planarity control for stereo reproduction in the context of a personal audio system was investigated. Four solutions, to provide virtual left and right channels for two audio programmes, were calculated and superposed to achieve the stereo effect in two separate sound zones. The performance was measured in an acoustically treated studio using a 60 channel circular array, and compared against a least-squares pressure matching solution whereby each channel was reproduced as a plane wave field. Results demonstrate that planarity control achieved 6 dB greater mean contrast than the least-squares case over the range 250-2000 Hz. Based on the principal directions of arrival across frequency, planarity control produced azimuthal RMSE of 4.2/4.5 degrees for the left/right channels respectively (least-squares 2.8/3.6 degrees). Future work should investigate the perceived spatial quality of the implemented system with respect to a reference stereophonic setup.
Studies on sound field control methods able to create independent listening zones in a single acoustic space have recently been undertaken due to the potential of such methods for various practical applications, such as individual audio streams in home entertainment. Existing solutions to the problem have shown to be effective in creating high and low sound energy regions under anechoic conditions. Although some case studies in a reflective environment can also be found, the capabilities of sound zoning methods in rooms have not been fully explored. In this paper, the influence of low-order (early) reflections on the performance of key sound zone techniques is examined. Analytic considerations for small-scale systems reveal strong dependence of performance on parameters such as source positioning with respect to zone locations and room surfaces, as well as the parameters of the receiver configuration. These dependencies are further investigated through numerical simulation to determine system configurations which maximize the performance in terms of acoustic contrast and array control effort. The design rules for source and receiver positioning are suggested, for improved performance under a given set of constraints such as a number of available sources, zone locations and the direction of the dominant reflection. © 2013 Acoustical Society of America.
Frequency-invariant beamformers are useful for spatial audio capture since their attenuation of sources outside the look direction is consistent across frequency. In particular, the least-squares beamformer (LSB) approximates arbitrary frequency-invariant beampatterns with generic microphone configurations. This paper investigates the effects of array geometry, directivity order and regularization for robust hypercardioid synthesis up to 15th order with the LSB, using three 2D 32-microphone array designs (rectangular grid, open circular, and circular with cylindrical baffle). While the directivity increases with order, the frequency range is inversely proportional to the order and is widest for the cylindrical array. Regularization results in broadening of the mainlobe and reduced on-axis response at low frequencies. The PEASS toolkit was used to evaluate perceptually beamformed speech signals.
Sound zone systems aim to produce regions within a room where listeners may consume separate audio programs with minimal acoustical interference. Often, there is a trade-off between the acoustic contrast achieved between the zones, and the fidelity of the reproduced audio program (the target quality). An open question is whether reducing contrast (i.e. allowing greater interference) can improve target quality. The planarity control sound zoning method can be used to improve spatial reproduction, though at the expense of decreased contrast. Hence, this can be used to investigate the relationship between target quality (which is affected by the spatial presentation) and distraction (which is related to the perceived effect of interference). An experiment was conducted investigating target quality and distraction, and examining their relationship with overall quality within sound zones. Sound zones were reproduced using acoustic contrast control, planarity control and pressure matching applied to a circular loudspeaker array. Overall quality was related to target quality and distraction, each having a similar magnitude of effect; however, the result was dependent upon program combination. The highest mean overall quality was a compromise between distraction and target quality, with energy arriving from up to 15 degrees either side of the target direction.
Personal audio systems generate a local sound field for a listener while attenuating the sound energy at pre-defined quiet zones. In practice, system performance is sensitive to errors in the acoustic transfer functions between the sources and the zones. Regularization is commonly used to improve robustness, however, selecting a regularization parameter is not always straightforward. In this paper, a design framework for robust reproduction is proposed, combining transfer function and error modelling. The framework allows a physical perspective on the regularization required for a system, based on the bound of assumed additive or multiplicative errors, which is obtained by acoustic modelling. Acoustic contrast control is separately combined with worst-case and probability-model optimization, exploiting limited knowledge of the potential error distribution. Monte-Carlo simulations show that these approaches give increased system robustness compared to the state of the art approaches for regularization parameter estimation, and experimental results verify that robust sound zone control is achieved in the presence of loudspeaker gain errors. Furthermore, by applying the proposed framework, in-situ transfer function measurements were reduced to a single measurement per loudspeaker, per zone, with limited acoustic contrast degradation of less than 2 dB over 100–3000 Hz compared to the fully measured regularized case.
Acoustic reflector localization is an important issue in audio signal processing, with direct applications in spatial audio, scene reconstruction, and source separation. Several methods have recently been proposed to estimate the 3D positions of acoustic reflectors given room impulse responses (RIRs). In this article, we categorize these methods as “image-source reversion”, which localizes the image source before finding the reflector position, and “direct localization”, which localizes the reflector without intermediate steps. We present five new contributions. First, an onset detector, called the clustered dynamic programming projected phase-slope algorithm, is proposed to automatically extract the time of arrival for early reflections within the RIRs of a compact microphone array. Second, we propose an image-source reversion method that uses the RIRs from a single loudspeaker. It is constructed by combining an image source locator (the image source direction and range (ISDAR) algorithm), and a reflector locator (using the loudspeaker-image bisection (LIB) algorithm). Third, two variants of it, exploiting multiple loudspeakers, are proposed. Fourth, we present a direct localization method, the ellipsoid tangent sample consensus (ETSAC), exploiting ellipsoid properties to localize the reflector. Finally, systematic experiments on simulated and measured RIRs are presented, comparing the proposed methods with the state-of-the-art. ETSAC generates errors lower than the alternative methods compared through our datasets. Nevertheless, the ISDAR-LIB combination performs well and has a run time 200 times faster than ETSAC.
Object-based audio promises format-agnostic reproduction and extensive personalization of spatial audio content. However, in practical listening scenarios, such as in consumer audio, ideal reproduction is typically not possible. To maximize the quality of listening experience, a different approach is required, for example modifications of metadata to adjust for the reproduction layout or personalization choices. In this paper we propose a novel system architecture for semantically informed rendering (SIR), that combines object audio rendering with high-level processing of object metadata. In many cases, this processing uses novel, advanced metadata describing the objects to optimally adjust the audio scene to the reproduction system or listener preferences. The proposed system is evaluated with several adaptation strategies, including semantically motivated downmix to layouts with few loudspeakers, manipulation of perceptual attributes, perceptual reverberation compensation, and orchestration of mobile devices for immersive reproduction. These examples demonstrate how SIR can significantly improve the media experience and provide advanced personalization controls, for example by maintaining smooth object trajectories on systems with few loudspeakers, or providing personalized envelopment levels. An example implementation of the proposed system architecture is described and provided as an open, extensible software framework that combines object-based audio rendering and high-level processing of advanced object metadata.
Media Device Orchestration (MDO) makes use of interconnected devices to augment a reproduction system, and could be used to deliver more immersive audio experiences to domestic audiences. To investigate optimal rendering on an MDO-based system, stimuli were created via: 1) object-based audio (OBA) mixes undertaken in a reference listening room; and 2) up to 13 rendered versions of these employing a range of installed and ad-hoc loudspeakers with varying cost, quality and position. The programme items include audio-visual material (short film trailer and big band performance) and audio-only material (radio panel show, pop track, football match, and orchestral performance). The object-based programme items and alternate MDO configurations are made available for testing and demonstrating OBA systems.
Homes contain a plethora of devices for audio-visual content consumption, which intelligent reproduction systems can exploit to give the best possible experience. To investigate media device ownership in the home, media service-types usage and solitary versus group audio/audio-visual media consumption, a survey of UK households with 1102 respondents was undertaken. The results suggest that there is already significant ownership of wireless and smart loudspeakers, as well as other interconnected devices containing loudspeakers such as smartphones and tablets. Questions on group media consumption suggest that the majority of listeners spend more time consuming media with others than alone, demonstrating an opportunity for systems which can adapt to varying audience requirements within the same environment.
This engineering brief reports on the production of 3 object-based audio drama scenes, commissioned as part of the S3A project. 3D reproduction and an object-based workflow were considered and implemented from the initial script commissioning through to the final mix of the scenes. The scenes are being made available as Broadcast Wave Format files containing all objects as separate tracks and all metadata necessary to render the scenes as an XML chunk in the header conforming to the Audio Definition Model specification (Recommendation ITU-R BS.2076 ). It is hoped that these scenes will find use in perceptual experiments and in the testing of 3D audio systems. The scenes are available via the following link: http://dx.doi.org/10.17866/rd.salford.3043921.
The topic of sound zone reproduction, whereby listeners sharing an acoustic space can receive personalized audio content, has been researched for a number of years. Recently, a number of sound zone systems have been realized, moving the concept towards becoming a practical reality. Current implementations of sound zone systems have relied upon conventional loudspeaker geometries such as linear and circular arrays. Line arrays may be compact, but do not necessarily give the system the opportunity to compensate for room reflections in real-world environments. Circular arrays give this opportunity, and also give greater flexibility for spatial audio reproduction, but typically require large numbers of loudspeakers in order to reproduce sound zones over an acceptable bandwidth. Therefore, one key area of research standing between the ideal capability and the performance of a physical system is that of establishing the number and location of the loudspeakers comprising the reproduction array. In this study, the topic of loudspeaker configurations was considered for two-zone reproduction, using a circular array of 60 loudspeakers as the candidate set for selection. A numerical search procedure was used to select a number of loudspeakers from the candidate set. The novel objective function driving the search comprised terms relating to the acoustic contrast between the zones, array effort, matrix condition number, and target zone planarity. The performance of the selected sets using acoustic contrast control was measured in an acoustically treated studio. Results demonstrate that the loudspeaker selection process has potential for maximising the contrast over frequency by increasing the minimum contrast over the frequency range 100--4000 Hz. The array effort and target planarity can also be optimised, depending on the formulation of the objective function. Future work should consider greater diversity of candidate locations.
Techniques such as multi-point optimization, wave field synthesis and ambisonics attempt to create spatial effects by synthesizing a sound field over a listening region. In this paper, we propose planarity panning, which uses superdirective microphone array beamforming to focus the sound from the specified direction, as an alternative approach. Simulations compare performance against existing strategies, considering the cases where the listener is central and non-central in relation to a 60 channel circular loudspeaker array. Planarity panning requires low control effort and provides high sound field planarity over a large frequency range, when the zone positions match the target regions specified for the filter calculations. Future work should implement and validate the perceptual properties of the method.
Personal audio provides private and personalized listening experiences by generating sound zones in a shared space with minimal interference between zones. One challenge of the design is to achieve the best performance with a limited number of microphones and loudspeakers. In this paper, two modal domain methods for personal audio reproduction are compared. One is the spatial harmonic decomposition (SHD) based method and the other is the singular value decomposition (SVD) based method. It is demonstrated that the SVD based method provides a more efficient modal domain decomposition than the SHD method for 2.5 dimensional personal audio design. Simulation results show that the SVD based method outperforms the SHD one by up to 10 dB in terms of acoustic contrast and up to 17 dB in terms of reproduction error for a compact arc array with five loudspeakers, while requiring fewer microphones around the zone boundaries. The SVD based method retains the inherent efficiency of optimizing in a modal domain while avoiding the inherent geometric limitations of using SHD basis functions. Thus, this approach is advantageous for applications with flexible system geometries and a small number of loudspeakers and microphones.
Reproduction of multiple sound zones, in which personal audio programs may be consumed without the need for headphones, is an active topic in acoustical signal processing. Many approaches to sound zone reproduction do not consider control of the bright zone phase, which may lead to self-cancellation problems if the loudspeakers surround the zones. Conversely, control of the phase in a least-squares sense comes at a cost of decreased level difference between the zones and frequency range of cancellation. Single-zone approaches have considered plane wave reproduction by focusing the sound energy in to a point in the wavenumber domain. In this article, a planar bright zone is reproduced via planarity control, which constrains the bright zone energy to impinge from a narrow range of angles via projection in to a spatial domain. Simulation results using a circular array surrounding two zones show the method to produce superior contrast to the least-squares approach, and superior planarity to the contrast maximization approach. Practical performance measurements obtained in an acoustically treated room verify the conclusions drawn under free-field conditions.
Object-based audio is gaining momentum as a means for future audio productions to be format-agnostic and interactive. Recent standardization developments make recommendations for object formats, however the capture, production and reproduction of reverberation is an open issue. In this paper, we review approaches for recording, transmitting and rendering reverberation over a 3D spatial audio system. Techniques include channel-based approaches where room signals intended for a specific reproduction layout are transmitted, and synthetic reverberators where the room effect is constructed at the renderer. We consider how each approach translates into an object-based context considering the end-to-end production chain of capture, representation, editing, and rendering. We discuss some application examples to highlight the implications of the various approaches.
Room Impulse Responses (RIRs) measured with microphone arrays capture spatial and nonspatial information, e.g. the early reflections’ directions and times of arrival, the size of the room and its absorption properties. The Reverberant Spatial Audio Object (RSAO) was proposed as a method to encode room acoustic parameters from measured array RIRs. As the RSAO is object-based audio compatible, its parameters can be rendered to arbitrary reproduction systems and edited to modify the reverberation characteristics, to improve the user experience. Various microphone array designs have been proposed for sound field and room acoustic analysis, but a comparative performance evaluation is not available. This study assesses the performance of five regular microphone array geometries (linear, rectangular, circular, dual-circular and spherical) to capture RSAO parameters for the direct sound and early reflections of RIRs. The image source method is used to synthesise RIRs at the microphone positions as well as at the centre of the array. From the array RIRs, the RSAO parameters are estimated and compared to the reference parameters at the centre of the array. A performance comparison among the five arrays is established as well as the effect of a rigid spherical baffle for the circular and spherical arrays. The effects of measurement uncertainties, such as microphone misplacement and sensor noise errors, are also studied. The results show that planar arrays achieve the most accurate horizontal localisation whereas the spherical arrays perform best in elevation. Arrays with smaller apertures achieve a higher number of detected reflections, which becomes more significant for the smaller room with higher reflection density.
Boundary estimation from an acoustic room impulse response (RIR), exploiting known sound propagation behavior, yields useful information for various applications: e.g., source separation, simultaneous localization and mapping, and spatial audio. The baseline method, an algorithm proposed by Antonacci et al., uses reflection times of arrival (TOAs) to hypothesize reflector ellipses. Here, we modify the algorithm for 3-D environments and for enhanced noise robustness: DYPSA and MUSIC for epoch detection and direction of arrival (DOA) respectively are combined for source localization, and numerical search is adopted for reflector estimation. Both methods, and other variants, are tested on measured RIR data; the proposed method performs best, reducing the estimation error by 30%.
Recent work on a reverberant spatial audio object (RSAO) encoded spatial room impulse responses (RIRs) as object-based metadata which can be synthesized in an object-based renderer. Encoding reverberation into metadata presents new opportunities for end users to interact with and personalize reverberant content. The RSAO models an RIR as a set of early re ections together with a late reverberation filter. Previous work to encode the RSAO parameters was based on recordings made with a dense array of omnidirectional microphones. This paper describes RSAO parameterization from first-order Ambisonic (B-Format) RIRs, making the RSAO compatible with existing spatial reverb libraries. The object-based implementation achieves reverberation time, early decay time, clarity and interaural cross-correlation similar to direct Ambisonic rendering of 13 test RIRs.
James Woodcock, Jon Franombe, Andreas Franck, Philip Coleman, Richard Hughes, Hansung Kim, Qingju Liu, Dylan Menzies, Marcos F Simón Gálvez, Yan Tang, Tim Brookes, William J Davies, Bruno M Fazenda, Russell Mason, Trevor J Cox, Filippo Maria Fazi, Philip Jackson, Chris Pike, Adrian Hilton (2018)A Framework for Intelligent Metadata Adaptation in Object-Based Audio, In: AES E-Librarypp. P11-3
Audio Engineering Society
Object-based audio can be used to customize, personalize, and optimize audio reproduction depending on the speci?c listening scenario. To investigate and exploit the bene?ts of object-based audio, a framework for intelligent metadata adaptation was developed. The framework uses detailed semantic metadata that describes the audio objects, the loudspeakers, and the room. It features an extensible software tool for real-time metadata adaptation that can incorporate knowledge derived from perceptual tests and/or feedback from perceptual meters to drive adaptation and facilitate optimal rendering. One use case for the system is demonstrated through a rule-set (derived from perceptual tests with experienced mix engineers) for automatic adaptation of object levels and positions when rendering 3D content to two- and ?ve-channel systems.
Reproduction of personal sound zones can be attempted by sound field synthesis, energy control, or a combination of both. Energy control methods can create an unpredictable pressure distribution in the listening zone. Sound field synthesis methods may be used to overcome this problem, but tend to produce a lower acoustic contrast between the zones. Here, we present a cost function to optimize the cancellation and the plane wave energy over a range of incoming azimuths, producing a planar sound field without explicitly specifying the propagation direction. Simulation results demonstrate the performance of the methods in comparison with the current state of the art. The method produces consistent high contrast and a consistently planar target sound zone across the frequency range 80-7000Hz. Copyright © (2013) by the Audio Engineering Society.
Recent work into 3D audio reproduction has considered the definition of a set of parameters to encode reverberation into an object-based audio scene. The reverberant spatial audio object (RSAO) describes the reverberation in terms of a set of localised, delayed and filtered (early) reflections, together with a late energy envelope modelling the diffuse late decay. The planarity metric, originally developed to evaluate the directionality of reproduced sound fields, is used to analyse a set of multichannel room impulse responses (RIRs) recorded at a microphone array. Planarity describes the spatial compactness of incident sound energy, which tends to decrease as the reflection density and diffuseness of the room response develop over time. Accordingly, planarity complements intensity-based diffuseness estimators, which quantify the degree to which the sound field at a discrete frequency within a particular time window is due to an impinging coherent plane wave. In this paper, we use planarity as a tool to analyse the sound field in relation to the RSAO parameters. Specifically, we use planarity to estimate two important properties of the sound field. First, as high planarity identifies the most localised reflections along the RIR, we estimate the most planar portions of the RIR, corresponding to the RSAO early reflection model and increasing the likelihood of detecting prominent specular reflections. Second, as diffuse sound fields give a low planarity score, we investigate planarity for data-based mixing time estimation. Results show that planarity estimates on measured multichannel RIR datasets represent a useful tool for room acoustics analysis and RSAO parameterisation.
Pressure matching (PM) and planarity control (PC) methods can be used to re- produce local sound with a certain orientation at the listening zone, while suppressing the sound energy at the quiet zone. In this letter, regularized PM and PC, incorporating coarse error estimation, are introduced to increase the robustness in non-ideal reproduction scenarios. Facilitated by this, the interaction between regularization, robustness, (tuned) personal audio optimization and local directional performance is explored. Simulations show that under certain conditions, PC and weighted PM achieve comparable performance, while PC is more robust to a poorly selected regularization parameter.
We present a novel pipeline to estimate reverberant spatial audio object (RSAO) parameters given room impulse responses (RIRs) recorded by ad-hoc microphone arrangements. The proposed pipeline performs three tasks: direct-to-reverberant-ratio (DRR) estimation; microphone localization; RSAO parametrization. RIRs recorded at Bridgewater Hall by microphones arranged for a BBC Philharmonic Orchestra performance were parametrized. Objective measures of the rendered RSAO reverberation characteristics were evaluated and compared with reverberation recorded by a Soundfield microphone. Alongside informal listening tests, the results confirmed that the rendered RSAO gave a plausible reproduction of the hall, comparable to the measured response. The objectification of the reverb from in-situ RIR measurements unlocks customization and personalization of the experience for different audio systems, user preferences and playback environments.
In this paper, we propose an iterative deep neural network (DNN)-based binaural source separation scheme, for recovering two concurrent speech signals in a room environment. Besides the commonly-used spectral features, the DNN also takes non-linearly wrapped binaural spatial features as input, which are refined iteratively using parameters estimated from the DNN output via a feedback loop. Different DNN structures have been tested, including a classic multilayer perception regression architecture as well as a new hybrid network with both convolutional and densely-connected layers. Objective evaluations in terms of PESQ and STOI showed consistent improvement over baseline methods using traditional binaural features, especially when the hybrid DNN architecture was employed. In addition, our proposed scheme is robust to mismatches between the training and testing data.
Sound eld control methods can be used to create multiple zones of audio in the same room. Separation achieved by such systems has classically been evaluated using physical metrics including acoustic contrast and target-to-interferer ratio (TIR). However, to optimise the experience for a listener it is desirable to consider perceptual factors. A search procedure was used to select 5 loudspeakers for production of 2 sound zones using acoustic contrast control. Comparisons were made between searches driven by physical (programme-independent TIR) and perceptual (distraction predictions from a statistical model) cost func- Tions. Performance was evaluated on TIR and predicted distraction in addition to subjective ratings. The perceptual cost function showed some benefits over physical optimisation, although the model used needs further work. Copyright © (2013) by the Audio Engineering Society.
The ability to replicate a plane wave represents an essential element of spatial sound field reproduction. In sound field synthesis, the desired field is often formulated as a plane wave and the error minimized; for other sound field control methods, the energy density or energy ratio is maximized. In all cases and further to the reproduction error, it is informative to characterize how planar the resultant sound field is. This paper presents a method for quantifying a region's acoustic planarity by superdirective beamforming with an array of microphones, which analyzes the azimuthal distribution of impinging waves and hence derives the planarity. Estimates are obtained for a variety of simulated sound field types, tested with respect to array orientation, wavenumber, and number of microphones. A range of microphone configurations is examined. Results are compared with delay-and-sum beamforming, which is equivalent to spatial Fourier decomposition. The superdirective beamformer provides better characterization of sound fields, and is effective with a moderate number of omni-directional microphones over a broad frequency range. Practical investigation of planarity estimation in real sound fields is needed to demonstrate its validity as a physical sound field evaluation measure. © 2013 Acoustical Society of America.
Since the mid 1990s, acoustics research has been undertaken relating to the sound zone problem—using loudspeakers to deliver a region of high sound pressure while simultaneously creating an area where the sound is suppressed—in order to facilitate independent listening within the same acoustic enclosure. The published solutions to the sound zone problem are derived from areas such as wave field synthesis and beamforming. However, the properties of such methods differ and performance tends to be compared against similar approaches. In this study, the suitability of energy focusing, energy cancelation, and synthesis approaches for sound zone reproduction is investigated. Anechoic simulations based on two zones surrounded by a circular array show each of the methods to have a characteristic performance, quantified in terms of acoustic contrast, array control effort and target sound field planarity. Regularization is shown to have a significant effect on the array effort and achieved acoustic contrast, particularly when mismatched conditions are considered between calculation of the source weights and their application to the system.
The acoustic environment affects the properties of the audio signals recorded. Generally, given room impulse responses (RIRs), three sets of parameters have to be extracted in order to create an acoustic model of the environment: sources, sensors and reflector positions. In this paper, the cross-correlation based iterative sensor position estimation (CISPE) algorithm is presented, a new method to estimate a microphone configuration, together with source and reflector position estimators. A rough measurement of the microphone positions initializes the process; then a recursive algorithm is applied to improve the estimates, exploiting a delay-and-sum beamformer. Knowing where the microphones lie in the space, the dynamic programming projected phase slope algorithm (DYPSA) extracts the times of arrival (TOAs) of the direct sounds from the RIRs, and multiple signal classification (MUSIC) extracts the directions of arrival (DOAs). A triangulation technique is then applied to estimate the source positions. Finally, exploiting properties of 3D quadratic surfaces (namely, ellipsoids), reflecting planes are localized via a technique ported from image processing, by random sample consensus (RANSAC). Simulation tests were performed on measured RIR datasets acquired from three different rooms located at the University of Surrey, using either a uniform circular array (UCA) or uniform rectangular array (URA) of microphones. Results showed small improvements with CISPE pre-processing in almost every case.
For many audio applications, availability of recorded multi-channel room impulse responses (MC-RIRs) is fundamental. They enable development and testing of acoustic systems for reflective rooms. We present multiple MC-RIR datasets recorded in diverse rooms, using up to 60 loudspeaker positions and various uniform compact microphone arrays. These datasets complement existing RIR libraries and have dense spatial sampling of a listening position. To reveal the encapsulated spatial information, several state of the art room visualization methods are presented. Results confirm the measurement fidelity and graphically depict the geometry of the recorded rooms. Further investigation of these recordings and visualization methods will facilitate object-based RIR encoding, integration of audio with other forms of spatial information, and meaningful extrapolation and manipulation of recorded compact microphone array RIRs.
At the University of Surrey (Guildford, UK), we have brought together research groups in different disciplines, with a shared interest in audio, to work on a range of collaborative research projects. In the Centre for Vision, Speech and Signal Processing (CVSSP) we focus on technologies for machine perception of audio scenes; in the Institute of Sound Recording (IoSR) we focus on research into human perception of audio quality; the Digital World Research Centre (DWRC) focusses on the design of digital technologies; while the Centre for Digital Economy (CoDE) focusses on new business models enabled by digital technology. This interdisciplinary view, across different traditional academic departments and faculties, allows us to undertake projects which would be impossible for a single research group. In this poster we will present an overview of some of these interdisciplinary projects, including projects in spatial audio, sound scene and event analysis, and creative commons audio.
Object-based audio is gaining momentum as a means for future audio content to be more immersive, interactive, and accessible. Recent standardization developments make recommendations for object formats, however, the capture, production and reproduction of reverberation is an open issue. In this paper, parametric approaches for capturing, representing, editing, and rendering reverberation over a 3D spatial audio system are reviewed. A framework is proposed for a Reverberant Spatial Audio Object (RSAO), which synthesizes reverberation inside an audio object renderer. An implementation example of an object scheme utilising the RSAO framework is provided, and supported with listening test results, showing that: the approach correctly retains the sense of room size compared to a convolved reference; editing RSAO parameters can alter the perceived room size and source distance; and, format-agnostic rendering can be exploited to alter listener envelopment.
For subjective experimentation on 3D audio systems, suitable programme material is needed. A large-scale recording session was performed in which four ensembles were recorded with a range of existing microphone techniques (aimed at mono, stereo, 5.0, 9.0, 22.0, ambisonic, and headphone reproduction) and a novel 48-channel circular microphone array. Further material was produced by remixing and augmenting pre-existing multichannel content. To mix and monitor the programme items (which included classical, jazz, pop and experimental music, and excerpts from a sports broadcast and a lm soundtrack), a flexible 3D audio reproduction environment was created. Solutions to the following challenges were found: level calibration for different reproduction formats; bass management; and adaptable signal routing from different software and fille formats.
Situated in the domain of urban sound scene classiﬁcation by humans and machines, this research is the ﬁrst step towards mapping urban noise pollution experienced indoors and ﬁnding ways to reduce its negative impact in peoples’ homes. We have recorded a sound dataset, called Open-Window, which contains recordings from three different locations and four different window states; two stationary states (open and close) and two transitional states (open to close and close to open). We have then built our machine recognition base lines for different scenarios (open set versus closed set) using a deep learning framework. The human listening test is also performed to be able to compare the human and machine performance for detecting the window state just using the acoustic cues. Our experimental results reveal that when using a simple machine baseline system, humans and machines are achieving similar average performance for closed set experiments.
Recent attention to the problem of controlling multiple loudspeakers to create sound zones has been directed toward practical issues arising from system robustness concerns. In this study, the effects of regularization are analyzed for three representative sound zoning methods. Regularization governs the control effort required to drive the loudspeaker array, via a constraint in each optimization cost function. Simulations show that regularization has a significant effect on the sound zone performance, both under ideal anechoic conditions and when systematic errors are introduced between calculation of the source weights and their application to the system. Results are obtained for speed of sound variations and loudspeaker positioning errors with respect to the source weights calculated. Judicious selection of the regularization parameter is shown to be a primary concern for sound zone system designers-the acoustic contrast can be increased by up to 50 dB with proper regularization in the presence of errors. A frequency-dependent minimum regularization parameter is determined based on the conditioning of the matrix inverse. The regularization parameter can be further increased to improve performance depending on the control effort constraints, expected magnitude of errors, and desired sound field properties of the system.
Philip Coleman, Qingju Liu, Jon Francombe, Philip Jackson (2018)Perceptual evaluation of blind source separation in object-based audio production, In: Latent Variable Analysis and Signal Separation - 14th International Conference, LVA/ICA 2018, Guildford, UK, July 2–5, 2018, Proceedingspp. 558-567
Object-based audio has the potential to enable multime- dia content to be tailored to individual listeners and their reproduc- tion equipment. In general, object-based production assumes that the objects|the assets comprising the scene|are free of noise and inter- ference. However, there are many applications in which signal separa- tion could be useful to an object-based audio work ow, e.g., extracting individual objects from channel-based recordings or legacy content, or recording a sound scene with a single microphone array. This paper de- scribes the application and evaluation of blind source separation (BSS) for sound recording in a hybrid channel-based and object-based workflow, in which BSS-estimated objects are mixed with the original stereo recording. A subjective experiment was conducted using simultaneously spoken speech recorded with omnidirectional microphones in a rever- berant room. Listeners mixed a BSS-extracted speech object into the scene to make the quieter talker clearer, while retaining acceptable au- dio quality, compared to the raw stereo recording. Objective evaluations show that the relative short-term objective intelligibility and speech qual- ity scores increase using BSS. Further objective evaluations are used to discuss the in uence of the BSS method on the remixing scenario; the scenario shown by human listeners to be useful in object-based audio is shown to be a worse-case scenario.
Studies on sound field control methods able to create independent listening zones in a single acoustic space have recently been undertaken due to the potential of such methods for various practical applications, such as individual audio streams in home entertainment. Existing solutions to the problem have shown to be effective in creating high and low sound energy regions under anechoic conditions. Although some case studies in a reflective environment can also be found, the capabilities of sound zoning methods in rooms have not been fully explored. In this paper, the influence of low-order (early) reflections on the performance of key sound zone techniques is examined. Analytic considerations for small-scale systems reveal strong dependence of performance on parameters such as source positioning with respect to zone locations and room surfaces, as well as the parameters of the receiver configuration. These dependencies are further investigated through numerical simulation to determine system configurations which maximize the performance in terms of acoustic contrast and array control effort. The design rules for source and receiver positioning are suggested, for improved performance under a given set of constraints such as a number of available sources, zone locations, and the direction of the dominant reflection.