Studies on sound field control methods able to create independent listening zones in a single acoustic space have recently been undertaken due to the potential of such methods for various practical applications, such as individual audio streams in home entertainment. Existing solutions to the problem have shown to be effective in creating high and low sound energy regions under anechoic conditions. Although some case studies in a reflective environment can also be found, the capabilities of sound zoning methods in rooms have not been fully explored. In this paper, the influence of low-order (early) reflections on the performance of key sound zone techniques is examined. Analytic considerations for small-scale systems reveal strong dependence of performance on parameters such as source positioning with respect to zone locations and room surfaces, as well as the parameters of the receiver configuration. These dependencies are further investigated through numerical simulation to determine system configurations which maximize the performance in terms of acoustic contrast and array control effort. The design rules for source and receiver positioning are suggested, for improved performance under a given set of constraints such as a number of available sources, zone locations and the direction of the dominant reflection. © 2013 Acoustical Society of America.
For many audio applications, availability of recorded multi-channel room impulse responses (MC-RIRs) is fundamental. They enable development and testing of acoustic systems for reflective rooms. We present multiple MC-RIR datasets recorded in diverse rooms, using up to 60 loudspeaker positions and various uniform compact microphone arrays. These datasets complement existing RIR libraries and have dense spatial sampling of a listening position. To reveal the encapsulated spatial information, several state of the art room visualization methods are presented. Results confirm the measurement fidelity and graphically depict the geometry of the recorded rooms. Further investigation of these recordings and visualization methods will facilitate object-based RIR encoding, integration of audio with other forms of spatial information, and meaningful extrapolation and manipulation of recorded compact microphone array RIRs.
Since the mid 1990s, acoustics research has been undertaken relating to the sound zone problem?using loudspeakers to deliver a region of high sound pressure while simultaneously creating an area where the sound is suppressed?in order to facilitate independent listening within the same acoustic enclosure. The published solutions to the sound zone problem are derived from areas such as wave field synthesis and beamforming. However, the properties of such methods differ and performance tends to be compared against similar approaches. In this study, the suitability of energy focusing, energy cancelation, and synthesis approaches for sound zone reproduction is investigated. Anechoic simulations based on two zones surrounded by a circular array show each of the methods to have a characteristic performance, quantified in terms of acoustic contrast, array control effort and target sound field planarity. Regularization is shown to have a significant effect on the array effort and achieved acoustic contrast, particularly when mismatched conditions are considered between calculation of the source weights and their application to the system.
Reproduction of personal sound zones can be attempted by sound field synthesis, energy control, or a combination of both. Energy control methods can create an unpredictable pressure distribution in the listening zone. Sound field synthesis methods may be used to overcome this problem, but tend to produce a lower acoustic contrast between the zones. Here, we present a cost function to optimize the cancellation and the plane wave energy over a range of incoming azimuths, producing a planar sound field without explicitly specifying the propagation direction. Simulation results demonstrate the performance of the methods in comparison with the current state of the art. The method produces consistent high contrast and a consistently planar target sound zone across the frequency range 80-7000Hz. Copyright © (2013) by the Audio Engineering Society.
Sound eld control methods can be used to create multiple zones of audio in the same room. Separation achieved by such systems has classically been evaluated using physical metrics including acoustic contrast and target-to-interferer ratio (TIR). However, to optimise the experience for a listener it is desirable to consider perceptual factors. A search procedure was used to select 5 loudspeakers for production of 2 sound zones using acoustic contrast control. Comparisons were made between searches driven by physical (programme-independent TIR) and perceptual (distraction predictions from a statistical model) cost func- Tions. Performance was evaluated on TIR and predicted distraction in addition to subjective ratings. The perceptual cost function showed some benefits over physical optimisation, although the model used needs further work. Copyright © (2013) by the Audio Engineering Society.
Whilst sound zoning methods have typically been studied under anechoic conditions, it is desirable to evaluate the performance of various methods in a real room. Three control methods were implemented (delay and sum, DS; acoustic contrast control, ACC; and pressure matching, PM) on two regular 24-element loudspeaker arrays (line and circle). The acoustic contrast between two zones was evaluated and the reproduced sound fields compared for uniformity of energy distribution. ACC generated the highest contrast, whilst PM produced a uniform bright zone. Listening tests were also performed using monophonic auralisations from measured system responses to collect ratings of perceived distraction due to the alternate audio programme. Distraction ratings were affected by control method and programme material. Copyright © (2013) by the Audio Engineering Society.
Object-based audio is gaining momentum as a means for future audio content to be more immersive,
interactive, and accessible. Recent standardization developments make recommendations for object
formats, however, the capture, production and reproduction of reverberation is an open issue. In this
paper, parametric approaches for capturing, representing, editing, and rendering reverberation over a
3D spatial audio system are reviewed. A framework is proposed for a Reverberant Spatial Audio Object
(RSAO), which synthesizes reverberation inside an audio object renderer. An implementation example
of an object scheme utilising the RSAO framework is provided, and supported with listening test
results, showing that: the approach correctly retains the sense of room size compared to a convolved
reference; editing RSAO parameters can alter the perceived room size and source distance; and,
format-agnostic rendering can be exploited to alter listener envelopment.
The acoustic environment affects the properties of the audio signals recorded. Generally, given room impulse responses (RIRs), three sets of parameters have to be extracted in order to create an acoustic model of the environment: sources, sensors and reflector positions. In this paper, the cross-correlation based iterative sensor position estimation (CISPE) algorithm is presented, a new method to estimate a microphone configuration, together with source and reflector position estimators. A rough measurement of the microphone positions initializes the process; then a recursive algorithm is applied to improve the estimates, exploiting a delay-and-sum beamformer. Knowing where the microphones lie in the space, the dynamic programming projected phase slope algorithm (DYPSA) extracts the times of arrival (TOAs) of the direct sounds from the RIRs, and multiple signal classification (MUSIC) extracts the directions of arrival (DOAs). A triangulation technique is then applied to estimate the source positions. Finally, exploiting properties of 3D quadratic surfaces (namely, ellipsoids), reflecting planes are localized via a technique ported from image processing, by random sample consensus (RANSAC). Simulation tests were performed on measured RIR datasets acquired from three different rooms located at the University of Surrey, using either a uniform circular array (UCA) or uniform rectangular array (URA) of microphones. Results showed small improvements with CISPE pre-processing in almost every case.
Recent attention to the problem of controlling multiple loudspeakers to create sound zones has been directed towards practical issues arising from system robustness concerns. In this study, the effects of regularization are analyzed for three representative sound zoning methods. Regularization governs the control effort required to drive the loudspeaker array, via a constraint in each optimization cost function. Simulations show that regularization has a significant effect on the sound zone performance, both under ideal anechoic conditions and when systematic errors are introduced between calculation of the source weights and their application to the system. Results are obtained for speed of sound variations and loudspeaker positioning errors with respect to the source weights calculated. Judicious selection of the regularization parameter is shown to be a primary concern for sound zone system designers - the acoustic contrast can be increased by up to 50dB with proper regularization in the presence of errors. A frequency-dependent minimum regularization parameter is determined based on the conditioning of the matrix inverse. The regularization parameter can be further increased to improve performance depending on the control effort constraints, expected magnitude of errors, and desired sound field properties of the system. © 2013 Acoustical Society of America.
Coleman P Loudspeaker array processing for personal sound zone reproduction,
Sound zone reproduction facilitates listeners wishing to consume personal audio content within the same acoustic enclosure by filtering loudspeaker signals to create constructive and destructive interference in different spatial regions. Published solutions to the sound zone problem are derived from areas such as sound field synthesis and beamforming. The first contribution of this thesis is a comparative study of multi-point approaches. A new metric of planarity is adopted to analyse the spatial distribution of energy in the target zone, and the well-established metrics of acoustic contrast and control effort are also used. Simulations and experimental results demonstrate the advantages and disadvantages of the approaches. Energy cancellation produces good acoustic contrast but allows very little control over the target sound field; synthesis-derived approaches precisely control the target sound field but produce less contrast.
Motivated by the limitations of the existing optimization methods, the central contribution of this thesis is a proposed optimization cost function ?planarity control?, which maximizes the acoustic contrast between the zones while controlling sound field planarity by projecting the target zone energy into a spatial domain. Planarity control is shown to achieve good contrast and high target zone planarity over a large frequency range. The method also has potential for reproducing stereophonic material in the context of sound zones.
The remaining contributions consider two further practical concerns. First, judicious choice of the regularization parameter is shown to have a significant effect on the contrast, effort and robustness. Second, attention is given to the problem of optimally positioning the loudspeakers via a numerical framework and objective function.
The simulation and experimental results presented in this thesis represent a significant addition to the literature and will influence the future choices of control methods, regularization and loudspeaker placement for personal audio. Future systems may incorporate 3D rendering and listener tracking.
Object-based audio is gaining momentum as a means for future audio productions to be format-agnostic and interactive. Recent standardization developments make recommendations for object formats, however the capture, production and reproduction of reverberation is an open issue. In this paper, we review approaches for recording, transmitting and rendering reverberation over a 3D spatial audio system. Techniques include channel-based approaches where room signals intended for a specific reproduction layout are transmitted, and synthetic reverberators where the room effect is constructed at the renderer. We consider how each approach translates into an object-based context considering the end-to-end production chain of capture, representation, editing, and rendering. We discuss some application examples to highlight the implications of the various approaches.
The topic of sound zone reproduction, whereby listeners sharing an acoustic space can receive personalized audio content, has been researched for a number of years. Recently, a number of sound zone systems have been realized, moving the concept towards becoming a practical reality. Current implementations of sound zone systems have relied upon conventional loudspeaker geometries such as linear and circular arrays. Line arrays may be compact, but do not necessarily give the system the opportunity to compensate for room reflections in real-world environments. Circular arrays give this opportunity, and also give greater flexibility for spatial audio reproduction, but typically require large numbers of loudspeakers in order to reproduce sound zones over an acceptable bandwidth. Therefore, one key area of research standing between the ideal capability and the performance of a physical system is that of establishing the number and location of the loudspeakers comprising the reproduction array. In this study, the topic of loudspeaker configurations was considered for two-zone reproduction, using a circular array of 60 loudspeakers as the candidate set for selection. A numerical search procedure was used to select a number of loudspeakers from the candidate set. The novel objective function driving the search comprised terms relating to the acoustic contrast between the zones, array effort, matrix condition number, and target zone planarity. The performance of the selected sets using acoustic contrast control was measured in an acoustically treated studio. Results demonstrate that the loudspeaker selection process has potential for maximising the contrast over frequency by increasing the minimum contrast over the frequency range 100--4000 Hz. The array effort and target planarity can also be optimised, depending on the formulation of the objective function. Future work should consider greater diversity of candidate locations.
Remaggi L, Jackson PJB, Coleman P, Wang W (2014) Room boundary estimation from acoustic room impulse responses, Proc. Sensor Signal Processing for Defence (SSPD 2014) pp. 1-5 IEEE
Boundary estimation from an acoustic room impulse response (RIR), exploiting known sound propagation behavior, yields useful information for various applications: e.g., source separation, simultaneous localization and mapping, and spatial audio. The baseline method, an algorithm proposed by Antonacci et al., uses reflection times of arrival (TOAs) to hypothesize reflector ellipses. Here, we modify the algorithm for 3-D environments and for enhanced noise robustness: DYPSA and MUSIC for epoch detection and direction of arrival (DOA) respectively are combined for source localization, and numerical search is adopted for reflector estimation. Both methods, and other variants, are tested on measured RIR data; the proposed method performs best, reducing the estimation error by 30%.
Coleman P, Jackson PJB, Francombe J (2015) Audio Object Separation Using Microphone Array Beamforming, Audio Engineering Society
Audio production is moving toward an object-based approach, where content is represented as audio together with metadata that describe the sound scene. From current object definitions, it would usually be expected that the audio portion of the object is free from interfering sources. This poses a potential problem for object-based capture, if microphones cannot be placed close to a source. This paper investigates the application of microphone array beamforming to separate a mixture into distinct audio objects. Real mixtures recorded by a 48-channel microphone array in reflective rooms were separated, and the results were evaluated using perceptual models in addition to physical measures based on the beam pattern. The effect of interfering objects was reduced by applying the beamforming techniques.
Sound field control to create multiple personal audio spaces (sound zones) in a shared listening environment is an active research topic. Typically, sound zones in the literature have aimed to reproduce monophonic audio programme material. The planarity control optimization approach can reproduce sound zones with high levels of acoustic contrast, while constraining the energy flux distribution in the target zone to impinge from a certain range of azimuths. Such a constraint has been shown to reduce problematic self-cancellation artefacts such as uneven sound pressure levels and complex phase patterns within the target zone. Furthermore, multichannel reproduction systems have the potential to reproduce spatial audio content at arbitrary listening positions (although most exclusively target a `sweet spot'). By designing the planarity control to constrain the impinging energy rather tightly, a sound field approximating a plane-wave can be reproduced for a listener in an arbitrarily-placed target zone. In this study, the application of planarity control for stereo reproduction in the context of a personal audio system was investigated. Four solutions, to provide virtual left and right channels for two audio programmes, were calculated and superposed to achieve the stereo effect in two separate sound zones. The performance was measured in an acoustically treated studio using a 60 channel circular array, and compared against a least-squares pressure matching solution whereby each channel was reproduced as a plane wave field. Results demonstrate that planarity control achieved 6 dB greater mean contrast than the least-squares case over the range 250-2000 Hz. Based on the principal directions of arrival across frequency, planarity control produced azimuthal RMSE of 4.2/4.5 degrees for the left/right channels respectively (least-squares 2.8/3.6 degrees). Future work should investigate the perceived spatial quality of the implemented system with respect to a reference stereophonic setup.
Jackson PJ, Jacobsen F, Coleman P, Pedersen JA (2013) Sound field planarity characterized by superdirective beamforming, Proceedings of Meetings on Acoustics 19
The ability to replicate a plane wave represents an essential element of spatial sound field reproduction. In sound field synthesis, the desired field is often formulated as a plane wave and the error minimized; for other sound field control methods, the energy density or energy ratio is maximized. In all cases and further to the reproduction error, it is informative to characterize how planar the resultant sound field is. This paper presents a method for quantifying a region's acoustic planarity by superdirective beamforming with an array of microphones, which analyzes the azimuthal distribution of impinging waves and hence derives the planarity. Estimates are obtained for a variety of simulated sound field types, tested with respect to array orientation, wavenumber, and number of microphones. A range of microphone configurations is examined. Results are compared with delay-and-sum beamforming, which is equivalent to spatial Fourier decomposition. The superdirective beamformer provides better characterization of sound fields, and is effective with a moderate number of omni-directional microphones over a broad frequency range. Practical investigation of planarity estimation in real sound fields is needed to demonstrate its validity as a physical sound field evaluation measure. © 2013 Acoustical Society of America.
Techniques such as multi-point optimization, wave field synthesis and ambisonics attempt to create spatial effects by synthesizing a sound field over a listening region. In this paper, we propose planarity panning, which uses superdirective microphone array beamforming to focus the sound from the specified direction, as an alternative approach. Simulations compare performance against existing strategies, considering the cases where the listener is central and non-central in relation to a 60 channel circular loudspeaker array. Planarity panning requires low control effort and provides high sound field planarity over a large frequency range, when the zone positions match the target regions specified for the filter calculations. Future work should implement and validate the perceptual properties of the method.
Sound zone systems aim to produce regions within a room where listeners may consume separate audio programs with minimal acoustical interference. Often, there is a trade-off between the acoustic contrast achieved between the zones, and the fidelity of the reproduced audio program (the target quality). An open question is whether reducing contrast (i.e. allowing greater interference) can improve target quality. The planarity control sound zoning method can be used to improve spatial reproduction, though at the expense of decreased contrast. Hence, this can be used to investigate the relationship between target quality (which is affected by the spatial presentation) and distraction (which is related to the perceived effect of interference). An experiment was conducted investigating target quality and distraction, and examining their relationship with overall quality within sound zones. Sound zones were reproduced using acoustic contrast control, planarity control and pressure matching applied to a circular loudspeaker array. Overall quality was related to target quality and distraction, each having a similar magnitude of effect; however, the result was dependent upon program combination. The highest mean overall quality was a compromise between distraction and target quality, with energy arriving from up to 15 degrees either side of the target direction.
It is of interest to create regions of increased and reduced sound pressure ('sound zones') in an enclosure such that different audio programs can be simultaneously delivered over loudspeakers, thus allowing listeners sharing a space to receive independent audio without physical barriers or headphones. Where previous comparisons of sound zoning techniques exist, they have been conducted under favorable acoustic conditions, utilizing simulations based on theoretical transfer functions or anechoic measurements. Outside of these highly specified and controlled environments, real-world factors including reflections, measurement errors, matrix conditioning and practical filter design degrade the realizable performance. This study compares the performance of sound zoning techniques when applied to create two sound zones in simulated and real acoustic environments. In order to compare multiple methods in a common framework without unduly hindering performance, an optimization procedure for each method is first used to select the best loudspeaker positions in terms of robustness, efficiency and the acoustic contrast deliverable to both zones. The characteristics of each control technique are then studied, noting the contrast and the impact of acoustic conditions on performance.
The problem of delivering personal audio content to listeners sharing the same acoustic space has recently attracted attention. It has been shown that a perceptually acceptable level of acoustic separation between the listening zones is difficult to achieve with active control in non-anechoic conditions. A common problem of strong first order reflections has not been examined in detail for systems with practical constraints. Acoustic contrast maximization combined with optimization of source positions is identified as a potentially effective control strategy when strong individual reflections occur. An analytic study is carried out to describe the relationship between the performance of a 2 × 2 (two sources and two control sensors) system and its geometry in a single-reflection scenario. The expression for acoustic contrast is used to formulate guidelines for optimizing source positions, based on three distinct techniques: Null-Split, Far-Align, and Near-Align. The applicability of the techniques to larger systems with up to two reflections is demonstrated using numerical optimization. Simulation results show that optimized systems produce higher acoustic contrast than non-optimized source arrangements and an alternative method for reducing the impact of reflections (sound power minimization).
For subjective experimentation on 3D audio systems, suitable programme material is needed. A large-scale recording session was performed in which four ensembles were recorded with a range of existing microphone techniques (aimed at mono, stereo, 5.0, 9.0, 22.0, ambisonic, and headphone reproduction) and a novel 48-channel circular microphone array. Further material was produced by remixing and augmenting pre-existing multichannel content. To mix and monitor the programme items (which included classical, jazz, pop and experimental music, and excerpts from a sports broadcast and a lm soundtrack), a flexible 3D audio reproduction environment was created. Solutions to the following challenges were found: level calibration for different reproduction formats; bass management; and adaptable signal routing from different software and fille formats.
Reproduction of multiple sound zones, in which personal audio programs may be consumed without the need for headphones, is an active topic in acoustical signal processing. Many approaches to sound zone reproduction do not consider control of the bright zone phase, which may lead to self-cancellation problems if the loudspeakers surround the zones. Conversely, control of the phase in a least-squares sense comes at a cost of decreased level difference between the zones and frequency range of cancellation. Single-zone approaches have considered plane wave reproduction by focusing the sound energy in to a point in the wavenumber domain. In this article, a planar bright zone is reproduced via planarity control, which constrains the bright zone energy to impinge from a narrow range of angles via projection in to a spatial domain. Simulation results using a circular array surrounding two zones show the method to produce superior contrast to the least-squares approach, and superior planarity to the contrast maximization approach. Practical performance measurements obtained in an acoustically treated room verify the conclusions drawn under free-field conditions.
Planarity panning (PP) and planarity control (PC) have previously been shown to be efficient methods for focusing directional sound energy into listening zones. In this paper, we consider sound field control for two listeners. First, PP is extended to create spatial audio for two listeners consuming the same spatial audio content. Then, PC is used to create highly directional sound and cancel interfering audio. Simulation results compare PP and PC against pressure matching (PM) solutions. For multiple listeners listening to the same content, PP creates directional sound at lower effort than the PM counterpart. When listeners consume different audio, PC produces greater acoustic contrast than PM, with excellent directional control except for frequencies where grating lobes generate problematic interference patterns.
Estimating and parameterizing the early and late reflections of an enclosed space is an interesting topic in acoustics. With a suitable set of parameters, the current concept of a spatial audio object (SAO), which is typically limited to either direct (dry) sound or diffuse field components, could be extended to afford an editable spatial description of the room acoustics. In this paper we present an analysis/synthesis method for parameterizing a set of measured room impulse responses (RIRs). RIRs were recorded in a medium-sized auditorium, using a uniform circular array of microphones representing the perspective of a listener in the front row. During the analysis process, these RIRs were decomposed, in time, into three parts: the direct sound, the early reflections, and the late reflections. From the direct sound and early reflections, parameters were extracted for the length, amplitude, and direction of arrival (DOA) of the propagation paths by exploiting the dynamic programming projected phase-slope algorithm (DYPSA) and classical delay-and-sum beamformer (DSB). Their spectral envelope was calculated using linear predictive coding (LPC). Late reflections were modeled by frequency-dependent decays excited by band-limited Gaussian noise. The combination of these parameters for a given source position and the direct source signal represents the reverberant or ?wet? spatial audio object. RIRs synthesized for a specified rendering and reproduction arrangement were convolved with dry sources to form reverberant components of the sound scene. The resulting signals demonstrated potential for these techniques, e.g., in SAO reproduction over a 22.2 surround sound system.
Acoustic reflector localization is an important issue in audio signal processing, with direct applications in spatial audio, scene reconstruction, and source separation. Several methods have recently been proposed to estimate the 3D positions of acoustic reflectors given room impulse responses (RIRs). In this article, we categorize these methods as ?image-source reversion?, which localizes the image source before finding the reflector position, and ?direct localization?, which localizes the reflector without intermediate steps. We present five new contributions. First, an onset detector, called the clustered dynamic programming projected phase-slope algorithm, is proposed to automatically extract the time of arrival for early reflections within the RIRs of a compact microphone array. Second, we propose an image-source reversion method that uses the RIRs from a single loudspeaker. It is constructed by combining an image source locator (the image source direction and range (ISDAR) algorithm), and a reflector locator (using the loudspeaker-image bisection (LIB) algorithm). Third, two variants of it, exploiting multiple loudspeakers, are proposed. Fourth, we present a direct localization method, the ellipsoid tangent sample consensus (ETSAC), exploiting ellipsoid properties to localize the reflector. Finally, systematic experiments on simulated and measured RIRs are presented, comparing the proposed methods with the state-of-the-art. ETSAC generates errors lower than the alternative methods compared through our datasets. Nevertheless, the ISDAR-LIB combination performs well and has a run time 200 times faster than ETSAC.
Personal audio systems generate a local sound field for a listener while attenuating the sound energy at pre-defined quiet zones. In practice, system performance is sensitive to errors in the acoustic transfer functions between the sources and the zones. Regularization is commonly used to improve robustness, however, selecting a regularization parameter is not always straightforward. In this paper, a design framework for robust reproduction is proposed, combining transfer function and error modelling. The framework allows a physical perspective on the regularization required for a system, based on the bound of assumed additive or multiplicative errors, which is obtained by acoustic modelling. Acoustic contrast control is separately combined with worst-case and probability-model optimization, exploiting limited knowledge of the potential error distribution. Monte-Carlo simulations show that these approaches give increased system robustness compared to the state of the art approaches for regularization parameter estimation, and experimental results verify that robust sound zone control is achieved in the presence of loudspeaker gain errors. Furthermore, by applying the proposed framework, in-situ transfer function measurements were reduced to a single measurement per loudspeaker, per zone, with limited acoustic contrast degradation of less than 2 dB over 100?3000 Hz compared to the fully measured regularized case.
Multi-point approaches for sound field control generally sample the listening zone(s) with pressure
microphones, and use these measurements as an input for an optimisation cost function.
A number of techniques are based on this concept, for single-zone (e.g. least-squares pressure
matching (PM), brightness control, planarity panning) and multi-zone (e.g. PM, acoustic contrast
control, planarity control) reproduction. Accurate performance predictions are obtained when distinct
microphone positions are employed for setup versus evaluation. While, in simulation, one
can afford a dense sampling of virtual microphones, it is desirable in practice to have a microphone
array which can be positioned once in each zone to measure the setup transfer functions
between each loudspeaker and that zone. In this contribution, we present simulation results over
a fixed dense set of evaluation points comparing the performance of several multi-point optimisation
approaches for 2D reproduction with a 60 channel circular loudspeaker arrangement. Various
regular setup microphone arrays are used to calculate the sound zone filters: circular grid, circular,
dual-circular, and spherical arrays, each with different numbers of microphones. Furthermore, the
effect of a rigid spherical baffle is studied for the circular and spherical arrangements. The results
of this comparative study show how the directivity and effective frequency range of multi-point
optimisation techniques depend on the microphone array used to sample the zones. In general,
microphone arrays with dense spacing around the boundary give better angular discrimination,
leading to more accurate directional sound reproduction, while those distributed around the whole
zone enable more accurate prediction of the reproduced target sound pressure level.
Recent work into 3D audio reproduction has considered the definition of a set of parameters to encode
reverberation into an object-based audio scene. The reverberant spatial audio object (RSAO)
describes the reverberation in terms of a set of localised, delayed and filtered (early) reflections,
together with a late energy envelope modelling the diffuse late decay. The planarity metric, originally
developed to evaluate the directionality of reproduced sound fields, is used to analyse a set of
multichannel room impulse responses (RIRs) recorded at a microphone array. Planarity describes
the spatial compactness of incident sound energy, which tends to decrease as the reflection density
and diffuseness of the room response develop over time. Accordingly, planarity complements
intensity-based diffuseness estimators, which quantify the degree to which the sound field at a
discrete frequency within a particular time window is due to an impinging coherent plane wave.
In this paper, we use planarity as a tool to analyse the sound field in relation to the RSAO parameters.
Specifically, we use planarity to estimate two important properties of the sound field. First,
as high planarity identifies the most localised reflections along the RIR, we estimate the most
planar portions of the RIR, corresponding to the RSAO early reflection model and increasing the
likelihood of detecting prominent specular reflections. Second, as diffuse sound fields give a low
planarity score, we investigate planarity for data-based mixing time estimation. Results show
that planarity estimates on measured multichannel RIR datasets represent a useful tool for room
acoustics analysis and RSAO parameterisation.
Room Impulse Responses (RIRs) measured with microphone arrays capture spatial and nonspatial
information, e.g. the early reflections? directions and times of arrival, the size of the
room and its absorption properties. The Reverberant Spatial Audio Object (RSAO) was proposed
as a method to encode room acoustic parameters from measured array RIRs. As the RSAO is
object-based audio compatible, its parameters can be rendered to arbitrary reproduction systems
and edited to modify the reverberation characteristics, to improve the user experience. Various
microphone array designs have been proposed for sound field and room acoustic analysis, but a
comparative performance evaluation is not available. This study assesses the performance of five
regular microphone array geometries (linear, rectangular, circular, dual-circular and spherical) to
capture RSAO parameters for the direct sound and early reflections of RIRs. The image source
method is used to synthesise RIRs at the microphone positions as well as at the centre of the array.
From the array RIRs, the RSAO parameters are estimated and compared to the reference parameters
at the centre of the array. A performance comparison among the five arrays is established
as well as the effect of a rigid spherical baffle for the circular and spherical arrays. The effects
of measurement uncertainties, such as microphone misplacement and sensor noise errors, are also
studied. The results show that planar arrays achieve the most accurate horizontal localisation
whereas the spherical arrays perform best in elevation. Arrays with smaller apertures achieve a
higher number of detected reflections, which becomes more significant for the smaller room with
higher reflection density.
Recent work on a reverberant spatial audio object (RSAO) encoded spatial room impulse responses
(RIRs) as object-based metadata which can be synthesized in an object-based renderer. Encoding
reverberation into metadata presents new opportunities for end users to interact with and personalize
reverberant content. The RSAO models an RIR as a set of early re
ections together with a late
reverberation filter. Previous work to encode the RSAO parameters was based on recordings made
with a dense array of omnidirectional microphones. This paper describes RSAO parameterization from
first-order Ambisonic (B-Format) RIRs, making the RSAO compatible with existing spatial reverb
libraries. The object-based implementation achieves reverberation time, early decay time, clarity and
interaural cross-correlation similar to direct Ambisonic rendering of 13 test RIRs.
This engineering brief reports on the production of 3 object-based audio drama scenes, commissioned as part of the S3A project. 3D reproduction and an object-based workflow were considered and implemented from the initial script commissioning through to the final mix of the scenes. The scenes are being made available as Broadcast Wave Format files containing all objects as separate tracks and all metadata necessary to render the scenes as an XML chunk in the header conforming to the Audio Definition Model specification (Recommendation ITU-R BS.2076 ). It is hoped that these scenes will find use in perceptual experiments and in the testing of 3D audio systems. The scenes are available via the following link: http://dx.doi.org/10.17866/rd.salford.3043921.
Pressure matching (PM) and planarity control (PC) methods can be used to re-
produce local sound with a certain orientation at the listening zone, while suppressing
the sound energy at the quiet zone. In this letter, regularized PM and PC, incorporating coarse error estimation, are introduced to increase the robustness in non-ideal reproduction scenarios. Facilitated by this, the interaction between regularization, robustness, (tuned) personal audio optimization and local directional performance is explored. Simulations show that under certain conditions, PC and weighted PM achieve
comparable performance, while PC is more robust to a poorly selected regularization parameter.
Personal audio systems generate a local sound ?eld for a listener while attenuating the sound energy at pre-de?ned quiet zones. Their performance can be sensitive to errors in the acoustic transfer functions between the sources and the zones. In this paper, we model the acoustic transfer functions as a superposition of multipoles with a term to describe errors in the actual gain and phase. We then propose a design framework for robust reproduction, incorporating additional prior knowledge about the error distribution where available. We combine acoustic contrast control with worst-case and probability-model optimization, exploiting limited knowledge of the error distribution. Monte-Carlo simulations over 10000 test cases show that the method increases system robustness when errors are present in the assumed transfer functions.
Coleman Philip, Franck A, Francombe Jon, Liu Qingju, de Campos Teofilo, Hughes R, Menzies D, Simon Galvez, M, Tang Y, Woodcock J, Jackson Philip, Melchior F, Pike C, Fazi F, Cox T, Hilton Adrian (2018) An Audio-Visual System for Object-Based Audio:
From Recording to Listening, IEEE Transactions on Multimedia 20 (8) pp. 1919-1931
Object-based audio is an emerging representation
for audio content, where content is represented in a reproductionformat-
agnostic way and thus produced once for consumption on
many different kinds of devices. This affords new opportunities
for immersive, personalized, and interactive listening experiences.
This article introduces an end-to-end object-based spatial audio
pipeline, from sound recording to listening. A high-level
system architecture is proposed, which includes novel audiovisual
interfaces to support object-based capture and listenertracked
rendering, and incorporates a proposed component for
objectification, i.e., recording content directly into an object-based
form. Text-based and extensible metadata enable communication
between the system components. An open architecture for object
rendering is also proposed.
The system?s capabilities are evaluated in two parts. First,
listener-tracked reproduction of metadata automatically estimated
from two moving talkers is evaluated using an objective
binaural localization model. Second, object-based scene capture
with audio extracted using blind source separation (to remix
between two talkers) and beamforming (to remix a recording of
a jazz group), is evaluated with perceptually-motivated objective
and subjective experiments. These experiments demonstrate that
the novel components of the system add capabilities beyond
the state of the art. Finally, we discuss challenges and future
perspectives for object-based audio workflows.
Coleman Philip, Franck Andreas, Francombe Jon, Liu Qingju, de Campos Teofilo, Hughes Richard, Menzies Dylan, Simo?n Ga?lvez Marcos, Tang Yan, Woodcock James, Melchior Frank, Pike Chris, Fazi Filippo, Cox Trevor, Hilton Adrian, Jackson Philip (2018) S3A Audio-Visual System for Object-Based Audio,
University of Surrey
Coleman Philip, Liu Qingju, Francombe Jon, Jackson Philip (2018) Perceptual evaluation of blind source separation in object-based audio production, Latent Variable Analysis and Signal Separation - 14th International Conference, LVA/ICA 2018, Guildford, UK, July 2?5, 2018, Proceedings pp. 558-567
Object-based audio has the potential to enable multime-
dia content to be tailored to individual listeners and their reproduc-
tion equipment. In general, object-based production assumes that the
objects|the assets comprising the scene|are free of noise and inter-
ference. However, there are many applications in which signal separa-
tion could be useful to an object-based audio work
ow, e.g., extracting
individual objects from channel-based recordings or legacy content, or
recording a sound scene with a single microphone array. This paper de-
scribes the application and evaluation of blind source separation (BSS)
for sound recording in a hybrid channel-based and object-based workflow, in which BSS-estimated objects are mixed with the original stereo
recording. A subjective experiment was conducted using simultaneously
spoken speech recorded with omnidirectional microphones in a rever-
berant room. Listeners mixed a BSS-extracted speech object into the
scene to make the quieter talker clearer, while retaining acceptable au-
dio quality, compared to the raw stereo recording. Objective evaluations
show that the relative short-term objective intelligibility and speech qual-
ity scores increase using BSS. Further objective evaluations are used to
discuss the in
uence of the BSS method on the remixing scenario; the
scenario shown by human listeners to be useful in object-based audio is
shown to be a worse-case scenario.
In this paper, we propose an iterative deep neural network
(DNN)-based binaural source separation scheme, for recovering
two concurrent speech signals in a room environment.
Besides the commonly-used spectral features, the DNN also
takes non-linearly wrapped binaural spatial features as input,
which are refined iteratively using parameters estimated from
the DNN output via a feedback loop. Different DNN structures
have been tested, including a classic multilayer perception
regression architecture as well as a new hybrid network
with both convolutional and densely-connected layers. Objective
evaluations in terms of PESQ and STOI showed consistent
improvement over baseline methods using traditional
binaural features, especially when the hybrid DNN architecture
was employed. In addition, our proposed scheme is robust
to mismatches between the training and testing data.
We present a novel pipeline to estimate reverberant
spatial audio object (RSAO) parameters given room
impulse responses (RIRs) recorded by ad-hoc microphone
arrangements. The proposed pipeline performs
three tasks: direct-to-reverberant-ratio (DRR) estimation;
microphone localization; RSAO parametrization.
RIRs recorded at Bridgewater Hall by microphones
arranged for a BBC Philharmonic Orchestra performance
were parametrized. Objective measures of
the rendered RSAO reverberation characteristics were
evaluated and compared with reverberation recorded
by a Soundfield microphone. Alongside informal listening
tests, the results confirmed that the rendered
RSAO gave a plausible reproduction of the hall, comparable
to the measured response. The objectification
of the reverb from in-situ RIR measurements unlocks
customization and personalization of the experience
for different audio systems, user preferences and playback
Jackson Philip, Plumbley Mark D, Wang Wenwu, Brookes Tim, Coleman Philip, Mason Russell, Frohlich David, Bonina Carla, Plans David (2017) Signal Processing, Psychoacoustic Engineering and Digital Worlds: Interdisciplinary Audio Research at the University of Surrey,
At the University of Surrey (Guildford, UK), we have brought together research groups in different disciplines, with a shared interest in audio, to work on a range of collaborative research projects. In the Centre for Vision, Speech and Signal Processing (CVSSP) we focus on technologies for machine perception of audio scenes; in the Institute of Sound Recording (IoSR) we focus on research into human perception of audio quality; the Digital World Research Centre (DWRC) focusses on the design of digital technologies; while the Centre for Digital Economy (CoDE) focusses on new business models enabled by digital technology. This interdisciplinary view, across different traditional academic departments and faculties, allows us to undertake projects which would be impossible for a single research group. In this poster we will present an overview of some of these interdisciplinary projects, including projects in spatial audio, sound scene and event analysis, and creative commons audio.
Woodcock James, Franombe Jon, Franck Andreas, Coleman Philip, Hughes Richard, Kim Hansung, Liu Qingju, Menzies Dylan, Simón Gálvez Marcos F, Tang Yan, Brookes Tim, Davies William J, Fazenda Bruno M, Mason Russell, Cox Trevor J, Fazi Filippo Maria, Jackson Philip, Pike Chris, Hilton Adrian (2018) A Framework for Intelligent Metadata Adaptation in Object-Based Audio, AES E-Library pp. P11-3
Audio Engineering Society
Object-based audio can be used to customize, personalize, and optimize audio reproduction depending on the speci?c listening scenario. To investigate and exploit the bene?ts of object-based audio, a framework for intelligent metadata adaptation was developed. The framework uses detailed semantic metadata that describes the audio objects, the loudspeakers, and the room. It features an extensible software tool for real-time metadata adaptation that can incorporate knowledge derived from perceptual tests and/or feedback from perceptual meters to drive adaptation and facilitate optimal rendering. One use case for the system is demonstrated through a rule-set (derived from perceptual tests with experienced mix engineers) for automatic adaptation of object levels and positions when rendering 3D content to two- and ?ve-channel systems.
Media Device Orchestration (MDO) makes use of interconnected devices to augment a reproduction system, and
could be used to deliver more immersive audio experiences to domestic audiences. To investigate optimal rendering
on an MDO-based system, stimuli were created via: 1) object-based audio (OBA) mixes undertaken in a reference
listening room; and 2) up to 13 rendered versions of these employing a range of installed and ad-hoc loudspeakers
with varying cost, quality and position. The programme items include audio-visual material (short film trailer
and big band performance) and audio-only material (radio panel show, pop track, football match, and orchestral
performance). The object-based programme items and alternate MDO configurations are made available for testing
and demonstrating OBA systems.
Homes contain a plethora of devices for audio-visual content consumption, which intelligent reproduction systems
can exploit to give the best possible experience. To investigate media device ownership in the home, media
service-types usage and solitary versus group audio/audio-visual media consumption, a survey of UK households
with 1102 respondents was undertaken. The results suggest that there is already significant ownership of wireless
and smart loudspeakers, as well as other interconnected devices containing loudspeakers such as smartphones and
tablets. Questions on group media consumption suggest that the majority of listeners spend more time consuming
media with others than alone, demonstrating an opportunity for systems which can adapt to varying audience
requirements within the same environment.
Personal audio generates sound zones in a shared space to provide private and personalized listening experiences with minimized interference between consumers. Regularization has been commonly used to increase the robustness of such systems against potential perturbations in the sound reproduction. However, the performance is limited by the system geometry such as the number and location of the loudspeakers and controlled zones. This paper proposes a geometry optimization method to find the most geometrically robust approach for personal audio amongst all available candidate system placements. The proposed method aims to approach the most ?natural? sound reproduction so that the solo control of the listening zone coincidently accompanies the preferred quiet zone. Being formulated in the SVD-based modal domain, the method is demonstrated by applications in three typical personal audio optimizations, i.e., the acoustic contrast control, the pressure matching, and the planarity control. Simulation results show that the proposed method can obtain the system geometry with better avoidance of ?occlusion,? improved robustness to regularization, and improved broadband equalization.
Frequency-invariant beamformers are useful for spatial audio capture since their attenuation of sources outside
the look direction is consistent across frequency. In particular, the least-squares beamformer (LSB) approximates
arbitrary frequency-invariant beampatterns with generic microphone configurations. This paper investigates the
effects of array geometry, directivity order and regularization for robust hypercardioid synthesis up to 15th order
with the LSB, using three 2D 32-microphone array designs (rectangular grid, open circular, and circular with
cylindrical baffle). While the directivity increases with order, the frequency range is inversely proportional to the
order and is widest for the cylindrical array. Regularization results in broadening of the mainlobe and reduced
on-axis response at low frequencies. The PEASS toolkit was used to evaluate perceptually beamformed speech
In our information-overloaded daily lives, unwanted sounds create confusion, disruption and fatigue in what do and experience. Taking control of your own sound environment, you can design what information to hear and how. Providing personalised sound to different people over loudspeakers enables communication, human connection and social activity in a shared space, meanwhile addressing the individuals? needs. Recent developments in object-based audio, robust sound zoning algorithms, computer vision, device synchronisation and electronic hardware facilitate personal control of immersive and interactive reproduction techniques. Accordingly, the creative sector is moving towards more demand for personalisation and personalisable content.
This tutorial offers participants a novel and timely introduction to the increasingly valuable capability to personalise sound over loudspeakers, alongside resources for the audio signal processing community. Presenting the science behind personalising sound technologies and providing insights for making sound zones in practice, we hope to create better listening experiences. The tutorial attempts a holistic exposition of techniques for producing personal sound over loudspeakers. It incorporates a practical step-by-step guide to digital filter design for real-world multizone sound reproduction and relates various approaches to one another thereby enabling comparison of the listener benefits.
Object-based audio promises format-agnostic reproduction and extensive personalization of
spatial audio content. However, in practical listening scenarios, such as in consumer audio,
ideal reproduction is typically not possible. To maximize the quality of listening experience,
a different approach is required, for example modifications of metadata to adjust for the
reproduction layout or personalization choices. In this paper we propose a novel system architecture
for semantically informed rendering (SIR), that combines object audio rendering with
high-level processing of object metadata. In many cases, this processing uses novel, advanced
metadata describing the objects to optimally adjust the audio scene to the reproduction system
or listener preferences. The proposed system is evaluated with several adaptation strategies,
including semantically motivated downmix to layouts with few loudspeakers, manipulation
of perceptual attributes, perceptual reverberation compensation, and orchestration of mobile
devices for immersive reproduction. These examples demonstrate how SIR can significantly
improve the media experience and provide advanced personalization controls, for example by
maintaining smooth object trajectories on systems with few loudspeakers, or providing personalized
envelopment levels. An example implementation of the proposed system architecture is
described and provided as an open, extensible software framework that combines object-based
audio rendering and high-level processing of advanced object metadata.
Microphone arrays can capture a sound scene and can be combined with signal processing to spatially filter or beamform the scene to extract the source of interest by suppressing unwanted sounds.
Microphone array beamforming has been widely used for speech enhancement, giving rise to a vast number of beamforming methods to optimally suppress interfering sounds. However, the opportunities of these systems in broadcast and consumer audio recording have not been investigated, where wideband capture is a requirement. In this case, the microphone array design plays a significant role, yet despite the various designs from the literature, it is not clear which geometry provides the best performance under a range of criteria relevant for these applications. Moreover, the interactions between the array geometry, the beamformer and other design parameters and their impact on both physical and perceptual quality of extracted audio sources have not been established.
The main contribution of this thesis is to determine the uniform microphone array design that maximises the quality of extracted audio sources (or objects) from horizontal sound scenes, since most sound scenes have much larger variation in azimuth than elevation. Both physical and perceptual performance evaluations are conducted with a range of microphone geometries and beamforming methods showing that baffled circular arrays outperform alternative geometries both objectively (in terms of frequency range, spatial resolution, directivity and robustness) and perceptually (based on interference suppression and quality of target and overall sounds). New insights of the interactions between array geometries and beamformers are provided. Moreover, a subjective evaluation of beamforming methods is undertaken showing the benefits of the on-axis distortionless response in combination with very high directivity from the superdirective beamformer, particularly for wideband signals.
In addition to the array geometry, the effects of directivity order and regularisation are further investigated to synthesise frequency-invariant directional responses with the least-squares beamformer. The results exhibit the trade-offs between directivity and robustness with regularisation and between directivity and frequency range with directivity order. Baffled circular arrays perform best consistently for different orders and regularisation parameters. Furthermore, an optimal regularisation parameter is derived that minimises the error between the target and synthesised responses in presence of manifold errors, outperforming constant robustness constraints particularly for gain and positioning errors whose optimal regularised responses are frequency dependent.
The combination of simulation and perceptual results presented in this thesis represents a significant addition to the beamforming literature, potentially influencing the design of future compact microphone arrays.