I am interested in what you can do with acoustical signals, including speech, music and the everyday sounds all around us. Through research on projects such as Nephthys, Columbo, BALTHASAR, DANSA, SAVEE, DynamicFaces, QESTRAL, POSZ, UDRC2 and S3A, I have contributed to active noise control for aircraft, speech aero-acoustics, source separation and articulatory models for automatic speech recognition, audio-visual emotion classification and visual speech synthesis, including new techniques for spatial audio and personal sound.
I joined CVSSP in 2002 after a UK postdoctoral fellowship at University of Birmingham, with PhD in Electronics & Computer Science from University of Southampton (2000) and MA from Cambridge University Engineering Department (1997). I now have over 100 journal, patent, conference and book publications (Google h-index=12) and serve as associate editor for Computer Speech & Language (Elsevier), and as reviewer for the Journal of the Acoustical Society of America, IEEE/ACM Transactions on Audio, Speech & Language Processing, IEEE Signal Processing Letters, InterSpeech and ICASSP.
Further details can be found on my personal web page.
Virtual Reality (VR) systems have been intensely explored, with several research communities investigating the different modalities involved. Regarding the audio modality, one of the main issues is the generation of sound that is perceptually coherent with the visual reproduction. Here, we propose a pipeline for creating plausible interactive reverb using visual information: first, we characterize real environment acoustics given a pair of spherical cameras; then, we reproduce reverberant spatial sound, by using the estimated acoustics, within a VR scene. The evaluation is made by extracting the room impulse responses (RIRs) of four virtually rendered rooms. Results show agreement, in terms of objective metrics, between the synthesized acoustics and the ones calculated from RIRs recorded within the respective real rooms.
In order to maximise the immersion in VR environments, a plausible spatial audio reproduction synchronised with visual information is essential. In this work, we propose a pipeline to create plausible interactive audio from a pair of 360 degree cameras.
Acoustic reflector localization is an important issue in audio signal processing, with direct applications in spatial audio, scene reconstruction, and source separation. Several methods have recently been proposed to estimate the 3D positions of acoustic reflectors given room impulse responses (RIRs). In this article, we categorize these methods as “image-source reversion”, which localizes the image source before finding the reflector position, and “direct localization”, which localizes the reflector without intermediate steps. We present five new contributions. First, an onset detector, called the clustered dynamic programming projected phase-slope algorithm, is proposed to automatically extract the time of arrival for early reflections within the RIRs of a compact microphone array. Second, we propose an image-source reversion method that uses the RIRs from a single loudspeaker. It is constructed by combining an image source locator (the image source direction and range (ISDAR) algorithm), and a reflector locator (using the loudspeaker-image bisection (LIB) algorithm). Third, two variants of it, exploiting multiple loudspeakers, are proposed. Fourth, we present a direct localization method, the ellipsoid tangent sample consensus (ETSAC), exploiting ellipsoid properties to localize the reflector. Finally, systematic experiments on simulated and measured RIRs are presented, comparing the proposed methods with the state-of-the-art. ETSAC generates errors lower than the alternative methods compared through our datasets. Nevertheless, the ISDAR-LIB combination performs well and has a run time 200 times faster than ETSAC.
Recent work on a reverberant spatial audio object (RSAO) encoded spatial room impulse responses (RIRs) as object-based metadata which can be synthesized in an object-based renderer. Encoding reverberation into metadata presents new opportunities for end users to interact with and personalize reverberant content. The RSAO models an RIR as a set of early re ections together with a late reverberation filter. Previous work to encode the RSAO parameters was based on recordings made with a dense array of omnidirectional microphones. This paper describes RSAO parameterization from first-order Ambisonic (B-Format) RIRs, making the RSAO compatible with existing spatial reverb libraries. The object-based implementation achieves reverberation time, early decay time, clarity and interaural cross-correlation similar to direct Ambisonic rendering of 13 test RIRs.
Object-based audio is gaining momentum as a means for future audio content to be more immersive, interactive, and accessible. Recent standardization developments make recommendations for object formats, however, the capture, production and reproduction of reverberation is an open issue. In this paper, parametric approaches for capturing, representing, editing, and rendering reverberation over a 3D spatial audio system are reviewed. A framework is proposed for a Reverberant Spatial Audio Object (RSAO), which synthesizes reverberation inside an audio object renderer. An implementation example of an object scheme utilising the RSAO framework is provided, and supported with listening test results, showing that: the approach correctly retains the sense of room size compared to a convolved reference; editing RSAO parameters can alter the perceived room size and source distance; and, format-agnostic rendering can be exploited to alter listener envelopment.
We propose a new method for source separation by synthesizing the source from a speech mixture corrupted by various environmental noise. Unlike traditional source separation methods which estimate the source from the mixture as a replica of the original source (e.g. by solving an inverse problem), our proposed method is a synthesis-based approach which aims to generate a new signal (i.e. “fake” source) that sounds similar to the original source. The proposed system has an encoder-decoder topology, where the encoder predicts intermediate-level features from the mixture, i.e. Mel-spectrum of the target source, using a hybrid recurrent and hourglass network, while the decoder is a state-of-the-art WaveNet speech synthesis network conditioned on the Mel-spectrum, which directly generates time-domain samples of the sources. Both objective and subjective evaluations were performed on the synthesized sources, and show great advantages of our proposed method for high-quality speech source separation and generation.
Reproduction of multiple sound zones, in which personal audio programs may be consumed without the need for headphones, is an active topic in acoustical signal processing. Many approaches to sound zone reproduction do not consider control of the bright zone phase, which may lead to self-cancellation problems if the loudspeakers surround the zones. Conversely, control of the phase in a least-squares sense comes at a cost of decreased level difference between the zones and frequency range of cancellation. Single-zone approaches have considered plane wave reproduction by focusing the sound energy in to a point in the wavenumber domain. In this article, a planar bright zone is reproduced via planarity control, which constrains the bright zone energy to impinge from a narrow range of angles via projection in to a spatial domain. Simulation results using a circular array surrounding two zones show the method to produce superior contrast to the least-squares approach, and superior planarity to the contrast maximization approach. Practical performance measurements obtained in an acoustically treated room verify the conclusions drawn under free-field conditions.
—Active speaker detection (ASD) is a multi-modal task that aims to identify who, if anyone, is speaking from a set of candidates. Current audiovisual approaches for ASD typically rely on visually pre-extracted face tracks (sequences of consecutive face crops) and the respective monaural audio. However, their recall rate is often low as only the visible faces are included in the set of candidates. Monaural audio may successfully detect the presence of speech activity but fails in localizing the speaker due to the lack of spatial cues. Our solution extends the audio front-end using a microphone array. We train an audio convolutional neural network (CNN) in combination with beamforming techniques to regress the speaker's horizontal position directly in the video frames. We propose to generate weak labels using a pre-trained active speaker detector on pre-extracted face tracks. Our pipeline embraces the " student-teacher " paradigm, where a trained " teacher " network is used to produce pseudo-labels visually. The " student " network is an audio network trained to generate the same results. At inference, the student network can independently localize the speaker in the visual frames directly from the audio input. Experimental results on newly collected data prove that our approach significantly outperforms a variety of other baselines as well as the teacher network itself. It results in an excellent speech activity detector too.
Recent work into 3D audio reproduction has considered the definition of a set of parameters to encode reverberation into an object-based audio scene. The reverberant spatial audio object (RSAO) describes the reverberation in terms of a set of localised, delayed and filtered (early) reflections, together with a late energy envelope modelling the diffuse late decay. The planarity metric, originally developed to evaluate the directionality of reproduced sound fields, is used to analyse a set of multichannel room impulse responses (RIRs) recorded at a microphone array. Planarity describes the spatial compactness of incident sound energy, which tends to decrease as the reflection density and diffuseness of the room response develop over time. Accordingly, planarity complements intensity-based diffuseness estimators, which quantify the degree to which the sound field at a discrete frequency within a particular time window is due to an impinging coherent plane wave. In this paper, we use planarity as a tool to analyse the sound field in relation to the RSAO parameters. Specifically, we use planarity to estimate two important properties of the sound field. First, as high planarity identifies the most localised reflections along the RIR, we estimate the most planar portions of the RIR, corresponding to the RSAO early reflection model and increasing the likelihood of detecting prominent specular reflections. Second, as diffuse sound fields give a low planarity score, we investigate planarity for data-based mixing time estimation. Results show that planarity estimates on measured multichannel RIR datasets represent a useful tool for room acoustics analysis and RSAO parameterisation.
Immersive audio-visual perception relies on the spatial integration of both auditory and visual information which are heterogeneous sensing modalities with different fields of reception and spatial resolution. This study investigates the perceived coherence of audiovisual object events presented either centrally or peripherally with horizontally aligned/misaligned sound. Various object events were selected to represent three acoustic feature classes. Subjective test results in a simulated virtual environment from 18 participants indicate a wider capture region in the periphery, with an outward bias favoring more lateral sounds. Centered stimulus results support previous findings for simpler scenes.
A binaural sound source localization method is proposed that uses interaural and spectral cues for localization of sound sources with any direction of arrival on the full-sphere. The method is designed to be robust to the presence of reverberation, additive noise and different types of sounds. The method uses the interaural phase difference (IPD) for lateral angle localization, then interaural and spectral cues for polar angle localization. The method applies different weighting to the interaural and spectral cues depending on the estimated lateral angle. In particular, only the spectral cues are used for sound sources near or on the median plane.
In our information-overloaded daily lives, unwanted sounds create confusion, disruption and fatigue in what do and experience. Taking control of your own sound environment, you can design what information to hear and how. Providing personalised sound to different people over loudspeakers enables communication, human connection and social activity in a shared space, meanwhile addressing the individuals’ needs. Recent developments in object-based audio, robust sound zoning algorithms, computer vision, device synchronisation and electronic hardware facilitate personal control of immersive and interactive reproduction techniques. Accordingly, the creative sector is moving towards more demand for personalisation and personalisable content. This tutorial offers participants a novel and timely introduction to the increasingly valuable capability to personalise sound over loudspeakers, alongside resources for the audio signal processing community. Presenting the science behind personalising sound technologies and providing insights for making sound zones in practice, we hope to create better listening experiences. The tutorial attempts a holistic exposition of techniques for producing personal sound over loudspeakers. It incorporates a practical step-by-step guide to digital filter design for real-world multizone sound reproduction and relates various approaches to one another thereby enabling comparison of the listener benefits.
Object-based audio is an emerging representation for audio content, where content is represented in a reproductionformat- agnostic way and thus produced once for consumption on many different kinds of devices. This affords new opportunities for immersive, personalized, and interactive listening experiences. This article introduces an end-to-end object-based spatial audio pipeline, from sound recording to listening. A high-level system architecture is proposed, which includes novel audiovisual interfaces to support object-based capture and listenertracked rendering, and incorporates a proposed component for objectification, i.e., recording content directly into an object-based form. Text-based and extensible metadata enable communication between the system components. An open architecture for object rendering is also proposed. The system’s capabilities are evaluated in two parts. First, listener-tracked reproduction of metadata automatically estimated from two moving talkers is evaluated using an objective binaural localization model. Second, object-based scene capture with audio extracted using blind source separation (to remix between two talkers) and beamforming (to remix a recording of a jazz group), is evaluated with perceptually-motivated objective and subjective experiments. These experiments demonstrate that the novel components of the system add capabilities beyond the state of the art. Finally, we discuss challenges and future perspectives for object-based audio workflows.
Spatial audio processes (SAPs) commonly encountered in consumer audio reproduction systems are known to generate a range of impairments to spatial quality. Two listening tests (involving two listening positions, six 5-channel audio recordings, and 48 SAPs) indicate that the degree of quality degradation is determined largely by the nature of the SAP but that the effect of a particular SAP can depend on program material and on listening position. Combining off-center listening with another SAP can reduce spatial quality significantly compared to auditioning that SAP centrally. These findings, and the associated listening test data, can guide the development of an artificial-listener-based spatial audio quality evaluation system.
In this paper, a novel probabilistic Bayesian tracking scheme is proposed and applied to bimodal measurements consisting of tracking results from the depth sensor and audio recordings collected using binaural microphones. We use random finite sets to cope with varying number of tracking targets. A measurement-driven birth process is integrated to quickly localize any emerging person. A new bimodal fusion method that prioritizes the most confident modality is employed. The approach was tested on real room recordings and experimental results show that the proposed combination of audio and depth outperforms individual modalities, particularly when there are multiple people talking simultaneously and when occlusions are frequent.
Planarity panning (PP) and planarity control (PC) have previously been shown to be efficient methods for focusing directional sound energy into listening zones. In this paper, we consider sound field control for two listeners. First, PP is extended to create spatial audio for two listeners consuming the same spatial audio content. Then, PC is used to create highly directional sound and cancel interfering audio. Simulation results compare PP and PC against pressure matching (PM) solutions. For multiple listeners listening to the same content, PP creates directional sound at lower effort than the PM counterpart. When listeners consume different audio, PC produces greater acoustic contrast than PM, with excellent directional control except for frequencies where grating lobes generate problematic interference patterns.
In object-based spatial audio system, positions of the audio objects (e.g. speakers/talkers or voices) presented in the sound scene are required as important metadata attributes for object acquisition and reproduction. Binaural microphones are often used as a physical device to mimic human hearing and to monitor and analyse the scene, including localisation and tracking of multiple speakers. The binaural audio tracker, however, is usually prone to the errors caused by room reverberation and background noise. To address this limitation, we present a multimodal tracking method by fusing the binaural audio with depth information (from a depth sensor, e.g., Kinect). More specifically, the PHD filtering framework is first applied to the depth stream, and a novel clutter intensity model is proposed to improve the robustness of the PHD filter when an object is occluded either by other objects or due to the limited field of view of the depth sensor. To compensate mis-detections in the depth stream, a novel gap filling technique is presented to map audio azimuths obtained from the binaural audio tracker to 3D positions, using speaker-dependent spatial constraints learned from the depth stream. With our proposed method, both the errors in the binaural tracker and the mis-detections in the depth tracker can be significantly reduced. Real-room recordings are used to show the improved performance of the proposed method in removing outliers and reducing mis-detections.
Deep neural networks (DNN) have recently been shown to give state-of-the-art performance in monaural speech enhancement. However in the DNN training process, the perceptual difference between different components of the DNN output is not fully exploited, where equal importance is often assumed. To address this limitation, we have proposed a new perceptually-weighted objective function within a feedforward DNN framework, aiming to minimize the perceptual difference between the enhanced speech and the target speech. A perceptual weight is integrated into the proposed objective function, and has been tested on two types of output features: spectra and ideal ratio masks. Objective evaluations for both speech quality and speech intelligibility have been performed. Integration of our perceptual weight shows consistent improvement on several noise levels and a variety of different noise types.
Frequency-invariant beamformers are useful for spatial audio capture since their attenuation of sources outside the look direction is consistent across frequency. In particular, the least-squares beamformer (LSB) approximates arbitrary frequency-invariant beampatterns with generic microphone configurations. This paper investigates the effects of array geometry, directivity order and regularization for robust hypercardioid synthesis up to 15th order with the LSB, using three 2D 32-microphone array designs (rectangular grid, open circular, and circular with cylindrical baffle). While the directivity increases with order, the frequency range is inversely proportional to the order and is widest for the cylindrical array. Regularization results in broadening of the mainlobe and reduced on-axis response at low frequencies. The PEASS toolkit was used to evaluate perceptually beamformed speech signals.
Object-based audio can be used to customize, personalize, and optimize audio reproduction depending on the speci?c listening scenario. To investigate and exploit the bene?ts of object-based audio, a framework for intelligent metadata adaptation was developed. The framework uses detailed semantic metadata that describes the audio objects, the loudspeakers, and the room. It features an extensible software tool for real-time metadata adaptation that can incorporate knowledge derived from perceptual tests and/or feedback from perceptual meters to drive adaptation and facilitate optimal rendering. One use case for the system is demonstrated through a rule-set (derived from perceptual tests with experienced mix engineers) for automatic adaptation of object levels and positions when rendering 3D content to two- and ?ve-channel systems.
We propose a novel method for full-sphere binaural sound source localization that is designed to be robust to real world recording conditions. A mask is proposed that is designed to remove diffuse noise and early room reflections. The method makes use of the interaural phase difference (IPD) for lateral angle localization and spectral cues for polar angle localization. The method is tested using different HRTF datasets to generate the test data and training data. The method is also tested with the presence of additive noise and reverberation. The method outperforms the state of the art binaural localization methods for most testing conditions.
Recent attention to the problem of controlling multiple loudspeakers to create sound zones has been directed towards practical issues arising from system robustness concerns. In this study, the effects of regularization are analyzed for three representative sound zoning methods. Regularization governs the control effort required to drive the loudspeaker array, via a constraint in each optimization cost function. Simulations show that regularization has a significant effect on the sound zone performance, both under ideal anechoic conditions and when systematic errors are introduced between calculation of the source weights and their application to the system. Results are obtained for speed of sound variations and loudspeaker positioning errors with respect to the source weights calculated. Judicious selection of the regularization parameter is shown to be a primary concern for sound zone system designers - the acoustic contrast can be increased by up to 50dB with proper regularization in the presence of errors. A frequency-dependent minimum regularization parameter is determined based on the conditioning of the matrix inverse. The regularization parameter can be further increased to improve performance depending on the control effort constraints, expected magnitude of errors, and desired sound field properties of the system. © 2013 Acoustical Society of America.
The process of understanding acoustic properties of environments is important for several applications, such as spatial audio, augmented reality and source separation. In this paper, multichannel room impulse responses are recorded and transformed into their direction of arrival (DOA)-time domain, by employing a superdirective beamformer. This domain can be represented as a 2D image. Hence, a novel image processing method is proposed to analyze the DOA-time domain, and estimate the reflection times of arrival and DOAs. The main acoustically reflective objects are then localized. Recent studies in acoustic reflector localization usually assume the room to be free from furniture. Here, by analyzing the scattered reflections, an algorithm is also proposed to binary classify reflectors into room boundaries and interior furniture. Experiments were conducted in four rooms. The classification algorithm showed high quality performance, also improving the localization accuracy, for non-static listener scenarios.
Humans are able to identify a large number of environmental sounds and categorise them according to high-level semantic categories, e.g. urban sounds or music. They are also capable of generalising from past experience to new sounds when applying these categories. In this paper we report on the creation of a data set that is structured according to the top-level of a taxonomy derived from human judgements and the design of an associated machine learning challenge, in which strong generalisation abilities are required to be successful. We introduce a baseline classification system, a deep convolutional network, which showed strong performance with an average accuracy on the evaluation data of 80.8%. The result is discussed in the light of two alternative explanations: An unlikely accidental category bias in the sound recordings or a more plausible true acoustic grounding of the high-level categories.
The ability to predict the acoustics of a room without acoustical measurements is a useful capability. The motivation here stems from spatial audio reproduction, where knowledge of the acoustics of a space could allow for more accurate reproduction of a captured environment, or for reproduction room compensation techniques to be applied. A cuboid-based room geometry estimation method using a spherical camera is proposed, assuming a room and objects inside can be represented as cuboids aligned to the main axes of the coordinate system. The estimated geometry is used to produce frequency-dependent acoustic predictions based on geometrical room modelling techniques. Results are compared to measurements through calculated reverberant spatial audio object parameters used for reverberation reproduction customized to the given loudspeaker set up.
In this paper, we propose a divide-and-conquer approach using two generative adversarial networks (GANs) to explore how a machine can draw colorful pictures (bird) using a small amount of training data. In our work, we simulate the procedure of an artist drawing a picture, where one begins with drawing objects’ contours and edges and then paints them different colors. We adopt two GAN models to process basic visual features including shape, texture and color. We use the first GAN model to generate object shape, and then paint the black and white image based on the knowledge learned using the second GAN model. We run our experiments on 600 color images. The experimental results show that the use of our approach can generate good quality synthetic images, comparable to real ones.
The ability to replicate a plane wave represents an essential element of spatial sound field reproduction. In sound field synthesis, the desired field is often formulated as a plane wave and the error minimized; for other sound field control methods, the energy density or energy ratio is maximized. In all cases and further to the reproduction error, it is informative to characterize how planar the resultant sound field is. This paper presents a method for quantifying a region's acoustic planarity by superdirective beamforming with an array of microphones, which analyzes the azimuthal distribution of impinging waves and hence derives the planarity. Estimates are obtained for a variety of simulated sound field types, tested with respect to array orientation, wavenumber, and number of microphones. A range of microphone configurations is examined. Results are compared with delay-and-sum beamforming, which is equivalent to spatial Fourier decomposition. The superdirective beamformer provides better characterization of sound fields and is effective with a moderate number of omni-directional microphones over a broad frequency range. Practical investigation of planarity estimation in real sound fields is needed to demonstrate its validity as a physical sound field evaluation measure.
Recent progresses in Virtual Reality (VR) and Augmented Reality (AR) allow us to experience various VR/AR applications in our daily life. In order to maximise the immersiveness of user in VR/AR environments, a plausible spatial audio reproduction synchronised with visual information is essential. In this paper, we propose a simple and efficient system to estimate room acoustic for plausible reproducton of spatial audio using 360° cameras for VR/AR applications. A pair of 360° images is used for room geometry and acoustic property estimation. A simplified 3D geometric model of the scene is estimated by depth estimation from captured images and semantic labelling using a convolutional neural network (CNN). The real environment acoustics are characterised by frequency-dependent acoustic predictions of the scene. Spatially synchronised audio is reproduced based on the estimated geometric and acoustic properties in the scene. The reconstructed scenes are rendered with synthesised spatial audio as VR/AR content. The results of estimated room geometry and simulated spatial audio are evaluated against the actual measurements and audio calculated from ground-truth Room Impulse Responses (RIRs) recorded in the rooms.
Studies on sound field control methods able to create independent listening zones in a single acoustic space have recently been undertaken due to the potential of such methods for various practical applications, such as individual audio streams in home entertainment. Existing solutions to the problem have shown to be effective in creating high and low sound energy regions under anechoic conditions. Although some case studies in a reflective environment can also be found, the capabilities of sound zoning methods in rooms have not been fully explored. In this paper, the influence of low-order (early) reflections on the performance of key sound zone techniques is examined. Analytic considerations for small-scale systems reveal strong dependence of performance on parameters such as source positioning with respect to zone locations and room surfaces, as well as the parameters of the receiver configuration. These dependencies are further investigated through numerical simulation to determine system configurations which maximize the performance in terms of acoustic contrast and array control effort. The design rules for source and receiver positioning are suggested, for improved performance under a given set of constraints such as a number of available sources, zone locations and the direction of the dominant reflection. © 2013 Acoustical Society of America.
At the University of Surrey (Guildford, UK), we have brought together research groups in different disciplines, with a shared interest in audio, to work on a range of collaborative research projects. In the Centre for Vision, Speech and Signal Processing (CVSSP) we focus on technologies for machine perception of audio scenes; in the Institute of Sound Recording (IoSR) we focus on research into human perception of audio quality; the Digital World Research Centre (DWRC) focusses on the design of digital technologies; while the Centre for Digital Economy (CoDE) focusses on new business models enabled by digital technology. This interdisciplinary view, across different traditional academic departments and faculties, allows us to undertake projects which would be impossible for a single research group. In this poster we will present an overview of some of these interdisciplinary projects, including projects in spatial audio, sound scene and event analysis, and creative commons audio.
The challenge of installing and setting up dedicated spatial audio systems can make it difficult to deliver immersive listening experiences to the general public. However, the proliferation of smart mobile devices and the rise of the Internet of Things mean that there are increasing numbers of connected devices capable of producing audio in the home. ____Media device orchestration" (MDO) is the concept of utilizing an ad hoc set of devices to deliver or augment a media experience. In this paper, the concept is evaluated by implementing MDO for augmented spatial audio reproduction using objectbased audio with semantic metadata. A thematic analysis of positive and negative listener comments about the system revealed three main categories of response: perceptual, technical, and content-dependent aspects. MDO performed particularly well in terms of immersion/envelopment, but the quality of listening experience was partly dependent on loudspeaker quality and listener position. Suggestions for further development based on these categories are given.
Automatic and fast tagging of natural sounds in audio collections is a very challenging task due to wide acoustic variations, the large number of possible tags, the incomplete and ambiguous tags provided by different labellers. To handle these problems, we use a co-regularization approach to learn a pair of classifiers on sound and text. The first classifier maps low-level audio features to a true tag list. The second classifier maps actively corrupted tags to the true tags, reducing incorrect mappings caused by low-level acoustic variations in the first classifier, and to augment the tags with additional relevant tags. Training the classifiers is implemented using marginal co-regularization, pair of which draws the two classifiers into agreement by a joint optimization. We evaluate this approach on two sound datasets, Freefield1010 and Task4 of DCASE2016. The results obtained show that marginal co-regularization outperforms the baseline GMM in both ef- ficiency and effectiveness.
We present a novel pipeline to estimate reverberant spatial audio object (RSAO) parameters given room impulse responses (RIRs) recorded by ad-hoc microphone arrangements. The proposed pipeline performs three tasks: direct-to-reverberant-ratio (DRR) estimation; microphone localization; RSAO parametrization. RIRs recorded at Bridgewater Hall by microphones arranged for a BBC Philharmonic Orchestra performance were parametrized. Objective measures of the rendered RSAO reverberation characteristics were evaluated and compared with reverberation recorded by a Soundfield microphone. Alongside informal listening tests, the results confirmed that the rendered RSAO gave a plausible reproduction of the hall, comparable to the measured response. The objectification of the reverb from in-situ RIR measurements unlocks customization and personalization of the experience for different audio systems, user preferences and playback environments.
In applications such as virtual and augmented reality, a plausible and coherent audio-visual reproduction can be achieved by deeply understanding the reference scene acoustics. This requires knowledge of the scene geometry and related materials. In this paper, we present an audio-visual approach for acoustic scene understanding. We propose a novel material recognition algorithm, that exploits information carried by acoustic signals. The acoustic absorption coefficients are selected as features. The training dataset was constructed by combining information available in the literature, and additional labeled data that we recorded in a small room having short reverberation time (RT60). Classic machine learning methods are used to validate the model, by employing data recorded in five rooms, having different sizes and RT60s. The estimated materials are utilized to label room boundaries, reconstructed by a visionbased method. Results show 89 % and 80 % agreement between the estimated and reference room volumes and materials, respectively.
Binaural recording technology offers an inexpensive, portable solution for spatial audio capture. In this paper, a full-sphere 2D localization method is proposed which utilizes the Model-Based Expectation-Maximization Source Separation and Localization system (MESSL). The localization model is trained using a full-sphere head related transfer function dataset and produces localization estimates by maximum-likelihood of frequency-dependent interaural parameters. The model’s robustness is assessed using matched and mismatched HRTF datasets between test and training data, with environmental sounds and speech. Results show that the majority of sounds are estimated correctly with the matched condition in low noise levels; for the mismatched condition, a ‘cone of confusion’ arises with albeit effective estimation of lateral angles. Additionally, the results show a relationship between the spectral content of the test data and the performance of the proposed method.
Previously-obtained data, quantifying the degree of quality degradation resulting from a range of spatial audio processes (SAPs), can be used to build a regression model of perceived spatial audio quality in terms of previously developed spatially and timbrally relevant metrics. A generalizable model thus built, employing just five metrics and two principal components, performs well in its prediction of the quality of a range of program types degraded by a multitude of SAPs commonly encountered in consumer audio reproduction, auditioned at both central and off-center listening positions. Such a model can provide a correlation to listening test data of r = 0.89, with a root mean square error (RMSE) of 11%, making its performance comparable to that of previous audio quality models and making it a suitable core for an artificial-listener-based spatial audio quality evaluation system.
Environmental audio tagging aims to predict only the presence or absence of certain acoustic events in the interested acoustic scene. In this paper we make contributions to audio tagging in two parts, respectively, acoustic modeling and feature learning. We propose to use a shrinking deep neural network (DNN) framework incorporating unsupervised feature learning to handle the multi-label classification task. For the acoustic modeling, a large set of contextual frames of the chunk are fed into the DNN to perform a multi-label classification for the expected tags, considering that only chunk (or utterance) level rather than frame-level labels are available. Dropout and background noise aware training are also adopted to improve the generalization capability of the DNNs. For the unsupervised feature learning, we propose to use a symmetric or asymmetric deep de-noising auto-encoder (syDAE or asyDAE) to generate new data-driven features from the logarithmic Mel-Filter Banks (MFBs) features. The new features, which are smoothed against background noise and more compact with contextual information, can further improve the performance of the DNN baseline. Compared with the standard Gaussian Mixture Model (GMM) baseline of the DCASE 2016 audio tagging challenge, our proposed method obtains a significant equal error rate (EER) reduction from 0.21 to 0.13 on the development set. The proposed asyDAE system can get a relative 6.7% EER reduction compared with the strong DNN baseline on the development set. Finally, the results also show that our approach obtains the state-of-the-art performance with 0.15 EER on the evaluation set of the DCASE 2016 audio tagging task while EER of the first prize of this challenge is 0.17.
In this paper, we propose an iterative deep neural network (DNN)-based binaural source separation scheme, for recovering two concurrent speech signals in a room environment. Besides the commonly-used spectral features, the DNN also takes non-linearly wrapped binaural spatial features as input, which are refined iteratively using parameters estimated from the DNN output via a feedback loop. Different DNN structures have been tested, including a classic multilayer perception regression architecture as well as a new hybrid network with both convolutional and densely-connected layers. Objective evaluations in terms of PESQ and STOI showed consistent improvement over baseline methods using traditional binaural features, especially when the hybrid DNN architecture was employed. In addition, our proposed scheme is robust to mismatches between the training and testing data.
State-of-the-art binaural objective intelligibility measures (OIMs) require individual source signals for making intelligibility predictions, limiting their usability in real-time online operations. This limitation may be addressed by a blind source separation (BSS) process, which is able to extract the underlying sources from a mixture. In this study, a speech source is presented with either a stationary noise masker or a fluctuating noise masker whose azimuth varies in a horizontal plane, at two speech-to-noise ratios (SNRs). Three binaural OIMs are used to predict speech intelligibility from the signals separated by a BSS algorithm. The model predictions are compared with listeners' word identification rate in a perceptual listening experiment. The results suggest that with SNR compensation to the BSS-separated speech signal, the OIMs can maintain their predictive power for individual maskers compared to their performance measured from the direct signals. It also reveals that the errors in SNR between the estimated signals are not the only factors that decrease the predictive accuracy of the OIMs with the separated signals. Artefacts or distortions on the estimated signals caused by the BSS algorithm may also be concerns.
Typical methods for binaural source separation consider only the direct sound as the target signal in a mixture. However, in most scenarios, this assumption limits the source separation performance. It is well known that the early reflections interact with the direct sound, producing acoustic effects at the listening position, e.g. the so-called comb filter effect. In this article, we propose a novel source separation model, that utilizes both the direct sound and the first early reflection information to model the comb filter effect. This is done by observing the interaural phase difference obtained from the timefrequency representation of binaural mixtures. Furthermore, a method is proposed to model the interaural coherence of the signals. Including information related to the sound multipath propagation, the performance of the proposed separation method is improved with respect to the baselines that did not use such information, as illustrated by using binaural recordings made in four rooms, having different sizes and reverberation times.
Object-based audio promises format-agnostic reproduction and extensive personalization of spatial audio content. However, in practical listening scenarios, such as in consumer audio, ideal reproduction is typically not possible. To maximize the quality of listening experience, a different approach is required, for example modifications of metadata to adjust for the reproduction layout or personalization choices. In this paper we propose a novel system architecture for semantically informed rendering (SIR), that combines object audio rendering with high-level processing of object metadata. In many cases, this processing uses novel, advanced metadata describing the objects to optimally adjust the audio scene to the reproduction system or listener preferences. The proposed system is evaluated with several adaptation strategies, including semantically motivated downmix to layouts with few loudspeakers, manipulation of perceptual attributes, perceptual reverberation compensation, and orchestration of mobile devices for immersive reproduction. These examples demonstrate how SIR can significantly improve the media experience and provide advanced personalization controls, for example by maintaining smooth object trajectories on systems with few loudspeakers, or providing personalized envelopment levels. An example implementation of the proposed system architecture is described and provided as an open, extensible software framework that combines object-based audio rendering and high-level processing of advanced object metadata.
Object-based audio has the potential to enable multime- dia content to be tailored to individual listeners and their reproduc- tion equipment. In general, object-based production assumes that the objects|the assets comprising the scene|are free of noise and inter- ference. However, there are many applications in which signal separa- tion could be useful to an object-based audio work ow, e.g., extracting individual objects from channel-based recordings or legacy content, or recording a sound scene with a single microphone array. This paper de- scribes the application and evaluation of blind source separation (BSS) for sound recording in a hybrid channel-based and object-based workflow, in which BSS-estimated objects are mixed with the original stereo recording. A subjective experiment was conducted using simultaneously spoken speech recorded with omnidirectional microphones in a rever- berant room. Listeners mixed a BSS-extracted speech object into the scene to make the quieter talker clearer, while retaining acceptable au- dio quality, compared to the raw stereo recording. Objective evaluations show that the relative short-term objective intelligibility and speech qual- ity scores increase using BSS. Further objective evaluations are used to discuss the in uence of the BSS method on the remixing scenario; the scenario shown by human listeners to be useful in object-based audio is shown to be a worse-case scenario.
For a sound source on the median-plane of a binaural system, interaural localization cues are absent. So, for robust binaural localization of sound sources on the median-plane, localization methods need to be designed with this in consideration. We compare four median-plane binaural sound source localization methods. Where appropriate, adjustments to the methods have been made to improve their robustness to real world recording conditions. The methods are tested using different HRTF datasets to generate the test data and training data. Each method uses a different combination of spectral and interaural localization cues, allowing for a comparison of the effect of spectral and interaural cues on median-plane localization. The methods are tested for their robustness to different levels of additive noise and different categories of sound.
Studies on perceived audio-visual spatial coherence in the literature have commonly employed continuous judgment scales. This method requires listeners to detect and to quantify their perception of a given feature and is a difficult task, particularly for untrained listeners. An alternative method is the quantification of a percept by conducting a simple forced choice test with subsequent modeling of the psychometric function. An experiment to validate this alternative method for the perception of azimuthal audio-visual spatial coherence was performed. Furthermore, information on participant training and localization ability was gathered. The results are consistent with previous research and show that the proposed methodology is suitable for this kind of test. The main differences between participants result from the presence or absence of musical training.
Whilst it is possible to create exciting, immersive listening experiences with current spatial audio technology, the required systems are generally difficult to install in a standard living room. However, in any living room there is likely to already be a range of loudspeakers (such as mobile phones, tablets, laptops, and so on). ____Media device orchestration" (MDO) is the concept of utilising all available devices to augment the reproduction of a media experience. In this demonstration, MDO is used to augment low channel count renderings of various programme material, delivering immersive three-dimensional audio experiences.
Studies on sound field control methods able to create independent listening zones in a single acoustic space have recently been undertaken due to the potential of such methods for various practical applications, such as individual audio streams in home entertainment. Existing solutions to the problem have shown to be effective in creating high and low sound energy regions under anechoic conditions. Although some case studies in a reflective environment can also be found, the capabilities of sound zoning methods in rooms have not been fully explored. In this paper, the influence of low-order (early) reflections on the performance of key sound zone techniques is examined. Analytic considerations for small-scale systems reveal strong dependence of performance on parameters such as source positioning with respect to zone locations and room surfaces, as well as the parameters of the receiver configuration. These dependencies are further investigated through numerical simulation to determine system configurations which maximize the performance in terms of acoustic contrast and array control effort. The design rules for source and receiver positioning are suggested, for improved performance under a given set of constraints such as a number of available sources, zone locations, and the direction of the dominant reflection.
Recent attention to the problem of controlling multiple loudspeakers to create sound zones has been directed toward practical issues arising from system robustness concerns. In this study, the effects of regularization are analyzed for three representative sound zoning methods. Regularization governs the control effort required to drive the loudspeaker array, via a constraint in each optimization cost function. Simulations show that regularization has a significant effect on the sound zone performance, both under ideal anechoic conditions and when systematic errors are introduced between calculation of the source weights and their application to the system. Results are obtained for speed of sound variations and loudspeaker positioning errors with respect to the source weights calculated. Judicious selection of the regularization parameter is shown to be a primary concern for sound zone system designers-the acoustic contrast can be increased by up to 50 dB with proper regularization in the presence of errors. A frequency-dependent minimum regularization parameter is determined based on the conditioning of the matrix inverse. The regularization parameter can be further increased to improve performance depending on the control effort constraints, expected magnitude of errors, and desired sound field properties of the system.
The ventriloquism effect describes the phenomenon of audio and visual signals with common features, such as a voice and a talking face merging perceptually into one percept even if they are spatially misaligned. The boundaries of the fusion of spatially misaligned stimuli are of interest for the design of multimedia products to ensure a perceptually satisfactory product. They have mainly been studied using continuous judgment scales and forced-choice measurement methods. These results vary greatly between different studies. The current experiment aims to evaluate audio-visual fusion using reaction time (RT) measurements as an indirect method of measurement to overcome these great variances. A two-alternative forced-choice (2AFC) word recognition test was designed and tested with noise and multi-talker speech background distractors. Visual signals were presented centrally and audio signals were presented between 0° and 31° audio-visual offset in azimuth. RT data were analyzed separately for the underlying Simon effect and attentional effects. In the case of the attentional effects, three models were identified but no single model could explain the observed RTs for all participants so data were grouped and analyzed accordingly. The results show that significant differences in RTs are measured from 5° to 10° onwards for the Simon effect. The attentional effect varied at the same audio-visual offset for two out of the three defined participant groups. In contrast with the prior research, these results suggest that, even for speech signals, small audio-visual offsets influence spatial integration subconsciously.
In this paper, we compare different deep neural networks (DNN) in extracting speech signals from competing speakers in room environments, including the conventional fullyconnected multilayer perception (MLP) network, convolutional neural network (CNN), recurrent neural network (RNN), and the recently proposed capsule network (CapsNet). Each DNN takes input of both spectral features and converted spatial features that are robust to position mismatch, and outputs the separation mask for target source estimation. In addition, a psychacoustically-motivated objective function is integrated in each DNN, which explores perceptual importance of each TF unit in the training process. Objective evaluations are performed on the separated sounds using the converged models, in terms of PESQ, SDR as well as STOI. Overall, all the implemented DNNs have greatly improved the quality and speech intelligibility of the embedded target source as compared to the original recordings. In particular, bidirectional RNN, either along the temporal direction or along the frequency bins, outperforms the other DNN structures with consistent improvement.
Microphone array beamforming can be used to enhance and separate sound sources, with applications in the capture of object-based audio. Many beamforming methods have been proposed and assessed against each other. However, the effects of compact microphone array design on beamforming performance have not been studied for this kind of application. This study investigates how to maximize the quality of audio objects extracted from a horizontal sound field by filter-and-sum beamforming, through appropriate choice of microphone array design. Eight uniform geometries with practical constraints of a limited number of microphones and maximum array size are evaluated over a range of physical metrics. Results show that baffled circular arrays outperform the other geometries in terms of perceptually relevant frequency range, spatial resolution, directivity and robustness. Moreover, a subjective evaluation of microphone arrays and beamformers is conducted with regards to the quality of the target sound, interference suppression and overall quality of simulated music performance recordings. Baffled circular arrays achieve higher target quality and interference suppression than alternative geometries with wideband signals. Furthermore, subjective scores of beamformers regarding target quality and interference suppression agree well with beamformer onaxis and off-axis responses; with wideband signals the superdirective beamformer achieves the highest overall quality.
Room Impulse Responses (RIRs) measured with microphone arrays capture spatial and nonspatial information, e.g. the early reflections’ directions and times of arrival, the size of the room and its absorption properties. The Reverberant Spatial Audio Object (RSAO) was proposed as a method to encode room acoustic parameters from measured array RIRs. As the RSAO is object-based audio compatible, its parameters can be rendered to arbitrary reproduction systems and edited to modify the reverberation characteristics, to improve the user experience. Various microphone array designs have been proposed for sound field and room acoustic analysis, but a comparative performance evaluation is not available. This study assesses the performance of five regular microphone array geometries (linear, rectangular, circular, dual-circular and spherical) to capture RSAO parameters for the direct sound and early reflections of RIRs. The image source method is used to synthesise RIRs at the microphone positions as well as at the centre of the array. From the array RIRs, the RSAO parameters are estimated and compared to the reference parameters at the centre of the array. A performance comparison among the five arrays is established as well as the effect of a rigid spherical baffle for the circular and spherical arrays. The effects of measurement uncertainties, such as microphone misplacement and sensor noise errors, are also studied. The results show that planar arrays achieve the most accurate horizontal localisation whereas the spherical arrays perform best in elevation. Arrays with smaller apertures achieve a higher number of detected reflections, which becomes more significant for the smaller room with higher reflection density.
In this paper we examine how the term ‘Audio Augmented Reality’ (AAR) is used in the literature, and how the concept is used in practice. In particular, AAR seems to refer to a variety of closely related concepts. In order to gain a deeper understanding of disparate work surrounding AAR, we present a taxonomy of these concepts and highlight both canonical examples in each category, as well as edge cases that help define the category boundaries.
The ability to replicate a plane wave represents an essential element of spatial sound field reproduction. In sound field synthesis, the desired field is often formulated as a plane wave and the error minimized; for other sound field control methods, the energy density or energy ratio is maximized. In all cases and further to the reproduction error, it is informative to characterize how planar the resultant sound field is. This paper presents a method for quantifying a region's acoustic planarity by superdirective beamforming with an array of microphones, which analyzes the azimuthal distribution of impinging waves and hence derives the planarity. Estimates are obtained for a variety of simulated sound field types, tested with respect to array orientation, wavenumber, and number of microphones. A range of microphone configurations is examined. Results are compared with delay-and-sum beamforming, which is equivalent to spatial Fourier decomposition. The superdirective beamformer provides better characterization of sound fields, and is effective with a moderate number of omni-directional microphones over a broad frequency range. Practical investigation of planarity estimation in real sound fields is needed to demonstrate its validity as a physical sound field evaluation measure. © 2013 Acoustical Society of America.
In this paper we propose a cuboid-based air-tight indoor room geometry estimation method using combination of audio-visual sensors. Existing vision-based 3D reconstruction methods are not applicable for scenes with transparent or reflective objects such as windows and mirrors. In this work we fuse multi-modal sensory information to overcome the limitations of purely visual reconstruction for reconstruction of complex scenes including transparent and mirror surfaces. A full scene is captured by 360 cameras and acoustic room impulse responses (RIRs) recorded by a loudspeaker and compact microphone array. Depth information of the scene is recovered by stereo matching from the captured images and estimation of major acoustic reflector locations from the sound. The coordinate systems for audiovisual sensors are aligned into a unified reference frame and plane elements are reconstructed from audio-visual data. Finally cuboid proxies are fitted to the planes to generate a complete room model. Experimental results show that the proposed system generates complete representations of the room structures regardless of transparent windows, featureless walls and shiny surfaces.
Multi-point approaches for sound field control generally sample the listening zone(s) with pressure microphones, and use these measurements as an input for an optimisation cost function. A number of techniques are based on this concept, for single-zone (e.g. least-squares pressure matching (PM), brightness control, planarity panning) and multi-zone (e.g. PM, acoustic contrast control, planarity control) reproduction. Accurate performance predictions are obtained when distinct microphone positions are employed for setup versus evaluation. While, in simulation, one can afford a dense sampling of virtual microphones, it is desirable in practice to have a microphone array which can be positioned once in each zone to measure the setup transfer functions between each loudspeaker and that zone. In this contribution, we present simulation results over a fixed dense set of evaluation points comparing the performance of several multi-point optimisation approaches for 2D reproduction with a 60 channel circular loudspeaker arrangement. Various regular setup microphone arrays are used to calculate the sound zone filters: circular grid, circular, dual-circular, and spherical arrays, each with different numbers of microphones. Furthermore, the effect of a rigid spherical baffle is studied for the circular and spherical arrangements. The results of this comparative study show how the directivity and effective frequency range of multi-point optimisation techniques depend on the microphone array used to sample the zones. In general, microphone arrays with dense spacing around the boundary give better angular discrimination, leading to more accurate directional sound reproduction, while those distributed around the whole zone enable more accurate prediction of the reproduced target sound pressure level.
As audio-visual systems increasingly bring immersive and interactive capabilities into our work and leisure activities, so the need for naturalistic test material grows. New volumetric datasets have captured high-quality 3D video, but accompanying audio is often neglected, making it hard to test an integrated bimodal experience. Designed to cover diverse sound types and features, the presented volumetric dataset was constructed from audio and video studio recordings of scenes to yield forty short action sequences. Potential uses in technical and scientific tests are discussed.