It is known that applying a time-frequency binary mask to very noisy speech can improve its intelligibility but results in poor perceptual quality. In this paper we propose a new approach to applying a binary mask that combines the intelligibility gains of conventional binary masking with the perceptual quality gains of a classical speech enhancer. The binary mask is not applied directly as a time-frequency gain as in most previous studies. Instead, the mask is used to supply prior information to a classical speech enhancer about the probability of speech presence in different time-frequency regions. Using an oracle ideal binary mask, we show that the proposed method results in a higher predicted quality than other methods of applying a binary mask whilst preserving the improvements in predicted intelligibility.
Developments in immersive audio technologies have been evolving in two directions: physically-motivated and perceptually-motivated systems. Physically motivated techniques aim to reproduce a physically accurate approximation of desired sound fields by employing a very high equipment load and sophisticated computationally intensive algorithms. Perceptuallymotivated techniques, on the other hand, aim to render only the perceptually relevant aspects of the sound scene by means of modest computational and equipment load. This article presents an overview of perceptually motivated techniques, with a focus on multichannel audio recording and reproduction, audio source and reflection culling, and artificial reverberators.
Parametric modeling of room acoustics aims at representing room transfer functions (RTFs) by means of digital filters and finds application in many acoustic signal enhancement algorithms. In previous work by other authors, the use of orthonormal basis functions (OBFs) for modeling room acoustics has been proposed. Some advantages of OBF models over all-zero and pole-zero models have been illustrated, mainly focusing on the fact that OBF models typically require less model parameters to provide the same model accuracy. In this paper, it is shown that the orthogonality of the OBF model brings several additional advantages, which can be exploited if a suitable algorithm for identifying the OBF model parameters is applied. Specifically, the orthogonality of OBF models does not only lead to improved model efficiency (as pointed out in previous work), but also leads to improved model scalability and model stability. Its appealing scalability property derives from a previously unexplored interpretation of the OBF model as an approximation to a solution of the inhomogeneous acoustic wave equation. Following this interpretation, a novel identification algorithm is proposed that takes advantage of the OBF model orthogonality to deliver efficient, scalable and stable OBF model estimates, which is not necessarily the case for nonlinear estimation techniques that are normally applied.
This report introduces a new database of room impulse responses (RIRs) measured in an empty rectangular room using subwoofers as sound sources. The purpose of this database, publicly available for download, is to provide acoustic measurements within the frequency region of modal resonances. Performing acoustic measurements at low frequencies presents many difficulties, mainly related to ambient noise and to unavoidable nonlinearities of the subwoofer. In this report, it is shown that these issues can be addressed and partially solved by means of the exponential sine-sweep technique and a careful calibration of the measurement equipment. A procedure for estimating the reverberation time at very low frequencies is proposed, which uses a cosine-modulated filterbank and an approximation of the RIRs using parametric models in order to reduce problems related to low signal-to-noise ratio and to the length of typical band-pass filter responses.
Room Impulse Responses (RIRs) are typically measured
using a set of microphones and a loudspeaker. When
RIRs spanning a large volume are needed, many microphone
measurements must be used to spatially sample the sound field.
In order to reduce the number of microphone measurements,
RIRs can be spatially interpolated. In the present study, RIR
interpolation is formulated as an inverse problem. This inverse
problem relies on a particular acoustic model capable of representing
the measurements. Two different acoustic models are
compared: the plane wave decomposition model and a novel
time-domain model that consists of a collection of equivalent
sources creating spherical waves. These acoustic models can
both approximate any reverberant sound field created by a far
field sound source. In order to produce an accurate RIR interpolation,
sparsity regularization is employed when solving the
inverse problem. In particular, by combining different acoustic
models with different sparsity promoting regularizations, spatial
sparsity, spatio-spectral sparsity and spatio-temporal sparsity are
compared. The inverse problem is solved using a matrix-free large
scale optimization algorithm. Simulations show that the best RIR
interpolation is obtained when combining the novel time-domain
acoustic model with the spatio-temporal sparsity regularization,
outperforming the results of the plane wave decomposition model
even when far fewer microphone measurements are available.
This research focuses on sound localization experiments in which subjects report the position of an active sound source by turning toward it. A statistical framework for the analysis of the data is presented together with a case study from a large-scale listening experiment. The statistical framework is based on a model that is robust to the presence of front/back confusions and random errors. Closed-form natural estimators are derived, and one-sample and two-sample statistical tests are described. The framework is used to analyze the data of an auralized experiment undertaken by nearly nine hundred subjects. The objective was to explore localization performance in the horizontal plane in an informal setting and with little training, which are conditions that are similar to those typically encountered in consumer applications of binaural audio. Results show that responses had a rightward bias and that speech was harder to localize than percussion sounds, which are results consistent with the literature. Results also show that it was harder to localize sound in a simulated room with a high ceiling despite having a higher direct-to-reverberant ratio than other simulated rooms.
In this paper, source localization and dereverberation are formulated
jointly as an inverse problem. The inverse problem
consists in the interpolation of the sound field measured by a
set of microphones by matching the recorded sound pressure
with that of a particular acoustic model. This model is based
on a collection of equivalent sources creating either spherical
or plane waves. In order to achieve meaningful results, spatial,
spatio-temporal and spatio-spectral sparsity can be promoted
in the signals originating from the equivalent sources.
The inverse problem consists of a large-scale optimization
problem that is solved using a first order matrix-free optimization
algorithm. It is shown that once the equivalent source
signals capable of effectively interpolating the sound field are
obtained, they can be readily used to localize a speech sound
source in terms of Direction of Arrival (DOA) and to perform
dereverberation in a highly reverberant environment.
The active sensing and perception of the environment by auditory means is
typically known as echolocation and it can be acquired by humans, who can
profit from it in the absence of vision. We investigated the ability of twentyone
untrained sighted participants to use echolocation with self-generated oral
clicks for aligning themselves within the horizontal plane towards a virtual wall,
emulated with an acoustic virtual reality system, at distances between 1 and 32
m, in the absence of background noise and reverberation. Participants were able
to detect the virtual wall on 61% of the trials, although with large diµerences
across individuals and distances. The use of louder and shorter clicks led to an
increased performance, whereas the use of clicks with lower frequency content
allowed for the use of interaural time diµerences to improve the accuracy of
reflection localization at very long distances. The distance of 2 m was the most
difficult to detect and localize, whereas the furthest distances of 16 and 32 m
were the easiest ones. Thus, echolocation may be used eµectively to identify
large distant environmental landmarks such as buildings.
Parametric equalization of an acoustic system aims to compensate for the deviations
of its response from a desired target response using parametric digital filters.
An optimization procedure is presented for the automatic design of a low-order equalizer
using parametric infinite impulse response (IIR) filters, specifically second-order
peaking filters and first-order shelving filters. The proposed procedure minimizes the
sum of square errors (SSE) between the system and the target complex frequency
responses, instead of the commonly used difference in magnitudes, and exploits a
previously unexplored orthogonality property of one particular type of parametric
filter. This brings a series of advantages over the state-of-the-art procedures, such as
an improved mathematical tractability of the equalization problem, with the possibility
of computing analytical expressions for the gradients, an improved initialization
of the parameters, including the global gain of the equalizer, the incorporation of
shelving filters in the optimization procedure, and a more accentuated focus on
the equalization of the more perceptually relevant frequency peaks. Examples of
loudspeaker and room equalization are provided, as well as a note about extending
the procedure to multi-point equalization and transfer function modeling.
The image-source method models the specular reflection from a plane by means of a secondary source positioned at the source?s reflected image. The method has been widely used in acoustics to model the reverberant field of rectangular rooms, but can also be used for general-shaped rooms and nonflat reflectors. This paper explores the relationship between the physical properties of a non-flat reflector and the statistical properties of the associated cloud of image-sources. It is shown here that the standard deviation of the image-sources is strongly
correlated with the ratio between depth and width of the reflector?s spatial features.
The auralization schemes in the domain of automotive audio have primarily utilized dummy head recordings in the past. Recently, spatial reproduction allowed the auralization of cabin acoustics over large loudspeaker arrays. Yet no direct comparisons between those methods exist. In this study, the efficacy of headphone presentation is explored in this context. Six acoustical conditions were presented over headphones to experienced assessors (n=23), who were asked to compare them over six elicited perceptual attributes. In 24 out of 36 cases, the results indicate an agreement between headphone- and loudspeaker-based auralisation of identical stimuli sets. It is concluded that, when compared to loudspeakers-based rendering, headphones-based rendering reveals similar judgment on timbral attributes, while certain spatial attributes should be assessed with caution.
Perceptual sound field reconstruction (PSR) is a spatial audio recording and reproduction method based on the application of stereophonic panning laws in microphone array design. PSR allows rendering a perceptually veridical and stable auditory perspective in the horizontal plane of the listener, and involves recording using near-coincident microphone arrays. This paper extends the PSR concept to three dimensions using sound field extrapolation carried out in the spherical-harmonic domain. Sound field rendering is performed using a two-level loudspeaker rig. An active-intensity-based analysis of the rendered sound field shows that the proposed approach can render direction of monochromatic plane waves accurately.
Acoustic source localization and dereverberation are formulated jointly as an inverse problem. The inverse problem consists of the approximation of the sound field measured by a set of microphones. The recorded sound pressure is matched with that of a particular acoustic model based on a collection of plane waves arriving from different directions at the microphone positions. In order to achieve meaningful results, spatial and spatio-spectral sparsity can be promoted in the weight signals controlling the plane waves. The large-scale optimization problem resulting from the inverse problem formulation is solved using a first order optimization algorithm combined with a weighted overlap-add procedure. It is shown that once the weight signals capable of effectively approximating the sound field are
obtained, they can be readily used to localize a moving sound source in terms of direction of arrival (DOA) and to perform dereverberation in a highly reverberant environment. Results from simulation experiments and from real measurements show that the proposed algorithm is robust against both localized and diffuse noise exhibiting a noise reduction in the dereverberated signals.
This paper studies the effects of inter-channel time
and level differences in stereophonic reproduction on perceived localization uncertainty, which is defined as how difficult it is
for a listener to tell where a sound source is located. Towards this end, a computational model of localization uncertainty is proposed first. The model calculates inter-aural time and level difference cues, and compares them to those associated to freefield
point-like sources. The comparison is carried out using
a particular distance functional that replicates the increased uncertainty observed experimentally with inconsistent inter-aural time and level difference cues. The model is validated by formal
listening tests, achieving a Pearson correlation of 0:99. The model is then used to predict localization uncertainty for stereophonic setups and a listener in central and off-central positions. Results
show that amplitude methods achieve a slightly lower localization uncertainty for a listener positioned exactly in the center of the
sweet spot. As soon as the listener moves away from that position,
the situation reverses, with time-amplitude methods achieving a
lower localization uncertainty.