Dr Enzo De Sena
Enzo De Sena received the B.Sc. in 2007 and M.Sc. (cum laude) in 2009, both from the Università degli Studi di Napoli "Federico II” in Telecommunication Engineering. In 2013, he received the Ph.D. degree in Electronic Engineering from King’s College London, where he was also a Teaching Fellow from 2012 to 2013. Between 2013 and 2016 he was a Postdoctoral Research Fellow at the Katholieke Universiteit Leuven. Since 2016 he is a Lecturer in Audio at the Institute of Sound Recording at the University of Surrey. He is a former Marie Curie Fellow and he held visiting positions at Stanford University, Aalborg University and Imperial College London.
University roles and responsibilities
- International Relations Officer for the Department of Music and Media
His current research interests include room acoustics modelling, multichannel audio, microphone beam forming and binaural modelling.
His current research interests include room acoustics modelling, multichannel audio, microphone beam forming and binaural modelling.
Computer simulations of room acoustics suffer from an efficiency vs accuracy trade-off, with highly accurate wave-based models being highly computationally expensive, and delay-network-based models lacking in physical accuracy. The Scattering Delay Network (SDN) is a highly efficient recursive structure that renders first order reflections exactly while approximating higher order ones. With the purpose of improving the accuracy of SDNs, in this paper, several variations on SDNs are investigated, including appropriate node placement for exact modeling of higher order reflections , redesigned scattering matrices for physically-motivated scattering, and pruned network connections for reduced computational complexity. The results of these variations are compared to state-of-the-art geometric acoustic models for different shoebox room simulations. Objective measures (Normalized Echo Densities (NEDs) and Energy Decay Curves (EDCs)) showed a close match between the proposed methods and the references. A formal listening test was carried out to evaluate differences in perceived naturalness of the synthesized Room Impulse Responses. Results show that increasing SDNs' order and adding directional scattering in a fully-connected network improves perceived naturalness, and higher-order pruned networks give similar performance at a much lower computational cost.
Geometric acoustic models have a lower computational complexity than wave-based methods due to the assumption that sound propagates as rays, however this fails to consider the wave-like properties of sound such as diffraction. Historically, tthe Biot-Tolstoy-Medwin (BTM) model and the Uniform Theory of Diffraction (UTD) have been used to augment geometric acoustic models with diffraction. Computational efficiency is essential for real-time application and recently two more efficient models, the Volumetric Diffraction and Transmission (VDaT) model and an infinite impulse response filter (IIR) approximation, were proposed to approximate these solutions. A higher-order IIR filter approximation is proposed in this paper. An experiment is carried out to evaluate the perceived naturalness of these approximations compared to the more accurate analytical solutions. Stationary and moving receivers were considered in simple geometries with a single edge. The results suggest that the higher order IIR approximation is perceptually similar to the BTM model. VDaT and the low order IIR approximation were found to be less natural in some cases. While in dynamic scenes, VDaT was found to be significantly more natural than the other models. The experiment was limited in scope by the simplicity of the scenes considered, however the results suggest the models are perceptually similar. Improvements to the higher-order IIR approximation are suggested and a recommendation is made for future perceptual evaluations.
A study was performed using a virtual environment to investigate the relative importance of spatial audio fidelity and video resolution on perceived audiovisual quality and immersion. Subjects wore a head-mounted display and headphones and were presented with a virtual environment featuring music and speech stimuli using three levels each of spatial audio quality and video resolution. Spatial audio was rendered monaurally, binaurally with head-tracking, and binaurally with head-tracking and room acoustic rendering. Video was rendered at resolutions of 0.5 megapixels per eye, 1.5 megapixels per eye, and 2.5 megapixels per eye. Results showed that both video resolution and spatial audio rendering had a statistically significant effect on both immersion and audiovisual quality. Most strikingly, the results showed that under the conditions that were tested in the experiment, the addition of room acoustic rendering to head-tracked binaural audio had the same improvement on immersion as increasing the video resolution five-fold, from 0.5 megapixels per eye to 2.5 megapixels per eye.
The active sensing and perception of the environment by auditory means is typically known as echolocation and it can be acquired by humans, who can profit from it in the absence of vision. We investigated the ability of twentyone untrained sighted participants to use echolocation with self-generated oral clicks for aligning themselves within the horizontal plane towards a virtual wall, emulated with an acoustic virtual reality system, at distances between 1 and 32 m, in the absence of background noise and reverberation. Participants were able to detect the virtual wall on 61% of the trials, although with large di↵erences across individuals and distances. The use of louder and shorter clicks led to an increased performance, whereas the use of clicks with lower frequency content allowed for the use of interaural time di↵erences to improve the accuracy of reflection localization at very long distances. The distance of 2 m was the most difficult to detect and localize, whereas the furthest distances of 16 and 32 m were the easiest ones. Thus, echolocation may be used e↵ectively to identify large distant environmental landmarks such as buildings.
This research focuses on sound localization experiments in which subjects report the position of an active sound source by turning toward it. A statistical framework for the analysis of the data is presented together with a case study from a large-scale listening experiment. The statistical framework is based on a model that is robust to the presence of front/back confusions and random errors. Closed-form natural estimators are derived, and one-sample and two-sample statistical tests are described. The framework is used to analyze the data of an auralized experiment undertaken by nearly nine hundred subjects. The objective was to explore localization performance in the horizontal plane in an informal setting and with little training, which are conditions that are similar to those typically encountered in consumer applications of binaural audio. Results show that responses had a rightward bias and that speech was harder to localize than percussion sounds, which are results consistent with the literature. Results also show that it was harder to localize sound in a simulated room with a high ceiling despite having a higher direct-to-reverberant ratio than other simulated rooms.
The auralization schemes in the domain of automotive audio have primarily utilized dummy head recordings in the past. Recently, spatial reproduction allowed the auralization of cabin acoustics over large loudspeaker arrays. Yet no direct comparisons between those methods exist. In this study, the efficacy of headphone presentation is explored in this context. Six acoustical conditions were presented over headphones to experienced assessors (n=23), who were asked to compare them over six elicited perceptual attributes. In 24 out of 36 cases, the results indicate an agreement between headphone- and loudspeaker-based auralisation of identical stimuli sets. It is concluded that, when compared to loudspeakers-based rendering, headphones-based rendering reveals similar judgment on timbral attributes, while certain spatial attributes should be assessed with caution.
Room Impulse Responses (RIRs) are typically measured using a set of microphones and a loudspeaker. When RIRs spanning a large volume are needed, many microphone measurements must be used to spatially sample the sound field. In order to reduce the number of microphone measurements, RIRs can be spatially interpolated. In the present study, RIR interpolation is formulated as an inverse problem. This inverse problem relies on a particular acoustic model capable of representing the measurements. Two different acoustic models are compared: the plane wave decomposition model and a novel time-domain model that consists of a collection of equivalent sources creating spherical waves. These acoustic models can both approximate any reverberant sound field created by a far field sound source. In order to produce an accurate RIR interpolation, sparsity regularization is employed when solving the inverse problem. In particular, by combining different acoustic models with different sparsity promoting regularizations, spatial sparsity, spatio-spectral sparsity and spatio-temporal sparsity are compared. The inverse problem is solved using a matrix-free large scale optimization algorithm. Simulations show that the best RIR interpolation is obtained when combining the novel time-domain acoustic model with the spatio-temporal sparsity regularization, outperforming the results of the plane wave decomposition model even when far fewer microphone measurements are available.
The multipole expansion method (MEM) is a spatial discretization technique that is widely used in applications that feature scattering of waves from circular cylinders. Moreover, it also serves as a key component in several other numerical methods in which scattering computations involving arbitrarily shaped objects are accelerated by enclosing the objects in artificial cylinders. A fundamental question is that of how fast the approximation error of the MEM converges to zero as the truncation number goes to infinity. Despite the fact that the MEM was introduced in 1913, and has been in widespread usage as a numerical technique since as far back as 1955, a precise characterization of the asymptotic rate of convergence of the MEM has not been obtained. In this work, we provide a resolution to this issue. While our focus in this paper is on the Dirichlet scattering problem, this is merely for convenience and our results actually establish convergence rates that hold for all MEM formulations irrespective of the specific boundary conditions or boundary integral equation solution representation chosen.
Acoustic source localization and dereverberation are formulated jointly as an inverse problem. The inverse problem consists of the approximation of the sound field measured by a set of microphones. The recorded sound pressure is matched with that of a particular acoustic model based on a collection of plane waves arriving from different directions at the microphone positions. In order to achieve meaningful results, spatial and spatio-spectral sparsity can be promoted in the weight signals controlling the plane waves. The large-scale optimization problem resulting from the inverse problem formulation is solved using a first order optimization algorithm combined with a weighted overlap-add procedure. It is shown that once the weight signals capable of effectively approximating the sound field are obtained, they can be readily used to localize a moving sound source in terms of direction of arrival (DOA) and to perform dereverberation in a highly reverberant environment. Results from simulation experiments and from real measurements show that the proposed algorithm is robust against both localized and diffuse noise exhibiting a noise reduction in the dereverberated signals.
A new publicly available dataset of microphone impulse responses (IRs) has been generated. The dataset covers 25 microphones, including a Class-1 measurement microphone, plus polar pattern variations for 7 of the microphones. Microphones were included having: omnidirectional, cardioid, supercardioid and bidirectional polar patterns; condenser, moving-coil and ribbon transduction types; single and dual diaphragms; multiple body and head basket shapes; small and large diaphragms; and end-address and side-address designs. Using a custom-developed computer-controlled precision turntable, IRs were captured quasi-anechoically at incident angles from 0º to 355º in steps of 5º, and at source-to-microphone distances of 0.5 m, 1.25 m and 5 m. The resulting dataset is suitable for perceptual and objective studies related to the incident-angle-dependent response of microphones, as well as for the development of tools for predicting and emulating on- and off-axis microphone characteristics. The captured IRs allow generation of frequency response plots with a degree of detail not commonly available in manufacturer-supplied data sheets, and are also particularly well suited to harmonic distortion analysis.
In this paper, source localization and dereverberation are formulated jointly as an inverse problem. The inverse problem consists in the interpolation of the sound field measured by a set of microphones by matching the recorded sound pressure with that of a particular acoustic model. This model is based on a collection of equivalent sources creating either spherical or plane waves. In order to achieve meaningful results, spatial, spatio-temporal and spatio-spectral sparsity can be promoted in the signals originating from the equivalent sources. The inverse problem consists of a large-scale optimization problem that is solved using a first order matrix-free optimization algorithm. It is shown that once the equivalent source signals capable of effectively interpolating the sound field are obtained, they can be readily used to localize a speech sound source in terms of Direction of Arrival (DOA) and to perform dereverberation in a highly reverberant environment.
Perceptual sound field reconstruction (PSR) is a spatial audio recording and reproduction method based on the application of stereophonic panning laws in microphone array design. PSR allows rendering a perceptually veridical and stable auditory perspective in the horizontal plane of the listener, and involves recording using near-coincident microphone arrays. This paper extends the PSR concept to three dimensions using sound field extrapolation carried out in the spherical-harmonic domain. Sound field rendering is performed using a two-level loudspeaker rig. An active-intensity-based analysis of the rendered sound field shows that the proposed approach can render direction of monochromatic plane waves accurately.
The steered response power (SRP) approach to acoustic source lo-calization computes a map of the acoustic scene from the frequency-weighted output power of a beamformer steered towards a set of candidate locations. Equivalently, SRP may be expressed in terms of time-domain generalized cross-correlations (GCCs) at lags equal to the candidate locations' time-differences of arrival (TDOAs). Due to the dense grid of candidate locations, each of which requires inverse Fourier transform (IFT) evaluations, conventional SRP exhibits a high computational complexity. In this paper, we propose a low-complexity SRP approach based on Nyquist-Shannon sampling. Noting that on the one hand the range of possible TDOAs is physically bounded, while on the other hand the GCCs are band-limited, we critically sample the GCCs around their TDOA interval and approximate the SRP map by interpolation. In usual setups, the number of sample points can be orders of magnitude less than the number of candidate locations and frequency bins, yielding a significant reduction of IFT computations at a limited interpolation cost. Simulations comparing the proposed approximation with conventional SRP indicate low approximation errors and equal localization performance. A MATLAB implementation is available online.
Media and entertainment companies have embraced immersive audio technology for cinema, television, games, and music. Meanwhile, in recent years there has been a rise in the number of organizations welcoming underrepresented groups to the field of audio. However, although some disciplines such as music recording are seeing an increase in participation, others are not keeping pace. Immersive and spatial audio are disciplines in which diversity is measurably lacking. Audio based mixed-gender social media groups are comprised of less than 10% women and minorities, and groups dedicated to immersive audio exhibit poorer representation. Barriers to entry are societal as well as economic; however, outreach, networking opportunities, mentoring, and affordable education are remedies have been shown to be effective for related industries and should be adopted by the immersive audio industry.
This report introduces a new database of room impulse responses (RIRs) measured in an empty rectangular room using subwoofers as sound sources. The purpose of this database, publicly available for download, is to provide acoustic measurements within the frequency region of modal resonances. Performing acoustic measurements at low frequencies presents many difficulties, mainly related to ambient noise and to unavoidable nonlinearities of the subwoofer. In this report, it is shown that these issues can be addressed and partially solved by means of the exponential sine-sweep technique and a careful calibration of the measurement equipment. A procedure for estimating the reverberation time at very low frequencies is proposed, which uses a cosine-modulated filterbank and an approximation of the RIRs using parametric models in order to reduce problems related to low signal-to-noise ratio and to the length of typical band-pass filter responses.
Parametric modeling of room acoustics aims at representing room transfer functions (RTFs) by means of digital filters and finds application in many acoustic signal enhancement algorithms. In previous work by other authors, the use of orthonormal basis functions (OBFs) for modeling room acoustics has been proposed. Some advantages of OBF models over all-zero and pole-zero models have been illustrated, mainly focusing on the fact that OBF models typically require less model parameters to provide the same model accuracy. In this paper, it is shown that the orthogonality of the OBF model brings several additional advantages, which can be exploited if a suitable algorithm for identifying the OBF model parameters is applied. Specifically, the orthogonality of OBF models does not only lead to improved model efficiency (as pointed out in previous work), but also leads to improved model scalability and model stability. Its appealing scalability property derives from a previously unexplored interpretation of the OBF model as an approximation to a solution of the inhomogeneous acoustic wave equation. Following this interpretation, a novel identification algorithm is proposed that takes advantage of the OBF model orthogonality to deliver efficient, scalable and stable OBF model estimates, which is not necessarily the case for nonlinear estimation techniques that are normally applied.
Developments in immersive audio technologies have been evolving in two directions: physically-motivated and perceptually-motivated systems. Physically motivated techniques aim to reproduce a physically accurate approximation of desired sound fields by employing a very high equipment load and sophisticated computationally intensive algorithms. Perceptuallymotivated techniques, on the other hand, aim to render only the perceptually relevant aspects of the sound scene by means of modest computational and equipment load. This article presents an overview of perceptually motivated techniques, with a focus on multichannel audio recording and reproduction, audio source and reflection culling, and artificial reverberators.
This paper studies the effects of inter-channel time and level differences in stereophonic reproduction on perceived localization uncertainty, which is defined as how difficult it is for a listener to tell where a sound source is located. Towards this end, a computational model of localization uncertainty is proposed first. The model calculates inter-aural time and level difference cues, and compares them to those associated to freefield point-like sources. The comparison is carried out using a particular distance functional that replicates the increased uncertainty observed experimentally with inconsistent inter-aural time and level difference cues. The model is validated by formal listening tests, achieving a Pearson correlation of 0:99. The model is then used to predict localization uncertainty for stereophonic setups and a listener in central and off-central positions. Results show that amplitude methods achieve a slightly lower localization uncertainty for a listener positioned exactly in the center of the sweet spot. As soon as the listener moves away from that position, the situation reverses, with time-amplitude methods achieving a lower localization uncertainty.
—Artificial reverberators provide a computationally viable alternative to full-scale room acoustics simulation methods for deployment in interactive, immersive systems. Scattering delay network (SDN) is an artificial reverberator that allows direct parametric control over the geometry of a simulated cuboid enclosure, as well as the directional characteristics of the simulated sound sources and microphones. This paper extends the concept of SDN reverberators to multiple enclosures coupled via an aperture. The extension allows independent control of the acoustical properties of the coupled enclosures and the size of the connecting aperture. Transfer functions of the coupled-volume SDN are derived. The effectiveness of the proposed method is evaluated in terms of rendered energy decay curves in comparison to full-scale ray-tracing models and scale model measurements.
The image-source method models the specular reflection from a plane by means of a secondary source positioned at the source’s reflected image. The method has been widely used in acoustics to model the reverberant field of rectangular rooms, but can also be used for general-shaped rooms and nonflat reflectors. This paper explores the relationship between the physical properties of a non-flat reflector and the statistical properties of the associated cloud of image-sources. It is shown here that the standard deviation of the image-sources is strongly correlated with the ratio between depth and width of the reflector’s spatial features.
Parametric equalization of an acoustic system aims to compensate for the deviations of its response from a desired target response using parametric digital filters. An optimization procedure is presented for the automatic design of a low-order equalizer using parametric infinite impulse response (IIR) filters, specifically second-order peaking filters and first-order shelving filters. The proposed procedure minimizes the sum of square errors (SSE) between the system and the target complex frequency responses, instead of the commonly used difference in magnitudes, and exploits a previously unexplored orthogonality property of one particular type of parametric filter. This brings a series of advantages over the state-of-the-art procedures, such as an improved mathematical tractability of the equalization problem, with the possibility of computing analytical expressions for the gradients, an improved initialization of the parameters, including the global gain of the equalizer, the incorporation of shelving filters in the optimization procedure, and a more accentuated focus on the equalization of the more perceptually relevant frequency peaks. Examples of loudspeaker and room equalization are provided, as well as a note about extending the procedure to multi-point equalization and transfer function modeling.
It is known that applying a time-frequency binary mask to very noisy speech can improve its intelligibility but results in poor perceptual quality. In this paper we propose a new approach to applying a binary mask that combines the intelligibility gains of conventional binary masking with the perceptual quality gains of a classical speech enhancer. The binary mask is not applied directly as a time-frequency gain as in most previous studies. Instead, the mask is used to supply prior information to a classical speech enhancer about the probability of speech presence in different time-frequency regions. Using an oracle ideal binary mask, we show that the proposed method results in a higher predicted quality than other methods of applying a binary mask whilst preserving the improvements in predicted intelligibility.
- E. De Sena, H. Hacıhabiboğlu, Z. Cvetković, and J. O. Smith III "Efficient Synthesis of Room Acoustics via Scattering Delay Networks," IEEE/ACM Trans. on Audio Speech Language Process., vol. 23, no. 9, pp 1478 - 1492, Sept. 2015.
- E. De Sena, Niccoló Antonello, Marc Moonen, and Toon van Waterschoot, "On the modeling of rectangular geometries in room acoustic simulations," IEEE/ACM Trans. on Audio Speech Language Process., vol. 23, no. 4, Apr. 2015.
- E. De Sena, H. Hacıhabiboğlu, and Z. Cvetković - “Analysis and Design of Multichannel Systems for Perceptual Sound Field Reconstruction,” IEEE Trans. on Audio, Speech and Language Process., vol. 21 , no. 8, pp 1653-1665, Aug. 2013.
- E. De Sena, H. Hacihabiboglu and Z. Cvetkovic - "On the design and implementation of higher-order differential microphones," IEEE Trans. on Audio, Speech and Language Process., vol. 20, no. 1, pp 162-174, Jan. 2012.
- G. Vairetti, E. De Sena, M. Catrysse, S. H. Jensen, M. Moonen, and T. van Waterschoot, “Multichannel Identification of Room Acoustic Systems with Adaptive Filters based on Orthonormal Basis Functions,” IEEE Int. Conf. on Acoust. Speech and Signal Process. (ICASSP-16), Mar. 2016.
- N. Antonello, E. De Sena, M. Moonen, P. A. Naylor, T. van Waterschoot, "Sound field control in a reverberant room using the Finite Difference Time Domain method" in AES 60th Int. Conf., Leuven, Belgium, Feb. 2016.
- G. Vairetti, E. De Sena, M. Catrysse, S. H. Jensen, M. Moonen, T. van Waterschoot, "Room acoustic system identification using orthonormal basis function models," in AES 60th Int. Conf., Leuven, Belgium, Feb. 2016.
- C. S. J. Doire, M. Brookes, P. A. Naylor, E. De Sena, T. van Waterschoot, S. H. Jensen, “Acoustic Environment Control: Implementation of a Reverberation Enhancement System,” in AES 60th Int. Conf., Leuven, Belgium, Feb. 2016.
- E. De Sena, N. Kaplanis, P. A. Naylor, T. van Waterschoot, “Large-scale auralised sound localisation experiment,” in AES 60th Int. Conf., Leuven, Belgium, Feb. 2016.
- G. Vairetti, E. De Sena, T. van Waterschoot, M. Moonen, M. Catrysse, N. Kaplanis and S.H. Jensen, "A Physically-motivated Parametric Model for Compact Representation of Room Impulse Responses based on Orthonormal Basis Functions," in Proc. 10th European Congress and Exposition on Noise Control Engineering Maastricht, The Netherlands, June 2015.
- E. De Sena and Z. Cvetković, "A Computational Model for the Estimation of Localisation Uncertainty," IEEE Int. Conf. on Acoust. Speech and Signal Process. (ICASSP-13), May 2013, Vancouver, Canada.
- H. Hacıhabiboğlu, E. De Sena, and Z. Cvetković, "Frequency-Domain Scattering Delay Networks for Simulating Room Acoustics in Virtual Environments " in proceedings of the 7th ACM/IEEE International Conference on Signal Image Tech. and Internet-based Syst. (SITIS'11), Dijon, France, November 2011.
- E. De Sena, H. Hacıhabiboğlu and Z. Cvetković - “A Generalized Design Method for Directivity Patterns of Spherical Microphone Arrays”, in proceedings of IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP-11), May 2011, Prague, Czech Republic.
- E. De Sena, H. Hacıhabiboğlu and Z. Cvetković - “Scattering Delay Network: an Interactive Reverberator for Computer Games”, in AES 41st Int. Conf., February 2011, London, UK.
- H. Hacıhabiboğlu, E. De Sena and Z. Cvetković - “Design of a Circular Microphone Array for Panoramic Audio Recording and Reproduction: Microphone Directivity”, presented at the 128th Audio Engineering Society Convention, May 2010, London, UK.
- E. De Sena, H. Hacıhabiboğlu and Z. Cvetković - “Design of a Circular Microphone Array for Panoramic Audio Recording and Reproduction: Array Radius”, presented at the 128th Audio Engineering Society Convention, May 2010, London, UK.
- E. De Sena, H. Hacıhabiboğlu and Z. Cvetković - “Perceptual Evaluation of a Circularly Symmetric Microphone Array for Panoramic Recording of Audio”, in proceedings of the 2nd Int. Symposium on Ambisonics and Spherical Acoustics, May 2010, Paris, France.
- E. Giordano, E. De Sena, G. Pau and M. Gerla - “Vergilius: A Scenario Generator for Vanet”, in proceedings of IEEE 71st Vehicular Technology Conference (VTC), May 2010, Taipei, Taiwan.
- G. Marfia, G. Pau, E. Giordano, E. De Sena, M. Gerla – “VANET: On Mobility Scenarios and Urban Infrastructure. A Case Study”, in proceedings of MOVE Workshop in conjunction with IEEE INFOCOM 2007, May 2007, Alaska, USA.
- G. Marfia, G. Pau, E. De Sena, E. Giordano, M. Gerla – “Evaluating Vehicle Network Strategies for Downtown Portland: opportunistic infrastructure and the importance of realistic mobility models”, in proceedings of the First International Workshop on Mobile Opportunistic Networking ACM/SIGMOBILE MobiOpp 2007, in conjunction with MobiSys 2007, June 2007, Puerto Rico, USA.
- E. De Sena, H. Hacıhabiboğlu, and Z. Cvetković, inventors; King's College London, assignee, "Electronic Device with Digital Reverberator and Method", USPTO Patent n. 8,908,875, filed 2/2/2012, granted 09/12/2014.
- H. Hacıhabiboğlu, E. De Sena, and Z. Cvetković, inventors; King's College London, assignee, "Microphone array", USPTO Patent n. 8,976,977, filed 15/10/2010, granted 10/3/2015.