Enzo De Sena

Dr Enzo De Sena

Senior Lecturer in Audio, International Relations Officer (Department of Music and Media)


University roles and responsibilities

  • International Relations Officer for the Department of Music and Media


    Research interests


    Matteo Scerbo, Orchisama Das, Patrick Friend, Enzo De Sena, Patrick Friend (2022)Higher-order scattering delay networks for artificial reverberation, In: Proceedings of the 25th International Conference on Digital Audio Effects (DAFx20in22), Vienna, Austria, September 2022 Universität für Musik und darstellende Kunst Wien

    Computer simulations of room acoustics suffer from an efficiency vs accuracy trade-off, with highly accurate wave-based models being highly computationally expensive, and delay-network-based models lacking in physical accuracy. The Scattering Delay Network (SDN) is a highly efficient recursive structure that renders first order reflections exactly while approximating higher order ones. With the purpose of improving the accuracy of SDNs, in this paper, several variations on SDNs are investigated, including appropriate node placement for exact modeling of higher order reflections , redesigned scattering matrices for physically-motivated scattering, and pruned network connections for reduced computational complexity. The results of these variations are compared to state-of-the-art geometric acoustic models for different shoebox room simulations. Objective measures (Normalized Echo Densities (NEDs) and Energy Decay Curves (EDCs)) showed a close match between the proposed methods and the references. A formal listening test was carried out to evaluate differences in perceived naturalness of the synthesized Room Impulse Responses. Results show that increasing SDNs' order and adding directional scattering in a fully-connected network improves perceived naturalness, and higher-order pruned networks give similar performance at a much lower computational cost.

    Joshua Mannall, Orchisama Das, Paul Calamia, ENZO DE SENA (2022)Perceptual evaluation of low-complexity diffraction models from a single edge

    Geometric acoustic models have a lower computational complexity than wave-based methods due to the assumption that sound propagates as rays, however this fails to consider the wave-like properties of sound such as diffraction. Historically, tthe Biot-Tolstoy-Medwin (BTM) model and the Uniform Theory of Diffraction (UTD) have been used to augment geometric acoustic models with diffraction. Computational efficiency is essential for real-time application and recently two more efficient models, the Volumetric Diffraction and Transmission (VDaT) model and an infinite impulse response filter (IIR) approximation, were proposed to approximate these solutions. A higher-order IIR filter approximation is proposed in this paper. An experiment is carried out to evaluate the perceived naturalness of these approximations compared to the more accurate analytical solutions. Stationary and moving receivers were considered in simple geometries with a single edge. The results suggest that the higher order IIR approximation is perceptually similar to the BTM model. VDaT and the low order IIR approximation were found to be less natural in some cases. While in dynamic scenes, VDaT was found to be significantly more natural than the other models. The experiment was limited in scope by the simplicity of the scenes considered, however the results suggest the models are perceptually similar. Improvements to the higher-order IIR approximation are suggested and a recommendation is made for future perceptual evaluations.

    Thomas Potter, Zoran Cvetkovic, ENZO DE SENA (2022)On the Relative Importance of Visual and Spatial Audio Rendering on VR Immersion, In: Front. Signal Process. - Audio and Acoustic Signal Processing Frontiers Media

    A study was performed using a virtual environment to investigate the relative importance of spatial audio fidelity and video resolution on perceived audiovisual quality and immersion. Subjects wore a head-mounted display and headphones and were presented with a virtual environment featuring music and speech stimuli using three levels each of spatial audio quality and video resolution. Spatial audio was rendered monaurally, binaurally with head-tracking, and binaurally with head-tracking and room acoustic rendering. Video was rendered at resolutions of 0.5 megapixels per eye, 1.5 megapixels per eye, and 2.5 megapixels per eye. Results showed that both video resolution and spatial audio rendering had a statistically significant effect on both immersion and audiovisual quality. Most strikingly, the results showed that under the conditions that were tested in the experiment, the addition of room acoustic rendering to head-tracked binaural audio had the same improvement on immersion as increasing the video resolution five-fold, from 0.5 megapixels per eye to 2.5 megapixels per eye.

    D Pelegrın-Garcıa, Enzo De Sena, T van Waterschoot, M Rychtarikova, C Glorieux (2018)Localization of a Virtual Wall by Means of Active Echolocation by Untrained Sighted Persons, In: Applied Acoustics139pp. 82-92 Elsevier

    The active sensing and perception of the environment by auditory means is typically known as echolocation and it can be acquired by humans, who can profit from it in the absence of vision. We investigated the ability of twentyone untrained sighted participants to use echolocation with self-generated oral clicks for aligning themselves within the horizontal plane towards a virtual wall, emulated with an acoustic virtual reality system, at distances between 1 and 32 m, in the absence of background noise and reverberation. Participants were able to detect the virtual wall on 61% of the trials, although with large di↵erences across individuals and distances. The use of louder and shorter clicks led to an increased performance, whereas the use of clicks with lower frequency content allowed for the use of interaural time di↵erences to improve the accuracy of reflection localization at very long distances. The distance of 2 m was the most difficult to detect and localize, whereas the furthest distances of 16 and 32 m were the easiest ones. Thus, echolocation may be used e↵ectively to identify large distant environmental landmarks such as buildings.

    Enzo De Sena, Mike Brookes, Patrick A. Naylor, Toon van Waterschoot (2017)Localization experiments with reporting by head orientation: statistical framework and case study, In: Journal of the Audio Engineering Society65(12)pp. 982-996 Audio Engineering Society

    This research focuses on sound localization experiments in which subjects report the position of an active sound source by turning toward it. A statistical framework for the analysis of the data is presented together with a case study from a large-scale listening experiment. The statistical framework is based on a model that is robust to the presence of front/back confusions and random errors. Closed-form natural estimators are derived, and one-sample and two-sample statistical tests are described. The framework is used to analyze the data of an auralized experiment undertaken by nearly nine hundred subjects. The objective was to explore localization performance in the horizontal plane in an informal setting and with little training, which are conditions that are similar to those typically encountered in consumer applications of binaural audio. Results show that responses had a rightward bias and that speech was harder to localize than percussion sounds, which are results consistent with the literature. Results also show that it was harder to localize sound in a simulated room with a high ceiling despite having a higher direct-to-reverberant ratio than other simulated rooms.

    Jessica Camilleri, Neofytos Kaplanis, Enzo De Sena (2019)Evaluation of Car Cabin Acoustics Using Auralisation over Headphones, In: Tony Tew, Duncan Williams (eds.), Proceedings 2019 AES International Conference on Immersive and Interactive Audio Audio Engineering Society

    The auralization schemes in the domain of automotive audio have primarily utilized dummy head recordings in the past. Recently, spatial reproduction allowed the auralization of cabin acoustics over large loudspeaker arrays. Yet no direct comparisons between those methods exist. In this study, the efficacy of headphone presentation is explored in this context. Six acoustical conditions were presented over headphones to experienced assessors (n=23), who were asked to compare them over six elicited perceptual attributes. In 24 out of 36 cases, the results indicate an agreement between headphone- and loudspeaker-based auralisation of identical stimuli sets. It is concluded that, when compared to loudspeakers-based rendering, headphones-based rendering reveals similar judgment on timbral attributes, while certain spatial attributes should be assessed with caution.

    N Antonello, Enzo De Sena, M Moonen, PA Naylor, T van Waterschoot (2017)Room impulse response interpolation using a sparse spatio-temporal representation of the sound field, In: IEEE/ACM Transactions on Audio, Speech, and Language Processing25(10)pp. 1929-1941 IEEE

    Room Impulse Responses (RIRs) are typically measured using a set of microphones and a loudspeaker. When RIRs spanning a large volume are needed, many microphone measurements must be used to spatially sample the sound field. In order to reduce the number of microphone measurements, RIRs can be spatially interpolated. In the present study, RIR interpolation is formulated as an inverse problem. This inverse problem relies on a particular acoustic model capable of representing the measurements. Two different acoustic models are compared: the plane wave decomposition model and a novel time-domain model that consists of a collection of equivalent sources creating spherical waves. These acoustic models can both approximate any reverberant sound field created by a far field sound source. In order to produce an accurate RIR interpolation, sparsity regularization is employed when solving the inverse problem. In particular, by combining different acoustic models with different sparsity promoting regularizations, spatial sparsity, spatio-spectral sparsity and spatio-temporal sparsity are compared. The inverse problem is solved using a matrix-free large scale optimization algorithm. Simulations show that the best RIR interpolation is obtained when combining the novel time-domain acoustic model with the spatio-temporal sparsity regularization, outperforming the results of the plane wave decomposition model even when far fewer microphone measurements are available.

    ENZO DE SENA, Brian Fitzpatrick, Toon Van Waterschoot (2021)On the Convergence of the Multipole Expansion Method, In: SIAM Journal on Numerical Analysis Society for Industrial and Applied Mathematics

    The multipole expansion method (MEM) is a spatial discretization technique that is widely used in applications that feature scattering of waves from circular cylinders. Moreover, it also serves as a key component in several other numerical methods in which scattering computations involving arbitrarily shaped objects are accelerated by enclosing the objects in artificial cylinders. A fundamental question is that of how fast the approximation error of the MEM converges to zero as the truncation number goes to infinity. Despite the fact that the MEM was introduced in 1913, and has been in widespread usage as a numerical technique since as far back as 1955, a precise characterization of the asymptotic rate of convergence of the MEM has not been obtained. In this work, we provide a resolution to this issue. While our focus in this paper is on the Dirichlet scattering problem, this is merely for convenience and our results actually establish convergence rates that hold for all MEM formulations irrespective of the specific boundary conditions or boundary integral equation solution representation chosen.

    Niccolò Antonello, Enzo De Sena, Marc Moonen, Patrick A. Naylor, Toon van Waterschoot (2019)Joint acoustic localization and dereverberation through plane wave decomposition and sparse regularization, In: IEEE Transactions on Audio, Speech and Language Processing27(12)pp. 1893-1905 IEEE

    Acoustic source localization and dereverberation are formulated jointly as an inverse problem. The inverse problem consists of the approximation of the sound field measured by a set of microphones. The recorded sound pressure is matched with that of a particular acoustic model based on a collection of plane waves arriving from different directions at the microphone positions. In order to achieve meaningful results, spatial and spatio-spectral sparsity can be promoted in the weight signals controlling the plane waves. The large-scale optimization problem resulting from the inverse problem formulation is solved using a first order optimization algorithm combined with a weighted overlap-add procedure. It is shown that once the weight signals capable of effectively approximating the sound field are obtained, they can be readily used to localize a moving sound source in terms of direction of arrival (DOA) and to perform dereverberation in a highly reverberant environment. Results from simulation experiments and from real measurements show that the proposed algorithm is robust against both localized and diffuse noise exhibiting a noise reduction in the dereverberated signals.

    Juan Franco, Bogdan Bǎcilǎ, Tim Brookes, Enzo De Sena (2022)A multi-angle, multi-distance dataset of microphone impulse responses, In: Journal of the Audio Engineering Society

    A new publicly available dataset of microphone impulse responses (IRs) has been generated. The dataset covers 25 microphones, including a Class-1 measurement microphone, plus polar pattern variations for 7 of the microphones. Microphones were included having: omnidirectional, cardioid, supercardioid and bidirectional polar patterns; condenser, moving-coil and ribbon transduction types; single and dual diaphragms; multiple body and head basket shapes; small and large diaphragms; and end-address and side-address designs. Using a custom-developed computer-controlled precision turntable, IRs were captured quasi-anechoically at incident angles from 0º to 355º in steps of 5º, and at source-to-microphone distances of 0.5 m, 1.25 m and 5 m. The resulting dataset is suitable for perceptual and objective studies related to the incident-angle-dependent response of microphones, as well as for the development of tools for predicting and emulating on- and off-axis microphone characteristics. The captured IRs allow generation of frequency response plots with a degree of detail not commonly available in manufacturer-supplied data sheets, and are also particularly well suited to harmonic distortion analysis.

    Niccolo Antonello, Enzo De Sena, Marc Moonen, Patrick A. Naylor, Toon van Waterschoot (2018)Joint source localization and dereverberation by sound field interpolation using sparse regularization, In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing2018pp. 6892-6896 Institute of Electrical and Electronics Engineers (IEEE)

    In this paper, source localization and dereverberation are formulated jointly as an inverse problem. The inverse problem consists in the interpolation of the sound field measured by a set of microphones by matching the recorded sound pressure with that of a particular acoustic model. This model is based on a collection of equivalent sources creating either spherical or plane waves. In order to achieve meaningful results, spatial, spatio-temporal and spatio-spectral sparsity can be promoted in the signals originating from the equivalent sources. The inverse problem consists of a large-scale optimization problem that is solved using a first order matrix-free optimization algorithm. It is shown that once the equivalent source signals capable of effectively interpolating the sound field are obtained, they can be readily used to localize a speech sound source in terms of Direction of Arrival (DOA) and to perform dereverberation in a highly reverberant environment.

    Ege Erdem, Enzo De Sena, Huseyin Hacihabiboglu, Zoran Cvetkovic (2019)Perceptual Soundfield Reconstruction in Three Dimensions via Sound Field Extrapolation, In: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)pp. 8023-8027 IEEE

    Perceptual sound field reconstruction (PSR) is a spatial audio recording and reproduction method based on the application of stereophonic panning laws in microphone array design. PSR allows rendering a perceptually veridical and stable auditory perspective in the horizontal plane of the listener, and involves recording using near-coincident microphone arrays. This paper extends the PSR concept to three dimensions using sound field extrapolation carried out in the spherical-harmonic domain. Sound field rendering is performed using a two-level loudspeaker rig. An active-intensity-based analysis of the rendered sound field shows that the proposed approach can render direction of monochromatic plane waves accurately.


    The steered response power (SRP) approach to acoustic source lo-calization computes a map of the acoustic scene from the frequency-weighted output power of a beamformer steered towards a set of candidate locations. Equivalently, SRP may be expressed in terms of time-domain generalized cross-correlations (GCCs) at lags equal to the candidate locations' time-differences of arrival (TDOAs). Due to the dense grid of candidate locations, each of which requires inverse Fourier transform (IFT) evaluations, conventional SRP exhibits a high computational complexity. In this paper, we propose a low-complexity SRP approach based on Nyquist-Shannon sampling. Noting that on the one hand the range of possible TDOAs is physically bounded, while on the other hand the GCCs are band-limited, we critically sample the GCCs around their TDOA interval and approximate the SRP map by interpolation. In usual setups, the number of sample points can be orders of magnitude less than the number of candidate locations and frequency bins, yielding a significant reduction of IFT computations at a limited interpolation cost. Simulations comparing the proposed approximation with conventional SRP indicate low approximation errors and equal localization performance. A MATLAB implementation is available online.

    LESLIE GASTON-BIRD, RUSSELL DAVID MASON, ENZO DE SENA (2021)Inclusivity in Immersive Audio: Current Participation and Barriers to Entry

    Media and entertainment companies have embraced immersive audio technology for cinema, television, games, and music. Meanwhile, in recent years there has been a rise in the number of organizations welcoming underrepresented groups to the field of audio. However, although some disciplines such as music recording are seeing an increase in participation, others are not keeping pace. Immersive and spatial audio are disciplines in which diversity is measurably lacking. Audio based mixed-gender social media groups are comprised of less than 10% women and minorities, and groups dedicated to immersive audio exhibit poorer representation. Barriers to entry are societal as well as economic; however, outreach, networking opportunities, mentoring, and affordable education are remedies have been shown to be effective for related industries and should be adopted by the immersive audio industry.

    G Vairetti, N Kaplanis, Enzo De Sena, SH Jonsen, S Bech, M Moonen, T van Waterschoot (2017)The Subwoofer Room Impulse Response (SUBRIR) database, In: Journal of the Audio Engineering Society65(5)pp. 389-401 Audio Engineering Society

    This report introduces a new database of room impulse responses (RIRs) measured in an empty rectangular room using subwoofers as sound sources. The purpose of this database, publicly available for download, is to provide acoustic measurements within the frequency region of modal resonances. Performing acoustic measurements at low frequencies presents many difficulties, mainly related to ambient noise and to unavoidable nonlinearities of the subwoofer. In this report, it is shown that these issues can be addressed and partially solved by means of the exponential sine-sweep technique and a careful calibration of the measurement equipment. A procedure for estimating the reverberation time at very low frequencies is proposed, which uses a cosine-modulated filterbank and an approximation of the RIRs using parametric models in order to reduce problems related to low signal-to-noise ratio and to the length of typical band-pass filter responses.

    G Vairetti, Enzo De Sena, M Catrysse, SH Jensen, M Moonen, T van Waterschoot (2017)A Scalable Algorithm for Physically Motivated and Sparse Approximation of Room Impulse Responses with Orthonormal Basis Functions, In: IEEE/ACM Trans. Audio, Speech and Language Processing25(7)pp. 1547-1561 IEEE

    Parametric modeling of room acoustics aims at representing room transfer functions (RTFs) by means of digital filters and finds application in many acoustic signal enhancement algorithms. In previous work by other authors, the use of orthonormal basis functions (OBFs) for modeling room acoustics has been proposed. Some advantages of OBF models over all-zero and pole-zero models have been illustrated, mainly focusing on the fact that OBF models typically require less model parameters to provide the same model accuracy. In this paper, it is shown that the orthogonality of the OBF model brings several additional advantages, which can be exploited if a suitable algorithm for identifying the OBF model parameters is applied. Specifically, the orthogonality of OBF models does not only lead to improved model efficiency (as pointed out in previous work), but also leads to improved model scalability and model stability. Its appealing scalability property derives from a previously unexplored interpretation of the OBF model as an approximation to a solution of the inhomogeneous acoustic wave equation. Following this interpretation, a novel identification algorithm is proposed that takes advantage of the OBF model orthogonality to deliver efficient, scalable and stable OBF model estimates, which is not necessarily the case for nonlinear estimation techniques that are normally applied.

    H Hacıhabiboglu, Enzo De Sena, Z Cvetkovic, J Johnston, JO Smith (2017)Perceptual Spatial Audio Recording, Simulation, and Rendering: An overview of spatial-audio techniques based on psychoacoustics, In: IEEE Signal Processing Magazine34(3)pp. 36-54 IEEE

    Developments in immersive audio technologies have been evolving in two directions: physically-motivated and perceptually-motivated systems. Physically motivated techniques aim to reproduce a physically accurate approximation of desired sound fields by employing a very high equipment load and sophisticated computationally intensive algorithms. Perceptuallymotivated techniques, on the other hand, aim to render only the perceptually relevant aspects of the sound scene by means of modest computational and equipment load. This article presents an overview of perceptually motivated techniques, with a focus on multichannel audio recording and reproduction, audio source and reflection culling, and artificial reverberators.

    Enzo De Sena, Zoran Cvetkovic, Huseyin Hacıhabiboglu, Marc Moonen, Toon van Waterschoot (2020)Localization Uncertainty in Time-Amplitude Stereophonic Reproduction, In: IEEE/ACM Transactions on Audio, Speech, and Language Processing IEEE

    This paper studies the effects of inter-channel time and level differences in stereophonic reproduction on perceived localization uncertainty, which is defined as how difficult it is for a listener to tell where a sound source is located. Towards this end, a computational model of localization uncertainty is proposed first. The model calculates inter-aural time and level difference cues, and compares them to those associated to freefield point-like sources. The comparison is carried out using a particular distance functional that replicates the increased uncertainty observed experimentally with inconsistent inter-aural time and level difference cues. The model is validated by formal listening tests, achieving a Pearson correlation of 0:99. The model is then used to predict localization uncertainty for stereophonic setups and a listener in central and off-central positions. Results show that amplitude methods achieve a slightly lower localization uncertainty for a listener positioned exactly in the center of the sweet spot. As soon as the listener moves away from that position, the situation reverses, with time-amplitude methods achieving a lower localization uncertainty.

    Timuçin Berk Atalay, Zühre Sü Gül, ENZO DE SENA, Zoran Cvetkovic, Hüseyin Hacıhabiboğlu (2021)Scattering Delay Network Simulator of Coupled Volume Acoustics, In: IEEE/ACM transactions on audio, speech, and language processing a publication of the Signal Processing Society IEEE

    —Artificial reverberators provide a computationally viable alternative to full-scale room acoustics simulation methods for deployment in interactive, immersive systems. Scattering delay network (SDN) is an artificial reverberator that allows direct parametric control over the geometry of a simulated cuboid enclosure, as well as the directional characteristics of the simulated sound sources and microphones. This paper extends the concept of SDN reverberators to multiple enclosures coupled via an aperture. The extension allows independent control of the acoustical properties of the coupled enclosures and the size of the connecting aperture. Transfer functions of the coupled-volume SDN are derived. The effectiveness of the proposed method is evaluated in terms of rendered energy decay curves in comparison to full-scale ray-tracing models and scale model measurements.

    P.J. Dawson, E. De Sena, P. A. Naylor (2018)An acoustic image-source characterisation of surface profiles, In: Proceedings 2018 26th European Signal Processing Conference (EUSIPCO)pp. pp 2130-2134 IEEE

    The image-source method models the specular reflection from a plane by means of a secondary source positioned at the source’s reflected image. The method has been widely used in acoustics to model the reverberant field of rectangular rooms, but can also be used for general-shaped rooms and nonflat reflectors. This paper explores the relationship between the physical properties of a non-flat reflector and the statistical properties of the associated cloud of image-sources. It is shown here that the standard deviation of the image-sources is strongly correlated with the ratio between depth and width of the reflector’s spatial features.

    Giacomo Vairetti, Enzo De Sena, Michael Catrysse, Soren Holdt Jensen, Marc Moonen, Toon Van Waterschoot (2018)An Automatic Design Procedure for Low-order IIR Parametric Equalizers, In: Journal of the Audio Engineering Society66(11)pp. 935-952 Audio Enginering Society

    Parametric equalization of an acoustic system aims to compensate for the deviations of its response from a desired target response using parametric digital filters. An optimization procedure is presented for the automatic design of a low-order equalizer using parametric infinite impulse response (IIR) filters, specifically second-order peaking filters and first-order shelving filters. The proposed procedure minimizes the sum of square errors (SSE) between the system and the target complex frequency responses, instead of the commonly used difference in magnitudes, and exploits a previously unexplored orthogonality property of one particular type of parametric filter. This brings a series of advantages over the state-of-the-art procedures, such as an improved mathematical tractability of the equalization problem, with the possibility of computing analytical expressions for the gradients, an improved initialization of the parameters, including the global gain of the equalizer, the incorporation of shelving filters in the optimization procedure, and a more accentuated focus on the equalization of the more perceptually relevant frequency peaks. Examples of loudspeaker and room equalization are provided, as well as a note about extending the procedure to multi-point equalization and transfer function modeling.

    L Lightburn, Enzo De Sena, A Moore, PA Naylor, M Brookes (2017)Improving the perceptual quality of ideal binary masked speech, In: Proceedings of ICASSP 2017 IEEE

    It is known that applying a time-frequency binary mask to very noisy speech can improve its intelligibility but results in poor perceptual quality. In this paper we propose a new approach to applying a binary mask that combines the intelligibility gains of conventional binary masking with the perceptual quality gains of a classical speech enhancer. The binary mask is not applied directly as a time-frequency gain as in most previous studies. Instead, the mask is used to supply prior information to a classical speech enhancer about the probability of speech presence in different time-frequency regions. Using an oracle ideal binary mask, we show that the proposed method results in a higher predicted quality than other methods of applying a binary mask whilst preserving the improvements in predicted intelligibility.

    Additional publications