4pm - 5pm
Thursday 20 November 2025
High Quality Audio Generation with Latent Diffusion Models
PhD Viva Open Presentation - Haohe Liu
Online event - All Welcome!
Free
High Quality Audio Generation with Latent Diffusion Models
Abstract:
High-quality audio generation remains a core challenge in artificial intelligence, with applications in content creation, virtual environments, and assistive technologies. To address this challenge effectively, intuitive user interaction is important, and our preliminary user study on AI-assisted sound-search systems suggests that natural language provides a flexible and user-friendly means for interacting with audio. Inspired by this insight, we focus on text-to-audio generation and explore how diffusion-based models can improve the diversity, quality, and efficiency of audio generation. This thesis introduces AudioLDM and its successor AudioLDM 2 for text-to-audio generation, followed by AudioSR for audio super-resolution, and SemantiCodec to enable efficient, semantically-aware audio compression.
Previous works on text-to-audio generation often fall short on the scope of the audio they can generate. They may not follow the text prompt well, and the audio it produces is often restricted to limited categories. To address these challenges, we propose AudioLDM, a text-to-audio model that can accept open-domain natural language as input and generate audio that follows the given descriptions. To further explore the possibility of combining the advantages of diffusion modelling and language modelling, we introduce AudioLDM 2, which enables the generation of speech, music, and environmental sound effects from text within a shared architecture, with improved generation quality over AudioLDM.
To further enhance the output quality of the audio generation model, we propose AudioSR, a latent diffusion-based model specialized in audio super-resolution. Unlike previous methods that focus on narrow domains such as speech, AudioSR is designed to enhance a broad spectrum of sounds, including speech, music, and general sound effects, and uniquely supports input audio with flexible sampling rates, making AudioSR robust for diverse real-world use cases. Our experiments show that AudioSR can effectively restore missing high-frequency details with significantly improved perceptual quality, as verified on both real audio recordings and generative model outputs.
Finally, we explore compact and semantically rich latent spaces for future audio generation models. Existing language-modelling-based audio generation models often rely on high-bitrate codecs, leading to high computational costs, limited semantic abstraction in the audio tokens, and inefficient downstream model training. To address these challenges, we propose SemantiCodec, an ultra-low bitrate codec that encodes audio into compressed but semantically rich representations. Building upon the self-supervised learning feature, SemantiCodec achieves a good tradeoff between compression efficiency and reconstruction quality, outperforming existing neural codecs at significantly lower bitrates.
