1:30pm - 2:30pm

Friday 14 June 2024

Sound to Text: Automated Audio Captioning using Deep Learning

PhD Viva Open Presentation for Xinhao Mei

Online event - All Welcome!


University of Surrey
back to all events

This event has passed


Sound to Text: Automated Audio Captioning using Deep Learning

Automated audio captioning (AAC) is a task that involves describing the ambient sound within an audio clip using a natural language sentence, bridging the gap between auditory perception and linguistic expression. AAC requires not only identifying sound events and acoustic scenes within an audio clip but also interpreting their relationships and summarizing the audio content through descriptive language. AAC has gained significant attention and seen considerable progress in recent years. Despite the progress, the field continues to face numerous challenges. This thesis studied automated audio captioning from three perspectives: model architectures, data scarcity issue, and diversity in generated captions.

Translating sound into text, AAC is a sequence-to-sequence task. Therefore, existing approaches follow an encoder-decoder paradigm using deep learning techniques. Recurrent neural networks and convolutional neural networks are popularly employed as the audio encoder. However, both of them have their own limitations in modelling lengthy audio signals. We introduce the Audio Captioning Transformer (ACT), a novel fully Transformer-based model that overcomes the limitations of traditional RNN and CNN approaches in automated audio captioning. The self-attention mechanism of the ACT model facilitates a better modelling of audio signals’ local and global dependencies. Our findings highlight the critical role of the audio encoder in an AAC system.

Data scarcity is a major issue in the field of AAC. Collecting audio captioning datasets manually is expensive and time-consuming, therefore, existing audio captioning datasets are all limited in size. To address the data scarcity issue, we source audio clips and their metadata (e.g., raw descriptions and audio tags) from three web platforms and one audio tagging dataset. We devise a three-stage processing pipeline to filter and transformer noisy raw descriptions into audio captions with the help of ChatGPT, a powerful, conversational large language model. Consequently, we introduce the WavCaps dataset, the first large-scale, weakly-labelled audio captioning dataset for audio-language multimodal research, containing \num{403050} audio clips with paired captions. We conduct a comprehensive analysis of the characteristics of WavCaps dataset and achieve new state-of-the-art results on main AAC benchmarks.

Finally, different people may interpret and describe the same audio scene in diverse ways, leading to a wide range of possible captions for a single audio clip. However, captions generated by existing audio captioning systems are deterministic, simple and generic. We argue that an effective audio captioning system should be capable of producing diverse captions for both a single audio clip and across similar clips. To achieve this, we introduce an adversarial training framework utilizing a conditional generative adversarial network (C-GAN) to enhance the diversity of audio captioning systems. Our experiments on the Clotho dataset demonstrate that our model outperforms state-of-the-art methods in generating more diverse captions.