Xubo Liu

Postgraduate Research Student


My research project


Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, Mark David Plumbley (2024)AudioLDM 2: Learning holistic audio generation with self-supervised pretraining, In: AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining IEEE

Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a holistic framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework utilizes a general representation of audio, called “language of audio” (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate other modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on the LOA of audio in our training set. The proposed framework naturally brings advantages such as reusable self-supervised pretrained latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech with three AudioLDM 2 variants demonstrate competitive performance of the AudioLDM 2 framework against previous approaches.

Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, Mark Plumbley (2023)AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining, In: AudioLDM 2: Learning holistic audio generation with self-supervised pretraining Cornell University Library, arXiv.org

Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called "language of audio" (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pretrained AudioMAE and latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech demonstrate state-of-the-art or competitive performance against previous approaches. Our code, pretrained model, and demo are available at https://audioldm.github.io/audioldm2.

Jinzheng Zhao, Peipei Wu, Shidrokh Goudarzi, Xubo Liu, Jianyuan Sun, Yong Xu, Wenwu Wang (2022)Visually Assisted Self-supervised Audio Speaker Localization and Tracking, In: 2022 30th European Signal Processing Conference (EUSIPCO) EUSIPCO

—Training a robust tracker of objects (such as vehicles and people) using audio and visual information often needs a large amount of labelled data, which is difficult to obtain as manual annotation is expensive and time-consuming. The natural synchronization of the audio and visual modalities enables the object tracker to be trained in a self-supervised manner. In this work, we propose to localize an audio source (i.e., speaker) using a teacher-student paradigm, where the visual network teaches the audio network by knowledge distillation to localize speakers. The introduction of multi-task learning, by training the audio network to perform source localization and semantic segmentation jointly, further improves the model performance. Experimental results show that the audio localization network can learn from visual information and achieve competitive tracking performance as compared to the baseline methods that are based on the audio-only measurements. The proposed method can provide more reliable measurements for tracking than the traditional sound source localization methods, and the generated audio features aid in visual tracking.

Yi Yuan, Haohe Liu, Xubo Liu, Qiushi Huang, Mark D Plumbley, Wenwu Wang (2023)Retrieval-Augmented Text-to-Audio Generation

Despite recent progress in text-to-audio (TTA) generation, we show that the state-of-the-art models, such as AudioLDM, trained on datasets with an imbalanced class distribution, such as AudioCaps, are biased in their generation performance. Specifically, they excel in generating common audio classes while underperforming in the rare ones, thus degrading the overall generation performance. We refer to this problem as long-tailed text-to-audio generation. To address this issue, we propose a simple retrieval-augmented approach for TTA models. Specifically, given an input text prompt, we first leverage a Contrastive Language Audio Pretraining (CLAP) model to retrieve relevant text-audio pairs. The features of the retrieved audio-text data are then used as additional conditions to guide the learning of TTA models. We enhance AudioLDM with our proposed approach and denote the resulting augmented system as Re-AudioLDM. On the AudioCaps dataset, Re-AudioLDM achieves a state-of-the-art Frechet Audio Distance (FAD) of 1.37, outperforming the existing approaches by a large margin. Furthermore, we show that Re-AudioLDM can generate realistic audio for complex scenes, rare audio classes, and even unseen audio types, indicating its potential in TTA tasks.


Audio captioning aims at using language to describe the content of an audio clip. Existing audio captioning systems are generally based on an encoder-decoder architecture, in which acoustic information is extracted by an audio encoder and then a language decoder is used to generate the captions. Training an audio captioning system often encounters the problem of data scarcity. Transferring knowledge from pre-trained audio models such as Pre-trained Audio Neural Networks (PANNs) have recently emerged as a useful method to mitigate this issue. However, there is less attention on exploiting pre-trained language models for the decoder, compared with the encoder. BERT is a pre-trained language model that has been extensively used in natural language processing tasks. Nevertheless, the potential of using BERT as the language decoder for audio captioning has not been investigated. In this study, we demonstrate the efficacy of the pre-trained BERT model for audio captioning. Specifically, we apply PANNs as the encoder and initialize the decoder from the publicly available pre-trained BERT models. We conduct an empirical study on the use of these BERT models for the decoder in the audio captioning model. Our models achieve competitive results with the existing audio captioning methods on the AudioCaps dataset.

JIANYUAN SUN, XUBO LIU, XINHAO MEI, JINZHENG ZHAO, Mark D. PLUMBLEY, Volkan Kılıc, WENWU WANG (2022)Deep Neural Decision Forest for Acoustic Scene Classification

—Acoustic scene classification (ASC) aims to classify an audio clip based on the characteristic of the recording environment. In this regard, deep learning based approaches have emerged as a useful tool for ASC problems. Conventional approaches to improving the classification accuracy include integrating auxiliary methods such as attention mechanism, pre-trained models and ensemble multiple sub-networks. However, due to the complexity of audio clips captured from different environments, it is difficult to distinguish their categories without using any auxiliary methods for existing deep learning models using only a single classifier. In this paper, we propose a novel approach for ASC using deep neural decision forest (DNDF). DNDF combines a fixed number of convolutional layers and a decision forest as the final classifier. The decision forest consists of a fixed number of decision tree classifiers, which have been shown to offer better classification performance than a single classifier in some datasets. In particular, the decision forest differs substantially from traditional random forests as it is stochastic, differentiable, and capable of using the back-propagation to update and learn feature representations in neural network. Experimental results on the DCASE2019 and ESC-50 datasets demonstrate that our proposed DNDF method improves the ASC performance in terms of classification accuracy and shows competitive performance as compared with state-of-the-art baselines.

Meng Cui, Xubo Liu, Haohe Liu, Zhuangzhuang Du, Tao Chen, Guoping Lian, Daoliang Li, Wenwu Wang Multimodal Fish Feeding Intensity Assessment in Aquaculture

Fish feeding intensity assessment (FFIA) aims to evaluate the intensity change of fish appetite during the feeding process, which is vital in industrial aquaculture applications. The main challenges surrounding FFIA are two-fold. 1) robustness: existing work has mainly leveraged single-modality (e.g., vision, audio) methods, which have a high sensitivity to input noise. 2) efficiency: FFIA models are generally expected to be employed on devices. This presents a challenge in terms of computational efficiency. In this work, we first introduce an audio-visual dataset, called AV-FFIA. AV-FFIA consists of 27,000 labeled audio and video clips that capture different levels of fish feeding intensity. To our knowledge, AV-FFIA is the first large-scale multimodal dataset for FFIA research. Then, we introduce a multi-modal approach for FFIA by leveraging single-modality pre-trained models and modality-fusion methods, with benchmark studies on AV-FFIA. Our experimental results indicate that the multi-modal approach substantially outperforms the single-modality based approach, especially in noisy environments. While multimodal approaches provide a performance gain for FFIA, it inherently increase the computational cost. To overcome this issue, we further present a novel unified model, termed as U-FFIA. U-FFIA is a single model capable of processing audio, visual, or audio-visual modalities, by leveraging modality dropout during training and knowledge distillation from single-modality pre-trained models. We demonstrate that U-FFIA can achieve performance better than or on par with the state-of-the-art modality-specific FFIA models, with significantly lower computational overhead. Our proposed U-FFIA approach enables a more robust and efficient method for FFIA, with the potential to contribute to improved management practices and sustainability in aquaculture.

Feiyang Xiao, Qiaoxi Zhu, Jian Guan, Xubo Liu, Haohe Liu, Kejia Zhang, Wenwu Wang (2023)Synth-AC: Enhancing Audio Captioning with Synthetic Supervision, In: arXiv.org Cornell University Library, arXiv.org

Data-driven approaches hold promise for audio captioning. However, the development of audio captioning methods can be biased due to the limited availability and quality of text-audio data. This paper proposes a SynthAC framework, which leverages recent advances in audio generative models and commonly available text corpus to create synthetic text-audio pairs, thereby enhancing text-audio representation. Specifically, the text-to-audio generation model, i.e., AudioLDM, is used to generate synthetic audio signals with captions from an image captioning dataset. Our SynthAC expands the availability of well-annotated captions from the text-vision domain to audio captioning, thus enhancing text-audio representation by learning relations within synthetic text-audio pairs. Experiments demonstrate that our SynthAC framework can benefit audio captioning models by incorporating well-annotated text corpus from the text-vision domain, offering a promising solution to the challenge caused by data scarcity. Furthermore, SynthAC can be easily adapted to various state-of-the-art methods, leading to substantial performance improvements.

Yi Yuan, Haohe Liu, Xubo Liu, Xiyuan Kang, Peipei Wu, Mark Plumbley, Wenwu Wang (2023)Text-Driven Foley Sound Generation With Latent Diffusion Model, In: arXiv.org Cornell University Library, arXiv.org

Foley sound generation aims to synthesise the background sound for multimedia content. Previous models usually employ a large development set with labels as input (e.g., single numbers or one-hot vector). In this work, we propose a diffusion model based system for Foley sound generation with text conditions. To alleviate the data scarcity issue, our model is initially pre-trained with large-scale datasets and fine-tuned to this task via transfer learning using the contrastive language-audio pertaining (CLAP) technique. We have observed that the feature embedding extracted by the text encoder can significantly affect the performance of the generation model. Hence, we introduce a trainable layer after the encoder to improve the text embedding produced by the encoder. In addition, we further refine the generated waveform by generating multiple candidate audio clips simultaneously and selecting the best one, which is determined in terms of the similarity score between the embedding of the candidate clips and the embedding of the target text label. Using the proposed method, our system ranks \({1}^{st}\) among the systems submitted to DCASE Challenge 2023 Task 7. The results of the ablation studies illustrate that the proposed techniques significantly improve sound generation performance. The codes for implementing the proposed system are available online.

Özkan Çaylı, Xubo Liu, Volkan Kılıç, Wenwu Wang (2023)Knowledge Distillation for Efficient Audio-Visual Video Captioning, In: arXiv.org Cornell University Library, arXiv.org

Automatically describing audio-visual content with texts, namely video captioning, has received significant attention due to its potential applications across diverse fields. Deep neural networks are the dominant methods, offering state-of-the-art performance. However, these methods are often undeployable in low-power devices like smartphones due to the large size of the model parameters. In this paper, we propose to exploit simple pooling front-end and down-sampling algorithms with knowledge distillation for audio and visual attributes using a reduced number of audio-visual frames. With the help of knowledge distillation from the teacher model, our proposed method greatly reduces the redundant information in audio-visual streams without losing critical contexts for caption generation. Extensive experimental evaluations on the MSR-VTT dataset demonstrate that our proposed approach significantly reduces the inference time by about 80% with a small sacrifice (less than 0.02%) in captioning accuracy.

Xubo Liu, Zhongkai Zhu, Haohe Liu, Yi Yuan, Meng Cui, Qiushi Huang, Jinhua Liang, Yin Cao, Qiuqiang Kong, Mark Plumbley, Wenwu Wang (2023)WavJourney: Compositional Audio Creation with Large Language Models, In: WavJourney: Compositional Audio Creation with LLMs Cornell University Library, arXiv.org

Large Language Models (LLMs) have shown great promise in integrating diverse expert models to tackle intricate language and vision tasks. Despite their significance in advancing the field of Artificial Intelligence Generated Content (AIGC), their potential in intelligent audio content creation remains unexplored. In this work, we tackle the problem of creating audio content with storylines encompassing speech, music, and sound effects, guided by text instructions. We present WavJourney, a system that leverages LLMs to connect various audio models for audio content generation. Given a text description of an auditory scene, WavJourney first prompts LLMs to generate a structured script dedicated to audio storytelling. The audio script incorporates diverse audio elements, organized based on their spatio-temporal relationships. As a conceptual representation of audio, the audio script provides an interactive and interpretable rationale for human engagement. Afterward, the audio script is fed into a script compiler, converting it into a computer program. Each line of the program calls a task-specific audio generation model or computational operation function (e.g., concatenate, mix). The computer program is then executed to obtain an explainable solution for audio generation. We demonstrate the practicality of WavJourney across diverse real-world scenarios, including science fiction, education, and radio play. The explainable and interactive design of WavJourney fosters human-machine co-creation in multi-round dialogues, enhancing creative control and adaptability in audio production. WavJourney audiolizes the human imagination, opening up new avenues for creativity in multimedia content creation.

Xubo Liu, Qiuqiang Kong, Yan Zhao, Haohe Liu, Yi Yuan, Yuzhuo Liu, Rui Xia, Yuxuan Wang, Mark Plumbley, Wenwu Wang (2023)Separate Anything You Describe, In: Official implementation of "Separate Anything You Describe": source code, evaluation benchmark and pre-trained model Cornell University Library, arXiv.org

Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA). LASS aims to separate a target sound from an audio mixture given a natural language query, which provides a natural and scalable interface for digital audio applications. Recent works on LASS, despite attaining promising separation performance on specific sources (e.g., musical instruments, limited classes of audio events), are unable to separate audio concepts in the open domain. In this work, we introduce AudioSep, a foundation model for open-domain audio source separation with natural language queries. We train AudioSep on large-scale multimodal datasets and extensively evaluate its capabilities on numerous tasks including audio event separation, musical instrument separation, and speech enhancement. AudioSep demonstrates strong separation performance and impressive zero-shot generalization ability using audio captions or text labels as queries, substantially outperforming previous audio-queried and language-queried sound separation models. For reproducibility of this work, we will release the source code, evaluation benchmark and pre-trained model at: https://github.com/Audio-AGI/AudioSep.

Yi Yuan, Haohe Liu, Xubo Liu, Xiyuan Kang, Mark Plumbley, Wenwu Wang (2023)Latent Diffusion Model Based Foley Sound Generation System For DCASE Challenge 2023 Task 7, In: arXiv.org Cornell University Library, arXiv.org

Foley sound presents the background sound for multimedia content and the generation of Foley sound involves computationally modelling sound effects with specialized techniques. In this work, we proposed a system for DCASE 2023 challenge task 7: Foley Sound Synthesis. The proposed system is based on AudioLDM, which is a diffusion-based text-to-audio generation model. To alleviate the data-hungry problem, the system first trained with large-scale datasets and then downstreamed into this DCASE task via transfer learning. Through experiments, we found out that the feature extracted by the encoder can significantly affect the performance of the generation model. Hence, we improve the results by leveraging the input label with related text embedding features obtained by a significant language model, i.e., contrastive language-audio pertaining (CLAP). In addition, we utilize a filtering strategy to further refine the output, i.e. by selecting the best results from the candidate clips generated in terms of the similarity score between the sound and target labels. The overall system achieves a Frechet audio distance (FAD) score of 4.765 on average among all seven different classes, substantially outperforming the baseline system which performs a FAD score of 9.7.

Hejing Zhang, Qiaoxi Zhu, Jian Guan, Haohe Liu, Feiyang Xiao, Jiantong Tian, Xinhao Mei, Xubo Liu, Wenwu Wang (2023)First-Shot Unsupervised Anomalous Sound Detection With Unknown Anomalies Estimated by Metadata-Assisted Audio Generation, In: arXiv.org Cornell University Library, arXiv.org

First-shot (FS) unsupervised anomalous sound detection (ASD) is a brand-new task introduced in DCASE 2023 Challenge Task 2, where the anomalous sounds for the target machine types are unseen in training. Existing methods often rely on the availability of normal and abnormal sound data from the target machines. However, due to the lack of anomalous sound data for the target machine types, it becomes challenging when adapting the existing ASD methods to the first-shot task. In this paper, we propose a new framework for the first-shot unsupervised ASD, where metadata-assisted audio generation is used to estimate unknown anomalies, by utilising the available machine information (i.e., metadata and sound data) to fine-tune a text-to-audio generation model for generating the anomalous sounds that contain unique acoustic characteristics accounting for each different machine types. We then use the method of Time-Weighted Frequency domain audio Representation with Gaussian Mixture Model (TWFR-GMM) as the backbone to achieve the first-shot unsupervised ASD. Our proposed FS-TWFR-GMM method achieves competitive performance amongst top systems in DCASE 2023 Challenge Task 2, while requiring only 1% model parameters for detection, as validated in our experiments.

Jianyuan Sun, Xubo Liu, Xinhao Mei, Mark Plumbley, Volkan Kilic, Wenwu Wang (2022)Automated Audio Captioning via Fusion of Low- and High- Dimensional Features, In: JOURNAL OF LATEX CLASS FILES14(8) Cornell University Library, arXiv.org

Automated audio captioning (AAC) aims to describe the content of an audio clip using simple sentences. Existing AAC methods are developed based on an encoder-decoder architecture that success is attributed to the use of a pre-trained CNN10 called PANNs as the encoder to learn rich audio representations. AAC is a highly challenging task due to its high-dimensional talent space involves audio of various scenarios. Existing methods only use the high-dimensional representation of the PANNs as the input of the decoder. However, the low-dimension representation may retain as much audio information as the high-dimensional representation may be neglected. In addition, although the high-dimensional approach may predict the audio captions by learning from existing audio captions, which lacks robustness and efficiency. To deal with these challenges, a fusion model which integrates low- and high-dimensional features AAC framework is proposed. In this paper, a new encoder-decoder framework is proposed called the Low- and High-Dimensional Feature Fusion (LHDFF) model for AAC. Moreover, in LHDFF, a new PANNs encoder is proposed called Residual PANNs (RPANNs) by fusing the low-dimensional feature from the intermediate convolution layer output and the high-dimensional feature from the final layer output of PANNs. To fully explore the information of the low- and high-dimensional fusion feature and high-dimensional feature respectively, we proposed dual transformer decoder structures to generate the captions in parallel. Especially, a probabilistic fusion approach is proposed that can ensure the overall performance of the system is improved by concentrating on the respective advantages of the two transformer decoders. Experimental results show that LHDFF achieves the best performance on the Clotho and AudioCaps datasets compared with other existing models

Yaru Chen, Ruohao Guo, Xubo Liu, Peipei Wu, Guangyao Li, Zhenbo Li, Wenwu Wang (2023)CM-PIE: Cross-modal perception for interactive-enhanced audio-visual video parsing, In: arXiv.org Cornell University Library, arXiv.org

Audio-visual video parsing is the task of categorizing a video at the segment level with weak labels, and predicting them as audible or visible events. Recent methods for this task leverage the attention mechanism to capture the semantic correlations among the whole video across the audio-visual modalities. However, these approaches have overlooked the importance of individual segments within a video and the relationship among them, and tend to rely on a single modality when learning features. In this paper, we propose a novel interactive-enhanced cross-modal perception method~(CM-PIE), which can learn fine-grained features by applying a segment-based attention module. Furthermore, a cross-modal aggregation block is introduced to jointly optimize the semantic representation of audio and visual signals by enhancing inter-modal interactions. The experimental results show that our model offers improved parsing performance on the Look, Listen, and Parse dataset compared to other methods.

Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, Mark Plumbley (2023)AudioLDM: Text-to-Audio Generation with Latent Diffusion Models, In: arXiv.org Cornell University Library, arXiv.org

Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining (CLAP) latents. The pretrained CLAP models enable us to train LDMs with audio embedding while providing text embedding as a condition during sampling. By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency. Trained on AudioCaps with a single GPU, AudioLDM achieves state-of-the-art TTA performance measured by both objective and subjective metrics (e.g., frechet distance). Moreover, AudioLDM is the first TTA system that enables various text-guided audio manipulations (e.g., style transfer) in a zero-shot fashion. Our implementation and demos are available at https://audioldm.github.io.

Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark Plumbley, Wenwu Wang (2022)Towards Generating Diverse Audio Captions via Adversarial Training, In: arXiv.org Cornell University Library, arXiv.org

Automated audio captioning is a cross-modal translation task for describing the content of audio clips with natural language sentences. This task has attracted increasing attention and substantial progress has been made in recent years. Captions generated by existing models are generally faithful to the content of audio clips, however, these machine-generated captions are often deterministic (e.g., generating a fixed caption for a given audio clip), simple (e.g., using common words and simple grammar), and generic (e.g., generating the same caption for similar audio clips). When people are asked to describe the content of an audio clip, different people tend to focus on different sound events and describe an audio clip diversely from various aspects using distinct words and grammar. We believe that an audio captioning system should have the ability to generate diverse captions, either for a fixed audio clip, or across similar audio clips. To this end, we propose an adversarial training framework based on a conditional generative adversarial network (C-GAN) to improve diversity of audio captioning systems. A caption generator and two hybrid discriminators compete and are learned jointly, where the caption generator can be any standard encoder-decoder captioning model used to generate captions, and the hybrid discriminators assess the generated captions from different criteria, such as their naturalness and semantics. We conduct experiments on the Clotho dataset. The results show that our proposed model can generate captions with better diversity as compared to state-of-the-art methods.

Haohe Liu, QIUQIANG KONG, Xubo Liu, Xinhao Mei, Wenwu Wang, Mark David Plumbley (2023)Ontology-aware Learning and Evaluation for Audio Tagging

This study defines a new evaluation metric for audio tagging tasks to alleviate the limitation of the mean average precision (mAP) metric. The mAP metric treats different kinds of sound as independent classes without considering their relations. The proposed metric, ontology-aware mean average precision (OmAP), addresses the weaknesses of mAP by utilizing additional on-tology during evaluation. Specifically, we reweight the false positive events in the model prediction based on the AudioSet ontology graph distance to the target classes. The OmAP also provides insights into model performance by evaluating different coarse-grained levels in the ontology graph. We conduct a human assessment and show that OmAP is more consistent with human perception than mAP. We also propose an ontology-based loss function (OBCE) that reweights binary cross entropy (BCE) loss based on the ontology distance. Our experiment shows that OBCE can improve both mAP and OmAP metrics on the AudioSet tagging task.

Jinzheng Zhao, Peipei Wu, Xubo Liu, Shidrokh Goudarzi, Haohe Liu, Yong Xu, Wenwu Wang (2022)Audio Visual Multi-Speaker Tracking with Improved GCF and PMBM Filter, In: INTERSPEECH 20223704pp. 3704-3708 Isca-Int Speech Communication Assoc

Audio and visual signals can be used jointly to provide complementary information for multi-speaker tracking. Face detectors and color histogram can provide visual measurements while Direction of Arrival (DOA) lines and global coherence field (GCF) maps can provide audio measurements. GCF, as a traditional sound source localization method, has been widely used to provide audio measurements in audio-visual speaker tracking by estimating the positions of speakers. However, GCF cannot directly deal with the scenarios of multiple speakers due to the emergence of spurious peaks on the GCF map, making it difficult to find the non-dominant speakers. To overcome this limitation, we propose a phase-aware VoiceFilter and a separation-before-localization method, which enables the audio mixture to be separated into individual speech sources while retaining their phases. This allows us to calculate the GCF map for multiple speakers, thereby their positions accurately and concurrently. Based on this method, we design an adaptive audio measurement likelihood for audio-visual multiple speaker tracking using Poisson multi-Bernoulli mixture (PMBM) filter. The experiments demonstrate that our proposed tracker achieves state-of-the-art results on the AV16.3 dataset.

Qiushi Huang, Yu Zhang, Tom Ko, Xubo Liu, Bo Wu, Wenwu Wang, Lilian Tang Personalized Dialogue Generation with Persona-Adaptive Attention, In: arXiv (Cornell University)

Persona-based dialogue systems aim to generate consistent responses based on historical context and predefined persona. Unlike conventional dialogue generation, the persona-based dialogue needs to consider both dialogue context and persona, posing a challenge for coherent training. Specifically, this requires a delicate weight balance between context and persona. To achieve that, in this paper, we propose an effective framework with Persona-Adaptive Attention (PAA), which adaptively integrates the weights from the persona and context information via our designed attention. In addition, a dynamic masking mechanism is applied to the PAA to not only drop redundant information in context and persona but also serve as a regularization mechanism to avoid overfitting. Experimental results demonstrate the superiority of the proposed PAA framework compared to the strong baselines in both automatic and human evaluation. Moreover, the proposed PAA approach can perform equivalently well in a low-resource regime compared to models trained in a full-data setting, which achieve a similar result with only 20% to 30% of data compared to the larger models trained in the full-data setting. To fully exploit the effectiveness of our design, we designed several variants for handling the weighted information in different ways, showing the necessity and sufficiency of our weighting and masking designs.

Qiushi Huang, Tom Ko, H Tang, Xubo Liu, Bo Wu (2021)Token-Level Supervised Contrastive Learning for Punctuation Restoration, In: arXiv.org2pp. 936-940 Cornell University Library, arXiv.org

Punctuation is critical in understanding natural language text. Currently, most automatic speech recognition (ASR) systems do not generate punctuation, which affects the performance of downstream tasks, such as intent detection and slot filling. This gives rise to the need for punctuation restoration. Recent work in punctuation restoration heavily utilizes pre-trained language models without considering data imbalance when predicting punctuation classes. In this work, we address this problem by proposing a token-level supervised contrastive learning method that aims at maximizing the distance of representation of different punctuation marks in the embedding space. The result shows that training with token-level supervised contrastive learning obtains up to 3.2% absolute F1 improvement on the test set.

Yi Yuan, Haohe Liu, Jinhua Liang, Xubo Liu, Mark D. Plumbley, Wenwu Wang (2023)Leveraging Pre-Trained AudioLDM for Sound Generation: A Benchmark Study, In: 2023 31st European Signal Processing Conference (EUSIPCO)pp. 765-769 EURASIP

Deep neural networks have recently achieved break-throughs in sound generation. Despite the outstanding sample quality, current sound generation models face issues on small-scale datasets (e.g., overfitting), significantly limiting performance. In this paper, we make the first attempt to investigate the benefits of pre-training on sound generation with AudioLDM, the cutting-edge model for audio generation, as the backbone. Our study demonstrates the advantages of the pre-trained AudioLDM, especially in data-scarcity scenarios. In addition, the baselines and evaluation protocol for sound generation systems are not consistent enough to compare different studies directly. Aiming to facilitate further study on sound generation tasks, we benchmark the sound generation task on various frequently-used datasets. We hope our results on transfer learning and benchmarks can provide references for further research on conditional sound generation.

Ke Wang, XUBO LIU, Chien-Ming Chen, Saru Kumari, MOHAMMAD SHOJAFAR, Mohammed Alamgir Hossain (2020)Voice-Transfer Attacking on Industrial Voice Control Systems in 5G-Aided IIoT Domain, In: IEEE transactions on industrial informaticspp. 1-1 IEEE

At present, specific voice control has gradually become an important means for 5G-IoT-aided industrial control systems. However, the security of specific voice control system needs to be improved, because the voice cloning technology may lead to industrial accidents and other potential security risks. In this paper, we propose a transductive voice transfer learning method to learn the predictive function from the source domain and fine-tune in the target domain adaptively. The target learning task and source learning task are both synthesizing speech signals from the given audio while the data sets of both domains are different. By adding different penalty values to each instances and minimizing the expected risk, an optimal precise model can be learned. Many details of the experimental results show that our method can effectively synthesize the speech of the target speaker with small samples.

Meng Cui, XUBO LIU, JINZHENG ZHAO, JIANYUAN SUN, GUOPING LIAN, TAO CHEN, Mark D. PLUMBLEY, Daoliang Li, WENWU WANG (2022)Fish Feeding Intensity Assessment in Aquaculture: A New Audio Dataset AFFIA3K and a Deep Learning Algorithm
Jinhua Liang, Xubo Liu, Haohe Liu, Huy Phan, Emmanouil Benetos, Mark David Plumbley, Wenwu Wang (2023)Adapting Language-Audio Models as Few-Shot Audio Learners, In: Proceedings of Interspeech 2023pp. 276-280 International Speech Communication Association (ISCA)

Contrastive language-audio pretraining (CLAP) has become a new paradigm to learn audio concepts with audio-text pairs. CLAP models have shown unprecedented performance as zero-shot classifiers on downstream tasks. To further adapt CLAP with domain-specific knowledge, a popular method is to fine-tune its audio encoder with available labelled examples. However, this is challenging in low-shot scenarios, as the amount of annotations is limited compared to the model size. In this work, we introduce a Training-efficient (Treff) adapter to rapidly learn with a small set of examples while maintaining the capacity for zero-shot classification. First, we propose a crossattention linear model (CALM) to map a set of labelled examples and test audio to test labels. Second, we find initialising CALM as a cosine measurement improves our Treff adapter even without training. The Treff adapter outperforms metricbased methods in few-shot settings and yields competitive results to fully-supervised methods.

Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, Mark D. Plumbley, Wenwu Wang (2023)Simple Pooling Front-Ends for Efficient Audio Classification, In: Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2023)pp. 1-5 Institute of Electrical and Electronics Engineers (IEEE)

Recently, there has been increasing interest in building efficient audio neural networks for on-device scenarios. Most existing approaches are designed to reduce the size of audio neural networks using methods such as model pruning. In this work, we show that instead of reducing model size using complex methods, eliminating the temporal redundancy in the input audio features (e.g., mel-spectrogram) could be an effective approach for efficient audio classification. To do so, we proposed a family of simple pooling front-ends (SimPFs) which use simple non-parametric pooling operations to reduce the redundant information within the mel-spectrogram. We perform extensive experiments on four audio classification tasks to evaluate the performance of SimPFs. Experimental results show that SimPFs can achieve a reduction in more than half of the number of floating point operations (FLOPs) for off-the-shelf audio neural networks, with negligible degradation or even some improvements in audio classification performance.

Qiushi Huang, Shuai Fu, Xubo Liu, Wenwu Wang, Tom Ko, Yu Zhang, Lilian Tang (2023)Learning Retrieval Augmentation for Personalized Dialogue Generation, In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processingpp. 2523-2540 Association for Computational Linguistics

Personalized dialogue generation, focusing on generating highly tailored responses by lever-aging persona profiles and dialogue context, has gained significant attention in conversational AI applications. However, persona profiles , a prevalent setting in current personal-ized dialogue datasets, typically composed of merely four to five sentences, may not offer comprehensive descriptions of the persona about the agent, posing a challenge to generate truly personalized dialogues. To handle this problem, we propose Learning Retrieval Augmentation for Personalized DialOgue Generation (LAPDOG), which studies the potential of leveraging external knowledge for persona dialogue generation. Specifically, the proposed LAPDOG model consists of a story retriever and a dialogue generator. The story retriever uses a given persona profile as queries to retrieve relevant information from the story document, which serves as a supplementary context to augment the persona profile. The dialogue generator utilizes both the dialogue history and the augmented persona profile to generate personalized responses. For optimization , we adopt a joint training framework that collaboratively learns the story retriever and dialogue generator, where the story retriever is optimized towards desired ultimate metrics (e.g., BLEU) to retrieve content for the dialogue generator to generate personalized responses. Experiments conducted on the CONVAI2 dataset with ROCStory as a supplementary data source show that the proposed LAPDOG method substantially outperforms the baselines, indicating the effectiveness of the proposed method. The LAPDOG model code is publicly available for further exploration. 1

Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, Jinzheng Zhao, Qiushi Huang, Mark D. Plumbley, Wenwu Wang (2022)Separate What You Describe: Language-Queried Audio Source Separation, In: Interspeech 2022pp. 1801-1805

In this paper, we introduce the task of language-queried audio source separation (LASS), which aims to separate a target source from an audio mixture based on a natural language query of the target source (e.g., “a man tells a joke followed by people laughing”). A unique challenge in LASS is associated with the complexity of natural language description and its relation with the audio sources. To address this issue, we proposed LASSNet, an end-to-end neural network that is learned to jointly process acoustic and linguistic information, and separate the target source that is consistent with the language query from an audio mixture. We evaluate the performance of our proposed system with a dataset created from the AudioCaps dataset. Experimental results show that LASS-Net achieves considerable improvements over baseline methods. Furthermore, we observe that LASS-Net achieves promising generalization results when using diverse human-annotated descriptions as queries, indicating its potential use in real-world scenarios. The separated audio samples and source code are available at https://liuxubo717.github.io/LASS-demopage.

Y Cao, Zhili Sun, Ning Wang, Maryam Riaz, Haitham Cruickshank, X Liu (2015)Geographic-Based Spray-and-Relay (GSaR): An efficient routing scheme for DTNs, In: IEEE Transactions on Vehicular Technology64(4)pp. 1548-1564 IEEE

In this paper, we design and evaluate the proposed geographic-based spray-and-relay (GSaR) routing scheme in delay/disruption-tolerant networks. To the best of our knowledge, GSaR is the first spray-based geographic routing scheme using historical geographic information for making a routing decision. Here, the term spray means that only a limited number of message copies are allowed for replication in the network. By estimating a movement range of destination via the historical geographic information, GSaR expedites the message being sprayed toward this range, meanwhile prevents that away from and postpones that out of this range. As such, the combination of them intends to fast and efficiently spray the limited number of message copies toward this range and effectively spray them within range, to reduce the delivery delay and increase the delivery ratio. Furthermore, GSaR exploits delegation forwarding to enhance the reliability of the routing decision and handle the local maximum problem, which is considered to be the challenges for applying the geographic routing scheme in sparse networks. We evaluate GSaR under three city scenarios abstracted from real world, with other routing schemes for comparison. Results show that GSaR is reliable for delivering messages before the expiration deadline and efficient for achieving low routing overhead ratio. Further observation indicates that GSaR is also efficient in terms of a low and fair energy consumption over the nodes in the network.

Xiaoran Liu, Xiaoying Zhang, Lei Zhang, Pei Xiao, Jibo Wei, Haijun Zhang, Victor C.M Leung (2020)PAPR Reduction Using Iterative Clipping/Filtering and ADMM Approaches for OFDM-Based Mixed-Numerology Systems, In: IEEE Transactions on Wireless Communications

Mixed-numerology transmission is proposed to support a variety of communication scenarios with diverse requirements. However, as the orthogonal frequency division multiplexing (OFDM) remains as the basic waveform, the peak-to average power ratio (PAPR) problem is still cumbersome. In this paper, based on the iterative clipping and filtering (ICF) and optimization methods, we investigate the PAPR reduction in the mixed numerology systems.We first illustrate that the direct extension of classical ICF brings about the accumulation of inter-numerology interference (INI) due to the repeated execution. By exploiting the clipping noise rather than the clipped signal, the noiseshaped ICF (NS-ICF) method is then proposed without increasing the INI. Next, we address the in-band distortion minimization problem subject to the PAPR constraint. By reformulation, the resulting model is separable in both the objective function and the constraints, and well suited for the alternating direction method of multipliers (ADMM) approach. The ADMM-based algorithms are then developed to split the original problem into several subproblems which can be easily solved with closedform solutions. Furthermore, the applications of the proposed PAPR reduction methods combined with filtering and windowing techniques are also shown to be effective.

X Liu, BG Evans, K Moessner (2013)Comparison of reliability, delay and complexity for standalone cognitive radio spectrum sensing schemes, In: IET COMMUNICATIONS7(9)pp. 799-807 INST ENGINEERING TECHNOLOGY-IET
X Liu, P Barnaghi, B Cheng, L Wan, Y Yang (2015)OMI-DL: An Ontology Matching Framework, In: Services Computing, IEEE Transactions onPP99pp. 1-1
X Liu, BG Evans, K Moessner (2015)Energy-Efficient Sensor Scheduling Algorithm in Cognitive Radio Networks Employing Heterogeneous Sensors, In: IEEE Transactions on Vehicular Technology64(3)pp. 1243-1249 Institute of Electrical and Electronics Engineers (IEEE)

We consider, in this paper, the maximization of throughput in a dense network of collaborative cognitive radio (CR) sensors with limited energy supply. In our case, the sensors are mixed varieties (heterogeneous) and are battery powered. We propose an ant colony-based energy-efficient sensor scheduling algorithm (ACO-ESSP) to optimally schedule the activities of the sensors to provide the required sensing performance and increase the overall secondary system throughput. The proposed algorithm is an improved version of the conventional ant colony optimization (ACO) algorithm, specifically tailored to the formulated sensor scheduling problem. We also use a more realistic sensor energy consumption model and consider CR networks employing heterogeneous sensors (CRNHSs). Simulations demonstrate that our approach improves the system throughput efficiently and effectively compared with other algorithms.

C Han, M Dianati, R Tafazolli, X Liu, X Shen (2012)A Novel Distributed Asynchronous Multichannel MAC Scheme for Large-Scale Vehicular Ad Hoc Networks., In: IEEE T. Vehicular Technology617pp. 3125-3138
Y Wu, Y Jin, X Liu (2015)A directed search strategy for evolutionary dynamic multiobjective optimization, In: SOFT COMPUTING19(11)pp. 3221-3235 SPRINGER
X. Liu, Y. L. Guan, S.N. Koh, D. Teo, Zilong Liu (2018)On the Performance of Single-Channel Source Separation of Two Co-Frequency PSK Signals with Carrier Frequency Offsets, In: Proceedings of IEEE MILCOM 2018 Institute of Electrical and Electronics Engineers (IEEE)

Continuously learning new classes without catastrophic forgetting is a challenging problem for on-device environmental sound classification given the restrictions on computation resources (e.g., model size, running memory). To address this issue, we propose a simple and efficient continual learning method. Our method selects the historical data for the training by measuring the per-sample classification uncertainty. Specifically, we measure the uncertainty by observing how the classification probability of data fluctuates against the parallel perturbations added to the classifier embedding. In this way, the computation cost can be significantly reduced compared with adding perturbation to the raw data. Experimental results on the DCASE 2019 Task 1 and ESC-50 dataset show that our proposed method outperforms baseline continual learning methods on classification accuracy and computational efficiency, indicating our method can efficiently and incrementally learn new classes without the catastrophic forgetting problem for on-device environmental sound classification.


Automated audio captioning aims to use natural language to describe the content of audio data. This paper presents an audio captioning system with an encoder-decoder architecture, where the decoder predicts words based on audio features extracted by the encoder. To improve the proposed system, transfer learning from either an upstream audio-related task or a large in-domain dataset is introduced to mitigate the problem induced by data scarcity. Moreover, evaluation metrics are incorporated into the optimization of the model with reinforcement learning, which helps address the problem of " exposure bias " induced by " teacher forcing " training strategy and the mismatch between the evaluation metrics and the loss function. The resulting system was ranked 3rd in DCASE 2021 Task 6. Abla-tion studies are carried out to investigate how much each component in the proposed system can contribute to final performance. The results show that the proposed techniques significantly improve the scores of the evaluation metrics, however, reinforcement learning may impact adversely on the quality of the generated captions.


Audio captioning aims to automatically generate a natural language description of an audio clip. Most captioning models follow an encoder-decoder architecture, where the decoder predicts words based on the audio features extracted by the encoder. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are often used as the audio encoder. However, CNNs can be limited in modelling temporal relationships among the time frames in an audio signal, while RNNs can be limited in modelling the long-range dependencies among the time frames. In this paper, we propose an Audio Captioning Transformer (ACT), which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free. The proposed method has a better ability to model the global information within an audio signal as well as capture temporal relationships between audio events. We evaluate our model on AudioCaps, which is the largest audio captioning dataset publicly available. Our model shows competitive performance compared to other state-of-the-art approaches.

Xinhao Mei, Xubo Liu, Mark D. Plumbley, Wenwu Wang (2022)Automated Audio Captioning: An Overview of Recent Progress and New Challenges, In: EURASIP journal on audio, speech, and music processing2022(Recent advances in computational sound scene analysis)26 Springer Open

Automated audio captioning is a cross-modal translation task that aims to generate natural language descriptions for given audio clips. This task has received increasing attention with the release of freely available datasets in recent years. The problem has been addressed predominantly with deep learning techniques. Numerous approaches have been proposed, such as investigating different neural network architectures, exploiting auxiliary information such as keywords or sentence information to guide caption generation, and employing different training strategies, which have greatly facilitated the development of this field. In this paper, we present a comprehensive review of the published contributions in automated audio captioning, from a variety of existing approaches to evaluation metrics and datasets. We also discuss open challenges and envisage possible future research directions.

Xiaolan Liu, Jiadong Yu, Jian Wang, Yue Gao (2020)Resource Allocation With Edge Computing in IoT Networks via Machine Learning, In: IEEE internet of things journal7(4)pp. 3415-3426 IEEE

In this article, we investigate resource allocation with edge computing in Internet-of-Things (IoT) networks via machine learning approaches. Edge computing is playing a promising role in IoT networks by providing computing capabilities close to users. However, the massive number of users in IoT networks requires sufficient spectrum resource to transmit their computation tasks to an edge server, while the IoT users were developed to have more powerful computation ability recently, which makes it possible for them to execute some tasks locally. Then, the design of computation task offloading policies for such IoT edge computing systems remains challenging. In this article, centralized user clustering is explored to group the IoT users into different clusters according to users' priorities. The cluster with the highest priority is assigned to offload computation tasks and executed at the edge server, while the lowest priority cluster executes computation tasks locally. For the other clusters, the design of distributed task offloading policies for the IoT users is modeled by a Markov decision process, where each IoT user is considered as an agent which makes a series of decisions on task offloading by minimizing the system cost based on the environment dynamics. To deal with the curse of high dimensionality, we use a deep Q -network to learn the optimal policy in which deep neural network is used to approximate the Q -function in Q -learning. Simulations show that users are grouped into clusters with optimal number of clusters. Moreover, our proposed computation offloading algorithm outperforms the other baseline schemes under the same system costs.

Jiadong Yu, Xiaolan Liu, Yue Gao, Xuemin Shen (2020)3D Channel Tracking for UAV-Satellite Communications in Space-Air-Ground Integrated Networks, In: IEEE journal on selected areas in communications38(12)pp. 2810-2823 IEEE

The space-air-ground integrated network (SAGIN) aims to provide seamless wide-area connections, high throughput and strong resilience for 5G and beyond communications. Acting as a crucial link segment of the SAGIN, unmanned aerial vehicle (UAV)-satellite communication has drawn much attention. However, it is a key challenge to track dynamic channel information due to the low earth orbit (LEO) satellite orbiting and three-dimensional (3D) UAV trajectory. In this paper, we explore the 3D channel tracking for a Ka-band UAV-satellite communication system. We firstly propose a statistical dynamic channel model called 3D two-dimensional Markov model (3D-2D-MM) for the UAV-satellite communication system by exploiting the probabilistic insight relationship of both hidden value vector and joint hidden support vector. Specifically, for the joint hidden support vector, we consider a more realistic 3D support vector in both azimuth and elevation direction. Moreover, the spatial sparsity structure and the time-varying probabilistic relationship between degree patterns named the spatial and temporal correlation, respectively, are studied for each direction. Furthermore, we derive a novel 3D dynamic turbo approximate message passing (3D-DTAMP) algorithm to recursively track the dynamic channel with the 3D-2D-MM priors. Numerical results show that our proposed algorithm achieves superior channel tracking performance to the state-of-the-art algorithms with lower pilot overhead and comparable complexity.

Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang (2022)On Metric Learning for Audio-Text Cross-Modal Retrieval

Audio-text retrieval aims at retrieving a target audio clip or caption from a pool of candidates given a query in another modality. Solving such cross-modal retrieval task is challenging because it not only requires learning robust feature representations for both modalities, but also requires capturing the fine-grained alignment between these two modalities. Existing cross-modal retrieval models are mostly optimized by metric learning objectives as both of them attempt to map data to an embedding space, where similar data are close together and dissimilar data are far apart. Unlike other cross-modal retrieval tasks such as image-text and video-text retrievals, audio-text retrieval is still an unexplored task. In this work, we aim to study the impact of different metric learning objectives on the audio-text retrieval task. We present an extensive evaluation of popular metric learning objectives on the AudioCaps and Clotho datasets. We demonstrate that NT-Xent loss adapted from self-supervised learning shows stable performance across different datasets and training settings, and outperforms the popular triplet-based losses. Our code is available at https://github.com/XinhaoMei/ audio-text_retrieval.

L Lever, Y Hu, M Myronov, X Liu, N Owens, FY Gardes, IP Marko, SJ Sweeney, Z Ikonić, DR Leadley, GT Reed, RW Kelsall (2011)Strain engineering of the electroabsorption response in Ge/SiGe multiple quantum well heterostructures, In: 8th IEEE International Conference on Group IV Photonicspp. 107-108

Many fibre-optic telecommunications systems exploit the spectral `window' at 1310 nm, which corresponds to zero dispersion in standard single-mode fibres (SMFs). In particular, several passive optical network (PON) architectures use 1310 nm for upstream signals,1 and so compact, low-cost and low-power modulators operating at 1310 nm that can be integrated into Si electronic-photonic integrated circuits would be extremely desireable for future fibre-to-the-home (FTTH) applications.

P Barnaghi, X Liu, K Moessner, J Liao (2012)Using Concept and Structure Similarities for Ontology Integration, In: P Shvaiko, J Euzenat, F Giunchiglia, H Stuckenschmidt, M Mao, I Cruz (eds.), CEUR Workshop Proceedings: Proceedings of the 5th International Workshop on Ontology Matching689

We propose a method to align different ontologies in similar domains and then define correspondence between concepts in two different ontologies using the SKOS model.


Audio captioning aims at generating natural language descriptions for audio clips automatically. Existing audio captioning models have shown promising improvement in recent years. However, these models are mostly trained via maximum likelihood estimation (MLE), which tends to make captions generic, simple and deterministic. As different people may describe an audio clip from different aspects using distinct words and grammars, we argue that an audio captioning system should have the ability to generate diverse captions for a fixed audio clip and across similar audio clips. To address this problem, we propose an adversarial training framework for audio captioning based on a conditional generative adversarial network (C-GAN), which aims at improving the naturalness and diversity of generated captions. Unlike processing data of continuous values in a classical GAN, a sentence is composed of discrete tokens and the discrete sampling process is non-differentiable. To address this issue, policy gradient, a reinforcement learning technique, is used to back-propagate the reward to the generator. The results show that our proposed model can generate more diverse captions, as compared to state-of-the-art methods.

Xubo Liu, Turab Iqbal, Jinzheng Zhao, Qiushi Huang, Mark D. Plumbley, Wenwu Wang (2021)Conditional Sound Generation Using Neural Discrete Time-Frequency Representation Learningpp. 25-28

Deep generative models have recently achieved impressive performance in speech and music synthesis. However, compared to the generation of those domain-specific sounds, generating general sounds (such as siren, gunshots) has received less attention , despite their wide applications. In previous work, the SampleRNN method was considered for sound generation in the time domain. However, SampleRNN is potentially limited in capturing long-range dependencies within sounds as it only back-propagates through a limited number of samples. In this work, we propose a method for generating sounds via neural discrete time-frequency representation learning, conditioned on sound classes. This offers an advantage in efficiently modelling long-range dependencies and retaining local fine-grained structures within sound clips. We evaluate our approach on the UrbanSound8K dataset, compared to SampleRNN, with the performance metrics measuring the quality and diversity of generated sounds. Experimental results show that our method offers comparable performance in quality and significantly better performance in diversity.


Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language to describe the content of an audio clip. As shown in the submissions received for Task 6 of the DCASE 2021 Challenges, this problem has received increasing interest in the community. The existing AAC systems are usually based on an encoder-decoder architecture, where the audio signal is encoded into a latent representation, and aligned with its corresponding text descriptions, then a decoder is used to generate the captions. However, training of an AAC system often encounters the problem of data scarcity, which may lead to inaccurate representation and audio-text alignment. To address this problem, we propose a novel encoder-decoder framework called Contrastive Loss for Audio Captioning (CL4AC). In CL4AC, the self-supervision signals derived from the original audio-text paired data are used to exploit the correspondences between audio and texts by contrasting samples, which can improve the quality of latent representation and the alignment between audio and texts, while trained with limited data. Experiments are performed on the Clotho dataset to show the effectiveness of our proposed approach.