My research project
Cross-modal translations between audio and texts
My research topic is cross-modal translations between audio and texts, including text-to-sound generation and automated audio captioning. I am interested in self-supervised learning and representation learning for audio, and text data.
Audio captioning aims at generating natural language descriptions for audio clips automatically. Existing audio captioning models have shown promising improvement in recent years. However, these models are mostly trained via maximum likelihood estimation (MLE), which tends to make captions generic, simple and deterministic. As different people may describe an audio clip from different aspects using distinct words and grammars, we argue that an audio captioning system should have the ability to generate diverse captions for a fixed audio clip and across similar audio clips. To address this problem, we propose an adversarial training framework for audio captioning based on a conditional generative adversarial network (C-GAN), which aims at improving the naturalness and diversity of generated captions. Unlike processing data of continuous values in a classical GAN, a sentence is composed of discrete tokens and the discrete sampling process is non-differentiable. To address this issue, policy gradient, a reinforcement learning technique, is used to back-propagate the reward to the generator. The results show that our proposed model can generate more diverse captions, as compared to state-of-the-art methods.
Deep generative models have recently achieved impressive performance in speech and music synthesis. However, compared to the generation of those domain-specific sounds, generating general sounds (such as siren, gunshots) has received less attention , despite their wide applications. In previous work, the SampleRNN method was considered for sound generation in the time domain. However, SampleRNN is potentially limited in capturing long-range dependencies within sounds as it only back-propagates through a limited number of samples. In this work, we propose a method for generating sounds via neural discrete time-frequency representation learning, conditioned on sound classes. This offers an advantage in efficiently modelling long-range dependencies and retaining local fine-grained structures within sound clips. We evaluate our approach on the UrbanSound8K dataset, compared to SampleRNN, with the performance metrics measuring the quality and diversity of generated sounds. Experimental results show that our method offers comparable performance in quality and significantly better performance in diversity.
Automated audio captioning aims to use natural language to describe the content of audio data. This paper presents an audio captioning system with an encoder-decoder architecture, where the decoder predicts words based on audio features extracted by the encoder. To improve the proposed system, transfer learning from either an upstream audio-related task or a large in-domain dataset is introduced to mitigate the problem induced by data scarcity. Moreover, evaluation metrics are incorporated into the optimization of the model with reinforcement learning, which helps address the problem of " exposure bias " induced by " teacher forcing " training strategy and the mismatch between the evaluation metrics and the loss function. The resulting system was ranked 3rd in DCASE 2021 Task 6. Abla-tion studies are carried out to investigate how much each component in the proposed system can contribute to final performance. The results show that the proposed techniques significantly improve the scores of the evaluation metrics, however, reinforcement learning may impact adversely on the quality of the generated captions.
Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language to describe the content of an audio clip. As shown in the submissions received for Task 6 of the DCASE 2021 Challenges, this problem has received increasing interest in the community. The existing AAC systems are usually based on an encoder-decoder architecture, where the audio signal is encoded into a latent representation, and aligned with its corresponding text descriptions, then a decoder is used to generate the captions. However, training of an AAC system often encounters the problem of data scarcity, which may lead to inaccurate representation and audio-text alignment. To address this problem, we propose a novel encoder-decoder framework called Contrastive Loss for Audio Captioning (CL4AC). In CL4AC, the self-supervision signals derived from the original audio-text paired data are used to exploit the correspondences between audio and texts by contrasting samples, which can improve the quality of latent representation and the alignment between audio and texts, while trained with limited data. Experiments are performed on the Clotho dataset to show the effectiveness of our proposed approach.
Audio captioning aims to automatically generate a natural language description of an audio clip. Most captioning models follow an encoder-decoder architecture, where the decoder predicts words based on the audio features extracted by the encoder. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are often used as the audio encoder. However, CNNs can be limited in modelling temporal relationships among the time frames in an audio signal, while RNNs can be limited in modelling the long-range dependencies among the time frames. In this paper, we propose an Audio Captioning Transformer (ACT), which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free. The proposed method has a better ability to model the global information within an audio signal as well as capture temporal relationships between audio events. We evaluate our model on AudioCaps, which is the largest audio captioning dataset publicly available. Our model shows competitive performance compared to other state-of-the-art approaches.
The space-air-ground integrated network (SAGIN) aims to provide seamless wide-area connections, high throughput and strong resilience for 5G and beyond communications. Acting as a crucial link segment of the SAGIN, unmanned aerial vehicle (UAV)-satellite communication has drawn much attention. However, it is a key challenge to track dynamic channel information due to the low earth orbit (LEO) satellite orbiting and three-dimensional (3D) UAV trajectory. In this paper, we explore the 3D channel tracking for a Ka-band UAV-satellite communication system. We firstly propose a statistical dynamic channel model called 3D two-dimensional Markov model (3D-2D-MM) for the UAV-satellite communication system by exploiting the probabilistic insight relationship of both hidden value vector and joint hidden support vector. Specifically, for the joint hidden support vector, we consider a more realistic 3D support vector in both azimuth and elevation direction. Moreover, the spatial sparsity structure and the time-varying probabilistic relationship between degree patterns named the spatial and temporal correlation, respectively, are studied for each direction. Furthermore, we derive a novel 3D dynamic turbo approximate message passing (3D-DTAMP) algorithm to recursively track the dynamic channel with the 3D-2D-MM priors. Numerical results show that our proposed algorithm achieves superior channel tracking performance to the state-of-the-art algorithms with lower pilot overhead and comparable complexity.
In this article, we investigate resource allocation with edge computing in Internet-of-Things (IoT) networks via machine learning approaches. Edge computing is playing a promising role in IoT networks by providing computing capabilities close to users. However, the massive number of users in IoT networks requires sufficient spectrum resource to transmit their computation tasks to an edge server, while the IoT users were developed to have more powerful computation ability recently, which makes it possible for them to execute some tasks locally. Then, the design of computation task offloading policies for such IoT edge computing systems remains challenging. In this article, centralized user clustering is explored to group the IoT users into different clusters according to users' priorities. The cluster with the highest priority is assigned to offload computation tasks and executed at the edge server, while the lowest priority cluster executes computation tasks locally. For the other clusters, the design of distributed task offloading policies for the IoT users is modeled by a Markov decision process, where each IoT user is considered as an agent which makes a series of decisions on task offloading by minimizing the system cost based on the environment dynamics. To deal with the curse of high dimensionality, we use a deep Q -network to learn the optimal policy in which deep neural network is used to approximate the Q -function in Q -learning. Simulations show that users are grouped into clusters with optimal number of clusters. Moreover, our proposed computation offloading algorithm outperforms the other baseline schemes under the same system costs.
In this paper, we design and evaluate the proposed geographic-based spray-and-relay (GSaR) routing scheme in delay/disruption-tolerant networks. To the best of our knowledge, GSaR is the first spray-based geographic routing scheme using historical geographic information for making a routing decision. Here, the term spray means that only a limited number of message copies are allowed for replication in the network. By estimating a movement range of destination via the historical geographic information, GSaR expedites the message being sprayed toward this range, meanwhile prevents that away from and postpones that out of this range. As such, the combination of them intends to fast and efficiently spray the limited number of message copies toward this range and effectively spray them within range, to reduce the delivery delay and increase the delivery ratio. Furthermore, GSaR exploits delegation forwarding to enhance the reliability of the routing decision and handle the local maximum problem, which is considered to be the challenges for applying the geographic routing scheme in sparse networks. We evaluate GSaR under three city scenarios abstracted from real world, with other routing schemes for comparison. Results show that GSaR is reliable for delivering messages before the expiration deadline and efficient for achieving low routing overhead ratio. Further observation indicates that GSaR is also efficient in terms of a low and fair energy consumption over the nodes in the network.
Many fibre-optic telecommunications systems exploit the spectral `window' at 1310 nm, which corresponds to zero dispersion in standard single-mode fibres (SMFs). In particular, several passive optical network (PON) architectures use 1310 nm for upstream signals,1 and so compact, low-cost and low-power modulators operating at 1310 nm that can be integrated into Si electronic-photonic integrated circuits would be extremely desireable for future fibre-to-the-home (FTTH) applications.
At present, specific voice control has gradually become an important means for 5G-IoT-aided industrial control systems. However, the security of specific voice control system needs to be improved, because the voice cloning technology may lead to industrial accidents and other potential security risks. In this paper, we propose a transductive voice transfer learning method to learn the predictive function from the source domain and fine-tune in the target domain adaptively. The target learning task and source learning task are both synthesizing speech signals from the given audio while the data sets of both domains are different. By adding different penalty values to each instances and minimizing the expected risk, an optimal precise model can be learned. Many details of the experimental results show that our method can effectively synthesize the speech of the target speaker with small samples.
Mixed-numerology transmission is proposed to support a variety of communication scenarios with diverse requirements. However, as the orthogonal frequency division multiplexing (OFDM) remains as the basic waveform, the peak-to average power ratio (PAPR) problem is still cumbersome. In this paper, based on the iterative clipping and filtering (ICF) and optimization methods, we investigate the PAPR reduction in the mixed numerology systems.We first illustrate that the direct extension of classical ICF brings about the accumulation of inter-numerology interference (INI) due to the repeated execution. By exploiting the clipping noise rather than the clipped signal, the noiseshaped ICF (NS-ICF) method is then proposed without increasing the INI. Next, we address the in-band distortion minimization problem subject to the PAPR constraint. By reformulation, the resulting model is separable in both the objective function and the constraints, and well suited for the alternating direction method of multipliers (ADMM) approach. The ADMM-based algorithms are then developed to split the original problem into several subproblems which can be easily solved with closedform solutions. Furthermore, the applications of the proposed PAPR reduction methods combined with filtering and windowing techniques are also shown to be effective.
We consider, in this paper, the maximization of throughput in a dense network of collaborative cognitive radio (CR) sensors with limited energy supply. In our case, the sensors are mixed varieties (heterogeneous) and are battery powered. We propose an ant colony-based energy-efficient sensor scheduling algorithm (ACO-ESSP) to optimally schedule the activities of the sensors to provide the required sensing performance and increase the overall secondary system throughput. The proposed algorithm is an improved version of the conventional ant colony optimization (ACO) algorithm, specifically tailored to the formulated sensor scheduling problem. We also use a more realistic sensor energy consumption model and consider CR networks employing heterogeneous sensors (CRNHSs). Simulations demonstrate that our approach improves the system throughput efficiently and effectively compared with other algorithms.
We propose a method to align different ontologies in similar domains and then define correspondence between concepts in two different ontologies using the SKOS model.