Junqi Zhao

Postgraduate Research Student

junqi.zhao@surrey.ac.uk

Academic and research departments

Centre for Vision, Speech and Signal Processing (CVSSP).

Publications

Yun Chen, Haohe Liu, Qi Chen, Arshdeep Singh, Junqi Zhao, Wenwu Wang, Philip J B Jackson, Mark D. Plumbley (2026)SFM-Adapter: Style-aware Feature Manipulation Adapter for Speech Style Editing, In: SFM-Adapter: Style-aware Feature Manipulation Adapter for Speech Style Editing IEEE

DOI: 10.1109/TASLPRO.2026.3698957

Speech Style Editing (SSE) aims to modify selected style attributes (e.g., timbre, emotion, pitch) while preserving the linguistic content and all other style attributes that are not given. Many speech applications require flexible control over speech style, making SSE increasingly important. Existing SSE approaches typically follow a style-generation paradigm that synthesizes non-linguistic attributes from style conditions. However, this often results in limited preservation of source attributes and insufficient flexibility when only a subset of style attributes is specified. To overcome these limitations, we adopt a style editing paradigm, in which the target style is achieved by adjusting the source speech instead of producing speech from scratch. Building on this paradigm, we propose a diffusion-based framework with a Style-aware Feature Manipulation Adapter (SFM-Adapter). The SFM-Adapter performs feature-level modulation by integrating user-provided style information with source speech features through multi-layer cross-attention. The resulting modulated features are incorporated into the generation process via mask attention. During inference, a Large Audio-Language Model (LALM)-based length regulation is designed to predict speaking speed and adjust duration. Experiments across multiple speech style editing tasks demonstrate that the SFM-Adapter achieves more natural, accurate, and source-preserving style editing compared with existing methods. Speech samples are provided in https://ychenn1.github.io/SFM-Adapter/.

Junqi Zhao, Chenxing Li, Jinzheng Zhao, Rilin Chen, Dong Yu, Mark D. Plumbley, Wenwu Wang (2026)FEEDBACK-DRIVEN RETRIEVAL-AUGMENTED AUDIO GENERATION WITH LARGE AUDIO LANGUAGE MODELS, In: 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing IEEE

We propose a general feedback-driven retrieval-augmented generation (RAG) approach that leverages Large Audio Language Models (LALMs) to address the missing or imperfect synthesis of specific sound events in text-to-audio (TTA) generation. Unlike previous RAG-based TTA methods that typically train specialized models from scratch, we utilize LALMs to analyze audio generation outputs, retrieve concepts that pre-trained models struggle to generate from an external database, and incorporate the retrieved information into the generation process. Experimental results show that our method not only enhances the ability of LALMs to identify missing sound events but also delivers improvements across different models, outperforming existing RAG-specialized approaches.

Junqi Zhao, Xubo Liu, Jinzheng Zhao, Yi Yuan, Qiuqiang Kong, Mark D Plumbley, Wenwu Wang (2024)Universal Sound Separation with Self-Supervised Audio Masked Autoencoder

Universal sound separation (USS) is a task of separating mixtures of arbitrary sound sources. Typically, universal separation models are trained from scratch in a supervised manner, using labeled data. Self-supervised learning (SSL) is an emerging deep learning approach that leverages unlabeled data to obtain task-agnostic representations, which can benefit many downstream tasks. In this paper, we propose integrating a self-supervised pre-trained model, namely the audio masked autoencoder (A-MAE), into a universal sound separation system to enhance its separation performance. We employ two strategies to utilize SSL embeddings: freezing or updating the parameters of A-MAE during fine-tuning. The SSL embeddings are concate-nated with the short-time Fourier transform (STFT) to serve as input features for the separation model. We evaluate our methods on the AudioSet dataset, and the experimental results indicate that the proposed methods successfully enhance the separation performance of a state-of-the-art ResUNet-based USS model.