10:30am - 11:30am
Friday 14 November 2025
Towards Explainable Speaker Recognition Neural Networks
PhD Viva Open Presentation - Yanze Xu
Online event - All Welcome!
Free
Towards Explainable Speaker Recognition Neural Networks

Abstract:
Speaker recognition systems powered by neural networks can identify individuals from their voice signals. However, these models often function as black boxes, with decision-making processes that remain opaque. The field of Explainable AI (XAI) seeks to explain these processes, particularly in neural networks, in order to make them transparent and understandable to humans. This thesis investigates two central questions in the context of XAI for speaker recognition neural networks: (1) Considering a well-trained network that can learn representations of voices relevant to decision-making, how this network organises diverse representations, and (2) what specific information the network selectively processes for the decision-making process. We refer to the first question as exploring the network representation organisation and the second as exploring the network attention mechanism.
For the representations' organisation, prior studies showed that the network tends to group representations of similar voices into individual clusters. However, few researchers have further investigated the relationships between individual representation clusters. In this thesis, we first give an analysis of diverse representations extracted from the speaker recognition neural network using two hierarchical clustering algorithms (i.e. HDBSCAN and SLINK). Such analysis successfully discovers that representations can form clusters with hierarchical relationships. Secondly, we evaluate hierarchical structures separately obtained by HDBSCAN and SLINK and visualise the optimal evaluation case in the dendrogram, a tree-like structure. Thirdly, we propose a method called hierarchical cluster-class matching (HCCM) to explain unknown hierarchical representation clusters of the optimal evaluation case at the semantic level. Such matching results reveal that some unknown clusters can be well explained not only by individual semantic classes about speaker identity, nationality, and gender information, but also by conjunctions of individual semantic classes.
For the network's attention, prior studies have applied methods based on the Class Activation Maps (CAM) technique to visualise the two-dimensional matrix, referred to as an attention map or saliency map. Such a map highlights the regions of the input that the Convolutional Neural Network (CNN) model pays attention to while classifying the input as a certain class. However, CAM-based methods lack in-depth exploration of the CNN model trained by the contrastive loss function. Nonetheless, only a few works focus on evaluating and characterising the attention maps they visualised. In this thesis, we develop a new algorithm, abbreviated as the Modified RISE-eval, to evaluate the qualities of attention maps, visualised using CAM-based methods in the context of the speaker recognition task. Secondly, we explore two popular CAM methods (i.e. GradCAM and LayerCAM) to analyse and visualise the attention mechanism of a certain contrastive-learned CNN model for speaker recognition. Lastly, we propose a new qualitative analysis method and introduce a quantitative method to characterise the attention maps, gaining us a better understanding of the attention mechanism that these maps represent (i.e. contrastive-learned network).