Muhammad Awais Centre for Vision Speech and Signal Processing CVSSP, Surrey Institute for people-centred AI; self-supervised learning; deep learning; machine learning; foundation models; multimodal learning and analysis

Dr Muhammad Awais


Senior Lecturer in Trustworthy and Responsible AI. Leaading the research on foundation models and self-supervised learning
PhD in AI, MSc in AI, BSc Computer Engineering, Bsc in Math and Physics

About

Research

Research interests

Sustainable development goals

My research interests are related to the following:

Industry, Innovation, and Infrastructure UN Sustainable Development Goal 9 logo
Sustainable Cities and Communities UN Sustainable Development Goal 11 logo

Publications

Fatemeh Nazarieh, Josef Kittler, Muhammad Awais Rana, Diptesh Kanojia, Zhenhua Feng (2024)StableTalk: Advancing Audio-to-Talking Face Generation with Stable Diffusion and Vision Transformer, In: Apostolos Antonacopoulos, Subhasis Chaudhuri, Rama Chellappa, Cheng-Lin Liu, Saumik Bhattacharya, Umapada Pal (eds.), Pattern Recognitionpp. 271-286 Springer Nature Switzerland

Audio-to-talking face generation stands at the forefront of advancements in generative AI. It bridges the gap between audio and visual representations by generating synchronized and realistic talking faces. Despite recent progress, the lack of realism in animated faces, asynchronous audio-lip movements, and computational burden remain key barriers to practical applications. To address these challenges, we introduce a novel approach, StableTalk, leveraging the emerging capabilities of Stable diffusion models and vision Transformers for Talking face generation. We also integrate the Re-attention mechanism and adversarial loss to improve the consistency of facial animations and synchronization with a given audio input. More importantly, the computational efficiency of our method has been notably enhanced by optimizing operations within the latent space and dynamically adjusting the focus on different parts of the visual content based on the provided conditions. Our experimental results demonstrate the superiority of StableTalk over the existing approaches in image quality, audio-lip synchronization, and computational efficiency.

Ali Akbari, Muhammad Awais, Manijeh Bashar, Josef Kittler (2021)A Theoretical Insight Into the Effect of Loss Function for Deep Semantic-Preserving Learning, In: IEEE transaction on neural networks and learning systems34(1)pp. 119-133 IEEE

Good generalization performance is the fundamental goal of any machine learning algorithm. Using the uniform stability concept, this article theoretically proves that the choice of loss function impacts the generalization performance of a trained deep neural network (DNN). The adopted stability-based framework provides an effective tool for comparing the generalization error bound with respect to the utilized loss function. The main result of our analysis is that using an effective loss function makes stochastic gradient descent more stable which consequently leads to the tighter generalization error bound, and so better generalization performance. To validate our analysis, we study learning problems in which the classes are semantically correlated. To capture this semantic similarity of neighboring classes, we adopt the well-known semantics-preserving learning framework, namely label distribution learning (LDL). We propose two novel loss functions for the LDL framework and theoretically show that they provide stronger stability than the other widely used loss functions adopted for training DNNs. The experimental results on three applications with semantically correlated classes, including facial age estimation, head pose estimation, and image esthetic assessment, validate the theoretical insights gained by our analysis and demonstrate the usefulness of the proposed loss functions in practical applications.

Tony Alex, Sara Atito Ali Ahmed, Armin Mustafa, Muhammad Awais, Philip J B Jackson (2025)SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes, In: ICLR 2025 - The Thirteenth International Conference on Learning Representations - Proceedings ICLR

Self-supervised pre-trained audio networks have seen widespread adoption in real-world systems, particularly in multi-modal large language models. These networks are often employed in a frozen state, under the assumption that the SSL pre-training has sufficiently equipped them to handle real-world audio. However, a critical question remains: how well do these models actually perform in real-world conditions, where audio is typically polyphonic and complex, involving multiple overlapping sound sources? Current audio SSL methods are often benchmarked on datasets predominantly featuring monophonic audio, such as environmental sounds, and speech. As a result, the ability of SSL models to generalize to polyphonic audio, a common characteristic in natural scenarios, remains underexplored. This limitation raises concerns about the practical robustness of SSL models in more realistic audio settings. To address this gap, we introduce Self-Supervised Learning from Audio Mixtures (SSLAM), a novel direction in audio SSL research, designed to improve, designed to improve the model's ability to learn from polyphonic data while maintaining strong performance on monophonic data. We thoroughly evaluate SSLAM on standard audio SSL benchmark datasets which are predominantly monophonic and conduct a comprehensive comparative analysis against SOTA methods using a range of high-quality, publicly available polyphonic datasets. SSLAM not only improves model performance on polyphonic audio, but also maintains or exceeds performance on standard audio SSL benchmarks. Notably, it achieves up to a 3.9\% improvement on the AudioSet-2M (AS-2M), reaching a mean average precision (mAP) of 50.2. For polyphonic datasets, SSLAM sets new SOTA in both linear evaluation and fine-tuning regimes with performance improvements of up to 9.1\% (mAP).

Sergio Sanchez Santiesteban, Sara Atito, Muhammad Awais, Yi-Zhe Song, Josef Kittler (2024)Improved Image Captioning Via Knowledge Graph-Augmented Models, In: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024)pp. 4290-4294 Institute of Electrical and Electronics Engineers (IEEE)

Multimodal foundation models, pre-trained on large-scale data, effectively capture vast amounts of factual and commonsense knowledge. However, these models store all their knowledge within their parameters, requiring increasingly larger models and training data to capture more knowledge. To address this limitation and achieve a more scalable and modular integration of knowledge, we propose a novel knowledge graph-augmented multimodal model. This approach enables a base multimodal model to access pertinent information from an external knowledge graph. Our methodology leverages existing general domain knowledge to facilitate vision-language pre-training using paired images and text descriptions. We conduct comprehensive evaluations demonstrating that our model outperforms state-of-the-art models and yields comparable results to much larger models trained on more extensive datasets. Notably, our model reached a 145 Cider score on MS COCO Captions using only 2.9 million samples, outperforming a 1.4B parameter model by 1.7% despite having 11 times fewer parameters.

Tony Alex, Sara Ahmed, Armin Mustafa, Muhammad Awais, Philip JB Jackson (2024)Max-AST: Combining Convolution, Local and Global Self-Attentions for Audio Event Classification, In: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024)pp. 1061-1065 Institute of Electrical and Electronics Engineers (IEEE)

In the domain of audio transformer architectures, prior research has extensively investigated isotropic architectures that capture the global context through full self-attention and hierarchical architectures that progressively transition from local to global context utilising hierarchical structures with convolutions or window-based attention. However, the idea of imbuing each individual block with both local and global contexts, thereby creating a hybrid transformer block, remains relatively under-explored in the field.To facilitate this exploration, we introduce Multi Axis Audio Spectrogram Transformer (Max-AST), an adaptation of MaxViT to the audio domain. Our approach leverages convolution, local window-attention, and global grid-attention in all the transformer blocks. The proposed model excels in efficiency compared to prior methods and consistently outperforms state-of-the-art techniques, achieving significant gains of up to 2.6% on the AudioSet full set. Further, we performed detailed ablations to analyse the impact of each of these components on audio feature learning. The source code is available at https://github.com/ta012/MaxAST.git

Cong Wu, Xiao-Jun Wu, Josef Kittler, Tianyang Xu, Sara Ahmed, Muhammad Awais, Zhenhua Feng (2024)SCD-Net: Spatiotemporal Clues Disentanglement Network for Self-Supervised Skeleton-Based Action Recognition, In: Thirty-Eighth AAAI Conference on Artificial Intelligence Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence Fourteenth Symposium on Educational Advances in Artificial Intelligence38(6)pp. 5949-5957 AAAI Press

Contrastive learning has achieved great success in skeleton-based action recognition. However, most existing approaches encode the skeleton sequences as entangled spatiotemporal representations and confine the contrasts to the same level of representation. Instead, this paper introduces a novel contrastive learning framework, namely Spatiotemporal Clues Disentanglement Network (SCD-Net). Specifically, we integrate the decoupling module with a feature extractor to derive explicit clues from spatial and temporal domains respectively. As for the training of SCD-Net, with a constructed global anchor, we encourage the interaction between the anchor and extracted clues. Further, we propose a new masking strategy with structural constraints to strengthen the contextual associations, leveraging the latest development from masked image modelling into the proposed SCD-Net. We conduct extensive evaluations on the NTU-RGB+D (60&120) and PKUMMD (I&II) datasets, covering various downstream tasks such as action recognition, action retrieval, transfer learning, and semi-supervised learning. The experimental results demonstrate the effectiveness of our method, which outperforms the existing state-of-the-art (SOTA) approaches significantly. Our code and supplementary material can be found at https://github.com/cong-wu/SCD-Net.

Tony Alex, Sara Ahmed, Armin Mustafa, Muhammad Awais, Philip J. B. Jackson (2024)DTF-AT: Decoupled Time-Frequency Audio Transformer for Event Classification, In: AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence38(16)1968pp. 17647-17655 AAAI Press

Convolutional neural networks (CNNs) and Transformer-based networks have recently enjoyed significant attention for various audio classification and tagging tasks following their wide adoption in the computer vision domain. Despite the difference in information distribution between audio spectrograms and natural images, there has been limited exploration of effective information retrieval from spectrograms using domain-specific layers tailored for the audio domain. In this paper, we leverage the power of the Multi-Axis Vision Transformer (MaxViT) to create DTF-AT (Decoupled Time-Frequency Audio Transformer) that facilitates interactions across time, frequency, spatial, and channel dimensions. The proposed DTF-AT architecture is rigorously evaluated across diverse audio and speech classification tasks, consistently establishing new benchmarks for state-of-the-art (SOTA) performance. Notably, on the challenging AudioSet 2M classification task, our approach demonstrates a substantial improvement of 4.4% when the model is trained from scratch and 3.2% when the model is initialised from ImageNet-1K pre-trained weights. In addition, we present comprehensive ablation studies to investigate the impact and efficacy of our proposed approach. The codebase and pretrained weights are available on https://github.com/ta012/DTFAT.git

Xue-Feng Zhu, Tianyang Xu, Sara Atito, Muhammad Awais, Xiao-Jun Wu, Zhenhua Feng, Josef Kittler (2024)Self-supervised learning for RGB-D object tracking, In: Pattern recognition155110543 Elsevier

Recently, there has been a growing interest in RGB-D object tracking thanks to its promising performance achieved by combining visual information with auxiliary depth cues. However, the limited volume of annotated RGB-D tracking data for offline training has hindered the development of a dedicated end-to-end RGB-D tracker design. Consequently, the current state-of-the-art RGB-D trackers mainly rely on the visual branch to support the appearance modelling, with the depth map utilised for elementary information fusion or failure reasoning of online tracking. Despite the achieved progress, the current paradigms for RGB-D tracking have not fully harnessed the inherent potential of depth information, nor fully exploited the synergy of vision-depth information. Considering the availability of ample unlabelled RGB-D data and the advancement in self-supervised learning, we address the problem of self-supervised learning for RGB-D object tracking. Specifically, an RGB-D backbone network is trained on unlabelled RGB-D datasets using masked image modelling. To train the network, the masking mechanism creates a selective occlusion of the input visible image to force the corresponding aligned depth map to help with discerning and learning vision-depth cues for the reconstruction of the masked visible image. As a result, the pre-trained backbone network is capable of cooperating with crucial visual and depth features of the diverse objects and background in the RGB-D image. The intermediate RGB-D features output by the pre-trained network can effectively be used for object tracking. We thus embed the pre-trained RGB-D network into a transformer-based tracking framework for stable tracking. Comprehensive experiments and the analysis of the results obtained on several RGB-D tracking datasets demonstrate the effectiveness and superiority of the proposed RGB-D self-supervised learning framework and the following tracking approach. •A novel RGB-D backbone network based on self-supervised learning.•Joint extraction of RGB-D feature representation for object localisation.•A Transformer-based tracking method for RGB-D object tracking.•Extensive experiments and analyses on four RGB-D tracking benchmarks.

Srinivasa Rao Nandam, Sara Atito Ali Ahmed, Zhen-Hua Feng, Josef Kittler, Muhammad Awais (2024)Enhanced Weakly Supervised Few-shot Classification & Segmentation, In: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) - Proceedings Institute of Electrical and Electronics Engineers (IEEE)

The emergence of vision-language foundation models has enabled the integration of textual information into vision-based applications. However, in few-shot classification and segmentation (FS-CS), this potential remains underutilised. Commonly, self-supervised vision models have been employed, particularly in weakly-supervised scenarios, to generate pseudo-segmentation masks, as ground truth masks are typically unavailable and only target classification is provided. Despite their success, such models find it difficult to capture accurate semantics when compared to vision-language models. To address this limitation, we propose a novel FS-CS approach that leverages the rich semantic alignment of vision-language models to generate more precise pseudo ground-truth masks. While current vision-language models excel in global visual-text alignment, they struggle with finer, patch-level alignment, which is crucial for detailed segmentation tasks. To overcome this, we introduce a method that enhances patch-level alignment without requiring additional training. In addition, existing FS-CS frameworks typically lacks multi-scale information, limiting their ability to capture fine and coarse features simultaneously. To overcome this, we incorporate a module based on atrous convolutions to inject multi-scale information into the feature maps. Together, these contributions - text enhanced pseudo-mask generation and improved multi-scale feature representation - significantly boost the performance of our model in weakly-supervised settings, surpassing state-of-the-art methods and demonstrating the importance of integrating multi-modal information for robust FS-CS solutions.

Rongchang Li, Zhenhua Feng, Tianyang Linze, Linze Li, Xiao-Jun Wu, Muhammad Awais, Sara Atito, Josef Kittler (2024)C2C: Component-to-Composition Learning for Zero-Shot Compositional Action Recognition, In: COMPUTER VISION-ECCV 2024, PT XXXVIII15096pp. 369-388 Springer Nature

Compositional actions consist of dynamic (verbs) and static (objects) concepts. Humans can easily recognize unseen compositions using the learned concepts. For machines, solving such a problem requires a model to recognize unseen actions composed of previously observed verbs and objects, thus requiring so-called compositional generalization ability. To facilitate this research, we propose a novel Zero-Shot Compositional Action Recognition (ZS-CAR) task. For evaluating the task, we construct a new benchmark, Something-composition (Sth-com), based on the widely used Something-Something V2 dataset. We also propose a novel Component-to-Composition (C2C) learning method to solve the new ZS-CAR task. C2C includes an independent component learning module and a composition inference module. Last, we devise an enhanced training strategy to address the challenges of component variations between seen and unseen compositions and to handle the subtle balance between learning seen and unseen actions. The experimental results demonstrate that the proposed framework significantly surpasses the existing compositional generalization methods and sets a new state-of-the-art. The new Sth-com benchmark and code are available at https://github.com/RongchangLi/ZSCAR C2C.

Srinivasa Rao Nandam, Sara Atito, Zhenhua Feng, Josef Kittler, Muhammad Awais (2025)Investigating Self-Supervised Methods for Label-Efficient Learning, In: International Journal of Computer Vision133(7)pp. 4522-4537 Springer

Vision transformers combined with self-supervised learning have enabled the development of models which scale across large datasets for several downstream tasks, including classification, segmentation, and detection. However, the potential of these models for low-shot learning across several downstream tasks remains largely under explored. In this work, we conduct a systematic examination of different self-supervised pretext tasks, namely contrastive learning, clustering, and masked image modelling, to assess their low-shot capabilities by comparing different pretrained models. In addition, we explore the impact of various collapse avoidance techniques, such as centring, ME-MAX, and sinkhorn, on these downstream tasks. Based on our detailed analysis, we introduce a framework that combines mask image modelling and clustering as pretext tasks. This framework demonstrates superior performance across all examined low-shot downstream tasks, including multi-class classification, multi-label classification and semantic segmentation. Furthermore, when testing the model on large-scale datasets, we show performance gains in various tasks.

Chunyang Cheng, Tianyang Xu, Zhenhua Feng, Xiaojun Wu, Zhangyong Tang, Hui Li, Zeyang Zhang, Sara Atito Ali Ahmed, Muhammad Awais, Josef Kittler (2025)One Model for ALL: Low-Level Task Interaction Is a Key to Task-Agnostic Image Fusion, In: IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025pp. 28102-28112 Institute of Electrical and Electronics Engineers (IEEE)

Advanced image fusion methods mostly prioritise high-level missions, where task interaction struggles with semantic gaps, requiring complex bridging mechanisms. In contrast, we propose to leverage low-level vision tasks from digital photography fusion, allowing for effective feature interaction through pixel-level supervision. This new paradigm provides strong guidance for unsupervised multimodal fusion without relying on abstract semantics, enhancing task-shared feature learning for broader applicability. Owning to the hybrid image features and enhanced universal representations, the proposed GIFNet supports diverse fusion tasks, achieving high performance across both seen and unseen scenarios with a single model. Uniquely, experimental results reveal that our framework also supports single-modality enhancement, offering superior flexibility for practical applications. Our code will be available at https://github.com/AWCXV/GIFNet.

Srinivasa Rao Nandam, Sara Atito Ali Ahmed, Zhen-Hua Feng, Josef Vaclav Kittler, Muhammad Awais (2025)Text Augmented Correlation Transformer for Few-shot Classification & Segmentation, In: IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025 - Proceedingspp. 25357-25366 Institute of Electrical and Electronics Engineers (IEEE)

Foundation models like CLIP and ALIGN have transformed few-shot and zero-shot vision applications by fusing visual and textual data, yet the integrative few-shot classification and segmentation (FS-CS) task primarily leverages visual cues, overlooking the potential of textual support. In FS-CS scenarios, ambiguous object boundaries and overlapping classes often hinder model performance, as limited visual data struggles to fully capture high-level semantics. To bridge this gap, we present a novel multi-modal FS-CS framework that integrates textual cues into support data, facilitating enhanced semantic disambiguation and fine-grained segmentation. Our approach first investigates the unique contributions of exclusive text-based support, using only class labels to achieve FS-CS. This strategy alone achieves performance competitive with vision-only methods on FS-CS tasks, underscoring the power of textual cues in few-shot learning. Building on this, we introduce a dual-modal prediction mechanism that synthesizes insights from both textual and visual support sets, yielding robust multi-modal predictions. This integration significantly elevates FS-CS performance, with classification and segmentation improvements of +3.7/6.6% (1-way 1-shot) and +8.0/6.5% (2-way 1-shot) on COCO-20^i, and +2.2/3.8% (1-way 1-shot) and +4.3/4.0% (2-way 1-shot) on Pascal-5^i. Additionally, in weakly supervised FS-CS settings, our method surpasses visual-only benchmarks using textual support exclusively, further enhanced by our dual-modal predictions. By rethinking the role of text in FS-CS, our work establishes new benchmarks for multi-modal few-shot learning and demonstrates the efficacy of textual cues for improving model generalization and segmentation accuracy.

ALI AKBARI, MUHAMMAD AWAIS TANVIR RANA, SOROUSH FATEMIFAR, SYED SAFWAN KHALID, JOSEF VACLAV KITTLER (2022)A Novel Ground Metric for Optimal Transport-Based Chronological Age Estimation, In: IEEE Transactions on Cybernetics52(10)pp. 9986-9999 IEEE

—Label distribution Learning (LDL) is the state-of-the-art approach to deal with a number of real-world applications , such as chronological age estimation from a face image, where there is an inherent similarity among adjacent age labels. LDL takes into account the semantic similarity by assigning a label distribution to each instance. The well-known Kullback–Leibler (KL) divergence is the widely used loss function for the LDL framework. However, the KL divergence does not fully and effectively capture the semantic similarity among age labels, thus leading to the sub-optimal performance. In this paper, we propose a novel loss function based on optimal transport theory for the LDL-based age estimation. A ground metric function plays an important role in the optimal transport formulation. It should be carefully determined based on underlying geometric structure of the label space of the application in-hand. The label space in the age estimation problem has a specific geometric structure, i.e. closer ages have more inherent semantic relationship. Inspired by this, we devise a novel ground metric function, which enables the loss function to increase the influence of highly correlated ages; thus exploiting the semantic similarity among ages more effectively than the existing loss functions. We then use the proposed loss function, namely γ–Wasserstein loss, for training a deep neural network (DNN). This leads to a notoriously computationally expensive and non-convex optimisa-tion problem. Following the standard methodology, we formulate the optimisation function as a convex problem and then use an efficient iterative algorithm to update the parameters of the DNN. Extensive experiments in age estimation on different benchmark datasets validate the effectiveness of the proposed method, which consistently outperforms state-of-the-art approaches.

Rafaella E. Sigala, Vasiliki Lagou, Aleksey Shmeliov, Sara Atito, Samaneh Kouchaki, Muhammad Awais, Inga Prokopenko, Adam Mahdi, Ayse Demirkan (2023)Machine Learning to Advance Human Genome-Wide Association Studies, In: Genes15(1)

Machine learning, including deep learning, reinforcement learning, and generative artificial intelligence are revolutionising every area of our lives when data are made available. With the help of these methods, we can decipher information from larger datasets while addressing the complex nature of biological systems in a more efficient way. Although machine learning methods have been introduced to human genetic epidemiological research as early as 2004, those were never used to their full capacity. In this review, we outline some of the main applications of machine learning to assigning human genetic loci to health outcomes. We summarise widely used methods and discuss their advantages and challenges. We also identify several tools, such as Combi, GenNet, and GMSTool, specifically designed to integrate these methods for hypothesis-free analysis of genetic variation data. We elaborate on the additional value and limitations of these tools from a geneticist’s perspective. Finally, we discuss the fast-moving field of foundation models and large multi-modal omics biobank initiatives.

Michael Danner, Patrik Huber, Muhammad Awais, Matthias Rätsch, Josef Kittler (2023)GAN-Powered Model &Landmark-Free Reconstruction: A Versatile Approach for High-Quality 3D Facial and Object Recovery from Single Images, In: Deep Learning Theory and Applicationspp. 403-418 Springer

In recent years, 3D facial reconstructions from single images have garnered significant interest. Most of the approaches are based on 3D Morphable Model (3DMM) fitting to reconstruct the 3D face shape. Concurrently, the adoption of Generative Adversarial Networks (GAN) has been gaining momentum to improve the texture of reconstructed faces. In this paper, we propose a fundamentally different approach to reconstructing the 3D head shape from a single image by harnessing the power of GAN. Our method predicts three maps of normal vectors of the head’s frontal, left, and right poses. We are thus presenting a model-free method that does not require any prior knowledge of the object’s geometry to be reconstructed. The key advantage of our proposed approach is the substantial improvement in reconstruction quality compared to existing methods, particularly in the case of facial regions that are self-occluded in the input image. Our method is not limited to 3d face reconstruction. It is generic and applicable to multiple kinds of 3D objects. To illustrate the versatility of our method, we demonstrate its efficacy in reconstructing the entire human body. By delivering a model-free method capable of generating high-quality 3D reconstructions, this paper not only advances the field of 3D facial reconstruction but also provides a foundation for future research and applications spanning multiple object types. The implications of this work have the potential to extend far beyond facial reconstruction, paving the way for innovative solutions and discoveries in various domains.

Matej Kristan, Jiri Matas, Ales Leonardis, Michael Felsberg, Roman Pflugfelder, Joni-Kristian Kamarainen, Hyung Jin Chang, Martin Danelljan, Luka Cehovin Zajc, Alan Lukezic, Ondrej Drbohlav, Jani Kapyla, Gustav Hager, Song Yan, Jinyu Yang, Zhongqun Zhang, Gustavo Fernandez, Mohamed Abdelpakey, Goutam Bhat, Llukman Cerkezi, Hakan Cevikalp, Shengyong Chen, Xin Chen, Miao Cheng, Ziyi Cheng, Yu-Chen Chiu, Ozgun Cirakman, Yutao Cui, Kenan Dai, Mohana Murali Dasari, Qili Deng, Xingping Dong, Daniel K. Du, Matteo Dunnhofer, Zhen-Hua Feng, Zhiyong Feng, Zhihong Fu, Shiming Ge, Rama Krishna Gorthi, Yuzhang Gu, Bilge Gunsel, Qing Guo, Filiz Gurkan, Wencheng Han, Yanyan Huang, Felix Jaremo Lawin, Shang-Jhih Jhang, Rongrong Ji, Cheng Jiang, Yingjie Jiang, Felix Juefei-Xu, Yin Jun, Xiao Ke, Fahad Shahbaz Khan, Byeong Hak Kim, Josef Kittler, Xiangyuan Lan, Jun Ha Lee, Bastian Leibe, Hui Li, Jianhua Li, Xianxian Li, Yuezhou Li, Bo Liu, Chang Liu, Jingen Liu, Li Liu, Qingjie Liu, Huchuan Lu, Wei Lu, Jonathon Luiten, Jie Ma, Ziang Ma, Niki Martinel, Christoph Mayer, Alireza Memarmoghadam, Christian Micheloni, Yuzhen Niu, Danda Paudel, Houwen Peng, Shoumeng Qiu, Aravindh Rajiv, Muhammad Rana, Andreas Robinson, Hasan Saribas, Ling Shao, Mohamed Shehata, Furao Shen, Jianbing Shen, Kristian Simonato, Xiaoning Song, Zhangyong Tang, Radu Timofte, Philip Torr, Chi-Yi Tsai, Bedirhan Uzun, Luc Van Gool, Paul Voigtlaender, Dong Wang, Guangting Wang, Liangliang Wang, Lijun Wang, Limin Wang, Linyuan Wang, Yong Wang, Yunhong Wang, Chenyan Wu, Gangshan Wu, Xiao-Jun Wu, Fei Xie, Tianyang Xu, Xiang Xu, Wanli Xue, Bin Yan, Wankou Yang, Xiaoyun Yang, Yu Ye, Jun Yin, Chengwei Zhang, Chunhui Zhang, Haitao Zhang, Kaihua Zhang, Kangkai Zhang, Xiaohan Zhang, Xiaolin Zhang, Xinyu Zhang, Zhibin Zhang, Shaochuan Zhao, Ming Zhen, Bineng Zhong, Jiawen Zhu, Xue-Feng Zhu (2021)The Ninth Visual Object Tracking VOT2021 Challenge Results, In: 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)2021-pp. 2711-2738 IEEE

The Visual Object Tracking challenge VOT2021 is the ninth annual tracker benchmarking activity organized by the VOT initiative. Results of 71 trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in recent years. The VOT2021 challenge was composed of four sub-challenges focusing on different tracking domains: (i) VOT-ST2021 challenge focused on short-term tracking in RGB, (ii) VOT-RT2021 challenge focused on "real-time" short-term tracking in RGB, (iii) VOT-LT2021 focused on long-term tracking, namely coping with target disappearance and reappearance and (iv) VOT-RGBD2021 challenge focused on long-term tracking in RGB and depth imagery. The VOT-ST2021 dataset was refreshed, while VOT-RGBD2021 introduces a training dataset and sequestered dataset for winner identification. The source code for most of the trackers, the datasets, the evaluation kit and the results along with the source code for most trackers are publicly available at the challenge website 1 .

Fatemeh Nazarieh, Zhenhua Feng, Muhammad Awais, Wenwu Wang, Josef Vaclav Kittler (2024)A Survey of Cross-Modal Visual Content Generation, In: IEEE Transactions on Circuits and Systems for Video Technology Institute of Electrical and Electronics Engineers (IEEE)

Cross-modal content generation has become very popular in recent years. To generate high-quality and realistic content, a variety of methods have been proposed. Among these approaches, visual content generation has attracted significant attention from academia and industry due to its vast potential in various applications. This survey provides an overview of recent advances in visual content generation conditioned on other modalities, such as text, audio, speech, and music, with a focus on their key contributions to the community. In addition, we summarize the existing publicly available datasets that can be used for training and benchmarking cross-modal visual content generation models. We provide an in-depth exploration of the datasets used for audio-to-visual content generation, filling a gap in the existing literature. Various evaluation metrics are also introduced along with the datasets. Furthermore, we discuss the challenges and limitations encountered in the area, such as modality alignment and semantic coherence. Last, we outline possible future directions for synthesizing visual content from other modalities including the exploration of new modalities, and the development of multi-task multi-modal networks. This survey serves as a resource for researchers interested in quickly gaining insights into this burgeoning field.

Recently, impressively growing efforts have been devoted to the challenging task of facial age estimation. The improvements in performance achieved by new algorithms are measured on several benchmarking test databases with different characteristics to check on consistency. While this is a valuable methodology in itself, a significant issue in the most age estimation related studies is that the reported results lack an assessment of intrinsic system uncertainty. Hence, a more in-depth view is required to examine the robustness of age estimation systems in different scenarios. The purpose of this paper is to conduct an evaluative and comparative analysis of different age estimation systems to identify trends, as well as the points of their critical vulnerability. In particular, we investigate four age estimation systems, including the online Microsoft service, two best state-of-the-art approaches advocated in the literature, as well as a novel age estimation algorithm. We analyse the effect of different internal and external factors, including gender, ethnicity, expression, makeup, illumination conditions, quality and resolution of the face images, on the performance of these age estimation systems. The goal of this sensitivity analysis is to provide the biometrics community with the insight and understanding of the critical subject-, camera- and environmental-based factors that affect the overall performance of the age estimation system under study.

Cong Wu, Xiao-Jun Wu, Josef Kittler, Tianyang Xu, Sara Atito, Muhammad Awais, Zhenhua Feng SCD-Net: Spatiotemporal Clues Disentanglement Network for Self-supervised Skeleton-based Action Recognition, In: arXiv.org

Contrastive learning has achieved great success in skeleton-based action recognition. However, most existing approaches encode the skeleton sequences as entangled spatiotemporal representations and confine the contrasts to the same level of representation. Instead, this paper introduces a novel contrastive learning framework, namely Spatiotemporal Clues Disentanglement Network (SCD-Net). Specifically, we integrate the decoupling module with a feature extractor to derive explicit clues from spatial and temporal domains respectively. As for the training of SCD-Net, with a constructed global anchor, we encourage the interaction between the anchor and extracted clues. Further, we propose a new masking strategy with structural constraints to strengthen the contextual associations, leveraging the latest development from masked image modelling into the proposed SCD-Net. We conduct extensive evaluations on the NTU-RGB+D (60&120) and PKU-MMD (I&II) datasets, covering various downstream tasks such as action recognition, action retrieval, transfer learning, and semi-supervised learning. The experimental results demonstrate the effectiveness of our method, which outperforms the existing state-of-the-art (SOTA) approaches significantly.

Jiantao Wu, Shentong Mo, Muhammad Awais, Sara Atito, Zhenhua Feng, Josef Kittler Masked Momentum Contrastive Learning for Zero-shot Semantic Understanding, In: arXiv.org

Self-supervised pretraining (SSP) has emerged as a popular technique in machine learning, enabling the extraction of meaningful feature representations without labelled data. In the realm of computer vision, pretrained vision transformers (ViTs) have played a pivotal role in advancing transfer learning. Nonetheless, the escalating cost of finetuning these large models has posed a challenge due to the explosion of model size. This study endeavours to evaluate the effectiveness of pure self-supervised learning (SSL) techniques in computer vision tasks, obviating the need for finetuning, with the intention of emulating human-like capabilities in generalisation and recognition of unseen objects. To this end, we propose an evaluation protocol for zero-shot segmentation based on a prompting patch. Given a point on the target object as a prompt, the algorithm calculates the similarity map between the selected patch and other patches, upon that, a simple thresholding is applied to segment the target. Another evaluation is intra-object and inter-object similarity to gauge discriminatory ability of SSP ViTs. Insights from zero-shot segmentation from prompting and discriminatory abilities of SSP led to the design of a simple SSP approach, termed MMC. This approaches combines Masked image modelling for encouraging similarity of local features, Momentum based self-distillation for transferring semantics from global to local features, and global Contrast for promoting semantics of global features, to enhance discriminative representations of SSP ViTs. Consequently, our proposed method significantly reduces the overlap of intra-object and inter-object similarities, thereby facilitating effective object segmentation within an image. Our experiments reveal that MMC delivers top-tier results in zero-shot semantic segmentation across various datasets.

Sara Atito Ali Ahmed, Muhammad Awais, Wenwu Wang, Mark D. Plumbley, Josef Kittler (2024)ASiT: Local-Global Audio Spectrogram vIsion Transformer for Event Classification, In: IEEE/ACM Transactions on Audio, Speech, and Language Processing32pp. 3684-3693 Institute of Electrical and Electronics Engineers (IEEE)

Transformers, which were originally developed for natural language processing, have recently generated significant interest in the computer vision and audio communities due to their flexibility in learning long-range relationships. Constrained by the data hungry nature of transformers and the limited amount of labelled data, most transformer-based models for audio tasks are finetuned from ImageNet pretrained models, despite the huge gap between the domain of natural images and audio. This has motivated the research in self-supervised pretraining of audio transformers, which reduces the dependency on large amounts of labeled data and focuses on extracting concise representations of audio spectrograms. In this paper, we propose L ocal- G lobal A udio S pectrogram v I sion T ransformer, namely ASiT, a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation. We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification. We further conduct comprehensive ablation studies, including evaluations of different pretraining strategies. The proposed ASiT framework significantly boosts the performance on all tasks and sets a new state-of-the-art performance in five audio and speech classification tasks, outperforming recent methods, including the approaches that use additional datasets for pretraining.

Ali Akbari, Muhammad Awais, Soroush Fatemifar, Syed Safwan Khalid, Josef Kittler (2022)RAgE: Robust Age Estimation Through Subject Anchoring with Consistency Regularisation, In: IEEE transactions on pattern analysis and machine intelligencePPpp. 1-15 IEEE

Modern facial age estimation systems can achieve high accuracy when training and test datasets are identically distributed and captured under similar conditions. However, domain shifts in data, encountered in practice, lead to a sharp drop in accuracy of most existing age estimation algorithms. In this work, we propose a novel method, namely RAgE, to improve the robustness and reduce the uncertainty of age estimates by leveraging unlabelled data through a subject anchoring strategy and a novel consistency regularisation term. First, we propose an similarity-preserving pseudo-labelling algorithm by which the model generates pseudo-labels for a cohort of unlabelled images belonging to the same subject, while taking into account the similarity among age labels. In order to improve the robustness of the system, a consistency regularisation term is then used to simultaneously encourage the model to produce invariant outputs for the images in the cohort with respect to an anchor image. We propose a novel consistency regularisation term the noise-tolerant property of which effectively mitigates the so-called confirmation bias caused by incorrect pseudo-labels. Experiments on multiple benchmark ageing datasets demonstrate substantial improvements over the state-of-the-art methods and robustness to confounding external factors, including subject's head pose, illumination variation and appearance of expression in the face image.

Lei Ju, Josef Vaclav Kittler, Muhammad Awais Tanvir Rana, Wankou Yang, Zhenhua Feng (2023)Keep an eye on faces: Robust face detection with heatmap-Assisted spatial attention and scale-Aware layer attention, In: Pattern recognition140 Elsevier Ltd

We propose supervised spatial attention that employs a heatmap generator for instructive feature learning.•We formulate a rectified Gaussian scoring function to generate informative heatmaps.•We present scale-aware layer attention that eliminates redundant information from pyramid features.•A voting strategy is designed to produce more reliable classification results.•Our face detector achieves encouraging performance in accuracy and speed on several benchmarks. Modern anchor-based face detectors learn discriminative features using large-capacity networks and extensive anchor settings. In spite of their promising results, they are not without problems. First, most anchors extract redundant features from the background. As a consequence, the performance improvements are achieved at the expense of a disproportionate computational complexity. Second, the predicted face boxes are only distinguished by a classifier supervised by pre-defined positive, negative and ignored anchors. This strategy may ignore potential contributions from cohorts of anchors labelled negative/ignored during inference simply because of their inferior initialisation, although they can regress well to a target. In other words, true positives and representative features may get filtered out by unreliable confidence scores. To deal with the first concern and achieve more efficient face detection, we propose a Heatmap-assisted Spatial Attention (HSA) module and a Scale-aware Layer Attention (SLA) module to extract informative features using lower computational costs. To be specific, SLA incorporates the information from all the feature pyramid layers, weighted adaptively to remove redundant layers. HSA predicts a reshaped Gaussian heatmap and employs it to facilitate a spatial feature selection by better highlighting facial areas. For more reliable decision-making, we merge the predicted heatmap scores and classification results by voting. Since our heatmap scores are based on the distance to the face centres, they are able to retain all the well-regressed anchors. The experiments obtained on several well-known benchmarks demonstrate the merits of the proposed method.

Cemre Zor, Muhammad Awais, Josef Kittler, Miroslaw Bober, Sameed Husain, Qiuqiang Kong, Christian Kroos (2019)Divergence Based Weighting for Information Channels in Deep Convolutional Neural Networks for Bird Audio Detection, In: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)2019-pp. 3052-3056 IEEE

In this paper, we address the problem of bird audio detection and propose a new convolutional neural network architecture together with a divergence based information channel weighing strategy in order to achieve improved state-of-the-art performance and faster convergence. The effectiveness of the methodology is shown on the Bird Audio Detection Challenge 2018 (Detection and Classification of Acoustic Scenes and Events Challenge, Task 3) development data set.

Ali Akbari, Muhammad Awais, Soroush Fatemifar, Josef Kittler (2023)Deep Order-Preserving Learning With Adaptive Optimal Transport Distance, In: IEEE transactions on pattern analysis and machine intelligence45(1)pp. 313-328 IEEE

We consider a framework for taking into consideration the relative importance (ordinality) of object labels in the process of learning a label predictor function. The commonly used loss functions are not well matched to this problem, as they exhibit deficiencies in capturing natural correlations of the labels and the corresponding data. We propose to incorporate such correlations into our learning algorithm using an optimal transport formulation. Our approach is to learn the ground metric, which is partly involved in forming the optimal transport distance, by leveraging ordinality as a general form of side information in its formulation. Based on this idea, we then develop a novel loss function for training deep neural networks. A highly efficient alternating learning method is then devised to alternatively optimise the ground metric and the deep model in an end-to-end learning manner. This scheme allows us to adaptively adjust the shape of the ground metric, and consequently the shape of the loss function for each application. We back up our approach by theoretical analysis and verify the performance of our proposed scheme by applying it to two learning tasks, i.e. chronological age estimation from the face and image aesthetic assessment. The numerical results on several benchmark datasets demonstrate the superiority of the proposed algorithm.

Jiantao Wu, Shentong Mo, Sara Ahmed, Zhen-Hua Feng, Josef Vaclav Kittler, Muhammad Awais (2024)DailyMAE: Towards Pretraining Masked Autoencoders in One Day

Recently, masked image modeling (MIM), an important self-supervised learning (SSL) method, has drawn attention for its effectiveness in learning data representation from unlabeled data. Numerous studies underscore the advantages of MIM, highlighting how models pretrained on extensive datasets can enhance the performance of downstream tasks. However, the high computational demands of pretraining pose significant challenges, particularly within academic environments, thereby impeding the SSL research progress. In this study, we propose efficient training recipes for MIM based SSL that focuses on mitigating data loading bottlenecks and employing progressive training techniques and other tricks to closely maintain pretraining performance. Our library enables the training of a MAE-Base/16 model on the ImageNet 1K dataset for 800 epochs within just 18 hours, using a single machine equipped with 8 A100 GPUs. By achieving speed gains of up to 5.8 times, this work not only demonstrates the feasibility of conducting high-efficiency SSL training but also paves the way for broader accessibility and promotes advancement in SSL research particularly for prototyping and initial testing of SSL ideas.

Soroush Fatemifar, Muhammad Awais, Ali Akbari, Josef Kittler (2020)A Stacking Ensemble for Anomaly Based Client-Specific Face Spoofing Detection, In: 2020 IEEE International Conference on Image Processing (ICIP)pp. 1371-1375 IEEE

To counteract spoofing attacks, the majority of recent approaches to face spoofing attack detection formulate the problem as a binary classification task in which real data and attack-accesses are both used to train spoofing detectors. Although the classical training framework has been demonstrated to deliver satisfactory results, its robustness to unseen attacks is debatable. Inspired by the recent success of anomaly detection models in face spoofing detection, we propose an ensemble of one-class classifiers fused by a Stacking ensemble method to reduce the generalisation error in the more realistic unseen attack scenario. To be consistent with this scenario, anomalous samples are considered neither for training the component anomaly classifiers nor for the design of the Stacking ensemble. To achieve better face-anti spoofing results, we adopt client-specific information to build both constituent classifiers as well as the Stacking combiner. Besides, we propose a novel 2-stage Genetic Algorithm to further improve the generalisation performance of Stacking ensemble. We evaluate the effectiveness of the proposed systems on publicly available face anti-spoofing databases including Replay-Attack, Replay-Mobile and Rose-Youtu. The experimental results following the unseen attack evaluation protocol confirm the merits of the proposed model.

Sara Atito Ali Ahmed, Cemre Zor, Muhammad Awais, Berrin Yanikoglu, Josef Kittler (2021)Deep Convolutional Neural Network Ensembles Using ECOC, In: IEEE access9pp. 86083-86095 IEEE

Deep neural networks have enhanced the performance of decision making systems in many applications, including image understanding, and further gains can be achieved by constructing ensembles. However, designing an ensemble of deep networks is often not very beneficial since the time needed to train the networks is generally very high or the performance gain obtained is not very significant. In this paper, we analyse an error correcting output coding (ECOC) framework for constructing ensembles of deep networks and propose different design strategies to address the accuracy-complexity trade-off. We carry out an extensive comparative study between the introduced ECOC designs and the state-of-the-art ensemble techniques such as ensemble averaging and gradient boosting decision trees. Furthermore, we propose a fusion technique, that is shown to achieve the highest classification performance.

Soroush Fatemifar, Muhammad Awais, Ali Akbari, Josef Kittler (2022)Developing a generic framework for anomaly detection, In: Pattern recognition124 Elsevier

The fusion of one-class classifiers (OCCs) has been shown to exhibit promising performance in a variety of machine learning applications. The ability to assess the similarity or correlation between the output of various OCCs is an important prerequisite for building of a meaningful OCCs ensemble. However, this aspect of the OCC fusion problem has been mostly ignored so far. In this paper, we propose a new method of constructing a fusion of OCCs with three contributions: (a) As a key contribution, enabling an OCC ensemble design using exclusively non anomalous samples, we propose a novel fitness function to evaluate the competency of OCCs without requiring samples from the anomalous class; (b) As a minor, but impactful contribution, we investigate alternative forms of score normalisation of OCCs, and identify a novel two-sided normalisation method as the best in coping with long tail non anomalous data distributions; (c) In the context of building our proposed OCC fusion system based on the weighted averaging approach, we find that the weights optimised using a particle swarm optimisation algorithm produce the most effective solution. We evaluate the merits of the proposed method on 15 benchmarking datasets from different application domains including medical, anti-spam and face spoofing detection. The comparison of the proposed approach with state-of-the-art methods alongside the statistical analysis confirm the effectiveness of the proposed model. (c) 2021 Elsevier Ltd. All rights reserved.

SYED SAFWAN KHALID, MUHAMMAD AWAIS TANVIR RANA, ZHENHUA FENG, CHI HO CHAN, AMMARAH FAROOQ, ALI AKBARI, JOSEF VACLAV KITTLER (2022)NPT-Loss: Demystifying face recognition losses with Nearest Proxies Triplet, In: IEEE transactions on pattern analysis and machine intelligence IEEE

Face recognition (FR) using deep convolutional neural networks (DCNNs) has seen remarkable success in recent years. One key ingredient of DCNN-based FR is the design of a loss function that ensures discrimination between various identities. The state-of-the-art (SOTA) solutions utilise normalised Softmax loss with additive and/or multiplicative margins. Despite being popular and effective, these losses are justified only intuitively with little theoretical explanations. In this work, we show that under the LogSumExp (LSE) approximation, the SOTA Softmax losses become equivalent to a proxy-triplet loss that focuses on nearest-neighbour negative proxies only. This motivates us to propose a variant of the proxy-triplet loss, entitled Nearest Proxies Triplet (NPT) loss, which unlike SOTA solutions, converges for a wider range of hyper-parameters and offers flexibility in proxy selection and thus outperforms SOTA techniques. We generalise many SOTA losses into a single framework and give theoretical justifications for the assertion that minimising the proposed loss ensures a minimum separability between all identities. We also show that the proposed loss has an implicit mechanism of hard-sample mining. We conduct extensive experiments using various DCNN architectures on a number of FR benchmarks to demonstrate the efficacy of the proposed scheme over SOTA methods.