Dr Muhammad Awais
Academic and research departments
Surrey Institute for People-Centred Artificial Intelligence (PAI), Centre for Vision, Speech and Signal Processing (CVSSP).About
Biography
I was lucky to be part of the research (together with colleagues Sara Atito and Josef Kittler) which resulted in the first state-of-the-art (SOTA) masked image modelling (MIM) approach for vision transformers using the simple principles of heavy masking and recovery of information without using human annotated labels. Proposed MIM outperformed all existing self-supervised learning (SSL) SOTA methods including joint embedding-based architectures. It marked a milestone in computer vision to become the first method which used self-supervised pretraining (SSP) to outperform supervised pretraining (SP). According to leading Artificial Intelligence (AI) researcher and chief AI scientist of Meta, Yann LeCun the MIM has revolutionised the SSL. To put it in context I will introduce the challenges of AI and SSL first and then the breakthrough.
The Challenge: AI has seen a phenomenal growth over last decade mainly thanks to Supervised Learning (SL) and supervised pretraining (SP) of deep neural networks (DNNs). The growth of AI was somewhat stagnated after the first few years due to lack of available labelled data for training DNNs. Tech giants' solution to the problem was to collect millions or even billions of weakly labelled samples to pretrain DNNs. Despite that there was an anticipation that Un/Self-Supervised Learning (SSL), i.e., learning without human annotated labels is the way forward in AI due to closeness of SSL to human-like learning. In the words Yann LeCun, “The revolution will not be supervised”. However, the problem with self-supervised pretraining (SSP) of DNNs was that it did not outperform supervised pretraining (SP) of DNNs for downstream tasks particularly in computer vision despite significant efforts from tech giants and top researchers in AI. In 2018 SSL has seen huge success in natural language processing (NLP) by models like BERT and GPT which are trained using masked language modelling (MLM) and auto regressive generative pretraining. However, this principle of MLM was not easy to adopt to computer vision as indicated by Yann LeCun in the March 2021 blog (The dark matter of intelligence), “But we cannot use this trick (MLM) for images because we cannot enumerate all possible images. Is there a solution to this problem? The short answer is no. There are interesting ideas in this direction, but they have not yet led to results that are as good as joint embedding architectures.”.
The breakthrough: At the beginning of 2021, we conducted the research on first working version of masked image modelling (MIM) which we dubbed as GMML (Group Masked Model Learning) in our seminal work SiT (Self-Supervised vIsion Transformers; released on 8th of April 2021 along with code). GMML marked a milestone in computer vision by being the first SSP method which outperformed SP across multiple tasks and also outperformed all existing SSL methods including joint embedding architectures. In this sense, GMML is the first working foundation model for vision (image only) modality. Prior to GMML, the SSL methods in computer vision applications were unsustainable and mainly limited to large groups and AI tech giants due to complexities of SSL algorithms, like requirements of huge resources because of large batch size, DNNs model size, dataset size etc. GMML democratized and made SSL sustainable by enabling SSP using a single GPU and on small datasets. In fact, the principles of GMML, i.e., heavy masking and recovery of information is shown to be the best way to utilise information for both small/medium amount of data by our research group as well as later for large datasets with large models by tech giants like Microsoft, Meta, Nvidia etc. The remarkable strength of GMML over SOTA SSL including joint-embedding based methods can be seen when combined with self-supervised clustering methods, the GMML show remarkable improvements over joint-embedding based SOTA methods which were proposed by tech giants. The tweet of Yann LeCun in March 2023 acknowledges that SSL for computer vision is revolutionised by MIM, the principles which we laid down at the beginning of 2021.
I am a senior lecturer (associate professor) jointly at Centre for Vision Speech & Signal Processing (CVSSP) and Surrey Institute for People Centred AI (SI-PAI) where I lead the research on foundation models and self-supervised learning. I am also responsible for technical aspects of trustworthy and explainable AI.
Building next generation AI algorithms is futile unless it benefits society. Therefore, I pivot the fundamental AI development with applications for the benefit of society. This is evident from the applicability of my research to a range of applications areas from healthcare to security. An example is manifested in terms of a startup Sensus Futuris which I co-founded with Prof. Kittler. The focus of Sensus Futuris is to use innovative AI algorithms to make society safer and more efficient. Another aspect of people-centred AI is evident from industrial fundings (e.g., innovateUK, ignite etc.) I have co-designed and co-led with a particular focus on the people-centred nature of the AI products. Some of the AI algorithms we worked on (at Imperial College London’s startup) were deployed on huge scales at eBay, Macy’s, Zalando etc. for recommending visually related items to their users. My experience of working in both industry and academia places me well to develop innovative AI algorithms and advance their theoretical underpinning as well as do technology transfer and have bigger impact on society.
ResearchResearch interests
The focus of my research is on core AI/ML/DL algorithms and their applicability to a wide range of application areas. My research interests include foundation models, un/self-supervised learning, cross/multi-modal learning, theoretical insights and understanding of deep learning, computer vision, NLP, medical image analysis, audio, retrieval, biometrics, security.
You can find some of interesting research work in my google scholar profile (Note that it is usually not up to date because if I enable auto-update I start to receive dozens of papers not belonging to me. I very occasionally find the time to update and add a few interesting papers).
Research interests
The focus of my research is on core AI/ML/DL algorithms and their applicability to a wide range of application areas. My research interests include foundation models, un/self-supervised learning, cross/multi-modal learning, theoretical insights and understanding of deep learning, computer vision, NLP, medical image analysis, audio, retrieval, biometrics, security.
You can find some of interesting research work in my google scholar profile (Note that it is usually not up to date because if I enable auto-update I start to receive dozens of papers not belonging to me. I very occasionally find the time to update and add a few interesting papers).
Publications
Face recognition (FR) using deep convolutional neural networks (DCNNs) has seen remarkable success in recent years. One key ingredient of DCNN-based FR is the design of a loss function that ensures discrimination between various identities. The state-of-the-art (SOTA) solutions utilise normalised Softmax loss with additive and/or multiplicative margins. Despite being popular and effective, these losses are justified only intuitively with little theoretical explanations. In this work, we show that under the LogSumExp (LSE) approximation, the SOTA Softmax losses become equivalent to a proxy-triplet loss that focuses on nearest-neighbour negative proxies only. This motivates us to propose a variant of the proxy-triplet loss, entitled Nearest Proxies Triplet (NPT) loss, which unlike SOTA solutions, converges for a wider range of hyper-parameters and offers flexibility in proxy selection and thus outperforms SOTA techniques. We generalise many SOTA losses into a single framework and give theoretical justifications for the assertion that minimising the proposed loss ensures a minimum separability between all identities. We also show that the proposed loss has an implicit mechanism of hard-sample mining. We conduct extensive experiments using various DCNN architectures on a number of FR benchmarks to demonstrate the efficacy of the proposed scheme over SOTA methods.
We propose supervised spatial attention that employs a heatmap generator for instructive feature learning.•We formulate a rectified Gaussian scoring function to generate informative heatmaps.•We present scale-aware layer attention that eliminates redundant information from pyramid features.•A voting strategy is designed to produce more reliable classification results.•Our face detector achieves encouraging performance in accuracy and speed on several benchmarks. Modern anchor-based face detectors learn discriminative features using large-capacity networks and extensive anchor settings. In spite of their promising results, they are not without problems. First, most anchors extract redundant features from the background. As a consequence, the performance improvements are achieved at the expense of a disproportionate computational complexity. Second, the predicted face boxes are only distinguished by a classifier supervised by pre-defined positive, negative and ignored anchors. This strategy may ignore potential contributions from cohorts of anchors labelled negative/ignored during inference simply because of their inferior initialisation, although they can regress well to a target. In other words, true positives and representative features may get filtered out by unreliable confidence scores. To deal with the first concern and achieve more efficient face detection, we propose a Heatmap-assisted Spatial Attention (HSA) module and a Scale-aware Layer Attention (SLA) module to extract informative features using lower computational costs. To be specific, SLA incorporates the information from all the feature pyramid layers, weighted adaptively to remove redundant layers. HSA predicts a reshaped Gaussian heatmap and employs it to facilitate a spatial feature selection by better highlighting facial areas. For more reliable decision-making, we merge the predicted heatmap scores and classification results by voting. Since our heatmap scores are based on the distance to the face centres, they are able to retain all the well-regressed anchors. The experiments obtained on several well-known benchmarks demonstrate the merits of the proposed method.
—Label distribution Learning (LDL) is the state-of-the-art approach to deal with a number of real-world applications , such as chronological age estimation from a face image, where there is an inherent similarity among adjacent age labels. LDL takes into account the semantic similarity by assigning a label distribution to each instance. The well-known Kullback–Leibler (KL) divergence is the widely used loss function for the LDL framework. However, the KL divergence does not fully and effectively capture the semantic similarity among age labels, thus leading to the sub-optimal performance. In this paper, we propose a novel loss function based on optimal transport theory for the LDL-based age estimation. A ground metric function plays an important role in the optimal transport formulation. It should be carefully determined based on underlying geometric structure of the label space of the application in-hand. The label space in the age estimation problem has a specific geometric structure, i.e. closer ages have more inherent semantic relationship. Inspired by this, we devise a novel ground metric function, which enables the loss function to increase the influence of highly correlated ages; thus exploiting the semantic similarity among ages more effectively than the existing loss functions. We then use the proposed loss function, namely γ–Wasserstein loss, for training a deep neural network (DNN). This leads to a notoriously computationally expensive and non-convex optimisa-tion problem. Following the standard methodology, we formulate the optimisation function as a convex problem and then use an efficient iterative algorithm to update the parameters of the DNN. Extensive experiments in age estimation on different benchmark datasets validate the effectiveness of the proposed method, which consistently outperforms state-of-the-art approaches.
Recently, impressively growing efforts have been devoted to the challenging task of facial age estimation. The improvements in performance achieved by new algorithms are measured on several benchmarking test databases with different characteristics to check on consistency. While this is a valuable methodology in itself, a significant issue in the most age estimation related studies is that the reported results lack an assessment of intrinsic system uncertainty. Hence, a more in-depth view is required to examine the robustness of age estimation systems in different scenarios. The purpose of this paper is to conduct an evaluative and comparative analysis of different age estimation systems to identify trends, as well as the points of their critical vulnerability. In particular, we investigate four age estimation systems, including the online Microsoft service, two best state-of-the-art approaches advocated in the literature, as well as a novel age estimation algorithm. We analyse the effect of different internal and external factors, including gender, ethnicity, expression, makeup, illumination conditions, quality and resolution of the face images, on the performance of these age estimation systems. The goal of this sensitivity analysis is to provide the biometrics community with the insight and understanding of the critical subject-, camera- and environmental-based factors that affect the overall performance of the age estimation system under study.
To counteract spoofing attacks, the majority of recent approaches to face spoofing attack detection formulate the problem as a binary classification task in which real data and attack-accesses are both used to train spoofing detectors. Although the classical training framework has been demonstrated to deliver satisfactory results, its robustness to unseen attacks is debatable. Inspired by the recent success of anomaly detection models in face spoofing detection, we propose an ensemble of one-class classifiers fused by a Stacking ensemble method to reduce the generalisation error in the more realistic unseen attack scenario. To be consistent with this scenario, anomalous samples are considered neither for training the component anomaly classifiers nor for the design of the Stacking ensemble. To achieve better face-anti spoofing results, we adopt client-specific information to build both constituent classifiers as well as the Stacking combiner. Besides, we propose a novel 2-stage Genetic Algorithm to further improve the generalisation performance of Stacking ensemble. We evaluate the effectiveness of the proposed systems on publicly available face anti-spoofing databases including Replay-Attack, Replay-Mobile and Rose-Youtu. The experimental results following the unseen attack evaluation protocol confirm the merits of the proposed model.