
Srinivasa Rao Nandam
Academic and research departments
Centre for Vision, Speech and Signal Processing (CVSSP), Surrey Institute for People-Centred Artificial Intelligence (PAI).About
My research project
Foundation models for multimodal understandingFoundation models for natural language processing (NLP) have already seen a huge sucess after the seminal work of BERT (BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding) and GPT-N (Generative Pretraining Transformer: Improving language understanding by generative pre-training) which are recognised as the early foundation models for NLP in 2018.
However, the foundation models for computer vision started to emerge three years later at the beginning of 2021 with the seminal work of SiT (SiT: Self-supervised vIsion Transformer (under review)) which proposed the idea of group masked model learning (GMML).
Equipped with both NLP and vision foundation models, the aim of the PhD will be to study the role of these foundation models (e.g., NLP, vision, audio etc.) in multimodal analysis and understanding. For example, the initial work of the current research team (CLMIU: Commonsense Learning in Multimodal Image Understanding (under review)) has already established that using foundation models for vision in multimodal image understanding is more beneficial and has already alleviated the need of object detector which is considered as a critical pre-processing step of visual input.
The PhD research will build more advanced multimodal and cross-modal algorithms suitable for several downstream applications by building upon foundation models. The explainability of the decision and tasks performed by multimodal algorithm will also be of particular focus during the PhD study.
Supervisors
Foundation models for natural language processing (NLP) have already seen a huge sucess after the seminal work of BERT (BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding) and GPT-N (Generative Pretraining Transformer: Improving language understanding by generative pre-training) which are recognised as the early foundation models for NLP in 2018.
However, the foundation models for computer vision started to emerge three years later at the beginning of 2021 with the seminal work of SiT (SiT: Self-supervised vIsion Transformer (under review)) which proposed the idea of group masked model learning (GMML).
Equipped with both NLP and vision foundation models, the aim of the PhD will be to study the role of these foundation models (e.g., NLP, vision, audio etc.) in multimodal analysis and understanding. For example, the initial work of the current research team (CLMIU: Commonsense Learning in Multimodal Image Understanding (under review)) has already established that using foundation models for vision in multimodal image understanding is more beneficial and has already alleviated the need of object detector which is considered as a critical pre-processing step of visual input.
The PhD research will build more advanced multimodal and cross-modal algorithms suitable for several downstream applications by building upon foundation models. The explainability of the decision and tasks performed by multimodal algorithm will also be of particular focus during the PhD study.
Publications
Vision transformers combined with self-supervised learning have enabled the development of models which scale across large datasets for several downstream tasks, including classification, segmentation, and detection. However, the potential of these models for low-shot learning across several downstream tasks remains largely under explored. In this work, we conduct a systematic examination of different self-supervised pretext tasks, namely contrastive learning, clustering, and masked image modelling, to assess their low-shot capabilities by comparing different pretrained models. In addition, we explore the impact of various collapse avoidance techniques, such as centring, ME-MAX, and sinkhorn, on these downstream tasks. Based on our detailed analysis, we introduce a framework that combines mask image modelling and clustering as pretext tasks. This framework demonstrates superior performance across all examined low-shot downstream tasks, including multi-class classification, multi-label classification and semantic segmentation. Furthermore, when testing the model on large-scale datasets, we show performance gains in various tasks.
Foundation models like CLIP and ALIGN have transformed few-shot and zero-shot vision applications by fusing visual and textual data, yet the integrative few-shot classification and segmentation (FS-CS) task primarily leverages visual cues, overlooking the potential of textual support. In FS-CS scenarios, ambiguous object boundaries and overlapping classes often hinder model performance, as limited visual data struggles to fully capture high-level semantics. To bridge this gap, we present a novel multi-modal FS-CS framework that integrates textual cues into support data, facilitating enhanced semantic disambiguation and fine-grained segmentation. Our approach first investigates the unique contributions of exclusive text-based support, using only class labels to achieve FS-CS. This strategy alone achieves performance competitive with vision-only methods on FS-CS tasks, underscoring the power of textual cues in few-shot learning. Building on this, we introduce a dual-modal prediction mechanism that synthesizes insights from both textual and visual support sets, yielding robust multi-modal predictions. This integration significantly elevates FS-CS performance, with classification and segmentation improvements of +3.7/6.6% (1-way 1-shot) and +8.0/6.5% (2-way 1-shot) on COCO-20^i, and +2.2/3.8% (1-way 1-shot) and +4.3/4.0% (2-way 1-shot) on Pascal-5^i. Additionally, in weakly supervised FS-CS settings, our method surpasses visual-only benchmarks using textual support exclusively, further enhanced by our dual-modal predictions. By rethinking the role of text in FS-CS, our work establishes new benchmarks for multi-modal few-shot learning and demonstrates the efficacy of textual cues for improving model generalization and segmentation accuracy.