Dr Anjan Dutta
Academic and research departmentsSurrey Institute for People-Centred Artificial Intelligence, Centre for Vision, Speech and Signal Processing (CVSSP), School of Veterinary Medicine.
Dr. Anjan Dutta is a Senior Lecturer (Assistant Professor) in Artificial Intelligence at the University of Surrey in United Kingdom. He received a PhD in Computer Science from the Autonomous University of Barcelona (UAB) in 2014, which was awarded with an Excellent Cum Laude (highest grade) qualification with International mention. Moreover, he is a recipient of the Extraordinary PhD Thesis Award for the year 2013-14 by the UAB. Before his PhD, he obtained MSc in Computer Vision and Artificial Intelligence also from the UAB, MCA in Computer Applications from the Maulana Abul Kalam Azad University of Technology and a BSc in Mathematics (Honours) from the University of Calcutta respectively in the year of 2010, 2009 and 2006. His main research interests revolve around computer vision and machine learning. Specifically, he works on deep multi-modal embedding, zero-shot learning, graph neural network for various computer vision tasks.
Areas of specialism
Representation learning for sketch-based image retrieval has mostly been tackled by learning embeddings that discard modality-specific information. As instances from different modalities can often provide complementary information describing the underlying concept, we propose a cross-attention framework for Vision Transformers (XModalViT) that fuses modality-specific information instead of discarding them. Our framework first maps paired datapoints from the individual photo and sketch modalities to fused representations that unify information from both modalities. We then decouple the input space of the aforementioned modality fusion network into independent encoders of the individual modalities via contrastive and relational cross-modal knowledge distillation. Such encoders can then be applied to downstream tasks like cross-modal retrieval. We demonstrate the expressive capacity of the learned representations by performing a wide range of experiments and achieving state-of-the-art results on three fine-grained sketch-based image retrieval benchmarks: Shoe-V2, Chair-V2 and Sketchy. Implementation is available at https://github.com/abhrac/xmodal-vit.
Fine-grained categories that largely share the same set of parts cannot be discriminated based on part information alone, as they mostly differ in the way the local parts relate to the overall global structure of the object. We propose Relational Proxies , a novel approach that leverages the relational information between the global and local views of an object for encoding its semantic label. Starting with a rigorous formalization of the notion of distinguishability between fine-grained categories, we prove the necessary and sufficient conditions that a model must satisfy in order to learn the underlying decision boundaries in the fine-grained setting. We design Relational Proxies based on our theoretical findings and evaluate it on seven challenging fine-grained benchmark datasets and achieve state-of-the-art results on all of them, surpassing the performance of all existing works with a margin exceeding 4% in some cases. We also experimentally validate our theory on fine-grained dis-tinguishability and obtain consistent results across multiple benchmarks. Implementation is available at https://github.com/abhrac/relational-proxies.
Humans show high-level of abstraction capabilities in games that require quickly communicating object information. They decompose the message content into multiple parts and communicate them in an interpretable protocol. Toward equipping machines with such capabilities , we propose the Primitive-based Sketch Abstraction task where the goal is to represent sketches using a fixed set of drawing primi-tives under the influence of a budget. To solve this task, our Primitive-Matching Network (PMN), learns interpretable abstractions of a sketch in a self supervised manner. Specifically, PMN maps each stroke of a sketch to its most similar primitive in a given set, predicting an affine transformation that aligns the selected primitive to the target stroke. We learn this stroke-to-primitive mapping end-to-end with a distance-transform loss that is minimal when the original sketch is precisely reconstructed with the predicted primitives. Our PMN abstraction empirically achieves the highest performance on sketch recognition and sketch-based image retrieval given a communication budget, while at the same time being highly interpretable. This opens up new possibilities for sketch analysis, such as comparing sketches by extracting the most relevant primitives that define an object category. Code is available at https://github.com/ExplainableML/sketch-primitives.