About

Areas of specialism

Computer Vision; Machine Learning; Representation Learning; Learning from Fewer Labels; Structured Representation Learning

My qualifications

2021
Postgraduate Certification in Academic Practice (PCAP / FHEA)
University of Exeter
2014
PhD in Computer Science
Autonomous University of Barcelona
2010
MSc in Computer Vision and Artificial Intelligence
Autonomous University of Barcelona
2009
MCA in Computer Application
Maulana Abul Kalam Azad University of Technology
2006
BSc in Mathematics (Honours)
University of Calcutta

Previous roles

2019 - 2022
Lecturer of Computer Vision & Machine Learning
University of Exeter
2017 - 2019
Marie Curie Fellow
Computer Vision Centre
2016 - 2017
Postdoctoral Researcher
Computer Vision Centre
2014 - 2015
Postdoctoral Researcher
Télécom ParisTech

Sustainable development goals

My research interests are related to the following:

Quality Education UN Sustainable Development Goal 4 logo
Affordable and Clean Energy UN Sustainable Development Goal 7 logo
Sustainable Cities and Communities UN Sustainable Development Goal 11 logo
Life on Land UN Sustainable Development Goal 15 logo

Publications

Anjan Dutta, Josep Llados, Umapada Pal (2011)A Bag-of-Paths Based Serialized Subgraph Matching for Symbol Spotting in Line Drawings, In: J Vitria, J M Sanches, M Hernandez (eds.), PATTERN RECOGNITION AND IMAGE ANALYSIS: 5TH IBERIAN CONFERENCE, IBPRIA 20116669pp. 620-627 Springer Nature

In this paper we propose an error tolerant subgraph matching algorithm based on bag-of-paths for solving the problem of symbol spotting in line drawings. Bag-of-paths is a factorized representation of graphs where the factorization is done by considering all the acyclic paths between each pair of connected nodes. Similar paths within the whole collection of documents are clustered and organized in a lookup table for efficient indexing. The lookup table contains the index key of each cluster and the corresponding list of locations as a single entry. The mean path of each of the clusters serves as the index key for each table entry. The spotting method is then formulated by a spatial voting scheme to the list of locations of the paths that are decided in terms of search of similar paths that compose the query symbol. Efficient indexing of common substructures helps to reduce the computational burden of usual graph based methods. The proposed method can also be seen as a way to serialize graphs which allows to reduce the complexity of the subgraph isomorphism. We have encoded the paths in terms of both attributed strings and turning functions, and presented a comparative results between them within the symbol spotting framework. Experimentations for matching different shape silhouettes are also reported and the method has been proved to work in noisy environment also.

Anjan Dutta, Josep Llados, Horst Bunke, Umapada Pal (2014)A Product Graph Based Method for Dual Subgraph Matching Applied to Symbol Spotting, In: B Lamiroy, J M Ogier (eds.), GRAPHICS RECOGNITION: CURRENT TRENDS AND CHALLENGES8746pp. 11-24 Springer Nature

Product graph has been shown as a way for matching subgraphs. This paper reports the extension of the product graph methodology for subgraph matching applied to symbol spotting in graphical documents. Here we focus on the two major limitations of the previous version of the algorithm: (1) spurious nodes and edges in the graph representation and (2) inefficient node and edge attributes. To deal with noisy information of vectorized graphical documents, we consider a dual edge graph representation on the original graph representing the graphical information and the product graph is computed between the dual edge graphs of the pattern graph and the target graph. The dual edge graph with redundant edges is helpful for efficient and tolerating encoding of the structural information of the graphical documents. The adjacency matrix of the product graph locates the pair of similar edges of two operand graphs and exponentiating the adjacency matrix finds similar random walks of greater lengths. Nodes joining similar random walks between two graphs are found by combining different weighted exponentials of adjacency matrices. An experimental investigation reveals that the recall obtained by this approach is quite encouraging.

Pau Riba, Josep Lladós, Alicia Fornés, Anjan Dutta (2015)Large-scale graph indexing using binary embeddings of node contexts, In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)9069pp. 208-217
Palaiahnakote Shivakumara, Anjan Dutta, Trung Quy Phan, Chew Lim Tan, Umapada Pal (2011)A novel mutual nearest neighbor based symmetry for text frame classification in video, In: Pattern recognition44(8)pp. 1671-1683 Elsevier Ltd

In the field of multimedia retrieval in video, text frame classification is essential for text detection, event detection, event boundary detection, etc. We propose a new text frame classification method that introduces a combination of wavelet and median moment with k-means clustering to select probable text blocks among 16 equally sized blocks of a video frame. The same feature combination is used with a new Max–Min clustering at the pixel level to choose probable dominant text pixels in the selected probable text blocks. For the probable text pixels, a so-called mutual nearest neighbor based symmetry is explored with a four-quadrant formation centered at the centroid of the probable dominant text pixels to know whether a block is a true text block or not. If a frame produces at least one true text block then it is considered as a text frame otherwise it is a non-text frame. Experimental results on different text and non-text datasets including two public datasets and our own created data show that the proposed method gives promising results in terms of recall and precision at the block and frame levels. Further, we also show how existing text detection methods tend to misclassify non-text frames as text frames in term of recall and precision at both the block and frame levels. ► A new wavelet–median moment feature to enhance gap between text and non-text pixel. ► Probable text blocks selection (PTBS) using k-means clustering among 16 blocks. ► Max–Min clustering to obtain dominant and high contrast pixels. ► A new mutual nearest neighbor concept (MMNS) to identify a true text block. ► The combination of PTBS and MNNS for achieving better results.

Klaus Broelemann, Anjan Dutta, Xiaoyi Jiang, Josep Llados (2012)Hierarchical Graph Representation for Symbol Spotting in Graphical Document Images, In: G Gimelfarb, E Hancock, A Imiya, A Kuijper, M Kudo, S Omachi, T Windeatt, K Yamada (eds.), STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION7626pp. 529-538 Springer Nature

Symbol spotting can be defined as locating given query symbol in a large collection of graphical documents. In this paper we present a hierarchical graph representation for symbols. This representation allows graph matching methods to deal with low-level vectorization errors and, thus, to perform a robust symbol spotting. To show the potential of this approach, we conduct an experiment with the SESYD dataset.

Anjan Dutta, Josep Llados, Horst Bunke, Umapada Pal (2013)Near Convex Region Adjacency Graph and Approximate Neighborhood String Matching for Symbol Spotting in Graphical Documents, In: Proceedings of the International Conference on Document Analysis and Recognitionpp. 1078-1082 IEEE

This paper deals with a subgraph matching problem in Region Adjacency Graph (RAG) applied to symbol spotting in graphical documents. RAG is a very important, efficient and natural way of representing graphical information with a graph but this is limited to cases where the information is well defined with perfectly delineated regions. What if the information we are interested in is not confined within well defined regions? This paper addresses this particular problem and solves it by defining near convex grouping of oriented line segments which results in near convex regions. Pure convexity imposes hard constraints and can not handle all the cases efficiently. Hence to solve this problem we have defined a new type of convexity of regions, which allows convex regions to have concavity to some extend. We call this kind of regions Near Convex Regions (NCRs). These NCRs are then used to create the Near Convex Region Adjacency Graph (NCRAG) and with this representation we have formulated the problem of symbol spotting in graphical documents as a subgraph matching problem. For subgraph matching we have used the Approximate Edit Distance Algorithm (AEDA) on the neighborhood string, which starts working after finding a key node in the input or target graph and iteratively identifies similar nodes of the query graph in the neighborhood of the key node. The experiments are performed on artificial, real and distorted datasets.

Alicia Fornés, Anjan Dutta, Albert Gordo, Josep Lladós (2013)The 2012 Music Scores Competitions: Staff Removal and Writer Identification, In: Young-Bin Kwon, Jean-Marc Ogier (eds.), Graphics Recognition. New Trends and Challengespp. 173-186 Springer Berlin Heidelberg

Since there has been a growing interest in the analysis of handwritten music scores, we have tried to foster this interest by proposing in ICDAR and GREC two different competitions: Staff removal and Writer identification. Both competitions have been tested on the CVC-MUSCIMA database of handwritten music score images. In the corresponding ICDAR publication, we have described the ground-truth, the evaluation metrics, the participants’ methods and results. As a result of the discussions with attendees in ICDAR and GREC concerning our music competition, we decided to propose a new experiment for an extended competition. Thus, this paper is focused on this extended competition, describing the new set of images and analyzing the new results.

Pau Riba, Josep Llados, Alicia Fornes, Anjan Dutta (2017)Large-scale graph indexing using binary embeddings of node contexts for information spotting in document image databases, In: Pattern recognition letters87pp. 203-211 Elsevier

Graph-based representations are experiencing a growing usage in visual recognition and retrieval due to their representational power in front of classical appearance-based representations. However, retrieving a query graph from a large dataset of graphs implies a high computational complexity. The most important property for a large-scale retrieval is the search time complexity to be sub-linear in the number of database examples. With this aim, in this paper we propose a graph indexation formalism applied to visual retrieval. A binary embedding is defined as hashing keys for graph nodes. Given a database of labeled graphs, graph nodes are complemented with vectors of attributes representing their local context. Then, each attribute vector is converted to a binary code applying a binary-valued hash function. Therefore, graph retrieval is formulated in terms of finding target graphs in the database whose nodes have a small Hamming distance from the query nodes, easily computed with bitwise logical operators. As an application example, we validate the performance of the proposed methods in different real scenarios such as handwritten word spotting in images of historical documents or symbol spotting in architectural floor plans. (C) 2016 Elsevier B.V. All rights reserved.

Klaus Broelemann, Anjan Dutta, Xiaoyi Jiang, Josep Llados (2014)Hierarchical Plausibility-Graphs for Symbol Spotting in Graphical Documents, In: B Lamiroy, J M Ogier (eds.), GRAPHICS RECOGNITION: CURRENT TRENDS AND CHALLENGES8746pp. 25-37 Springer Nature

Graph representation of graphical documents often suffers from noise such as spurious nodes and edges, and their discontinuity. In general these errors occur during the low-level image processing viz. binarization, skeletonization, vectorization etc. Hierarchical graph representation is a nice and efficient way to solve this kind of problem by hierarchically merging node-node and node-edge depending on the distance. But the creation of hierarchical graph representing the graphical information often uses hard thresholds on the distance to create the hierarchical nodes (next state) of the lower nodes (or states) of a graph. As a result, the representation often loses useful information. This paper introduces plausibilities to the nodes of hierarchical graph as a function of distance and proposes a modified algorithm for matching subgraphs of the hierarchical graphs. The plausibility-annotated nodes help to improve the performance of the matching algorithm on two hierarchical structures. To show the potential of this approach, we conduct an experiment with the SESYD dataset.

Palaiahnakote Shivakumara, Anjan Dutta, Umapada Pal, Chew Lim Tan (2010)A New Method for Handwritten Scene Text Detection in Video, In: 2010 International Conference on Frontiers in Handwriting Recognitionpp. 387-392 IEEE

There are many video images where hand written text may appear. Therefore handwritten scene text detection in video is essential and useful for many applications for efficient indexing, retrieval etc. Also there are many video frames where text line may be multi-oriented in nature. To the best of our knowledge there is no work on handwritten text detection in video, which is multi-oriented in nature. In this paper, we present a new method based on maximum color difference and boundary growing method for detection of multi-oriented handwritten scene text in video. The method computes maximum color difference for the average of R, G and B channels of the original frame to enhance the text information. The output of maximum color difference is fed to a K-means algorithm with K = 2 to separate text and non-text clusters. Text candidates are obtained by intersecting the text cluster with the Sobel output of the original frame. To tackle the fundamental problem of different orientations and skews of handwritten text, boundary growing method based on a nearest neighbor concept is employed. We evaluate the proposed method by testing on our own handwritten text database and publicly available video data (Hua's data). Experimental results obtained from the proposed method are promising.

Alicia Fornes, Anjan Dutta, Albert Gordo, Josep Llados (2012)CVC-MUSCIMA: a ground truth of handwritten music score images for writer identification and staff removal, In: International journal on document analysis and recognition15(3)pp. 243-251 Springer Nature

The analysis of music scores has been an active research field in the last decades. However, there are no publicly available databases of handwritten music scores for the research community. In this paper, we present the CVC-MUSCIMA database and ground truth of handwritten music score images. The dataset consists of 1,000 music sheets written by 50 different musicians. It has been especially designed for writer identification and staff removal tasks. In addition to the description of the dataset, ground truth, partitioning, and evaluation metrics, we also provide some baseline results for easing the comparison between different approaches.

Palaiahnakote Shivakumara, Anjan Dutta, Chew Lim Tan, Umapada Pal (2010)A new wavelet-median-moment based method for multi-oriented video text detection, In: Proceedings of the 9th Iapr International Workshop on Document Analysis Systemspp. 279-286
Anjan Dutta, Umapada Pal, Alicia Fornes, Josep Llados (2010)An Efficient Staff Removal Approach from Printed Musical Documents, In: 2010 20th International Conference on Pattern Recognitionpp. 1965-1968 IEEE

Staff removal is an important preprocessing step of the Optical Music Recognition (OMR). The process aims to remove the stafflines from a musical document and retain only the musical symbols, later these symbols are used effectively to identify the music information. This paper proposes a simple but robust method to remove stafflines from printed musical scores. In the proposed methodology we have considered a staffline segment as a horizontal linkage of vertical black runs with uniform height. We have used the neighbouring properties of a staffline segment to validate it as a true segment. We have considered the dataset along with the deformations described in for evaluation purpose. From experimentation we have got encouraging results.

Anjan Dutta, Hichem Sahbi (2015)High order graphlets for pattern classification, In: 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR)7486495pp. 206-210 IEEE

Graph-based methods are known to be successful for pattern description and comparison. Their general principle consists in using graphs to model local features as well as their structural relationships and achieving pattern comparison with graph matching. Among these methods, subgraph isomorphism is particularly effective but intractable for general and unconstrained graph structures. In this paper, we introduce an efficient and effective method for graph-based pattern comparison. The main contribution includes a new stochastic search procedure that allows us to efficiently extract, hash and measure the distribution of increasing order subgraphs (a.k.a graphlets) in large graph collections. We consider both low and high order graphlets in order to model local features as well as their complex interactions. These graphlets are partitioned into sets of isomorphic and non-isomorphic subgraphs using well designed hash functions with a low probability of collision; resulting into accurate graph descriptions. When combined with support vector machines, these high order graphlet-based descriptions have positive impact on the performance of pattern comparison and classification as corroborated through experiments on different standard databases.

Alicia Fornes, Van Cuong Kieu, Muriel Visani, Nicholas Journet, Anjan Dutta (2014)The ICDAR/GREC 2013 Music Scores Competition: Staff Removal, In: B Lamiroy, J M Ogier (eds.), GRAPHICS RECOGNITION: CURRENT TRENDS AND CHALLENGES8746pp. 207-220 Springer Nature

The first competition on music scores that was organized at ICDAR and GREC in 2011 awoke the interest of researchers, who participated in both staff removal and writer identification tasks. In this second edition, we focus on the staff removal task and simulate a real case scenario concerning old and degraded music scores. For this purpose, we have generated a new set of semi-synthetic images using two degradation models that we previously introduced: local noise and 3D distortions. In this extended paper we provide an extended description of the dataset, degradation models, evaluation metrics, the participant's methods and the obtained results that could not be presented at ICDAR and GREC proceedings due to page limitations.

Anjan Dutta, Josep Llados, Umapada Pal (2011)Symbol Spotting in Line Drawings Through Graph Paths Hashing, In: Proceedings of the International Conference on Document Analysis and Recognitionpp. 982-986 IEEE

In this paper we propose a symbol spotting technique through hashing the shape descriptors of graph paths (Hamiltonian paths). Complex graphical structures in line drawings can be efficiently represented by graphs, which ease the accurate localization of the model symbol. Graph paths are the factorized substructures of graphs which enable robust recognition even in the presence of noise and distortion. In our framework, the entire database of the graphical documents is indexed in hash tables by the locality sensitive hashing (LSH) of shape descriptors of the paths. The hashing data structure aims to execute an approximate k-NN search in a sub-linear time. The spotting method is formulated by a spatial voting scheme to the list of locations of the paths that are decided during the hash table lookup process. We perform detailed experiments with various dataset of line drawings and the results demonstrate the effectiveness and efficiency of the technique.

10.1109/ICPR.2010.40 Proceedings - International Conference on Pattern Recognition 129-132 PICRE

Anjan Dutta, Josep Lladós, Umapada Pal (2013)Bag-of-GraphPaths Descriptors for Symbol Recognition and Spotting in Line Drawings, In: Young-Bin Kwon, Jean-Marc Ogier (eds.), Graphics Recognition. New Trends and Challengespp. 208-217 Springer Berlin Heidelberg

Graphical symbol recognition and spotting recently have become an important research activity. In this work we present a descriptor for symbols, especially for line drawings. The descriptor is based on the graph representation of graphical objects. We construct graphs from the vectorized information of the binarized images, where the critical points detected by the vectorization algorithm are considered as nodes and the lines joining them are considered as edges. Graph paths between two nodes in a graph are the finite sequences of nodes following the order from the starting to the final node. The occurrences of different graph paths in a given graph is an important feature, as they capture the geometrical and structural attributes of a graph. So the graph representing a symbol can efficiently be represent by the occurrences of its different paths. Their occurrences in a symbol can be obtained in terms of a histogram counting the number of some fixed prototype paths, we call the histogram as the Bag-of-GraphPaths (BOGP). These BOGP histograms are used as a descriptor to measure the distance among the symbols in vector space. We use the descriptor for three applications, they are: (1) classification of the graphical symbols, (2) spotting of the architectural symbols on floorplans, (3) classification of the historical handwritten words.

Palaiahnakote Shivakumara, Anjan Dutta, Chew Lim Tan, Umapada Pal (2014)Multi-oriented scene text detection in video based on wavelet and angle projection boundary growing, In: Multimedia tools and applications72(1)pp. 515-539 Springer Nature

In this paper, we address two complex issues: 1) Text frame classification and 2) Multi-oriented text detection in video text frame. We first divide a video frame into 16 blocks and propose a combination of wavelet and median-moments with k-means clustering at the block level to identify probable text blocks. For each probable text block, the method applies the same combination of feature with k-means clustering over a sliding window running through the blocks to identify potential text candidates. We introduce a new idea of symmetry on text candidates in each block based on the observation that pixel distribution in text exhibits a symmetric pattern. The method integrates all blocks containing text candidates in the frame and then all text candidates are mapped on to a Sobel edge map of the original frame to obtain text representatives. To tackle the multi-orientation problem, we present a new method called Angle Projection Boundary Growing (APBG) which is an iterative algorithm and works based on a nearest neighbor concept. APBG is then applied on the text representatives to fix the bounding box for multi-oriented text lines in the video frame. Directional information is used to eliminate false positives. Experimental results on a variety of datasets such as non-horizontal, horizontal, publicly available data (Hua's data) and ICDAR-03 competition data (camera images) show that the proposed method outperforms existing methods proposed for video and the state of the art methods for scene text as well.

Anindya Mondal, Ayan Banerjee, Sauradip Nag, Josep Llados, Xiatian Zhu, Anjan Dutta (2025)CountLoop: Training-Free High-Instance Image Generation via Iterative Agent Guidance

Diffusion models have shown remarkable progress in photorealistic image synthesis, yet they remain unreliable for generating scenes with a precise number of object instances, particularly in complex and high-density settings. We present CountLoop, a training-free framework that provides diffusion models with accurate instance control through iterative structured feedback. The approach alternates between image generation and multimodal agent evaluation, where a language-guided planner and critic assess object counts, spatial arrangements, and attribute consistency. This feedback is then used to refine layouts and guide subsequent generations. To further improve separation between objects, especially in occluded scenes, we introduce instance-driven attention masking and compositional generation techniques. Experiments on COCO Count, T2I CompBench, and two new high-instance benchmarks show that CountLoop achieves counting accuracy of up to 98% while maintaining spatial fidelity and visual quality, outperforming layout-based and gradient-guided baselines with a score of 0.97.

Silpa Vadakkeeveetil Sreelatha, Sauradip Nag, Muhammad Awais, Serge Belongie, Anjan Dutta RespoDiff: Dual-Module Bottleneck Transformation for Responsible & Faithful T2I Generation

The rapid advancement of diffusion models has enabled high-fidelity and semantically rich text-to-image generation; however, ensuring fairness and safety remains an open challenge. Existing methods typically improve fairness and safety at the expense of semantic fidelity and image quality. In this work, we propose RespoDiff, a novel framework for responsible text-to-image generation that incorporates a dual-module transformation on the intermediate bottleneck representations of diffusion models. Our approach introduces two distinct learnable modules: one focused on capturing and enforcing responsible concepts, such as fairness and safety, and the other dedicated to maintaining semantic alignment with neutral prompts. To facilitate the dual learning process, we introduce a novel score-matching objective that enables effective coordination between the modules. Our method outperforms state-of-the-art methods in responsible generation by ensuring semantic alignment while optimizing both objectives without compromising image fidelity. Our approach improves responsible and semantically coherent generation by 20% across diverse, unseen prompts. Moreover, it integrates seamlessly into large-scale models like SDXL, enhancing fairness and safety. Code will be released upon acceptance.

Abhra Chaudhuri, Massimiliano Mancini, Yanbei Chen, Zeynep Akata, Anjan Dutta (2022)Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval, In: The 33rd British Machine Vision Conference Proceedings

Representation learning for sketch-based image retrieval has mostly been tackled by learning embeddings that discard modality-specific information. As instances from different modalities can often provide complementary information describing the underlying concept, we propose a cross-attention framework for Vision Transformers (XModalViT) that fuses modality-specific information instead of discarding them. Our framework first maps paired datapoints from the individual photo and sketch modalities to fused representations that unify information from both modalities. We then decouple the input space of the aforementioned modality fusion network into independent encoders of the individual modalities via contrastive and relational cross-modal knowledge distillation. Such encoders can then be applied to downstream tasks like cross-modal retrieval. We demonstrate the expressive capacity of the learned representations by performing a wide range of experiments and achieving state-of-the-art results on three fine-grained sketch-based image retrieval benchmarks: Shoe-V2, Chair-V2 and Sketchy. Implementation is available at https://github.com/abhrac/xmodal-vit.

Anindya Mondal, Sauradip Nag, Xiatian Zhu, Anjan Dutta (2025)OmniCount: Multi-label Object Counting with Semantic-Geometric Priors, In: TBC

Object counting is pivotal for understanding the composition of scenes. Previously, this task was dominated by class-specific methods, which have gradually evolved into more adaptable class-agnostic strategies. However, these strategies come with their own set of limitations, such as the need for manual exemplar input and multiple passes for multiple categories, resulting in significant inefficiencies. This paper introduces a more practical approach enabling simultaneous counting of multiple object categories using an open-vocabulary framework. Our solution, OmniCount, stands out by using semantic and geometric insights (priors) from pre-trained models to count multiple categories of objects as specified by users, all without additional training. OmniCount distinguishes itself by generating precise object masks and leveraging varied interactive prompts via the Segment Anything Model for efficient counting. To evaluate OmniCount, we created the OmniCount-191 benchmark, a first-of-its-kind dataset with multi-label object counts, including points, bounding boxes, and VQA annotations. Our comprehensive evaluation in OmniCount-191, alongside other leading benchmarks, demonstrates OmniCount's exceptional performance, significantly outpacing existing solutions.

Abhra Chaudhuri, Serban Georgescu, Anjan Dutta Learning Conditional Invariances through Non-Commutativity, In: arXiv.org

Invariance learning algorithms that conditionally filter out domain-specific random variables as distractors, do so based only on the data semantics, and not the target domain under evaluation. We show that a provably optimal and sample-efficient way of learning conditional invariances is by relaxing the invariance criterion to be non-commutatively directed towards the target domain. Under domain asymmetry, i.e., when the target domain contains semantically relevant information absent in the source, the risk of the encoder $\varphi^*$ that is optimal on average across domains is strictly lower-bounded by the risk of the target-specific optimal encoder $\Phi^*_\tau$. We prove that non-commutativity steers the optimization towards $\Phi^*_\tau$ instead of $\varphi^*$, bringing the $\mathcal{H}$-divergence between domains down to zero, leading to a stricter bound on the target risk. Both our theory and experiments demonstrate that non-commutative invariance (NCI) can leverage source domain samples to meet the sample complexity needs of learning $\Phi^*_\tau$, surpassing SOTA invariance learning algorithms for domain adaptation, at times by over $2\%$, approaching the performance of an oracle. Implementation is available at https://github.com/abhrac/nci.

Sounak Dey, Pau Riba, Anjan Dutta, Josep Llados, Yi-Zhe Song (2019)Doodle to Search: Practical Zero-Shot Sketch-based Image Retrieval, In: 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019)2019-pp. 2174-2183 IEEE

In this paper, we investigate the problem of zero-shot sketch-based image retrieval (ZS-SBIR), where human sketches are used as queries to conduct retrieval of photos from unseen categories. We importantly advance prior arts by proposing a novel ZS-SBIR scenario that represents a firm step forward in its practical application. The new setting uniquely recognizes two important yet often neglected challenges of practical ZS-SBIR, (i) the large domain gap between amateur sketch and photo, and (ii) the necessity for moving towards large-scale retrieval. We first contribute to the community a novel ZS-SBIR dataset, QuickDraw-Extended, that consists of 330,000 sketches and 204,000 photos spanning across 110 categories. Highly abstract amateur human sketches are purposefully sourced to maximize the domain gap, instead of ones included in existing datasets that can often be semi-photorealistic. We then formulate a ZS-SBIR framework to jointly model sketches and photos into a common embedding space. A novel strategy to mine the mutual information among domains is specifically engineered to alleviate the domain gap. External semantic knowledge is further embedded to aid semantic transfer. We show that, rather surprisingly, retrieval performance significantly outperforms that of state-of-the-art on existing datasets that can already be achieved using a reduced version of our model. We further demonstrate the superior performance of our full model by comparing with a number of alternatives on the newly proposed dataset. The new dataset, plus all training and testing code of our model, will be publicly released to facilitate future research.

Anindya Mondal, Sauradip Nag, Joaquin Prada, Xiatian Zhu, Anjan Dutta (2023)Actor-agnostic Multi-label Action Recognition with Multi-modal Query, In: arXiv.org arXiv

Existing action recognition methods are typically actor-specific due to the intrinsic topological and apparent differences among the actors. This requires actor-specific pose estimation (e.g., humans vs. animals), leading to cumbersome model design complexity and high maintenance costs. Moreover, they often focus on learning the visual modality alone and single-label classification whilst neglecting other available information sources (e.g., class name text) and the concurrent occurrence of multiple actions. To overcome these limitations, we propose a new approach called 'actor-agnostic multi-modal multi-label action recognition,' which offers a unified solution for various types of actors, including humans and animals. We further formulate a novel Multi-modal Semantic Query Network (MSQNet) model in a transformer-based object detection framework (e.g., DETR), characterized by leveraging visual and textual modalities to represent the action classes better. The elimination of actor-specific model designs is a key advantage, as it removes the need for actor pose estimation altogether. Extensive experiments on five publicly available benchmarks show that our MSQNet consistently outperforms the prior arts of actor-specific alternatives on human and animal single- and multi-label action recognition tasks by up to 50%. Code will be released at https://github.com/mondalanindya/MSQNet.

Anindya Mondal, Sauradip Nag, Joaquin M Prada, Xiatian Zhu, Anjan Dutta (2023)Actor-agnostic Multi-label Action Recognition with Multi-modal Query, In: 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)pp. 784-794 IEEE

Existing action recognition methods are typically actor-specific due to the intrinsic topological and apparent differences among the actors. This requires actor-specific pose estimation (e.g., humans vs. animals), leading to cumbersome model design complexity and high maintenance costs. Moreover, they often focus on learning the visual modality alone and single-label classification whilst neglecting other available information sources (e.g., class name text) and the concurrent occurrence of multiple actions. To overcome these limitations, we propose a new approach called 'actor-agnostic multi-modal multi-label action recognition,' which offers a unified solution for various types of actors, including humans and animals. We further formulate a novel Multi-modal Semantic Query Network (MSQNet) model in a transformer-based object detection framework (e.g., DETR), characterized by leveraging visual and textual modalities to represent the action classes better. The elimination of actor-specific model designs is a key advantage, as it removes the need for actor pose estimation altogether. Extensive experiments on five publicly available benchmarks show that our MSQNet consistently outperforms the prior arts of actor-specific alternatives on human and animal single- and multi-label action recognition tasks by up to 50%. Code is made available at https://github.com/mondalanindya/MSQNet.

Silpa Vadakkeeveetil Sreelatha, Adarsh Kappiyath, Abhra Chaudhuri, Anjan Dutta (2024)DeNetDM: Debiasing by Network Depth Modulation

Neural networks trained on biased datasets tend to inadvertently learn spurious correlations, hindering generalization. We formally prove that (1) samples that exhibit spurious correlations lie on a lower rank manifold relative to the ones that do not; and (2) the depth of a network acts as an implicit regularizer on the rank of the attribute subspace that is encoded in its representations. Leveraging these insights, we present DeNetDM, a novel debiasing method that uses network depth modulation as a way of developing robustness to spurious correlations. Using a training paradigm derived from Product of Experts, we create both biased and debiased branches with deep and shallow architectures and then distill knowledge to produce the target debiased model. Our method requires no bias annotations or explicit data augmentation while performing on par with approaches that require either or both. We demonstrate that DeNetDM outperforms existing debiasing techniques on both synthetic and real-world datasets by 5%. Source code will be available upon acceptance.

Abhra Chaudhuri, Ayan Kumar Bhunia, Yi-Zhe Song, Anjan Dutta (2023)Data-Free Sketch-Based Image Retrieval, In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)pp. 12084-12093 IEEE

Rising concerns about privacy and anonymity preservation of deep learning models have facilitated research in data-free learning (DFL). For the first time, we identify that for data-scarce tasks like Sketch-Based Image Retrieval (SBIR), where the difficulty in acquiring paired photos and hand-drawn sketches limits data-dependent cross-modal learning algorithms, DFL can prove to be a much more practical paradigm. We thus propose Data-Free (DF)-SBIR, where, unlike existing DFL problems, pre-trained, single-modality classification models have to be leveraged to learn a cross-modal metric-space for retrieval without access to any training data. The widespread availability of pre-trained classification models, along with the difficulty in acquiring paired photo-sketch datasets for SBIR justify the practicality of this setting. We present a methodology for DF-SBIR, which can leverage knowledge from models independently trained to perform classification on photos and sketches. We evaluate our model on the Sketchy, TU-Berlin, and QuickDraw benchmarks, designing a variety of baselines based on state-of-the-art DFL literature, and observe that our method surpasses all of them by significant margins. Our method also achieves mAPs competitive with data-dependent approaches, all the while requiring no training data. Implementation is available at https://github.com/abhrac/data-free-sbir.

Silpa Vadakkeeveetil Sreelatha, Adarsh Kappiyath, Abhra Chaudhuri, Anjan Dutta (2024)DeNetDM: Debiasing by Network Depth Modulation

Neural networks trained on biased datasets tend to inadvertently learn spurious correlations, hindering generalization. We formally prove that (1) samples that exhibit spurious correlations lie on a lower rank manifold relative to the ones that do not; and (2) the depth of a network acts as an implicit regularizer on the rank of the attribute subspace that is encoded in its representations. Leveraging these insights, we present DeNetDM, a novel debiasing method that uses network depth modulation as a way of developing robustness to spurious correlations. Using a training paradigm derived from Product of Experts, we create both biased and debiased branches with deep and shallow architectures and then distill knowledge to produce the target debiased model. Our method requires no bias annotations or explicit data augmentation while performing on par with approaches that require either or both. We demonstrate that DeNetDM outperforms existing debiasing techniques on both synthetic and real-world datasets by 5%.

Abhra Chaudhuri, Massimiliano Mancini, Zeynep Akata, Anjan Dutta (2004)Relational Proxies: Fine-Grained Relationships as Zero-Shot Discriminators, In: IEEE Transaction on Pattern Analysis and Machine Intelligence IEEE

Visual categories that largely share the same set of local parts cannot be discriminated based on part information alone, as they mostly differ in the way the local parts relate to the overall global structure of the object. We propose Relational Proxies , a novel approach that leverages the relational information between the global and local views of an object for encoding its semantic label, even for categories it has not encountered during training. Starting with a rigorous formalization of the notion of distinguishability between categories that share attributes, we prove the necessary and sufficient conditions that a model must satisfy in order to learn the underlying decision boundaries to tell them apart. We design Relational Proxies based on our theoretical findings and evaluate it on seven challenging fine-grained benchmark datasets and achieve state-of-the-art results on all of them, surpassing the performance of all existing works with a margin exceeding 4% in some cases. We additionally show that Relational Proxies also generalizes to the zero-shot setting, where it can efficiently leverage emergent relationships among attributes and image views to generalize to unseen categories, surpassing current state-of-the-art in both the non-generative and generative settings. Implementation will be made public upon acceptance.

Abhra Chaudhuri, Massimiliano Mancini, Zeynep Akata, Anjan Dutta (2023)Transitivity Recovering Decompositions: Interpretable and Robust Fine-Grained Relationships, In: arXiv.org Cornell University Library, arXiv.org

Recent advances in fine-grained representation learning leverage local-to-global (emergent) relationships for achieving state-of-the-art results. The relational representations relied upon by such methods, however, are abstract. We aim to deconstruct this abstraction by expressing them as interpretable graphs over image views. We begin by theoretically showing that abstract relational representations are nothing but a way of recovering transitive relationships among local views. Based on this, we design Transitivity Recovering Decompositions (TRD), a graph-space search algorithm that identifies interpretable equivalents of abstract emergent relationships at both instance and class levels, and with no post-hoc computations. We additionally show that TRD is provably robust to noisy views, with empirical evidence also supporting this finding. The latter allows TRD to perform at par or even better than the state-of-the-art, while being fully interpretable. Implementation is available at https://github.com/abhrac/trd.

Anjan Dutta, Zeynep Akata (2020)Semantically Tied Paired Cycle Consistency for Any-Shot Sketch-Based Image Retrieval, In: International journal of computer vision128(10-11)2684pp. 2684-2703 Springer Nature

Low-shot sketch-based image retrieval is an emerging task in computer vision, allowing to retrieve natural images relevant to hand-drawn sketch queries that are rarely seen during the training phase. Related prior works either require aligned sketch-image pairs that are costly to obtain or inefficient memory fusion layer for mapping the visual information to a semantic space. In this paper, we address any-shot,i.e. zero-shot and few-shot, sketch-based image retrieval (SBIR) tasks, where we introduce the few-shot setting for SBIR. For solving these tasks, we propose a semantically aligned paired cycle-consistent generative adversarial network (SEM-PCYC) for any-shot SBIR, where each branch of the generative adversarial network maps the visual information from sketch and image to a common semantic space via adversarial training. Each of these branches maintains cycle consistency that only requires supervision at the category level, and avoids the need of aligned sketch-image pairs. A classification criteria on the generators' outputs ensures the visual to semantic space mapping to be class-specific. Furthermore, we propose to combine textual and hierarchical side information via an auto-encoder that selects discriminating side information within a same end-to-end model. Our results demonstrate a significant boost in any-shot SBIR performance over the state-of-the-art on the extended version of the challenging Sketchy, TU-Berlin and QuickDraw datasets.

Abhra Chaudhuri, Massimiliano Mancini, Zeynep Akata, ANJAN DUTTA (2022)Relational Proxies: Emergent Relationships as Fine-Grained Discriminators

Fine-grained categories that largely share the same set of parts cannot be discriminated based on part information alone, as they mostly differ in the way the local parts relate to the overall global structure of the object. We propose Relational Proxies , a novel approach that leverages the relational information between the global and local views of an object for encoding its semantic label. Starting with a rigorous formalization of the notion of distinguishability between fine-grained categories, we prove the necessary and sufficient conditions that a model must satisfy in order to learn the underlying decision boundaries in the fine-grained setting. We design Relational Proxies based on our theoretical findings and evaluate it on seven challenging fine-grained benchmark datasets and achieve state-of-the-art results on all of them, surpassing the performance of all existing works with a margin exceeding 4% in some cases. We also experimentally validate our theory on fine-grained dis-tinguishability and obtain consistent results across multiple benchmarks. Implementation is available at https://github.com/abhrac/relational-proxies.

Faisal Alamri, Anjan Dutta (2023)Implicit and explicit attention mechanisms for zero-shot learning, In: Neurocomputing (Amsterdam)53455pp. 55-66 Elsevier B.V

Zero-Shot Learning (ZSL) aims to recognise unseen object classes which are not observed during the training phase. Most of the existing methods on ZSL focus on learning a compatibility function between the image representation and class semantic information. Few others concentrate on learning image representation by combining local and global features. However, the existing approaches still fail to address the bias issue towards the seen classes. This paper proposes implicit and explicit attention mechanisms to address the existing bias problem in generalised ZSL models. We formulate the implicit attention mechanism with a self-supervised image angle rotation task, which focuses on specific image features aiding in solving the task. On the other hand, the explicit attention mechanism is composed via the consideration of a multi-headed self-attention mechanism in the Vision Transformer model, which learns to attend important image locations and map global image features to semantic space during the training stage. We have conducted comprehensive experiments on three popular benchmarks: AWA2, CUB and SUN, where the effectiveness of our proposed attention mechanisms is shown in both discriminative and generative settings. Our extensive experiments show that our method has achieved state-of-the-art performance obtaining the highest harmonic mean on all three datasets, which is very encouraging to consider the ViT-based attention mechanisms for ZSL tasks in the future.

Anjan Dutta, Massimiliano Mancini, Zeynep Akata (2021)Concurrent Discrimination and Alignment for Self-Supervised Feature Learning, In: 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021)2021-pp. 2189-2198 IEEE

Existing self-supervised learning methods learn representation by means of pretext tasks which are either (1) discriminating that explicitly specify which features should be separated or (2) aligning that precisely indicate which features should be closed together, but ignore the fact how to jointly and principally define which features to be repelled and which ones to be attracted. In this work, we combine the positive aspects of the discriminating and aligning methods, and design a hybrid method that addresses the above issue. Our method explicitly specifies the repulsion and attraction mechanism respectively by discriminative predictive task and concurrently maximizing mutual information between paired views sharing redundant information. We qualitatively and quantitatively show that our proposed model learns better features that are more effective for the diverse downstream tasks ranging from classification to semantic segmentation. Our experiments on nine established benchmarks show that the proposed model consistently outperforms the existing state-of-the-art results of self-supervised and transfer learning protocol.

Ushasi Chaudhuri, Ruchika Chavan, Biplab Banerjee, Anjan Dutta, Zeynep Akata (2022)BDA-SketRet: Bi-level domain adaptation for zero-shot SBIR, In: Neurocomputing (Amsterdam)514245pp. 245-255 Elsevier B.V

The efficacy of zero-shot sketch-based image retrieval (ZS-SBIR) models is governed by two challenges. The immense distributions-gap between the sketches and the images requires a proper domain alignment. Moreover, the fine-grained nature of the task and the high intra-class variance of many categories necessitates a class-wise discriminative mapping among the sketch, image, and the semantic spaces. Under this premise, we propose BDA-SketRet, a novel ZS-SBIR framework performing a bi-level domain adaptation for aligning the spatial and semantic features of the visual data pairs progressively. In order to highlight the shared features and reduce the effects of any sketch or image-specific artifacts, we propose a novel symmetric loss function based on the notion of information bottleneck for aligning the semantic features while a cross-entropy-based adversarial loss is introduced to align the spatial feature maps. Finally, our CNN-based model confirms the discriminativeness of the shared latent space through a novel topology-preserving semantic projection network. Experimental results on the extended Sketchy, TU-Berlin, and QuickDraw datasets exhibit sharp improvements over the literature.

Abhra Chaudhuri, Ayan Kumar Bhunia, Yi-Zhe Song, Anjan Dutta Data-Free Sketch-Based Image Retrieval

Rising concerns about privacy and anonymity preservation of deep learning models have facilitated research in data-free learning (DFL). For the first time, we identify that for data-scarce tasks like Sketch-Based Image Retrieval (SBIR), where the difficulty in acquiring paired photos and hand-drawn sketches limits data-dependent cross-modal learning algorithms, DFL can prove to be a much more practical paradigm. We thus propose Data-Free (DF)-SBIR, where, unlike existing DFL problems, pre-trained, single-modality classification models have to be leveraged to learn a cross-modal metric-space for retrieval without access to any training data. The widespread availability of pre-trained classification models, along with the difficulty in acquiring paired photo-sketch datasets for SBIR justify the practicality of this setting. We present a methodology for DF-SBIR, which can leverage knowledge from models independently trained to perform classification on photos and sketches. We evaluate our model on the Sketchy, TU-Berlin, and QuickDraw benchmarks, designing a variety of baselines based on state-of-the-art DFL literature, and observe that our method surpasses all of them by significant margins. Our method also achieves mAPs competitive with data-dependent approaches, all the while requiring no training data. Implementation is available at \url{https://github.com/abhrac/data-free-sbir}.

Swapnil Bhosale, Abhra Chaudhuri, Alex Williams, Divyank Tiwari, Anjan Dutta, Xiatian Zhu, Pushpak Bhattacharyya, Diptesh Kanojia (2023)Sarcasm in Sight and Sound: Benchmarking and Expansion to Improve Multimodal Sarcasm Detection, In: arXiv.org Cornell University Library, arXiv.org

The introduction of the MUStARD dataset, and its emotion recognition extension MUStARD++, have identified sarcasm to be a multi-modal phenomenon -- expressed not only in natural language text, but also through manners of speech (like tonality and intonation) and visual cues (facial expression). With this work, we aim to perform a rigorous benchmarking of the MUStARD++ dataset by considering state-of-the-art language, speech, and visual encoders, for fully utilizing the totality of the multi-modal richness that it has to offer, achieving a 2\% improvement in macro-F1 over the existing benchmark. Additionally, to cure the imbalance in the `sarcasm type' category in MUStARD++, we propose an extension, which we call \emph{MUStARD++ Balanced}, benchmarking the same with instances from the extension split across both train and test sets, achieving a further 2.4\% macro-F1 boost. The new clips were taken from a novel source -- the TV show, House MD, which adds to the diversity of the dataset, and were manually annotated by multiple annotators with substantial inter-annotator agreement in terms of Cohen's kappa and Krippendorf's alpha. Our code, extended data, and SOTA benchmark models are made public.

Faisal Alamri, Anjan Dutta (2021)Implicit and Explicit Attention for Zero-Shot Learning, In: Pattern Recognitionpp. 467-483 Springer International Publishing

Most of the existing Zero-Shot Learning (ZSL) methods focus on learning a compatibility function between the image representation and class attributes. Few others concentrate on learning image representation combining local and global features. However, the existing approaches still fail to address the bias issue towards the seen classes. In this paper, we propose implicit and explicit attention mechanisms to address the existing bias problem in ZSL models. We formulate the implicit attention mechanism with a self-supervised image angle rotation task, which focuses on specific image features aiding to solve the task. The explicit attention mechanism is composed with the consideration of a multi-headed self-attention mechanism via Vision Transformer model, which learns to map image features to semantic space during the training stage. We conduct comprehensive experiments on three popular benchmarks: AWA2, CUB and SUN. The performance of our proposed attention mechanisms has proved its effectiveness, and has achieved the state-of-the-art harmonic mean on all the three datasets.

Abhra Chaudhuri, Massimiliano Mancini, Zeynep Akata, Anjan Dutta (2022)Relational Proxies: Emergent Relationships as Fine-Grained Discriminators, In: arXiv.org Cornell University Library, arXiv.org

Fine-grained categories that largely share the same set of parts cannot be discriminated based on part information alone, as they mostly differ in the way the local parts relate to the overall global structure of the object. We propose Relational Proxies, a novel approach that leverages the relational information between the global and local views of an object for encoding its semantic label. Starting with a rigorous formalization of the notion of distinguishability between fine-grained categories, we prove the necessary and sufficient conditions that a model must satisfy in order to learn the underlying decision boundaries in the fine-grained setting. We design Relational Proxies based on our theoretical findings and evaluate it on seven challenging fine-grained benchmark datasets and achieve state-of-the-art results on all of them, surpassing the performance of all existing works with a margin exceeding 4% in some cases. We also experimentally validate our theory on fine-grained distinguishability and obtain consistent results across multiple benchmarks. Implementation is available at https://github.com/abhrac/relational-proxies.

Stephan Alaniz, Massimiliano Mancini, Anjan Dutta, Diego Marcos, Zeynep Akata (2022)Abstracting Sketches Through Simple Primitives, In: S Avidan, G Brostow, M Cisse, G M Farinella, T Hassner (eds.), COMPUTER VISION, ECCV 2022, PT XXIX13689pp. 396-412 Springer Nature

Humans show high-level of abstraction capabilities in games that require quickly communicating object information. They decompose the message content into multiple parts and communicate them in an interpretable protocol. Toward equipping machines with such capabilities, we propose the Primitive-based Sketch Abstraction task where the goal is to represent sketches using a fixed set of drawing primitives under the influence of a budget. To solve this task, our Primitive-Matching Network (PMN), learns interpretable abstractions of a sketch in a self supervised manner. Specifically, PMN maps each stroke of a sketch to its most similar primitive in a given set, predicting an affine transformation that aligns the selected primitive to the target stroke. We learn this stroke-to-primitive mapping end-to-end with a distance-transform loss that is minimal when the original sketch is precisely reconstructed with the predicted primitives. Our PMN abstraction empirically achieves the highest performance on sketch recognition and sketch-based image retrieval given a communication budget, while at the same time being highly interpretable. This opens up new possibilities for sketch analysis, such as comparing sketches by extracting the most relevant primitives that define an object category.

Marco Federici, Anjan Dutta, Patrick Forré, Nate Kushman, Zeynep Akata (2020)Learning Robust Representations via Multi-View Information Bottleneck

The information bottleneck principle provides an information-theoretic method for representation learning, by training an encoder to retain all information which is relevant for predicting the label while minimizing the amount of other, excess information in the representation. The original formulation, however, requires labeled data to identify the superfluous information. In this work, we extend this ability to the multi-view unsupervised setting, where two views of the same underlying entity are provided but the label is unknown. This enables us to identify superfluous information as that not shared by both views. A theoretical analysis leads to the definition of a new multi-view model that produces state-of-the-art results on the Sketchy dataset and label-limited versions of the MIR-Flickr dataset. We also extend our theory to the single-view setting by taking advantage of standard data augmentation techniques, empirically showing better generalization capabilities when compared to common unsupervised approaches for representation learning.

Anjan Dutta, Pau Riba, Josep Llados, Alicia Fornes (2020)Hierarchical stochastic graphlet embedding for graph-based pattern recognition, In: Neural computing & applications32(15)11579pp. 11579-11596 Springer Nature

Despite being very successful within the pattern recognition and machine learning community, graph-based methods are often unusable because of the lack of mathematical operations defined in graph domain. Graph embedding, which maps graphs to a vectorial space, has been proposed as a way to tackle these difficulties enabling the use of standard machine learning techniques. However, it is well known that graph embedding functions usually suffer from the loss of structural information. In this paper, we consider the hierarchical structure of a graph as a way to mitigate this loss of information. The hierarchical structure is constructed by topologically clustering the graph nodes and considering each cluster as a node in the upper hierarchical level. Once this hierarchical structure is constructed, we consider several configurations to define the mapping into a vector space given a classical graph embedding, in particular, we propose to make use of the stochastic graphlet embedding (SGE). Broadly speaking, SGE produces a distribution of uniformly sampled low-to-high-order graphlets as a way to embed graphs into the vector space. In what follows, the coarse-to-fine structure of a graph hierarchy and the statistics fetched by the SGE complements each other and includes important structural information with varied contexts. Altogether, these two techniques substantially cope with the usual information loss involved in graph embedding techniques, obtaining a more robust graph representation. This fact has been corroborated through a detailed experimental evaluation on various benchmark graph datasets, where we outperform the state-of-the-art methods.

Stephan Alaniz, Massimiliano Mancini, Anjan Dutta, Diego Marcos, Zeynep Akata (2022)Abstracting Sketches through Simple Primitives

Humans show high-level of abstraction capabilities in games that require quickly communicating object information. They decompose the message content into multiple parts and communicate them in an interpretable protocol. Toward equipping machines with such capabilities , we propose the Primitive-based Sketch Abstraction task where the goal is to represent sketches using a fixed set of drawing primi-tives under the influence of a budget. To solve this task, our Primitive-Matching Network (PMN), learns interpretable abstractions of a sketch in a self supervised manner. Specifically, PMN maps each stroke of a sketch to its most similar primitive in a given set, predicting an affine transformation that aligns the selected primitive to the target stroke. We learn this stroke-to-primitive mapping end-to-end with a distance-transform loss that is minimal when the original sketch is precisely reconstructed with the predicted primitives. Our PMN abstraction empirically achieves the highest performance on sketch recognition and sketch-based image retrieval given a communication budget, while at the same time being highly interpretable. This opens up new possibilities for sketch analysis, such as comparing sketches by extracting the most relevant primitives that define an object category. Code is available at https://github.com/ExplainableML/sketch-primitives.