Ayan Kumar Bhunia
Academic and research departmentsFaculty of Engineering and Physical Sciences, Centre for Vision, Speech and Signal Processing (CVSSP), Department of Electrical and Electronic Engineering.
My research project
Computer vision and deep learning
I am a first year PhD student, focussing on Computer Vision and Deep Learning, at Centre for Vision, Speech and Signal Processing (CVSSP) of the University of Surrey. My supervisors are Dr Yi-Zhe Song, Prof. Tao Xiang. Prior to that, I worked as a research assistant at the Institute for Media Innovation (IMI) Lab of Nanyang Technological University, Singapore under the supervision of Prof. Tat-Jen Cham and Prof Jianfei Cai. I completed my B.Tech (Bachelor of Technology) from Institute of Engineering & Management (IEM), Kolkata (India), under West Bengal University of Technology, majoring in Electronics and Communication Engineering.
During my undergraduate days, I worked with Dr Partha Pratim Roy, Assistant Professor, CSE department, Indian Institute of Technology, Roorkee and Prof. Umapada Pal, Head and Professor, CVPR unit, Indian Statistical Institute Kolkata. Also, I did a three months internship under Prof. Dipti Prasad Mukherjee, Head and Professor, Electronics and Communication Science Unit, Indian Statistical Institute, Kolkata on Facial Expression Recognition during Indian Academy of Science Summer Research Fellowship.
Few of my research works have been published/accepted in journals like Pattern Recognition(Elsevier), Neurocomputing (Elsevier), Expert Systems With Applications(Elsevier), Multimedia Tools and Application(Springer) and the popular conferences like IEEE WACV, ICPR, ICDAR. Recently, one of my papers has been accepted in CVPR 2019, a top venue in my research field.
J12: Ayan Kumar Bhunia*, Subham Mukherjee, Aneeshan Sain, Ankan Kumar Bhunia, Partha Pratim Roy, Umapada PaL, “Indic Handwritten Script Identification using Offline-Online multi-modal Deep Network”, Information Fusion, Elsevier. 2019. (Impact Factor - 10.716)
C11: Ankan Kumar Bhunia, Ayan Kumar Bhunia*, Aneeshan Sain, Partha Pratim Roy, “Improving Document Binarization via Adversarial Noise-Texture Augmentation”, IEEE International Conference on Image Processing (ICIP), 2019. [PDF] [Github]
C10: Pranay Mukherjee, Abhirup Das, Ayan Kumar Bhunia*, Partha Pratim Roy, “Cogni-Net: Cognitive Feature Learning through Deep Visual Perception”, IEEE International Conference on Image Processing (ICIP), 2019. [PDF] [Github]
J11: Ayan Kumar Bhunia, Shuvozit Ghose, Partha Pratim Roy, Subrahmanyam Murala, “A Novel Feature Descriptor for Image Retrieval by Combining Modified Color Histogram and Diagonally Symmetric Co-occurrence Texture Pattern”, Pattern Analysis and Applications, 2019, Springer. (Impact Factor-1.28) [PDF]
C9: Ayan Kumar Bhunia, Abhirup Das, Ankan Kumar Bhunia, Perla Sai Raj Kishore, Partha Pratim Roy, “Handwriting Recognition in Low-resource Scripts using Adversarial Learning ”, IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019. [PDF]
C8: Sauradeep Nag, Ayan Kumar Bhunia*, Aishik Konwer, Partha Pratim Roy, “Facial Micro-expression Spotting and Recognition using Time Contrasted Feature with Visual Memory”, International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019. [PDF]
C7: Perla Sai Raj Kishore, Ayan Kumar Bhunia*, Shuvozit Ghose, Partha Pratim Roy, “User Constrained Thumbnail Generation using Adaptive Convolutions”, International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019. [PDF] [Oral] [Code]
C6: Ayan Kumar Bhunia, Perla Sai Raj Kishore, Pranay Mukherjee, Abhirup Das, Partha Pratim Roy “Texture Synthesis Guided Deep Hashing for Texture Image Retrieval”, IEEE Winter Conference on Applications of Computer Vision (WACV), 2019. [PDF, Youtube]
C5: Ayan Kumar Bhunia, Abir Bhowmick, Ankan Kumar Bhunia, Aishik Konwer, Prithaj Banerjee, Partha Pratim Roy, Umapada Pal, “Handwriting Trajectory Recovery using End-to-End Deep Encoder-Decoder Network”, 24th International Conference on Pattern Recognition (ICPR), 2018. [PDF, CODE]
C4: Ankan Kumar Bhunia, Ayan Kumar Bhunia, Prithaj Banerjee, Aishik Konwer, Abir Bhowmick, Partha Pratim Roy, Umapada Pal, “Word Level Font-to-Font Image Translation using Convolutional Recurrent Generative Adversarial Networks”, 24th International Conference on Pattern Recognition (ICPR), 2018. [PDF]
C3: Aishik Konwer, Ayan Kumar Bhunia, Abir Bhowmick, Ankan Kumar Bhunia, Prithaj Banerjee, Partha Pratim Roy, Umapada Pal, “Staff line Removal using Generative Adversarial Networks”, 24th International Conference on Pattern Recognition (ICPR), 2018. [PDF, Oral ]
J10: Partha Pratim Roy, Ayan Kumar Bhunia, Avirup Bhattacharyya, Umapada Pal, “Word Searching in Scene Image and Video Frame in Multi-Script Scenario using Dynamic Shape Coding”, Multimedia Tools and Applications, 2018, Springer. (Impact Factor-1.541) (DOI:10.1007/s11042-018-6484-5) [PDF]
J9: Ankan Kumar Bhunia*, Aishik Konwer*, Ayan Kumar Bhunia, Abir Bhowmick, Partha Pratim Roy, Umapada Pal, “Script Identification in Natural Scene Image and Video Frame using Attention based Convolutional-LSTM Network", Pattern Recognition, Elsevier, 2018. (Impact Factor - 3.962) (DOI:10.1016/j.patcog.2018.07.034) [PDF, Github]
J8: Prithaj Banerjee, Ayan Kumar Bhunia, Partha Pratim Roy, Avirup Bhattacharyya, Partha Pratim Roy, Subrahmanyam Murala, “Local Neighborhood Intensity Pattern–A new texture feature descriptor for image retrieval", Expert Systems with Applications, 2018. (Impact Factor - 3.928) (DOI:10.1016/j.eswa.2018.06.044) [PDF]
J7: Ayan Kumar Bhunia, Partha Pratim Roy, Akash Mohta, Umapada Pal, “Cross-language Framework for Word Recognition and Spotting of Indic Scripts", Pattern Recognition, 2018, Elsevier. (Impact Factor - 4.582) (DOI:10.1016/j.patcog.2018.01.034) [PDF]
J6: Aneeshan Sain, Ayan Kumar Bhunia, Partha Pratim Roy, Umapada Pal, “Multi-Oriented Text Detection and Verification in Video Frames and Scene Images", Neurocomputing, Elsevier. (Impact Factor - 3.317) (DOI:10.1016/j.neucom.2017.09.089) [PDF]
J5: Ayan Kumar Bhunia, Gautam Kumar, Partha Pratim Roy, R. Balasubramanian, Umapada Pal, “Text Recognition in Scene Image and Video Frames using Color Channel Selection”, Multimedia Tools and Applications, Volume 77, Pages 8551–8578, 2018, Springer. (Impact Factor-1.530) (DOI:10.1007/s11042-017-4750-6) [PDF]
J4: Partha Pratim Roy, Ayan Kumar Bhunia, Umapada Pal, “Date-Field Retrieval in Scene Image and Video Frames using Text Enhancement and Shape Coding", Neurocomputing, Elsevier. (Impact Factor - 2.392) (DOI:10.1016/j.neucom.2016.08.141) [PDF]
J3: Partha Pratim Roy, Ayan Kumar Bhunia, Umapada Pal, “HMM-based Writer Identification in Music Score Documents without Staff-Line Removal", Expert Systems with Applications, Volume-89, Pages 222-240, 2017, Elsevier. (Impact Factor - 3.928) (DOI:10.1016/j.eswa.2017.07.031) [PDF]
J2: Partha Pratim Roy, Ayan Kumar Bhunia, Ayan Das, Prithviraj Dhar, Umapada Pal, “Keyword Spotting in Doctor's Handwriting on Medical Prescriptions”, Expert Systems with Applications, Volume-76, Pages 113-128, 2017, Elsevier. (Impact Factor 2.981). (DOI:10.1016/j.eswa.2017.01.027) [PDF]
J1: Partha Pratim Roy, Ayan Kumar Bhunia, Ayan Das, Prasenjit Dey, Umapada Pal, “HMM-based Indic Handwriting Recognition using Zone Segmentation", Pattern Recognition, Volume-60, Pages 1057-1075, 2016, Elsevier. (Impact Factor-3.399) (DOI:10.1016/j.patcog.2016.04.012)[PDF]
C1: Ayan Kumar Bhunia, Ayan Das, Partha Pratim Roy, Umapada Pal, “A Comparative Study of Features for Handwritten Bangla Text Recognition”, 13th International Conference on Document Analysis and Recognition (ICDAR), Pages 636-640, Nancy-France, 2015, IEEE. (DOI:10.1109/ICDAR.2015.7333839) [PDF, Oral]
C2: Ayan Das, Ayan Kumar Bhunia, Partha Pratim Roy, Umapada Pal, “Handwritten Word Spotting in Indic Scripts using Foreground and Back- ground Information", 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Pages 426-430, KualLumpur- Malaysia,2015, IEEE. (DOI:10.1109/ACPR.2015.7486539) [PDF]
Fine-grained sketch-based image retrieval (FG-SBIR) addresses the problem of retrieving a particular photo instance given a user’s query sketch. Its widespread applicability is however hindered by the fact that drawing a sketch takes time, and most people struggle to draw a complete and faithful sketch. In this paper, we reformulate the conventional FG-SBIR framework to tackle these challenges, with the ultimate goal of retrieving the target photo with the least number of strokes possible. We further propose an on-the-ﬂy design that starts retrieving as soon as the user starts drawing. To accomplish this, we devise a reinforcement learning based cross-modal retrieval framework that directly optimizes rank of the ground-truth photo over a complete sketch drawing episode. Additionally, we introduce a novel reward scheme that circumvents the problems related to irrelevant sketch strokes, and thus provides us with a more consistent rank list during the retrieval. We achievesuperiorearly-retrievalefﬁciencyoverstate-of-theartmethodsandalternativebaselinesontwopubliclyavailable ﬁne-grained sketch retrieval datasets.
We present the first competitive drawing agent Pixelor that exhibits human-level performance at a Pictionary-like sketching game, where the participant whose sketch is recognized first is a winner. Our AI agent can autonomously sketch a given visual concept, and achieve a recognizable rendition as quickly or faster than a human competitor. The key to victory for the agent’s goal is to learn the optimal stroke sequencing strategies that generate the most recognizable and distinguishable strokes first. Training Pixelor is done in two steps. First, we infer the stroke order that maximizes early recognizability of human training sketches. Second, this order is used to supervise the training of a sequence-to-sequence stroke generator. Our key technical contributions are a tractable search of the exponential space of orderings using neural sorting; and an improved Seq2Seq Wasserstein (S2S-WAE) generator that uses an optimal-transport loss to accommodate the multi-modal nature of the optimal stroke distribution. Our analysis shows that Pixelor is better than the human players of the Quick, Draw! game, under both AI and human judging of early recognition. To analyze the impact of human competitors’ strategies, we conducted a further human study with participants being given unlimited thinking time and training in early recognizability by feedback from an AI judge. The study shows that humans do gradually improve their strategies with training, but overall Pixelor still matches human performance. The code and the dataset are available at http://sketchx.ai/pixelor.
Sketch as an image search query is an ideal alternative to text in capturing the finegrained visual details. Prior successes on fine-grained sketch-based image retrieval (FGSBIR) have demonstrated the importance of tackling the unique traits of sketches as opposed to photos, e.g., temporal vs. static, strokes vs. pixels, and abstract vs. pixelperfect. In this paper, we study a further trait of sketches that has been overlooked to date, that is, they are hierarchical in terms of the levels of detail – a person typically sketches up to various extents of detail to depict an object. This hierarchical structure is often visually distinct. In this paper, we design a novel network that is capable of cultivating sketch-specific hierarchies and exploiting them to match sketch with photo at corresponding hierarchical levels. In particular, features from a sketch and a photo are enriched using cross-modal co-attention, coupled with hierarchical node fusion at every level to form a better embedding space to conduct retrieval. Experiments on common benchmarks show our method to outperform state-of-the-arts by a significant margin.
A fundamental challenge faced by existing Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) models is the data scarcity – model performances are largely bottlenecked by the lack of sketch-photo pairs. Whilst the number of photos can be easily scaled, each corresponding sketch still needs to be individually produced. In this paper, we aim to mitigate such an upper-bound on sketch data, and study whether unlabelled photos alone (of which they are many) can be cultivated for performance gain. In particular, we introduce a novel semi-supervised framework for cross-modal retrieval that can additionally leverage large-scale unla-belled photos to account for data scarcity. At the center of our semi-supervision design is a sequential photo-to-sketch generation model that aims to generate paired sketches for unlabelled photos. Importantly, we further introduce a discriminator-guided mechanism to guide against unfaithful generation, together with a distillation loss-based regu-larizer to provide tolerance against noisy training samples. Last but not least, we treat generation and retrieval as two conjugate problems, where a joint learning procedure is devised for each module to mutually benefit from each other. Extensive experiments show that our semi-supervised model yields a significant performance boost over the state-of-the-art supervised alternatives, as well as existing methods that can exploit unlabelled photos for FG-SBIR.
Sketch-based image retrieval (SBIR) is a cross-modal matching problem which is typically solved by learning a joint embedding space where the semantic content shared between photo and sketch modalities are preserved. However, a fundamental challenge in SBIR has been largely ignored so far, that is, sketches are drawn by humans and considerable style variations exist amongst different users. An effective SBIR model needs to explicitly account for this style diversity, crucially, to generalise to unseen user styles. To this end, a novel style-agnostic SBIR model is proposed. Different from existing models, a cross-modal variational autoencoder (VAE) is employed to explicitly disentangle each sketch into a semantic content part shared with the corresponding photo, and a style part unique to the sketcher. Importantly, to make our model dynamically adaptable to any unseen user styles, we propose to metatrain our cross-modal VAE by adding two style-adaptive components: a set of feature transformation layers to its encoder and a regulariser to the disentangled semantic content latent code. With this meta-learning framework, our model can not only disentangle the cross-modal shared semantic content for SBIR, but can adapt the disentanglement to any unseen user style as well, making the SBIR model truly style-agnostic. Extensive experiments show that our style-agnostic model yields state-of-the-art performance for both category-level and instance-level SBIR.
Self-supervised learning has gained prominence due to its efficacy at learning powerful representations from un-labelled data that achieve excellent performance on many challenging downstream tasks. However, supervision-free pretext tasks are challenging to design and usually modality specific. Although there is a rich literature of self-supervised methods for either spatial (such as images) or temporal data (sound or text) modalities, a common pretext task that benefits both modalities is largely missing. In this paper, we are interested in defining a self-supervised pretext task for sketches and handwriting data. This data is uniquely characterised by its existence in dual modalities of rasterized images and vector coordinate sequences. We address and exploit this dual representation by proposing two novel cross-modal translation pretext tasks for self-supervised feature learning: Vectorization and Rasteriza-tion. Vectorization learns to map image space to vector coordinates and rasterization maps vector coordinates to image space. We show that our learned encoder modules benefit both raster-based and vector-based downstream approaches to analysing hand-drawn data. Empirical evidence shows that our novel pretext tasks surpass existing single and multi-modal self-supervision methods.
Handwritten Text Recognition (HTR) remains a challenging problem to date, largely due to the varying writing styles that exist amongst us. Prior works however generally operate with the assumption that there is a limited number of styles, most of which have already been captured by existing datasets. In this paper, we take a completely different perspective – we work on the assumption that there is always a new style that is drastically different, and that we will only have very limited data during testing to perform adaptation. This creates a commercially viable solution – being exposed to the new style, the model has the best shot at adaptation, and the few-sample nature makes it practical to implement. We achieve this via a novel meta-learning framework which exploits additional new-writer data via a support set, and outputs a writer-adapted model via single gradient step update, all during inference (see Figure 1). We discover and leverage on the important insight that there exists few key characters per writer that exhibit relatively larger style discrepancies. For that, we additionally propose to meta-learn instance specific weights for a character-wise cross-entropy loss, which is specifically designed to work with the sequential nature of text data. Our writer-adaptive MetaHTR framework can be easily implemented on the top of most state-of-the-art HTR models. Experiments show an average performance gain of 5-7% can be obtained by observing very few new style data (≤ 16).