Professor Tao Xiang


Professor of Computer Vision and Machine Learning
BSc, MSc, PhD
+44 (0)1483 684263
40 BA 01

Biography

My publications

Publications

SEN HE, Wentong Liao, Michael Ying Yang, YONGXIN YANG, YI-ZHE SONG, Bodo Rosenhahn, TAO XIANG (2021)Context-Aware Layout to Image Generation with Enhanced Object Appearance

A layout to image (L2I) generation model aims to generate a complicated image containing multiple objects (things) against natural background (stuff), conditioned on a given layout. Built upon the recent advances in gen-erative adversarial networks (GANs), existing L2I models have made great progress. However, a close inspection of their generated images reveals two major limitations: (1) the object-to-object as well as object-to-stuff relations are often broken and (2) each object's appearance is typically distorted lacking the key defining characteristics associated with the object class. We argue that these are caused by the lack of context-aware object and stuff feature encoding in their generators, and location-sensitive appearance representation in their discriminators. To address these limitations, two new modules are proposed in this work. First, a context-aware feature transformation module is introduced in the generator to ensure that the generated feature encoding of either object or stuff is aware of other co-existing objects/stuff in the scene. Second, instead of feeding location-insensitive image features to the discriminator, we use the Gram matrix computed from the feature maps of the generated object images to preserve location-sensitive information, resulting in much enhanced object appearance. Extensive experiments show that the proposed method achieves state-of-the-art performance on the COCO-Thing-Stuff and Visual Genome benchmarks. Code available at: https://github.com/wtliao/layout2img.

AYAN KUMAR BHUNIA, PINAKI NATH CHOWDHURY, YONGXIN YANG, Timothy M. Hospedales, TAO XIANG, YI-ZHE SONG (2021)Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting

Self-supervised learning has gained prominence due to its efficacy at learning powerful representations from un-labelled data that achieve excellent performance on many challenging downstream tasks. However, supervision-free pretext tasks are challenging to design and usually modality specific. Although there is a rich literature of self-supervised methods for either spatial (such as images) or temporal data (sound or text) modalities, a common pretext task that benefits both modalities is largely missing. In this paper, we are interested in defining a self-supervised pretext task for sketches and handwriting data. This data is uniquely characterised by its existence in dual modalities of rasterized images and vector coordinate sequences. We address and exploit this dual representation by proposing two novel cross-modal translation pretext tasks for self-supervised feature learning: Vectorization and Rasteriza-tion. Vectorization learns to map image space to vector coordinates and rasterization maps vector coordinates to image space. We show that our learned encoder modules benefit both raster-based and vector-based downstream approaches to analysing hand-drawn data. Empirical evidence shows that our novel pretext tasks surpass existing single and multi-modal self-supervision methods.

AYAN DAS, YONGXIN YANG, Timothy M. Hospedales, TAO XIANG, YI-ZHE SONG (2021)Cloud2Curve: Generation and Vectorization of Parametric Sketches

Analysis of human sketches in deep learning has advanced immensely through the use of waypoint-sequences rather than raster-graphic representations. We further aim to model sketches as a sequence of low-dimensional parametric curves. To this end, we propose an inverse graphics framework capable of approximating a raster or waypoint based stroke encoded as a point-cloud with a variable-degree Bezier curve. Building on this module, ´we present Cloud2Curve, a generative model for scalable high-resolution vector sketches that can be trained end-to-end using point-cloud data alone. As a consequence, our model is also capable of deterministic vectorization which can map novel raster or waypoint based sketches to their corresponding high-resolution scalable Bezier equivalent. ´We evaluate the generation and vectorization capabilities of our model on Quick, Draw! and K-MNIST datasets. The analysis of free-hand sketches using deep learning [40] has flourished over the past few years, with sketches now being well analysed from classification [43, 42] and retrieval [27, 12, 4] perspectives. Sketches for digital analysis have always been acquired in two primary modalities - raster (pixel grids) and vector (line segments). Raster sketches have mostly been the modality of choice for sketch recognition and retrieval [43, 27]. However, generative sketch models began to advance rapidly [16] after focusing on vector representations and generating sketches as sequences [7, 37] of waypoints/line segments, similarly to how humans sketch. As a happy byproduct, this paradigm leads to clean and blur-free image generation as opposed to direct raster-graphic generations [30]. Recent works have studied creativity in sketch generation [16], learning to sketch raster photo input images [36], learning efficient

ANEESHAN SAIN, AYAN KUMAR BHUNIA, YONGXIN YANG, TAO XIANG, YI-ZHE SONG (2021)StyleMeUp: Towards Style-Agnostic Sketch-Based Image Retrieval

Sketch-based image retrieval (SBIR) is a cross-modal matching problem which is typically solved by learning a joint embedding space where the semantic content shared between photo and sketch modalities are preserved. However, a fundamental challenge in SBIR has been largely ignored so far, that is, sketches are drawn by humans and considerable style variations exist amongst different users. An effective SBIR model needs to explicitly account for this style diversity, crucially, to generalise to unseen user styles. To this end, a novel style-agnostic SBIR model is proposed. Different from existing models, a cross-modal variational autoencoder (VAE) is employed to explicitly disentangle each sketch into a semantic content part shared with the corresponding photo, and a style part unique to the sketcher. Importantly, to make our model dynamically adaptable to any unseen user styles, we propose to metatrain our cross-modal VAE by adding two style-adaptive components: a set of feature transformation layers to its encoder and a regulariser to the disentangled semantic content latent code. With this meta-learning framework, our model can not only disentangle the cross-modal shared semantic content for SBIR, but can adapt the disentanglement to any unseen user style as well, making the SBIR model truly style-agnostic. Extensive experiments show that our style-agnostic model yields state-of-the-art performance for both category-level and instance-level SBIR.

Ayan Kumar Bhunia, PINAKI NATH CHOWDHURY, ANEESHAN SAIN, YONGXIN YANG, TAO XIANG, YI-ZHE SONG (2021)More Photos are All You Need: Semi-Supervised Learning for Fine-Grained Sketch Based Image Retrieval

A fundamental challenge faced by existing Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) models is the data scarcity – model performances are largely bottlenecked by the lack of sketch-photo pairs. Whilst the number of photos can be easily scaled, each corresponding sketch still needs to be individually produced. In this paper, we aim to mitigate such an upper-bound on sketch data, and study whether unlabelled photos alone (of which they are many) can be cultivated for performance gain. In particular, we introduce a novel semi-supervised framework for cross-modal retrieval that can additionally leverage large-scale unla-belled photos to account for data scarcity. At the center of our semi-supervision design is a sequential photo-to-sketch generation model that aims to generate paired sketches for unlabelled photos. Importantly, we further introduce a discriminator-guided mechanism to guide against unfaithful generation, together with a distillation loss-based regu-larizer to provide tolerance against noisy training samples. Last but not least, we treat generation and retrieval as two conjugate problems, where a joint learning procedure is devised for each module to mutually benefit from each other. Extensive experiments show that our semi-supervised model yields a significant performance boost over the state-of-the-art supervised alternatives, as well as existing methods that can exploit unlabelled photos for FG-SBIR.

Zhihe Lu, Yongxin Yang, Xiatian Zhu, Cong Liu, Yi-Zhe Song, Tao Xiang (2020)Stochastic Classifiers for Unsupervised Domain Adaptation, In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)pp. 9108-9117 IEEE

A common strategy adopted by existing state-of-the-art unsupervised domain adaptation (UDA) methods is to employ two classifiers to identify the misaligned local regions between source and target domain. Following the 'wisdom of the crowd' principle, one has to ask: why stop at two? Indeed, we find that using more classifiers leads to better performance, but also introduces more model parameters, therefore risking overfitting. In this paper, we introduce a novel method called STochastic clAssifieRs (STAR) for addressing this problem. Instead of representing one classifier as a weight vector, STAR models it as a Gaussian distribution with its variance representing the inter-classifier discrepancy. With STAR, we can now sample an arbitrary number of classifiers from the distribution, whilst keeping the model size the same as having two classifiers. Extensive experiments demonstrate that a variety of existing UDA methods can greatly benefit from STAR and achieve the state-of-the-art performance on both image classification and semantic segmentation tasks.

Yanwei Fu, Yongxin Yang, Timothy Hospedales, Tao Xiang, Shaogang Gong (2014)Transductive multi-label zero-shot learning, In: Proceedings of the British Machine Vision Conference 2014 BMVA Press

Zero-shot learning has received increasing interest as a means to alleviate the often prohibitive expense of annotating training data for large scale recognition problems. These methods have achieved great success via learning intermediate semantic representations in the form of attributes and more recently, semantic word vectors. However, they have thus far been constrained to the single-label case, in contrast to the growing popularity and importance of more realistic multi-label data. In this paper, for the first time, we investigate and formalise a general framework for multi-label zero-shot learning, addressing the unique challenge therein: how to exploit multi-label correlation at test time with no training data for those classes? In particular, we propose (1) a multi-output deep regression model to project an image into a semantic word space, which explicitly exploits the correlations in the intermediate semantic layer of word vectors; (2) a novel zero-shot learning algorithm for multi-label data that exploits the unique compositionality property of semantic word vector representations; and (3) a transductive learning strategy to enable the regression model learned from seen classes to generalise well to unseen classes. Our zero-shot learning experiments on a number of standard multi-label datasets demonstrate that our method outperforms a variety of baselines. © 2014. The copyright of this document resides with its authors.

Mingyu Ding, An Zhao, Zhiwu Lu, Tao Xiang, Ji-Rong Wen (2019)Face-Focused Cross-Stream Network for Deception Detection in Videos, In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)pp. 7794-7803 IEEE

Automated deception detection (ADD) from real-life videos is a challenging task. It specifically needs to address two problems: (1) Both face and body contain useful cues regarding whether a subject is deceptive. How to effectively fuse the two is thus key to the effectiveness of an ADD model. (2) Real-life deceptive samples are hard to collect; learning with limited training data thus challenges most deep learning based ADD models. In this work, both problems are addressed. Specifically, for face-body multimodal learning, a novel face-focused cross-stream network (FFCSN) is proposed. It differs significantly from the popular two-stream networks in that: (a) face detection is added into the spatial stream to capture the facial expressions explicitly, and (b) correlation learning is performed across the spatial and temporal streams for joint deep feature learning across both face and body. To address the training data scarcity problem, our FFCSN model is trained with both meta learning and adversarial learning. Extensive experiments show that our FFCSN model achieves state-of-the-art results. Further, the proposed FFCSN model as well as its robust training strategy are shown to be generally applicable to other human-centric video analysis tasks such as emotion recognition from user-generated videos.

Aoxue Li, Tiange Luo, Zhiwu Lu, Tao Xiang, Liwei Wang (2019)Large-Scale Few-Shot Learning: Knowledge Transfer With Class Hierarchy, In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)pp. 7205-7213 IEEE

Recently, large-scale few-shot learning (FSL) becomes topical. It is discovered that, for a large-scale FSL problem with 1,000 classes in the source domain, a strong baseline emerges, that is, simply training a deep feature embedding model using the aggregated source classes and performing nearest neighbor (NN) search using the learned features on the target classes. The state-of-the-art large-scale FSL methods struggle to beat this baseline, indicating intrinsic limitations on scalability. To overcome the challenge, we propose a novel large-scale FSL model by learning transferable visual features with the class hierarchy which encodes the semantic relations between source and target classes. Extensive experiments show that the proposed model significantly outperforms not only the NN baseline but also the state-of-the-art alternatives. Furthermore, we show that the proposed model can be easily extended to the large-scale zero-shot learning (ZSL) problem and also achieves the state-of-the-art results.

Aoxue Li, Tiange Luo, Tao Xiang, Weiran Huang, Liwei Wang (2019)Few-Shot Learning With Global Class Representations, In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV)pp. 9714-9723 IEEE

In this paper, we propose to tackle the challenging few-shot learning (FSL) problem by learning global class representations using both base and novel class training samples. In each training episode, an episodic class mean computed from a support set is registered with the global representation via a registration module. This produces a registered global class representation for computing the classification loss using a query set. Though following a similar episodic training pipeline as existing meta learning based approaches, our method differs significantly in that novel class training samples are involved in the training from the beginning. To compensate for the lack of novel class training samples, an effective sample synthesis strategy is developed to avoid overfitting. Importantly, by joint base-novel class training, our approach can be easily extended to a more practical yet challenging FSL setting, i.e., generalized FSL, where the label space of test data is extended to both base and novel classes. Extensive experiments show that our approach is effective for both of the two FSL settings.

Umar Riaz Muhammad, Yongxin Yang, Timothy Hospedales, Tao Xiang, Yi-Zhe Song (2019)Goal-Driven Sequential Data Abstraction, In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV)pp. 71-80 IEEE

Automatic data abstraction is an important capability for both benchmarking machine intelligence and supporting summarization applications. In the former one asks whether a machine can `understand' enough about the meaning of input data to produce a meaningful but more compact abstraction. In the latter this capability is exploited for saving space or human time by summarizing the essence of input data. In this paper we study a general reinforcement learning based framework for learning to abstract sequential data in a goal-driven way. The ability to define different abstraction goals uniquely allows different aspects of the input data to be preserved according to the ultimate purpose of the abstraction. Our reinforcement learning objective does not require human-defined examples of ideal abstraction. Importantly our model processes the input sequence holistically without being constrained by the original input order. Our framework is also domain agnostic -- we demonstrate applications to sketch, video and text data and achieve promising results in all domains.

Tianyuan Yu, Da Li, Yongxin Yang, Timothy Hospedales, Tao Xiang (2019)Robust Person Re-Identification by Modelling Feature Uncertainty, In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV)pp. 552-561 IEEE

We aim to learn deep person re-identification (ReID) models that are robust against noisy training data. Two types of noise are prevalent in practice: (1) label noise caused by human annotator errors and (2) data outliers caused by person detector errors or occlusion. Both types of noise pose serious problems for training ReID models, yet have been largely ignored so far. In this paper, we propose a novel deep network termed DistributionNet for robust ReID. Instead of representing each person image as a feature vector, DistributionNet models it as a Gaussian distribution with its variance representing the uncertainty of the extracted features. A carefully designed loss is formulated in DistributionNet to unevenly allocate uncertainty across training samples. Consequently, noisy samples are assigned large variance/uncertainty, which effectively alleviates their negative impacts on model fitting. Extensive experiments demonstrate that our model is more effective than alternative noise-robust deep models. The source code is available at: https://github.com/TianyuanYu/DistributionNet.

Kaiyang Zhou, Yongxin Yang, Andrea Cavallaro, Tao Xiang (2019)Omni-Scale Feature Learning for Person Re-Identification, In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV)pp. 3701-3711 IEEE

As an instance-level recognition problem, person re-identification (ReID) relies on discriminative features, which not only capture different spatial scales but also encapsulate an arbitrary combination of multiple scales. We callse features of both homogeneous and heterogeneous scales omni-scale features. In this paper, a novel deep ReID CNN is designed, termed Omni-Scale Network (OSNet), for omni-scale feature learning. This is achieved by designing a residual block composed of multiple convolutional feature streams, each detecting features at a certain scale. Importantly, a novel unified aggregation gate is introduced to dynamically fuse multi-scale features with input-dependent channel-wise weights. To efficiently learn spatial-channel correlations and avoid overfitting, the building block uses both pointwise and depthwise convolutions. By stacking such blocks layer-by-layer, our OSNet is extremely lightweight and can be trained from scratch on existing ReID benchmarks. Despite its small model size, our OSNet achieves state-of-the-art performance on six person-ReID datasets. Code and models are available at: https://github.com/KaiyangZhou/deep-person-reid.

Kaiyang Zhou, Yongxin Yang, Timothy Hospedales, Tao Xiang (2020)Deep Domain-Adversarial Image Generation for Domain Generalisation, In: Proceedings of the ... AAAI Conference on Artificial Intelligence34(7)pp. 13025-13032

Machine learning models typically suffer from the domain shift problem when trained on a source dataset and evaluated on a target dataset of different distribution. To overcome this problem, domain generalisation (DG) methods aim to leverage data from multiple source domains so that a trained model can generalise to unseen domains. In this paper, we propose a novel DG approach based on Deep Domain-Adversarial Image Generation (DDAIG). Specifically, DDAIG consists of three components, namely a label classifier, a domain classifier and a domain transformation network (DoTNet). The goal for DoTNet is to map the source training data to unseen domains. This is achieved by having a learning objective formulated to ensure that the generated data can be correctly classified by the label classifier while fooling the domain classifier. By augmenting the source training data with the generated unseen domain data, we can make the label classifier more robust to unknown domain changes. Extensive experiments on four DG datasets demonstrate the effectiveness of our approach.

Kaiyue Pang, Yongxin Yang, Timothy M Hospedales, Tao Xiang, Yi-Zhe Song (2020)Solving Mixed-Modal Jigsaw Puzzle for Fine-Grained Sketch-Based Image Retrieval, In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)pp. 10344-10352 IEEE

ImageNet pre-training has long been considered crucial by the fine-grained sketch-based image retrieval (FG-SBIR) community due to the lack of large sketch-photo paired datasets for FG-SBIR training. In this paper, we propose a self-supervised alternative for representation pre-training. Specifically, we consider the jigsaw puzzle game of recomposing images from shuffled parts. We identify two key facets of jigsaw task design that are required for effective FG-SBIR pre-training. The first is formulating the puzzle in a mixed-modality fashion. Second we show that framing the optimisation as permutation matrix inference via Sinkhorn iterations is more effective than the common classifier formulation of Jigsaw self-supervision. Experiments show that this self-supervised pre-training strategy significantly outperforms the standard ImageNet-based pipeline across all four product-level FG-SBIR benchmarks. Interestingly it also leads to improved cross-category generalisation across both pre-train/fine-tune and fine-tune/testing stages.

AYAN KUMAR BHUNIA, YONGXIN YANG, T.M. Hospedales, TAO XIANG, Yi-Zhe Song (2020)Sketch Less for More: On-the-Fly Fine-Grained Sketch Based Image Retrieval, In: Proceedings Conference on Computer Vision and Pattern Recognition (CVPR) 2020

Fine-grained sketch-based image retrieval (FG-SBIR) addresses the problem of retrieving a particular photo instance given a user’s query sketch. Its widespread applicability is however hindered by the fact that drawing a sketch takes time, and most people struggle to draw a complete and faithful sketch. In this paper, we reformulate the conventional FG-SBIR framework to tackle these challenges, with the ultimate goal of retrieving the target photo with the least number of strokes possible. We further propose an on-the-fly design that starts retrieving as soon as the user starts drawing. To accomplish this, we devise a reinforcement learning based cross-modal retrieval framework that directly optimizes rank of the ground-truth photo over a complete sketch drawing episode. Additionally, we introduce a novel reward scheme that circumvents the problems related to irrelevant sketch strokes, and thus provides us with a more consistent rank list during the retrieval. We achievesuperiorearly-retrievalefficiencyoverstate-of-theartmethodsandalternativebaselinesontwopubliclyavailable fine-grained sketch retrieval datasets.

Juan-Manuel Perez-Rua, Xiatian Zhu, Timothy M Hospedales, Tao Xiang (2020)Incremental Few-Shot Object Detection, In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)13852 IEEE

Existing object detection methods typically rely on the availability of abundant labelled training samples per class and offline model training in a batch mode. These requirements substantially limit their scalability to open-ended accommodation of novel classes with limited labelled training data, both in terms of model accuracy and training efficiency during deployment. We present the first study aiming to go beyond these limitations by considering the Incremental Few-Shot Detection (iFSD) problem setting, where new classes must be registered incrementally (without revisiting base classes) and with few examples. To this end we propose OpeN-ended Centre nEt (ONCE), a detector designed for incrementally learning to detect novel class objects with few examples. This is achieved by an elegant adaptation of the efficient CentreNet detector to the few-shot learning scenario, and meta-learning a class-wise code generator model for registering novel classes. ONCE fully respects the incremental learning paradigm, with novel class registration requiring only a single forward pass of few-shot training samples, and no access to base classes - thus making it suitable for deployment on embedded devices, etc. Extensive experiments conducted on both the standard object detection (COCO, PASCAL VOC) and fashion landmark detection (DeepFashion2) tasks show the feasibility of iFSD for the first time, opening an interesting and very important line of research.

Yulei Niu, Zhiwu Lu, Ji-Rong Wen, Tao Xiang, Shih-Fu Chang (2019)Multi-Modal Multi-Scale Deep Learning for Large-Scale Image Annotation, In: IEEE Transactions on Image Processing28(4)pp. 1720-1731

Image annotation aims to annotate a given image with a variable number of class labels corresponding to diverse visual concepts. In this paper, we address two main issues in large-scale image annotation: 1) how to learn a rich feature representation suitable for predicting a diverse set of visual concepts ranging from object, scene to abstract concept and 2) how to annotate an image with the optimal number of class labels. To address the first issue, we propose a novel multi-scale deep model for extracting rich and discriminative features capable of representing a wide range of visual concepts. Specifically, a novel two-branch deep neural network architecture is proposed, which comprises a very deep main network branch and a companion feature fusion network branch designed for fusing the multi-scale features computed from the main branch. The deep model is also made multi-modal by taking noisy user-provided tags as model input to complement the image input. For tackling the second issue, we introduce a label quantity prediction auxiliary task to the main label prediction task to explicitly estimate the optimal label number for a given image. Extensive experiments are carried out on two large-scale image annotation benchmark datasets, and the results show that our method significantly outperforms the state of the art.

K. Li, K. Pang, Yi-Zhe Song, TAO XIANG, T.M. Hospedales, H. Zhang (2019)Toward Deep Universal Sketch Perceptual Grouper, In: IEEE Transactions on Image Processing28(7)pp. 3219-3231 Institute of Electrical and Electronics Engineers (IEEE)

Human free-hand sketches provide the useful data for studying human perceptual grouping, where the grouping principles such as the Gestalt laws of grouping are naturally in play during both the perception and sketching stages. In this paper, we make the first attempt to develop a universal sketch perceptual grouper. That is, a grouper that can be applied to sketches of any category created with any drawing style and ability, to group constituent strokes/segments into semantically meaningful object parts. The first obstacle to achieving this goal is the lack of large-scale datasets with grouping annotation. To overcome this, we contribute the largest sketch perceptual grouping dataset to date, consisting of 20 000 unique sketches evenly distributed over 25 object categories. Furthermore, we propose a novel deep perceptual grouping model learned with both generative and discriminative losses. The generative loss improves the generalization ability of the model, while the discriminative loss guarantees both local and global grouping consistency. Extensive experiments demonstrate that the proposed grouper significantly outperforms the state-of-the-art competitors. In addition, we show that our grouper is useful for a number of sketch analysis tasks, including sketch semantic segmentation, synthesis, and fine-grained sketch-based image retrieval. © 1992-2012 IEEE.

Feng Liu, Tao Xiang, Timothy M Hospedales, Wankou Yang, Changyin Sun (2020)Inverse Visual Question Answering: A New Benchmark and VQA Diagnosis Tool, In: IEEE Transactions on Pattern Analysis and Machine Intelligence42(2)pp. 460-474 IEEE

In recent years, visual question answering (VQA) has become topical. The premise of VQA's significance as a benchmark in AI, is that both the image and textual question need to be well understood and mutually grounded in order to infer the correct answer. However, current VQA models perhaps `understand' less than initially hoped, and instead master the easier task of exploiting cues given away in the question and biases in the answer distribution [1] . In this paper we propose the inverse problem of VQA (iVQA). The iVQA task is to generate a question that corresponds to a given image and answer pair. We propose a variational iVQA model that can generate diverse, grammatically correct and content correlated questions that match the given answer. Based on this model, we show that iVQA is an interesting benchmark for visuo-linguistic understanding, and a more challenging alternative to VQA because an iVQA model needs to understand the image better to be successful. As a second contribution, we show how to use iVQA in a novel reinforcement learning framework to diagnose any existing VQA model by way of exposing its belief set: the set of question-answer pairs that the VQA model would predict true for a given image. This provides a completely new window into what VQA models `believe' about images. We show that existing VQA models have more erroneous beliefs than previously thought, revealing their intrinsic weaknesses. Suggestions are then made on how to address these weaknesses going forward.

Zhong Ji, Biying Cui, Huihui Li, Yu-Gang Jiang, Tao Xiang, Timothy Hospedales, Yanwei Fu (2020)Deep Ranking for Image Zero-Shot Multi-Label Classification, In: IEEE Transactions on Image Processing29pp. 6549-6560 IEEE

During the past decade, both multi-label learning and zero-shot learning have attracted huge research attention, and significant progress has been made. Multi-label learning algorithms aim to predict multiple labels given one instance, while most existing zero-shot learning approaches target at predicting a single testing label for each unseen class via transferring knowledge from auxiliary seen classes to target unseen classes. However, relatively less effort has been made on predicting multiple labels in the zero-shot setting, which is nevertheless a quite challenging task. In this work, we investigate and formalize a flexible framework consisting of two components, i.e., visual-semantic embedding and zero-shot multi-label prediction. First, we present a deep regression model to project the visual features into the semantic space, which explicitly exploits the correlations in the intermediate semantic layer of word vectors and makes label prediction possible. Then, we formulate the label prediction problem as a pairwise one and employ Ranking SVM to seek the unique multi-label correlations in the embedding space. Furthermore, we provide a transductive multi-label zero-shot prediction approach that exploits the testing data manifold structure. We demonstrate the effectiveness of the proposed approach on three popular multi-label datasets with state-of-the-art performance obtained on both conventional and generalized ZSL settings.

Jiechao Guan, Zhiwu Lu, Tao Xiang, Aoxue Li, An Zhao, Ji-Rong Wen (2020)Zero and Few Shot Learning with Semantic Feature Synthesis and Competitive Learning, In: IEEE Transactions on Pattern Analysis and Machine IntelligencePP IEEE

Zero-shot learning (ZSL) is made possible by learning a projection function between a feature space and a semantic space (e.g., an attribute space). Key to ZSL is thus to learn a projection that is robust against the often large domain gap between the seen and unseen class domains. In this work, this is achieved by unseen class data synthesis and robust projection function learning. Specifically, a novel semantic data synthesis strategy is proposed, by which semantic class prototypes (e.g., attribute vectors) are used to simply perturb seen class data for generating unseen class ones. As in any data synthesis/hallucination approach, there are ambiguities and uncertainties on how well the synthesised data can capture the targeted unseen class data distribution. To cope with this, the second contribution of this work is a novel projection learning model termed competitive bidirectional projection learning (BPL) designed to best utilise the ambiguous synthesised data. As a third contribution, we show that the proposed ZSL model can be easily extended to few-shot learning (FSL) by again exploiting semantic (class prototype guided) feature synthesis and competitive BPL. Extensive experiments show that our model achieves the state-of-the-art results on both problems.

Peng Xu, Kun Liu, TAO XIANG, Timothy M. Hospedales, Zhanyu Ma, Jun Guo, Yi-Zhe Song (2020)Fine-Grained Instance-Level Sketch-Based Video Retrieval, In: IEEE Transactions on Circuits and Systems for Video Technologypp. 1-1 IEEE

Existing sketch-analysis work studies sketches depicting static objects or scenes. In this work, we propose a novel cross-modal retrieval problem of fine-grained instance-level sketch-based video retrieval (FG-SBVR), where a sketch sequence is used as a query to retrieve a specific target video instance. Compared with sketch-based still image retrieval, and coarse-grained category-level video retrieval, this is more challenging as both visual appearance and motion need to be simultaneously matched at a fine-grained level. We contribute the first FG-SBVR dataset with rich annotations. We then introduce a novel multi-stream multi-modality deep network to perform FG-SBVR under both strong and weakly supervised settings. The key component of the network is a relation module, designed to prevent model overfitting given scarce training data. We show that this model significantly outperforms a number of existing state-of-the-art models designed for video analysis.

Aoxue Li, Zhiwu Lu, Jiechao Guan, Tao Xiang, Liwei Wang, Ji-Rong Wen (2020)Transferrable Feature and Projection Learning with Class Hierarchy for Zero-Shot Learning, In: International Journal of Computer Vision128(12)pp. 2810-2827

Zero-shot learning (ZSL) aims to transfer knowledge from seen classes to unseen ones so that the latter can be recognised without any training samples. This is made possible by learning a projection function between a feature space and a semantic space (e.g. attribute space). Considering the seen and unseen classes as two domains, a big domain gap often exists which challenges ZSL. In this work, we propose a novel inductive ZSL model that leverages superclasses as the bridge between seen and unseen classes to narrow the domain gap. Specifically, we first build a class hierarchy of multiple superclass layers and a single class layer, where the superclasses are automatically generated by data-driven clustering over the semantic representations of all seen and unseen class names. We then exploit the superclasses from the class hierarchy to tackle the domain gap challenge in two aspects: deep feature learning and projection function learning. First, to narrow the domain gap in the feature space, we define a recurrent neural network over superclasses and then plug it into a convolutional neural network for enforcing the superclass hierarchy. Second, to further learn a transferrable projection function for ZSL, a novel projection function learning method is proposed by exploiting the superclasses to align the two domains. Importantly, our transferrable feature and projection learning methods can be easily extended to a closely related task—few-shot learning (FSL). Extensive experiments show that the proposed model outperforms the state-of-the-art alternatives in both ZSL and FSL tasks.

Jifei Song, Yongxin Yang, Yi-Zhe Song, Tao Xiang, Timothy M. Hospedales (2019)Generalizable Person Re-identification by Domain-Invariant Mapping Network, In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019)pp. 719-728 Institute of Electrical and Electronics Engineers (IEEE)

We aim to learn a domain generalizable person reidentification (ReID) model. When such a model is trained on a set of source domains (ReID datasets collected from different camera networks), it can be directly applied to any new unseen dataset for effective ReID without any model updating. Despite its practical value in real-world deployments, generalizable ReID has seldom been studied. In this work, a novel deep ReID model termed Domain-Invariant Mapping Network(DIMN) is proposed. DIMN is designed to learn a mapping between a person image and its identity classifier, i.e., it produces a classifier using a single shot. To make the model domain-invariant, we follow a meta-learning pipeline and sample a subset of source domain training tasks during each training episode. However, the model is significantly different from conventional meta-learning methods in that: (1) no model updating is required for the target domain, (2) different training tasks share a memory bank for maintaining both scalability and discrimination ability, and (3) it can be used to match an arbitrary number of identities in a target domain. Extensive experiments on a newly proposed large-scale ReID domain generalization benchmark show that our DIMN significantly outperforms alternative domain generalization or meta-learning methods.

Ayan Kumar Bhunia, Ayan Das, Umar Riaz Muhammad, Yongxin Yang, Timothy M. Hospedales, Tao Xiang, Yulia Gryaditskaya, Yi-Zhe Song (2020)Pixelor: A Competitive Sketching AI Agent. So you think you can sketch?, In: ACM Transactions on Graphics39(6) Association for Computing Machinery (ACM)

We present the first competitive drawing agent Pixelor that exhibits human-level performance at a Pictionary-like sketching game, where the participant whose sketch is recognized first is a winner. Our AI agent can autonomously sketch a given visual concept, and achieve a recognizable rendition as quickly or faster than a human competitor. The key to victory for the agent’s goal is to learn the optimal stroke sequencing strategies that generate the most recognizable and distinguishable strokes first. Training Pixelor is done in two steps. First, we infer the stroke order that maximizes early recognizability of human training sketches. Second, this order is used to supervise the training of a sequence-to-sequence stroke generator. Our key technical contributions are a tractable search of the exponential space of orderings using neural sorting; and an improved Seq2Seq Wasserstein (S2S-WAE) generator that uses an optimal-transport loss to accommodate the multi-modal nature of the optimal stroke distribution. Our analysis shows that Pixelor is better than the human players of the Quick, Draw! game, under both AI and human judging of early recognition. To analyze the impact of human competitors’ strategies, we conducted a further human study with participants being given unlimited thinking time and training in early recognizability by feedback from an AI judge. The study shows that humans do gradually improve their strategies with training, but overall Pixelor still matches human performance. The code and the dataset are available at http://sketchx.ai/pixelor.

Peng Xu, Yongye Huang, Tongtong Yuan, Tao Xiang, Timothy M Hospedales, Yi-Zhe Song, Liang Wang (2020)On Learning Semantic Representations for Large-Scale Abstract Sketches, In: IEEE Transactions on Circuits and Systems for Video Technologypp. 1-1 IEEE

In this paper, we focus on learning semantic representations for large-scale highly abstract sketches that were produced by the practical sketch-based application rather than the excessively well dawn sketches obtained by crowd-sourcing. We propose a dual-branch CNN-RNN network architecture to represent sketches, which simultaneously encodes both the static and temporal patterns of sketch strokes. Based on this architecture, we further explore learning the sketch-oriented semantic representations in two practical settings, i.e., hashing retrieval and zero-shot recognition on million-scale highly abstract sketches produced by practical online interactions. Specifically, we use our dual-branch architecture as a universal representation framework to design two sketch-specific deep models: (i) We propose a deep hashing model for sketch retrieval, where a novel hashing loss is specifically designed to further accommodate both the abstract and messy traits of sketches. (ii) We propose a deep embedding model for sketch zero-shot recognition, via collecting a large-scale edge-map dataset and proposing to extract a set of semantic vectors from edge-maps as the semantic knowledge for sketch zero-shot domain alignment. Both deep models are evaluated by comprehensive experiments on million-scale abstract sketches produced by a global online game QuickDraw and outperform state-of-the-art competitors.

Aneeshan Sain, Ayan Kumar Bhunia, Yongxin Yang, Tao Xiang, Yi-Zhe Song (2020)Cross-Modal Hierarchical Modelling for Fine-Grained Sketch Based Image Retrieval, In: Proceedings of The 31st British Machine Vision Virtual Conference (BMVC 2020)pp. 1-14 British Machine Vision Association

Sketch as an image search query is an ideal alternative to text in capturing the finegrained visual details. Prior successes on fine-grained sketch-based image retrieval (FGSBIR) have demonstrated the importance of tackling the unique traits of sketches as opposed to photos, e.g., temporal vs. static, strokes vs. pixels, and abstract vs. pixelperfect. In this paper, we study a further trait of sketches that has been overlooked to date, that is, they are hierarchical in terms of the levels of detail – a person typically sketches up to various extents of detail to depict an object. This hierarchical structure is often visually distinct. In this paper, we design a novel network that is capable of cultivating sketch-specific hierarchies and exploiting them to match sketch with photo at corresponding hierarchical levels. In particular, features from a sketch and a photo are enriched using cross-modal co-attention, coupled with hierarchical node fusion at every level to form a better embedding space to conduct retrieval. Experiments on common benchmarks show our method to outperform state-of-the-arts by a significant margin.

Zhiyuan Shi, Yongxin Yang, Timothy M Hospedales, Tao Xiang (2014)Weakly supervised learning of objects, attributes and their associations, In: Lecture Notes in Computer Science8690 L(PART 2)pp. 472-487 Springer Verlag

When humans describe images they tend to use combinations of nouns and adjectives, corresponding to objects and their associated attributes respectively. To generate such a description automatically, one needs to model objects, attributes and their associations. Conventional methods require strong annotation of object and attribute locations, making them less scalable. In this paper, we model object-attribute associations from weakly labelled images, such as those widely available on media sharing sites (e.g. Flickr), where only image-level labels (either object or attributes) are given, without their locations and associations. This is achieved by introducing a novel weakly supervised non-parametric Bayesian model. Once learned, given a new image, our model can describe the image, including objects, attributes and their associations, as well as their locations and segmentation. Extensive experiments on benchmark datasets demonstrate that our weakly supervised model performs at par with strongly supervised models on tasks such as image description and retrieval based on object-attribute associations. © 2014 Springer International Publishing.

Qian Yu, Jifei Song, Yi-Zhe Song, Tao Xiang, Timothy M. Hospedales (2020)Fine-Grained Instance-Level Sketch-Based Image Retrieval, In: International Journal of Computer Vision Springer

The problem of fine-grained sketch-based image retrieval (FG-SBIR) is defined and investigated in this paper. In FG-SBIR, free-hand human sketch images are used as queries to retrieve photo images containing the same object instances. It is thus a cross-domain (sketch to photo) instance-level retrieval task. It is an extremely challenging problem because (i) visual comparisons and matching need to be executed under large domain gap, i.e., from black and white line drawing sketches to colour photos; (ii) it requires to capture the fine-grained (dis)similarities of sketches and photo images while free-hand sketches drawn by different people present different levels of deformation and expressive interpretation; and (iii) annotated cross-domain fine-grained SBIR datasets are scarce, challenging many state-of-the-art machine learning techniques, particularly those based on deep learning. In this paper, for the first time, we address all these challenges, providing a step towards the capabilities that would underpin a commercial sketch-based object instance retrieval application. Specifically, a new large-scale FG-SBIR database is introduced which is carefully designed to reflect the real-world application scenarios. A deep cross-domain matching model is then formulated to solve the intrinsic drawing style variability, large domain gap issues, and capture instance-level discriminative features. It distinguishes itself by a carefully designed attention module. Extensive experiments on the new dataset demonstrate the effectiveness of the proposed model and validate the need for a rigorous definition of the FG-SBIR problem and collecting suitable datasets.

K. Li, K. Pang, Yi-Zhe Song, T.M. Hospedales, T. Xiang, H. Zhang (2017)Synergistic Instance-Level Subspace Alignment for Fine-Grained Sketch-Based Image Retrieval, In: IEEE Transactions on Image Processing26(12)pp. 5908-5921 Institute of Electrical and Electronics Engineers Inc.

We study the problem of fine-grained sketch-based image retrieval. By performing instance-level (rather than category-level) retrieval, it embodies a timely and practical application, particularly with the ubiquitous availability of touchscreens. Three factors contribute to the challenging nature of the problem: 1) free-hand sketches are inherently abstract and iconic, making visual comparisons with photos difficult; 2) sketches and photos are in two different visual domains, i.e., black and white lines versus color pixels; and 3) fine-grained distinctions are especially challenging when executed across domain and abstraction-level. To address these challenges, we propose to bridge the image-sketch gap both at the high level via parts and attributes, as well as at the low level via introducing a new domain alignment method. More specifically, first, we contribute a data set with 304 photos and 912 sketches, where each sketch and image is annotated with its semantic parts and associated part-level attributes. With the help of this data set, second, we investigate how strongly supervised deformable part-based models can be learned that subsequently enable automatic detection of part-level attributes, and provide pose-aligned sketch-image comparisons. To reduce the sketch-image gap when comparing low-level features, third, we also propose a novel method for instance-level domain-alignment that exploits both subspace and instance-level cues to better align the domains. Finally, fourth, these are combined in a matching framework integrating aligned low-level features, mid-level geometric structure, and high-level semantic attributes. Extensive experiments conducted on our new data set demonstrate effectiveness of the proposed method.

Q. Yu, F. Liu, Yi-Zhe Song, T. Xiang, T.M. Hospedales, C.C. Loy (2017)Sketch me that shoe, In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016)2016-Dpp. 799-807 IEEE Computer Society

We investigate the problem of fine-grained sketch-based image retrieval (SBIR), where free-hand human sketches are used as queries to perform instance-level retrieval of images. This is an extremely challenging task because (i) visual comparisons not only need to be fine-grained but also executed cross-domain, (ii) free-hand (finger) sketches are highly abstract, making fine-grained matching harder, and most importantly (iii) annotated cross-domain sketch-photo datasets required for training are scarce, challenging many state-of-the-art machine learning techniques. In this paper, for the first time, we address all these challenges, providing a step towards the capabilities that would underpin a commercial sketch-based image retrieval application. We introduce a new database of 1,432 sketchphoto pairs from two categories with 32,000 fine-grained triplet ranking annotations. We then develop a deep tripletranking model for instance-level SBIR with a novel data augmentation and staged pre-training strategy to alleviate the issue of insufficient fine-grained training data. Extensive experiments are carried out to contribute a variety of insights into the challenges of data sufficiency and over-fitting avoidance when training deep networks for finegrained cross-domain ranking tasks.

K. Li, K. Pang, J. Song, Yi-Zhe Song, T. Xiang, T.M. Hospedales, H. Zhang (2018)Universal sketch perceptual grouping, In: Lecture notes in computer science: Proceedings of the European Conference on Computer Vision (ECCV 2018)11212pp. 593-609 Springer

In this work we aim to develop a universal sketch grouper. That is, a grouper that can be applied to sketches of any category in any domain to group constituent strokes/segments into semantically meaningful object parts. The first obstacle to this goal is the lack of large-scale datasets with grouping annotation. To overcome this, we contribute the largest sketch perceptual grouping (SPG) dataset to date, consisting of 20, 000 unique sketches evenly distributed over 25 object categories. Furthermore, we propose a novel deep universal perceptual grouping model. The model is learned with both generative and discriminative losses. The generative losses improve the generalisation ability of the model to unseen object categories and datasets. The discriminative losses include a local grouping loss and a novel global grouping loss to enforce global grouping consistency. We show that the proposed model significantly outperforms the state-of-the-art groupers. Further, we show that our grouper is useful for a number of sketch analysis tasks including sketch synthesis and fine-grained sketch-based image retrieval (FG-SBIR). © Springer Nature Switzerland AG 2018.

C. Zou, Q. Yu, R. Du, H. Mo, Yi-Zhe Song, T. Xiang, C. Gao, B. Chen, H. Zhang (2018)Sketchyscene: Richly-annotated scene sketches, In: Lecture notes in computer science: Proceedings of the European Conference on Computer Vision (ECCV 2018)11219pp. 438-454 Springer Verlag

We contribute the first large-scale dataset of scene sketches, SketchyScene, with the goal of advancing research on sketch understanding at both the object and scene level. The dataset is created through a novel and carefully designed crowdsourcing pipeline, enabling users to efficiently generate large quantities of realistic and diverse scene sketches. SketchyScene contains more than 29,000 scene-level sketches, 7,000+ pairs of scene templates and photos, and 11,000+ object sketches. All objects in the scene sketches have ground-truth semantic and instance masks. The dataset is also highly scalable and extensible, easily allowing augmenting and/or changing scene composition. We demonstrate the potential impact of SketchyScene by training new computational models for semantic segmentation of scene sketches and showing how the new dataset enables several applications including image retrieval, sketch colorization, editing, and captioning, etc. The dataset and code can be found at https://github.com/SketchyScene/SketchyScene.

J. Song, Q. Yu, Yi-Zhe Song, T. Xiang, T.M. Hospedales (2018)Deep Spatial-Semantic Attention for Fine-Grained Sketch-Based Image Retrieval, In: Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV)pp. 5552-5561 Institute of Electrical and Electronics Engineers (IEEE)

Human sketches are unique in being able to capture both the spatial topology of a visual object, as well as its subtle appearance details. Fine-grained sketch-based image retrieval (FG-SBIR) importantly leverages on such fine-grained characteristics of sketches to conduct instance-level retrieval of photos. Nevertheless, human sketches are often highly abstract and iconic, resulting in severe misalignments with candidate photos which in turn make subtle visual detail matching difficult. Existing FG-SBIR approaches focus only on coarse holistic matching via deep cross-domain representation learning, yet ignore explicitly accounting for fine-grained details and their spatial context. In this paper, a novel deep FG-SBIR model is proposed which differs significantly from the existing models in that: (1) It is spatially aware, achieved by introducing an attention module that is sensitive to the spatial position of visual details: (2) It combines coarse and fine semantic information via a shortcut connection fusion block: and (3) It models feature correlation and is robust to misalignments between the extracted features across the two domains by introducing a novel higher-order learnable energy function (HOLEF) based loss. Extensive experiments show that the proposed deep spatial-semantic attention model significantly outperforms the state-of-the-art.

Y. Qi, J. Guo, Y. Li, H. Zhang, T. Xiang, Yi-Zhe Song, Z.-H. Tan (2014)Perceptual grouping via untangling Gestalt principles, In: Proceedings of the 2013 Visual Communications and Image Processing (VCIP 2013)pp. 1-6 Institute of Electrical and Electronics Engineers (IEEE)

Gestalt principles, a set of conjoining rules derived from human visual studies, have been known to play an important role in computer vision. Many applications such as image segmentation, contour grouping and scene understanding often rely on such rules to work. However, the problem of Gestalt confliction, i.e., the relative importance of each rule compared with another, remains unsolved. In this paper, we investigate the problem of perceptual grouping by quantifying the confliction among three commonly used rules: similarity, continuity and proximity. More specifically, we propose to quantify the importance of Gestalt rules by solving a learning to rank problem, and formulate a multi-label graph-cuts algorithm to group image primitives while taking into account the learned Gestalt confliction. Our experiment results confirm the existence of Gestalt confliction in perceptual grouping and demonstrate an improved performance when such a confliction is accounted for via the proposed grouping algorithm. Finally, a novel cross domain image classification method is proposed by exploiting perceptual grouping as representation.

Y. Qi, J. Guo, Yi-Zhe Song, T. Xiang, H. Zhang, Z.-H. Tan (2015)Im2Sketch: Sketch generation by unconflicted perceptual grouping, In: Neurocomputing165pp. 338-349 Elsevier

Effectively solving the problem of sketch generation, which aims to produce human-drawing-like sketches from real photographs, opens the door for many vision applications such as sketch-based image retrieval and non-photorealistic rendering. In this paper, we approach automatic sketch generation from a human visual perception perspective. Instead of gathering insights from photographs, for the first time, we extract information from a large pool of human sketches. In particular, we study how multiple Gestalt rules can be encapsulated into a unified perceptual grouping framework for sketch generation. We further show that by solving the problem of Gestalt confliction, i.e., encoding the relative importance of each rule, more similar to human-made sketches can be generated. For that, we release a manually labeled sketch dataset of 96 object categories and 7680 sketches. A novel evaluation framework is proposed to quantify human likeness of machine-generated sketches by examining how well they can be classified using models trained from human data. Finally, we demonstrate the superiority of our sketches under the practical application of sketch-based image retrieval. © 2015 Elsevier B.V.

Qian Yu, Yongxin Yang, Feng Liu, Yi-Zhe Song, Tao Xiang, Timothy M. Hospedales (2016)Sketch-a-Net: A Deep Neural Network that Beats Humans, In: International Journal of Computer Vision122(3)pp. 411-425 Springer New York LLC

We propose a deep learning approach to free-hand sketch recognition that achieves state-of-the-art performance, significantly surpassing that of humans. Our superior performance is a result of modelling and exploiting the unique characteristics of free-hand sketches, i.e., consisting of an ordered set of strokes but lacking visual cues such as colour and texture, being highly iconic and abstract, and exhibiting extremely large appearance variations due to different levels of abstraction and deformation. Specifically, our deep neural network, termed Sketch-a-Net has the following novel components: (i) we propose a network architecture designed for sketch rather than natural photo statistics. (ii) Two novel data augmentation strategies are developed which exploit the unique sketch-domain properties to modify and synthesise sketch training data at multiple abstraction levels. Based on this idea we are able to both significantly increase the volume and diversity of sketches for training, and address the challenge of varying levels of sketching detail commonplace in free-hand sketches. (iii) We explore different network ensemble fusion strategies, including a re-purposed joint Bayesian scheme, to further improve recognition performance. We show that state-of-the-art deep networks specifically engineered for photos of natural objects fail to perform well on sketch recognition, regardless whether they are trained using photos or sketches. Furthermore, through visualising the learned filters, we offer useful insights in to where the superior performance of our network comes from. © 2016, Springer Science+Business Media New York.

Y. Qi, J. Guo, Y. Li, H. Zhang, T. Xiang, Yi-Zhe Song (2014)Sketching by perceptual grouping, In: Proceedings of the 2013 IEEE International Conference on Image Processing (ICIP 2013)pp. 270-274 Institute of Electrical and Electronics Engineers (IEEE)

Sketch is used for rendering the visual world since prehistoric times, and has become ubiquitous nowadays with the increasing availability of touchscreens on portable devices. However, how to automatically map images to sketches, a problem that has profound implications on applications such as sketch-based image retrieval, still remains open. In this paper, we propose a novel method that draws a sketch automatically from a single natural image. Sketch extraction is posed within an unified contour grouping framework, where perceptual grouping is first used to form contour segment groups, followed by a group-based contour simplification method that generate the final sketches. In our experiment, for the first time we pose sketch evaluation as a sketch-based object recognition problem and the results validate the effectiveness of our system over the state-of-the-arts alternatives.

Q. Yu, Y. Yang, F. Liu, Yi-Zhe Song, T. Xiang, T.M. Hospedales (2017)Sketch-a-Net: A Deep Neural Network that Beats Humans, In: International Journal of Computer Vision122(3)pp. 411-425 Springer

We propose a deep learning approach to free-hand sketch recognition that achieves state-of-the-art performance, significantly surpassing that of humans. Our superior performance is a result of modelling and exploiting the unique characteristics of free-hand sketches, i.e., consisting of an ordered set of strokes but lacking visual cues such as colour and texture, being highly iconic and abstract, and exhibiting extremely large appearance variations due to different levels of abstraction and deformation. Specifically, our deep neural network, termed Sketch-a-Net has the following novel components: (i) we propose a network architecture designed for sketch rather than natural photo statistics. (ii) Two novel data augmentation strategies are developed which exploit the unique sketch-domain properties to modify and synthesise sketch training data at multiple abstraction levels. Based on this idea we are able to both significantly increase the volume and diversity of sketches for training, and address the challenge of varying levels of sketching detail commonplace in free-hand sketches. (iii) We explore different network ensemble fusion strategies, including a re-purposed joint Bayesian scheme, to further improve recognition performance. We show that state-of-the-art deep networks specifically engineered for photos of natural objects fail to perform well on sketch recognition, regardless whether they are trained using photos or sketches. Furthermore, through visualising the learned filters, we offer useful insights in to where the superior performance of our network comes from. © 2016, Springer Science+Business Media New York.

K. Pang, D. Li, J. Song, Yi-Zhe Song, T. Xiang, T.M. Hospedales (2018)Deep factorised inverse-sketching, In: Proceedings of the 15th European Conference on Computer Vision (ECCV 2018)11219pp. 37-54 Springer Verlag

Modelling human free-hand sketches has become topical recently, driven by practical applications such as fine-grained sketch based image retrieval (FG-SBIR). Sketches are clearly related to photo edge-maps, but a human free-hand sketch of a photo is not simply a clean rendering of that photo’s edge map. Instead there is a fundamental process of abstraction and iconic rendering, where overall geometry is warped and salient details are selectively included. In this paper we study this sketching process and attempt to invert it. We model this inversion by translating iconic free-hand sketches to contours that resemble more geometrically realistic projections of object boundaries, and separately factorise out the salient added details. This factorised re-representation makes it easier to match a free-hand sketch to a photo instance of an object. Specifically, we propose a novel unsupervised image style transfer model based on enforcing a cyclic embedding consistency constraint. A deep FG-SBIR model is then formulated to accommodate complementary discriminative detail from each factorised sketch for better matching with the corresponding photo. Our method is evaluated both qualitatively and quantitatively to demonstrate its superiority over a number of state-of-the-art alternatives for style transfer and FG-SBIR.

P. Xu, Q. Yin, Y. Huang, Yi-Zhe Song, Z. Ma, L. Wang, T. Xiang, W.B. Kleijn, J. Guo (2018)Cross-modal subspace learning for fine-grained sketch-based image retrieval, In: Neurocomputing278pp. 75-86 Elsevier

Sketch-based image retrieval (SBIR) is challenging due to the inherent domain-gap between sketch and photo. Compared with pixel-perfect depictions of photos, sketches are iconic renderings of the real world with highly abstract. Therefore, matching sketch and photo directly using low-level visual clues are insufficient, since a common low-level subspace that traverses semantically across the two modalities is non-trivial to establish. Most existing SBIR studies do not directly tackle this cross-modal problem. This naturally motivates us to explore the effectiveness of cross-modal retrieval methods in SBIR, which have been applied in the image-text matching successfully. In this paper, we introduce and compare a series of state-of-the-art cross-modal subspace learning methods and benchmark them on two recently released fine-grained SBIR datasets. Through thorough examination of the experimental results, we have demonstrated that the subspace learning can effectively model the sketch-photo domain-gap. In addition we draw a few key insights to drive future research. © 2017 Elsevier B.V.

Y. Qi, Yi-Zhe Song, T. Xiang, H. Zhang, T. Hospedales, Y. Li, J. Guo (2015)Making better use of edges via perceptual grouping, In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015)pp. 1856-1865 IEEE Computer Society

We propose a perceptual grouping framework that organizes image edges into meaningful structures and demonstrate its usefulness on various computer vision tasks. Our grouper formulates edge grouping as a graph partition problem, where a learning to rank method is developed to encode probabilities of candidate edge pairs. In particular, RankSVM is employed for the first time to combine multiple Gestalt principles as cue for edge grouping. Afterwards, an edge grouping based object proposal measure is introduced that yields proposals comparable to state-of-the-art alternatives. We further show how human-like sketches can be generated from edge groupings and consequently used to deliver state-of-the-art sketch-based image retrieval performance. Last but not least, we tackle the problem of freehand human sketch segmentation by utilizing the proposed grouper to cluster strokes into semantic object parts.

J. Song, Yi-Zhe Song, T. Xiang, T. Hospedales, X. Ruan (2016)Deep Multi-task attribute-driven ranking for fine-grained sketch-based image retrieval, In: Proceedings of the 27th British Machine Vision Conference (BMVC)pp. 132.1-132.11 BMVA Press

Fine-grained sketch-based image retrieval (SBIR) aims to go beyond conventional SBIR to perform instance-level cross-domain retrieval: finding the specific photo that matches an input sketch. Existing methods focus on designing/learning good features for cross-domain matching and/or learning cross-domain matching functions. However, they neglect the semantic aspect of retrieval, i.e., what meaningful object properties does a user try encode in her/his sketch? We propose a fine-grained SBIR model that exploits semantic attributes and deep feature learning in a complementary way. Specifically, we perform multi-task deep learning with three objectives, including: retrieval by fine-grained ranking on a learned representation, attribute prediction, and attribute-level ranking. Simultaneously predicting semantic attributes and using such predictions in the ranking procedure help retrieval results to be more semantically relevant. Importantly, the introduction of semantic attribute learning in the model allows for the elimination of the otherwise prohibitive cost of human annotations required for training a fine-grained deep ranking model. Experimental results demonstrate that our method outperforms the state-of-the-art on challenging fine-grained SBIR benchmarks while requiring less annotation.

J. Song, K. Pang, Yi-Zhe Song, T. Xiang, T.M. Hospedales (2019)Learning to Sketch with Shortcut Cycle Consistency, In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognitionpp. 801-810 IEEE Computer Society

To see is to sketch - free-hand sketching naturally builds ties between human and machine vision. In this paper, we present a novel approach for translating an object photo to a sketch, mimicking the human sketching process. This is an extremely challenging task because the photo and sketch domains differ significantly. Furthermore, human sketches exhibit various levels of sophistication and abstraction even when depicting the same object instance in a reference photo. This means that even if photo-sketch pairs are available, they only provide weak supervision signal to learn a translation model. Compared with existing supervised approaches that solve the problem of D(E(photo)) → sketch), where E(·) and D(·) denote encoder and decoder respectively, we take advantage of the inverse problem (e.g., D(E(sketch) → photo), and combine with the unsupervised learning tasks of within-domain reconstruction, all within a multi-task learning framework. Compared with existing unsupervised approaches based on cycle consistency (i.e., D(E(D(E(photo)))) → photo), we introduce a shortcut consistency enforced at the encoder bottleneck (e.g., D(E(photo)) → photo) to exploit the additional self-supervision. Both qualitative and quantitative results show that the proposed model is superior to a number of state-of-the-art alternatives. We also show that the synthetic sketches can be used to train a better fine-grained sketch-based image retrieval (FG-SBIR) model, effectively alleviating the problem of sketch data scarcity.

Kaiyue Pang, Ke Li, Yongxin Yang, Honggang Zhang, Timothy M. Hospedales, Tao Xiang, Yi-Zhe Song (2019)Generalising Fine-Grained Sketch-Based Image Retrieval, In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019)pp. 677-686 Institute of Electrical and Electronics Engineers (IEEE)

Fine-grained sketch-based image retrieval (FG-SBIR) addresses matching specific photo instance using free-hand sketch as a query modality. Existing models aim to learn an embedding space in which sketch and photo can be directly compared. While successful, they require instance-level pairing within each coarse-grained category as annotated training data. Since the learned embedding space is domain-specific, these models do not generalise well across categories. This limits the practical applicability of FGSBIR. In this paper, we identify cross-category generalisation for FG-SBIR as a domain generalisation problem, and propose the first solution. Our key contribution is a novel unsupervised learning approach to model a universal manifold of prototypical visual sketch traits. This manifold can then be used to paramaterise the learning of a sketch/photo representation. Model adaptation to novel categories then becomes automatic via embedding the novel sketch in the manifold and updating the representation and retrieval function accordingly. Experiments on the two largest FG-SBIR datasets, Sketchy and QMUL-Shoe-V2, demonstrate the efficacy of our approach in enabling crosscategory generalisation of FG-SBIR.

C. Hu, D. Li, Yi-Zhe Song, T. Xiang, T.M. Hospedales (2019)Sketch-a-Classifier: Sketch-Based Photo Classifier Generation, In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognitionpp. 9136-9144 IEEE Computer Society

Contemporary deep learning techniques have made image recognition a reasonably reliable technology. However training effective photo classifiers typically takes numerous examples which limits image recognition's scalability and applicability to scenarios where images may not be available. This has motivated investigation into zero-shot learning, which addresses the issue via knowledge transfer from other modalities such as text. In this paper we investigate an alternative approach of synthesizing image classifiers: Almost directly from a user's imagination, via freehand sketch. This approach doesn't require the category to be nameable or describable via attributes as per zero-shot learning. We achieve this via training a model regression network to map from free-hand sketch space to the space of photo classifiers. It turns out that this mapping can be learned in a category-agnostic way, allowing photo classifiers for new categories to be synthesized by user with no need for annotated training photos. We also demonstrate that this modality of classifier generation can also be used to enhance the granularity of an existing photo classifier, or as a complement to name-based zero-shot learning.

Y. Qi, W.-S. Zheng, T. Xiang, Yi-Zhe Song, H. Zhang, J. Guo (2014)One-shot learning of sketch categories with co-regularized sparse coding, In: Lecture Notes in Computer Science - proceedings of the 10th International Symposium on Visual Computing (ISVC 2014) Part II8888pp. 74-84 Springer Verlag

Categorizing free-hand human sketches has profound implications in applications such as human computer interaction and image retrieval. The task is non-trivial due to the iconic nature of sketches, signified by large variances in both appearance and structure when compared with photographs. Prior works often utilize off-the-shelf low-level features and assume the availability of a large training set, rendering them sensitive towards abstraction and less scalable to new categories. To overcome this limitation, we propose a transfer learning framework which enables one-shot learning of sketch categories. The framework is based on a novel co-regularized sparse coding model which exploits common/ shareable parts among human sketches of seen categories and transfer them to unseen categories. We contribute a new dataset consisting of 7,760 human segmented sketches from 97 object categories. Extensive experiments reveal that the proposed method can classify unseen sketch categories given just one training sample with a 33.04% accuracy, offering a two-fold improvement over baselines.

Xuelin Qian, Yanwei Fu, Tao Xiang, Yu-Gang Jiang, Xiangyang Xue (2019)Leader-based Multi-Scale Attention Deep Architecture for Person Re-identification, In: IEEE Transactions on Pattern Analysis and Machine Intelligencepp. 1-1 IEEE

Person re-identification (re-id) aims to match people across non-overlapping camera views in a public space. This is a challenging problem because the people captured in surveillance videos often wear similar clothing. Consequently, the differences in their appearance are typically subtle and only detectable at particular locations and scales. In this paper, we propose a deep re-id network (MuDeep) that is composed of two novel types of layers – a multi-scale deep learning layer, and a leader-based attention learning layer. Specifically, the former learns deep discriminative feature representations at different scales, while the latter utilizes the information from multiple scales to lead and determine the optimal weightings for each scale. The importance of different spatial locations for extracting discriminative features is learned explicitly via our leader-based attention learning layer. Extensive experiments are carried out to demonstrate that the proposed MuDeep outperforms the state-of-the-art on a number of benchmarks and has a better generalization ability under a domain generalization setting.

U.R. Muhammad, Y. Yang, Yi-Zhe Song, T. Xiang, T.M. Hospedales (2019)Learning Deep Sketch Abstraction, In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognitionpp. 8014-8023 IEEE Computer Society

Human free-hand sketches have been studied in various contexts including sketch recognition, synthesis and fine-grained sketch-based image retrieval (FG-SBIR). A fundamental challenge for sketch analysis is to deal with drastically different human drawing styles, particularly in terms of abstraction level. In this work, we propose the first stroke-level sketch abstraction model based on the insight of sketch abstraction as a process of trading off between the recognizability of a sketch and the number of strokes used to draw it. Concretely, we train a model for abstract sketch generation through reinforcement learning of a stroke removal policy that learns to predict which strokes can be safely removed without affecting recognizability. We show that our abstraction model can be used for various sketch analysis tasks including: (1) modeling stroke saliency and understanding the decision of sketch recognition models, (2) synthesizing sketches of variable abstraction for a given category, or reference object instance in a photo, and (3) training a FG-SBIR model with photos only, bypassing the expensive photo-sketch pair collection step.

P. Xu, Y. Huang, T. Yuan, K. Pang, Yi-Zhe Song, T. Xiang, T.M. Hospedales, Z. Ma, J. Guo (2018)SketchMate: Deep Hashing for Million-Scale Human Sketch Retrieval, In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognitionpp. 8090-8098 IEEE Computer Society

We propose a deep hashing framework for sketch retrieval that, for the first time, works on a multi-million scale human sketch dataset. Leveraging on this large dataset, we explore a few sketch-specific traits that were otherwise under-studied in prior literature. Instead of following the conventional sketch recognition task, we introduce the novel problem of sketch hashing retrieval which is not only more challenging, but also offers a better testbed for large-scale sketch analysis, since: (i) more fine-grained sketch feature learning is required to accommodate the large variations in style and Abstraction, and (ii) a compact binary code needs to be learned at the same time to enable efficient retrieval. Key to our network design is the embedding of unique characteristics of human sketch, where (i) a two-branch CNN-RNN architecture is adapted to explore the temporal ordering of strokes, and (ii) a novel hashing loss is specifically designed to accommodate both the temporal and Abstract traits of sketches. By working with a 3.8M sketch dataset, we show that state-of-the-art hashing models specifically engineered for static images fail to perform well on temporal sketch data. Our network on the other hand not only offers the best retrieval performance on various code sizes, but also yields the best generalization performance under a zero-shot setting and when re-purposed for sketch recognition. Such superior performances effectively demonstrate the benefit of our sketch-specific design. © 2018 IEEE.