When humans describe images they tend to use combinations of nouns and adjectives, corresponding to objects and their associated attributes respectively. To generate such a description automatically, one needs to model objects, attributes and their associations. Conventional methods require strong annotation of object and attribute locations, making them less scalable. In this paper, we model object-attribute associations from weakly labelled images, such as those widely available on media sharing sites (e.g. Flickr), where only image-level labels (either object or attributes) are given, without their locations and associations. This is achieved by introducing a novel weakly supervised non-parametric Bayesian model. Once learned, given a new image, our model can describe the image, including objects, attributes and their associations, as well as their locations and segmentation. Extensive experiments on benchmark datasets demonstrate that our weakly supervised model performs at par with strongly supervised models on tasks such as image description and retrieval based on object-attribute associations. Â© 2014 Springer International Publishing.
Zero-shot learning has received increasing interest as a means to alleviate the often prohibitive expense of annotating training data for large scale recognition problems. These methods have achieved great success via learning intermediate semantic representations in the form of attributes and more recently, semantic word vectors. However, they have thus far been constrained to the single-label case, in contrast to the growing popularity and importance of more realistic multi-label data. In this paper, for the first time, we investigate and formalise a general framework for multi-label zero-shot learning, addressing the unique challenge therein: how to exploit multi-label correlation at test time with no training data for those classes? In particular, we propose (1) a multi-output deep regression model to project an image into a semantic word space, which explicitly exploits the correlations in the intermediate semantic layer of word vectors; (2) a novel zero-shot learning algorithm for multi-label data that exploits the unique compositionality property of semantic word vector representations; and (3) a transductive learning strategy to enable the regression model learned from seen classes to generalise well to unseen classes. Our zero-shot learning experiments on a number of standard multi-label datasets demonstrate that our method outperforms a variety of baselines. Â© 2014. The copyright of this document resides with its authors.
We propose a deep learning approach to free-hand sketch recognition that achieves state-of-the-art performance, significantly surpassing that of humans. Our superior performance is a result of modelling and exploiting the unique characteristics of free-hand sketches, i.e., consisting of an ordered set of strokes but lacking visual cues such as colour and texture, being highly iconic and abstract, and exhibiting extremely large appearance variations due to different levels of abstraction and deformation. Specifically, our deep neural network, termed Sketch-a-Net has the following novel components: (i) we propose a network architecture designed for sketch rather than natural photo statistics. (ii) Two novel data augmentation strategies are developed which exploit the unique sketch-domain properties to modify and synthesise sketch training data at multiple abstraction levels. Based on this idea we are able to both significantly increase the volume and diversity of sketches for training, and address the challenge of varying levels of sketching detail commonplace in free-hand sketches. (iii) We explore different network ensemble fusion strategies, including a re-purposed joint Bayesian scheme, to further improve recognition performance. We show that state-of-the-art deep networks specifically engineered for photos of natural objects fail to perform well on sketch recognition, regardless whether they are trained using photos or sketches. Furthermore, through visualising the learned filters, we offer useful insights in to where the superior performance of our network comes from. Â© 2016, Springer Science+Business Media New York.
Pang Kaiyue, Li Ke, Yang Yongxin, Zhang Honggang, Hospedales Timothy M., Xiang Tao, Song Yi-Zhe (2019) Generalising Fine-Grained Sketch-Based Image Retrieval,Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019) pp. 677-686
Institute of Electrical and Electronics Engineers (IEEE)
Fine-grained sketch-based image retrieval (FG-SBIR) addresses matching specific photo instance using free-hand sketch as a query modality. Existing models aim to learn an embedding space in which sketch and photo can be directly compared. While successful, they require instance-level pairing within each coarse-grained category as annotated training data. Since the learned embedding space is
domain-specific, these models do not generalise well across categories. This limits the practical applicability of FGSBIR. In this paper, we identify cross-category generalisation for FG-SBIR as a domain generalisation problem, and propose the first solution. Our key contribution is a novel unsupervised learning approach to model a universal manifold of prototypical visual sketch traits. This manifold can then be used to paramaterise the learning of a sketch/photo representation. Model adaptation to novel categories then becomes automatic via embedding the novel sketch in the manifold and updating the representation and retrieval function accordingly. Experiments on the two largest FG-SBIR datasets, Sketchy and QMUL-Shoe-V2, demonstrate the efficacy of our approach in enabling crosscategory generalisation of FG-SBIR.
We aim to learn a domain generalizable person reidentification (ReID) model. When such a model is trained on a set of source domains (ReID datasets collected from different camera networks), it can be directly applied to any new unseen dataset for effective ReID without any model updating. Despite its practical value in real-world deployments, generalizable ReID has seldom been studied. In this work, a novel deep ReID model termed Domain-Invariant Mapping Network(DIMN) is proposed. DIMN is designed to learn a mapping between a person image and its identity classifier, i.e., it produces a classifier using a single shot. To make the model domain-invariant, we follow a meta-learning pipeline and sample a subset of source domain training tasks during each training episode. However, the model is significantly different from conventional meta-learning methods in that: (1) no model updating is required for the target domain, (2) different training tasks share a memory bank for maintaining both scalability and discrimination ability, and (3) it can be used to match an arbitrary number of identities in a target domain. Extensive experiments on a newly proposed large-scale ReID domain generalization benchmark show that our DIMN significantly outperforms alternative domain generalization or meta-learning methods.
Person re-identification (re-id) aims to match people across non-overlapping camera views in a public space. This is a
challenging problem because the people captured in surveillance videos often wear similar clothing. Consequently, the differences in
their appearance are typically subtle and only detectable at particular locations and scales. In this paper, we propose a deep re-id
network (MuDeep) that is composed of two novel types of layers ? a multi-scale deep learning layer, and a leader-based attention
learning layer. Specifically, the former learns deep discriminative feature representations at different scales, while the latter utilizes
the information from multiple scales to lead and determine the optimal weightings for each scale. The importance of different spatial
locations for extracting discriminative features is learned explicitly via our leader-based attention learning layer. Extensive experiments
are carried out to demonstrate that the proposed MuDeep outperforms the state-of-the-art on a number of benchmarks and has a
better generalization ability under a domain generalization setting.
Existing sketch-analysis work studies sketches depicting static objects or scenes. In this work, we propose a novel cross-modal retrieval problem of fine-grained instance-level sketch-based video retrieval (FG-SBVR), where a sketch sequence is used as a query to retrieve a specific target video instance. Compared with sketch-based still image retrieval, and coarse-grained category-level video retrieval, this is more challenging as both visual appearance and motion need to be simultaneously matched at a fine-grained level. We contribute the first FG-SBVR dataset with rich annotations. We then introduce a novel multi-stream multi-modality deep network to perform FG-SBVR under both strong and weakly supervised settings. The key component of the network is a relation module, designed to prevent model overfitting given scarce training data. We show that this model significantly outperforms a number of existing state-of-the-art models designed for video analysis.
Sketch as an image search query is an ideal alternative to text in capturing the finegrained
visual details. Prior successes on fine-grained sketch-based image retrieval (FGSBIR)
have demonstrated the importance of tackling the unique traits of sketches as
opposed to photos, e.g., temporal vs. static, strokes vs. pixels, and abstract vs. pixelperfect.
In this paper, we study a further trait of sketches that has been overlooked to
date, that is, they are hierarchical in terms of the levels of detail ? a person typically
sketches up to various extents of detail to depict an object. This hierarchical structure
is often visually distinct. In this paper, we design a novel network that is capable of
cultivating sketch-specific hierarchies and exploiting them to match sketch with photo at
corresponding hierarchical levels. In particular, features from a sketch and a photo are
enriched using cross-modal co-attention, coupled with hierarchical node fusion at every
level to form a better embedding space to conduct retrieval. Experiments on common
benchmarks show our method to outperform state-of-the-arts by a significant margin.
The problem of fine-grained sketch-based image retrieval (FG-SBIR) is defined and investigated in this paper. In FG-SBIR, free-hand human sketch images are used as queries to retrieve photo images containing the same object instances. It is thus a cross-domain (sketch to photo) instance-level retrieval task. It is an extremely challenging problem because (i) visual comparisons and matching need to be executed under large domain gap, i.e., from black and white line drawing sketches to colour photos; (ii) it requires to capture the fine-grained (dis)similarities of sketches and photo images while free-hand sketches drawn by different people present different levels of deformation and expressive interpretation; and (iii) annotated cross-domain fine-grained SBIR datasets are scarce, challenging many state-of-the-art machine learning techniques, particularly those based on deep learning. In this paper, for the first time, we address all these challenges, providing a step towards the capabilities that would underpin a commercial sketch-based object instance retrieval application. Specifically, a new large-scale FG-SBIR database is introduced which is carefully designed to reflect the real-world application scenarios. A deep cross-domain matching model is then formulated to solve the intrinsic drawing style variability, large domain gap issues, and capture instance-level discriminative features. It distinguishes itself by a carefully designed attention module. Extensive experiments on the new dataset demonstrate the effectiveness of the proposed model and validate the need for a rigorous definition of the FG-SBIR problem and collecting suitable datasets.
We present the first competitive drawing agent Pixelor that exhibits human-level
performance at a Pictionary-like sketching game, where the participant whose sketch is recognized first is a winner. Our AI agent can autonomously
sketch a given visual concept, and achieve a recognizable rendition as quickly
or faster than a human competitor. The key to victory for the agent?s goal
is to learn the optimal stroke sequencing strategies that generate the most
recognizable and distinguishable strokes first. Training Pixelor is done in two
steps. First, we infer the stroke order that maximizes early recognizability of
human training sketches. Second, this order is used to supervise the training
of a sequence-to-sequence stroke generator. Our key technical contributions
are a tractable search of the exponential space of orderings using neural
sorting; and an improved Seq2Seq Wasserstein (S2S-WAE) generator that
uses an optimal-transport loss to accommodate the multi-modal nature of the
optimal stroke distribution. Our analysis shows that Pixelor is better than the
human players of the Quick, Draw! game, under both AI and human judging
of early recognition. To analyze the impact of human competitors? strategies,
we conducted a further human study with participants being given unlimited
thinking time and training in early recognizability by feedback from an AI
judge. The study shows that humans do gradually improve their strategies
with training, but overall Pixelor still matches human performance. The code
and the dataset are available at http://sketchx.ai/pixelor