My publications


Shi Zhiyuan, Yang Yongxin, Hospedales Timothy M, Xiang Tao (2014) Weakly supervised learning of objects, attributes and their associations,Lecture Notes in Computer Science 8690 L (PART 2) pp. 472-487 Springer Verlag
When humans describe images they tend to use combinations of nouns and adjectives, corresponding to objects and their associated attributes respectively. To generate such a description automatically, one needs to model objects, attributes and their associations. Conventional methods require strong annotation of object and attribute locations, making them less scalable. In this paper, we model object-attribute associations from weakly labelled images, such as those widely available on media sharing sites (e.g. Flickr), where only image-level labels (either object or attributes) are given, without their locations and associations. This is achieved by introducing a novel weakly supervised non-parametric Bayesian model. Once learned, given a new image, our model can describe the image, including objects, attributes and their associations, as well as their locations and segmentation. Extensive experiments on benchmark datasets demonstrate that our weakly supervised model performs at par with strongly supervised models on tasks such as image description and retrieval based on object-attribute associations. © 2014 Springer International Publishing.
Fu Yanwei, Yang Yongxin, Hospedales Timothy, Xiang Tao, Gong Shaogang (2014) Transductive multi-label zero-shot learning,Proceedings of the British Machine Vision Conference 2014 BMVA Press
Zero-shot learning has received increasing interest as a means to alleviate the often prohibitive expense of annotating training data for large scale recognition problems. These methods have achieved great success via learning intermediate semantic representations in the form of attributes and more recently, semantic word vectors. However, they have thus far been constrained to the single-label case, in contrast to the growing popularity and importance of more realistic multi-label data. In this paper, for the first time, we investigate and formalise a general framework for multi-label zero-shot learning, addressing the unique challenge therein: how to exploit multi-label correlation at test time with no training data for those classes? In particular, we propose (1) a multi-output deep regression model to project an image into a semantic word space, which explicitly exploits the correlations in the intermediate semantic layer of word vectors; (2) a novel zero-shot learning algorithm for multi-label data that exploits the unique compositionality property of semantic word vector representations; and (3) a transductive learning strategy to enable the regression model learned from seen classes to generalise well to unseen classes. Our zero-shot learning experiments on a number of standard multi-label datasets demonstrate that our method outperforms a variety of baselines. © 2014. The copyright of this document resides with its authors.
Hu Guosheng, Yang Yongxin, Yi Dong, Kittler Josef, Christmas William, Li Stan, Hospedales Timothy (2015) When Face Recognition Meets with Deep Learning: an Evaluation of Convolutional Neural Networks for Face Recognition,Computer Vision Workshop (ICCVW), 2015 IEEE International Conference on pp. 384-392
Deep learning, in particular Convolutional Neural Network (CNN), has achieved promising results in face recognition recently. However, it remains an open question: why CNNs work well and how to design a ?good? architecture. The existing works tend to focus on reporting CNN architectures that work well for face recognition rather than investigate the reason. In this work, we conduct an extensive evaluation of CNN-based face recognition systems (CNN-FRS) on a common ground to make our work easily reproducible. Specifically, we use public database LFW (Labeled Faces in the Wild) to train CNNs, unlike most existing CNNs trained on private databases. We propose three CNN architectures which are the first reported architectures trained using LFW data. This paper quantitatively compares the architectures of CNNs and evaluates the effect of different implementation choices. We identify several useful properties of CNN-FRS. For instance, the dimensionality of the learned features can be significantly reduced without adverse effect on face recognition accuracy. In addition, a traditional metric learning method exploiting CNN-learned features is evaluated. Experiments show two crucial factors to good CNN-FRS performance are the fusion of multiple CNNs and metric learning. To make our work reproducible, source code and models will be made publicly available.
Yu Qian, Yang Yongxin, Liu Feng, Song Yi-Zhe, Xiang Tao, Hospedales Timothy M. (2016) Sketch-a-Net: A Deep Neural Network that Beats Humans,International Journal of Computer Vision 122 (3) pp. 411-425 Springer New York LLC
We propose a deep learning approach to free-hand sketch recognition that achieves state-of-the-art performance, significantly surpassing that of humans. Our superior performance is a result of modelling and exploiting the unique characteristics of free-hand sketches, i.e., consisting of an ordered set of strokes but lacking visual cues such as colour and texture, being highly iconic and abstract, and exhibiting extremely large appearance variations due to different levels of abstraction and deformation. Specifically, our deep neural network, termed Sketch-a-Net has the following novel components: (i) we propose a network architecture designed for sketch rather than natural photo statistics. (ii) Two novel data augmentation strategies are developed which exploit the unique sketch-domain properties to modify and synthesise sketch training data at multiple abstraction levels. Based on this idea we are able to both significantly increase the volume and diversity of sketches for training, and address the challenge of varying levels of sketching detail commonplace in free-hand sketches. (iii) We explore different network ensemble fusion strategies, including a re-purposed joint Bayesian scheme, to further improve recognition performance. We show that state-of-the-art deep networks specifically engineered for photos of natural objects fail to perform well on sketch recognition, regardless whether they are trained using photos or sketches. Furthermore, through visualising the learned filters, we offer useful insights in to where the superior performance of our network comes from. © 2016, Springer Science+Business Media New York.
Song Mingying, Karatutlu Ali, Ali Isma, Ersoy Osman, Zhou Yun, Yang Yongxin, Zhang Yuanpeng, Little William R., Wheeler Ann P., Sapelkin Andrei V. (2017) Spectroscopic super-resolution fluorescence cell imaging using ultra-small Ge quantum dots,Optics Express 25 (4) pp. 4240-4253 Optical Society of America
We demonstrate a spectroscopic imaging based super-resolution approach by separating the overlapping diffraction spots into several detectors during a single scanning period and taking advantage of the size-dependent emission wavelength in nanoparticles. This approach has been tested using off-the-shelf quantum dots (Invitrogen Qdot) and inhouse novel ultra-small (âý¼3 nm) Ge QDs. Furthermore, we developed a method-specific Gaussian fitting and maximum likelihood estimation based on a Matlab algorithm for fast QD localisation. This methodology results in a three-fold improvement in the number of localised QDs compared to non-spectroscopic images. With the addition of advanced ultra-small Ge probes, the number can be improved even further, giving at least 1.5 times improvement when compared to Qdots. Using a standard scanning confocal microscope we achieved a data acquisition rate of 200 ms per image frame. This is an improvement on single molecule localisation super-resolution microscopy where repeated image capture limits the imaging speed, and the size of fluorescence probes limits the possible theoretical localisation resolution. We show that our spectral deconvolution approach has a potential to deliver data acquisition rates on the ms scale thus providing super-resolution in live systems. © 2017, OSA - The Optical Society. All rights reserved.
Pang Kaiyue, Li Ke, Yang Yongxin, Zhang Honggang, Hospedales Timothy M., Xiang Tao, Song Yi-Zhe (2019) Generalising Fine-Grained Sketch-Based Image Retrieval,Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019) pp. 677-686 Institute of Electrical and Electronics Engineers (IEEE)
Fine-grained sketch-based image retrieval (FG-SBIR) addresses matching specific photo instance using free-hand sketch as a query modality. Existing models aim to learn an embedding space in which sketch and photo can be directly compared. While successful, they require instance-level pairing within each coarse-grained category as annotated training data. Since the learned embedding space is
domain-specific, these models do not generalise well across categories. This limits the practical applicability of FGSBIR. In this paper, we identify cross-category generalisation for FG-SBIR as a domain generalisation problem, and propose the first solution. Our key contribution is a novel unsupervised learning approach to model a universal manifold of prototypical visual sketch traits. This manifold can then be used to paramaterise the learning of a sketch/photo representation. Model adaptation to novel categories then becomes automatic via embedding the novel sketch in the manifold and updating the representation and retrieval function accordingly. Experiments on the two largest FG-SBIR datasets, Sketchy and QMUL-Shoe-V2, demonstrate the efficacy of our approach in enabling crosscategory generalisation of FG-SBIR.
Song Jifei, Yang Yongxin, Song Yi-Zhe, Xiang Tao, Hospedales Timothy M. (2019) Generalizable Person Re-identification by Domain-Invariant Mapping Network,Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019) pp. 719-728 Institute of Electrical and Electronics Engineers (IEEE)
We aim to learn a domain generalizable person reidentification (ReID) model. When such a model is trained on a set of source domains (ReID datasets collected from different camera networks), it can be directly applied to any new unseen dataset for effective ReID without any model updating. Despite its practical value in real-world deployments, generalizable ReID has seldom been studied. In this work, a novel deep ReID model termed Domain-Invariant Mapping Network(DIMN) is proposed. DIMN is designed to learn a mapping between a person image and its identity classifier, i.e., it produces a classifier using a single shot. To make the model domain-invariant, we follow a meta-learning pipeline and sample a subset of source domain training tasks during each training episode. However, the model is significantly different from conventional meta-learning methods in that: (1) no model updating is required for the target domain, (2) different training tasks share a memory bank for maintaining both scalability and discrimination ability, and (3) it can be used to match an arbitrary number of identities in a target domain. Extensive experiments on a newly proposed large-scale ReID domain generalization benchmark show that our DIMN significantly outperforms alternative domain generalization or meta-learning methods.
Zhang Jiang, Du Jun, Yang Yongxin, Song Yi-Zhe, Wei Si, Dai Lirong (2020) A Tree-Structured Decoder for Image-to-Markup Generation,Journal of Machine Learning Research Microtome Publishing
Recent encoder-decoder approaches typically employ string decoders to convert images into serialized strings for image-to-markup. However,
for tree-structured representational markup, string
representations can hardly cope with the structural complexity. In this work, we first show
via a set of toy problems that string decoders
struggle to decode tree structures, especially as
structural complexity increases, we then propose
a tree-structured decoder that specifically aims
at generating a tree-structured markup. Our decoders works sequentially, where at each step a
child node and its parent node are simultaneously
generated to form a sub-tree. This sub-tree is consequently used to construct the final tree structure
in a recurrent manner. Key to the success of our
tree decoder is twofold, (i) it strictly respects the
parent-child relationship of trees, and (ii) it explicitly outputs trees as oppose to a linear string.
Evaluated on both math formula recognition and
chemical formula recognition, the proposed tree
decoder is shown to greatly outperform strong
string decoder baselines.
Sain Aneeshan, Bhunia Ayan Kumar, Yang Yongxin, Xiang Tao, Song Yi-Zhe (2020) Cross-Modal Hierarchical Modelling for Fine-Grained Sketch Based Image Retrieval,Proceedings of The 31st British Machine Vision Virtual Conference (BMVC 2020) pp. 1-14 British Machine Vision Association
Sketch as an image search query is an ideal alternative to text in capturing the finegrained
visual details. Prior successes on fine-grained sketch-based image retrieval (FGSBIR)
have demonstrated the importance of tackling the unique traits of sketches as
opposed to photos, e.g., temporal vs. static, strokes vs. pixels, and abstract vs. pixelperfect.
In this paper, we study a further trait of sketches that has been overlooked to
date, that is, they are hierarchical in terms of the levels of detail ? a person typically
sketches up to various extents of detail to depict an object. This hierarchical structure
is often visually distinct. In this paper, we design a novel network that is capable of
cultivating sketch-specific hierarchies and exploiting them to match sketch with photo at
corresponding hierarchical levels. In particular, features from a sketch and a photo are
enriched using cross-modal co-attention, coupled with hierarchical node fusion at every
level to form a better embedding space to conduct retrieval. Experiments on common
benchmarks show our method to outperform state-of-the-arts by a significant margin.
Bhunia Ayan Kumar, Das Ayan, Muhammad Umar Riaz, Yang Yongxin, Hospedales Timothy M., Xiang Tao, Gryaditskaya Yulia, Song Yi-Zhe (2020) Pixelor: A Competitive Sketching AI Agent. So you think you can beat me?,ACM Transactions on Graphics 39 (6) Association for Computing Machinery (ACM)
We present the first competitive drawing agent Pixelor that exhibits human-level
performance at a Pictionary-like sketching game, where the participant whose sketch is recognized first is a winner. Our AI agent can autonomously
sketch a given visual concept, and achieve a recognizable rendition as quickly
or faster than a human competitor. The key to victory for the agent?s goal
is to learn the optimal stroke sequencing strategies that generate the most
recognizable and distinguishable strokes first. Training Pixelor is done in two
steps. First, we infer the stroke order that maximizes early recognizability of
human training sketches. Second, this order is used to supervise the training
of a sequence-to-sequence stroke generator. Our key technical contributions
are a tractable search of the exponential space of orderings using neural
sorting; and an improved Seq2Seq Wasserstein (S2S-WAE) generator that
uses an optimal-transport loss to accommodate the multi-modal nature of the
optimal stroke distribution. Our analysis shows that Pixelor is better than the
human players of the Quick, Draw! game, under both AI and human judging
of early recognition. To analyze the impact of human competitors? strategies,
we conducted a further human study with participants being given unlimited
thinking time and training in early recognizability by feedback from an AI
judge. The study shows that humans do gradually improve their strategies
with training, but overall Pixelor still matches human performance. The code
and the dataset are available at