Dr Yi-Zhe Song


Reader in Computer Vision and Machine Learning
Bsc (Bath), Msc (Cantab), PhD (Bath), FHEA, SMIEEE

Biography

My publications

Publications

Yu Qian, Yang Yongxin, Liu Feng, Song Yi-Zhe, Xiang Tao, Hospedales Timothy M. (2016) Sketch-a-Net: A Deep Neural Network that Beats Humans, International Journal of Computer Vision 122 (3) pp. 411-425 Springer New York LLC
We propose a deep learning approach to free-hand sketch recognition that achieves state-of-the-art performance, significantly surpassing that of humans. Our superior performance is a result of modelling and exploiting the unique characteristics of free-hand sketches, i.e., consisting of an ordered set of strokes but lacking visual cues such as colour and texture, being highly iconic and abstract, and exhibiting extremely large appearance variations due to different levels of abstraction and deformation. Specifically, our deep neural network, termed Sketch-a-Net has the following novel components: (i) we propose a network architecture designed for sketch rather than natural photo statistics. (ii) Two novel data augmentation strategies are developed which exploit the unique sketch-domain properties to modify and synthesise sketch training data at multiple abstraction levels. Based on this idea we are able to both significantly increase the volume and diversity of sketches for training, and address the challenge of varying levels of sketching detail commonplace in free-hand sketches. (iii) We explore different network ensemble fusion strategies, including a re-purposed joint Bayesian scheme, to further improve recognition performance. We show that state-of-the-art deep networks specifically engineered for photos of natural objects fail to perform well on sketch recognition, regardless whether they are trained using photos or sketches. Furthermore, through visualising the learned filters, we offer useful insights in to where the superior performance of our network comes from. © 2016, Springer Science+Business Media New York.
Song Jifei, Yang Yongxin, Song Yi-Zhe, Xiang Tao, Hospedales Timothy M. (2019) Generalizable Person Re-identification by Domain-Invariant Mapping Network, Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019) pp. 719-728 Institute of Electrical and Electronics Engineers (IEEE)
We aim to learn a domain generalizable person reidentification (ReID) model. When such a model is trained on a set of source domains (ReID datasets collected from different camera networks), it can be directly applied to any new unseen dataset for effective ReID without any model updating. Despite its practical value in real-world deployments, generalizable ReID has seldom been studied. In this work, a novel deep ReID model termed Domain-Invariant Mapping Network(DIMN) is proposed. DIMN is designed to learn a mapping between a person image and its identity classifier, i.e., it produces a classifier using a single shot. To make the model domain-invariant, we follow a meta-learning pipeline and sample a subset of source domain training tasks during each training episode. However, the model is significantly different from conventional meta-learning methods in that: (1) no model updating is required for the target domain, (2) different training tasks share a memory bank for maintaining both scalability and discrimination ability, and (3) it can be used to match an arbitrary number of identities in a target domain. Extensive experiments on a newly proposed large-scale ReID domain generalization benchmark show that our DIMN significantly outperforms alternative domain generalization or meta-learning methods.
Pang Kaiyue, Li Ke, Yang Yongxin, Zhang Honggang, Hospedales Timothy M., Xiang Tao, Song Yi-Zhe (2019) Generalising Fine-Grained Sketch-Based Image Retrieval, Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019) pp. 677-686 Institute of Electrical and Electronics Engineers (IEEE)
Fine-grained sketch-based image retrieval (FG-SBIR) addresses matching specific photo instance using free-hand sketch as a query modality. Existing models aim to learn an embedding space in which sketch and photo can be directly compared. While successful, they require instance-level pairing within each coarse-grained category as annotated training data. Since the learned embedding space is
domain-specific, these models do not generalise well across categories. This limits the practical applicability of FGSBIR. In this paper, we identify cross-category generalisation for FG-SBIR as a domain generalisation problem, and propose the first solution. Our key contribution is a novel unsupervised learning approach to model a universal manifold of prototypical visual sketch traits. This manifold can then be used to paramaterise the learning of a sketch/photo representation. Model adaptation to novel categories then becomes automatic via embedding the novel sketch in the manifold and updating the representation and retrieval function accordingly. Experiments on the two largest FG-SBIR datasets, Sketchy and QMUL-Shoe-V2, demonstrate the efficacy of our approach in enabling crosscategory generalisation of FG-SBIR.
Dey Sounak, Riba Pau, Dutta Anjan, Llados Josep, Song Yi-Zhe (2019) Doodle to Search: Practical Zero-Shot Sketch-based Image Retrieval, Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019) pp. 2179-2188 Institute of Electrical and Electronics Engineers (IEEE)
In this paper, we investigate the problem of zeroshot sketch-based image retrieval (ZS-SBIR), where human sketches are used as queries to conduct retrieval of photos from unseen categories. We importantly advance prior arts by proposing a novel ZS-SBIR scenario that represents a firm step forward in its practical application. The new setting uniquely recognizes two important yet often neglected challenges of practical ZS-SBIR, (i) the large domain gap between amateur sketch and photo, and (ii) the necessity for moving towards large-scale retrieval. We first contribute to the community a novel ZS-SBIR dataset, QuickDraw-Extended, that consists of 330; 000 sketches and 204; 000 photos spanning across 110 categories. Highly abstract amateur human sketches are purposefully sourced to maximize the domain gap, instead of ones included in existing datasets that can often be semi-photorealistic. We then formulate a ZS-SBIR framework to jointly model sketches and photos into a common embedding space. A novel strategy to mine the mutual information among domains is specifically
engineered to alleviate the domain gap. External semantic knowledge is further embedded to aid semantic transfer. We show that, rather surprisingly, retrieval performance significantly outperforms that of state-of-the-art on existing datasets that can already be achieved using a
reduced version of our model. We further demonstrate the superior performance of our full model by comparing with a number of alternatives on the newly proposed dataset. The new dataset, plus all training and testing code of our model, will be publicly released to facilitate future research.
Ma Z., Lai Y., Kleijn W.B., Song Yi-Zhe, Wang L., Guo J. (2019) Variational Bayesian Learning for Dirichlet Process Mixture of Inverted Dirichlet Distributions in Non-Gaussian Image Feature Modeling, IEEE Transactions on Neural Networks and Learning Systems 30 (2) pp. 449-463 Institute of Electrical and Electronics Engineers Inc.
In this paper, we develop a novel variational Bayesian learning method for the Dirichlet process (DP) mixture of the inverted Dirichlet distributions, which has been shown to be very flexible for modeling vectors with positive elements. The recently proposed extended variational inference (EVI) framework is adopted to derive an analytically tractable solution. The convergency of the proposed algorithm is theoretically guaranteed by introducing single lower bound approximation to the original objective function in the EVI framework. In principle, the proposed model can be viewed as an infinite inverted Dirichlet mixture model that allows the automatic determination of the number of mixture components from data. Therefore, the problem of predetermining the optimal number of mixing components has been overcome. Moreover, the problems of overfitting and underfitting are avoided by the Bayesian estimation approach. Compared with several recently proposed DP-related methods and conventional applied methods, the good performance and effectiveness of the proposed method have been demonstrated with both synthesized data and real data evaluations.
Muhammad U.R., Yang Y., Song Yi-Zhe, Xiang T., Hospedales T.M. (2019) Learning Deep Sketch Abstraction, Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 8014-8023 IEEE Computer Society
Human free-hand sketches have been studied in various contexts including sketch recognition, synthesis and fine-grained sketch-based image retrieval (FG-SBIR). A fundamental challenge for sketch analysis is to deal with drastically different human drawing styles, particularly in terms of abstraction level. In this work, we propose the first stroke-level sketch abstraction model based on the insight of sketch abstraction as a process of trading off between the recognizability of a sketch and the number of strokes used to draw it. Concretely, we train a model for abstract sketch generation through reinforcement learning of a stroke removal policy that learns to predict which strokes can be safely removed without affecting recognizability. We show that our abstraction model can be used for various sketch analysis tasks including: (1) modeling stroke saliency and understanding the decision of sketch recognition models, (2) synthesizing sketches of variable abstraction for a given category, or reference object instance in a photo, and (3) training a FG-SBIR model with photos only, bypassing the expensive photo-sketch pair collection step.
Xu P., Huang Y., Yuan T., Pang K., Song Yi-Zhe, Xiang T., Hospedales T.M., Ma Z., Guo J. (2018) SketchMate: Deep Hashing for Million-Scale Human Sketch Retrieval, Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 8090-8098 IEEE Computer Society
We propose a deep hashing framework for sketch retrieval that, for the first time, works on a multi-million scale human sketch dataset. Leveraging on this large dataset, we explore a few sketch-specific traits that were otherwise under-studied in prior literature. Instead of following the conventional sketch recognition task, we introduce the novel problem of sketch hashing retrieval which is not only more challenging, but also offers a better testbed for large-scale sketch analysis, since: (i) more fine-grained sketch feature learning is required to accommodate the large variations in style and Abstraction, and (ii) a compact binary code needs to be learned at the same time to enable efficient retrieval. Key to our network design is the embedding of unique characteristics of human sketch, where (i) a two-branch CNN-RNN architecture is adapted to explore the temporal ordering of strokes, and (ii) a novel hashing loss is specifically designed to accommodate both the temporal and Abstract traits of sketches. By working with a 3.8M sketch dataset, we show that state-of-the-art hashing models specifically engineered for static images fail to perform well on temporal sketch data. Our network on the other hand not only offers the best retrieval performance on various code sizes, but also yields the best generalization performance under a zero-shot setting and when re-purposed for sketch recognition. Such superior performances effectively demonstrate the benefit of our sketch-specific design. © 2018 IEEE.
Li K., Pang K., Song Yi-Zhe, Xiang T., Hospedales T.M., Zhang H. (2019) Toward Deep Universal Sketch Perceptual Grouper, IEEE Transactions on Image Processing 28 (7) pp. 3219-3231 Institute of Electrical and Electronics Engineers (IEEE)
Human free-hand sketches provide the useful data for studying human perceptual grouping, where the grouping principles such as the Gestalt laws of grouping are naturally in play during both the perception and sketching stages. In this paper, we make the first attempt to develop a universal sketch perceptual grouper. That is, a grouper that can be applied to sketches of any category created with any drawing style and ability, to group constituent strokes/segments into semantically meaningful object parts. The first obstacle to achieving this goal is the lack of large-scale datasets with grouping annotation. To overcome this, we contribute the largest sketch perceptual grouping dataset to date, consisting of 20 000 unique sketches evenly distributed over 25 object categories. Furthermore, we propose a novel deep perceptual grouping model learned with both generative and discriminative losses. The generative loss improves the generalization ability of the model, while the discriminative loss guarantees both local and global grouping consistency. Extensive experiments demonstrate that the proposed grouper significantly outperforms the state-of-the-art competitors. In addition, we show that our grouper is useful for a number of sketch analysis tasks, including sketch semantic segmentation, synthesis, and fine-grained sketch-based image retrieval. © 1992-2012 IEEE.
Zhong Y., Zhang H., Guo J., Song Yi-Zhe (2018) Directional Element HOG for Sketch Recognition, Proceedings of the 2018 6th International Conference on Network Infrastructure and Digital Content (IC-NIDC 2018) pp. 50-54 Institute of Electrical and Electronics Engineers Inc.
We propose a novel Directional Element Histogram of Oriented Gradient (DE-HOG) feature to human free-hand sketch recognition task that achieves superior performance to traditional HOG feature, originally designed for photographic objects. As a result of modeling the unique characteristics of free-hand sketch, i.e. consisting only a set of strokes omitting visual information such as color and brightness, being highly iconic and abstract. Specifically, we encode sketching strokes as a form of regularized directional vectors from the skeleton of a sketch, whilst still leveraging the HOG feature to meet the local deformation-invariant demands. Such a representation combines the best of two features by encoding necessary and discriminative stroke-level information, but can still robustly deal with various levels of sketching variations. Extensive experiments conducted on two large benchmark sketch recognition datasets demonstrate the performance of our proposed method.
Li D., Yang Y., Song Yi-Zhe, Hospedales T.M. (2018) Learning to generalize: Meta-learning for domain generalization, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI 2018) pp. 3490-3497 AAAI press
Domain shift refers to the well known problem that a model trained in one source domain performs poorly when applied to a target domain with different statistics. Domain Generalization (DG) techniques attempt to alleviate this issue by producing models which by design generalize well to novel testing domains. We propose a novel meta-learning method for domain generalization. Rather than designing a specific model that is robust to domain shift as in most previous DG work, we propose a model agnostic training procedure for DG. Our algorithm simulates train/test domain shift during training by synthesizing virtual testing domains within each mini-batch. The meta-optimization objective requires that steps to improve training domain performance should also improve testing domain performance. This meta-learning procedure trains models with good generalization ability to novel domains. We evaluate our method and achieve state of the art results on a recent cross-domain image classification benchmark, as well demonstrating its potential on two classic reinforcement learning tasks.
Hu C., Li D., Song Yi-Zhe, Xiang T., Hospedales T.M. (2019) Sketch-a-Classifier: Sketch-Based Photo Classifier Generation, Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 9136-9144 IEEE Computer Society
Contemporary deep learning techniques have made image recognition a reasonably reliable technology. However training effective photo classifiers typically takes numerous examples which limits image recognition's scalability and applicability to scenarios where images may not be available. This has motivated investigation into zero-shot learning, which addresses the issue via knowledge transfer from other modalities such as text. In this paper we investigate an alternative approach of synthesizing image classifiers: Almost directly from a user's imagination, via freehand sketch. This approach doesn't require the category to be nameable or describable via attributes as per zero-shot learning. We achieve this via training a model regression network to map from free-hand sketch space to the space of photo classifiers. It turns out that this mapping can be learned in a category-agnostic way, allowing photo classifiers for new categories to be synthesized by user with no need for annotated training photos. We also demonstrate that this modality of classifier generation can also be used to enhance the granularity of an existing photo classifier, or as a complement to name-based zero-shot learning.
Song J., Pang K., Song Yi-Zhe, Xiang T., Hospedales T.M. (2019) Learning to Sketch with Shortcut Cycle Consistency, Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 801-810 IEEE Computer Society
To see is to sketch - free-hand sketching naturally builds ties between human and machine vision. In this paper, we present a novel approach for translating an object photo to a sketch, mimicking the human sketching process. This is an extremely challenging task because the photo and sketch domains differ significantly. Furthermore, human sketches exhibit various levels of sophistication and abstraction even when depicting the same object instance in a reference photo. This means that even if photo-sketch pairs are available, they only provide weak supervision signal to learn a translation model. Compared with existing supervised approaches that solve the problem of D(E(photo)) ’ sketch), where E(·) and D(·) denote encoder and decoder respectively, we take advantage of the inverse problem (e.g., D(E(sketch) ’ photo), and combine with the unsupervised learning tasks of within-domain reconstruction, all within a multi-task learning framework. Compared with existing unsupervised approaches based on cycle consistency (i.e., D(E(D(E(photo)))) ’ photo), we introduce a shortcut consistency enforced at the encoder bottleneck (e.g., D(E(photo)) ’ photo) to exploit the additional self-supervision. Both qualitative and quantitative results show that the proposed model is superior to a number of state-of-the-art alternatives. We also show that the synthetic sketches can be used to train a better fine-grained sketch-based image retrieval (FG-SBIR) model, effectively alleviating the problem of sketch data scarcity.
Xu P., Yin Q., Huang Y., Song Yi-Zhe, Ma Z., Wang L., Xiang T., Kleijn W.B., Guo J. (2018) Cross-modal subspace learning for fine-grained sketch-based image retrieval, Neurocomputing 278 pp. 75-86 Elsevier
Sketch-based image retrieval (SBIR) is challenging due to the inherent domain-gap between sketch and photo. Compared with pixel-perfect depictions of photos, sketches are iconic renderings of the real world with highly abstract. Therefore, matching sketch and photo directly using low-level visual clues are insufficient, since a common low-level subspace that traverses semantically across the two modalities is non-trivial to establish. Most existing SBIR studies do not directly tackle this cross-modal problem. This naturally motivates us to explore the effectiveness of cross-modal retrieval methods in SBIR, which have been applied in the image-text matching successfully. In this paper, we introduce and compare a series of state-of-the-art cross-modal subspace learning methods and benchmark them on two recently released fine-grained SBIR datasets. Through thorough examination of the experimental results, we have demonstrated that the subspace learning can effectively model the sketch-photo domain-gap. In addition we draw a few key insights to drive future research. © 2017 Elsevier B.V.
Pang K., Li D., Song J., Song Yi-Zhe, Xiang T., Hospedales T.M. (2018) Deep factorised inverse-sketching, Proceedings of the 15th European Conference on Computer Vision (ECCV 2018) 11219 pp. 37-54 Springer Verlag
Modelling human free-hand sketches has become topical recently, driven by practical applications such as fine-grained sketch based image retrieval (FG-SBIR). Sketches are clearly related to photo edge-maps, but a human free-hand sketch of a photo is not simply a clean rendering of that photo?s edge map. Instead there is a fundamental process of abstraction and iconic rendering, where overall geometry is warped and salient details are selectively included. In this paper we study this sketching process and attempt to invert it. We model this inversion by translating iconic free-hand sketches to contours that resemble more geometrically realistic projections of object boundaries, and separately factorise out the salient added details. This factorised re-representation makes it easier to match a free-hand sketch to a photo instance of an object. Specifically, we propose a novel unsupervised image style transfer model based on enforcing a cyclic embedding consistency constraint. A deep FG-SBIR model is then formulated to accommodate complementary discriminative detail from each factorised sketch for better matching with the corresponding photo. Our method is evaluated both qualitatively and quantitatively to demonstrate its superiority over a number of state-of-the-art alternatives for style transfer and FG-SBIR.
Ma Z., Chien J.-T., Tan Z.-H., Song Yi-Zhe, Taghia J., Xiao M. (2018) Recent advances in machine learning for non-Gaussian data processing, Neurocomputing 278 pp. 1-3 Elsevier

With the widespread explosion of sensing and computing, an increasing number of industrial applications and an ever-growing amount of academic research generate massive multi-modal data from multiple sources. Gaussian distribution is the probability distribution ubiquitously used in statistics, signal processing, and pattern recognition. However, in reality data are neither always Gaussian nor can be safely assumed to be Gaussian disturbed. In many real-life applications, the distribution of data is, e.g., bounded, asymmetric, and, therefore, is not Gaussian distributed. It has been found in recent studies that explicitly utilizing the non-Gaussian characteristics of data (e.g., data with bounded support, data with semi-bounded support, and data with L1/L2-norm constraint) can significantly improve the performance of practical systems. Hence, it is of particular importance and interest to make thorough studies of the non-Gaussian data and the corresponding non-Gaussian statistical models (e.g., beta distribution for bounded support data, gamma distribution for semi-bounded support data, and Dirichlet/vMF distribution for data with L1/L2-norm constraint).

In order to analyze and understand such kind of non-Gaussian distributed data, the developments of related learning theories, statistical models, and efficient algorithms become crucial. The scope of this special issue of the Elsevier's Journal on Neurocomputing is to provide theoretical foundations and ground-breaking models and algorithms to solve this challenge.

Pang L., Wang Y., Song Yi-Zhe, Huang T., Tian Y. (2018) Cross-domain adversarial feature learning for sketch re-identification, Proceedings of the 26th ACM international conference on Multimedia (MM 18) pp. 609-617 Association for Computing Machinery (ACM)
Under person re-identification (Re-ID), a query photo of the target person is often required for retrieval. However, one is not always guaranteed to have such a photo readily available under a practical forensic setting. In this paper, we define the problem of Sketch Re-ID, which instead of using a photo as input, it initiates the query process using a professional sketch of the target person. This is akin to the traditional problem of forensic facial sketch recognition, yet with the major difference that our sketches are whole-body other than just the face. This problem is challenging because sketches and photos are in two distinct domains. Specifically, a sketch is the abstract description of a person. Besides, person appearance in photos is variational due to camera viewpoint, human pose and occlusion. We address the Sketch Re-ID problem by proposing a cross-domain adversarial feature learning approach to jointly learn the identity features and domain-invariant features. We employ adversarial feature learning to filter low-level interfering features and remain high-level semantic information. We also contribute to the community the first Sketch Re-ID dataset with 200 persons, where each person has one sketch and two photos from different cameras associated. Extensive experiments have been performed on the proposed dataset and other common sketch datasets including CUFSF and QUML-shoe. Results show that the proposed method outperforms the state-of-the-arts.
Xu P., Li K., Ma Z., Song Yi-Zhe, Wang L., Guo J. (2017) Cross-modal subspace learning for sketch-based image retrieval: A comparative study, Proceedings of the 5th IEEE International Conference on Network Infrastructure and Digital Content (IC-NIDC 2016) pp. 500-504 Institute of Electrical and Electronics Engineers (IEEE)
Sketch-based image retrieval (SBIR) has become a prominent research topic in recent years due to the proliferation of touch screens. The problem is however very challenging for that photos and sketches are inherently modeled in different modalities. Photos are accurate (colored and textured) depictions of the real-world, whereas sketches are highly abstract (black and white) renderings often drawn from human memory. This naturally motivates us to study the effectiveness of various cross-modal retrieval methods in SBIR. However, to the best of our knowledge, all established cross-modal algorithms are designed to traverse the more conventional cross-modal gap of image and text, making their general applicableness to SBIR unclear. In this paper, we design a series of experiments to clearly illustrate circumstances under which cross-modal methods can be best utilized to solve the SBIR problem. More specifically, we choose six state-of-the-art cross-modal subspace learning approaches that were shown to work well on image-text and conduct extensive experiments on a recently released SBIR dataset. Finally, we present detailed comparative analysis of the experimental results and offer insights to benefit future research.
Li K., Pang K., Song Yi-Zhe, Hospedales T.M., Xiang T., Zhang H. (2017) Synergistic Instance-Level Subspace Alignment for Fine-Grained Sketch-Based Image Retrieval, IEEE Transactions on Image Processing 26 (12) pp. 5908-5921 Institute of Electrical and Electronics Engineers Inc.
We study the problem of fine-grained sketch-based image retrieval. By performing instance-level (rather than category-level) retrieval, it embodies a timely and practical application, particularly with the ubiquitous availability of touchscreens. Three factors contribute to the challenging nature of the problem: 1) free-hand sketches are inherently abstract and iconic, making visual comparisons with photos difficult; 2) sketches and photos are in two different visual domains, i.e., black and white lines versus color pixels; and 3) fine-grained distinctions are especially challenging when executed across domain and abstraction-level. To address these challenges, we propose to bridge the image-sketch gap both at the high level via parts and attributes, as well as at the low level via introducing a new domain alignment method. More specifically, first, we contribute a data set with 304 photos and 912 sketches, where each sketch and image is annotated with its semantic parts and associated part-level attributes. With the help of this data set, second, we investigate how strongly supervised deformable part-based models can be learned that subsequently enable automatic detection of part-level attributes, and provide pose-aligned sketch-image comparisons. To reduce the sketch-image gap when comparing low-level features, third, we also propose a novel method for instance-level domain-alignment that exploits both subspace and instance-level cues to better align the domains. Finally, fourth, these are combined in a matching framework integrating aligned low-level features, mid-level geometric structure, and high-level semantic attributes. Extensive experiments conducted on our new data set demonstrate effectiveness of the proposed method.
Song J., Yu Q., Song Yi-Zhe, Xiang T., Hospedales T.M. (2018) Deep Spatial-Semantic Attention for Fine-Grained Sketch-Based Image Retrieval, Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV) pp. 5552-5561 Institute of Electrical and Electronics Engineers (IEEE)
Human sketches are unique in being able to capture both the spatial topology of a visual object, as well as its subtle appearance details. Fine-grained sketch-based image retrieval (FG-SBIR) importantly leverages on such fine-grained characteristics of sketches to conduct instance-level retrieval of photos. Nevertheless, human sketches are often highly abstract and iconic, resulting in severe misalignments with candidate photos which in turn make subtle visual detail matching difficult. Existing FG-SBIR approaches focus only on coarse holistic matching via deep cross-domain representation learning, yet ignore explicitly accounting for fine-grained details and their spatial context. In this paper, a novel deep FG-SBIR model is proposed which differs significantly from the existing models in that: (1) It is spatially aware, achieved by introducing an attention module that is sensitive to the spatial position of visual details: (2) It combines coarse and fine semantic information via a shortcut connection fusion block: and (3) It models feature correlation and is robust to misalignments between the extracted features across the two domains by introducing a novel higher-order learnable energy function (HOLEF) based loss. Extensive experiments show that the proposed deep spatial-semantic attention model significantly outperforms the state-of-the-art.
Ma Z., Ling H., Song Y.Z., Hospedales T., Jia W., Peng Y., Han A. (2018) IEEE Access Special Section Editorial: Recent Advantages of Computer Vision, IEEE Access 6 pp. 31481-31485 Institute of Electrical and Electronics Engineers (IEEE)
Computer vision is an interdisciplinary field that deals with how computers can be made to gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to automate tasks that the human visual system can do. As a scientific discipline, computer vision is about the theory behind artificial systems that extracts information from images. With the emergence of big vision data and the development of artificial intelligence, it is important to investigate new theories, methods, and applications in computer vision.
Ouyang S., Hospedales T., Song Yi-Zhe, Li X., Loy C.C., Wang X. (2016) A survey on heterogeneous face recognition: Sketch, infra-red, 3D and low-resolution, Image and Vision Computing 56 pp. 28-48 Elsevier
Heterogeneous face recognition (HFR) refers to matching face imagery across different domains. It has received much interest from the research community as a result of its profound implications in law enforcement. A wide variety of new invariant features, cross-modality matching models and heterogeneous datasets are being established in recent years. This survey provides a comprehensive review of established techniques and recent developments in HFR. Moreover, we offer a detailed account of datasets and benchmarks commonly used for evaluation. We finish by assessing the state of the field and discussing promising directions for future research.
Li Y., Song Yi-Zhe, Hospedales T.M., Gong S. (2017) Free-Hand Sketch Synthesis with Deformable Stroke Models, International Journal of Computer Vision 122 pp. 169-190 Springer US
We present a generative model which can automatically summarize the stroke composition of free-hand sketches of a given category. When our model is fit to a collection of sketches with similar poses, it discovers and learns the structure and appearance of a set of coherent parts, with each part represented by a group of strokes. It represents both consistent (topology) as well as diverse aspects (structure and appearance variations) of each sketch category. Key to the success of our model are important insights learned from a comprehensive study performed on human stroke data. By fitting this model to images, we are able to synthesize visually similar and pleasant free-hand sketches. © 2016, The Author(s).
Qi Y., Guo J., Song Yi-Zhe, Xiang T., Zhang H., Tan Z.-H. (2015) Im2Sketch: Sketch generation by unconflicted perceptual grouping, Neurocomputing 165 pp. 338-349 Elsevier
Effectively solving the problem of sketch generation, which aims to produce human-drawing-like sketches from real photographs, opens the door for many vision applications such as sketch-based image retrieval and non-photorealistic rendering. In this paper, we approach automatic sketch generation from a human visual perception perspective. Instead of gathering insights from photographs, for the first time, we extract information from a large pool of human sketches. In particular, we study how multiple Gestalt rules can be encapsulated into a unified perceptual grouping framework for sketch generation. We further show that by solving the problem of Gestalt confliction, i.e., encoding the relative importance of each rule, more similar to human-made sketches can be generated. For that, we release a manually labeled sketch dataset of 96 object categories and 7680 sketches. A novel evaluation framework is proposed to quantify human likeness of machine-generated sketches by examining how well they can be classified using models trained from human data. Finally, we demonstrate the superiority of our sketches under the practical application of sketch-based image retrieval. © 2015 Elsevier B.V.
Yu Q., Yang Y., Liu F., Song Yi-Zhe, Xiang T., Hospedales T.M. (2017) Sketch-a-Net: A Deep Neural Network that Beats Humans, International Journal of Computer Vision 122 (3) pp. 411-425 Springer
We propose a deep learning approach to free-hand sketch recognition that achieves state-of-the-art performance, significantly surpassing that of humans. Our superior performance is a result of modelling and exploiting the unique characteristics of free-hand sketches, i.e., consisting of an ordered set of strokes but lacking visual cues such as colour and texture, being highly iconic and abstract, and exhibiting extremely large appearance variations due to different levels of abstraction and deformation. Specifically, our deep neural network, termed Sketch-a-Net has the following novel components: (i) we propose a network architecture designed for sketch rather than natural photo statistics. (ii) Two novel data augmentation strategies are developed which exploit the unique sketch-domain properties to modify and synthesise sketch training data at multiple abstraction levels. Based on this idea we are able to both significantly increase the volume and diversity of sketches for training, and address the challenge of varying levels of sketching detail commonplace in free-hand sketches. (iii) We explore different network ensemble fusion strategies, including a re-purposed joint Bayesian scheme, to further improve recognition performance. We show that state-of-the-art deep networks specifically engineered for photos of natural objects fail to perform well on sketch recognition, regardless whether they are trained using photos or sketches. Furthermore, through visualising the learned filters, we offer useful insights in to where the superior performance of our network comes from. © 2016, Springer Science+Business Media New York.
Zhao K., Zhang H., Ma Z., Song Yi-Zhe, Guo J. (2015) Multi-label learning with prior knowledge for facial expression analysis, Neurocomputing 157 pp. 280-289 Elsevier
Facial expression is one of the most expressive ways to display human emotions. Facial expression analysis (FEA) has been broadly studied in the past decades. In our daily life, few of the facial expressions are exactly one of the predefined affective states but are blends of several basic expressions. Even though the concept of 'blended emotions' has been proposed years ago, most researchers did not deal with FEA as a multiple outputs problem yet. In this paper, multi-label learning algorithm for FEA is proposed to solve this problem. Firstly, to depict facial expressions more effectively, we model FEA as a multi-label problem, which depicts all facial expressions with multiple continuous values and labels of predefined affective states. Secondly, in order to model FEA jointly with multiple outputs, multi-label Group Lasso regularized maximum margin classifier (GLMM) and Group Lasso regularized regression (GLR) algorithms are proposed which can analyze all facial expressions at one time instead of modeling as a binary learning problem. Thirdly, to improve the effectiveness of our proposed model used in video sequences, GLR is further extended to be a Total Variation and Group Lasso based regression model (GLTV) which adds a prior term (Total Variation term) in the original model. JAFFE dataset and the extended Cohn Kanade (CK+) dataset have been used to verify the superior performance of our approaches with common used criterions in multi-label classification and regression realms.
Li Y., Hospedales T.M., Song Yi-Zhe, Gong S. (2015) Free-hand sketch recognition by multi-kernel feature learning, Computer Vision and Image Understanding 137 pp. 1-11 Elsevier
Abstract Free-hand sketch recognition has become increasingly popular due to the recent expansion of portable touchscreen devices. However, the problem is non-trivial due to the complexity of internal structures that leads to intra-class variations, coupled with the sparsity in visual cues that results in inter-class ambiguities. In order to address the structural complexity, a novel structured representation for sketches is proposed to capture the holistic structure of a sketch. Moreover, to overcome the visual cue sparsity problem and therefore achieve state-of-the-art recognition performance, we propose a Multiple Kernel Learning (MKL) framework for sketch recognition, fusing several features common to sketches. We evaluate the performance of all the proposed techniques on the most diverse sketch dataset to date (Mathias et al., 2012), and offer detailed and systematic analyses of the performance of different features and representations, including a breakdown by sketch-super-category. Finally, we investigate the use of attributes as a high-level feature for sketches and show how this complements low-level features for improving recognition performance under the MKL framework, and consequently explore novel applications such as attribute-based retrieval.
Song J., Song Yi-Zhe, Xiang T., Hospedales T., Ruan X. (2016) Deep Multi-task attribute-driven ranking for fine-grained sketch-based image retrieval, Proceedings of the 27th British Machine Vision Conference (BMVC) pp. 132.1-132.11 BMVA Press
Fine-grained sketch-based image retrieval (SBIR) aims to go beyond conventional SBIR to perform instance-level cross-domain retrieval: finding the specific photo that matches an input sketch. Existing methods focus on designing/learning good features for cross-domain matching and/or learning cross-domain matching functions. However, they neglect the semantic aspect of retrieval, i.e., what meaningful object properties does a user try encode in her/his sketch? We propose a fine-grained SBIR model that exploits semantic attributes and deep feature learning in a complementary way. Specifically, we perform multi-task deep learning with three objectives, including: retrieval by fine-grained ranking on a learned representation, attribute prediction, and attribute-level ranking. Simultaneously predicting semantic attributes and using such predictions in the ranking procedure help retrieval results to be more semantically relevant. Importantly, the introduction of semantic attribute learning in the model allows for the elimination of the otherwise prohibitive cost of human annotations required for training a fine-grained deep ranking model. Experimental results demonstrate that our method outperforms the state-of-the-art on challenging fine-grained SBIR benchmarks while requiring less annotation.
Li K., Pang K., Song J., Song Yi-Zhe, Xiang T., Hospedales T.M., Zhang H. (2018) Universal sketch perceptual grouping, Lecture notes in computer science: Proceedings of the European Conference on Computer Vision (ECCV 2018) 11212 pp. 593-609 Springer
In this work we aim to develop a universal sketch grouper. That is, a grouper that can be applied to sketches of any category in any domain to group constituent strokes/segments into semantically meaningful object parts. The first obstacle to this goal is the lack of large-scale datasets with grouping annotation. To overcome this, we contribute the largest sketch perceptual grouping (SPG) dataset to date, consisting of 20, 000 unique sketches evenly distributed over 25 object categories. Furthermore, we propose a novel deep universal perceptual grouping model. The model is learned with both generative and discriminative losses. The generative losses improve the generalisation ability of the model to unseen object categories and datasets. The discriminative losses include a local grouping loss and a novel global grouping loss to enforce global grouping consistency. We show that the proposed model significantly outperforms the state-of-the-art groupers. Further, we show that our grouper is useful for a number of sketch analysis tasks including sketch synthesis and fine-grained sketch-based image retrieval (FG-SBIR). © Springer Nature Switzerland AG 2018.
Song Yi-Zhe, Xiao B., Hall P., Wang L. (2010) In search of perceptually salient groupings, IEEE Transactions on Image Processing 20 (4) pp. 935-947 Institute of Electrical and Electronics Engineers (IEEE)
Finding meaningful groupings of image primitives has been a long-standing problem in computer vision. This paper studies how salient groupings can be produced using established theories in the field of visual perception alone. The major contribution is a novel definition of the Gestalt principle of Prägnanz, based upon Koffka's definition that image descriptions should be both stable and simple. Our method is global in the sense that it operates over all primitives in an image at once. It works regardless of the type of image primitives and is generally independent of image properties such as intensity, color, and texture. A novel experiment is designed to quantitatively evaluate the groupings outputs by our method, which takes human disagreement into account and is generic to outputs of any grouper. We also demonstrate the value of our method in an image segmentation application and quantitatively show that segmentations deliver promising results when benchmarked using the Berkeley Segmentation Dataset (BSDS).
Song Yi-Zhe, Pickup D., Li C., Rosin P., Hall P. (2013) Abstract art by shape classification, IEEE Transactions on Visualization and Computer Graphics 19 (8) pp. 1252-1263
This paper shows that classifying shapes is a tool useful in nonphotorealistic rendering (NPR) from photographs. Our classifier inputs regions from an image segmentation hierarchy and outputs the 'best” fitting simple shape such as a circle, square, or triangle. Other approaches to NPR have recognized the benefits of segmentation, but none have classified the shape of segments. By doing so, we can create artwork of a more abstract nature, emulating the style of modern artists such as Matisse and other artists who favored shape simplification in their artwork. The classifier chooses the shape that 'best” represents the region. Since the classifier is trained by a user, the 'best shape” has a subjective quality that can over-ride measurements such as minimum error and more importantly captures user preferences. Once trained, the system is fully automatic, although simple user interaction is also possible to allow for differences in individual tastes. A gallery of results shows how this classifier contributes to NPR from images by producing abstract artwork.
Hall P.M., Collomosse J.P., Song Yi-Zhe, Shen P., Li C. (2007) RTcams: A new perspective on nonphotorealistic rendering from photographs, IEEE Transactions on Visualization and Computer Graphics 13 (5) pp. 966-979 Institute of Electrical and Electronics Engineers (IEEE)
We introduce a simple but versatile camera model that we call the Rational Tensor Camera (RTcam). RTcams are well principled mathematically and provably subsume several important contemporary camera models in both computer graphics and vision; their generality Is one contribution. They can be used alone or compounded to produce more complicated visual effects. In this paper, we apply RTcams to generate synthetic artwork with novel perspective effects from real photographs. Existing Nonphotorealistic Rendering from Photographs (NPRP) is constrained to the projection inherent in the source photograph, which is most often linear. RTcams lift this restriction and so contribute to NPRP via multiperspective projection. This paper describes RTcams, compares them to contemporary alternatives, and discusses how to control them in practice. Illustrative examples are provided throughout.
Song Yi-Zhe, Bowen C.R., Kim H.A., Nassehi A., Padget J., Gathercole N., Dent A. (2014) Non-invasive damage detection in beams using marker extraction and wavelets, Mechanical Systems and Signal Processing 49 (1-2) pp. 13-23 Elsevier
For structural health monitoring applications there is a need for simple and contact-less methods of Non-Destructive Evaluation (NDE). A number of damage detection techniques have been developed, such as frequency shift, generalised fractal dimension and wavelet transforms with the aim to identify, locate and determine the severity of damage in a material or structure. These techniques are often tailored for factors such as (i) type of material, (ii) damage pattern (crack, delamination), and (iii) the nature of any input signals (space and time). This paper describes and evaluates a wavelet-based damage detection framework that locates damage on cantilevered beams via NDE using computer vision technologies. The novelty of the approach is the use of computer vision algorithms for the contact-less acquisition of modal shapes. Using the proposed method, the modal shapes of cantilever beams are reconstructed by extracting markers using sub-pixel Hough Transforms from images captured using conventional slow motion cameras. The extracted modal shapes are then used as an input for wavelet transform damage detection, exploiting both discrete and continuous variants. The experimental results are verified and compared against finite element analysis. The methodology enables a non-invasive damage detection system that avoids the need for expensive equipment or the attachment of sensors to the structure. Two types of damage are investigated in our experiments: (i) defects induced by removing material to reduce the stiffness of a steel beam and (ii) delaminations in a (0/90/0/90/0)s composite laminate. Results show successful detection of notch depths of 5%, 28% and 50% for the steel beam and of 30 mm delaminations in central and outer layers for the composite laminate.
Li D., Yang Y., Song Yi-Zhe, Hospedales T.M. (2017) Deeper, Broader and Artier Domain Generalization, Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV 2017) pp. 5543-5551 Institute of Electrical and Electronics Engineers Inc.
The problem of domain generalization is to learn from multiple training domains, and extract a domain-agnostic model that can then be applied to an unseen domain. Domain generalization (DG) has a clear motivation in contexts where there are target domains with distinct characteristics, yet sparse data for training. For example recognition in sketch images, which are distinctly more abstract and rarer than photos. Nevertheless, DG methods have primarily been evaluated on photo-only benchmarks focusing on alleviating the dataset bias where both problems of domain distinctiveness and data sparsity can be minimal. We argue that these benchmarks are overly straightforward, and show that simple deep learning baselines perform surprisingly well on them. In this paper, we make two main contributions: Firstly, we build upon the favorable domain shift-robust properties of deep learning methods, and develop a low-rank parameterized CNN model for end-to-end DG learning. Secondly, we develop a DG benchmark dataset covering photo, sketch, cartoon and painting domains. This is both more practically relevant, and harder (bigger domain shift) than existing benchmarks. The results show that our method outperforms existing DG alternatives, and our dataset provides a more significant DG challenge to drive future research.
Song Yi-Zhe, Bowen C.R., Kim A.H., Nassehi A., Padget J., Gathercole N. (2014) Virtual visual sensors and their application in structural health monitoring, Structural Health Monitoring 13 (3) pp. 251-264 SAGE Publications Ltd
Wireless sensor networks are being increasingly accepted as an effective tool for structural health monitoring. The ability to deploy a wireless array of sensors efficiently and effectively is a key factor in structural health monitoring. Sensor installation and management can be difficult in practice for a variety of reasons: a hostile environment, high labour costs and bandwidth limitations. We present and evaluate a proof-of-concept application of virtual visual sensors to the well-known engineering problem of the cantilever beam, as a convenient physical sensor substitute for certain problems and environments. We demonstrate the effectiveness of virtual visual sensors as a means to achieve non-destructive evaluation. Major benefits of virtual visual sensors are its non-invasive nature, ease of installation and cost-effectiveness. The novelty of virtual visual sensors lies in the combination of marker extraction with visual tracking realised by modern computer vision algorithms. We demonstrate that by deploying a collection of virtual visual sensors on an oscillating structure, its modal shapes and frequencies can be readily extracted from a sequence of video images. Subsequently, we perform damage detection and localisation by means of a wavelet-based analysis. The contributions of this article are as follows: (1) use of a sub-pixel accuracy marker extraction algorithm to construct virtual sensors in the spatial domain, (2) embedding dynamic marker linking within a tracking-by-correspondence paradigm that offers benefits in computational efficiency and registration accuracy over traditional tracking-by-searching systems and (3) validation of virtual visual sensors in the context of a structural health monitoring application.
Song Yi-Zhe, Li C., Wang L., Hall P., Shen P. (2012) Robust visual tracking using structural region hierarchy and graph matching, Neurocomputing 89 pp. 12-20 Elsevier
Visual tracking aims to match objects of interest in consecutive video frames. This paper proposes a novel and robust algorithm to address the problem of object tracking. To this end, we investigate the fusion of state-of-the-art image segmentation hierarchies and graph matching. More specifically, (i) we represent the object to be tracked using a hierarchy of regions, each of which is described with a combined feature set of SIFT descriptors and color histograms; (ii) we formulate the tracking process as a graph matching problem, which is solved by minimizing an energy function incorporating appearance and geometry contexts; and (iii) more importantly, an effective graph updating mechanism is proposed to adapt to the object changes over time for ensuring the tracking robustness. Experiments are carried out on several challenging sequences and results show that our method performs well in terms of object tracking, even in the presence of variations of scale and illumination, moving camera, occlusion, and background clutter.
Li K., Pang K., Song Yi-Zhe, Hospedales T., Zhang H., Hu Y. (2016) Fine-grained sketch-based image retrieval: The role of part-aware attributes, Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV 2016) pp. 579-587 Institute of Electrical and Electronics Engineers (IEEE)
We study the problem of fine-grained sketch-based image retrieval. By performing instance-level (rather than category-level) retrieval, it embodies a timely and practical application, particularly with the ubiquitous availability of touchscreens. Three factors contribute to the challenging nature of the problem: (i) free-hand sketches are inherently abstract and iconic, making visual comparisons with photos more difficult, (ii) sketches and photos are in two different visual domains, i.e. black and white lines vs. color pixels, and (iii) fine-grained distinctions are especially challenging when executed across domain and abstraction-level. To address this, we propose to detect visual attributes at part-level, in order to build a new representation that not only captures fine-grained characteristics but also traverses across visual domains. More specifically, (i) we propose a dataset with 304 photos and 912 sketches, where each sketch and photo is annotated with its semantic parts and associated part-level attributes, and with the help of this dataset, we investigate (ii) how strongly-supervised deformable part-based models can be learned that subsequently enable automatic detection of part-level attributes, and (iii) a novel matching framework that synergistically integrates low-level features, mid-level geometric structure and high-level semantic attributes to boost retrieval performance. Extensive experiments conducted on our new dataset demonstrate value of the proposed method.
Zhang H., Zhao K., Song Yi-Zhe, Guo J. (2013) Text extraction from natural scene image: A survey, Neurocomputing 122 pp. 310-323 Elsevier
With the increasing popularity of portable camera devices and embedded visual processing, text extraction from natural scene images has become a key problem that is deemed to change our everyday lives via novel applications such as augmented reality. Text extraction from natural scene images algorithms is generally composed of the following three stages: (i) detection and localization, (ii) text enhancement and segmentation and (iii) optical character recognition (OCR). The problem is challenging in nature due to variations in the font size and color, text alignment, illumination change and reflections. This paper aims to classify and assess the latest algorithms. More specifically, we draw attention to studies on the first two steps in the extraction process, since OCR is a well-studied area where powerful algorithms already exist. This paper offers to the researchers a link to public image database for the algorithm assessment of text extraction from natural scene images.
Li C., Deussen O., Song Yi-Zhe, Willis P., Hall P. (2011) Modeling and generating moving trees from video, ACM Transactions on Graphics 30 (6) 127 pp. 1-11 Association for Computing Machinery (ACM)
We present a probabilistic approach for the automatic production of tree models with convincing 3D appearance and motion. The only input is a video of a moving tree that provides us an initial dynamic tree model, which is used to generate new individual trees of the same type. Our approach combines global and local constraints to construct a dynamic 3D tree model from a 2D skeleton. Our modeling takes into account factors such as the shape of branches, the overall shape of the tree, and physically plausible motion. Furthermore, we provide a generative model that creates multiple trees in 3D, given a single example model. This means that users no longer have to make each tree individually, or specify rules to make new trees. Results with different species are presented and compared to both reference input data and state of the art alternatives.
Zou C., Yu Q., Du R., Mo H., Song Yi-Zhe, Xiang T., Gao C., Chen B., Zhang H. (2018) Sketchyscene: Richly-annotated scene sketches, Lecture notes in computer science: Proceedings of the European Conference on Computer Vision (ECCV 2018) 11219 pp. 438-454 Springer Verlag
We contribute the first large-scale dataset of scene sketches, SketchyScene, with the goal of advancing research on sketch understanding at both the object and scene level. The dataset is created through a novel and carefully designed crowdsourcing pipeline, enabling users to efficiently generate large quantities of realistic and diverse scene sketches. SketchyScene contains more than 29,000 scene-level sketches, 7,000+ pairs of scene templates and photos, and 11,000+ object sketches. All objects in the scene sketches have ground-truth semantic and instance masks. The dataset is also highly scalable and extensible, easily allowing augmenting and/or changing scene composition. We demonstrate the potential impact of SketchyScene by training new computational models for semantic segmentation of scene sketches and showing how the new dataset enables several applications including image retrieval, sketch colorization, editing, and captioning, etc. The dataset and code can be found at https://github.com/SketchyScene/SketchyScene.
Qi Y., Guo J., Li Y., Zhang H., Xiang T., Song Yi-Zhe, Tan Z.-H. (2014) Perceptual grouping via untangling Gestalt principles, Proceedings of the 2013 Visual Communications and Image Processing (VCIP 2013) pp. 1-6 Institute of Electrical and Electronics Engineers (IEEE)
Gestalt principles, a set of conjoining rules derived from human visual studies, have been known to play an important role in computer vision. Many applications such as image segmentation, contour grouping and scene understanding often rely on such rules to work. However, the problem of Gestalt confliction, i.e., the relative importance of each rule compared with another, remains unsolved. In this paper, we investigate the problem of perceptual grouping by quantifying the confliction among three commonly used rules: similarity, continuity and proximity. More specifically, we propose to quantify the importance of Gestalt rules by solving a learning to rank problem, and formulate a multi-label graph-cuts algorithm to group image primitives while taking into account the learned Gestalt confliction. Our experiment results confirm the existence of Gestalt confliction in perceptual grouping and demonstrate an improved performance when such a confliction is accounted for via the proposed grouping algorithm. Finally, a novel cross domain image classification method is proposed by exploiting perceptual grouping as representation.
Song Yi-Zhe, Hall P.M. (2008) Stable image descriptions using gestalt principles, Proceedings of the 4th International Symposium on Visual Computing (ISVC 2008) Part I 5358 (PART 1) pp. 318-327 Springer Verlag
This paper addresses the problem of grouping image primitives; its principal contribution is an explicit definition of the Gestalt principle of Pragnanz, which organizes primitives into descriptions of images that are both simple and stable. Our definition of Pragnanz assumes just two things: that a vector of free variables controls some general grouping algorithm, and a scalar function measures the information in a grouping. Stable descriptions exist where the gradient of the function is zero, and these can be ordered by information content (simplicity) to create a "grouping" or "Gestalt" scale description. We provide a simple measure for information in a grouping based on its structure alone, leaving our grouper free to exploit other Gestalt principles as we see fit. We demonstrate the value of our definition of Pragnanz on several real-world images.
Ouyang S., Hospedales T.M., Song Yi-Zhe, Li X. (2017) ForgetMeNot: Memory-aware forensic facial sketch matching, Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016) 2016-D pp. 5571-5579 IEEE Computer Society
We investigate whether it is possible to improve the performance of automated facial forensic sketch matching by learning from examples of facial forgetting over time. Forensic facial sketch recognition is a key capability for law enforcement, but remains an unsolved problem. It is extremely challenging because there are three distinct contributors to the domain gap between forensic sketches and photos: The well-studied sketch-photo modality gap, and the less studied gaps due to (i) the forgetting process of the eye-witness and (ii) their inability to elucidate their memory. In this paper, we address the memory problem head on by introducing a database of 400 forensic sketches created at different time-delays. Based on this database we build a model to reverse the forgetting process. Surprisingly, we show that it is possible to systematically 'un-forget' facial details. Moreover, it is possible to apply this model to dramatically improve forensic sketch recognition in practice: we achieve the state of the art results when matching 195 benchmark forensic sketches against corresponding photos and a 10,030 mugshot database.
Song Yi-Zhe, Arbelaez P., Hall P., Li C., Balikai A. (2010) Finding semantic structures in image hierarchies using laplacian graph energy, Lecture Notes in Computer Science - proceedings of the 11th European Conference on Computer Vision (ECCV 2010) part IV 6314 (PART 4) pp. 694-707 Springer Verlag
Many segmentation algorithms describe images in terms of a hierarchy of regions. Although such hierarchies can produce state of the art segmentations and have many applications, they often contain more data than is required for an efficient description. This paper shows Laplacian graph energy is a generic measure that can be used to identify semantic structures within hierarchies, independently of the algorithm that produces them. Quantitative experimental validation using hierarchies from two state of art algorithms show we can reduce the number of levels and regions in a hierarchy by an order of magnitude with little or no loss in performance when compared against human produced ground truth. We provide a tracking application that illustrates the value of reduced hierarchies.
Li Y., Song Yi-Zhe, Gong S. (2013) Sketch recognition by ensemble matching of structured features, Proceedings of the British Machine Vision Conference 2013 (BMVC 2013) pp. 35.1-35.11 British Machine Vision Association (BMVA)
Sketch recognition aims to automatically classify human hand sketches of objects into known categories. This has become increasingly a desirable capability due to recent advances in human computer interaction on portable devices. The problem is nontrivial because of the sparse and abstract nature of hand drawings as compared to photographic images of objects, compounded by a highly variable degree of details in human sketches. To this end, we present a method for the representation and matching of sketches by exploiting not only local features but also global structures of sketches, through a star graph based ensemble matching strategy. Different local feature representations were evaluated using the star graph model to demonstrate the effectiveness of the ensemble matching of structured features. We further show that by encapsulating holistic structure matching and learned bag-of-features models into a single framework, notable recognition performance improvement over the state-of-the-art can be observed. Extensive comparative experiments were carried out using the currently largest sketch dataset released by Eitz et al. [15], with over 20,000 sketches of 250 object categories generated by AMT (Amazon Mechanical Turk) crowd-sourcing.
Zhao K., Zhang H., Dong M., Guo J., Qi Y., Song Yi-Zhe (2014) A multi-label classification approach for Facial Expression Recognition, Proceedings of the 2013 Visual Communications and Image Processing (VCIP 2013) Institute of Electrical and Electronics Engineers (IEEE)
Facial Expression Recognition (FER) techniques have already been adopted in numerous multimedia systems. Plenty of previous research assumes that each facial picture should be linked to only one of the predefined affective labels. Nevertheless, in practical applications, few of the expressions are exactly one of the predefined affective states. Therefore, to depict the facial expressions more accurately, this paper proposes a multi-label classification approach for FER and each facial expression would be labeled with one or multiple affective states. Meanwhile, by modeling the relationship between labels via Group Lasso regularization term, a maximum margin multi-label classifier is presented and the convex optimization formulation guarantees a global optimal solution. To evaluate the performance of our classifier, the JAFFE dataset is extended into a multi-label facial expression dataset by setting threshold to its continuous labels marked in the original dataset and the labeling results have shown that multiple labels can output a far more accurate description of facial expression. At the same time, the classification results have verified the superior performance of our algorithm.
Qi Y., Zhang H., Song Yi-Zhe, Tan Z. (2015) A patch-based sparse representation for sketch recognition, Proceedings of the 2014 4th IEEE International Conference on Network Infrastructure and Digital Content (IC-NIDC 2014) pp. 343-346 Institute of Electrical and Electronics Engineers (IEEE)
Categorizing free-hand human sketches has profound implications in applications such as human computer interaction and image retrieval. The task is non-trivial due to the iconic nature of sketches, signified by large variances in both appearance and structure when compared with photographs. One of the most fundamental problems is how to effectively describe a sketch image. Many existing descriptors, such as histogram of oriented gradients (HOG) and shape context (SC), have achieved great success. Moreover, some works have attempted to design features specifically engineered for sketches, such as symmetric-aware flip invariant sketch histogram (SYM-FISH). We present a novel patch-based sparse representation (PSR) for describing sketch image and it is evaluated under a sketch recognition framework. Extensive experiments on a large scale human drawn sketch dataset demonstrate the effectiveness of the proposed method.
Song Yi-Zhe, Bowen C., Kim H.A., Nassehi A., Padget J., Gathercore N., Dent A. (2011) Non-invasive damage detection in composite beams using marker extraction and wavelets, Proceedings of the SPIE Smart Structures and Materials and Nondestructive Evaluation and Health Monitoring 2011 7983 SPIE
Simple and contactless methods for determining the health of metallic and composite structures are necessary to allow non-invasive Non-Destructive Evaluation (NDE) of damaged structures. Many recognized damage detection techniques, such as frequency shift, generalized fractal dimension and wavelet transform, have been described with the aim to identify, locate damage and determine the severity of damage. These techniques are often tailored for factors such as (i) type of material, (ii) damage patterns (crack, impact damage, delamination), and (iii) nature of input signals (space and time). In this paper, a wavelet-based damage detection framework that locates damage on cantilevered composite beams via NDE using computer vision technologies is presented. Two types of damage have been investigated in this research: (i) defects induced by removing material to reduce stiffness in a metallic beam and (ii) manufactured delaminations in a composite laminate. The novelty in the proposed approach is the use of bespoke computer vision algorithms for the contactless acquisition of modal shapes, a task that is commonly regarded as a barrier to practical damage detection. Using the proposed method, it is demonstrated that modal shapes of cantilever beams can be readily reconstructed by extracting markers using Hough Transform from images captured using conventional slow motion cameras. This avoids the need to use expensive equipment such as laser doppler vibrometers. The extracted modal shapes are then used as input for a wavelet transform damage detection, exploiting both discrete and continuous variants. The experimental results are verified using finite element models (FEM).
Zhao K., Cao C., Zhang H., Song Yi-Zhe (2013) A dataset for scene classification based on camera metadata, Proceedings of the 2012 3rd IEEE International Conference on Network Infrastructure and Digital Content (IC-NIDC 2012) pp. 443-447 Institute of Electrical and Electronics Engineers (IEEE)
In this paper, we introduce a new dataset for scene classification based on camera metadata. We classify the most common scenes that have been researched much recently. This dataset consists of 12 scene categories. Each category contains 500 to 2000 images. Most images are high resolution such as 2000×2000 pixels. The images in the dataset are original, namely, each image brings with a camera metadata (EXIF). Various types, metadata cues of photos, strict definitions among scenes are characteristic factors that make this dataset a very challenging testbed for photo classification. We supply the scene photos together with scene labeling, as well as the EXIF information extraction via methodology, and we apply the dataset into sementic scene classification up to now.
Shen P., Zhang L., Wang Z., Shao W., Hu Z., Zhang X., Song Yi-Zhe (2013) A novel interpolation algorithm for nonlinear omni-catadioptric images, Proceedings of the 2012 3rd IEEE International Conference on Network Infrastructure and Digital Content (IC-NIDC 2012) pp. 508-511 Institute of Electrical and Electronics Engineers (IEEE)
This paper proposes a novel omni-image Interpolation technique and a new method to evaluation its performance. Omni-images are taken by non-linear catadioptric camera and offer important scientific and engineering benefits but often at the expense of the reduced visual accuracy. Interpolation algorithms aiming at improving the resolution of omni-images are deemed as an effective approach to deal with such a lack of visual content. Main contribution of this paper is that we propose an interpolation algorithm which could not only enhance the visual quality but also maintain camera parameters. Camera properties of the interpolated images are preserved by utilizing the epipolar geometry constraint of non-linear images. In our experiment, good performance was found on four sets of omni-images using the proposed method compared with the standard bilinear algorithm and bicubic algorithm.
Xu P., Yin Q., Qi Y., Song Yi-Zhe, Ma Z., Wang L., Guo J. (2016) Instance-level coupled subspace learning for fine-grained sketch-based image retrieval, Lecture Notes in Computer Science - proceedings of the Computer Vision ECCV 2016 Workshops (ECCV 2016) Part I 9913 pp. 19-34 Springer Verlag
Fine-grained sketch-based image retrieval (FG-SBIR) is a newly emerged topic in computer vision. The problem is challenging because in addition to bridging the sketch-photo domain gap, it also asks for instance-level discrimination within object categories. Most prior approaches focused on feature engineering and fine-grained ranking, yet neglected an important and central problem: how to establish a finegrained cross-domain feature space to conduct retrieval. In this paper, for the first time we formulate a cross-domain framework specifically designed for the task of FG-SBIR that simultaneously conducts instancelevel retrieval and attribute prediction. Different to conventional phototext cross-domain frameworks that performs transfer on category-level data, our joint multi-view space uniquely learns from the instance-level pair-wise annotations of sketch and photo. More specifically, we propose a joint view selection and attribute subspace learning algorithm to learn domain projection matrices for photo and sketch, respectively. It follows that visual attributes can be extracted from such matrices through projection to build a coupled semantic space to conduct retrieval. Experimental results on two recently released fine-grained photo-sketch datasets show that the proposed method is able to perform at a level close to those of deep models, while removing the need for extensive manual annotations.
Qi Y., Guo J., Li Y., Zhang H., Xiang T., Song Yi-Zhe (2014) Sketching by perceptual grouping, Proceedings of the 2013 IEEE International Conference on Image Processing (ICIP 2013) pp. 270-274 Institute of Electrical and Electronics Engineers (IEEE)
Sketch is used for rendering the visual world since prehistoric times, and has become ubiquitous nowadays with the increasing availability of touchscreens on portable devices. However, how to automatically map images to sketches, a problem that has profound implications on applications such as sketch-based image retrieval, still remains open. In this paper, we propose a novel method that draws a sketch automatically from a single natural image. Sketch extraction is posed within an unified contour grouping framework, where perceptual grouping is first used to form contour segment groups, followed by a group-based contour simplification method that generate the final sketches. In our experiment, for the first time we pose sketch evaluation as a sketch-based object recognition problem and the results validate the effectiveness of our system over the state-of-the-arts alternatives.
Li W., Song Yi-Zhe, Cavallaro A. (2015) Refining graph matching using inherent structure information, Proceedings of the 2015 IEEE International Conference on Multimedia and Expo (ICME 2015) 2015-A pp. 187-192 IEEE Computer Society
We present a graph matching refinement framework that improves the performance of a given graph matching algorithm. Our method synergistically uses the inherent structure information embedded globally in the active association graph, and locally on each individual graph. The combination of such information reveals how consistent each candidate match is with its global and local contexts. In doing so, the proposed method removes most false matches and improves precision. The validation on standard benchmark datasets demonstrates the effectiveness of our method.
Ouyang S., Hospedales T., Song Yi-Zhe, Li X. (2015) Cross-Modal face matching: Beyond viewed sketches, Lecture Notes in Computer Science - proceedings of the 12th Asian Conference on Computer Vision, Singapore, Singapore, November 1-5, 2014 - Revised Selected Papers, Part II 9004 pp. 210-225 Springer Verlag
Matching face images across different modalities is a challenging open problem for various reasons, notably feature heterogeneity, and particularly in the case of sketch recognition ? abstraction, exaggeration and distortion. Existing studies have attempted to address this task by engineering invariant features, or learning a common subspace between the modalities. In this paper, we take a different approach and explore learning a mid-level representation within each domain that allows faces in each modality to be compared in a domain invariant way. In particular, we investigate sketch-photo face matching and go beyond the well-studied viewed sketches to tackle forensic sketches and caricatures where representations are often symbolic. We approach this by learning a facial attribute model independently in each domain that represents faces in terms of semantic properties. This representation is thus more invariant to heterogeneity, distortions and robust to mis-alignment. Our intermediate level attribute representation is then integrated synergistically with the original low-level features using CCA. Our framework shows impressive results on cross-modal matching tasks using forensic sketches, and even more challenging caricature sketches. Furthermore, we create a new dataset with H 59, 000 attribute annotations for evaluation and to facilitate future research.
Qi Y., Song Yi-Zhe, Zhang H., Liu J. (2017) Sketch-based image retrieval via Siamese convolutional neural network, Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP 2016) 2016-A pp. 2460-2464 IEEE Computer Society
Sketch-based image retrieval (SBIR) is a challenging task due to the ambiguity inherent in sketches when compared with photos. In this paper, we propose a novel convolutional neural network based on Siamese network for SBIR. The main idea is to pull output feature vectors closer for input sketch-image pairs that are labeled as similar, and push them away if irrelevant. This is achieved by jointly tuning two convolutional neural networks which linked by one loss function. Experimental results on Flickr15K demonstrate that the proposed method offers a better performance when compared with several state-of-the-art approaches. © 2016 IEEE.
Qi Y., Song Yi-Zhe, Xiang T., Zhang H., Hospedales T., Li Y., Guo J. (2015) Making better use of edges via perceptual grouping, Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015) pp. 1856-1865 IEEE Computer Society
We propose a perceptual grouping framework that organizes image edges into meaningful structures and demonstrate its usefulness on various computer vision tasks. Our grouper formulates edge grouping as a graph partition problem, where a learning to rank method is developed to encode probabilities of candidate edge pairs. In particular, RankSVM is employed for the first time to combine multiple Gestalt principles as cue for edge grouping. Afterwards, an edge grouping based object proposal measure is introduced that yields proposals comparable to state-of-the-art alternatives. We further show how human-like sketches can be generated from edge groupings and consequently used to deliver state-of-the-art sketch-based image retrieval performance. Last but not least, we tackle the problem of freehand human sketch segmentation by utilizing the proposed grouper to cluster strokes into semantic object parts.
Yu Q., Liu F., Song Yi-Zhe, Xiang T., Hospedales T.M., Loy C.C. (2017) Sketch me that shoe, Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016) 2016-D pp. 799-807 IEEE Computer Society
We investigate the problem of fine-grained sketch-based image retrieval (SBIR), where free-hand human sketches are used as queries to perform instance-level retrieval of images. This is an extremely challenging task because (i) visual comparisons not only need to be fine-grained but also executed cross-domain, (ii) free-hand (finger) sketches are highly abstract, making fine-grained matching harder, and most importantly (iii) annotated cross-domain sketch-photo datasets required for training are scarce, challenging many state-of-the-art machine learning techniques. In this paper, for the first time, we address all these challenges, providing a step towards the capabilities that would underpin a commercial sketch-based image retrieval application. We introduce a new database of 1,432 sketchphoto pairs from two categories with 32,000 fine-grained triplet ranking annotations. We then develop a deep tripletranking model for instance-level SBIR with a novel data augmentation and staged pre-training strategy to alleviate the issue of insufficient fine-grained training data. Extensive experiments are carried out to contribute a variety of insights into the challenges of data sufficiency and over-fitting avoidance when training deep networks for finegrained cross-domain ranking tasks.
Xu Y.-J., Xue C., Li G.-F., Song Yi-Zhe (2015) Discovering social relationship between city regions using human mobility, Proceedings of the 11th EAI International Conference on Heterogeneous Networking for Quality, Reliability, Security and Robustness pp. 104-109 Institute of Electrical and Electronics Engineers (IEEE)
The development of a city gradually fosters different functional regions, and between these regions there exists different social information due to human activities. In this paper, a Region Activation Entropy Model (RAEM) is proposed to discover the social relations hidden between the regions. Specifically we segment a city into coherent regions according the base station (BS) position and detect the stay and passing regions in trajectories of mobile phone users. We regard one user's trajectory as a short document and take the stay regions in the trajectory as words, so that we can use Natural Language Processing (NLP) method to discover the relations between regions. Furthermore, the Region Activation Force (RAF) is defined to measure the intensity of relationship between regions. By measuring the Region Activation Entropy (RAE) based on RAF, we find an 88% potential predictability in regional mobility. The result generated by RAEM can benefit a variety of applications, including city planning, location choosing for a business and predicting the spread of human. We evaluated our method using a one-month-long record collected by mobile phone carriers. We believe our findings offer a new perspective on research of human mobility.
Keitler P., Pankratz F., Schwerdtfeger B., Pustka D., Rodiger W., Klinker G., Rauch C., Chathoth A., Collomosse J., Song Yi-Zhe (2010) Mobile augmented reality based 3D snapshots, Proceedings of the 2009 8th IEEE International Symposium on Mixed and Augmented Reality (ISMAR 2009) pp. 199-200 Institute of Electrical and Electronics Engineers (IEEE)
We describe a mobile augmented reality application that is based on 3D snapshotting using multiple photographs. Optical square markers provide the anchor for reconstructed virtual objects in the scene. A novel approach based on pixel flow highly improves tracking performance. This dual tracking approach also allows for a new single-button user interface metaphor for moving virtual objects in the scene. The development of the AR viewer was accompanied by user studies confirming the chosen approach.
Song Yi-Zhe (2013) Simple art as abstractions of photographs, Proceedings of the 13th Symposium on Computational Aesthetics (CAE 13) pp. 77-85 Association for Computing Machinery (ACM)
This paper shows that it is possible to semi-automatically process photographs into Simple Art. Simple Art is a term that we use to refer to a group of artistic styles such a child art cave art, and Fine Artists as exemplified by Joan Miro. None of these styles has been previously studied by the NPR community. Our contribution is to provide a process that makes them accessible. We describe a method that automatically constructs a hierarchical model of an input photograph, and asks a user to identify objects inside it. Each object is a sub-tree, which can be rendered under user control. The method is demonstrated using emulations of Simple Art. We include an assessment of our results against a set of norms recommended by a Cultural Historian. We conclude that producing Simple Art raises important technical questions, especially surrounding the interplay between computational modelling and human abstractions.
Qi Y., Zheng W.-S., Xiang T., Song Yi-Zhe, Zhang H., Guo J. (2014) One-shot learning of sketch categories with co-regularized sparse coding, Lecture Notes in Computer Science - proceedings of the 10th International Symposium on Visual Computing (ISVC 2014) Part II 8888 pp. 74-84 Springer Verlag
Categorizing free-hand human sketches has profound implications in applications such as human computer interaction and image retrieval. The task is non-trivial due to the iconic nature of sketches, signified by large variances in both appearance and structure when compared with photographs. Prior works often utilize off-the-shelf low-level features and assume the availability of a large training set, rendering them sensitive towards abstraction and less scalable to new categories. To overcome this limitation, we propose a transfer learning framework which enables one-shot learning of sketch categories. The framework is based on a novel co-regularized sparse coding model which exploits common/ shareable parts among human sketches of seen categories and transfer them to unseen categories. We contribute a new dataset consisting of 7,760 human segmented sketches from 97 object categories. Extensive experiments reveal that the proposed method can classify unseen sketch categories given just one training sample with a 33.04% accuracy, offering a two-fold improvement over baselines.
Xiao B., Song Yi-Zhe, Hall P.M. (2007) Learning object classes from structure, Proceedings of the 18th British Machine Conference (BMVC 2007) pp. 27.1-27.10 BMVA Press
The problem of identifying the class of an object from its visual appearance has received significant attention recently. Most of the work to date is premised on photometric measures, often building codebooks made from interest regions. All of it has been tested only on photographs, so far as we know. Our approach differs in two significant ways. First, we do not build a codebook of interest regions but instead make use of a hierarchical description of an image based on a watershed transform. Root nodes in the hierarchy are putative objects to be classified. Second, we classify these putative objects using a vector of fixed length that represents the structure of the hierarchy below the node. This allows us to classify not just photographs, but also paintings and drawings of visual objects.
Balikai A., Rosin P., Song Yi-Zhe, Hall P. (2008) Shapes fit for purpose, Proceedings of the 19th British Machine Conference (BMVC 2008) pp. 45.1-45.10 BMVA Press
This paper is about shape fitting to regions that segment an image and some applications that rely on the abstraction that offers. The novelty lies in three areas: (1) we fit a shape drawn from a selection of shape families, not just one class of shape, using a supervised classifier; (2) We use results from the classifier to match photographs and artwork of particular objects using a few qualitative shapes, which overcomes the significant differences between photographs and paintings; (3) We further use the shape classifier to process photographs into abstract synthetic art which, so far as we know, is novel too. Thus we use our shape classier in both discriminative (matching) and generative (image synthesis) tasks. We conclude the level of abstraction offered by our shape classifier is novel and useful.
Song Yi-Zhe, Town C.P. (2005) Visual recognition of man-made materials and structures in an office environment, Proceedings of the 2nd International Conference on Video, Vision and Graphics, VVG 2005 - Edinburgh, UK United Kingdom pp. 159-166 Eurographics: European Association for Computer Graphics
This paper demonstrates a new approach towards object recognition founded on the development of Neural Network classifiers and Bayesian Networks. The mapping from segmented image region descriptors to semantically meaningful class membership terms is achieved using Neural Networks. Bayesian Networks are then employed to probabilistically detect objects within an image by means of relating region class labels and their surrounding environments. Furthermore, it makes use of an intermediate level of image representation and demonstrates how object recognition can be achieved in this way.
Song Yi-Zhe (2019) Goal-Driven Sequential Data Abstraction, Proceedings of the International Conference on Computer Vision (ICCV 2019) Institute of Electrical and Electronics Engineers (IEEE)
Automatic data abstraction is an important capability
for both benchmarking machine intelligence and supporting
summarization applications. In the former one asks
whether a machine can ?understand? enough about the
meaning of input data to produce a meaningful but more
compact abstraction. In the latter this capability is exploited
for saving space or human time by summarizing the
essence of input data. In this paper we study a general reinforcement
learning based framework for learning to abstract
sequential data in a goal-driven way. The ability to
define different abstraction goals uniquely allows different
aspects of the input data to be preserved according to the ultimate
purpose of the abstraction. Our reinforcement learning
objective does not require human-defined examples of
ideal abstraction. Importantly our model processes the input
sequence holistically without being constrained by the
original input order. Our framework is also domain agnostic
? we demonstrate applications to sketch, video and text
data and achieve promising results in all domains.
Song Yi-Zhe (2019) Episodic Training for Domain Generalization, Proceedings of the International Conference on Computer Vision (ICCV 2019) Institute of Electrical and Electronics Engineers (IEEE)
Domain generalization (DG) is the challenging and topical
problem of learning models that generalize to novel testing
domains with different statistics than a set of known training domains.
The simple approach of aggregating data from all source
domains and training a single deep neural network end-to-end
on all the data provides a surprisingly strong baseline that surpasses
many prior published methods. In this paper we build on
this strong baseline by designing an episodic training procedure
that trains a single deep network in a way that exposes it to
the domain shift that characterises a novel domain at runtime.
Specifically, we decompose a deep network into feature extractor
and classifier components, and then train each component
by simulating it interacting with a partner who is badly tuned for
the current domain. This makes both components more robust,
ultimately leading to our networks producing state-of-the-art
performance on three DG benchmarks. Furthermore, we consider
the pervasive workflow of using an ImageNet trained CNN
as a fixed feature extractor for downstream recognition tasks.
Using the Visual Decathlon benchmark, we demonstrate that
our episodic-DG training improves the performance of such a
general purpose feature extractor by explicitly training a feature
for robustness to novel problems. This shows that DG training
can benefit standard practice in computer vision.