Xiatian Zhu is a senior lecturer at the Surrey Institute of People-Centred AI, and Centre for Vision, Speech and Signal Processing (CVSSP), Faculty of Engineering and Physical Sciences, University of Surrey. He is interested in fundamental and scalable machine learning such as unsupervised learning, zero/few-shot learning, transfer learning, multi-modality learning, and embedded AI. He won the 2016 Sullivan Doctoral Thesis Prize, representing the annual best doctoral thesis in computer vision in the UK and has published over 70 peer-reviewed papers with over 7000 citations (h-index 39).
Areas of specialism
Xiatian Zhu has extensive research experiences in visual surveillance for the benefits of society and human wellbeing by providing enabling AI technologies. He has been working on two EU Security Programme Projects on street-scene behaviour recognition and airborne multi-sensor object detection and tracking in video.
As a research scientist at Samsung AI Centre, Cambridge, he has contributed significantly on several "human-centric AI" projects to enrich peoples’ lives via developing compute-efficient, supervision-efficient and data-efficient AI technologies.
Person images captured by public surveillance cameras often have low resolutions (LRs), along with uncontrolled pose variations, background clutter and occlusion. These issues cause the resolution mismatch problem when matched with high-resolution (HR) gallery images (typically available during collection), harming the person re-identification (re-id) performance. While a number of methods have been introduced based on the joint learning of super-resolution and person re-id, they ignore specific discriminant identity information encoded in LR person images, leading to ineffective model performance. In this work, we propose a novel joint bilateral-resolution identity modeling method that concurrently performs HR-specific identity feature learning with super-resolution, LR-specific identity feature learning, and person re-id optimization. We also introduce an adaptive ensemble algorithm for handling different low resolutions. Extensive evaluations validate the advantages of our method over related state-of-the-art re-id and super-resolution methods on cross-resolution re-id benchmarks. An important discovery is that leveraging LR-specific identity information enables a simple cascade of super-resolution and person re-id learning to achieve state-of-the-art performance, without elaborate model design nor bells and whistles, which has not been investigated before.
Existing person re-identification (re-id) methods mostly exploit a large set of cross-camera identity labelled training data. This requires a tedious data collection and annotation process, leading to poor scalability in practical re-id applications. On the other hand unsupervised re-id methods do not need identity label information, but they usually suffer from much inferior and insufficient model performance. To overcome these fundamental limitations, we propose a novel person re-identification paradigm based on an idea of independent per-camera identity annotation. This eliminates the most time-consuming and tedious inter-camera identity labelling process, significantly reducing the amount of human annotation efforts. Consequently, it gives rise to a more scalable and more feasible setting, which we call Intra-Camera Supervised (ICS) person re-id, for which we formulate a Multi-tAsk mulTi-labEl (MATE) deep learning method. Specifically, MATE is designed for self-discovering the cross-camera identity correspondence in a per-camera multi-task inference framework. Extensive experiments demonstrate the cost-effectiveness superiority of our method over the alternative approaches on three large person re-id datasets. For example, MATE yields 88.7% rank-1 score on Market-1501 in the proposed ICS person re-id setting, significantly outperforming unsupervised learning models and closely approaching conventional fully supervised learning competitors.
Most existing person re-identification (re-id) methods assume supervised model training on a separate large set of training samples from the target domain. While performing well in the training domain, such trained models are seldom generalisable to a new independent unsupervised target domain without further labelled training data from the target domain. To solve this scalability limitation, we develop a novel Hierarchical Unsupervised Domain Adaptation (HUDA) method. It can transfer labelled information of an existing dataset (a source domain) to an unlabelled target domain for unsupervised person re-id. Specifically, HUDA is designed to model jointly global distribution alignment and local instance alignment in a two-level hierarchy for discovering transferable source knowledge in unsupervised domain adaptation. Crucially, this approach aims to overcome the under-constrained learning problem of existing unsupervised domain adaptation methods. Extensive evaluations show the superiority of HUDA for unsupervised cross-domain person re-id over a wide variety of state-of-the-art methods on four re-id benchmarks: Market-1501, DukeMTMC, MSMT17 and CUHK03.
This work introduces a novel data augmentation method for few-shot website fingerprinting (WF) attack where only a handful of training samples per website are available for deep learning model optimization. Moving beyond earlier WF methods relying on manually-engineered feature representations, more advanced deep learning alternatives demonstrate that learning feature representations automatically from training data is superior. Nonetheless, this advantage is subject to an unrealistic assumption that there exist many training samples per website, which otherwise will disappear. To address this, we introduce a model-agnostic, efficient, and harmonious data augmentation (HDA) method that can improve deep WF attacking methods significantly. HDA involves both intrasample and intersample data transformations that can be used in a harmonious manner to expand a tiny training dataset to an arbitrarily large collection, therefore effectively and explicitly addressing the intrinsic data scarcity problem. We conducted expensive experiments to validate our HDA for boosting state-of-the-art deep learning WF attack models in both closed-world and open-world attacking scenarios, at absence and presence of strong defense. For instance, in the more challenging and realistic evaluation scenario with WTF-PAD-based defense, our HDA method surpasses the previous state-of-the-art results by nearly 3% in classification accuracy in the 20-shot learning case.
Modelling long-range contextual relationships is critical for pixel-wise prediction tasks such as semantic segmentation. However, convolutional neural networks (CNNs) are inherently limited to model such dependencies due to the naive structure in its building modules (e.g., local convolution kernel). While recent global aggregation methods are beneficial for long-range structure information modelling, they would oversmooth and bring noise to the regions contain fine details (e.g., boundaries and small objects), which are very much cared in the semantic segmentation task. To alleviate this problem, we propose to explore the local context for making the aggregated long-range relationship being distributed more accurately in local regions. In particular, we design a novel local distribution module which models the affinity map between global and local relationship for each pixel adaptively. Integrating existing global aggregation modules, we show that our approach can be modularized as an end-to-end trainable block and easily plugged into existing semantic segmentation networks, giving rise to the GALD networks. Despite its simplicity and versatility, our approach allows us to build new state of the art on major semantic segmentation benchmarks including Cityscapes, ADE20K, Pascal Context, Camvid and COCO-stuff. Code and trained models are released at https://github.com/lxtGH/GALD-DGCNet to foster further research.
Existing person search methods typically focus on improving person detection accuracy. This ignores the model inference efficiency, which however is fundamentally significant for real-world applications. In this work, we address this limitation by investigating the scalability problem of person search involving both model accuracy and inference efficiency simultaneously. Specifically, we formulate a Hierarchical Distillation Learning (HDL) approach. With HDL, we aim to comprehensively distil the knowledge of a strong teacher model with strong learning capability to a lightweight student model with weak learning capability. To facilitate the HDL process, we design a simple and powerful teacher model for joint learning of person detection and person re-identification matching in unconstrained scene images. Extensive experiments show the modelling advantages and cost-effectiveness superiority of HDL over the state-of-the-art person search methods on three large person search benchmarks: CUHK-SYSU, PRW, and DukeMTMC-PS.
Existing logo detection methods mostly rely on supervised learning with a large quantity of labelled training data in limited classes. This restricts their scalability to a large number of logo classes subject to limited labelling budget. In this work, we consider a more scalable open logo detection problem where only a fraction of logo classes are fully labelled whilst the remaining classes are only annotated with a clean icon image (e.g. 1-shot icon supervised). To generalise and transfer knowledge of fully supervised logo classes to other 1-shot icon supervised classes, we propose a Multi-Perspective Cross-Class (MPCC) domain adaptation method. In a data augmentation principle, MPCC conducts feature distribution alignment in two perspectives. Specifically, we align the feature distribution between synthetic logo images of 1-shot icon supervised classes and genuine logo images of fully supervised classes, and that between logo images and non-logo images, concurrently. This allows for mitigating the domain shift problem between model training and testing on 1-shot icon supervised logo classes, simultaneously reducing the model overfitting towards fully labelled logo classes. Extensive comparative experiments show the advantage of MPCC over existing state-of-the-art competitors on the challenging QMUL-OpenLogo benchmark (Su et al., 2018).
Contemporary person re-identification (re-id) methods mostly compute independentlya feature representation of each person image in the query set and the gallery set. This strategy fails to consider any ranking context information of each probe image in the query set represented implicitly by the whole gallery set. Some recent re-ranking re-id methods therefore propose to take a post-processing strategy to exploit such contextual information for improving re-id matching performance. However, post-processing is independent of model training without jointly optimising the re-id feature and the ranking context information for better compatibility. In this work, for the first time, we show that the appearance feature and the ranking context information can be jointly optimised for learning more discriminative representations and achieving superior matching accuracy. Specifically, we propose to learn a hybrid ranking representation for person re-id with a two-stream architecture: (1) In the external stream, we use the ranking list of each probe image to learn plausible visual variations among the top ranks from the gallery as the external ranking information; (2) In the internal stream, we employ the part-based fine-grained feature as the internal ranking information, which mitigates the harm of incorrect matches in the ranking list. Assembling these two streams generates a hybrid ranking representation for person matching. Extensive experiments demonstrate the superiority of our method over the state-of-the-art methods on four large-scale re-id benchmarks (Market-1501, DukeMTMC-ReID, CUHK03 and MSMT17), under both supervised and unsupervised settings.
Website fingerprinting (WF) attack aims to identify which website a user is visiting from the traffic data patterns. Whilst existing methods assume many training samples, we investigate a more realistic and scalable few-shot WF attack with only a few labeled training samples per website. To solve this problem, we introduce a novel Meta-Bias Learning (MBL) method for few-shot WF learning. Taking the meta-learning strategy, MBL simulates and optimizes the target tasks. Moreover, a new model parameter factorization idea is introduced for facilitating meta-training with superior task adaptation. Expensive experiments show that our MBL outperforms significantly existing hand-crafted feature and deep learning based alternatives in both closed-world and open-world attack scenarios, at the absence and presence of defense.
Human pose estimation has achieved significant progress on images with high imaging resolution. However, low-resolution imagery data bring nontrivial challenges which are still under-studied. To fill this gap, we start with investigating existing methods and reveal that the most dominant heatmap-based methods would suffer more severe model performance degradation from low-resolution, and offset learning is an effective strategy. Established on this observation, in this work we propose a novel Confidence-Aware Learning (CAL) method which further addresses two fundamental limitations of existing offset learning methods: inconsistent training and testing, decoupled heatmap and offset learning. Specifically, CAL selectively weighs the learning of heatmap and offset with respect to ground-truth and most confident prediction, whilst capturing the statistical importance of model output in mini-batch learning manner. Extensive experiments conducted on the COCO benchmark show that our method outperforms significantly the state-of-the-art methods for low-resolution human pose estimation.