Areas of specialism

Machine learning; Computer Vision; Multimodal learning


Research interests


Lihua Zhou, Siying Xiao, Mao Ye, Xiatian Zhu, Shuaifeng Li (2023)Adaptive Mutual Learning for Unsupervised Domain Adaptation, In: IEEE Transactions on Circuits and Systems for Video Technology Institute of Electrical and Electronics Engineers (IEEE)

Unsupervised domain adaptation aims to transfer knowledge from labeled source domain to unlabeled target domain. The semi-supervised method based on mean-teacher framework is one of the main stream approaches. By enforcing consistency constraints, it is hopeful that the teacher network will distill useful source domain knowledge to the student network. However, in practice negative transfer often emerges because the performance of the teacher network is not guaranteed to be always better than the student network. To address this limitation, a novel Adaptive Mutual Learning (AML) strategy is proposed in this paper. Specifically, given a target sample, the network with worse prediction will be optimized by pushing its prediction close to the better prediction. This is in the spirit of traditional knowledge distillation. On the other hand, the network with better prediction is further refined by requiring its prediction to stay away from the worse prediction. This can be regarded conceptually as reverse knowledge distillation. In this way, two networks learn from each other according to their respective performance. At inference phase, the averaged output of these two networks can be taken as the final prediction. Experimental results demonstrate that our AML achieves competitive results.

Sauradip Nag, Xiatian Zhu, Yi-Zhe Song, Tao Xiang (2023)Post-Processing Temporal Action Detection

Existing Temporal Action Detection (TAD) methods typ- ically take a pre-processing step in converting an input varying-length video into a fixed-length snippet represen- tation sequence, before temporal boundary estimation and action classification. This pre-processing step would tem- porally downsample the video, reducing the inference res- olution and hampering the detection performance in the original temporal resolution. In essence, this is due to a temporal quantization error introduced during the resolu- tion downsampling and recovery. This could negatively im- pact the TAD performance, but is largely ignored by existing methods. To address this problem, in this work we intro- duce a novel model-agnostic post-processing method with- out model redesign and retraining. Specifically, we model the start and end points of action instances with a Gaussian distribution for enabling temporal boundary inference at a sub-snippet level. We further introduce an efficient Taylor- expansion based approximation, dubbed as Gaussian Ap- proximated Post-processing (GAP). Extensive experiments demonstrate that our GAP can consistently improve a wide variety of pre-trained off-the-shelf TAD models on the chal- lenging ActivityNet (+0.2%∼0.7% in average mAP) and THUMOS (+0.2%∼0.5% in average mAP) benchmarks. Such performance gains are already significant and highly comparable to those achieved by novel model designs. Also, GAP can be integrated with model training for further performance gain. Importantly, GAP enables lower tem- poral resolutions for more efficient inference, facilitating low-resource applications. The code will be available in https://github.com/sauradip/GAP

Shuaifeng Li, Mao Ye, Xiatian Zhu, Lihua Zhou, Lin Xiong (2022)Source-Free Object Detection by Learning to Overlook Domain Style, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022)pp. 8014-8023 Institute of Electrical and Electronics Engineers (IEEE)

Source-free object detection (SFOD) needs to adapt a detector pre-trained on a labeled source domain to a target domain, with only unlabeled training data from the target domain. Existing SFOD methods typically adopt the pseudo labeling paradigm with model adaption alternating between predicting pseudo labels and fine-tuning the model. This approach suffers from both unsatisfactory accuracy of pseudo labels due to the presence of domain shift and limited use of target domain training data. In this work, we present a novel Learning to Overlook Domain Style (LODS) method with such limitations solved in a principled manner. Our idea is to reduce the domain shift effect by enforcing the model to overlook the target domain style, such that model adaptation is simplified and becomes easier to carry on. To that end, we enhance the style of each target domain image and leverage the style degree difference between the original image and the enhanced image as a self-supervised signal for model adaptation. By treating the enhanced image as an auxiliary view, we exploit a student-teacher architecture for learning to overlook the style degree difference against the original image, also characterized with a novel style enhancement algorithm and graph alignment constraint. Extensive experiments demonstrate that our LODS yields new state-of-the-art performance on four benchmarks.

Xiao Han, Xiatian Zhu, Yi-Zhe Song, Tao Xiang (2023)FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks Xiao Han1,2 Xiatian Zhu1,3 Licheng Yu Li Zhang4 Yi-Zhe Song1,2 Tao Xiang1,2 1 CVSSP, University of Surrey 2 iFlyTek-Surrey Joint Research Centre on Artificial Intelligence 3 Surrey Institute for People-Centred Artificial Intelligence 4 Fudan University {xiao.han, xiatian.zhu, y.song, t.xiang}@surrey.ac.uk lichengyu24@gmail.com lizhangfd@fudan.edu.cn Abstract In the fashion domain, there exists a variety of vision- and-language (V+L) tasks, including cross-modal retrieval, text-guided image retrieval, multi-modal classification, and image captioning. They differ drastically in each individ- ual input/output format and dataset size. It has been com- mon to design a task-specific model and fine-tune it in- dependently from a pre-trained V+L model (e.g., CLIP). This results in parameter inefficiency and inability to ex- ploit inter-task relatedness. To address such issues, we pro- pose a novel FAshion-focused Multi-task Efficient learn- ing method for Vision-and-Language tasks (FAME-ViL) in this work. Compared with existing approaches, FAME-ViL applies a single model for multiple heterogeneous fashion tasks, therefore being much more parameter-efficient. It is enabled by two novel components: (1) a task-versatile architecture with cross-attention adapters and task-specific adapters integrated into a unified V+L model, and (2) a sta- ble and effective multi-task training strategy that supports learning from heterogeneous data and prevents negative transfer. Extensive experiments on four fashion tasks show that our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conven- tional independently trained single-task models. Code is available at https://github.com/BrandonHanx/FAME-ViL

Guile Wu, XIATIAN ZHU, Shaogang Gong (2022)Learning hybrid ranking representation for person re-identification, In: Pattern recognition [e-journal]121108239 Elsevier

Contemporary person re-identification (re-id) methods mostly compute independentlya feature representation of each person image in the query set and the gallery set. This strategy fails to consider any ranking context information of each probe image in the query set represented implicitly by the whole gallery set. Some recent re-ranking re-id methods therefore propose to take a post-processing strategy to exploit such contextual information for improving re-id matching performance. However, post-processing is independent of model training without jointly optimising the re-id feature and the ranking context information for better compatibility. In this work, for the first time, we show that the appearance feature and the ranking context information can be jointly optimised for learning more discriminative representations and achieving superior matching accuracy. Specifically, we propose to learn a hybrid ranking representation for person re-id with a two-stream architecture: (1) In the external stream, we use the ranking list of each probe image to learn plausible visual variations among the top ranks from the gallery as the external ranking information; (2) In the internal stream, we employ the part-based fine-grained feature as the internal ranking information, which mitigates the harm of incorrect matches in the ranking list. Assembling these two streams generates a hybrid ranking representation for person matching. Extensive experiments demonstrate the superiority of our method over the state-of-the-art methods on four large-scale re-id benchmarks (Market-1501, DukeMTMC-ReID, CUHK03 and MSMT17), under both supervised and unsupervised settings.

Wei Li, Shaogang Gong, XIATIAN ZHU (2021)Hierarchical distillation learning for scalable person search, In: Pattern recognition [e-journal]114107862 Elsevier

Existing person search methods typically focus on improving person detection accuracy. This ignores the model inference efficiency, which however is fundamentally significant for real-world applications. In this work, we address this limitation by investigating the scalability problem of person search involving both model accuracy and inference efficiency simultaneously. Specifically, we formulate a Hierarchical Distillation Learning (HDL) approach. With HDL, we aim to comprehensively distil the knowledge of a strong teacher model with strong learning capability to a lightweight student model with weak learning capability. To facilitate the HDL process, we design a simple and powerful teacher model for joint learning of person detection and person re-identification matching in unconstrained scene images. Extensive experiments show the modelling advantages and cost-effectiveness superiority of HDL over the state-of-the-art person search methods on three large person search benchmarks: CUHK-SYSU, PRW, and DukeMTMC-PS.

Xu Lan, XIATIAN ZHU, Shaogang Gong (2022)Unsupervised cross-domain person re-identification by instance and distribution alignmen, In: Pattern recognition [e-journal]124108514 Elsevier

Most existing person re-identification (re-id) methods assume supervised model training on a separate large set of training samples from the target domain. While performing well in the training domain, such trained models are seldom generalisable to a new independent unsupervised target domain without further labelled training data from the target domain. To solve this scalability limitation, we develop a novel Hierarchical Unsupervised Domain Adaptation (HUDA) method. It can transfer labelled information of an existing dataset (a source domain) to an unlabelled target domain for unsupervised person re-id. Specifically, HUDA is designed to model jointly global distribution alignment and local instance alignment in a two-level hierarchy for discovering transferable source knowledge in unsupervised domain adaptation. Crucially, this approach aims to overcome the under-constrained learning problem of existing unsupervised domain adaptation methods. Extensive evaluations show the superiority of HUDA for unsupervised cross-domain person re-id over a wide variety of state-of-the-art methods on four re-id benchmarks: Market-1501, DukeMTMC, MSMT17 and CUHK03.

Chen Wang, Feng Zhang, XIATIAN ZHU, Shuzhi Sam Ge (2022)Low-resolution human pose estimation, In: Low-resolution Human Pose Estimation126108579

Human pose estimation has achieved significant progress on images with high imaging resolution. However, low-resolution imagery data bring nontrivial challenges which are still under-studied. To fill this gap, we start with investigating existing methods and reveal that the most dominant heatmap-based methods would suffer more severe model performance degradation from low-resolution, and offset learning is an effective strategy. Established on this observation, in this work we propose a novel Confidence-Aware Learning (CAL) method which further addresses two fundamental limitations of existing offset learning methods: inconsistent training and testing, decoupled heatmap and offset learning. Specifically, CAL selectively weighs the learning of heatmap and offset with respect to ground-truth and most confident prediction, whilst capturing the statistical importance of model output in mini-batch learning manner. Extensive experiments conducted on the COCO benchmark show that our method outperforms significantly the state-of-the-art methods for low-resolution human pose estimation.

Mantun Chen, Yongjun Wang, Zhiquan Qin, XIATIAN ZHU (2021)Few-Shot Website Fingerprinting Attack with Data Augmentation, In: Security and communication networks2840289 Hindawi

This work introduces a novel data augmentation method for few-shot website fingerprinting (WF) attack where only a handful of training samples per website are available for deep learning model optimization. Moving beyond earlier WF methods relying on manually-engineered feature representations, more advanced deep learning alternatives demonstrate that learning feature representations automatically from training data is superior. Nonetheless, this advantage is subject to an unrealistic assumption that there exist many training samples per website, which otherwise will disappear. To address this, we introduce a model-agnostic, efficient, and harmonious data augmentation (HDA) method that can improve deep WF attacking methods significantly. HDA involves both intrasample and intersample data transformations that can be used in a harmonious manner to expand a tiny training dataset to an arbitrarily large collection, therefore effectively and explicitly addressing the intrinsic data scarcity problem. We conducted expensive experiments to validate our HDA for boosting state-of-the-art deep learning WF attack models in both closed-world and open-world attacking scenarios, at absence and presence of strong defense. For instance, in the more challenging and realistic evaluation scenario with WTF-PAD-based defense, our HDA method surpasses the previous state-of-the-art results by nearly 3% in classification accuracy in the 20-shot learning case.

Wei-Shi Zheng, Jincheng Hong, Jiening Jiao, Ancong Wu, XIATIAN ZHU, Shaogang Gong, Jiayin Qin , Jian-Huang Lai (2021)Joint Bilateral-Resolution Identity Modeling for Cross-Resolution Person Re-Identification, In: International Journal of Computer Vision130pp. 136-156 Springer

Person images captured by public surveillance cameras often have low resolutions (LRs), along with uncontrolled pose variations, background clutter and occlusion. These issues cause the resolution mismatch problem when matched with high-resolution (HR) gallery images (typically available during collection), harming the person re-identification (re-id) performance. While a number of methods have been introduced based on the joint learning of super-resolution and person re-id, they ignore specific discriminant identity information encoded in LR person images, leading to ineffective model performance. In this work, we propose a novel joint bilateral-resolution identity modeling method that concurrently performs HR-specific identity feature learning with super-resolution, LR-specific identity feature learning, and person re-id optimization. We also introduce an adaptive ensemble algorithm for handling different low resolutions. Extensive evaluations validate the advantages of our method over related state-of-the-art re-id and super-resolution methods on cross-resolution re-id benchmarks. An important discovery is that leveraging LR-specific identity information enables a simple cascade of super-resolution and person re-id learning to achieve state-of-the-art performance, without elaborate model design nor bells and whistles, which has not been investigated before.

Mantun Chen, Yongjun Wang, Xiatian Zhu (2022)Few-shot Website Fingerprinting attack with Meta-Bias Learning, In: Pattern Recognition130108739 Elsevier

Website fingerprinting (WF) attack aims to identify which website a user is visiting from the traffic data patterns. Whilst existing methods assume many training samples, we investigate a more realistic and scalable few-shot WF attack with only a few labeled training samples per website. To solve this problem, we introduce a novel Meta-Bias Learning (MBL) method for few-shot WF learning. Taking the meta-learning strategy, MBL simulates and optimizes the target tasks. Moreover, a new model parameter factorization idea is introduced for facilitating meta-training with superior task adaptation. Expensive experiments show that our MBL outperforms significantly existing hand-crafted feature and deep learning based alternatives in both closed-world and open-world attack scenarios, at the absence and presence of defense.

Xiangping Zhu, XIATIAN ZHU, Minxian Li, Pietro Morerio, Vittorio Murino, Shaogang Gong (2021)Intra-Camera Supervised Person Re-Identification, In: International journal of computer vision129pp. 1580-1595 Springer

Existing person re-identification (re-id) methods mostly exploit a large set of cross-camera identity labelled training data. This requires a tedious data collection and annotation process, leading to poor scalability in practical re-id applications. On the other hand unsupervised re-id methods do not need identity label information, but they usually suffer from much inferior and insufficient model performance. To overcome these fundamental limitations, we propose a novel person re-identification paradigm based on an idea of independent per-camera identity annotation. This eliminates the most time-consuming and tedious inter-camera identity labelling process, significantly reducing the amount of human annotation efforts. Consequently, it gives rise to a more scalable and more feasible setting, which we call Intra-Camera Supervised (ICS) person re-id, for which we formulate a Multi-tAsk mulTi-labEl (MATE) deep learning method. Specifically, MATE is designed for self-discovering the cross-camera identity correspondence in a per-camera multi-task inference framework. Extensive experiments demonstrate the cost-effectiveness superiority of our method over the alternative approaches on three large person re-id datasets. For example, MATE yields 88.7% rank-1 score on Market-1501 in the proposed ICS person re-id setting, significantly outperforming unsupervised learning models and closely approaching conventional fully supervised learning competitors.

Yanbei Chen, Massimiliano Mancini, XIATIAN ZHU, Zeynep Akata (2022)Semi-Supervised and Unsupervised Deep Visual Learning: A Survey, In: IEEE Transactions on Pattern Analysis and Machine Intelligence IEEE

State-of-the-art deep learning models are often trained with a large amount of costly labeled training data. However, requiring exhaustive manual annotations may degrade the model’s generalizability in the limited-label regime.Semi-supervised learning and unsupervised learning offer promising paradigms to learn from an abundance of unlabeled visual data. Recent progress in these paradigms has indicated the strong benefits of leveraging unlabeled data to improve model generalization and provide better model initialization. In this survey, we review the recent advanced deep learning algorithms on semi-supervised learning (SSL) and unsupervised learning (UL) for visual recognition from a unified perspective. To offer a holistic understanding of the state-of-the-art in these areas, we propose a unified taxonomy. We categorize existing representative SSL and UL with comprehensive and insightful analysis to highlight their design rationales in different learning scenarios and applications in different computer vision tasks. Lastly, we discuss the emerging trends and open challenges in SSL and UL to shed light on future critical research directions.

Hang Su, Shaogang Gong, XIATIAN ZHU (2021)Multi-perspective cross-class domain adaptation for open logo detection, In: Computer vision and image understanding204103156 Elsevier

Existing logo detection methods mostly rely on supervised learning with a large quantity of labelled training data in limited classes. This restricts their scalability to a large number of logo classes subject to limited labelling budget. In this work, we consider a more scalable open logo detection problem where only a fraction of logo classes are fully labelled whilst the remaining classes are only annotated with a clean icon image (e.g. 1-shot icon supervised). To generalise and transfer knowledge of fully supervised logo classes to other 1-shot icon supervised classes, we propose a Multi-Perspective Cross-Class (MPCC) domain adaptation method. In a data augmentation principle, MPCC conducts feature distribution alignment in two perspectives. Specifically, we align the feature distribution between synthetic logo images of 1-shot icon supervised classes and genuine logo images of fully supervised classes, and that between logo images and non-logo images, concurrently. This allows for mitigating the domain shift problem between model training and testing on 1-shot icon supervised logo classes, simultaneously reducing the model overfitting towards fully labelled logo classes. Extensive comparative experiments show the advantage of MPCC over existing state-of-the-art competitors on the challenging QMUL-OpenLogo benchmark (Su et al., 2018).

Xiangtai Li, Li Zhang, Guangliang Cheng, Kuiyuan Yang, Yunhai Tong, XIATIAN ZHU, TAO XIANG (2021)Global Aggregation Then Local Distribution for Scene Parsing, In: IEEE transactions on image processing : a publication of the IEEE Signal Processing Society30pp. 6829-6842 IEEE

Modelling long-range contextual relationships is critical for pixel-wise prediction tasks such as semantic segmentation. However, convolutional neural networks (CNNs) are inherently limited to model such dependencies due to the naive structure in its building modules (e.g., local convolution kernel). While recent global aggregation methods are beneficial for long-range structure information modelling, they would oversmooth and bring noise to the regions contain fine details (e.g., boundaries and small objects), which are very much cared in the semantic segmentation task. To alleviate this problem, we propose to explore the local context for making the aggregated long-range relationship being distributed more accurately in local regions. In particular, we design a novel local distribution module which models the affinity map between global and local relationship for each pixel adaptively. Integrating existing global aggregation modules, we show that our approach can be modularized as an end-to-end trainable block and easily plugged into existing semantic segmentation networks, giving rise to the GALD networks. Despite its simplicity and versatility, our approach allows us to build new state of the art on major semantic segmentation benchmarks including Cityscapes, ADE20K, Pascal Context, Camvid and COCO-stuff. Code and trained models are released at https://github.com/lxtGH/GALD-DGCNet to foster further research.

Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H. S. Torr, Li Zhang (2021)Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021pp. 6881-6890 Institute of Electrical and Electronics Engineers (IEEE)

Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer (i.e., without convolution and resolution reduction) to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR). Extensive experiments show that SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes. Particularly, we achieve the first position in the highly competitive ADE20K test server leaderboard on the day of submission.

Mengmeng Xu, Juan-Manuel Perez-Rua, Victor Escorcia, Brais Martinez, XIATIAN ZHU, Li Zhang, Bernard Ghanem, TAO XIANG (2021)Boundary-Sensitive Pre-Training for Temporal Localization in Videos
Mantun Chen, Yongjun Wang, Zhiquan Qin, XIATIAN ZHU Few-shot website fingerprinting attack
Mengmeng Xu, Juan-Manuel Perez-Rua, XIATIAN ZHU, Bernard Ghanem, Brais Martinez Low-Fidelity Video Encoder Optimization for Temporal Action Localization
Jiachen Lu, Jinghan Yao, Junge Zhang, XIATIAN ZHU, Hang Xu, Chunjing Xu, TAO XIANG, Li Zhang (2021)Soft: Softmax-free transformer with linear complexity