11am - 12 noon

Friday 12 May 2023

Person re-identification using vision and language

PhD Viva Open Presentation by Ammarah Farooq.


University of Surrey
back to all events

This event has passed

Person re-identification using vision and language


Cross-modal person re-identification (Re-ID) is a crucial component of a modern video surveillance system and security infrastructure. The task of matching people across multiple non-overlapping camera views encompasses numerous computer vision challenges such as changes in illumination, occlusions, pose variations, and even the absence of visual query. In this thesis, we develop person ReID methods based on persons' images and textual descriptions. The key challenge is to align cross-modality feature representations according to the fine-grained appearance attributes and ignore background noise.

The first contribution proposes to jointly model the multi-modal latent space, where corresponding visual and textual representations are pushed closer. However, the performance of such late-fusion models depends on the quality of the feature extraction backbones for each modality. To overcome this issue, the second contribution asserts a unified cross-modal feature learning backbone to implicitly align the shared semantic concepts from the start of the learning network. Unified feature learning effectively utilizes textual data as a super-annotation signal for visual representation learning and automatically rejects irrelevant information.

With the emergence of Vision transformers (ViTs), the idea of splitting a 2D image into a 1D sequence of tokens and learning long-range interactions solely via a self-attention mechanism has further solidified the idea of a unified backbone model. In the final contribution, we propose a vision transformer architecture design with the aim of an effective intra-modal and cross-modal communication strategy based on special tokens. The purpose of these tokens is twofold. First, these tokens encapsulate the image information into a small set of tokens. Second, the special tokens are responsible for interacting across spatial windows of an image as well as across modalities.

The proposed approach of multi-modal unified feature learning has the potential to address the limitations of traditional single-modality person ReID methods and has important practical implications in real-world video surveillance systems.

Visitor information

Find out how to get to the University, make your way around campus and see what you can do when you get here.