Sergio Sánchez Santiesteban

Postgraduate Research Student

s.sanchezsantiesteban@surrey.ac.uk

Academic and research departments

Centre for Vision, Speech and Signal Processing (CVSSP).

About

My research project

Multimodal image matching and interpretation image matching

This research project will focus on multimodal matching of visual content across views and different cameras with application to person reidentification for forensic analysis of surveillance videos. Image matching is a fundamental problem in machine perception. It is the basis of visual object recognition, where unknown image content is compared with prestored visual object models.

In forensic video analytics the person reidentification goal is often defined by a free-form verbal description of the suspect. The human expert then mentally maps the description to a visual content and looks for a match in the analysed footage. To automate this process, it is essential to enable cross-modal matching between visual and verbal content. The aim of the project will be to advance the current state-of-the-art in language phrase matching using deep neural networks.

Publications

Sergio Sanchez Santiesteban, Sara Atito, Muhammad Awais, Yi-Zhe Song, Josef Kittler (2024)Improved Image Captioning Via Knowledge Graph-Augmented Models, In: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024)pp. 4290-4294 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/ICASSP48485.2024.10447637

Multimodal foundation models, pre-trained on large-scale data, effectively capture vast amounts of factual and commonsense knowledge. However, these models store all their knowledge within their parameters, requiring increasingly larger models and training data to capture more knowledge. To address this limitation and achieve a more scalable and modular integration of knowledge, we propose a novel knowledge graph-augmented multimodal model. This approach enables a base multimodal model to access pertinent information from an external knowledge graph. Our methodology leverages existing general domain knowledge to facilitate vision-language pre-training using paired images and text descriptions. We conduct comprehensive evaluations demonstrating that our model outperforms state-of-the-art models and yields comparable results to much larger models trained on more extensive datasets. Notably, our model reached a 145 Cider score on MS COCO Captions using only 2.9 million samples, outperforming a 1.4B parameter model by 1.7% despite having 11 times fewer parameters.