
Sergio Sánchez Santiesteban
About
My research project
Multimodal image matching and interpretation image matchingThis research project will focus on multimodal matching of visual content across views and different cameras with application to person reidentification for forensic analysis of surveillance videos. Image matching is a fundamental problem in machine perception. It is the basis of visual object recognition, where unknown image content is compared with prestored visual object models.
In forensic video analytics the person reidentification goal is often defined by a free-form verbal description of the suspect. The human expert then mentally maps the description to a visual content and looks for a match in the analysed footage. To automate this process, it is essential to enable cross-modal matching between visual and verbal content. The aim of the project will be to advance the current state-of-the-art in language phrase matching using deep neural networks.
This research project will focus on multimodal matching of visual content across views and different cameras with application to person reidentification for forensic analysis of surveillance videos. Image matching is a fundamental problem in machine perception. It is the basis of visual object recognition, where unknown image content is compared with prestored visual object models.
In forensic video analytics the person reidentification goal is often defined by a free-form verbal description of the suspect. The human expert then mentally maps the description to a visual content and looks for a match in the analysed footage. To automate this process, it is essential to enable cross-modal matching between visual and verbal content. The aim of the project will be to advance the current state-of-the-art in language phrase matching using deep neural networks.
Publications
Multimodal foundation models, pre-trained on large-scale data, effectively capture vast amounts of factual and commonsense knowledge. However, these models store all their knowledge within their parameters, requiring increasingly larger models and training data to capture more knowledge. To address this limitation and achieve a more scalable and modular integration of knowledge, we propose a novel knowledge graph-augmented multimodal model. This approach enables a base multimodal model to access pertinent information from an external knowledge graph. Our methodology leverages existing general domain knowledge to facilitate vision-language pre-training using paired images and text descriptions. We conduct comprehensive evaluations demonstrating that our model outperforms state-of-the-art models and yields comparable results to much larger models trained on more extensive datasets. Notably, our model reached a 145 Cider score on MS COCO Captions using only 2.9 million samples, outperforming a 1.4B parameter model by 1.7% despite having 11 times fewer parameters.