Asmar Nadeem

Research Fellow in Multimodal AI

asmar.nadeem@surrey.ac.uk

Academic and research departments

Centre for Vision, Speech and Signal Processing (CVSSP), School of Computer Science and Electronic Engineering.

About

My research project

Audio-visual object-based dynamic scene representation from monocular video

This project will investigate the transformation of monocular audio and visual video into a spatially localised object-based audio-visual representation.
Self-supervised and weakly supervised deep learning will be investigated for the transformation of general scenes into semantically labelled and localised objects.
Learning on in-the-wild and BBC archive datasets will be investigated to support the generalisation to complex scenes. Specific use-cases such as sports and programme recommendation will also be investigated for evaluation in constrained contexts. The approach will be evaluated on both live and legacy content.

Supervisors

Armin Mustafa

Adrian Hilton

News

28 JAN 2025

Surrey welcomes Director General of Department for Education's Skills Group on campus visit

Publications

Asmar Nadeem, Faegheh Sardari, Robert Dawes, Syed Sameed Husain, Adrian Hilton, Armin Mustafa NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative Arxiv

DOI: 10.48550/arxiv.2406.06499

Existing video captioning benchmarks and models lack causal-temporal narrative, which is sequences of events linked through cause and effect, unfolding over time and driven by characters or agents. This lack of narrative restricts models' ability to generate text descriptions that capture the causal and temporal dynamics inherent in video content. To address this gap, we propose NarrativeBridge, an approach comprising of: (1) a novel Causal-Temporal Narrative (CTN) captions benchmark generated using a large language model and few-shot prompting, explicitly encoding cause-effect temporal relationships in video descriptions; and (2) a Cause-Effect Network (CEN) with separate encoders for capturing cause and effect dynamics, enabling effective learning and generation of captions with causal-temporal narrative. Extensive experiments demonstrate that CEN significantly outperforms state-of-the-art models in articulating the causal and temporal aspects of video content: 17.88 and 17.44 CIDEr on the MSVD-CTN and MSRVTT-CTN datasets, respectively. Cross-dataset evaluations further showcase CEN's strong generalization capabilities. The proposed framework understands and generates nuanced text descriptions with intricate causal-temporal narrative structures present in videos, addressing a critical limitation in video captioning. For project details, visit https://narrativebridge.github.io/.

M. Awan, A. Nadeem, J. R. Santana, P. Sotres, T. Bousselin, M. Costalonga, T. Elsaleh (2025)Towards Efficient Structured Description Generation for Data Marketplace Offerings, In: IEEE Conference on Standards for Communications and Networking (Online)pp. 1-7 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/CSCN67557.2025.11230736

As data marketplaces are expected to become a prominent theme in the digital economy, whereby data assets are being generated today more than ever, and their potential to transform how value can be unlocked, the automation of metadata description generation will become a necessity for data asset discoverability, especially as data volumes explode and manual documentation becomes unsustainable. Therefore, semi-autonomous means are required to bring down the barrier to entry for data providers. As an extension of the Data Space concept, marketplaces advertise their assets or products through "offerings", that enable the discoverability of an asset or bundle of assets, published by data providers to a catalogue, which data consumers in turn can use for querying. Recent advances in large language models (LLMs) and constrained decoding techniques enable schema-compliant, semi-automated metadata generation, reducing manual overhead and improving discoverability. We propose a schema-aware, edge-optimized LLM pipeline for generating structured descriptions for data asset offerings in the SEDIMARK marketplace, with evaluation on realistic information models.

Davide Berghi, Craig Cieciura, Farshad Einabadi, Maxine Glancy, Oliver Charles Camilleri, Philip Anthony Foster, Asmar Nadeem, Faegheh Sardari, Jinzheng Zhao, Marco Volino, Armin Mustafa, Philip J B Jackson, Adrian Hilton (2024)ForecasterFlexOBM: A Multi-View Audio-Visual Dataset for Flexible Object-Based Media Production, In: ForecasterFlexOBM: A multi-view audio-visual dataset for flexible object-based media production IEEE

DOI: 10.1109/ICME57554.2024.10687655

Leveraging machine learning techniques, in the context of object-based media production, could enable provision of personalized media experiences to diverse audiences. To fine-tune and evaluate techniques for personalization applications, as well as more broadly, datasets which bridge the gap between research and production are needed. We introduce and publicly release such a dataset, themed around a UK weather forecast and shot against a blue-screen background, of three professional actors/presenters – one male and one female (English) and one female (British Sign Language). Scenes include both production and research-oriented examples, with a range of dialogue, motions, and actions. Capture techniques consisted of a synchronized 4K resolution 16-camera array, production-typical microphones plus professional audio mix, a 16-channel microphone array with collocated Grasshopper3 camera, and a photogrammetry array. We demonstrate applications relevant to virtual production and creation of personalized media including neural radiance fields, shadow casting, action/event detection, speaker source tracking and video captioning.

Davide Berghi, Craig Cieciura, Farshad Einabadi, Maxine Glancy, Oliver Charles Camilleri, Philip Anthony Foster, Asmar Nadeem, Faegheh Sardari, Jinzheng Zhao, Marco Volino, Armin Mustafa, Philip J B Jackson, Adrian Douglas Mark Hilton ForecasterFlexOBM: A multi-view audio-visual dataset for flexible object-based media production, In: ForecasterFlexOBM: A Multi-View Audio-Visual Dataset for Flexible Object-Based Media Production University of Surrey

DOI: 10.15126/surreydata.900912

Asmar Nadeem, Adrian Hilton, Robert Dawes, Graham Thomas, Armin Mustafa (2024)CAD - Contextual Multi-modal Alignment for Dynamic AVQA, In: 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/WACV57701.2024.00709

In the context of Audio Visual Question Answering (AVQA) tasks, the audio and visual modalities could be learnt on three levels: 1) Spatial, 2) Temporal, and 3) Semantic. Existing AVQA methods suffer from two major shortcomings; the audiovisual (AV) information passing through the network isn't aligned on Spatial and Temporal levels; and, intermodal (audio and visual) Semantic information is often not balanced within a context; this results in poor performance. In this paper, we propose a novel end-to-end Contextual Multi-modal Alignment (CAD) network that addresses the challenges in AVQA methods by i) introducing a parameter-free stochastic Contextual block that ensures robust audio and visual alignment on the Spatial level; ii) proposing a pre-training technique for dynamic audio and visual alignment on Temporal level in a self-supervised setting , and iii) introducing a cross-attention mechanism to balance audio and visual information on Semantic level. The proposed novel CAD network improves the overall performance over the state-of-the-art methods on average by 9.4% on the MUSIC-AVQA dataset. We also demonstrate that our proposed contributions to AVQA can be added to the existing methods to improve their performance without additional complexity requirements.

Mahrukh Awan, Asmar Nadeem, Muhammad Junaid Awan, Armin Mustafa, Syed Sameed Husain Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification

Exploiting both audio and visual modalities for video classification is a challenging task, as the existing methods require large model architectures, leading to high computational complexity and resource requirements. Smaller architectures, on the other hand, struggle to achieve optimal performance. In this paper, we propose Attend Fusion, an audio-visual (AV) fusion approach that introduces a compact model architecture specifically designed to capture intricate audio-visual relationships in video data. Through extensive experiments on the challenging YouTube-8M dataset, we demonstrate that Attend-Fusion achieves an F1 score of 75.64% with only 72M parameters, which is comparable to the performance of larger baseline models such as Fully-Connected Late Fusion (75.96% F1 score, 341M parameters). Attend-Fusion achieves similar performance to the larger baseline model while reducing the model size by nearly 80%, highlighting its efficiency in terms of model complexity. Our work demonstrates that the Attend-Fusion model effectively combines audio and visual information for video classification, achieving competitive performance with significantly reduced model size. This approach opens new possibilities for deploying high-performance video understanding systems in resource-constrained environments across various applications.

Asmar Nadeem, Adrian Hilton, Robert Dawes, Graham Thomas, Annin Mustafa (2023)SEM-POS: Grammatically and Semantically Correct Video Captioning, In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)pp. 2606-2616 IEEE

DOI: 10.1109/CVPRW59228.2023.00259

Generating grammatically and semantically correct captions in video captioning is a challenging task.. The captions generated from the existing methods are either word-by-word that do not align with grammatical structure or miss key information from the input videos. To address these issues, we introduce a novel global-local fusion network, with a Global-Local Fusion Block (GLFB) that encodes and fuses features from different parts of speech (POS) components with visual-spatial features. We use novel combinations of different POS components - 'determinant + subject', 'auxiliary verb', 'verb', and 'determinant + object' for supervision of the POS blocks - Det + Subject, Aux Verb, Verb, and Det + Object respectively. The novel global-local fusion network together with POS blocks helps align the visual features with language description to generate grammatically and semantically correct captions. Extensive qualitative and quantitative experiments on benchmark MSVD and MSRVTT datasets demonstrate that the proposed approach generates more grammatically and semantically correct captions compared to the existing methods, achieving the new state-of-the-art. Ablations on the POS blocks and the GLFB demonstrate the impact of the contributions on the proposed method.

Asmar Nadeem, Adrian Hilton, Robert Dawes, Graham Thomas, Armin Mustafa CAD -- Contextual Multi-modal Alignment for Dynamic AVQA

DOI: 10.48550/arxiv.2310.16754

In the context of Audio Visual Question Answering (AVQA) tasks, the audio visual modalities could be learnt on three levels: 1) Spatial, 2) Temporal, and 3) Semantic. Existing AVQA methods suffer from two major shortcomings; the audio-visual (AV) information passing through the network isn't aligned on Spatial and Temporal levels; and, inter-modal (audio and visual) Semantic information is often not balanced within a context; this results in poor performance. In this paper, we propose a novel end-to-end Contextual Multi-modal Alignment (CAD) network that addresses the challenges in AVQA methods by i) introducing a parameter-free stochastic Contextual block that ensures robust audio and visual alignment on the Spatial level; ii) proposing a pre-training technique for dynamic audio and visual alignment on Temporal level in a self-supervised setting, and iii) introducing a cross-attention mechanism to balance audio and visual information on Semantic level. The proposed novel CAD network improves the overall performance over the state-of-the-art methods on average by 9.4% on the MUSIC-AVQA dataset. We also demonstrate that our proposed contributions to AVQA can be added to the existing methods to improve their performance without additional complexity requirements.

Asmar Nadeem, Adrian Hilton, Robert Dawes, Graham Thomas, Armin Mustafa SEM-POS: Grammatically and Semantically Correct Video Captioning

DOI: 10.48550/arxiv.2303.14829

Generating grammatically and semantically correct captions in video captioning is a challenging task. The captions generated from the existing methods are either word-by-word that do not align with grammatical structure or miss key information from the input videos. To address these issues, we introduce a novel global-local fusion network, with a Global-Local Fusion Block (GLFB) that encodes and fuses features from different parts of speech (POS) components with visual-spatial features. We use novel combinations of different POS components - 'determinant + subject', 'auxiliary verb', 'verb', and 'determinant + object' for supervision of the POS blocks - Det + Subject, Aux Verb, Verb, and Det + Object respectively. The novel global-local fusion network together with POS blocks helps align the visual features with language description to generate grammatically and semantically correct captions. Extensive qualitative and quantitative experiments on benchmark MSVD and MSRVTT datasets demonstrate that the proposed approach generates more grammatically and semantically correct captions compared to the existing methods, achieving the new state-of-the-art. Ablations on the POS blocks and the GLFB demonstrate the impact of the contributions on the proposed method.