Advancements in AI for video and audio parsing

AI4ME researchers have created AI new tools to improve video analysis, audio-visual question answering and captioning for better media and accessibility.

New AI tool improves audio-visual video parsing

15 July 2024

Researchers at AI4ME have developed a new AI tool that can more accurately identify and categorise events in videos based on both audio and visual information.

The tool, called CoLeaF, is particularly effective at weakly supervised audio-visual video parsing (AVVP), which means it can learn to identify events even with limited training data. Existing AVVP methods often struggle to distinguish between audible-only, visible-only, and audible-visible events, especially when the audio and visual information don't perfectly align.

CoLeaF addresses this issue by learning to combine cross-modal information (audio and visual) only when it's relevant. This helps the tool avoid introducing irrelevant information that can hinder performance. Additionally, CoLeaF models complex class relationships to improve accuracy without increasing computational costs.

To evaluate CoLeaF's performance, the researchers conducted extensive experiments using the LLP and UnAV-100 datasets. They found that CoLeaF significantly outperformed existing methods, achieving an average improvement of 1.9% and 2.4% in F-score, respectively.

This new AI tool has the potential to improve a variety of applications, such as video analysis, content creation, and accessibility. By more accurately understanding the content of videos, CoLeaF can help developers create more personalised and engaging experiences for users.

The researchers were: Faegheh Sardari, Armin Mustafa, Philip JB Jackson, Adrian Hilton. Proceedings of the ECCV 2024.

Access the CoLeaF paper

A breakthrough in audio-visual question answering

10 January 2024

A team of AI4ME researchers has developed a new AI model, CAD, that significantly outperforms existing methods in answering questions based on audio and visual information.

By aligning audio and visual data on spatial, temporal, and semantic levels, CAD achieves a remarkable 9.4% improvement on the MUSIC-AVQA dataset.

This breakthrough could have far-reaching implications for applications such as video captioning, content search, and accessibility for the visually impaired.

The researchers were: Asmar Nadeem, Adrian Hilton, Robert Dawes, Graham Thomas, Armin Mustafa. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024.

Access the CAD paper

AI tool improves video captioning accuracy

22 June 2023

Researchers at the CVSSP research programme, AI4ME have developed a new AI tool that can generate more accurate and grammatically correct video captions.

The tool, called SEM-POS, uses a novel global-local fusion network to combine visual and linguistic features. This approach helps the tool better align the visual information with the language description, resulting in more accurate and coherent captions.

By using different parts of speech components for supervision, SEM-POS can generate captions that are more grammatically correct and capture key information from the video. Extensive testing on benchmark datasets has shown that SEM-POS significantly outperforms existing methods in terms of caption accuracy.

This new AI tool has the potential to improve the accessibility and usability of video content for people with hearing impairments or who simply prefer to read captions. It could also be used to enhance search engine optimization and improve the discoverability of online videos.

The researchers were: Asmar Nadeem, Adrian Hilton, Robert Dawes, Graham Thomas, Armin Mustafa. Proceedings of the IEEE/CVF. Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2023, pp. 2606-2616.

Access the SEM-POS paper

Advancements in AI for video and audio parsing

New AI tool improves audio-visual video parsing

A breakthrough in audio-visual question answering

AI tool improves video captioning accuracy

Share what you've read?