Sadegh Rahmani
Academic and research departments
Music and Media, School of Computer Science and Electronic Engineering.About
My research project
Human inspired action recognition using top-down and bottom-up methodsMy research focuses on multimodal learning, leveraging diverse data sources, from established datasets to insights derived from human-participant studies conducted in collaboration with psychology researchers at Newcastle University. The aim of my research is to advance action recognition and action understanding in videos.
Supervisors
My research focuses on multimodal learning, leveraging diverse data sources, from established datasets to insights derived from human-participant studies conducted in collaboration with psychology researchers at Newcastle University. The aim of my research is to advance action recognition and action understanding in videos.
ResearchResearch interests
I am interested in multimodal learning, as it tries to approximate how humans learn. I enjoy working with vision-based modalities, particularly video, due to the challenges they present. I am also keen on exploring how these visual inputs can be effectively integrated with large (small) language models.
Research interests
I am interested in multimodal learning, as it tries to approximate how humans learn. I enjoy working with vision-based modalities, particularly video, due to the challenges they present. I am also keen on exploring how these visual inputs can be effectively integrated with large (small) language models.
Publications
Detecting actions in videos, particularly within cluttered scenes, poses significant challenges due to the limitations of 2D frame analysis from a camera perspective. Unlike human vision, which benefits from 3D understanding, recognizing actions in such environments can be difficult. This research introduces a novel approach integrating 3D features and depth maps alongside RGB features to enhance action recognition accuracy. Our method involves processing estimated depth maps through a separate branch from the RGB feature encoder and fusing the features to understand the scene and actions comprehensively. Using the Side4Video framework and VideoMamba, which employ CLIP and VisionMamba for spatial feature extraction, our approach outperformed our implementation of the Side4Video network on the Something-Something V2 dataset.
Despite its popularity and the substantial betting it attracts, horse racing has seen limited research in machine learning. Some studies have tackled related challenges, such as adapting multi-object tracking to the unique geometry of horse tracks [3] and tracking jockey caps during complex manoeuvres [2]. Our research aims to create a helmet detector framework as a preliminary step for re-identification using a limited dataset. Specifically, we detected jockeys’ helmets throughout a 205-second race with six disjointed outdoor cameras, addressing challenges like occlusion and varying illumination. Occlusion is a significant challenge in horse racing, often more pronounced than in other sports. Jockeys race in close groups, causing substantial overlap between jockeys and horses in the
camera’s view, complicating detection and segmentation. Additionally, motion blur, especially in the race’s final stretch, and the multi-camera broadcast capturing various angles—front, back, and sides—further complicate detection and consecutively re-identification (Re-ID). To address these issues, we focus on helmet identification rather than detecting all horses or jockeys. We believe helmets, with their simple shapes and consistent appearance even when rotated, offer a more reliable target for detection to make the Re-ID downstream task more achievable.
Gait is one of the most frequently used forms of human movement during daily activities. The majority of works focus on exploring the dynamic factors during gait. Different from previous works, we adapt an image prediction task for anticipating the next frame in process of gait. In this work, we present a novel framework for human gait plantar pressure prediction using Spatio-temporal Transformer. We train the model to predict the next plantar pressure image in an image series while also learning frame feature encoders that predict the features of subsequent frames in the sequence. We proposed two new components in our loss function for considering temporality as well as smaller values in the image. Our model achieves superior results over several competitive baselines on the CAD WALK database.
Humans reliably surpass the performance of the most advanced AI models in action recognition, especially in real-world scenarios with low resolution, occlusions, and visual clutter. These models are somewhat similar to humans in using architecture that allows hierarchical feature extraction. However, they prioritise different features, leading to notable differences in their recognition. This study investigated these differences by introducing Epic ReduAct, a dataset derived from Epic-Kitchens-100. It consists of Easy and Hard ego-centric videos across various action classes. Critically, our dataset incorporates the concepts of Minimal Recognisable Configuration (MIRC) and sub-MIRC derived by progressively reducing the spatial content of the action videos across multiple stages. This enables a controlled evaluation of recognition difficulty for humans and AI models. This study examines the fundamental differences between human and AI recognition processes. While humans, unlike AI models, demonstrate proficiency in recognizing hard videos, they experience a sharp decline in recognition ability as visual information is reduced, ultimately reaching a threshold beyond which recognition is no longer possible. In contrast, AI models exhibit greater resilience, with recognition confidence decreasing gradually or, in some cases, even increasing at later reduction stages. These findings suggest that the limitations observed in human recognition do not directly translate to AI models, highlighting the distinct nature of their processing mechanisms.
Efficient regular-frequent pattern mining from sensors-produced data has become a challenge. The large volume of data leads to prolonged runtime, thus delaying vital predictions and decision makings which need an immediate response. So, using big data platforms and parallel algorithms is an appropriate solution. Additionally, an incremental technique is more suitable to mine patterns from big data streams than static methods. This study presents an incremental parallel approach and compact tree structure for extracting regular-frequent patterns from the data of wireless sensor networks. Furthermore, fewer database scans have been performed in an effort to reduce the mining runtime. This study was performed on Intel 5-day and 10-day datasets with 6, 4, and 2 nodes clusters. The findings show the runtime was improved in all 3 cluster modes by 14, 18, and 34% for the 5-day dataset and by 22, 55, and 85% for the 10-day dataset, respectively.