Multi-View Multiple People Detection, Labelling and Tracking
Multi-view multiple people detection, labelling and tracking has attracted considerable interest in recent years and has been employed in many application domains such as video surveillance, human pose estimation and 3D human modelling. However, correctly detecting, labelling and tracking multiple people from different viewpoints is a challenging task, especially in complex scenes, due to occlusions, visual ambiguities, as well as variations in appearance and illumination. Recent approaches established based on deep learning such as object detection and semantic segmentation have proved very successful at improving the performance of a wide range of computer vision tasks. By contrast, limited works have been done with deep learning in multi-view multiple people detection, labelling and tracking, due to the lack of multi-view datasets. This thesis therefore introduces novel algorithms to advance the state-of-the-art in multi-view detection, labelling and tracking on both images and videos based on deep learning, whilst also establishing new synthetic datasets to address the lack of multi-view data in this field.
Firstly, we propose a multi-view framework for joint object detection and labelling based on pairs of images. As the lack of multi-view data prevents training an end-to-end network for the task, we therefore leverage some of the benefits of deep learning by introducing a novel single-view convolutional neural network (CNN) with a series of multi-view constraints. The proposed framework extends the single-view Mask R-CNN approach to multiple views to achieve multi-view object detection and labelling without need for additional training, thus overcoming the lack of multi-view data. Dedicated components are embedded into the framework to match objects across views by enforcing epipolar constraints, appearance feature similarity and class coherence. By jointly processing detection and labelling in a unified network, the multi-view extension enables the proposed framework to detect objects which would otherwise be mis-detected in a single view, and achieves coherent object labelling across views.
Next, to fundamentally address the shortage of multi-view dataset and leverage the advantage of deep neural network directly into multi-view detection and labelling, we generate a largescale synthetic dataset named MV3DHumans for the multi-view multiple people detection and labelling task by combining 3D human models, panoramic backgrounds, human poses, and appearance detail rendering. The proposed MV3DHumans dataset consists of various scenes with different degrees of crowdedness, providing sufficient data for training deep neural networks. We then develop a novel end-to-end multi-view framework named Multi-View Labelling network (MVL-net), using a CNN to learn multi-view consistency based on the generated synthetic dataset. A novel labelling branch with a matching net is introduced in the MVL-net to predict matching confidence scores for pairs of people from two views, thus solving the problem of label prediction for an unknown number of people. In addition, a calibration component based on the epipolar geometry is integrated in the proposed framework to further improve the performance.
Finally, we extend our work from multi-view labelling on images to the more challenging tracking problem on multi-view videos. A synthetic multi-view video dataset named MV3DHumansVideo is established by extending data generation from images to videos, consisting of hundreds of short multi-view video sequences for training and several long sequences for testing. Based on the newly proposed multi-view video dataset, we propose a novel spatiotemporal correlation (STC) network, jointly achieving detection, temporal feature learning and spatial feature learning based on multi-view videos. The proposed STC network combines detection and feature learning in a unified network and seamlessly integrates temporal and spatial context information to enhance temporal feature and spatial feature learned for multi-view video tracking. An efficient multi-view tracking framework is then developed based on learned features to obtain consistent people labels for multi-view videos. We demonstrate that this method can learn effective features for multi-view tracking and obtain consistent people labels for entire multi-view videos.
Attend the Event
This is a free event open to everyone