Complete scene reconstruction from single view RGBD is a challenging task, requiring
estimation of scene regions occluded from the captured depth surface. We propose
that scene-centric analysis of human motion within an indoor scene can reveal fully occluded
objects and provide functional cues to enhance scene understanding tasks. Captured
skeletal joint positions of humans, utilised as naturally exploring active sensors,
are projected into a human-scene motion representation. Inherent body occupancy is
leveraged to carve a volumetric scene occupancy map initialised from captured depth,
revealing a more complete voxel representation of the scene. To obtain a structured box
model representation of the scene, we introduce unique terms to an object detection optimisation
that overcome depth occlusions whilst deriving from the same depth data. The
method is evaluated on challenging indoor scenes with multiple occluding objects such as
tables and chairs. Evaluation shows that human-centric scene analysis can be applied to
effectively enhance state-of-the-art scene understanding approaches, resulting in a more
complete representation than single view depth alone.
In this paper, we propose an approach to indoor scene understanding from observation of people in single view spherical video. As input, our approach takes a centrally located spherical video capture of an indoor scene, estimating the 3D localisation of human actions performed throughout the long term capture. The central contribution of this work is a deep convolutional encoder-decoder network trained on a synthetic dataset to reconstruct regions of affordance from captured human activity. The predicted affordance segmentation is then applied to compose a reconstruction of the complete 3D scene, integrating the affordance segmentation into 3D space. The mapping learnt between human activity and affordance segmentation demonstrates that omnidirectional observation of human activity can be applied to scene understanding tasks such as 3D reconstruction. We show that our approach using only observation of people performs well against previous approaches, allowing reconstruction of occluded regions and labelling of scene affordances.
Visual scene understanding studies the task of representing a captured scene in a manner emulating human-like understanding of that space. Considering indoor scenes are designed for human use and are utilised everyday, attaining this understanding is crucial for applications such as robotic mapping and navigation, smart home and security systems, and home healthcare and assisted living. However, although we as humans utilise such spaces in our day-to-day lives, analysis of human activity is not commonly applied towards enhancing indoor scene-level understanding. As such, the work presented in this thesis investigates the benefits of including human activity information in indoor scene understanding challenges, aiming to demonstrate its potential contributions, applications, and versatility.
The first contribution of this thesis utilises human activity to reveal scene regions occluded behind objects and clutter. Human poses recognised from a static sensor are projected into a top-down scene representation recording belief of human activity over time. This representation is applied to carve a volumetric scene map, initialised on captured depth, to expose the occupancy of hidden scene regions. An object detection approach exploits the revealed occluded scene occupancy to localise self-, partially-, and, significantly, fully-occluded objects. The second contribution extends the top-down activity representation to predict the functionality of major scene surfaces from human activity recognised in 360 degree video. A convolutional network is trained on simulated human activity to segment walkable, sittable, and interactable surfaces from the top-down perspective. This prediction is applied to construct a complete scene 3D approximation, with results showing scene structure and surface functionality are predicted well from human activity alone. Finally, this thesis investigates an association between the top-down functionality prediction and the captured visual scene. A new dataset capturing long-term human activity is introduced to train a model on combined activity and visual scene information. The model is trained to segment functional scene surfaces from the capture sensor perspective, with evaluation establishing that the introduction of human activity information can improve functional surface segmentation performance.
Overall, the work presented in this thesis demonstrates that analysis of human activity can be applied to enhance indoor scene understanding across various challenges, sensors, and representations. Assorted datasets are introduced alongside the major contributions to motivate further investigation into its application.