11am - 12 noon

Thursday 18 January 2024

Structured representations for scene understanding

PhD Open Viva presentation by Avishkar Saha. All welcome!

Free

21BA02
CVSSP/AI Institute
University of Surrey
Guildford
Surrey
GU2 7XH

back to all events

This event has passed

Speakers

Avishkar Saha

Abstract

This is a thesis of two halves. In the first, we address autonomous 3D reconstruction, the process by which an agent constructs its own representations of a scene. In the second, we take a more task-agnostic approach and develop a range of structure-learning methods for graph neural networks.

Bird's-eye-view (BEV) estimation is a challenging task in autonomous 3D reconstruction, involving the generation of overhead maps from monocular images. While recent end-to-end deep learning approaches have replaced what was traditionally a multi-stage process, they are not constrained in ways that reflect the structure of the world. This limits their accuracy in mapping both the static and dynamic elements of a scene. The first contribution of this thesis proposes a neural network which exploits this image-to-world structure, by formulating map generation from an image as a set of sequence-to-sequence translations. Our physically-grounded formulation brings substantial improvements over state-of-the-art methods.

The first contribution improves BEV-estimation accuracy by leveraging the image-to-world structure, with more pronounced gains for larger textured classes like roads and pavements. However, smaller dynamic objects still face challenges, with low recall and localization errors, particularly at greater distances from the camera. This is partly due to the model's reliance on common visual cues, such as shadows, which become scarcer as objects move farther away. Our second contribution tackles these limitations in BEV-estimation by introducing graphs to reason about objects through their relations. By reasoning about an object within the context of other objects, we guide the network towards more pertinent cues than before. This graph-based reasoning results in a 50% relative improvement for dynamic objects compared to our prior model.

Our second contribution, imposes a k-nearest-neighbor graph structure, assuming closer objects provide more depth cues. This assumption may not always apply, and in some cases domain knowledge for graph construction might be entirely unavailable. In response, our third contribution introduces an end-to-end differentiable graph generator that dynamically builds graph topologies. This module integrates seamlessly with graph convolution networks, learning edge structures optimized for the downstream task. Extensive evaluation demonstrates its improvements on accuracy in BEV-estimation, trajectory prediction, pointcloud classification, and node classification.

While generating graph structures based on local neighborhoods is suitable for many tasks, it assumes relevance based on spatial proximity. However, this assumption does not always hold in 3D point clouds. Pointclouds follow topological manifolds, and locally constructed edge structures can create shortcuts between disconnected regions, making GCNs sensitive to perturbations. This can be particularly problematic in pointcloud processing as deformations and sensor noise are common. Our fourth contribution introduces a method that learns topologically stable representations across multiple scales. By utilizing tools from persistent homologies, we reduce overfitting in the embedding space. This topology stabilization can integrate with graph convolution operations, improving predictive performance in a range of pointcloud labeling tasks.

The preceding models are fundamentally all associative: given sufficient data, they can make accurate predictions for in-distributions scenarios similar to those seen before. However, as they are not causal, they do not model the data-generating process directly and often struggle when queried on scenarios outside the training distribution. Our final contribution focuses on modeling the causal structure of object-based physical systems to enhance forecasting. We infer time-varying causal relationships among objects and leverage them for trajectory prediction, enabling counterfactual predictions and generalization to diverse scenarios beyond the training distribution.

Featured Academics

Prof Richard Bowden

Professor of Computer Vision and Machine Learning

Visitor information

Find out how to get to the University, make your way around campus and see what you can do when you get here.

Find out more