
Dr Jaime Spencer Martin
Academic and research departments
Centre for Vision, Speech and Signal Processing (CVSSP), Department of Electrical and Electronic Engineering.About
Biography
I obtained my BEng in Electronic Engineering from the University of Surrey in 2017. I am currently in the final stages of obtaining my PhD from the University of Surrey, supervised by Dr. Simon Hadfield & Prof. Richard Bowden. My PhD research has focused on dense feature learning for vehicle automation. This consisted of using CNNs to learn generic dense feature representations that can be applied to a wide variety of computer vision tasks such as correspondence estimation, SLAM, semantic segmentation, depth estimation, visual localization and more. This also focused on robustness to seasonal appearance changes (day vs. night, summer vs. winter) and efficient multi-task learning. During my PhD I carried out an internship in Tesla Automotive as part of the Autopilot group.
My role is Research Fellow in Computer Vision and Deep Learning in CVSSP, working on the project "ROSSINI" funded by the Engineering and Physical Sciences Research Council (EPSRC). ROSSINI will develop a new machine vision system for 3D reconstruction that is more flexible and robust than previous methods. The aims are to (i) produce 3D representations that look correct to humans even if they are not strictly geometrically correct (ii) do so for all types of scene and (iii) express the uncertainty inherent in each reconstruction. Importantly, the inclusion of human perceptual data should reduce the overall quantity of training data required, while mitigating the risk of over-reliance on a specific dataset.
My qualifications
ResearchResearch projects
ROSSINI: Reconstructing 3D Structure from Single Images: A Perceptual Reconstruction ApproachConsumers enjoy the immersive experience of 3D content in cinema, TV and virtual reality (VR), but it is expensive to produce. Filming a 3D movie requires two cameras to simulate the two eyes of the viewer. A common but expensive alternative is to film a single view, then use video artists to create the left and right eyes' views in post-production. What if a computer could automatically produce a 3D model (and binocular images) from 2D content: 'lifting images into 3D'? This is the overarching aim of this project. Lifting into 3D has multiple uses, such as route planning for robots, obstacle avoidance for autonomous vehicles, alongside applications in VR and cinema.
Estimating 3D structure from a 2D image is difficult because in principle, the image could have been created from an infinite number of 3D scenes. Identifying which of these possible worlds is correct is very hard, yet humans interpret 2D images as 3D scenes all the time. We do this every time we look at a photograph, watch TV or gaze into the distance, where binocular depth cues are weak. Although we make some errors in judging distances, our ability to quickly understand the layout of any scene enables us to navigate through and interact with any environment.
ROSSINI will develop a new machine vision system for 3D reconstruction that is more flexible and robust than previous methods. Focussing on static images, we will identify key structural features that are important to humans. We will combine neural networks with computer vision methods to form human-like descriptions of scenes and 3D scene models. Our aims are to (i) produce 3D representations that look correct to humans even if they are not strictly geometrically correct (ii) do so for all types of scene and (iii) express the uncertainty inherent in each reconstruction. To this end we will collect data on human interpretation of images and incorporate this information into our network. Our novel training method will learn from humans and existing ground truth datasets; the training algorithm selecting the most useful human tasks (i.e. judge depth within a particular image) to maximise learning. Importantly, the inclusion of human perceptual data should reduce the overall quantity of training data required, while mitigating the risk of over-reliance on a specific dataset. Moreover, when fully trained, our system will produce 3D reconstructions alongside information about the reliability of the depth estimates.
Indicators of esteem
PGR Student of the Year 2021 (Faculty of Engineering and Physical Sciences)
PGR Student of the Year 2021
Research projects
Consumers enjoy the immersive experience of 3D content in cinema, TV and virtual reality (VR), but it is expensive to produce. Filming a 3D movie requires two cameras to simulate the two eyes of the viewer. A common but expensive alternative is to film a single view, then use video artists to create the left and right eyes' views in post-production. What if a computer could automatically produce a 3D model (and binocular images) from 2D content: 'lifting images into 3D'? This is the overarching aim of this project. Lifting into 3D has multiple uses, such as route planning for robots, obstacle avoidance for autonomous vehicles, alongside applications in VR and cinema.
Estimating 3D structure from a 2D image is difficult because in principle, the image could have been created from an infinite number of 3D scenes. Identifying which of these possible worlds is correct is very hard, yet humans interpret 2D images as 3D scenes all the time. We do this every time we look at a photograph, watch TV or gaze into the distance, where binocular depth cues are weak. Although we make some errors in judging distances, our ability to quickly understand the layout of any scene enables us to navigate through and interact with any environment.
ROSSINI will develop a new machine vision system for 3D reconstruction that is more flexible and robust than previous methods. Focussing on static images, we will identify key structural features that are important to humans. We will combine neural networks with computer vision methods to form human-like descriptions of scenes and 3D scene models. Our aims are to (i) produce 3D representations that look correct to humans even if they are not strictly geometrically correct (ii) do so for all types of scene and (iii) express the uncertainty inherent in each reconstruction. To this end we will collect data on human interpretation of images and incorporate this information into our network. Our novel training method will learn from humans and existing ground truth datasets; the training algorithm selecting the most useful human tasks (i.e. judge depth within a particular image) to maximise learning. Importantly, the inclusion of human perceptual data should reduce the overall quantity of training data required, while mitigating the risk of over-reliance on a specific dataset. Moreover, when fully trained, our system will produce 3D reconstructions alongside information about the reliability of the depth estimates.
Indicators of esteem
PGR Student of the Year 2021 (Faculty of Engineering and Physical Sciences)
PGR Student of the Year 2021
Teaching
- Demonstrator for EEE1035 (Programming in C) (2017/18, 2018/19, 2019/20)
- Demonstrator for EEE3032 (Computer Vision and Pattern Recognition) (2017/18)