Real time complex scene segmentation and prediction in urban environments

The project will be looking at spatiotemporal sematic segmentation and prediction of actions of other road users in the context of SAE level 4-5 autonomous vehicles. Specifically, it will focus on inner city and urban driving. It will focus on the computer vision tools for segmentation and reasoning about scene content and motion with integration into ROS and testing on our autonomous testbed.

Start date
Application deadline
Funding information

This is a 4 year industrial case studentship with an increased stipend of £18-19.5K pa (tax free) with additional funds for travel, equipment and consumables. It will also cover Home/EU tuition fees for the duration of study.

Funding source
EPSRC and Jaguar Land Rover
Supervised by


Semantic image segmentation describes the process of associating each pixel of an image with a class label (such as flower, person, road, sky etc). The purpose of which is to distil the complexities of a high-resolution image containing millions of pixels to a lower level representation where image understanding can be achieved. Historically single image segmentation has been performed with probabilistic models although the current state of the art technology uses deep learning approaches. Many of the recent datasets for automotive vision contain hand labelled semantic segmentations where images are labelled in terms of road, road markings, buildings, vehicles, pedestrians, cyclists, and street furniture. The typical approach is to train a network to reproduce these labels on unseen images. However, there is very little work which attempts to build the dynamics of the scene into the segmentation process. For example, the motion of a walking person is easily identifiable to the human eye even when the person occupies only a few pixels in size. At this resolution even, a human would have trouble distinguishing the human from a static image but through context and recognition of the pattern of moving pixels in video we can deduce human presence.

This project will focus on real time, complex urban semantic segmentation where the temporal evolution of the scene is used to increase segmentation accuracy. Although the dynamics of the scene can be used in segmentation they can also be used to predict the evolution of the scene and the future actions of other road users. Importantly this would allow an AI vision system to process an incoming video stream in real time, break that scene into its constituent components and answer questions such as “what do we expect the scene to look like in five seconds?”.

The project will investigate the use of spatiotemporal semantic segmentation and the predicted actions of other road users in the context of SAE level 4-5 autonomous vehicles. But specifically, it will focus on inner city/urban driving. It will focus on the computer vision tools for segmentation and reasoning about scene content and motion with integration into ROS and testing on our autonomous testbed.

The PhD is located within the Centre for Vision Speech and Signal Processing (CVSSP) at the University of Surrey but will involve close collaboration and internship opportunities at Jaguar Land Rover in Warwick. CVSSP is an internationally recognised leader in audio-visual machine perception research. With a diverse community of more than 150 researchers, we are one of the largest audio and vision research groups in the UK. You will join around 50 other postgraduate research students conducting research across a broad range of research areas in vision and deep learning.

Related links
Centre for Vision, Speech and Signal Processing (CVSSP)

Eligibility criteria

  • A first class or 2:1 honours degree (or equivalent overseas qualification) in an appropriate discipline (e.g. engineering, computer science, signal processing, applied mathematics, and physics)
  • You should be able to demonstrate excellent mathematical, analytic programming skills
  • Previous experience in computer vision, machine/deep learning, or augmented reality would be advantageous
  • IELTS 6.5 or above (or equivalent) with no sub-test of less than 6.

How to apply

In the first instance, contact Prof Richard Bowden indicating your areas of interest and include your CV with qualification details (including copies of transcripts and certificates). You will then need to apply to our Vision, Speech and Signal Processing PhD, mentioning this studentship in your application.

Vision, Speech and Signal Processing PhD

Contact details

Richard Bowden
22 BA 00
Telephone: +44 (0)1483 689838

The Centre for Vision, Speech and Signal Processing (CVSSP)

Our Centre is ranked number one for Computer Vision in the UK and it is one of the largest in Europe with over 150 researchers and a grant portfolio in excess of £23million. CVSSP brings together a unique combination of cutting-edge sound and vision expertise, with an outstanding track-record of innovative research leading to technology transfer and exploitation in biometrics, creative industries (film, TV, games, VR, AR), mobile communication, healthcare, robotics, autonomous vehicles and consumer electronics.

Postgraduate research with CVSSP

CVSSP promotes an exciting, supportive and inclusive research environment. Secondments with industry and international groups are encouraged during the PhD to broaden experience and to ensure research is kept relevant to real-world problems. Our PhD researchers contribute to advancing the state-of-the art through publication and by participating in leading international forums.

The Postgraduate Research Experience Survey (PRES) 2018 ranked the University of Surrey 7th in the UK out of 63 and 4th for retention and completion.


Studentships at Surrey

We have a wide range of studentship opportunities available.