Audio-visual object-based dynamic scene representation from monocular video

This research will investigate the transformation of monocular audio and visual video into a spatially localised object-based audio-visual representation.

Start date

1 October 2021


Standard project duration is 4 years.

Application deadline

Funding source

EPSRC icase studentship sponsored by the BBC

Funding information

  • Full UK/EU tuition fee covered
  • Stipend at £18,609  p.a. increasing annually (enhanced stipend of £3k/annum)
  • Personal Computer (provided by the department)
  • Conference attendance budget 2k/annum
  • Equipment/consumables budget 1k/annum
  • Funding duration is 4 years

EPSRC studentships up to 30% OS with fee waiver


This research will investigate the transformation of monocular audio and visual video into a spatially localised object-based audio-visual representation.

Self-supervised and weakly supervised deep learning will be investigated for the transformation of general scenes into semantically labelled and localised objects. This will build on recent advances in deep-learning based monocular reconstruction of general dynamic scenes and objects with known semantic labels, such as people. Multi-modal information sources including audio and text subtitles will be employed to support weakly supervised learning for semantic labelling and object-based reconstruction.

The goal of this research is to generalise to unconstrained video sequences of complex real-world scenes with multiple interacting people. Research will investigate approaches for the transfer of multi-modal or additional information to support the object-based scene reconstruction and evaluate the relative importance of different information sources.

The approach should be able to achieve plausible reconstruction of unknown or unmodelled object classes, together with complete reconstruction for modelled object classes. Learning on in-the-wild and BBC archive datasets will be investigated to support the generalisation to complex scenes. Specific use-cases such as sports and programme recommendation will also be investigated for evaluation in constrained contexts. The approach will be evaluated on both live and legacy content.

Related links

Centre for Vision, Speech and Signal Processing (CVSSP) CVSSP Postgraduate research study

Eligibility criteria

All applicants should have (or expect to obtain) a first-class degree in a numerate discipline (mathematics, science or engineering) or MSc with Distinction (or 70% average) and a strong interest in pursuing research in this field. Additional experience which is relevant to the area of research is also advantageous, especially a demonstrated capability or interest in convergence research that spans the physical, engineering and biological sciences.

If English is not your first language you are required to have an IELTS score of 6.5 or above (or equivalent) with no sub-test score of less than 6.

This studentship is open to UK students only.

How to apply

Applications must be submitted via the Vision, Speech and Signal Processing PhD programme page. 

Shortlisted applicants will be contacted directly to arrange a suitable time for an interview. For further information about our research portfolio and how to apply visit our PhD study dedicated page.

For enquiries contact Nan Bennett indicating your areas of interest and including your CV with qualification details (copies of transcripts and certificates).

Application deadline

Contact details

For enquiries please contact Nan Bennett 


CVSSP is a leading UK research centre in audio-visual signal processing, computer vision and machine learning ranked 1st in the UK and 3rd in Europe for Computer Vision. Our Centre is one of the largest in Europe with over 170 researchers and a grant portfolio in excess of £27 million, bringing together a unique combination of cutting-edge sound and vision expertise. Our aim is to advance the state of the art in multimedia signal processing and computer vision, with a focus on image, video and audio applications. Our Centre has a robust track-record of innovative research leading to technology transfer and exploitation in biometrics, creative industries (film, TV, games, VR), communication, healthcare, robotics and consumer electronics.

CVSSP is a destination of choice for postgraduate talent and it is part of the Department of Electrical and Electronic Engineering which is ranked second in the Guardian newspaper league table 2020.  The University of Surrey has recently been ranked 7th in the UK in the 2020 Advance HE Postgraduate Research Experience Survey (PRES).

We acknowledge, understand and embrace diversity.


Studentships at Surrey

We have a wide range of studentship opportunities available.