Dr Simon Hadfield
Academic and research departmentsCentre for Vision, Speech and Signal Processing (CVSSP), Department of Electrical and Electronic Engineering.
My research focuses on using cutting-edge visual sensing technologies such as event cameras to solve machine learning/autonomy and 3D computer vision tasks.
I am a lecturer in robot vision and autonomous systems within the University of Surrey’s Centre for Vision, Speech and Signal Processing (CVSSP). My long-term aim is to develop the first UK centre of excellence in perception and AI algorithms for asynchronous visual sensing.
Having studied for an MEng in Electronic and Computer Engineering at Surrey (finishing as the top student in my graduating year), I went on to do an EPSRC-funded PhD in computer vision at the University. Supervised by Professor Richard Bowden, this was entitled ‘The estimation and use of 3D information, for natural human action recognition’. I remained at Surrey as a Research Fellow and became a lecturer in February 2016.
Areas of specialism
3D computer vision;
scene flow estimation;
University roles and responsibilities
- Manager of Dissertation allocation and assessment system for undergraduate and postgraduate taught programmes in the Department of Electrical and Electronic Engineering.
- Undergraduate Personal Tutor
- Health and Safety group – Representative for CVSSP labs (BA)
Affiliations and memberships
My research is in the field of computer vision and machine learning, with a particular emphasis on the effective exploitation of novel visual sensors.
I focus on taking computer vision techniques ‘out of the lab’ and making them practically applicable to the real world. For example, I’ve proposed a new paradigm for efficient dynamic reconstruction with a computational overhead several orders of magnitude lower than previous techniques, which is an important step towards using these techniques in real-time robotics applications. This research was published in both the top journal and top international conference in the field, generating more than 100 citations to date.
Ultimately, my research in robotic perception and automation could impact a range of areas of modern life. In the manufacturing industry, automation techniques are urgently needed to reduce costs and enable manufacturers to adapt to changing environments, while autonomous vehicles require new types of visual sensor and perception algorithms to push safety to human levels and beyond. Medical robotics needs increasing levels of intelligence and automation in order to proceed beyond the capabilities of the human doctors operating them. And in the space sector, spaceborne robotics – which allow in-orbit assembly and servicing – are necessitating new approaches to visual perception.
Indicators of esteem
2018 – Winner of the Early Career Teacher of the Year (Faculty of Engineering & Physical Sciences )
Reviewer for more than 10 high impact international journals and conferences
Two NVidia GPU grants
2017 – Finalist for Supervisor of the Year (Faculty of Engineering & Physical Sciences) and the Tony Jeans Inspirational Teaching award
2016 – Second place in the international academic challenge on continuous gesture recognition (ChaLearn)
DTI MEng prize, for best all round performance of the entire graduating year, awarded by Department of Trade and Industry
I currently teach two modules to undergraduates in the Department of Electrical and Electronic Engineering:
- C++ and Object-Oriented Design: Year 2 (EEE2047)
- Robotics: Year 3 (EEE3043)
In my teaching, I focus heavily on practical hands-on experience interleaved with the traditional taught material, to ensure that students have the chance to practice and receive feedback on the techniques they are learning.
Every year I also supervise around four undergraduate projects in the areas of deep learning, robotics and computer vision. Historically, the projects I have helped supervise have often won or been nominated for external and internal prizes including the BAE prize, national ‘hackaday’ prizes and Surrey’s Department of Electrical and Electronic Engineering prize.
I have received a number of awards and nominations for my supervision and teaching. These include winning the 2018 Early Career Teacher of the Year and being nominated for the 2017 Tony Jeans Inspiration Teaching award.
Postgraduate research supervision
Oct 2018 - Yusuf Duman - NIMROD: Analogue visual sensors
Oct 2018 - Suzanna Lucarotti - Self-reconfigurable modular robotic manipulators for space
Oct 2018 - Lucy Elaine Jackson - Small Space Robots for In-orbit Operations
Oct 2017 - Jaime Spencer Martin - Driver behaviour modelling for assistance & automation
Jul 2017 - Celyn Walters - Automation of marine vessels
Jul 2017 - Stephanie Stoll - Sign Language Production using Neural Machine Translation and Generative Adversarial Networks
Oct 2016 - Rebecca Allday - Machine Learning for Robotic Grasping
Oct 2016 - Peter Blacker - LIDAR Sensor Fusion for Planetary Rovers
Jul 2016 - N. Cihan Camgöz - Neural Sign Language Translation: Continuous Sign Language REcognition from a Machine Translation Perspective (PGR student of the year - VC awards 2018)
Oct 2013 - Filippos Koidis - Development of an eating topography protocol and an investigation of its effects on body composition, appetite and mindfulness
Completed postgraduate research projects I have supervised
Jul 2018 - Matthew Marter - Learning to Recognise Visual Content from Textual Annotation
Sep 2017- Oscar Mendez - Collaborative Strategies for Autonomous Localisation, 3D Reconstruction and Pathplanning (Sullivan Thesis Prize finalist, top UK thesis in computer vision)
Jul 2016 - Karel Lebeda - 2D and 3D Tracking and Modelling (Sullivan Thesis Prize winner, top UK thesis in computer vision)
non-rigid structure from motion. The problem is reformulated, incorporating
ideas from visual object tracking, to provide a more general
and unified technique, with feedback between the reconstruction and
point-tracking algorithms. The resulting algorithm overcomes the limitations
of many conventional techniques, such as the need for a reference
image/template or precomputed trajectories. The technique can also be
applied in traditionally challenging scenarios, such as modelling objects
with strong self-occlusions or from an extreme range of viewpoints. The
proposed algorithm needs no offline pre-learning and does not assume
the modelled object stays rigid at the beginning of the video sequence.
Our experiments show that in traditional scenarios, the proposed method
can achieve better accuracy than the current state of the art while using
less supervision. Additionally we perform reconstructions in challenging
new scenarios where state-of-the-art approaches break down and where
our method improves performance by up to an order of magnitude.
Neural Networks for large scale user-independent continuous
gesture recognition. We have trained an end-to-end deep network
for continuous gesture recognition (jointly learning both the
feature representation and the classifier). The network performs
three-dimensional (i.e. space-time) convolutions to extract features
related to both the appearance and motion from volumes of
color frames. Space-time invariance of the extracted features is
encoded via pooling layers. The earlier stages of the network are
partially initialized using the work of Tran et al. before being
adapted to the task of gesture recognition. An earlier version
of the proposed method, which was trained for 11,250 iterations,
was submitted to ChaLearn 2016 Continuous Gesture Recognition
Challenge and ranked 2nd with the Mean Jaccard Index Score
of 0.269235. When the proposed method was further trained for
28,750 iterations, it achieved state-of-the-art performance on the
same dataset, yielding a 0.314779 Mean Jaccard Index Score.
has been widely addressed in the literature. While many
approaches exist to perform reconstruction, few of them
take an active role in deciding where the next observations
should come from. Furthermore, the problem of travelling
from the camera?s current position to the next, known as
pathplanning, usually focuses on minimising path length.
This approach is ill-suited for reconstruction applications,
where learning about the environment is more valuable than
speed of traversal.
We present a novel Scenic Route Planner that selects
paths which maximise information gain, both in terms of
total map coverage and reconstruction accuracy. We also
introduce a new type of collaborative behaviour into the
planning stage called opportunistic collaboration, which
allows sensors to switch between acting as independent
Structure from Motion (SfM) agents or as a variable baseline
We show that Scenic Planning enables similar performance
to state-of-the-art batch approaches using less than
0.00027% of the possible stereo pairs (3% of the views).
Comparison against length-based pathplanning approaches
show that our approach produces more complete and more
accurate maps with fewer frames. Finally, we demonstrate
the Scenic Pathplanner?s ability to generalise to live scenarios
by mounting cameras on autonomous ground-based
sensor platforms and exploring an environment.
simultaneous alignment and recognition problems (referred
to as ?Sequence-to-sequence? learning). We decompose the
problem into a series of specialised expert systems referred
to as SubUNets. The spatio-temporal relationships between
these SubUNets are then modelled to solve the task, while
remaining trainable end-to-end.
The approach mimics human learning and educational
techniques, and has a number of significant advantages. SubUNets
allow us to inject domain-specific expert knowledge
into the system regarding suitable intermediate representations.
They also allow us to implicitly perform transfer
learning between different interrelated tasks, which also allows
us to exploit a wider range of more varied data sources.
In our experiments we demonstrate that each of these properties
serves to significantly improve the performance of the
overarching recognition system, by better constraining the
The proposed techniques are demonstrated in the challenging
domain of sign language recognition. We demonstrate
state-of-the-art performance on hand-shape recognition (outperforming
previous techniques by more than 30%). Furthermore,
we are able to obtain comparable sign recognition
rates to previous research, without the need for an alignment
step to segment out the signs for recognition.
probabilistic forced alignment approach for training spatiotemporal
deep neural networks using weak border level
The proposed method jointly learns to localize and recognize
isolated instances in continuous streams. This is done
by drawing training volumes from a prior distribution of
likely regions and training a discriminative 3D-CNN from
this data. The classifier is then used to calculate the posterior
distribution by scoring the training examples and using this
as the prior for the next sampling stage.
We apply the proposed approach to the challenging task
of large-scale user-independent continuous gesture recognition.
We evaluate the performance on the popular ChaLearn
2016 Continuous Gesture Recognition (ConGD) dataset. Our
method surpasses state-of-the-art results by obtaining 0:3646
and 0:3744 Mean Jaccard Index Score on the validation and
test sets of ConGD, respectively. Furthermore, we participated
in the ChaLearn 2017 Continuous Gesture Recognition
Challenge and was ranked 3rd. It should be noted that our
method is learner independent, it can be easily combined
with other approaches.
In the real world, reconstructing a 3D scene requires nuanced understanding of the environment. Additionally, it is not enough to simply ?understand? the world, autonomous agents must be capable of actively acquiring this understanding. Achieving all of this using simple monocular sensors is extremely challenging. Agents must be able to understand what areas of the world are navigable, how egomotion affects reconstruction and how other agents may be leveraged to provide an advantage. All of this must be considered in addition to the traditional 3D reconstruction issues of correspondence estimation, triangulation and data association.
Simultaneous Localisation and Mapping (SLAM) solutions are not particularly well suited to autonomous multi-agent reconstruction. They typically require the sensors to be in constant communication, do not scale well with the number of agents (or map size) and require expensive optimisations. Instead, this thesis attempts to develop more pro-active techniques from the ground up.
First, an autonomous agent must have the ability to actively select what it is going to reconstruct. Known as view-selection, or Next-Best View (NBV), this has recently become an active topic in autonomous robotics and will form the first contribution of this thesis. Second, once a view is selected, an autonomous agent must be able to plan a trajectory to arrive at that view. This problem, known as path-planning, can be considered a core topic in the robotics field and will form the second contribution of this thesis. Finally, the 3D reconstruction must be anchored to a globally consistent map that co-relates to the real world. This will be addressed as a floorplan localisation problem, an emerging field for the vision community, and will be the third contribution of this thesis.
To give autonomous agents the ability to actively select what data to process, this thesis discusses the NBV problem in the context of Multi-View Stereo (MVS). The proposed approach has the ability to massively reduce the amount of computing resources required for any given 3D reconstruction. More importantly, it autonomously selects the views that improve the reconstruction the most. All of this is done exclusively on the sensor pose; the images are not used for view-selection and only loaded into memory once they have been selected for reconstruction. Experimental evaluation shows that NBV applied to this problem can achieve results comparable to state-of-the-art using as little as 3.8% of the views.
To provide the ability to execute an autonomous 3D reconstruction, this thesis proposes a novel computer-vision based goal-estimation and path-planning approach. The method proposed in the previous chapter is extended into a continuous pose-space. The resulting view then becomes the goal of a Scenic Pathplanner that plans a trajectory between the current robot pose and the NBV. This is done using an NBV-based pose-space that biases the paths towards areas of high information gain. Experimental evaluation shows that the Scenic Planning enables similar performance to state-of-the-art batch approaches using less than 3% of the views, whichcorresponds to 2.7 × 10 ?4 % of the possible stereo pairs (using a naive interpretation of plausible stereo pairs). Comparison against length-based path-planning approaches show that the Scenic Pathplanner produces more complete and more accurate maps with fewer frames. Finally, the ability of the Scenic Pathplanner to generalise to live sc
generalized reprojection error (which is known as the gold standard for multiview geometry). The HAR is an over-parameterization
scheme where the approximation is applied simultaneously in multiple parameter spaces. A joint minimization scheme ?HAR-Descent?
can then solve the PnP problem efficiently, while remaining robust to approximation errors and local minima.
The technique is evaluated extensively, including numerous synthetic benchmark protocols and the real-world data evaluations used in
previous works. The proposed technique was found to have runtime complexity comparable to the fastest O(n) techniques, and up to
10 times faster than current state of the art minimization approaches. In addition, the accuracy exceeds that of all 9 previous
techniques tested, providing definitive state of the art performance on the benchmarks, across all 90 of the experiments in the paper
and supplementary material.
the combination of SLR and sign language assessment is novel. The goal of an ongoing three-year project in Switzerland is to pioneer
an assessment system for lexical signs of Swiss German Sign Language (Deutschschweizerische Geb¨ardensprache, DSGS) that relies on
SLR. The assessment system aims to give adult L2 learners of DSGS feedback on the correctness of the manual parameters (handshape,
hand position, location, and movement) of isolated signs they produce. In its initial version, the system will include automatic feedback
for a subset of a DSGS vocabulary production test consisting of 100 lexical items. To provide the SLR component of the assessment
system with sufficient training samples, a large-scale dataset containing videotaped repeated productions of the 100 items of the
vocabulary test with associated transcriptions and annotations was created, consisting of data from 11 adult L1 signers and 19 adult L2
learners of DSGS. This paper introduces the dataset, which will be made available to the research community.
a floorplan? It is probably safe to say that we do not explicitly
measure depths to every visible surface and try to match them
against different pose estimates in the floorplan. And yet, this
is exactly how most robotic scan-matching algorithms operate.
Similarly, we do not extrude the 2D geometry present in the
floorplan into 3D and try to align it to the real-world. And yet,
this is how most vision-based approaches localise.
Humans do the exact opposite. Instead of depth, we use
high level semantic cues. Instead of extruding the floorplan up
into the third dimension, we collapse the 3D world into a 2D
representation. Evidence of this is that many of the floorplans
we use in everyday life are not accurate, opting instead for high
levels of discriminative landmarks.
In this work, we use this insight to present a global
localisation approach that relies solely on the semantic labels
present in the floorplan and extracted from RGB images. While
our approach is able to use range measurements if available,
we demonstrate that they are unnecessary as we can achieve
results comparable to state-of-the-art without them.
research field for the last two decades. However, most
research to date has considered SLR as a naive gesture
recognition problem. SLR seeks to recognize a sequence of
continuous signs but neglects the underlying rich grammatical and linguistic structures of sign language that differ from spoken language. In contrast, we introduce the Sign Language Translation (SLT) problem. Here, the objective is to generate spoken language translations from sign language videos, taking into account the different word orders and grammar.
We formalize SLT in the framework of Neural Machine
Translation (NMT) for both end-to-end and pretrained
settings (using expert knowledge). This allows us to jointly learn the spatial representations, the underlying language model, and the mapping between sign and spoken language.
To evaluate the performance of Neural SLT, we collected
the first publicly available Continuous SLT dataset, RWTHPHOENIX-Weather 2014T1. It provides spoken language
translations and gloss level annotations for German Sign
Language videos of weather broadcasts. Our dataset contains over .95M frames with >67K signs from a sign vocabulary of >1K and >99K words from a German vocabulary
of >2.8K. We report quantitative and qualitative results for various SLT setups to underpin future research in this newly established field. The upper bound for translation performance is calculated at 19.26 BLEU-4, while our end-to-end frame-level and gloss-level tokenization networks were able to achieve 9.58 and 18.13 respectively.
In this paper, the 3D motion field of a scene (known as the Scene Flow, in contrast to Optical Flow, which is it's projection onto the image plane) is estimated using a recently proposed algorithm, inspired by particle filtering. Unlike previous techniques, this scene flow algorithm does not introduce blurring across discontinuities, making it far more suitable for object segmentation and tracking. Additionally the algorithm operates several orders of magnitude faster than previous scene flow estimation systems, enabling the use of Scene Flow in real-time, and near real-time applications.
A novel approach to trajectory estimation is then introduced, based on clustering the estimated scene flow field in both space and velocity dimensions. This allows estimation of object motions in the true 3D scene, rather than the traditional approach of estimating 2D image plane motions. By working in the scene space rather than the image plane, the constant velocity assumption, commonly used in the prediction stage of trackers, is far more valid, and the resulting motion estimate is richer, providing information on out of plane motions. To evaluate the performance of the system, 3D trajectories are estimated on a multi-view sign-language dataset, and compared to a traditional high accuracy 2D system, with excellent results.
This paper presents a novel formulation for scene flow estimation, a collection of moving points in 3D space, modelled using a particle filter that supports multiple hypotheses and does not oversmooth the motion field.
In addition, this paper is the first to address scene flow estimation, while making use of modern depth sensors and monocular appearance images, rather than traditional multi-viewpoint rigs.
The algorithm is applied to an existing scene flow dataset, where it achieves comparable results to approaches utilising multiple views, while taking a fraction of the time.
the-art Neural Machine Translation (NMT) and Image Generation techniques. Our
system is capable of producing sign videos from spoken language sentences. Contrary to
current approaches that are dependent on heavily annotated data, our approach requires
minimal gloss and skeletal level annotations for training. We achieve this by breaking
down the task into dedicated sub-processes. We first translate spoken language sentences
into sign gloss sequences using an encoder-decoder network. We then find a data driven
mapping between glosses and skeletal sequences. We use the resulting pose information
to condition a generative model that produces sign language video sequences. We
evaluate our approach on the recently released PHOENIX14T Sign Language Translation
dataset. We set a baseline for text-to-gloss translation, reporting a BLEU-4 score of
16.34/15.26 on dev/test sets. We further demonstrate the video generation capabilities
of our approach by sharing qualitative results of generated sign sequences given their
same area to become familiar with it? To begin with, we might first
build a mental model of our surroundings. Upon revisiting this area, we
can use this model to extrapolate to new unseen locations and imagine
Based on this, we propose an approach where an agent is capable of
modelling new environments after a single visitation. To this end, we
introduce ?Deep Imagination?, a combination of classical Visual-based
Monte Carlo Localisation and deep learning. By making use of a feature
embedded 3D map, the system can ?imagine? the view from any
novel location. These ?imagined? views are contrasted with the current
observation in order to estimate the agent?s current location. In order to
build the embedded map, we train a deep Siamese Fully Convolutional
U-Net to perform dense feature extraction. By training these features to
be generic, no additional training or fine tuning is required to adapt to
Our results demonstrate the generality and transfer capability of our
learnt dense features by training and evaluating on multiple datasets.
Additionally, we include several visualizations of the feature representations
and resulting 3D maps, as well as their application to localisation.
How do computers and intelligent agents view the world
around them? Feature extraction and representation constitutes
one the basic building blocks towards answering this
question. Traditionally, this has been done with carefully
engineered hand-crafted techniques such as HOG, SIFT or
ORB. However, there is no ?one size fits all? approach that
satisfies all requirements.
In recent years, the rising popularity of deep learning
has resulted in a myriad of end-to-end solutions to many
computer vision problems. These approaches, while successful,
tend to lack scalability and can?t easily exploit information
learned by other systems.
Instead, we propose SAND features, a dedicated deep
learning solution to feature extraction capable of providing
hierarchical context information. This is achieved by
employing sparse relative labels indicating relationships of
similarity/dissimilarity between image locations. The nature
of these labels results in an almost infinite set of dissimilar
examples to choose from. We demonstrate how the
selection of negative examples during training can be used
to modify the feature space and vary it?s properties.
To demonstrate the generality of this approach, we apply
the proposed features to a multitude of tasks, each requiring
different properties. This includes disparity estimation,
semantic segmentation, self-localisation and SLAM. In all
cases, we show how incorporating SAND features results in
better or comparable results to the baseline, whilst requiring
little to no additional training. Code can be found at: