Simon Hadfield

Dr Simon Hadfield

Lecturer in Robot Vision and Autonomous Systems
PhD, FHEA, AUS, MENG, Graduate Certificate in Teaching & Learning
+44 (0)1483 689856
11 BA 00


Areas of specialism

Machine learning; Artificial Intelligence; Deep Learning; 3D computer vision; Event cameras; Robot vision; SLAM; target tracking; scene flow estimation; stereo reconstruction; robotic grasping

University roles and responsibilities

  • Manager of Dissertation allocation and assessment system for undergraduate and postgraduate taught programmes in the Department of Electrical and Electronic Engineering.
  • Undergraduate Personal Tutor
  • Health and Safety group – Representative for CVSSP labs (BA)

My qualifications

Graduate Certificate in Teaching and Learning
University of Surrey
EPSRC funded PhD in Computer Vision
University of Surrey
MEng (Distinction) in Electronic and Computer Engineering (Top student in the graduating year, average mark 75.1%)
University of Surrey

Affiliations and memberships

Member of the British Machine Vision Association (BMVA)
Member of the Institute of Electrical and Electronics Engineers (IEEE)
Member of the Institution of Engineering and Technology (IET)


Research interests

Research projects

Indicators of esteem

  • 2018 – Winner of the Early Career Teacher of the Year (Faculty of Engineering & Physical Sciences )

  • Reviewer for more than 10 high impact international journals and conferences

  • Two NVidia GPU grants

  • 2017 – Finalist for Supervisor of the Year (Faculty of Engineering & Physical Sciences) and the Tony Jeans Inspirational Teaching award

  • 2016 – Second place in the international academic challenge on continuous gesture recognition (ChaLearn)

  • DTI MEng prize, for best all round performance of the entire graduating year, awarded by Department of Trade and Industry


Postgraduate research supervision

Completed postgraduate research projects I have supervised

My teaching

My publications


The aim of this thesis, is to develop estimation and encoding techniques for 3D information, which are applicable in a range of vision tasks. Particular emphasis is given to the task of natural action recognition. This ?in the wild? recognition, favours algorithms with broad generalisation capabilities, as no constraints are placed on either the actor, or the setting. This leads to huge intra-class variability, including changes in lighting, actor appearance, viewpoint and action style. Algorithms which perform well under these circumstances, are generally well suited for real world deployment, in applications such as surveillance, video indexing, and assisted living. The issues of generalisation, may be mitigated to a significant extent, by utilising 3D information, which provides invariance to most appearance based variation. In addition, 3D information can remove projective distortions and the effect of camera orientation, and provides cues for occlusion. The exploitation of these properties has become feasible in recent years. This is due to both the emergence of affordable 3D sensors, such as the Microsoft Kinect", and the ongoing growth of 3D broadcast footage (including 3D TV channels, and 3D Blu-Ray). To evaluate the impact of this 3D information, and provide a benchmark to aid future development, a large multi-view action dataset is compiled, covering 14 different action classes and comprising over an hour of high definition video. This data is obtained from 3D broadcast footage, which provides a broader range of variations, than may be feasibly produced, during staged capture in the lab. A large number of existing action recognition techniques are then implemented, and extensions formulated, to allow the encoding of 3D structural information. This leads to significantly improved generalisation, over standard spatiotemporal techniques. As an additional information source, the estimation of 3D motion fields is also developed. Motion estimation in 3D is also referred to as ?scene flow?, to contrast with its image plane counterpart ?Optical Flow?. However, previous work on scene flow estimation, has been unsuitable for real applications, due to the computational complexity of approaches proposed in the literature. The previous state of the art techniques generally require several hours, to estimate the motion of a single frame, rendering their use with datasets of reasonable size, intractable. This in turn, has lead to the field of scene flow estimat
Marter M, Hadfield S, Bowden R (2014) Friendly faces: Weakly supervised character identification, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 8912 pp. 121-132
© Springer International Publishing Switzerland 2015.This paper demonstrates a novel method for automatically discovering and recognising characters in video without any labelled examples or user intervention. Instead weak supervision is obtained via a rough script-to-subtitle alignment. The technique uses pose invariant features, extracted from detected faces and clustered to form groups of co-occurring characters. Results show that with 9 characters, 29% of the closest exemplars are correctly identified, increasing to 50% as additional exemplars are considered.
Lebeda K, Hadfield S, Bowden R (2015) Dense Rigid Reconstruction from Unstructured Discontinuous Video, 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOP (ICCVW) pp. 814-822 IEEE
Kristan M, Matas J, Leonardis A, Felsberg M, Cehovin L, Fernandez G, Voj1r T, Hager G, Nebehay G, Pflugfelder R, Gupta A, Bibi A, Lukezic A, Garcia-Martin A, Petrosino A, Saffari A, Montero A, Varfolomieiev A, Baskurt A, Zhao B, Ghanem B, Martinez B, Lee B, Han B, Wang C, Garcia C, Zhang C, Schmid C, Tao D, Kim D, Huang D, Prokhorov D, Du D, Yeung D, Ribeiro E, Khan F, Porikli F, Bunyak F, Zhu G, Seetharaman G, Kieritz H, Yau H, Li H, Qi H, Bischof H, Possegger H, Lee H, Nam H, Bogun I, Jeong J, Cho J, Lee J, Zhu J, Shi J, Li J, Jia J, Feng J, Gao J, Choi J, Kim J, Lang J, Martinez J, Choi J, Xing J, Xue K, Palaniappan K, Lebeda K, Alahari K, Gao K, Yun K, Wong K, Luo L, Ma L, Ke L, Wen L, Bertinetto L, Pootschi M, Maresca M, Danelljan M, Wen M, Zhang M, Arens M, Valstar M, Tang M, Chang M, Khan M, Fan N, Wang N, Miksik O, Torr P, Wang Q, Martin-Nieto R, Pelapur R, Bowden R, Laganière R, Moujtahid S, Hare S, Hadfield SJ, Lyu S, Li S, Zhu S, Becker S, Duffner S, Hicks S, Golodetz S, Choi S, Wu T, Mauthner T, Pridmore T, Hu W, Hübner W, Wang X, Li X, Shi X, Zhao X, Mei X, Shizeng Y, Hua Y, Li Y, Lu Y, Li Y, Chen Z, Huang Z, Chen Z, Zhang Z, He Z, Hong Z (2015) The Visual Object Tracking VOT2015 challenge results, ICCV workshop on Visual Object Tracking Challenge pp. 564-586
The Visual Object Tracking challenge 2015, VOT2015, aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. Results of 62 trackers are presented. The number of tested trackers makes VOT 2015 the largest benchmark on short-term tracking to date. For each participating tracker, a short description is provided in the appendix. Features of the VOT2015 challenge that go beyond its VOT2014 predecessor are: (i) a new VOT2015 dataset twice as large as in VOT2014 with full annotation of targets by rotated bounding boxes and per-frame attribute, (ii) extensions of the VOT2014 evaluation methodology by introduction of a new performance measure. The dataset, the evaluation kit as well as the results are publicly available at the challenge website.
Lebeda K, Hadfield S, Bowden R (2015) Exploring Causal Relationships in Visual Object Tracking, 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV) pp. 3065-3073 IEEE
Hadfield S, Bowden R (2015) Exploiting high level scene cues in stereo reconstruction, 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV) pp. 783-791 IEEE
Hadfield S, Lebeda K, Bowden R (2014) Natural action recognition using invariant 3D motion encoding, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 8690 LNCS (PART 2) pp. 758-771
We investigate the recognition of actions "in the wild" using 3D motion information. The lack of control over (and knowledge of) the camera configuration, exacerbates this already challenging task, by introducing systematic projective inconsistencies between 3D motion fields, hugely increasing intra-class variance. By introducing a robust, sequence based, stereo calibration technique, we reduce these inconsistencies from fully projective to a simple similarity transform. We then introduce motion encoding techniques which provide the necessary scale invariance, along with additional invariances to changes in camera viewpoint. On the recent Hollywood 3D natural action recognition dataset, we show improvements of 40% over previous state-of-the-art techniques based on implicit motion encoding. We also demonstrate that our robust sequence calibration simplifies the task of recognising actions, leading to recognition rates 2.5 times those for the same technique without calibration. In addition, the sequence calibrations are made available. © 2014 Springer International Publishing.
Hadfield S, Bowden R (2013) Hollywood 3D: Recognizing actions in 3D natural scenes, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition pp. 3398-3405
Action recognition in unconstrained situations is a difficult task, suffering from massive intra-class variations. It is made even more challenging when complex 3D actions are projected down to the image plane, losing a great deal of information. The recent emergence of 3D data, both in broadcast content, and commercial depth sensors, provides the possibility to overcome this issue. This paper presents a new dataset, for benchmarking action recognition algorithms in natural environments, while making use of 3D information. The dataset contains around 650 video clips, across 14 classes. In addition, two state of the art action recognition algorithms are extended to make use of the 3D data, and five new interest point detection strategies are also proposed, that extend to the 3D data. Our evaluation compares all 4 feature descriptors, using 7 different types of interest point, over a variety of threshold levels, for the Hollywood3D dataset. We make the dataset including stereo video, estimated depth maps and all code required to reproduce the benchmark results, available to the wider community. © 2013 IEEE.
Kristan M, Pflugfelder R, Leonardis A, Matas J, Cehovin L, Nebehay G, Vojir T, Fernandez G, Lukezic A, Dimitriev A, Petrosino A, Saffari A, Li B, Han B, Heng C, Garcia C, Pangersic D, Haeger G, Khan FS, Oven F, Possegger H, Bischof H, Nam H, Zhu J, Li J, Choi JY, Choi J-W, Henriques JF, van de Weijer J, Batista J, Lebeda K, Oefjaell K, Yi KM, Qin L, Wen L, Maresca ME, Danelljan M, Felsberg M, Cheng M-M, Torr P, Huang Q, Bowden R, Hare S, Lim SY, Hong S, Liao S, Hadfield S, Li SZ, Duffner S, Golodetz S, Mauthner T, Vineet V, Lin W, Li Y, Qi Y, Lei Z, Niu Z (2014) The Visual Object Tracking VOT2014 Challenge Results, COMPUTER VISION - ECCV 2014 WORKSHOPS, PT II 8926 pp. 191-217 SPRINGER-VERLAG BERLIN
Hadfield SJ, Lebeda K, Bowden R (2016) Stereo reconstruction using top-down cues, Computer Vision and Image Understanding 157 pp. 206-222 Elsevier
We present a framework which allows standard stereo reconstruction to be unified with a wide range of classic top-down cues from urban scene understanding. The resulting algorithm is analogous to the human visual system where conflicting interpretations of the scene due to ambiguous data can be resolved based on a higher level understanding of urban environments. The cues which are reformulated within the framework include: recognising common arrangements of surface normals and semantic edges (e.g. concave, convex and occlusion boundaries), recognising connected or coplanar structures such as walls, and recognising collinear edges (which are common on repetitive structures such as windows). Recognition of these common configurations has only recently become feasible, thanks to the emergence of large-scale reconstruction datasets. To demonstrate the importance and generality of scene understanding during stereo-reconstruction, the proposed approach is integrated with 3 different state-of-the-art techniques for bottom-up stereo reconstruction. The use of high-level cues is shown to improve performance by up to 15 % on the Middlebury 2014 and KITTI datasets. We further evaluate the technique using the recently proposed HCI stereo metrics, finding significant improvements in the quality of depth discontinuities, planar surfaces and thin structures.
Hadfield S, Bowden R (2014) Scene flow estimation using intelligent cost functions, BMVC 2014 - Proceedings of the British Machine Vision Conference 2014
© 2014. The copyright of this document resides with its authors.Motion estimation algorithms are typically based upon the assumption of brightness constancy or related assumptions such as gradient constancy. This manuscript evaluates several common cost functions from the motion estimation literature, which embody these assumptions. We demonstrate that such assumptions break for real world data, and the functions are therefore unsuitable. We propose a simple solution, which significantly increases the discriminatory ability of the metric, by learning a nonlinear relationship using techniques from machine learning. Furthermore, we demonstrate how context and a nonlinear combination of metrics, can provide additional gains, and demonstrating a 44% improvement in the performance of a state of the art scene flow estimation technique. In addition, smaller gains of 20% are demonstrated in optical flow estimation tasks.
Lebeda K, Hadfield SJ, Bowden R (2016) Direct-from-Video: Unsupervised NRSfM, Proceedings of the ECCV workshop on Recovering 6D Object Pose Estimation
In this work we describe a novel approach to online dense
non-rigid structure from motion. The problem is reformulated, incorporating
ideas from visual object tracking, to provide a more general
and unified technique, with feedback between the reconstruction and
point-tracking algorithms. The resulting algorithm overcomes the limitations
of many conventional techniques, such as the need for a reference
image/template or precomputed trajectories. The technique can also be
applied in traditionally challenging scenarios, such as modelling objects
with strong self-occlusions or from an extreme range of viewpoints. The
proposed algorithm needs no offline pre-learning and does not assume
the modelled object stays rigid at the beginning of the video sequence.
Our experiments show that in traditional scenarios, the proposed method
can achieve better accuracy than the current state of the art while using
less supervision. Additionally we perform reconstructions in challenging
new scenarios where state-of-the-art approaches break down and where
our method improves performance by up to an order of magnitude.
Lebeda K, Hadfield S, Bowden R (2014) 2D or Not 2D: Bridging the Gap Between Tracking and Structure from Motion, COMPUTER VISION - ACCV 2014, PT IV 9006 pp. 642-658 SPRINGER-VERLAG BERLIN
Camgoz NC, Hadfield SJ, Koller O, Bowden R (2016) Using Convolutional 3D Neural Networks for User-Independent Continuous Gesture Recognition, Proceedings IEEE International Conference of Pattern Recognition (ICPR), ChaLearn Workshop IEEE
In this paper, we propose using 3D Convolutional
Neural Networks for large scale user-independent continuous
gesture recognition. We have trained an end-to-end deep network
for continuous gesture recognition (jointly learning both the
feature representation and the classifier). The network performs
three-dimensional (i.e. space-time) convolutions to extract features
related to both the appearance and motion from volumes of
color frames. Space-time invariance of the extracted features is
encoded via pooling layers. The earlier stages of the network are
partially initialized using the work of Tran et al. before being
adapted to the task of gesture recognition. An earlier version
of the proposed method, which was trained for 11,250 iterations,
was submitted to ChaLearn 2016 Continuous Gesture Recognition
Challenge and ranked 2nd with the Mean Jaccard Index Score
of 0.269235. When the proposed method was further trained for
28,750 iterations, it achieved state-of-the-art performance on the
same dataset, yielding a 0.314779 Mean Jaccard Index Score.
Hadfield SJ, Bowden R (2013) Scene Particles: Unregularized Particle Based Scene Flow Estimation, IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (3) 3 pp. 564-576 IEEE Computer Society
In this paper, an algorithm is presented for estimating scene flow, which is a richer, 3D analogue of Optical Flow. The approach operates orders of magnitude faster than alternative techniques, and is well suited to further performance gains through parallelized implementation. The algorithm employs multiple hypothesis to deal with motion ambiguities, rather than the traditional smoothness constraints, removing oversmoothing errors and providing signi?cant performance improvements on benchmark data, over the previous state of the art. The approach is ?exible, and capable of operating with any combination of appearance and/or depth sensors, in any setup, simultaneously estimating the structure and motion if necessary. Additionally, the algorithm propagates information over time to resolve ambiguities, rather than performing an isolated estimation at each frame, as in contemporary approaches. Approaches to smoothing the motion ?eld without sacri?cing the bene?ts of multiple hypotheses are explored, and a probabilistic approach to Occlusion estimation is demonstrated, leading to 10% and 15% improved performance respectively. Finally, a data driven tracking approach is described, and used to estimate the 3D trajectories of hands during sign language, without the need to model complex appearance variations at each viewpoint.
Lebeda K, Hadfield SJ, Bowden R, et al. (2016) The Thermal Infrared Visual Object Tracking VOT-TIR2016 Challenge Results,
The Thermal Infrared Visual Object Tracking challenge 2016, VOT-TIR2016, aims at comparing short-term single-object visual trackers that work on thermal infrared (TIR) sequences and do not apply pre-learned models of object appearance. VOT-TIR2016 is the second benchmark on short-term tracking in TIR sequences. Results of 24 trackers are presented. For each participating tracker, a short description is provided in the appendix. The VOT-TIR2016 challenge is similar to the 2015 challenge, the main difference is the introduction of new, more difficult sequences into the dataset. Furthermore, VOT-TIR2016 evaluation adopted the improvements regarding overlap calculation in VOT2016. Compared to VOT-TIR2015, a significant general improvement of results has been observed, which partly compensate for the more difficult sequences. The dataset, the evaluation kit, as well as the results are publicly available at the challenge website.
Holt B Pose_Holt11, University of Surrey
Hadfield SJ, Bowden R, Lebeda K (2016) The Visual Object Tracking VOT2016 Challenge Results, Lecture Notes in Computer Science 9914 pp. 777-823 Springer
The Visual Object Tracking challenge VOT2016 aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. Results of 70 trackers are presented, with a large number of trackers being published at major computer vision conferences and journals in the recent years. The number of tested state-of-the-art trackers makes the VOT 2016 the largest and most challenging benchmark on short-term tracking to date. For each participating tracker, a short description is provided in the Appendix. The VOT2016 goes beyond its predecessors by (i) introducing a new semi-automatic ground truth bounding box annotation methodology and (ii) extending the evaluation system with the no-reset experiment. The dataset, the evaluation kit as well as the results are publicly available at the challenge website (http:// votchallenge. net).
Lebeda K, Hadfield S, Matas J, Bowden R (2013) Long-term tracking through failure cases, Proceedings of the IEEE International Conference on Computer Vision pp. 153-160
Long term tracking of an object, given only a single instance in an initial frame, remains an open problem. We propose a visual tracking algorithm, robust to many of the difficulties which often occur in real-world scenes. Correspondences of edge-based features are used, to overcome the reliance on the texture of the tracked object and improve invariance to lighting. Furthermore we address long-term stability, enabling the tracker to recover from drift and to provide redetection following object disappearance or occlusion. The two-module principle is similar to the successful state-of-the-art long-term TLD tracker, however our approach extends to cases of low-textured objects. Besides reporting our results on the VOT Challenge dataset, we perform two additional experiments. Firstly, results on short-term sequences show the performance of tracking challenging objects which represent failure cases for competing state-of-the-art approaches. Secondly, long sequences are tracked, including one of almost 30000 frames which to our knowledge is the longest tracking sequence reported to date. This tests the re-detection and drift resistance properties of the tracker. All the results are comparable to the state-of-the-art on sequences with textured objects and superior on non-textured objects. The new annotated sequences are made publicly available. © 2013 IEEE.
Hadfield S (2013) Hollywood 3D, University of Surrey
Lebeda K, Hadfield S, Matas J, Bowden R (2015) Texture-Independent Long-Term Tracking Using Virtual Corners, IEEE TRANSACTIONS ON IMAGE PROCESSING 25 (1) pp. 359-371 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Hadfield Simon, Bowden Richard (2012) Generalised Pose Estimation Using Depth, In: Kutulakos KN (eds.), Trends and Topics in Computer Vision ECCV 2010 Trends and Topics in Computer Vision. ECCV 2010. Lecture Notes in Computer Science 6553 (6553) pp. 312-325 Springer
Estimating the pose of an object, be it articulated, deformable or rigid, is an important task, with applications ranging from Human-Computer Interaction to environmental understanding. The idea of a general pose estimation framework, capable of being rapidly retrained to suit a variety of tasks, is appealing. In this paper a solution isproposed requiring only a set of labelled training images in order to be applied to many pose estimation tasks. This is achieved bytreating pose estimation as a classification problem, with particle filtering used to provide non-discretised estimates. Depth information extracted from a calibrated stereo sequence, is used for background suppression and object scale estimation. The appearance and shape channels are then transformed to Local Binary Pattern histograms, and pose classification is performed via a randomised decision forest. To demonstrate flexibility, the approach is applied to two different situations, articulated hand pose and rigid head orientation, achieving 97% and 84% accurate estimation rates, respectively.
Hadfield Simon J., Bowden Richard (2010) Generalised Pose Estimation Using Depth, In proceedings, European Conference on Computer Vision (Workshops)
Estimating the pose of an object, be it articulated, deformable or rigid, is an important task, with applications ranging from Human-Computer Interaction to environmental understanding. The idea of a general pose estimation framework, capable of being rapidly retrained to suit a variety of tasks, is appealing. In this paper a solution is proposed requiring only a set of labelled training images in order to be applied to many pose estimation tasks. This is achieved by treating pose estimation as a classification problem, with particle filtering used to provide non-discretised estimates. Depth information extracted from a calibrated stereo sequence, is used for background suppression and object scale estimation. The appearance and shape channels are then transformed to Local Binary Pattern histograms, and pose classification is performed via a randomised decision forest. To demonstrate flexibility, the approach is applied to two different situations, articulated hand pose and rigid head orientation, achieving 97% and 84% accurate estimation rates, respectively.
López-Benítez M, Drysdale T, Hadfield S, Maricar M (2017) Prototype for Multidisciplinary Research in the context of the Internet of Things, Journal of Network and Computer Applications 78 pp. 146-161 Elsevier
The Internet of Things (IoT) poses important challenges requiring multidisciplinary solutions that take into account the potential mutual effects and interactions among the different dimensions of future IoT systems. A suitable platform is required for an accurate and realistic evaluation of such solutions. This paper presents a prototype developed in the context of the EPSRC/eFutures-funded project ?Internet of Surprise: Self-Organising Data?. The prototype has been designed to effectively enable the joint evaluation and optimisation of multidisciplinary aspects of IoT systems, including aspects related with hardware design, communications and data processing. This paper provides a comprehensive description, discussing design and implementation details that may be helpful to other researchers and engineers in the development of similar tools. Examples illustrating the potentials and capabilities are presented as well. The developed prototype is a versatile tool that can be used for proof-of-concept, validation and cross-layer optimisation of multidisciplinary solutions for future IoT deployments.
Hadfield Simon, Lebeda K, Bowden Richard (2016) Hollywood 3D: What are the best 3D features for Action Recognition?, International Journal of Computer Vision 121 (1) pp. 95-110 Springer Verlag
Action recognition ?in the wild? is extremely challenging, particularly when complex 3D actions are projected down to the image plane, losing a great deal of information. The recent growth of 3D data in broadcast content and commercial depth sensors, makes it possible to overcome this. However, there is little work examining the best way to exploit this new modality. In this paper we introduce the Hollywood 3D benchmark, which is the first dataset containing ?in the wild? action footage including 3D data. This dataset consists of 650 stereo video clips across 14 action classes, taken from Hollywood movies. We provide stereo calibrations and depth reconstructions for each clip. We also provide an action recognition pipeline, and propose a number of specialised depth-aware techniques including five interest point detectors and three feature descriptors. Extensive tests allow evaluation of different appearance and depth encoding schemes. Our novel techniques exploiting this depth allow us to reach performance levels more than triple those of the best baseline algorithm using only appearance information. The benchmark data, code and calibrations are all made available to the community.
Lebeda K, Hadfield SJ, Bowden R (2017) TMAGIC: A Model-free 3D Tracker, IEEE Transactions on Image Processing 26 (9) pp. 4378-4388 IEEE
Significant effort has been devoted within the visual tracking community to rapid learning of object properties on the fly. However, state-of-the-art approaches still often fail in cases such as rapid out-of-plane rotation, when the appearance changes suddenly. One of the major contributions of this work is a radical rethinking of the traditional wisdom of modelling 3D motion as appearance change during tracking. Instead, 3D motion is modelled as 3D motion. This intuitive but previously unexplored approach provides new possibilities in visual tracking research. Firstly, 3D tracking is more general, as large out-of-plane motion is often fatal for 2D trackers, but helps 3D trackers to build better models. Secondly, the tracker?s internal model of the object can be used in many different applications and it could even become the main motivation, with tracking supporting reconstruction rather than vice versa. This effectively bridges the gap between visual tracking and Structure from Motion. A new benchmark dataset of sequences with extreme out-ofplane rotation is presented and an online leader-board offered to stimulate new research in the relatively underdeveloped area of 3D tracking. The proposed method, provided as a baseline, is capable of successfully tracking these sequences, all of which pose a considerable challenge to 2D trackers (error reduced by 46 %).
Mendez Maldonado O, Hadfield S, Pugeault N, Bowden R (2016) Next-best stereo: extending next best view optimisation for collaborative sensors, Proceedings of BMVC 2016
Most 3D reconstruction approaches passively optimise over all data, exhaustively matching pairs, rather than actively selecting data to process. This is costly both in terms of time and computer resources, and quickly becomes intractable for large datasets. This work proposes an approach to intelligently filter large amounts of data for 3D reconstructions of unknown scenes using monocular cameras. Our contributions are twofold: First, we present a novel approach to efficiently optimise the Next-Best View ( NBV ) in terms of accuracy and coverage using partial scene geometry. Second, we extend this to intelligently selecting stereo pairs by jointly optimising the baseline and vergence to find the NBV ?s best stereo pair to perform reconstruction. Both contributions are extremely efficient, taking 0.8ms and 0.3ms per pose, respectively. Experimental evaluation shows that the proposed method allows efficient selection of stereo pairs for reconstruction, such that a dense model can be obtained with only a small number of images. Once a complete model has been obtained, the remaining computational budget is used to intelligently refine areas of uncertainty, achieving results comparable to state-of-the-art batch approaches on the Middlebury dataset, using as little as 3.8% of the views.
Mendez Maldonado O, Hadfield S, Pugeault N, Bowden R (2017) Taking the Scenic Route to 3D: Optimising Reconstruction from Moving Cameras, ICCV 2017 Proceedings IEEE
Reconstruction of 3D environments is a problem that
has been widely addressed in the literature. While many
approaches exist to perform reconstruction, few of them
take an active role in deciding where the next observations
should come from. Furthermore, the problem of travelling
from the camera?s current position to the next, known as
pathplanning, usually focuses on minimising path length.
This approach is ill-suited for reconstruction applications,
where learning about the environment is more valuable than
speed of traversal.
We present a novel Scenic Route Planner that selects
paths which maximise information gain, both in terms of
total map coverage and reconstruction accuracy. We also
introduce a new type of collaborative behaviour into the
planning stage called opportunistic collaboration, which
allows sensors to switch between acting as independent
Structure from Motion (SfM) agents or as a variable baseline
stereo pair.
We show that Scenic Planning enables similar performance
to state-of-the-art batch approaches using less than
0.00027% of the possible stereo pairs (3% of the views).
Comparison against length-based pathplanning approaches
show that our approach produces more complete and more
accurate maps with fewer frames. Finally, we demonstrate
the Scenic Pathplanner?s ability to generalise to live scenarios
by mounting cameras on autonomous ground-based
sensor platforms and exploring an environment.
Allday R, Hadfield S, Bowden R (2017) From Vision to Grasping: Adapting Visual Networks, TAROS-2017 Conference Proceedings. Lecture Notes in Computer Science 10454 pp. 484-494 Springer
Grasping is one of the oldest problems in robotics and is still considered challenging, especially when grasping unknown objects with unknown 3D shape. We focus on exploiting recent advances in computer vision recognition systems. Object classification problems tend to have much larger datasets to train from and have far fewer practical constraints around the size of the model and speed to train. In this paper we will investigate how to adapt Convolutional Neural Networks (CNNs), traditionally used for image classification, for planar robotic grasping. We consider the differences in the problems and how a network can be adjusted to account for this. Positional information is far more important to robotics than generic image classification tasks, where max pooling layers are used to improve translation invariance. By using a more appropriate network structure we are able to obtain improved accuracy while simultaneously improving run times and reducing memory consumption by reducing model size by up to 69%.
Camgöz N, Hadfield SJ, Koller O, Bowden R (2017) SubUNets: End-to-end Hand Shape and Continuous Sign Language Recognition, ICCV 2017 Proceedings IEEE
We propose a novel deep learning approach to solve
simultaneous alignment and recognition problems (referred
to as ?Sequence-to-sequence? learning). We decompose the
problem into a series of specialised expert systems referred
to as SubUNets. The spatio-temporal relationships between
these SubUNets are then modelled to solve the task, while
remaining trainable end-to-end.
The approach mimics human learning and educational
techniques, and has a number of significant advantages. SubUNets
allow us to inject domain-specific expert knowledge
into the system regarding suitable intermediate representations.
They also allow us to implicitly perform transfer
learning between different interrelated tasks, which also allows
us to exploit a wider range of more varied data sources.
In our experiments we demonstrate that each of these properties
serves to significantly improve the performance of the
overarching recognition system, by better constraining the
learning problem.
The proposed techniques are demonstrated in the challenging
domain of sign language recognition. We demonstrate
state-of-the-art performance on hand-shape recognition (outperforming
previous techniques by more than 30%). Furthermore,
we are able to obtain comparable sign recognition
rates to previous research, without the need for an alignment
step to segment out the signs for recognition.
Camgöz N, Hadfield S, Bowden R (2017) Particle Filter based Probabilistic Forced Alignment for Continuous Gesture Recognition, IEEE International Conference on Computer Vision Workshops (ICCVW) 2017 pp. 3079-3085 IEEE
In this paper, we propose a novel particle filter based
probabilistic forced alignment approach for training spatiotemporal
deep neural networks using weak border level
The proposed method jointly learns to localize and recognize
isolated instances in continuous streams. This is done
by drawing training volumes from a prior distribution of
likely regions and training a discriminative 3D-CNN from
this data. The classifier is then used to calculate the posterior
distribution by scoring the training examples and using this
as the prior for the next sampling stage.
We apply the proposed approach to the challenging task
of large-scale user-independent continuous gesture recognition.
We evaluate the performance on the popular ChaLearn
2016 Continuous Gesture Recognition (ConGD) dataset. Our
method surpasses state-of-the-art results by obtaining 0:3646
and 0:3744 Mean Jaccard Index Score on the validation and
test sets of ConGD, respectively. Furthermore, we participated
in the ChaLearn 2017 Continuous Gesture Recognition
Challenge and was ranked 3rd. It should be noted that our
method is learner independent, it can be easily combined
with other approaches.
Autonomous 3D reconstruction, the process whereby an agent can produce its own representation of the world, is an extremely challenging area in both vision and robotics. However, 3D reconstructions have the ability to grant robots the understanding of the world necessary for collaboration and high-level goal execution. Therefore, this thesis aims to explore methods that will enable modern robotic systems to autonomously and collaboratively achieve an understanding of the world.

In the real world, reconstructing a 3D scene requires nuanced understanding of the environment. Additionally, it is not enough to simply ?understand? the world, autonomous agents must be capable of actively acquiring this understanding. Achieving all of this using simple monocular sensors is extremely challenging. Agents must be able to understand what areas of the world are navigable, how egomotion affects reconstruction and how other agents may be leveraged to provide an advantage. All of this must be considered in addition to the traditional 3D reconstruction issues of correspondence estimation, triangulation and data association.

Simultaneous Localisation and Mapping (SLAM) solutions are not particularly well suited to autonomous multi-agent reconstruction. They typically require the sensors to be in constant communication, do not scale well with the number of agents (or map size) and require expensive optimisations. Instead, this thesis attempts to develop more pro-active techniques from the ground up.

First, an autonomous agent must have the ability to actively select what it is going to reconstruct. Known as view-selection, or Next-Best View (NBV), this has recently become an active topic in autonomous robotics and will form the first contribution of this thesis. Second, once a view is selected, an autonomous agent must be able to plan a trajectory to arrive at that view. This problem, known as path-planning, can be considered a core topic in the robotics field and will form the second contribution of this thesis. Finally, the 3D reconstruction must be anchored to a globally consistent map that co-relates to the real world. This will be addressed as a floorplan localisation problem, an emerging field for the vision community, and will be the third contribution of this thesis.

To give autonomous agents the ability to actively select what data to process, this thesis discusses the NBV problem in the context of Multi-View Stereo (MVS). The proposed approach has the ability to massively reduce the amount of computing resources required for any given 3D reconstruction. More importantly, it autonomously selects the views that improve the reconstruction the most. All of this is done exclusively on the sensor pose; the images are not used for view-selection and only loaded into memory once they have been selected for reconstruction. Experimental evaluation shows that NBV applied to this problem can achieve results comparable to state-of-the-art using as little as 3.8% of the views.

To provide the ability to execute an autonomous 3D reconstruction, this thesis proposes a novel computer-vision based goal-estimation and path-planning approach. The method proposed in the previous chapter is extended into a continuous pose-space. The resulting view then becomes the goal of a Scenic Pathplanner that plans a trajectory between the current robot pose and the NBV. This is done using an NBV-based pose-space that biases the paths towards areas of high information gain. Experimental evaluation shows that the Scenic Planning enables similar performance to state-of-the-art batch approaches using less than 3% of the views, whichcorresponds to 2.7 × 10 ?4 % of the possible stereo pairs (using a naive interpretation of plausible stereo pairs). Comparison against length-based path-planning approaches show that the Scenic Pathplanner produces more complete and more accurate maps with fewer frames. Finally, the ability of the Scenic Pathplanner to generalise to live sc

Hadfield Simon, Lebeda Karel, Bowden Richard (2018) HARD-PnP: PnP Optimization Using a Hybrid Approximate Representation, IEEE transactions on Pattern Analysis and Machine Intelligence Institute of Electrical and Electronics Engineers (IEEE)
This paper proposes a Hybrid Approximate Representation (HAR) based on unifying several efficient approximations of the
generalized reprojection error (which is known as the gold standard for multiview geometry). The HAR is an over-parameterization
scheme where the approximation is applied simultaneously in multiple parameter spaces. A joint minimization scheme ?HAR-Descent?
can then solve the PnP problem efficiently, while remaining robust to approximation errors and local minima.
The technique is evaluated extensively, including numerous synthetic benchmark protocols and the real-world data evaluations used in
previous works. The proposed technique was found to have runtime complexity comparable to the fastest O(n) techniques, and up to
10 times faster than current state of the art minimization approaches. In addition, the accuracy exceeds that of all 9 previous
techniques tested, providing definitive state of the art performance on the benchmarks, across all 90 of the experiments in the paper
and supplementary material.
Ebling S, Camgöz N, Boyes Braem P, Tissi K, Sidler-Miserez S, Stoll S, Hadfield S, Haug T, Bowden R, Tornay S, Razaviz M, Magimai-Doss M (2018) SMILE Swiss German Sign Language Dataset, Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC) 2018 The European Language Resources Association (ELRA)
Sign language recognition (SLR) involves identifying the form and meaning of isolated signs or sequences of signs. To our knowledge,
the combination of SLR and sign language assessment is novel. The goal of an ongoing three-year project in Switzerland is to pioneer
an assessment system for lexical signs of Swiss German Sign Language (Deutschschweizerische Geb¨ardensprache, DSGS) that relies on
SLR. The assessment system aims to give adult L2 learners of DSGS feedback on the correctness of the manual parameters (handshape,
hand position, location, and movement) of isolated signs they produce. In its initial version, the system will include automatic feedback
for a subset of a DSGS vocabulary production test consisting of 100 lexical items. To provide the SLR component of the assessment
system with sufficient training samples, a large-scale dataset containing videotaped repeated productions of the 100 items of the
vocabulary test with associated transcriptions and annotations was created, consisting of data from 11 adult L1 signers and 19 adult L2
learners of DSGS. This paper introduces the dataset, which will be made available to the research community.
Mendez Maldonado Oscar, Hadfield Simon, Pugeault Nicolas, Bowden Richard (2018) SeDAR ? Semantic Detection and Ranging: Humans can localise without LiDAR, can robots?, Proceedings of the 2018 IEEE International Conference on Robotics and Automation, May 21-25, 2018, Brisbane, Australia IEEE
How does a person work out their location using
a floorplan? It is probably safe to say that we do not explicitly
measure depths to every visible surface and try to match them
against different pose estimates in the floorplan. And yet, this
is exactly how most robotic scan-matching algorithms operate.
Similarly, we do not extrude the 2D geometry present in the
floorplan into 3D and try to align it to the real-world. And yet,
this is how most vision-based approaches localise.
Humans do the exact opposite. Instead of depth, we use
high level semantic cues. Instead of extruding the floorplan up
into the third dimension, we collapse the 3D world into a 2D
representation. Evidence of this is that many of the floorplans
we use in everyday life are not accurate, opting instead for high
levels of discriminative landmarks.
In this work, we use this insight to present a global
localisation approach that relies solely on the semantic labels
present in the floorplan and extracted from RGB images. While
our approach is able to use range measurements if available,
we demonstrate that they are unnecessary as we can achieve
results comparable to state-of-the-art without them.
Camgöz Necati Cihan, Hadfield Simon, Koller O, Ney H, Bowden Richard (2018) Neural Sign Language Translation, Proceedings CVPR 2018 pp. 7784-7793 IEEE
Sign Language Recognition (SLR) has been an active
research field for the last two decades. However, most
research to date has considered SLR as a naive gesture
recognition problem. SLR seeks to recognize a sequence of
continuous signs but neglects the underlying rich grammatical and linguistic structures of sign language that differ from spoken language. In contrast, we introduce the Sign Language Translation (SLT) problem. Here, the objective is to generate spoken language translations from sign language videos, taking into account the different word orders and grammar.
We formalize SLT in the framework of Neural Machine
Translation (NMT) for both end-to-end and pretrained
settings (using expert knowledge). This allows us to jointly learn the spatial representations, the underlying language model, and the mapping between sign and spoken language.
To evaluate the performance of Neural SLT, we collected
the first publicly available Continuous SLT dataset, RWTHPHOENIX-Weather 2014T1. It provides spoken language
translations and gloss level annotations for German Sign
Language videos of weather broadcasts. Our dataset contains over .95M frames with >67K signs from a sign vocabulary of >1K and >99K words from a German vocabulary
of >2.8K. We report quantitative and qualitative results for various SLT setups to underpin future research in this newly established field. The upper bound for translation performance is calculated at 19.26 BLEU-4, while our end-to-end frame-level and gloss-level tokenization networks were able to achieve 9.58 and 18.13 respectively.
Hadfield Simon J., Bowden Richard (2012) Go With The Flow: Hand Trajectories in 3D via Clustered Scene Flow, In Proceedings, International Conference on Image Analysis and Recognition LNCS 7 pp. 285-295
Tracking hands and estimating their trajectories is useful in a number of tasks, including sign language recognition and human computer interaction. Hands are extremely difficult objects to track, their deformability, frequent self occlusions and motion blur cause appearance variations too great for most standard object trackers to deal with robustly.

In this paper, the 3D motion field of a scene (known as the Scene Flow, in contrast to Optical Flow, which is it's projection onto the image plane) is estimated using a recently proposed algorithm, inspired by particle filtering. Unlike previous techniques, this scene flow algorithm does not introduce blurring across discontinuities, making it far more suitable for object segmentation and tracking. Additionally the algorithm operates several orders of magnitude faster than previous scene flow estimation systems, enabling the use of Scene Flow in real-time, and near real-time applications.

A novel approach to trajectory estimation is then introduced, based on clustering the estimated scene flow field in both space and velocity dimensions. This allows estimation of object motions in the true 3D scene, rather than the traditional approach of estimating 2D image plane motions. By working in the scene space rather than the image plane, the constant velocity assumption, commonly used in the prediction stage of trackers, is far more valid, and the resulting motion estimate is richer, providing information on out of plane motions. To evaluate the performance of the system, 3D trajectories are estimated on a multi-view sign-language dataset, and compared to a traditional high accuracy 2D system, with excellent results.

Hadfield Simon J., Bowden Richard (2011) Kinecting the dots: Particle Based Scene Flow From Depth Sensors, In Proceedings, International Conference on Computer Vision (ICCV) pp. 2290-2295
The motion field of a scene can be used for object segmentation and to provide features for classification tasks like action recognition. Scene flow is the full 3D motion field of the scene, and is more difficult to estimate than it's 2D counterpart, optical flow. Current approaches use a smoothness cost for regularisation, which tends to over-smooth at object boundaries.

This paper presents a novel formulation for scene flow estimation, a collection of moving points in 3D space, modelled using a particle filter that supports multiple hypotheses and does not oversmooth the motion field.

In addition, this paper is the first to address scene flow estimation, while making use of modern depth sensors and monocular appearance images, rather than traditional multi-viewpoint rigs.

The algorithm is applied to an existing scene flow dataset, where it achieves comparable results to approaches utilising multiple views, while taking a fraction of the time.

Stoll Stephanie, Camgöz Necati Cihan, Hadfield Simon, Bowden Richard (2018) Sign Language Production using Neural Machine Translation and Generative Adversarial Networks, Proceedings of the 29th British Machine Vision Conference (BMVC 2018) British Machine Vision Association
We present a novel approach to automatic Sign Language Production using stateof-
the-art Neural Machine Translation (NMT) and Image Generation techniques. Our
system is capable of producing sign videos from spoken language sentences. Contrary to
current approaches that are dependent on heavily annotated data, our approach requires
minimal gloss and skeletal level annotations for training. We achieve this by breaking
down the task into dedicated sub-processes. We first translate spoken language sentences
into sign gloss sequences using an encoder-decoder network. We then find a data driven
mapping between glosses and skeletal sequences. We use the resulting pose information
to condition a generative model that produces sign language video sequences. We
evaluate our approach on the recently released PHOENIX14T Sign Language Translation
dataset. We set a baseline for text-to-gloss translation, reporting a BLEU-4 score of
16.34/15.26 on dev/test sets. We further demonstrate the video generation capabilities
of our approach by sharing qualitative results of generated sign sequences given their
skeletal correspondence.
Spencer Jaime, Mendez Maldonado Oscar, Bowden Richard, Hadfield Simon (2018) Localisation via Deep Imagination: learn the features not the map, Proceedings of ECCV 2018 - European Conference on Computer Vision Springer Nature
How many times does a human have to drive through the
same area to become familiar with it? To begin with, we might first
build a mental model of our surroundings. Upon revisiting this area, we
can use this model to extrapolate to new unseen locations and imagine
their appearance.
Based on this, we propose an approach where an agent is capable of
modelling new environments after a single visitation. To this end, we
introduce ?Deep Imagination?, a combination of classical Visual-based
Monte Carlo Localisation and deep learning. By making use of a feature
embedded 3D map, the system can ?imagine? the view from any
novel location. These ?imagined? views are contrasted with the current
observation in order to estimate the agent?s current location. In order to
build the embedded map, we train a deep Siamese Fully Convolutional
U-Net to perform dense feature extraction. By training these features to
be generic, no additional training or fine tuning is required to adapt to
new environments.
Our results demonstrate the generality and transfer capability of our
learnt dense features by training and evaluating on multiple datasets.
Additionally, we include several visualizations of the feature representations
and resulting 3D maps, as well as their application to localisation.
Lebeda Karel, Hadfield Simon J., Bowden Richard 3DCars, IEEE
Spencer Jaime, Bowden Richard, Hadfield Simon (2019) Scale-Adaptive Neural Dense Features: Learning via Hierarchical Context Aggregation, Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019) Institute of Electrical and Electronics Engineers (IEEE)

How do computers and intelligent agents view the world
around them? Feature extraction and representation constitutes
one the basic building blocks towards answering this
question. Traditionally, this has been done with carefully
engineered hand-crafted techniques such as HOG, SIFT or
ORB. However, there is no ?one size fits all? approach that
satisfies all requirements.

In recent years, the rising popularity of deep learning
has resulted in a myriad of end-to-end solutions to many
computer vision problems. These approaches, while successful,
tend to lack scalability and can?t easily exploit information
learned by other systems.

Instead, we propose SAND features, a dedicated deep
learning solution to feature extraction capable of providing
hierarchical context information. This is achieved by
employing sparse relative labels indicating relationships of
similarity/dissimilarity between image locations. The nature
of these labels results in an almost infinite set of dissimilar
examples to choose from. We demonstrate how the
selection of negative examples during training can be used
to modify the feature space and vary it?s properties.

To demonstrate the generality of this approach, we apply
the proposed features to a multitude of tasks, each requiring
different properties. This includes disparity estimation,
semantic segmentation, self-localisation and SLAM. In all
cases, we show how incorporating SAND features results in
better or comparable results to the baseline, whilst requiring
little to no additional training. Code can be found at:

Blacker P., Bridges C. P., Hadfield S. (2019) Rapid Prototyping of Deep Learning Models on Radiation Hardened CPUs, Proceedings of the 13th NASA/ESA Conference on Adaptive Hardware and Systems (AHS 2019) Institute of Electrical and Electronics Engineers (IEEE)

Interest is increasing in the use of neural networks
and deep-learning for on-board processing tasks in the space
industry [1]. However development has lagged behind terrestrial
applications for several reasons: space qualified computers
have significantly less processing power than their terrestrial
equivalents, reliability requirements are more stringent than the
majority of applications deep-learning is being used for. The long
requirements, design and qualification cycles in much of the space
industry slows adoption of recent developments.

GPUs are the first hardware choice for implementing neural
networks on terrestrial computers, however no radiation
hardened equivalent parts are currently available. Field Programmable
Gate Array devices are capable of efficiently implementing
neural networks and radiation hardened parts are
available, however the process to deploy and validate an inference
network is non-trivial and robust tools that automate the process
are not available.

We present an open source tool chain that can automatically
deploy a trained inference network from the TensorFlow framework
directly to the LEON 3, and an industrial case study of
the design process used to train and optimise a deep-learning
model for this processor. This does not directly change the
three challenges described above however it greatly accelerates
prototyping and analysis of neural network solutions, allowing
these options to be more easily considered than is currently

Future improvements to the tools are identified along with a
summary of some of the obstacles to using neural networks and
potential solutions to these in the future.

Walters Celyn, Mendez Oscar, Hadfield Simon, Bowden Richard (2019) A Robust Extrinsic Calibration Framework for Vehicles with Unscaled Sensors, Towards a Robotic Society IEEE

Accurate extrinsic sensor calibration is essential for both autonomous vehicles and robots. Traditionally this is an involved process requiring calibration targets, known fiducial markers and is generally performed in a lab. Moreover, even a small change in the sensor layout requires recalibration. With the anticipated arrival of consumer autonomous vehicles, there is demand for a system which can do this automatically, after deployment and without specialist human expertise.

To solve these limitations, we propose a flexible framework which can estimate extrinsic parameters without an explicit calibration stage, even for sensors with unknown scale. Our first contribution builds upon standard hand-eye calibration by jointly recovering scale. Our second contribution is that our system is made robust to imperfect and degenerate sensor data, by collecting independent sets of poses and automatically selecting those which are most ideal.

We show that our approach?s robustness is essential for the target scenario. Unlike previous approaches, ours runs in real time and constantly estimates the extrinsic transform. For both an ideal experimental setup and a real use case, comparison against these approaches shows that we outperform the state-of-the-art. Furthermore, we demonstrate that the recovered scale may be applied to the full trajectory, circumventing the need for scale estimation via sensor fusion.

Allday Rebecca, Hadfield Simon, Bowden Richard (2019) Auto-Perceptive Reinforcement Learning (APRiL), Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2019) Institute of Electrical and Electronics Engineers (IEEE)
The relationship between the feedback given in
Reinforcement Learning (RL) and visual data input is often
extremely complex. Given this, expecting a single system trained
end-to-end to learn both how to perceive and interact with
its environment is unrealistic for complex domains. In this
paper we propose Auto-Perceptive Reinforcement Learning
(APRiL), separating the perception and the control elements
of the task. This method uses an auto-perceptive network to
encode a feature space. The feature space may explicitly encode
available knowledge from the semantically understood state
space but the network is also free to encode unanticipated
auxiliary data. By decoupling visual perception from the RL
process, APRiL can make use of techniques shown to improve
performance and efficiency of RL training, which are often
difficult to apply directly with a visual input. We present results
showing that APRiL is effective in tasks where the semantically
understood state space is known. We also demonstrate that
allowing the feature space to learn auxiliary information, allows
it to use the visual perception system to improve performance
by approximately 30%. We also show that maintaining some
level of semantics in the encoded state, which can then make
use of state-of-the art RL techniques, saves around 75% of the
time that would be used to collect simulation examples.
Mendez Oscar, Hadfield Simon, Pugeault Nicolas, Bowden Richard (2019) SeDAR: Reading floorplans like a human, International Journal of Computer Vision Springer Verlag

The use of human-level semantic information to
aid robotic tasks has recently become an important area for
both Computer Vision and Robotics. This has been enabled
by advances in Deep Learning that allow consistent and robust
semantic understanding. Leveraging this semantic vision
of the world has allowed human-level understanding to
naturally emerge from many different
approaches. Particularly, the use of semantic information to
aid in localisation and reconstruction has been at the forefront
of both fields.

Like robots, humans also require the ability to localise
within a structure. To aid this, humans have designed highlevel
semantic maps of our structures called floorplans. We
are extremely good at localising in them, even with limited
access to the depth information used by robots. This is because
we focus on the distribution of semantic elements,
rather than geometric ones. Evidence of this is that humans
are normally able to localise in a floorplan that has not been
scaled properly. In order to grant this ability to robots, it is
necessary to use localisation approaches that leverage the
same semantic information humans use.

In this paper, we present a novel method for semantically
enabled global localisation. Our approach relies on
the semantic labels present in the floorplan. Deep Learning
is leveraged to extract semantic labels from RGB images,
which are compared to the floorplan for localisation. While
our approach is able to use range measurements if available,
we demonstrate that they are unnecessary as we can achieve
results comparable to state-of-the-art without them.

Stoll Stephanie, Camgöz Necati Cihan, Hadfield Simon, Bowden Richard (2020) Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks., International Journal of Computer Vision Springer
We present a novel approach to automatic Sign Language Production using recent developments in Neural Machine Translation (NMT), Generative Adversarial Networks, and motion generation. Our system is capable of producing sign videos from spoken language sentences. Contrary to current approaches that are dependent on heavily annotated data, our approach requires minimal gloss and skeletal level annotations for training. We achieve this by breaking down the task into dedicated sub-processes. We first translate spoken language sentences into sign pose sequences by combining an NMT network with a Motion Graph. The resulting pose information is then used to condition a generative model that produces photo realistic sign language video sequences. This is the first approach to continuous sign video generation that does not use a classical graphical avatar. We evaluate the translation abilities of our approach on the PHOENIX14T Sign Language Translation dataset. We set a baseline for text-to-gloss translation, reporting a BLEU-4 score of 16.34/15.26 on dev/test sets. We further demonstrate the video generation capabilities of our approach for both multi-signer and high-definition settings qualitatively and quantitatively using broadcast quality assessment metrics.
Jackson Lucy, Saaj Chakravarthini M., Seddaoui Asma, Whiting Calem, Eckersley Steve, Hadfield Simon (2020) Downsizing an Orbital Space Robot: A Dynamic System Based Evaluation, Advances in Space Research Elsevier
Small space robots have the potential to revolutionise space exploration by
facilitating the on-orbit assembly of infrastructure, in shorter time scales, at
reduced costs. Their commercial appeal will be further improved if such a
system is also capable of performing on-orbit servicing missions, in line with the
current drive to limit space debris and prolong the lifetime of satellites already
in orbit. Whilst there have been a limited number of successful demonstrations
of technologies capable of these on-orbit operations, the systems remain large
and bespoke. The recent surge in small satellite technologies is changing the
economics of space and in the near future, downsizing a space robot might
become be a viable option with a host of beneýts. This industry wide shift means
some of the technologies for use with a downsized space robot, such as power
and communication subsystems, now exist. However, there are still dynamic
and control issues that need to be overcome before a downsized space robot can
be capable of undertaking useful missions. This paper ýrst outlines these issues,
before analyzing the effect of downsizing a system on its operational capability.
Therefore presenting the smallest controllable system such that the benefiýts of
a small space robot can be achieved with current technologies. The sizing of the base spacecraft and manipulator are addressed here. The design presented
consists of a 3 link, 6 degrees of freedom robotic manipulator mounted on a 12U
form factor satellite. The feasibility of this 12U space robot was evaluated in
simulation and the in-depth results presented here support the hypothesis that
a small space robot is a viable solution for in-orbit operations.
Keywords: Small Satellite; Space Robot; In-orbit Assembly and Servicing;
In-orbit operations; Free-Flying; Free-Floating.
Camgöz Necati Cihan, Koller Oscar, Hadfield Simon, Bowden Richard (2020) Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020
Prior work on Sign Language Translation has shown that having a mid-level sign gloss representation(effectively recognizing the individual signs) improves the translation performance drastically. In fact, the current state-of-theart in translation requires gloss level tokenization in order to work. We introduce a novel transformer based architecture that jointly learns Continuous Sign Language Recognition and Translation whilebeing trainable in an end-to-end manner. This is achieved by using a Connectionist Temporal Classi?cation(CTC)loss to bind the recognition and translation problems into a single uni?ed architecture. This joint approach does not require any ground-truth timing information,simultaneously solving two co-dependant sequence to sequence learning problems and leads to signi?cant performance gains. We evaluate the recognition and translation performances of our approaches on the challenging RWTHPHOENIX-Weather-2014T(PHOENIX14T)dataset. Wereport state-of-the-art sign language recognition and translation results achieved by our Sign Language Transformers. Our translation net works out perform both sign video to spoken language and gloss to spoken language translation models, in some cases more than doubling the performance (9.58vs. 21.80BLEU-4Score). We also share new baseline translation results using transformer networks for several other text-to-text sign language translation tasks.
Spencer Jaime, Bowden Richard, Had?eld Simon (2020) Same Features, Different Day: Weakly Supervised Feature Learning for Seasonal Invariance, IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
?Like night and day? is a commonly used expression to imply that two things are completely different. Unfortunately, this tends to be the case for current visual feature representations of the same scene across varying seasons or times of day. The aim of this paper is to provide a dense feature representation that can be used to perform localization, sparse matching or image retrieval,regardless of the current seasonal or temporal appearance. Recently, there have been several proposed methodologies for deep learning dense feature representations. These methods make use of ground truth pixel-wise correspondences between pairs of images and focus on the spatial properties of the features. As such, they don?t address temporal or seasonal variation. Furthermore, obtaining the required pixel-wise correspondence data to train in cross-seasonal environments is highly complex in most scenarios. We propose Deja-Vu, a weakly supervised approach to learning season invariant features that does not require pixel-wise ground truth data. The proposed system only requires coarse labels indicating if two images correspond to the same location or not. From these labels, the network is trained to produce ?similar? dense feature maps for corresponding locations despite environmental changes. Code will be made available at: jspenmar/DejaVu_Features
Spencer Jaime, Bowden Richard, Had?eld Simon (2020) DeFeat-Net: General Monocular Depth via Simultaneous Unsupervised Representation Learning, IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
In the current monocular depth research, the dominant approach is to employ unsupervised training on large datasets, driven by warped photometric consistency. Such approaches lack robustness and are unable to generalize to challenging domains such as nighttime scenes or adverse weather conditions where assumptions about photometric consistency break down. We propose DeFeat-Net (Depth & Feature network), an approach to simultaneously learn a cross-domain dense feature representation, alongside a robust depth-estimation framework based on warped feature consistency. The resulting feature representation is learned in an unsupervised manner with no explicit ground-truth correspondences required. We show that within a single domain, our technique is comparable to both the current state of the art in monocular depth estimation and supervised feature representation learning. However, by simultaneously learning features, depth and motion, our technique is able to generalize to challenging domains, allowing DeFeat-Net to outperform the current state-of-the-art with around 10% reduction in all error measures on more challenging sequences such as nighttime driving.