Dr Oscar Mendez Maldonado

Research Fellow in Robot Vision
BEng, PhD

Academic and research departments

Centre for Vision, Speech and Signal Processing (CVSSP).


Areas of specialism

Computer Vision; Robotics; Deep Learning; Artificial Intelligence; Localisation; Autonomous Vehicles

University roles and responsibilities

  • Lecturer for EEE1035 (Programming in C)
  • Lecturer for EEE3043 (Robotics)

My qualifications

Doctor of Philosophy
University of Surrey
Bachelor of Engineering
University of Surrey

Affiliations and memberships

Fellow of the Higher Education Academy


Research interests

Research projects

Indicators of esteem

  • Sullivan Thesis Prize Winner (2018)

Courses I teach on


Postgraduate taught

My publications


Mendez Maldonado Oscar, Hadfield Simon, Pugeault Nicolas, Bowden Richard (2018) SeDAR ? Semantic Detection and Ranging: Humans can localise without LiDAR, can robots?,Proceedings of the 2018 IEEE International Conference on Robotics and Automation, May 21-25, 2018, Brisbane, Australia IEEE
How does a person work out their location using
a floorplan? It is probably safe to say that we do not explicitly
measure depths to every visible surface and try to match them
against different pose estimates in the floorplan. And yet, this
is exactly how most robotic scan-matching algorithms operate.
Similarly, we do not extrude the 2D geometry present in the
floorplan into 3D and try to align it to the real-world. And yet,
this is how most vision-based approaches localise.
Humans do the exact opposite. Instead of depth, we use
high level semantic cues. Instead of extruding the floorplan up
into the third dimension, we collapse the 3D world into a 2D
representation. Evidence of this is that many of the floorplans
we use in everyday life are not accurate, opting instead for high
levels of discriminative landmarks.
In this work, we use this insight to present a global
localisation approach that relies solely on the semantic labels
present in the floorplan and extracted from RGB images. While
our approach is able to use range measurements if available,
we demonstrate that they are unnecessary as we can achieve
results comparable to state-of-the-art without them.
Autonomous 3D reconstruction, the process whereby an agent can produce its own representation of the world, is an extremely challenging area in both vision and robotics. However, 3D reconstructions have the ability to grant robots the understanding of the world necessary for collaboration and high-level goal execution. Therefore, this thesis aims to explore methods that will enable modern robotic systems to autonomously and collaboratively achieve an understanding of the world.

In the real world, reconstructing a 3D scene requires nuanced understanding of the environment. Additionally, it is not enough to simply ?understand? the world, autonomous agents must be capable of actively acquiring this understanding. Achieving all of this using simple monocular sensors is extremely challenging. Agents must be able to understand what areas of the world are navigable, how egomotion affects reconstruction and how other agents may be leveraged to provide an advantage. All of this must be considered in addition to the traditional 3D reconstruction issues of correspondence estimation, triangulation and data association.

Simultaneous Localisation and Mapping (SLAM) solutions are not particularly well suited to autonomous multi-agent reconstruction. They typically require the sensors to be in constant communication, do not scale well with the number of agents (or map size) and require expensive optimisations. Instead, this thesis attempts to develop more pro-active techniques from the ground up.

First, an autonomous agent must have the ability to actively select what it is going to reconstruct. Known as view-selection, or Next-Best View (NBV), this has recently become an active topic in autonomous robotics and will form the first contribution of this thesis. Second, once a view is selected, an autonomous agent must be able to plan a trajectory to arrive at that view. This problem, known as path-planning, can be considered a core topic in the robotics field and will form the second contribution of this thesis. Finally, the 3D reconstruction must be anchored to a globally consistent map that co-relates to the real world. This will be addressed as a floorplan localisation problem, an emerging field for the vision community, and will be the third contribution of this thesis.

To give autonomous agents the ability to actively select what data to process, this thesis discusses the NBV problem in the context of Multi-View Stereo (MVS). The proposed approach has the ability to massively reduce the amount of computing resources required for any given 3D reconstruction. More importantly, it autonomously selects the views that improve the reconstruction the most. All of this is done exclusively on the sensor pose; the images are not used for view-selection and only loaded into memory once they have been selected for reconstruction. Experimental evaluation shows that NBV applied to this problem can achieve results comparable to state-of-the-art using as little as 3.8% of the views.

To provide the ability to execute an autonomous 3D reconstruction, this thesis proposes a novel computer-vision based goal-estimation and path-planning approach. The method proposed in the previous chapter is extended into a continuous pose-space. The resulting view then becomes the goal of a Scenic Pathplanner that plans a trajectory between the current robot pose and the NBV. This is done using an NBV-based pose-space that biases the paths towards areas of high information gain. Experimental evaluation shows that the Scenic Planning enables similar performance to state-of-the-art batch approaches using less than 3% of the views, whichcorresponds to 2.7 × 10 ?4 % of the possible stereo pairs (using a naive interpretation of plausible stereo pairs). Comparison against length-based path-planning approaches show that the Scenic Pathplanner produces more complete and more accurate maps with fewer frames. Finally, the ability of the Scenic Pathplanner to generalise to live sc

Mendez Maldonado O, Hadfield S, Pugeault N, Bowden R (2017) Taking the Scenic Route to 3D: Optimising Reconstruction from Moving Cameras,ICCV 2017 Proceedings IEEE
Reconstruction of 3D environments is a problem that
has been widely addressed in the literature. While many
approaches exist to perform reconstruction, few of them
take an active role in deciding where the next observations
should come from. Furthermore, the problem of travelling
from the camera?s current position to the next, known as
pathplanning, usually focuses on minimising path length.
This approach is ill-suited for reconstruction applications,
where learning about the environment is more valuable than
speed of traversal.
We present a novel Scenic Route Planner that selects
paths which maximise information gain, both in terms of
total map coverage and reconstruction accuracy. We also
introduce a new type of collaborative behaviour into the
planning stage called opportunistic collaboration, which
allows sensors to switch between acting as independent
Structure from Motion (SfM) agents or as a variable baseline
stereo pair.
We show that Scenic Planning enables similar performance
to state-of-the-art batch approaches using less than
0.00027% of the possible stereo pairs (3% of the views).
Comparison against length-based pathplanning approaches
show that our approach produces more complete and more
accurate maps with fewer frames. Finally, we demonstrate
the Scenic Pathplanner?s ability to generalise to live scenarios
by mounting cameras on autonomous ground-based
sensor platforms and exploring an environment.
Mendez Maldonado O, Hadfield S, Pugeault N, Bowden R (2016) Next-best stereo: extending next best view optimisation for collaborative sensors,Proceedings of BMVC 2016
Most 3D reconstruction approaches passively optimise over all data, exhaustively matching pairs, rather than actively selecting data to process. This is costly both in terms of time and computer resources, and quickly becomes intractable for large datasets. This work proposes an approach to intelligently filter large amounts of data for 3D reconstructions of unknown scenes using monocular cameras. Our contributions are twofold: First, we present a novel approach to efficiently optimise the Next-Best View ( NBV ) in terms of accuracy and coverage using partial scene geometry. Second, we extend this to intelligently selecting stereo pairs by jointly optimising the baseline and vergence to find the NBV ?s best stereo pair to perform reconstruction. Both contributions are extremely efficient, taking 0.8ms and 0.3ms per pose, respectively. Experimental evaluation shows that the proposed method allows efficient selection of stereo pairs for reconstruction, such that a dense model can be obtained with only a small number of images. Once a complete model has been obtained, the remaining computational budget is used to intelligently refine areas of uncertainty, achieving results comparable to state-of-the-art batch approaches on the Middlebury dataset, using as little as 3.8% of the views.
Spencer Jaime, Mendez Maldonado Oscar, Bowden Richard, Hadfield Simon (2018) Localisation via Deep Imagination: learn the features not the map,Proceedings of ECCV 2018 - European Conference on Computer Vision Springer Nature
How many times does a human have to drive through the
same area to become familiar with it? To begin with, we might first
build a mental model of our surroundings. Upon revisiting this area, we
can use this model to extrapolate to new unseen locations and imagine
their appearance.
Based on this, we propose an approach where an agent is capable of
modelling new environments after a single visitation. To this end, we
introduce ?Deep Imagination?, a combination of classical Visual-based
Monte Carlo Localisation and deep learning. By making use of a feature
embedded 3D map, the system can ?imagine? the view from any
novel location. These ?imagined? views are contrasted with the current
observation in order to estimate the agent?s current location. In order to
build the embedded map, we train a deep Siamese Fully Convolutional
U-Net to perform dense feature extraction. By training these features to
be generic, no additional training or fine tuning is required to adapt to
new environments.
Our results demonstrate the generality and transfer capability of our
learnt dense features by training and evaluating on multiple datasets.
Additionally, we include several visualizations of the feature representations
and resulting 3D maps, as well as their application to localisation.
Walters Celyn, Mendez Oscar, Hadfield Simon, Bowden Richard (2019) A Robust Extrinsic Calibration Framework for Vehicles with Unscaled Sensors,Towards a Robotic Society IEEE

Accurate extrinsic sensor calibration is essential for both autonomous vehicles and robots. Traditionally this is an involved process requiring calibration targets, known fiducial markers and is generally performed in a lab. Moreover, even a small change in the sensor layout requires recalibration. With the anticipated arrival of consumer autonomous vehicles, there is demand for a system which can do this automatically, after deployment and without specialist human expertise.

To solve these limitations, we propose a flexible framework which can estimate extrinsic parameters without an explicit calibration stage, even for sensors with unknown scale. Our first contribution builds upon standard hand-eye calibration by jointly recovering scale. Our second contribution is that our system is made robust to imperfect and degenerate sensor data, by collecting independent sets of poses and automatically selecting those which are most ideal.

We show that our approach?s robustness is essential for the target scenario. Unlike previous approaches, ours runs in real time and constantly estimates the extrinsic transform. For both an ideal experimental setup and a real use case, comparison against these approaches shows that we outperform the state-of-the-art. Furthermore, we demonstrate that the recovered scale may be applied to the full trajectory, circumventing the need for scale estimation via sensor fusion.

Mendez Oscar, Hadfield Simon, Pugeault Nicolas, Bowden Richard (2019) SeDAR: Reading floorplans like a human,International Journal of Computer Vision Springer Verlag

The use of human-level semantic information to
aid robotic tasks has recently become an important area for
both Computer Vision and Robotics. This has been enabled
by advances in Deep Learning that allow consistent and robust
semantic understanding. Leveraging this semantic vision
of the world has allowed human-level understanding to
naturally emerge from many different
approaches. Particularly, the use of semantic information to
aid in localisation and reconstruction has been at the forefront
of both fields.

Like robots, humans also require the ability to localise
within a structure. To aid this, humans have designed highlevel
semantic maps of our structures called floorplans. We
are extremely good at localising in them, even with limited
access to the depth information used by robots. This is because
we focus on the distribution of semantic elements,
rather than geometric ones. Evidence of this is that humans
are normally able to localise in a floorplan that has not been
scaled properly. In order to grant this ability to robots, it is
necessary to use localisation approaches that leverage the
same semantic information humans use.

In this paper, we present a novel method for semantically
enabled global localisation. Our approach relies on
the semantic labels present in the floorplan. Deep Learning
is leveraged to extract semantic labels from RGB images,
which are compared to the floorplan for localisation. While
our approach is able to use range measurements if available,
we demonstrate that they are unnecessary as we can achieve
results comparable to state-of-the-art without them.

Additional publications