Dr Charles Malleson

Research Fellow in Computer Vision, CVSSP, Leverhulme Early Career Fellow
PhD, MSc, BEng (Hons)
+44 (0)1483 686046
05 BB 00

Academic and research departments

Centre for Vision, Speech and Signal Processing (CVSSP).

My publications


Trumble Matthew, Gilbert Andrew, Malleson Charles, Hilton Adrian, Collomosse John Total Capture, University of Surrey
Trumble Matthew, Gilbert Andrew, Malleson Charles, Hilton Adrian, Collomosse John (2017) Total Capture: 3D Human Pose Estimation Fusing Video and Inertial Sensors,Proceedings of 28th British Machine Vision Conference pp. 1-13
We present an algorithm for fusing multi-viewpoint video (MVV) with inertial measurement
unit (IMU) sensor data to accurately estimate 3D human pose. A 3-D convolutional
neural network is used to learn a pose embedding from volumetric probabilistic
visual hull data (PVH) derived from the MVV frames. We incorporate this model within
a dual stream network integrating pose embeddings derived from MVV and a forward
kinematic solve of the IMU data. A temporal model (LSTM) is incorporated within
both streams prior to their fusion. Hybrid pose inference using these two complementary
data sources is shown to resolve ambiguities within each sensor modality, yielding improved
accuracy over prior methods. A further contribution of this work is a new hybrid
MVV dataset (TotalCapture) comprising video, IMU and a skeletal joint ground truth
derived from a commercial motion capture system. The dataset is available online at
Malleson Charles, Volino Marco, Gilbert Andrew, Trumble Matthew, Collomosse John, Hilton Adrian (2017) Real-time Full-Body Motion Capture from Video and IMUs,3DV 2017 Proceedings CPS
A real-time full-body motion capture system is presented
which uses input from a sparse set of inertial measurement
units (IMUs) along with images from two or more standard
video cameras and requires no optical markers or specialized
infra-red cameras. A real-time optimization-based
framework is proposed which incorporates constraints from
the IMUs, cameras and a prior pose model. The combination
of video and IMU data allows the full 6-DOF motion to
be recovered including axial rotation of limbs and drift-free
global position. The approach was tested using both indoor
and outdoor captured data. The results demonstrate the effectiveness
of the approach for tracking a wide range of human
motion in real time in unconstrained indoor/outdoor
Malleson Charles, Guillemaut Jean-Yves, Hilton Adrian (2018) Hybrid modelling of non-rigid scenes from RGBD cameras,IEEE Transactions on Circuits and Systems for Video Technology IEEE
Recent advances in sensor technology have introduced
low-cost RGB video plus depth sensors, such as the
Kinect, which enable simultaneous acquisition of colour and
depth images at video rates. This paper introduces a framework
for representation of general dynamic scenes from video plus
depth acquisition. A hybrid representation is proposed which
combines the advantages of prior surfel graph surface segmentation
and modelling work with the higher-resolution surface
reconstruction capability of volumetric fusion techniques. The
contributions are (1) extension of a prior piecewise surfel graph
modelling approach for improved accuracy and completeness, (2)
combination of this surfel graph modelling with TSDF surface
fusion to generate dense geometry, and (3) proposal of means for
validation of the reconstructed 4D scene model against the input
data and efficient storage of any unmodelled regions via residual
depth maps. The approach allows arbitrary dynamic scenes to be
efficiently represented with temporally consistent structure and
enhanced levels of detail and completeness where possible, but
gracefully falls back to raw measurements where no structure
can be inferred. The representation is shown to facilitate creative
manipulation of real scene data which would previously require
more complex capture setups or manual processing.
Gilbert Andrew, Trumble Matthew, Malleson Charles, Hilton Adrian, Collomosse John (2018) Fusing Visual and Inertial Sensors with Semantics for 3D Human Pose Estimation,International Journal of Computer Vision Springer Verlag
We propose an approach to accurately esti-
mate 3D human pose by fusing multi-viewpoint video
(MVV) with inertial measurement unit (IMU) sensor
data, without optical markers, a complex hardware setup
or a full body model. Uniquely we use a multi-channel
3D convolutional neural network to learn a pose em-
bedding from visual occupancy and semantic 2D pose
estimates from the MVV in a discretised volumetric
probabilistic visual hull (PVH). The learnt pose stream
is concurrently processed with a forward kinematic solve
of the IMU data and a temporal model (LSTM) exploits
the rich spatial and temporal long range dependencies
among the solved joints, the two streams are then fused
in a final fully connected layer. The two complemen-
tary data sources allow for ambiguities to be resolved
within each sensor modality, yielding improved accu-
racy over prior methods. Extensive evaluation is per-
formed with state of the art performance reported on
the popular Human 3.6M dataset [26], the newly re-
leased TotalCapture dataset and a challenging set of
outdoor videos TotalCaptureOutdoor. We release the
new hybrid MVV dataset (TotalCapture) comprising
of multi- viewpoint video, IMU and accurate 3D skele-
tal joint ground truth derived from a commercial mo-
tion capture system. The dataset is available online at
Malleson Charles, Bazin Jean-Charles, Wang Oliver, Bradley Derek, Beeler Thabo, Hilton Adrian, Sorkine-Hornung Alexander (2016) FaceDirector: Continuous Control of Facial Performance in Video,Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV 2015) pp. 3979-3987 Institute of Electrical and Electronics Engineers (IEEE)
We present a method to continuously blend between multiple facial performances of an actor, which can contain different facial expressions or emotional states. As an example, given sad and angry video takes of a scene, our method empowers the movie director to specify arbitrary weighted combinations and smooth transitions between the two takes in post-production. Our contributions include (1) a robust nonlinear audio-visual synchronization technique that exploits complementary properties of audio and visual cues to automatically determine robust, dense spatiotemporal correspondences between takes, and (2) a seamless facial blending approach that provides the director full control to interpolate timing, facial expression, and local appearance, in order to generate novel performances after filming. In contrast to most previous works, our approach operates entirely in image space, avoiding the need of 3D facial reconstruction. We demonstrate that our method can synthesize visually believable performances with applications in emotion transition, performance correction, and timing control.
Malleson Charles, Guillemaut Jean-Yves, Hilton Adrian (2019) 3D Reconstruction from RGB-D Data,In: Rosin Paul L., Lai Yu-Kun, Shao Ling, Liu Yonghuai (eds.), RGB-D Image Analysis and Processing pp. pp 87-115 Springer Nature Switzerland AG
A key task in computer vision is that of generating virtual 3D models
of real-world scenes by reconstructing the shape, appearance and, in the case of
dynamic scenes, motion of the scene from visual sensors. Recently, low-cost video
plus depth (RGB-D) sensors have become widely available and have been applied
to 3D reconstruction of both static and dynamic scenes. RGB-D sensors contain an
active depth sensor, which provides a stream of depth maps alongside standard colour
video. The low cost and ease of use of RGB-D devices as well as their video rate
capture of images along with depth make them well suited to 3D reconstruction. Use
of active depth capture overcomes some of the limitations of passive monocular or
multiple-view video-based approaches since reliable, metrically accurate estimates
of the scene depth at each pixel can be obtained from a single view, even in scenes
that lack distinctive texture. There are two key components to 3D reconstruction
from RGB-D data: (1) spatial alignment of the surface over time and, (2) fusion
of noisy, partial surface measurements into a more complete, consistent 3D model.
In the case of static scenes, the sensor is typically moved around the scene and
its pose is estimated over time. For dynamic scenes, there may be multiple rigid,
articulated, or non-rigidly deforming surfaces to be tracked over time. The fusion
component consists of integration of the aligned surface measurements, typically
using an intermediate representation, such as the volumetric truncated signed distance
field (TSDF). In this chapter, we discuss key recent approaches to 3D reconstruction
from depth or RGB-D input, with an emphasis on real-time reconstruction of static
Malleson Charles, Collomosse John, Hilton Adrian (2019) Real-Time Multi-person Motion Capture from Multi-view Video and IMUs.,International Journal of Computer Vision Springer
A real-time motion capture system is presented which uses input from multiple standard video cameras and inertial measurement units (IMUs). The system is able to track multiple people simultaneously and requires no optical markers, specialized infra-red cameras or foreground/background segmentation, making it applicable to general indoor and outdoor scenarios with dynamic backgrounds and lighting. To overcome limitations of prior video or IMU-only approaches, we propose to use flexible combinations of multiple-view, calibrated video and IMU input along with a pose prior in an online optimization-based framework, which allows the full 6-DoF motion to be recovered including axial rotation of limbs and drift-free global position. A method for sorting and assigning raw input 2D keypoint detections into corresponding subjects is presented which facilitates multi-person tracking and rejection of any bystanders in the scene. The approach is evaluated on data from several indoor and outdoor capture environments with one or more subjects and the trade-off between input sparsity and tracking performance is discussed. State-of-the-art pose estimation performance is obtained on the Total Capture (mutli-view video and IMU) and Human 3.6M (multi-view video) datasets. Finally, a live demonstrator for the approach is presented showing real-time capture, solving and character animation using a light-weight, commodity hardware setup.
Typical colour digital cameras have a single sensor with a colour filter array (CFA), each pixel capturing a single channel (red, green or blue). A full RGB colour output image is generated by demosaicing (DM), i.e. interpolating to infer the two unobserved channels for each pixel. The DM approach used can have a significant effect on the quality of the output image, particularly in the presence of common imaging artifacts such as chromatic aberration (CA). Small differences in the focal length for each channel (lateral CA) and the inability of the lens to bring all three channels simultaneously into focus (longitudinal CA) can cause objectionable colour fringing artifacts in edge regions. These artifacts can be particularly severe when using low-cost lenses. We propose to use a set of simple neural networks to learn to jointly perform DM and CA correction, producing high quality colour images subject to severe CA as well as image noise. The proposed neural network-based joint DM and CA correction produces a significant improvement in image quality metrics (PSNR and SSIM) compared the baseline edge-directed linear interpolation approach preserving image detail and reducing objectionable false colour and comb artifacts. The approach can be applied in the production of high quality images and video from machine vision cameras with low cost lenses, thus extending the viability of such hardware to visual media production.