Dr Charles Malleson

Research Fellow in Computer Vision, CVSSP
PhD, MSc, BEng (Hons)

Academic and research departments

Centre for Vision, Speech and Signal Processing (CVSSP).

My publications


Trumble Matthew, Gilbert Andrew, Malleson Charles, Hilton Adrian, Collomosse John Total Capture, University of Surrey
Trumble Matthew, Gilbert Andrew, Malleson Charles, Hilton Adrian, Collomosse John (2017) Total Capture: 3D Human Pose Estimation Fusing Video and Inertial Sensors, Proceedings of 28th British Machine Vision Conference pp. 1-13
We present an algorithm for fusing multi-viewpoint video (MVV) with inertial measurement
unit (IMU) sensor data to accurately estimate 3D human pose. A 3-D convolutional
neural network is used to learn a pose embedding from volumetric probabilistic
visual hull data (PVH) derived from the MVV frames. We incorporate this model within
a dual stream network integrating pose embeddings derived from MVV and a forward
kinematic solve of the IMU data. A temporal model (LSTM) is incorporated within
both streams prior to their fusion. Hybrid pose inference using these two complementary
data sources is shown to resolve ambiguities within each sensor modality, yielding improved
accuracy over prior methods. A further contribution of this work is a new hybrid
MVV dataset (TotalCapture) comprising video, IMU and a skeletal joint ground truth
derived from a commercial motion capture system. The dataset is available online at
Malleson Charles, Volino Marco, Gilbert Andrew, Trumble Matthew, Collomosse John, Hilton Adrian (2017) Real-time Full-Body Motion Capture from Video and IMUs, 3DV 2017 Proceedings CPS
A real-time full-body motion capture system is presented
which uses input from a sparse set of inertial measurement
units (IMUs) along with images from two or more standard
video cameras and requires no optical markers or specialized
infra-red cameras. A real-time optimization-based
framework is proposed which incorporates constraints from
the IMUs, cameras and a prior pose model. The combination
of video and IMU data allows the full 6-DOF motion to
be recovered including axial rotation of limbs and drift-free
global position. The approach was tested using both indoor
and outdoor captured data. The results demonstrate the effectiveness
of the approach for tracking a wide range of human
motion in real time in unconstrained indoor/outdoor
Malleson Charles, Guillemaut Jean-Yves, Hilton Adrian (2018) Hybrid modelling of non-rigid scenes from RGBD cameras, IEEE Transactions on Circuits and Systems for Video Technology IEEE
Recent advances in sensor technology have introduced
low-cost RGB video plus depth sensors, such as the
Kinect, which enable simultaneous acquisition of colour and
depth images at video rates. This paper introduces a framework
for representation of general dynamic scenes from video plus
depth acquisition. A hybrid representation is proposed which
combines the advantages of prior surfel graph surface segmentation
and modelling work with the higher-resolution surface
reconstruction capability of volumetric fusion techniques. The
contributions are (1) extension of a prior piecewise surfel graph
modelling approach for improved accuracy and completeness, (2)
combination of this surfel graph modelling with TSDF surface
fusion to generate dense geometry, and (3) proposal of means for
validation of the reconstructed 4D scene model against the input
data and efficient storage of any unmodelled regions via residual
depth maps. The approach allows arbitrary dynamic scenes to be
efficiently represented with temporally consistent structure and
enhanced levels of detail and completeness where possible, but
gracefully falls back to raw measurements where no structure
can be inferred. The representation is shown to facilitate creative
manipulation of real scene data which would previously require
more complex capture setups or manual processing.
Gilbert Andrew, Trumble Matthew, Malleson Charles, Hilton Adrian, Collomosse John (2018) Fusing Visual and Inertial Sensors with Semantics for 3D Human Pose Estimation, International Journal of Computer Vision Springer Verlag
We propose an approach to accurately esti-
mate 3D human pose by fusing multi-viewpoint video
(MVV) with inertial measurement unit (IMU) sensor
data, without optical markers, a complex hardware setup
or a full body model. Uniquely we use a multi-channel
3D convolutional neural network to learn a pose em-
bedding from visual occupancy and semantic 2D pose
estimates from the MVV in a discretised volumetric
probabilistic visual hull (PVH). The learnt pose stream
is concurrently processed with a forward kinematic solve
of the IMU data and a temporal model (LSTM) exploits
the rich spatial and temporal long range dependencies
among the solved joints, the two streams are then fused
in a final fully connected layer. The two complemen-
tary data sources allow for ambiguities to be resolved
within each sensor modality, yielding improved accu-
racy over prior methods. Extensive evaluation is per-
formed with state of the art performance reported on
the popular Human 3.6M dataset [26], the newly re-
leased TotalCapture dataset and a challenging set of
outdoor videos TotalCaptureOutdoor. We release the
new hybrid MVV dataset (TotalCapture) comprising
of multi- viewpoint video, IMU and accurate 3D skele-
tal joint ground truth derived from a commercial mo-
tion capture system. The dataset is available online at
Malleson Charles, Bazin Jean-Charles, Wang Oliver, Bradley Derek, Beeler Thabo, Hilton Adrian, Sorkine-Hornung Alexander (2016) FaceDirector: Continuous Control of Facial Performance in Video, Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV 2015) pp. 3979-3987 Institute of Electrical and Electronics Engineers (IEEE)
We present a method to continuously blend between multiple facial performances of an actor, which can contain different facial expressions or emotional states. As an example, given sad and angry video takes of a scene, our method empowers the movie director to specify arbitrary weighted combinations and smooth transitions between the two takes in post-production. Our contributions include (1) a robust nonlinear audio-visual synchronization technique that exploits complementary properties of audio and visual cues to automatically determine robust, dense spatiotemporal correspondences between takes, and (2) a seamless facial blending approach that provides the director full control to interpolate timing, facial expression, and local appearance, in order to generate novel performances after filming. In contrast to most previous works, our approach operates entirely in image space, avoiding the need of 3D facial reconstruction. We demonstrate that our method can synthesize visually believable performances with applications in emotion transition, performance correction, and timing control.