Dr Charles Malleson

Research Fellow in Computer Vision, CVSSP, Leverhulme Early Career Fellow
PhD, MSc, BEng (Hons)

My publications


Trumble Matthew, Gilbert Andrew, Malleson Charles, Hilton Adrian, Collomosse John Total Capture, University of Surrey
Trumble Matthew, Gilbert Andrew, Malleson Charles, Hilton Adrian, Collomosse John (2017) Total Capture: 3D Human Pose Estimation Fusing Video and Inertial Sensors,Proceedings of 28th British Machine Vision Conferencepp. 1-13
We present an algorithm for fusing multi-viewpoint video (MVV) with inertial measurement unit (IMU) sensor data to accurately estimate 3D human pose. A 3-D convolutional neural network is used to learn a pose embedding from volumetric probabilistic visual hull data (PVH) derived from the MVV frames. We incorporate this model within a dual stream network integrating pose embeddings derived from MVV and a forward kinematic solve of the IMU data. A temporal model (LSTM) is incorporated within both streams prior to their fusion. Hybrid pose inference using these two complementary data sources is shown to resolve ambiguities within each sensor modality, yielding improved accuracy over prior methods. A further contribution of this work is a new hybrid MVV dataset (TotalCapture) comprising video, IMU and a skeletal joint ground truth derived from a commercial motion capture system. The dataset is available online at http://cvssp.org/data/totalcapture/.
Malleson Charles, Volino Marco, Gilbert Andrew, Trumble Matthew, Collomosse John, Hilton Adrian (2017) Real-time Full-Body Motion Capture from Video and IMUs,3DV 2017 Proceedings CPS
A real-time full-body motion capture system is presented which uses input from a sparse set of inertial measurement units (IMUs) along with images from two or more standard video cameras and requires no optical markers or specialized infra-red cameras. A real-time optimization-based framework is proposed which incorporates constraints from the IMUs, cameras and a prior pose model. The combination of video and IMU data allows the full 6-DOF motion to be recovered including axial rotation of limbs and drift-free global position. The approach was tested using both indoor and outdoor captured data. The results demonstrate the effectiveness of the approach for tracking a wide range of human motion in real time in unconstrained indoor/outdoor scenes.
Malleson Charles, Guillemaut Jean-Yves, Hilton Adrian (2018) Hybrid modelling of non-rigid scenes from RGBD cameras,IEEE Transactions on Circuits and Systems for Video Technology IEEE
Recent advances in sensor technology have introduced low-cost RGB video plus depth sensors, such as the Kinect, which enable simultaneous acquisition of colour and depth images at video rates. This paper introduces a framework for representation of general dynamic scenes from video plus depth acquisition. A hybrid representation is proposed which combines the advantages of prior surfel graph surface segmentation and modelling work with the higher-resolution surface reconstruction capability of volumetric fusion techniques. The contributions are (1) extension of a prior piecewise surfel graph modelling approach for improved accuracy and completeness, (2) combination of this surfel graph modelling with TSDF surface fusion to generate dense geometry, and (3) proposal of means for validation of the reconstructed 4D scene model against the input data and efficient storage of any unmodelled regions via residual depth maps. The approach allows arbitrary dynamic scenes to be efficiently represented with temporally consistent structure and enhanced levels of detail and completeness where possible, but gracefully falls back to raw measurements where no structure can be inferred. The representation is shown to facilitate creative manipulation of real scene data which would previously require more complex capture setups or manual processing.
Gilbert Andrew, Trumble Matthew, Malleson Charles, Hilton Adrian, Collomosse John (2018) Fusing Visual and Inertial Sensors with Semantics for 3D Human Pose Estimation,International Journal of Computer Vision Springer Verlag
We propose an approach to accurately esti- mate 3D human pose by fusing multi-viewpoint video (MVV) with inertial measurement unit (IMU) sensor data, without optical markers, a complex hardware setup or a full body model. Uniquely we use a multi-channel 3D convolutional neural network to learn a pose em- bedding from visual occupancy and semantic 2D pose estimates from the MVV in a discretised volumetric probabilistic visual hull (PVH). The learnt pose stream is concurrently processed with a forward kinematic solve of the IMU data and a temporal model (LSTM) exploits the rich spatial and temporal long range dependencies among the solved joints, the two streams are then fused in a final fully connected layer. The two complemen- tary data sources allow for ambiguities to be resolved within each sensor modality, yielding improved accu- racy over prior methods. Extensive evaluation is per- formed with state of the art performance reported on the popular Human 3.6M dataset [26], the newly re- leased TotalCapture dataset and a challenging set of outdoor videos TotalCaptureOutdoor. We release the new hybrid MVV dataset (TotalCapture) comprising of multi- viewpoint video, IMU and accurate 3D skele- tal joint ground truth derived from a commercial mo- tion capture system. The dataset is available online at http://cvssp.org/data/totalcapture/.
Malleson Charles, Bazin Jean-Charles, Wang Oliver, Bradley Derek, Beeler Thabo, Hilton Adrian, Sorkine-Hornung Alexander (2016) FaceDirector: Continuous Control of Facial Performance in Video,Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV 2015)pp. 3979-3987 Institute of Electrical and Electronics Engineers (IEEE)
We present a method to continuously blend between multiple facial performances of an actor, which can contain different facial expressions or emotional states. As an example, given sad and angry video takes of a scene, our method empowers the movie director to specify arbitrary weighted combinations and smooth transitions between the two takes in post-production. Our contributions include (1) a robust nonlinear audio-visual synchronization technique that exploits complementary properties of audio and visual cues to automatically determine robust, dense spatiotemporal correspondences between takes, and (2) a seamless facial blending approach that provides the director full control to interpolate timing, facial expression, and local appearance, in order to generate novel performances after filming. In contrast to most previous works, our approach operates entirely in image space, avoiding the need of 3D facial reconstruction. We demonstrate that our method can synthesize visually believable performances with applications in emotion transition, performance correction, and timing control.
Malleson Charles, Guillemaut Jean-Yves, Hilton Adrian (2019) 3D Reconstruction from RGB-D Data,In: Rosin Paul L., Lai Yu-Kun, Shao Ling, Liu Yonghuai (eds.), RGB-D Image Analysis and Processingpp. pp 87-115 Springer Nature Switzerland AG
A key task in computer vision is that of generating virtual 3D models of real-world scenes by reconstructing the shape, appearance and, in the case of dynamic scenes, motion of the scene from visual sensors. Recently, low-cost video plus depth (RGB-D) sensors have become widely available and have been applied to 3D reconstruction of both static and dynamic scenes. RGB-D sensors contain an active depth sensor, which provides a stream of depth maps alongside standard colour video. The low cost and ease of use of RGB-D devices as well as their video rate capture of images along with depth make them well suited to 3D reconstruction. Use of active depth capture overcomes some of the limitations of passive monocular or multiple-view video-based approaches since reliable, metrically accurate estimates of the scene depth at each pixel can be obtained from a single view, even in scenes that lack distinctive texture. There are two key components to 3D reconstruction from RGB-D data: (1) spatial alignment of the surface over time and, (2) fusion of noisy, partial surface measurements into a more complete, consistent 3D model. In the case of static scenes, the sensor is typically moved around the scene and its pose is estimated over time. For dynamic scenes, there may be multiple rigid, articulated, or non-rigidly deforming surfaces to be tracked over time. The fusion component consists of integration of the aligned surface measurements, typically using an intermediate representation, such as the volumetric truncated signed distance field (TSDF). In this chapter, we discuss key recent approaches to 3D reconstruction from depth or RGB-D input, with an emphasis on real-time reconstruction of static scenes.
Malleson Charles, Collomosse John, Hilton Adrian (2019) Real-Time Multi-person Motion Capture from Multi-view Video and IMUs.,International Journal of Computer Vision Springer
A real-time motion capture system is presented which uses input from multiple standard video cameras and inertial measurement units (IMUs). The system is able to track multiple people simultaneously and requires no optical markers, specialized infra-red cameras or foreground/background segmentation, making it applicable to general indoor and outdoor scenarios with dynamic backgrounds and lighting. To overcome limitations of prior video or IMU-only approaches, we propose to use flexible combinations of multiple-view, calibrated video and IMU input along with a pose prior in an online optimization-based framework, which allows the full 6-DoF motion to be recovered including axial rotation of limbs and drift-free global position. A method for sorting and assigning raw input 2D keypoint detections into corresponding subjects is presented which facilitates multi-person tracking and rejection of any bystanders in the scene. The approach is evaluated on data from several indoor and outdoor capture environments with one or more subjects and the trade-off between input sparsity and tracking performance is discussed. State-of-the-art pose estimation performance is obtained on the Total Capture (mutli-view video and IMU) and Human 3.6M (multi-view video) datasets. Finally, a live demonstrator for the approach is presented showing real-time capture, solving and character animation using a light-weight, commodity hardware setup.
Typical colour digital cameras have a single sensor with a colour filter array (CFA), each pixel capturing a single channel (red, green or blue). A full RGB colour output image is generated by demosaicing (DM), i.e. interpolating to infer the two unobserved channels for each pixel. The DM approach used can have a significant effect on the quality of the output image, particularly in the presence of common imaging artifacts such as chromatic aberration (CA). Small differences in the focal length for each channel (lateral CA) and the inability of the lens to bring all three channels simultaneously into focus (longitudinal CA) can cause objectionable colour fringing artifacts in edge regions. These artifacts can be particularly severe when using low-cost lenses. We propose to use a set of simple neural networks to learn to jointly perform DM and CA correction, producing high quality colour images subject to severe CA as well as image noise. The proposed neural network-based joint DM and CA correction produces a significant improvement in image quality metrics (PSNR and SSIM) compared the baseline edge-directed linear interpolation approach preserving image detail and reducing objectionable false colour and comb artifacts. The approach can be applied in the production of high quality images and video from machine vision cameras with low cost lenses, thus extending the viability of such hardware to visual media production.