Dr Marco Volino


Areas of specialism

Computer Vision; Computer Graphics; Computer Animation; Volumetric Video; Light Fields; Virtual Reality; Augmented Reality

Research projects

My publications


Malleson Charles, Volino Marco, Gilbert Andrew, Trumble Matthew, Collomosse John, Hilton Adrian (2017) Real-time Full-Body Motion Capture from Video and IMUs,3DV 2017 Proceedings CPS
A real-time full-body motion capture system is presented which uses input from a sparse set of inertial measurement units (IMUs) along with images from two or more standard video cameras and requires no optical markers or specialized infra-red cameras. A real-time optimization-based framework is proposed which incorporates constraints from the IMUs, cameras and a prior pose model. The combination of video and IMU data allows the full 6-DOF motion to be recovered including axial rotation of limbs and drift-free global position. The approach was tested using both indoor and outdoor captured data. The results demonstrate the effectiveness of the approach for tracking a wide range of human motion in real time in unconstrained indoor/outdoor scenes.
Gilbert Andrew, Volino Marco, Collomosse John, Hilton Adrian (2018) Volumetric performance capture from minimal camera viewpoints,Computer Vision ? ECCV 2018. ECCV 2018. Lecture Notes in Computer Science11215pp. 591-607 Springer Science+Business Media
We present a convolutional autoencoder that enables high fidelity volumetric reconstructions of human performance to be captured from multi-view video comprising only a small set of camera views. Our method yields similar end-to-end reconstruction error to that of a prob- abilistic visual hull computed using significantly more (double or more) viewpoints. We use a deep prior implicitly learned by the autoencoder trained over a dataset of view-ablated multi-view video footage of a wide range of subjects and actions. This opens up the possibility of high-end volumetric performance capture in on-set and prosumer scenarios where time or cost prohibit a high witness camera count.
Mustafa Armin, Volino Marco, Guillemaut Jean-Yves, Hilton Adrian (2018) 4D Temporally Coherent Light-field Video,3DV 2017 Proceedings IEEE
Light-field video has recently been used in virtual and augmented reality applications to increase realism and immersion. However, existing light-field methods are generally limited to static scenes due to the requirement to acquire a dense scene representation. The large amount of data and the absence of methods to infer temporal coherence pose major challenges in storage, compression and editing compared to conventional video. In this paper, we propose the first method to extract a spatio-temporally coherent light-field video representation. A novel method to obtain Epipolar Plane Images (EPIs) from a spare lightfield camera array is proposed. EPIs are used to constrain scene flow estimation to obtain 4D temporally coherent representations of dynamic light-fields. Temporal coherence is achieved on a variety of light-field datasets. Evaluation of the proposed light-field scene flow against existing multiview dense correspondence approaches demonstrates a significant improvement in accuracy of temporal coherence.
Casas D, Volino Marco, Collomosse JP, Hilton A (2014) 4D Video Textures for Interactive Character Appearance,Computer Graphics Forum: the international journal of the Eurographics Association
4D Video Textures (4DVT) introduce a novel representation for rendering video-realistic interactive character animation from a database of 4D actor performance captured in a multiple camera studio. 4D performance capture reconstructs dynamic shape and appearance over time but is limited to free-viewpoint video replay of the same motion. Interactive animation from 4D performance capture has so far been limited to surface shape only. 4DVT is the final piece in the puzzle enabling video-realistic interactive animation through two contributions: a layered view-dependent texture map representation which supports efficient storage, transmission and rendering from multiple view video capture; and a rendering approach that combines multiple 4DVT sequences in a parametric motion space, maintaining video quality rendering of dynamic surface appearance whilst allowing high-level interactive control of character motion and viewpoint. 4DVT is demonstrated for multiple characters and evaluated both quantitatively and through a user-study which confirms that the visual quality of captured video is maintained. The 4DVT representation achieves >90% reduction in size and halves the rendering cost.
Volino Marco, Casas D, Collomosse JP, Hilton A Optimal Representation of Multi-View Video,Proceedings of BMVC 2014 - British Machine Vision Conference BMVC
Multi-view video acquisition is widely used for reconstruction and free-viewpoint rendering of dynamic scenes by directly resampling from the captured images. This paper addresses the problem of optimally resampling and representing multi-view video to obtain a compact representation without loss of the view-dependent dynamic surface appearance. Spatio-temporal optimisation of the multi-view resampling is introduced to extract a coherent multi-layer texture map video. This resampling is combined with a surface-based optical flow alignment between views to correct for errors in geometric reconstruction and camera calibration which result in blurring and ghosting artefacts. The multi-view alignment and optimised resampling results in a compact representation with minimal loss of information allowing high-quality free-viewpoint rendering. Evaluation is performed on multi-view datasets for dynamic sequences of cloth, faces and people. The representation achieves >90% compression without significant loss of visual quality.
Volino Marco, Huang Peng, Hilton Adrian (2015) Online interactive 4D character animation,Proceedings of the 20th International Conference on 3D Web Technology (Web3D '15)pp. 289-295
This paper presents a framework for creating realistic virtual characters that can be delivered via the Internet and interactively controlled in a WebGL enabled web-browser. Four-dimensional performance capture is used to capture realistic human motion and appearance. The captured data is processed into efficient and compact representations for geometry and texture. Motions are analysed against a high-level, user-defined motion graph and suitable inter- and intra-motion transitions are identified. This processed data is stored on a webserver and downloaded by a client application when required. A Javascript-based character animation engine is used to manage the state of the character which responds to user input and sends required frames to a WebGL-based renderer for display. Through the efficient geometry, texture and motion graph representations, a game character capable of performing a range of motions can be represented in 40-50 MB of data. This highlights the potential use of four-dimensional performance capture for creating web-based content. Datasets are made available for further research and an online demo is provided.
Regateiro Joao, Volino Marco, Hilton Adrian (2018) Hybrid Skeleton Driven Surface Registration for Temporally Consistent Volumetric Video,Proceedings of 2018 International Conference on 3D Vision (3DV)pp. 514-522 Institute of Electrical and Electronics Engineers (IEEE)
This paper presents a hybrid skeleton-driven surface registration (HSDSR) approach to generate temporally consistent meshes from multiple view video of human subjects. 2D pose detections from multiple view video are used to estimate 3D skeletal pose on a per-frame basis. The 3D pose is embedded into a 3D surface reconstruction allowing any frame to be reposed into the shape from any other frame in the captured sequence. Skeletal motion transfer is performed by selecting a reference frame from the surface reconstruction data and reposing it to match the pose estimation of other frames in a sequence. This allows an initial coarse alignment to be performed prior to refinement by a patch-based non-rigid mesh deformation. The proposed approach overcomes limitations of previous work by reposing a reference mesh to match the pose of a target mesh reconstruction, providing a closer starting point for further non-rigid mesh deformation. It is shown that the proposed approach is able to achieve comparable results to existing model-based and model-free approaches. Finally, it is demonstrated that this framework provides an intuitive way for artists and animators to edit volumetric video.
Berghi Davide, Stenzel Hanne, Volino Marco, Hilton Adrian, Jackson Philip (2020) Audio-Visual Spatial Alignment Requirements of Central and Peripheral Object Events,IEEE VR 2020 IEEE
Immersive audio-visual perception relies on the spatial integration of both auditory and visual information which are heterogeneous sensing modalities with different fields of reception and spatial resolution. This study investigates the perceived coherence of audiovisual object events presented either centrally or peripherally with horizontally aligned/misaligned sound. Various object events were selected to represent three acoustic feature classes. Subjective test results in a simulated virtual environment from 18 participants indicate a wider capture region in the periphery, with an outward bias favoring more lateral sounds. Centered stimulus results support previous findings for simpler scenes.
Regateiro João P. C. (2020) Learning to animate volumetric video.,
4D human performance capture aims to create volumetric representations of observed human subjects performing arbitrary motions with the ability to replay and render dynamic scenes with the realism of the recorded video. This representation has the potential to enable highly realistic content production for immersive virtual and augmented reality experiences. Human models are typically rendered using detailed, explicit 3D models, which consist of meshes and textures, and animated using tailored motion models to simulate human behaviour and activity. However, designing a realistic 3D human model is still a costly and laborious process. Hence, this work investigates techniques to learn models of human body shape and appearance, aiming to facilitate the generation of highly realistic human animation, and demonstrate its potential contributions, applications, and versatility. The first contribution of this work is a skeleton driven surface registration approach to generate temporally consistent meshes from multi-view video of human subjects. 2D pose detections from multi-view video are used to estimate 3D skeletal pose on a per-frame basis, which allows a reference frame to match the pose estimation of other frames in a sequence. This allows an initial coarse alignment followed by a patch-based non-rigid mesh deformation to generate temporally consistent mesh sequences. The second contribution presents techniques to represent human-like shape using a compressed learnt model from 4D volumetric performance capture data. Sequences of 4D dynamic geometry representing a human are encoded with a generative network into a compact space representation, whilst maintaining the original properties, such as surface non-rigid deformations. This compact representation enables synthesis, interpolation and generation of 3D shapes. The third contribution is Deep4D generative network that is capable of compact representation of 4D volumetric video sequences from skeletal motion of people with two orders of magnitude compression. A variational encoder-decoder is employed to learn an encoded latent space that maps from 3D skeletal pose to 4D shape and appearance. This enable high-quality 4D volumetric video synthesis to be driven by skeletal animation. Finally, this thesis introduces Deep4D motion graph to implicitly combine multiple captured motions in a unified representation for character animation from volumetric video, allowing novel character movements to be generated with dynamic shape and appearance detail. Deep4D motion graphs allow character animation to be driven by skeletal motion sequences providing a compact encoded representation capable of high-quality synthesis of the 4D volumetric video with two orders of magnitude compression.
Mustafa Armin, Volino Marco, Kim Hansung, Guillemaut Jean-Yves, Hilton Adrian (2020) Temporally coherent general dynamic scene reconstruction,International Journal of Computer Vision Springer
Existing techniques for dynamic scene re- construction from multiple wide-baseline cameras pri- marily focus on reconstruction in controlled environ- ments, with fixed calibrated cameras and strong prior constraints. This paper introduces a general approach to obtain a 4D representation of complex dynamic scenes from multi-view wide-baseline static or moving cam- eras without prior knowledge of the scene structure, ap- pearance, or illumination. Contributions of the work are: An automatic method for initial coarse reconstruc- tion to initialize joint estimation; Sparse-to-dense tem- poral correspondence integrated with joint multi-view segmentation and reconstruction to introduce tempo- ral coherence; and a general robust approach for joint segmentation refinement and dense reconstruction of dynamic scenes by introducing shape constraint. Com- parison with state-of-the-art approaches on a variety of complex indoor and outdoor scenes, demonstrates im- proved accuracy in both multi-view segmentation and dense reconstruction. This paper demonstrates unsuper- vised reconstruction of complete temporally coherent 4D scene models with improved non-rigid object seg- mentation and shape reconstruction and its application to various applications such as free-view rendering and virtual reality.