Dr Marco Volino


Lecturer in Computer Vision and Graphics

About

Areas of specialism

Computer Vision; Computer Graphics; Computer Animation; Volumetric Video; Light Fields; Virtual Reality; Augmented Reality

Publications

Marco Volino, Adrian Hilton, Adrian Douglas Mark Hilton (2013)Layered view-dependent texture maps, In: Proceedings of the 10th European Conference on visual media productionpp. 1-8 ACM

Video-based free-viewpoint rendering from multiple view video capture has achieved video-realistic performance replay. Existing free-viewpoint rendering approaches require storage, streaming and re-sampling of multiple videos, which requires high bandwidth and computational resources limiting applications to local replay on high-performance computers. This paper introduces a layered texture representation for efficient storage and view-dependent rendering from multiple view video capture whilst maintaining the video-realism. Layered textures re-sample the captured video according to the surface visibility. Prioritisation of layers according to surface visibility allows the N-best views for all surface elements to be pre-computed significantly reducing both storage and rendering cost. Typically 3 texture map layers are required for free-viewpoint rendering with an equivalent visual quality to the multiple view video giving a significant reduction in storage cost. Quantitative evaluation demonstrates that the layered representation achieves a 90% reduction in storage cost and 50% reduction in rendering cost without loss of visual quality compared to storing only the foreground of the original multiple view video. This reduces the storage and transmission cost for free-viewpoint video rendering from eight cameras to be similar to the requirements for a single video. Streaming the layered representation enables, for the first time, demonstration of free-viewpoint video rendering on mobile devices and web platforms.

Davide Berghi, Hanne Stenzel, Marco Volino, Adrian Hilton, PHILIP J B JACKSON (2020)Audio-Visual Spatial Alignment Requirements of Central and Peripheral Object Events, In: 2020 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW)pp. 666-667 IEEE

Immersive audio-visual perception relies on the spatial integration of both auditory and visual information which are heterogeneous sensing modalities with different fields of reception and spatial resolution. This study investigates the perceived coherence of audiovisual object events presented either centrally or peripherally with horizontally aligned/misaligned sound. Various object events were selected to represent three acoustic feature classes. Subjective test results in a simulated virtual environment from 18 participants indicate a wider capture region in the periphery, with an outward bias favoring more lateral sounds. Centered stimulus results support previous findings for simpler scenes.

Marco Volino, Armin Mustafa, Jean-Yves Guillemaut, Adrian Hilton (2020)Light Field Video for Immersive Content Production, In: Real VR – Immersive Digital Realitypp. 33-64 Springer International Publishing

Light field video for content production is gaining both research and commercial interest as it has the potential to push the level of immersion for augmented and virtual reality to a close-to-reality experience. Light fields densely sample the viewing space of an object or scene using hundreds or even thousands of images with small displacements in between. However, a lack of standardised formats for compression, storage and transmission, along with the lack of tools to enable editing of light field data currently make it impractical for use in real-world content production. In this chapter we address two fundamental problems with light field data, namely representation and compression. Firstly we propose a method to obtain a 4D temporally coherent representation from the input light field video. This is an essential problem to solve that will enable efficient compression editing. Secondly, we present a method for compression of light field data based on the eigen texture method that provides a compact representation and enables efficient view-dependent rendering at interactive frame rates. These approaches achieve an order of magnitude compression and temporally consistent representation that are important steps towards practical toolsets for light field video content production.

Lewis Bridgeman, Marco Volino, Jean-Yves Guillemaut, Adrian Hilton (2019)Multi-Person 3D Pose Estimation and Tracking in Sports, In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)2019-pp. 2487-2496 IEEE

We present an approach to multi-person 3D pose estimation and tracking from multi-view video. Following independent 2D pose detection in each view, we: (1) correct errors in the output of the pose detector; (2) apply a fast greedy algorithm for associating 2D pose detections between camera views; and (3) use the associated poses to generate and track 3D skeletons. Previous methods for estimating skeletons of multiple people suffer long processing times or rely on appearance cues, reducing their applicability to sports. Our approach to associating poses between views works by seeking the best correspondences first in a greedy fashion, while reasoning about the cyclic nature of correspondences to constrain the search. The associated poses can be used to generate 3D skeletons, which we produce via robust triangulation. Our method can track 3D skeletons in the presence of missing detections, substantial occlusions, and large calibration error. We believe ours is the first method for full-body 3D pose estimation and tracking of multiple players in highly dynamic sports scenes. The proposed method achieves a significant improvement in speed over state-of-the-art methods.

Armin Mustafa, Marco Volino, Hansung Kim, Jean-Yves Guillemaut, Adrian Hilton (2020)Temporally coherent general dynamic scene reconstruction, In: International Journal of Computer Vision Springer

Existing techniques for dynamic scene re- construction from multiple wide-baseline cameras pri- marily focus on reconstruction in controlled environ- ments, with fixed calibrated cameras and strong prior constraints. This paper introduces a general approach to obtain a 4D representation of complex dynamic scenes from multi-view wide-baseline static or moving cam- eras without prior knowledge of the scene structure, ap- pearance, or illumination. Contributions of the work are: An automatic method for initial coarse reconstruc- tion to initialize joint estimation; Sparse-to-dense tem- poral correspondence integrated with joint multi-view segmentation and reconstruction to introduce tempo- ral coherence; and a general robust approach for joint segmentation refinement and dense reconstruction of dynamic scenes by introducing shape constraint. Com- parison with state-of-the-art approaches on a variety of complex indoor and outdoor scenes, demonstrates im- proved accuracy in both multi-view segmentation and dense reconstruction. This paper demonstrates unsuper- vised reconstruction of complete temporally coherent 4D scene models with improved non-rigid object seg- mentation and shape reconstruction and its application to various applications such as free-view rendering and virtual reality.

Craig Cieciura, Marco Volino, Philip J B Jackson (2023)SurrRoom 1.0 Dataset: Spatial Room Capture with Controlled Acoustic and Optical Measurements https://cvssp.org/data/SurrRoom1_0/

Room acoustics, and perception thereof, is an important consideration in research, engineering, architecture, creative expression, and many other areas of human activity, particularly indoors. Typical room datasets either contain disparate measurements of diverse spaces, e.g., the OpenAir dataset (Murphy and Shelley, 2010), or rich sets of measurements within a few rooms, e.g., CD4M (Stewart and Sandler, 2010). Development of techniques, such as the RSAO, with 6 degrees of freedom (6DOF) movement in media applications require testing distance-related effects. Hence there is a need for consistency across room measurements in terms of source-room-receiver configuration, such as in (Lokki et al., 2011) but particularly for typical rooms. Those available containing ARIRs, BRIRs and with a consistent measurement procedure tend to be limited in range of rooms measured (Bacila and Lee, 2019). We designed an RIR dataset covering a range of rooms, with typical reverberation times, 1O-ARIRs and BRIRs, and regular measurement procedure including source-receiver distances from 1 m to 3 m in 0.5 m intervals. We measured seven rooms including one room with variable acoustics in two configurations – eight total sets. The rooms within this dataset have mid-range RT60s from 0.24 s to 1.00 s and volumes from 50 m3 to 1600 m3. The accompanying paper includes capture methods, various descriptive metrics and a description of applications across diverse domains. The dataset consists of .WAV files for IRs, .PLY files for LiDAR scans and image files for accompanying 3D pictures. IRs are also available in the SOFA format. The dataset is licensed under CC BY-NC 4.0 and can be accessed from https://cvssp.org/data/SurrRoom1_0/.

F. Schweiger, G. Thomas, A. Sheikh, W. Paier, M. Kettern, P. Eisert, J.-S Franco, M. Volino, P. Huang, J. Collomosse, A. Hilton, V. Jantet, P. Smyth (2015)RE@CT: A new production pipeline for interactive 3D content, In: 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW)pp. 1-4 IEEE

The RE@CT project set out to revolutionise the production of realistic 3D characters for game-like applications and interactive video productions, and significantly reduce costs by developing an automated process to extract and represent animated characters from actor performance captured in a multi-camera studio. The key innovation is the development of methods for analysis and representation of 3D video to allow reuse for real-time interactive animation. This enables efficient authoring of interactive characters with video quality appearance and motion.

Joao Regateiro, Adrian Hilton, Marco Volino (2019)Dynamic Surface Animation using Generative Networks, In: 2019 INTERNATIONAL CONFERENCE ON 3D VISION (3DV 2019)pp. 376-385 IEEE

This paper presents techniques to animate realistic human-like motion using a compressed learnt model from 4D volumetric performance capture data. Sequences of 4D dynamic geometry representing a human performing an arbitrary motion are encoded through a generative network into a compact space representation, whilst maintaining the original properties, such as, surface dynamics. An animation framework is proposed which computes an optimal motion graph using the novel capabilities of compression and generative synthesis properties of the network. This approach significantly reduces the memory space requirements, improves quality of animation, and facilitates the interpolation between motions. The framework optimises the number of transitions in the graph with respect to the shape and motion of the dynamic content. This generates a compact graph structure with low edge connectivity, and maintains realism when transitioning between motions. Finally, it demonstrates that generative networks facilitate the computation of novel poses, and provides a compact motion graph representation of captured dynamic shape enabling real-time interactive animation and interpolation of novel poses to smoothly transition between motions.

Charles Malleson, Marco Volino, Andrew Gilbert, Matthew Trumble, John Collomosse, Adrian Hilton (2017)Real-time Full-Body Motion Capture from Video and IMUs, In: PROCEEDINGS 2017 INTERNATIONAL CONFERENCE ON 3D VISION (3DV)pp. 449-457 IEEE

A real-time full-body motion capture system is presented which uses input from a sparse set of inertial measurement units (IMUs) along with images from two or more standard video cameras and requires no optical markers or specialized infra-red cameras. A real-time optimization-based framework is proposed which incorporates constraints from the IMUs, cameras and a prior pose model. The combination of video and IMU data allows the full 6-DOF motion to be recovered including axial rotation of limbs and drift-free global position. The approach was tested using both indoor and outdoor captured data. The results demonstrate the effectiveness of the approach for tracking a wide range of human motion in real time in unconstrained indoor/outdoor scenes.

Marco Volino, Armin Mustafa, Jean-Yves Guillemaut, Adrian Hilton (2019)Light Field Compression using Eigen Textures, In: 2019 INTERNATIONAL CONFERENCE ON 3D VISION (3DV 2019)pp. 482-490 IEEE

Light fields are becoming an increasingly popular method of digital content production for visual effects and virtual/augmented reality as they capture a view dependent representation enabling photo realistic rendering over a range of viewpoints. Light field video is generally captured using arrays of cameras resulting in tens to hundreds of images of a scene at each time instance. An open problem is how to efficiently represent the data preserving the view-dependent detail of the surface in such a way that is compact to store and efficient to render. In this paper we show that constructing an Eigen texture basis representation from the light field using an approximate 3D surface reconstruction as a geometric proxy provides a compact representation that maintains view-dependent realism. We demonstrate that the proposed method is able to reduce storage requirements by > 95% while maintaining the visual quality of the captured data. An efficient view-dependent rendering technique is also proposed which is performed in eigen space allowing smooth continuous viewpoint interpolation through the light field.

Dan Casas, Marco Volino, John Collomosse, Adrian Hilton (2014)4D video textures for interactive character appearance, In: Computer graphics forum33(2)pp. 371-380 Wiley

4D Video Textures (4DVT) introduce a novel representation for rendering video-realistic interactive character animation from a database of 4D actor performance captured in a multiple camera studio. 4D performance capture reconstructs dynamic shape and appearance over time but is limited to free-viewpoint video replay of the same motion. Interactive animation from 4D performance capture has so far been limited to surface shape only. 4DVT is the final piece in the puzzle enabling video-realistic interactive animation through two contributions: a layered view-dependent texture map representation which supports efficient storage, transmission and rendering from multiple view video capture; and a rendering approach that combines multiple 4DVT sequences in a parametric motion space, maintaining video quality rendering of dynamic surface appearance whilst allowing high-level interactive control of character motion and viewpoint. 4DVT is demonstrated for multiple characters and evaluated both quantitatively and through a user-study which confirms that the visual quality of captured video is maintained. The 4DVT representation achieves >90% reduction in size and halves the rendering cost.

Davide Berghi, Hanne Stenzel, Marco Volino, Adrian Hilton, Philip Jackson (2020)Audio-Visual Spatial Alignment Requirements of Central and Peripheral Object Events, In: IEEE VR 2020 IEEE

Immersive audio-visual perception relies on the spatial integration of both auditory and visual information which are heterogeneous sensing modalities with different fields of reception and spatial resolution. This study investigates the perceived coherence of audiovisual object events presented either centrally or peripherally with horizontally aligned/misaligned sound. Various object events were selected to represent three acoustic feature classes. Subjective test results in a simulated virtual environment from 18 participants indicate a wider capture region in the periphery, with an outward bias favoring more lateral sounds. Centered stimulus results support previous findings for simpler scenes.

As audio-visual systems increasingly bring immersive and interactive capabilities into our work and leisure activities, so the need for naturalistic test material grows. New volumetric datasets have captured high-quality 3D video, but accompanying audio is often neglected, making it hard to test an integrated bimodal experience. Designed to cover diverse sound types and features, the presented volumetric dataset was constructed from audio and video studio recordings of scenes to yield forty short action sequences. Potential uses in technical and scientific tests are discussed.

Davide Berghi, Hanne Stenzel, Marco Volino, Philip J. B Jackson, Adrian Douglas Mark Hilton (2020)Audio-Visual Spatial Aligment Requirements of Central and Peripheral Object Events

IEEE VR 2020 Immersive audio-visual perception relies on the spatial integration of both auditory and visual information which are heterogeneous sensing modalities with different fields of reception and spatial resolution. This study investigates the perceived coherence of audiovisual object events presented either centrally or peripherally with horizontally aligned/misaligned sound. Various object events were selected to represent three acoustic feature classes. Subjective test results in a simulated virtual environment from 18 participants indicate a wider capture region in the periphery, with an outward bias favoring more lateral sounds. Centered stimulus results support previous findings for simpler scenes.

DAVIDE BERGHI, MARCO VOLINO, PHILIP J B JACKSON (2022)Dataset Tragic Talkers: A Shakespearean Sound- and Light-Field Dataset for Audio-Visual Machine Learning Research, In: Tragic Talkers: A Shakespearean Sound-and Light-Field Dataset for Audio-Visual Machine Learning Research University of Surrey

3D audio-visual production aims to deliver immersive and interactive experiences to the consumer. Yet, faithfully reproducing real-world 3D scenes remains a challenging task. This is partly due to the lack of available datasets enabling audio-visual research in this direction. In most of the existing multi-view datasets, the accompanying audio is neglected. Similarly, datasets for spatial audio research primarily offer unimodal content, and when visual data is included, the quality is far from meeting the standard production needs. We present “Tragic Talkers”, an audio-visual dataset consisting of excerpts from the “Romeo and Juliet” drama captured with microphone arrays and multiple co-located cameras for light-field video. Tragic Talkers provides ideal content for object-based media (OBM) production. It is designed to cover various conventional talking scenarios, such as monologues, two-people conversations, and interactions with considerable movement and occlusion, yielding 30 sequences captured from a total of 22 different points of view and two 16-element microphone arrays. Additionally, we provide voice activity labels, 2D face bounding boxes for each camera view, 2D pose detection keypoints, 3D tracking data of the mouth of the actors, and dialogue transcriptions. We believe the community will benefit from this dataset as it can assist multidisciplinary research. Possible uses of the dataset are discussed. The scenes were captured at the Centre for Vision, Speech & Signal Processing (CVSSP) of the University of Surrey (UK) with the aid of two twin Audio-Visual Array (AVA) Rigs. Each AVA Rig is a custom device consisting of a 16-element microphone array and 11 cameras fixed on a flat perspex sheet. For more information, please refer to the paper (see below) or contact the authors.

DAVIDE BERGHI, MARCO VOLINO, PHILIP J B JACKSON (2022)Tragic Talkers: A Shakespearean Sound-and Light-Field Dataset for Audio-Visual Machine Learning Research, In: Dataset Tragic Talkers: A Shakespearean Sound- and Light-Field Dataset for Audio-Visual Machine Learning Research

3D audiovisual production aims to deliver immersive and interactive experiences to the consumer. Yet, faithfully reproducing real-world 3D scenes remains a challenging task. This is partly due to the lack of available datasets enabling audiovisual research in this direction. In most of the existing multi-view datasets, the accompanying audio is neglected. Similarly, datasets for spatial audio research primarily offer unimodal content, and when visual data is included, the quality is far from meeting the standard production needs. We present " Tragic Talkers " , an audiovisual dataset consisting of excerpts from the " Romeo and Juliet " drama captured with microphone arrays and multiple co-located cameras for light-field video. Tragic Talkers provides ideal content for object-based media (OBM) production. It is designed to cover various conventional talking scenarios, such as monologues, two-people conversations, and interactions with considerable movement and occlusion, yielding 30 sequences captured from a total of 22 different points of view and two 16-element microphone arrays. Additionally, we provide voice activity labels, 2D face bounding boxes for each camera view, 2D pose detection keypoints, 3D tracking data of the mouth of the actors, and dialogue transcriptions. We believe the community will benefit from this dataset as it can assist multidisciplinary research. Possible uses of the dataset are discussed. * This is the author's version of the work. It is posted here for your personal use. This paper is published under a Creative Commons Attribution (CC-BY) license. The definitive version was published in CVMP '22, https://doi.org/10.1145/3565516.3565522.

Joao Regateiro, Marco Volino, Adrian Hilton (2018)Hybrid Skeleton Driven Surface Registration for Temporally Consistent Volumetric Video, In: Proceedings of 2018 International Conference on 3D Vision (3DV)pp. 514-522 Institute of Electrical and Electronics Engineers (IEEE)

This paper presents a hybrid skeleton-driven surface registration (HSDSR) approach to generate temporally consistent meshes from multiple view video of human subjects. 2D pose detections from multiple view video are used to estimate 3D skeletal pose on a per-frame basis. The 3D pose is embedded into a 3D surface reconstruction allowing any frame to be reposed into the shape from any other frame in the captured sequence. Skeletal motion transfer is performed by selecting a reference frame from the surface reconstruction data and reposing it to match the pose estimation of other frames in a sequence. This allows an initial coarse alignment to be performed prior to refinement by a patch-based non-rigid mesh deformation. The proposed approach overcomes limitations of previous work by reposing a reference mesh to match the pose of a target mesh reconstruction, providing a closer starting point for further non-rigid mesh deformation. It is shown that the proposed approach is able to achieve comparable results to existing model-based and model-free approaches. Finally, it is demonstrated that this framework provides an intuitive way for artists and animators to edit volumetric video.

D Casas, Marco Volino, JP Collomosse, A Hilton (2014)4D Video Textures for Interactive Character Appearance, In: Computer Graphics Forum: the international journal of the Eurographics Association

4D Video Textures (4DVT) introduce a novel representation for rendering video-realistic interactive character animation from a database of 4D actor performance captured in a multiple camera studio. 4D performance capture reconstructs dynamic shape and appearance over time but is limited to free-viewpoint video replay of the same motion. Interactive animation from 4D performance capture has so far been limited to surface shape only. 4DVT is the final piece in the puzzle enabling video-realistic interactive animation through two contributions: a layered view-dependent texture map representation which supports efficient storage, transmission and rendering from multiple view video capture; and a rendering approach that combines multiple 4DVT sequences in a parametric motion space, maintaining video quality rendering of dynamic surface appearance whilst allowing high-level interactive control of character motion and viewpoint. 4DVT is demonstrated for multiple characters and evaluated both quantitatively and through a user-study which confirms that the visual quality of captured video is maintained. The 4DVT representation achieves >90% reduction in size and halves the rendering cost.

Mohd Azri Mohd Izhar, Marco Volino, Adrian Hilton, Philip Jackson (2020)Tracking Sound Sources for Object-based Spatial Audio in 3D Audio-visual Production, In: Proceedings of the FA2020 Conferencepp. 2051-2058

In immersive and interactive audio-visual content, there is very significant scope for spatial misalignment between the two main modalities. So, in productions that have both 3D video and spatial audio, the positioning of sound sources relative to the visual display requires careful attention. This may be achieved in the form of object-based audio, moreover allowing the producer to maintain control over individual elements within the mix. Yet each object?s metadata is needed to define its position over time. In the present study, audio-visual studio recordings were made of short scenes representing three genres: drama, sport and music. Foreground video was captured by a light-field camera array, which incorporated a microphone array, alongside more conventional sound recording by spot microphones and a first-order ambisonic room microphone. In the music scenes, a direct feed from the guitar pickup was also recorded. Video data was analysed to form a 3D reconstruction of the scenes, and human figure detection was applied to the 2D frames of the central camera. Visual estimates of the sound source positions were used to provide ground truth. Position metadata were encoded within audio definition model (ADM) format audio files, suitable for standard object-based rendering. The steered response power of the acoustical signals at the microphone array were used, with the phase transform (SRP-PHAT), to determine the dominant source position(s) at any time, and given as input to a Sequential Monte Carlo Probability Hypothesis Density (SMC-PHD) tracker. The output of this was evaluated in relation to the ground truth. Results indicate a hierarchy of accuracy in azimuth, elevation and range, in accordance with human spatial auditory perception. Azimuth errors were within the tolerance bounds reported by studies of the Ventriloquism Effect, giving an initial promising indication that such an approach may open the door to object-based production for live events.

Charles Malleson, Marco Volino, Andrew Gilbert, Matthew Trumble, John Collomosse, Adrian Hilton (2017)Real-time Full-Body Motion Capture from Video and IMUs, In: 3DV 2017 Proceedings CPS

A real-time full-body motion capture system is presented which uses input from a sparse set of inertial measurement units (IMUs) along with images from two or more standard video cameras and requires no optical markers or specialized infra-red cameras. A real-time optimization-based framework is proposed which incorporates constraints from the IMUs, cameras and a prior pose model. The combination of video and IMU data allows the full 6-DOF motion to be recovered including axial rotation of limbs and drift-free global position. The approach was tested using both indoor and outdoor captured data. The results demonstrate the effectiveness of the approach for tracking a wide range of human motion in real time in unconstrained indoor/outdoor scenes.

J Imber, M Volino, J-Y Guillemaut, S Fenney, A Hilton (2013)Free-viewpoint video rendering for mobile devices., In: P Eisert, A Gagalowicz (eds.), MIRAGEpp. 11:1-11:1
Marco Volino, D Casas, JP Collomosse, A Hilton (2014)Optimal Representation of Multi-View Video, In: Proceedings of BMVC 2014 - British Machine Vision Conference BMVC

Multi-view video acquisition is widely used for reconstruction and free-viewpoint rendering of dynamic scenes by directly resampling from the captured images. This paper addresses the problem of optimally resampling and representing multi-view video to obtain a compact representation without loss of the view-dependent dynamic surface appearance. Spatio-temporal optimisation of the multi-view resampling is introduced to extract a coherent multi-layer texture map video. This resampling is combined with a surface-based optical flow alignment between views to correct for errors in geometric reconstruction and camera calibration which result in blurring and ghosting artefacts. The multi-view alignment and optimised resampling results in a compact representation with minimal loss of information allowing high-quality free-viewpoint rendering. Evaluation is performed on multi-view datasets for dynamic sequences of cloth, faces and people. The representation achieves >90% compression without significant loss of visual quality.

Marco Volino, Peng Huang, Adrian Hilton (2018)Online interactive 4D character animation, In: Proceedings of the 20th International Conference on 3D Web Technology - Web3D '15pp. 289-295

This paper presents a framework for creating realistic virtual characters that can be delivered via the Internet and interactively controlled in a WebGL enabled web-browser. Four-dimensional performance capture is used to capture realistic human motion and appearance. The captured data is processed into efficient and compact representations for geometry and texture. Motions are analysed against a high-level, user-defined motion graph and suitable inter- and intra-motion transitions are identified. This processed data is stored on a webserver and downloaded by a client application when required. A Javascript-based character animation engine is used to manage the state of the character which responds to user input and sends required frames to a WebGL-based renderer for display. Through the efficient geometry, texture and motion graph representations, a game character capable of performing a range of motions can be represented in 40-50 MB of data. This highlights the potential use of four-dimensional performance capture for creating web-based content. Datasets are made available for further research and an online demo is provided.

Armin Mustafa, Marco Volino, Jean-Yves Guillemaut, Adrian Hilton (2018)4D Temporally Coherent Light-field Video, In: 3DV 2017 Proceedings IEEE

Light-field video has recently been used in virtual and augmented reality applications to increase realism and immersion. However, existing light-field methods are generally limited to static scenes due to the requirement to acquire a dense scene representation. The large amount of data and the absence of methods to infer temporal coherence pose major challenges in storage, compression and editing compared to conventional video. In this paper, we propose the first method to extract a spatio-temporally coherent light-field video representation. A novel method to obtain Epipolar Plane Images (EPIs) from a spare lightfield camera array is proposed. EPIs are used to constrain scene flow estimation to obtain 4D temporally coherent representations of dynamic light-fields. Temporal coherence is achieved on a variety of light-field datasets. Evaluation of the proposed light-field scene flow against existing multiview dense correspondence approaches demonstrates a significant improvement in accuracy of temporal coherence.

Andrew Gilbert, Marco Volino, John Collomosse, Adrian Hilton (2018)Volumetric performance capture from minimal camera viewpoints, In: V Ferrari, M Hebert, C Sminchisescu, Y Weiss (eds.), Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science11215pp. 591-607 Springer Science+Business Media

We present a convolutional autoencoder that enables high fidelity volumetric reconstructions of human performance to be captured from multi-view video comprising only a small set of camera views. Our method yields similar end-to-end reconstruction error to that of a prob- abilistic visual hull computed using significantly more (double or more) viewpoints. We use a deep prior implicitly learned by the autoencoder trained over a dataset of view-ablated multi-view video footage of a wide range of subjects and actions. This opens up the possibility of high-end volumetric performance capture in on-set and prosumer scenarios where time or cost prohibit a high witness camera count.