A real-time full-body motion capture system is presented
which uses input from a sparse set of inertial measurement
units (IMUs) along with images from two or more standard
video cameras and requires no optical markers or specialized
infra-red cameras. A real-time optimization-based
framework is proposed which incorporates constraints from
the IMUs, cameras and a prior pose model. The combination
of video and IMU data allows the full 6-DOF motion to
be recovered including axial rotation of limbs and drift-free
global position. The approach was tested using both indoor
and outdoor captured data. The results demonstrate the effectiveness
of the approach for tracking a wide range of human
motion in real time in unconstrained indoor/outdoor
We present a convolutional autoencoder that enables high
fidelity volumetric reconstructions of human performance to be captured
from multi-view video comprising only a small set of camera views. Our
method yields similar end-to-end reconstruction error to that of a prob-
abilistic visual hull computed using significantly more (double or more)
viewpoints. We use a deep prior implicitly learned by the autoencoder
trained over a dataset of view-ablated multi-view video footage of a wide
range of subjects and actions. This opens up the possibility of high-end
volumetric performance capture in on-set and prosumer scenarios where
time or cost prohibit a high witness camera count.
Light-field video has recently been used in virtual and
augmented reality applications to increase realism and immersion.
However, existing light-field methods are generally
limited to static scenes due to the requirement to acquire
a dense scene representation. The large amount of
data and the absence of methods to infer temporal coherence
pose major challenges in storage, compression and
editing compared to conventional video. In this paper, we
propose the first method to extract a spatio-temporally coherent
light-field video representation. A novel method to
obtain Epipolar Plane Images (EPIs) from a spare lightfield
camera array is proposed. EPIs are used to constrain
scene flow estimation to obtain 4D temporally coherent representations
of dynamic light-fields. Temporal coherence is
achieved on a variety of light-field datasets. Evaluation of
the proposed light-field scene flow against existing multiview
dense correspondence approaches demonstrates a significant
improvement in accuracy of temporal coherence.
4D Video Textures (4DVT) introduce a novel representation for rendering video-realistic interactive character animation from a database of 4D actor performance captured in a multiple camera studio. 4D performance capture reconstructs dynamic shape and appearance over time but is limited to free-viewpoint video replay of the same motion. Interactive animation from 4D performance capture has so far been limited to surface shape only. 4DVT is the final piece in the puzzle enabling video-realistic interactive animation through two contributions: a layered view-dependent texture map representation which supports efficient storage, transmission and rendering from multiple view video capture; and a rendering approach that combines multiple 4DVT sequences in a parametric motion space, maintaining video quality rendering of dynamic surface appearance whilst allowing high-level interactive control of character motion and viewpoint. 4DVT is demonstrated for multiple characters and evaluated both quantitatively and through a user-study which confirms that the visual quality of captured video is maintained. The 4DVT representation achieves >90% reduction in size and halves the rendering cost.
Multi-view video acquisition is widely used for reconstruction and free-viewpoint rendering of dynamic scenes by directly resampling from the captured images. This paper addresses the problem of optimally resampling and representing multi-view video to obtain a compact representation without loss of the view-dependent dynamic surface appearance. Spatio-temporal optimisation of the multi-view resampling is introduced to extract a coherent multi-layer texture map video. This resampling is combined with a surface-based optical flow alignment between views to correct for errors in geometric reconstruction and camera calibration which result in blurring and ghosting artefacts. The multi-view alignment and optimised resampling results in a compact representation with minimal loss of information allowing high-quality free-viewpoint rendering. Evaluation is performed on multi-view datasets for dynamic sequences of cloth, faces and people. The representation achieves >90% compression without significant loss of visual quality.
This paper presents a framework for creating realistic virtual characters
that can be delivered via the Internet and interactively controlled
in a WebGL enabled web-browser. Four-dimensional performance
capture is used to capture realistic human motion and appearance.
The captured data is processed into efficient and compact
representations for geometry and texture. Motions are analysed
against a high-level, user-defined motion graph and suitable
inter- and intra-motion transitions are identified. This processed
data is stored on a webserver and downloaded by a client application
is used to manage the state of the character which responds to user
input and sends required frames to a WebGL-based renderer for
display. Through the efficient geometry, texture and motion graph
representations, a game character capable of performing a range of
motions can be represented in 40-50 MB of data. This highlights
the potential use of four-dimensional performance capture for creating
web-based content. Datasets are made available for further
research and an online demo is provided.
This paper presents a hybrid skeleton-driven surface registration
(HSDSR) approach to generate temporally consistent
meshes from multiple view video of human subjects.
2D pose detections from multiple view video are used to
estimate 3D skeletal pose on a per-frame basis. The 3D
pose is embedded into a 3D surface reconstruction allowing
any frame to be reposed into the shape from any other
frame in the captured sequence. Skeletal motion transfer
is performed by selecting a reference frame from the surface
reconstruction data and reposing it to match the pose
estimation of other frames in a sequence. This allows an
initial coarse alignment to be performed prior to refinement
by a patch-based non-rigid mesh deformation. The
proposed approach overcomes limitations of previous work
by reposing a reference mesh to match the pose of a target
mesh reconstruction, providing a closer starting point
for further non-rigid mesh deformation. It is shown that the
proposed approach is able to achieve comparable results to
existing model-based and model-free approaches. Finally,
it is demonstrated that this framework provides an intuitive
way for artists and animators to edit volumetric video.
Immersive audio-visual perception relies on the spatial integration of both auditory and visual information which are heterogeneous sensing modalities with different fields of reception and spatial
resolution. This study investigates the perceived coherence of audiovisual
object events presented either centrally or peripherally with horizontally aligned/misaligned sound. Various object events were selected to represent three acoustic feature classes. Subjective test results in a simulated virtual environment from 18 participants indicate
a wider capture region in the periphery, with an outward bias favoring more lateral sounds. Centered stimulus results support previous findings for simpler scenes.
4D human performance capture aims to create volumetric representations of observed human subjects performing arbitrary motions with the ability to replay and render dynamic scenes with the realism of the recorded video. This representation has the potential to enable highly realistic content production for immersive virtual and augmented reality experiences. Human models are typically rendered using detailed, explicit 3D models, which consist of meshes and textures, and animated using tailored motion models to simulate human behaviour and activity. However, designing a realistic 3D human model is still a costly and laborious process. Hence, this work investigates techniques to learn models of human body shape and appearance, aiming to facilitate the generation of highly realistic human animation, and demonstrate its potential contributions, applications, and versatility.
The first contribution of this work is a skeleton driven surface registration approach to generate temporally consistent meshes from multi-view video of human subjects. 2D pose detections from multi-view video are used to estimate 3D skeletal pose on a per-frame basis, which allows a reference frame to match the pose estimation of other frames in a sequence. This allows an initial coarse alignment followed by a patch-based non-rigid mesh deformation to generate temporally consistent mesh sequences.
The second contribution presents techniques to represent human-like shape using a compressed learnt model from 4D volumetric performance capture data. Sequences of 4D dynamic geometry representing a human are encoded with a generative network into a compact space representation, whilst maintaining the original properties, such as surface non-rigid deformations. This compact representation enables synthesis, interpolation and generation of 3D shapes.
The third contribution is Deep4D generative network that is capable of compact representation of 4D volumetric video sequences from skeletal motion of people with two orders of magnitude compression. A variational encoder-decoder is employed to learn an encoded latent space that maps from 3D skeletal pose to 4D shape and appearance. This enable high-quality 4D volumetric video synthesis to be driven by skeletal animation.
Finally, this thesis introduces Deep4D motion graph to implicitly combine multiple captured motions in a unified representation for character animation from volumetric video, allowing novel character movements to be generated with dynamic shape and appearance detail. Deep4D motion graphs allow character animation to be driven by skeletal motion sequences providing a compact encoded representation capable of high-quality synthesis of the 4D volumetric video with two orders of magnitude compression.
Existing techniques for dynamic scene re-
construction from multiple wide-baseline cameras pri-
marily focus on reconstruction in controlled environ-
ments, with fixed calibrated cameras and strong prior
constraints. This paper introduces a general approach to
obtain a 4D representation of complex dynamic scenes
from multi-view wide-baseline static or moving cam-
eras without prior knowledge of the scene structure, ap-
pearance, or illumination. Contributions of the work
are: An automatic method for initial coarse reconstruc-
tion to initialize joint estimation; Sparse-to-dense tem-
poral correspondence integrated with joint multi-view
segmentation and reconstruction to introduce tempo-
ral coherence; and a general robust approach for joint
segmentation refinement and dense reconstruction of
dynamic scenes by introducing shape constraint. Com-
parison with state-of-the-art approaches on a variety of
complex indoor and outdoor scenes, demonstrates im-
proved accuracy in both multi-view segmentation and
dense reconstruction. This paper demonstrates unsuper-
vised reconstruction of complete temporally coherent
4D scene models with improved non-rigid object seg-
mentation and shape reconstruction and its application
to various applications such as free-view rendering and