4D Video Textures (4DVT) introduce a novel representation for rendering video-realistic interactive character animation from a database of 4D actor performance captured in a multiple camera studio. 4D performance capture reconstructs dynamic shape and appearance over time but is limited to free-viewpoint video replay of the same motion. Interactive animation from 4D performance capture has so far been limited to surface shape only. 4DVT is the final piece in the puzzle enabling video-realistic interactive animation through two contributions: a layered view-dependent texture map representation which supports efficient storage, transmission and rendering from multiple view video capture; and a rendering approach that combines multiple 4DVT sequences in a parametric motion space, maintaining video quality rendering of dynamic surface appearance whilst allowing high-level interactive control of character motion and viewpoint. 4DVT is demonstrated for multiple characters and evaluated both quantitatively and through a user-study which confirms that the visual quality of captured video is maintained. The 4DVT representation achieves >90% reduction in size and halves the rendering cost.
Tena JR, Hamouz M, Hilton A, Illingworth J (2006) A Validation Method for Dense Non-rigid 3D Face Registration, IEEE Conf. on Advanced Video and Signal-based Surveillance
Guillemaut J-Y, Kilner J, Starck J, Hilton A (2007) Dynamic Feathering: Minimising Blending Artefacts in View Dependent Rendering, IET European Conference on Visual Media Productionpp. 1---8-1---8
Ahmed A, Hilton A, Mokhtarian F (2003) Cyclification of Animation for Human Motion Synthesis, Eurographics Short Paper
Budd C, Huang P, Hilton A (2011) Hierarchical shape matching for temporally consistent 3D video, Proceedings of International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmissionpp. 172-179
In this paper we present a novel approach for temporal alignment of reconstructed mesh sequences with non-rigid surfaces to obtain a consistent representation. We propose a hierarchical scheme for non-sequential matching of frames across the sequence using shape similarity. This gives a tree structure which represents the optimal path for alignment of each frame in the sequence to minimize the change in shape. Non-rigid alignment is performed by recursively traversing the tree to align all frames. Non-sequential alignment reduces problems of drift or tracking failure which occur in previous sequential frame-to-frame techniques. Comparative evaluation on challenging 3D video sequences demonstrates that the proposed approach produces a temporally coherent representation with reduced error in shape and correspondence.
Hilton A, Roberts JB, Hadded O (1991) Autocorrelation Based Analysis of Ensemble Averaged LDA Engine Data for Bias-Free Turbulence Estimates: A Unified Approach, Journal of the Society of Automotive Engineering SAE910479pp. 1---21-1---21
Starck J, Hilton A (2003) View-dependant Rendering with Multiple View Stereo Optimisation, CVPR
A framework for construction of detailed animated models of an actor's shape and appearance from multiple view images is presented. Multiple views of an actor are captured in a studio with controlled illumination and background. An initial low-resolution approximation of the person's shape is reconstructed by deformation of a generic humanoid model to fit the visual hull using shape constrained optimisation to preserve the surface parameterisation for animation. Stereo reconstruction with multiple view constraints is then used to reconstruct the detailed surface shape. High-resolution shape detail from stereo is represented in a structured format for animation by displacement mapping from the low-resolution model surface. A novel integration algorithm using displacement maps is introduced to combine overlapping stereo surface measurements from multiple views into a single displacement map representation of the high-resolution surface detail. Results of 3-D actor modelling in a 14 camera studio demonstrate improved representation of detailed surface shape such as creases in clothing compared to previous model fitting approaches. Actor models can be animated and rendered from arbitrary views under different illumination to produce free-viewpoint video sequences. The proposed framework enables rapid transformation of captured multiple view images into a structured representation suitable for realistic animation.
This paper introduces a novel 4D shape descriptor to match temporal surface sequences. A quantitative evaluation based on the receiver-operator characteristic (ROC) curve is presented to compare the performance of conventional 3D shape descriptors with and without using a time filter. Feature- based 3D shape descriptors including shape distribution (Osada et al., 2002 ), spin image (Johnson et al., 1999), shape histogram (Ankest et al., 1999) and spherical harmonics (Kazhdan et al., 2003) are considered. Evaluation shows that filtered descriptors outperform unfiltered descriptors and the best performing volume-sampling shape-histogram descriptor is extended to define a new 4D "shape-flow" descriptor. Shape-flow matching demonstrates improved performance in the context of matching time-varying sequences which is motivated by the requirement to connect similar sequences for animation production. Both simulated and real 3D human surface motion sequences are used for evaluation.
In this paper a new technique is introduced
for automatically building recognisable,
moving 3D models of individual people.
A set of multiview colour images of
a person is captured from the front, sides and
back by one or more cameras. Model-based
reconstruction of shape from silhouettes is
used to transform a standard 3D generic humanoid
model to approximate a person?s
shape and anatomical structure. Realistic appearance
is achieved by colour texture mapping
from the multiview images. The results
show the reconstruction of a realistic 3D facsimile
of the person suitable for animation
in a virtual world. The system is inexpensive
and is reliable for large variations in shape,
size and clothing. This is the first approach
to achieve realistic model capture for clothed
people and automatic reconstruction of animated
models. A commercial system based
on this approach has recently been used to
capture thousands of models of the general
Hilton A, Goncalves J (1995) 3D Scene Representation Using a Deformable Surface, IEEE Workshop on Physics Based Modellingpp. 24---30-24---30 IEEE
In this paper a new technique is introduced for automatically building recognisable
moving 3D models of individual people. A set of multi-view colour images of a person
are captured from the front, side and back using one or more cameras. Model-based reconstruction
of shape from silhouettes is used to transform a standard 3D generic humanoid
model to approximate the persons shape and anatomical structure. Realistic appearance
is achieved by colour texture mapping from the multi-view images. Results demonstrate
the reconstruction of a realistic 3D facsimile of the person suitable for animation in a virtual
world. The system is low-cost and is reliable for large variations in shape, size and
clothing. This is the first approach to achieve realistic model capture for clothed people and
automatic reconstruction of animated models. A commercial system based on this approach
has recently been used to capture thousands of models of the general public.
Motion capture (mocap) is widely used in a large number of industrial applications. Our work offers a new way of representing the mocap facial dynamics in a high resolution 3D morphable model expression space. A data-driven
approach to modelling of facial dynamics is presented. We propose a way to combine high quality static face scans with dynamic 3D mocap data which has lower spatial resolution in
order to study the dynamics of facial expressions.
In this paper we present a method to relight captured 3D video sequences of non-rigid, dynamic scenes, such as clothing of real actors, reconstructed from multiple view video. A view-dependent approach is introduced to refine an initial coarse surface reconstruction using shape-from-shading to estimate detailed surface normals. The prior surface approximation is used to constrain the simultaneous estimation of surface normals and scene illumination, under the assumption of Lambertian surface reflectance. This approach enables detailed surface normals of a moving non-rigid object to be estimated from a single image frame. Refined normal estimates from multiple views are integrated into a single surface normal map. This approach allows highly non-rigid surfaces, such as creases in clothing, to be relit whilst preserving the detailed dynamics observed in video.
Edge JD, Hilton A (2007) Facial Animation with Motion Capture based on Surface Blending, International Conference on Computer Graphics Theory and Applications
Action matching, where a recorded sequence is matched against, and synchronised with, a suitable proxy from a library of animations, is a technique for generating a synthetic representation of a recorded human activity. This proxy can then be used to represent the action in a virtual environment or as a prior on further processing of the sequence. In this paper we present a novel technique for performing action matching in outdoor sports environments. Outdoor sports broadcasts are typically multi-camera environments and as such reconstruction techniques can be applied to the footage to generate a 3D model of the scene. However due to poor calibration and matting this reconstruction is of a very low quality. Our technique matches the 3D reconstruction sequence against a predefined library of actions to select an appropriate high quality synthetic representation. A hierarchical Markov model combined with 3D summarisation of the data allows a large number of different actions to be matched successfully to the sequence in a rate-invariant manner without prior segmentation of the sequence into discrete units. The technique is applied to data captured at rugby and soccer games.
Kim H, Hilton A (2009) Environment Modelling using Spherical Stereo Imaging, IEEE Symposium on 3D Imaging (3DIM)
Collins G, Hilton A (2001) Models for Character Animation, Software Focus, Wiley2(2)2pp. 44-51
GORDON COLLINS and ADRIAN HILTON present a reviewof methods for the constructionand deformation of charactermodels. They consider both stateof the art research and common practice. In particular they review applications, data capture methods, manual model construction, polygonal, parametricand implicit surface representations, basic geometric deformations, free form deformations, subdivision surfaces, displacement map schemes and physical deformation. Copyright © 2001 John Wiley & Sons, Ltd.
Moeslund T, Hilton A, Kruger V (2006) A Survey of Advances in Vision-Based Human Motion Capture and Analysis, Computer Vision and Image Understanding1042-3pp. 90---127-90---127
Saminathan A, Stoddart AJ, Hilton A, Illingworth J (1997) Progress in arbitrary topology deformable surfaces, BMVCpp. 1---6-1---6 BMVA
Kittler J, Hilton A, Hamouz M, Illingworth J (2006) 3D Assisted Face Recognition: A Survey of 3D Imaging, Modelling and Recognition Approaches, CVPRpp. 114---122-114---122
Hilton A, Illingworth J, Li Y, Mitchelson J (2001) Real-Time Human Motion Estimation for Studio Production, BMVA Workshop on Understanding Human Behaviour
Collins G, Hilton A (2002) Mesh Decimation for Displacement Mapping, Eurograhics - Short Paper
Collins G, Hilton A (2005) A Rigid Transform Basis for Animation Compression and Level of Detail, IMA Conference on Vision, Video and Graphicspp. 21-28 Eurographics Association
We present a scheme for achieving level of detail and compression for animation sequences with known constant connectivity. We suggest compression is useful to automatically create low levels of detail in animations which may be more compressed than the original animation parameters and for high levels of detail where the original animation is expensive to compute. Our scheme is based on spatial segmentation of a base mesh into rigidly transforming segments and then temporal aggregation of these transformations. The result will approximate the given animation within a user specified tolerance which can be adjusted to give the required level of detail. A spatio-temporal smoothing algorithm is used on decoding to give acceptable animations. We show that the rigid transformation basis will span the space of all animations. We also show that the algorithm will converge to the specified tolerance. The algorithm is applied to several examples of synthetic animation and rate distortion curves are given which show that in some cases, the scheme outperforms current compressors.
This paper introduces a model-based approach to capturing a persons shape, appearance and movement. A 3D animated model of a clothed persons whole-body shape and appearance is automatically constructed from a set of orthogonal view colour images. The reconstructed model of a person is then used together with the least-squares inverse-kinematics framework of Bregler and Malik (1998) to capture simple 3D movements from a video image sequence
Kim H, Sarim M, Takai T, Guillemaut J-Y, Hilton A (2010) Dynamic 3D Scene Reconstruction in Outdoor Environments, In Proc. IEEE Symp. on 3D Data Processing and Visualization IEEE
A number of systems have been developed for dynamic
3D reconstruction from multiple view videos over the past
decade. In this paper we present a system for multiple view
reconstruction of dynamic outdoor scenes transferring
studio technology to uncontrolled environments.
A synchronised portable multiple camera system is
composed of off-the-shelf HD cameras for dynamic scene
capture. For foreground extraction, we propose a
multi-view trimap propagation method which is robust
against dynamic changes in appearance between views and
over time. This allows us to apply state-of-the-art natural
image matting algorithms for multi-view sequences with
minimal interaction. Optimal 3D surface of the foreground
models are reconstructed by integrating multi-view shape
cues and features.
For background modelling, we use a line scan camera
with a fish eye lens to capture a full environment with high
resolution. The environment model is reconstructed from a
spherical stereo image pair with sub-pixel correspondence.
Finally the foreground and background models are
merged into a 3D world coordinate and the composite
model is rendered from arbitrary viewpoints. We show that
the proposed system generates high quality scene images
with dynamic virtual camera actions.
Imre E, Hilton A (2014) Order Statistics of RANSAC and Their Practical Application, International Journal of Computer Vision
For statistical analysis purposes, RANSAC is usually treated as a Bernoulli process: each hypothesis is a Bernoulli trial with the outcome outlier-free/contaminated; a run is a sequence of such trials. However, this model only covers the special case where all outlier-free hypotheses are equally good, e.g. generated from noise-free data. In this paper, we explore a more general model which obviates the noise-free data assumption: we consider RANSAC a random process returning the best hypothesis, (Formula presented.), among a number of hypotheses drawn from a finite set ((Formula presented.)). We employ the rank of (Formula presented.) within (Formula presented.) for the statistical characterisation of the output, present a closed-form expression for its exact probability mass function, and demonstrate that (Formula presented.)-distribution is a good approximation thereof. This characterisation leads to two novel termination criteria, which indicate the number of iterations to come arbitrarily close to the global minimum in (Formula presented.) with a specified probability. We also establish the conditions defining when a RANSAC process is statistically equivalent to a cascade of shorter RANSAC processes. These conditions justify a RANSAC scheme with dedicated stages to handle the outliers and the noise separately. We demonstrate the validity of the developed theory via Monte-Carlo simulations and real data experiments on a number of common geometry estimation problems. We conclude that a two-stage RANSAC process offers similar performance guarantees at a much lower cost than the equivalent one-stage process, and that a cascaded set-up has a better performance than LO-RANSAC, without the added complexity of a nested RANSAC implementation. © 2014 Springer Science+Business Media New York.
Ahmed A, Mokhtarian F, Hilton A (2001) Parametric Motion Blending through Wavelet Analysis, Eurographics 2001 - Short Paperpp. 347---353-347---353
This paper presents a framework for performance-based animation and retargeting of high-resolution face models from motion capture. A novel method is introduced for learning a mapping between sparse 3D motion capture markers and dense high-resolution 3D scans of face shape and appearance. A high-resolution facial expression space is learnt from a set of 3D face scans as a person specific morphable model. Sparse 3D face points sampled at the motion capture marker positions are used to build a corresponding low-resolution expression space to represent the facial dynamics from motion capture. Radial basis function interpolation is used to automatically map the low-resolution motion capture of facial dynamics to the high-resolution facial expression space. This produces a high-resolution facial animation with the detailed shape and appearance of real facial dynamics. Retargeting is introduced to transfer facial expressions to a novel subject captured from a single photograph or 3D scan. The subject specific high- resolution expression space is mapped to the novel subject based on anatomical differences in face shape. Results facial animation and retargeting demonstrate realistic animation of expressions from motion capture.
Starck J, Maki A, Nobuhara S, Hilton A, Matsuyama T (2009) The Multiple-Camera 3-D Production Studio,IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY19(6)pp. 856-869
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Stoddart AJ, Lemke S, Hilton A, Renn T (1998) Estimating pose uncertainty for surface registration, IMAGE AND VISION COMPUTING16(2)pp. 111-120 ELSEVIER SCIENCE BV
HILTON A, ILLINGWORTH J, WINDEATT T (1995) STATISTICS OF SURFACE CURVATURE ESTIMATES, PATTERN RECOGNITION28(8)pp. 1201-1221 PERGAMON-ELSEVIER SCIENCE LTD
Multi-view video acquisition is widely used for reconstruction and free-viewpoint rendering of dynamic scenes by directly resampling from the captured images. This paper addresses the problem of optimally resampling and representing multi-view video to obtain a compact representation without loss of the view-dependent dynamic surface appearance. Spatio-temporal optimisation of the multi-view resampling is introduced to extract a coherent multi-layer texture map video. This resampling is combined with a surface-based optical flow alignment between views to correct for errors in geometric reconstruction and camera calibration which result in blurring and ghosting artefacts. The multi-view alignment and optimised resampling results in a compact representation with minimal loss of information allowing high-quality free-viewpoint rendering. Evaluation is performed on multi-view datasets for dynamic sequences of cloth, faces and people. The representation achieves >90% compression without significant loss of visual quality.
Stoddart AJ, Lemke S, Hilton A, Renn T (1996) Uncertainty estimation for surface registration, BMVCpp. 1---6-1---6 BMVA Press
We present a framework for speech-driven synthesis of real faces from a corpus of 3D video of a person speaking. Video-rate capture of dynamic 3D face shape and colour appearance provides the basis for a visual speech synthesis model. A displacement map representation combines face shape and colour into a 3D video. This representation is used to efficiently register and integrate shape and colour information captured from multiple views. To allow visual speech synthesis viseme primitives are identified from the corpus using automatic speech recognition. A novel nonrigid alignment algorithm is introduced to estimate dense correspondence between 3D face shape and appearance for different visemes. The registered displacement map representation together with a novel optical flow optimisation using both shape and colour, enables accurate and efficient nonrigid alignment. Face synthesis from speech is performed by concatenation of the corresponding viseme sequence using the nonrigid correspondence to reproduce both 3D face shape and colour appearance. Concatenative synthesis reproduces both viseme timing and co-articulation. Face capture and synthesis has been performed for a database of 51 people. Results demonstrate synthesis of 3D visual speech animation with a quality comparable to the captured video of a person.
Many practical applications require an accurate knowledge of the extrinsic calibration (\ie, pose) of a moving camera. The existing SLAM and structure-from-motion solutions are not robust to scenes with large dynamic objects, and do not fully utilize the available information in the presence of static cameras, a common practical scenario. In this paper, we propose an algorithm that addresses both of these issues for a hybrid static-moving camera setup. The algorithm uses the static cameras to build a sparse 3D model of the scene, with respect to which the pose of the moving camera is estimated at each time instant. The performance of the algorithm is studied through extensive experiments that cover a wide range of applications, and is shown to be satisfactory.
Conventional stereoscopic video content production requires use of dedicated stereo camera rigs which is both costly and lacking video editing flexibility. In this paper, we propose a novel approach which only requires a small number of standard cameras sparsely located around a scene to automatically convert the monocular inputs into stereoscopic streams. The approach combines a probabilistic spatio-temporal segmentation framework with a state-of-the-art multi-view graph-cut reconstruction algorithm, thus providing full control of the stereoscopic settings at render time. Results with studio sequences of complex human motion demonstrate the suitability of the method for high quality stereoscopic content generation with minimum user interaction.
A new surface based approach to implicit surface polygonisation is introduced. This is applied to the reconstruction of 3D surface models of complex objects from multiple range images. Geometric fusion of multiple range images into an implicit surface representation was presented in previous work. This paper introduces an efficient algorithm to reconstruct a triangulated model of a manifold implicit surface, a local 3D constraint is derived which defines the Delaunay surface triangulation of a set of points on a manifold surface in 3D space. The `marching triangles' algorithm uses the local 3D constraint to reconstruct a Delaunay triangulation of an arbitrary topology manifold surface. Computational and representational costs are both a factor of 3-5 lower than previous volumetric approaches such as marching cubes
This paper introduces a novel method of surface refinement for free-viewpoint video of dynamic scenes. Unlike previous approaches, the method presented here uses both visual hull and silhouette contours to constrain refinement of viewdependent depth maps from wide baseline views. A technique for extracting silhouette contours as rims in 3D from the view-dependent visual hull (VDVH) is presented. A new method for improving correspondence is introduced, where refinement of the VDVH is posed as a global problem in projective ray space. Artefacts of global optimisations are reduced by incorporating rims as constraints. Real time rendering of virtual views in a free-viewpoint video system is achieved using an image+depth representation for each real view. Results illustrate the high quality of rendered views achieved through this refinement technique.
Hilton A, Gentils T, Beresford D (1998) Popup-People: Capturing 3D Articulated Models of Individual People, IEE Colloquim on Computer Vision for Virtual Human Modellingpp. 1---6-1---6 IEE
This paper presents a system for simultaneous capture of
video sequences of face shape and colour appearance.
Shape capture uses a projected infra-red structured light
pattern together with stereo reconstruction to simultaneously
acquire full resolution shape and colour image sequences
at video rate. Displacement mapping techniques
are introduced to represent dynamic face surface shape as
a displacement video. This unifies the representation of
face shape and colour. The displacement video representation
enables efficient registration, integration and spatiotemporal
analysis of captured face data. Results demonstrate
that the system achieves video-rate (25Hz) acquisition
of dynamic 3D colour faces at PAL resolution with an
rms accuracy of 0.2mm and a visual quality comparable to
the captured video.
Multiple view 3D video reconstruction of actor performance captures a level-of-detail for body and clothing movement which is time-consuming to produce using existing animation tools. In this paper we present a framework for concatenative synthesis from multiple 3D video sequences according to user constraints on movement, position and timing. Multiple 3D video sequences of an actor performing different movements are automatically constructed into a surface motion graph which represents the possible transitions with similar shape and motion between sequences without unnatural movement artifacts. Shape similarity over an adaptive temporal window is used to identify transitions between 3D video sequences. Novel 3D video sequences are synthesized by finding the optimal path in the surface motion graph between user specified key-frames for control of movement, location and timing. The optimal path which satisfies the user constraints whilst minimizing the total transition cost between 3D video sequences is found using integer linear programming. Results demonstrate that this framework allows flexible production of novel 3D video sequences which preserve the detailed dynamics of the captured movement for an actress with loose clothing and long hair without visible artifacts.
In computer vision, matting is the process of accurate foreground estimation in images and videos. In this paper we presents a novel patch based approach to video matting relying on non-parametric statistics to represent image variations in appearance. This overcomes the limitation of parametric algorithms which only rely on strong colour correlation between the nearby pixels. Initially we construct a clean background by utilising the foreground object?s movement across the background. For a given frame, a trimap is constructed using the background and the last frame?s trimap. A patch-based approach is used to estimate the foreground colour for every unknown pixel and finally the alpha matte is extracted. Quantitative evaluation shows that the technique performs better, in terms of the accuracy and the required user interaction, than the current state-of-the-art parametric approaches.
Ahmed A, Hilton A, Mokhtarian F (2004) Enriching Animation Databases, Eurographics Short Paper
In this paper we consider the problem of aligning multiple non-rigid surface mesh sequences into a single temporally consistent representation of the shape and motion. A global alignment graph structure is introduced which uses shape similarity to identify frames for inter-sequence registration. Graph optimisation is performed to minimise the total non-rigid deformation required to register the input sequences into a common structure. The resulting global alignment ensures that all input sequences are resampled with a common mesh structure which preserves the shape and temporal correspondence. Results demonstrate temporally consistent representation of several public databases of mesh sequences for multiple people performing a variety of motions with loose clothing and hair.
Hilton A, Fua P, Ronfard R (2006) Vision-based Understanding of a Personýs Shape, Appearance, Movement and Behaviour, Computer Vision and Image Understanding - Special Issue on Modelling People1042-3pp. 87---90-87---90
Kim H, Hilton A (2013) 3D Scene Reconstruction from Multiple Spherical Stereo Pairs, International Journal of Computer Visionpp. 1-23
We propose a 3D environment modelling method using multiple pairs of high-resolution spherical images. Spherical images of a scene are captured using a rotating line scan camera. Reconstruction is based on stereo image pairs with a vertical displacement between camera views. A 3D mesh model for each pair of spherical images is reconstructed by stereo matching. For accurate surface reconstruction, we propose a PDE-based disparity estimation method which produces continuous depth fields with sharp depth discontinuities even in occluded and highly textured regions. A full environment model is constructed by fusion of partial reconstruction from spherical stereo pairs at multiple widely spaced locations. To avoid camera calibration steps for all camera locations, we calculate 3D rigid transforms between capture points using feature matching and register all meshes into a unified coordinate system. Finally a complete 3D model of the environment is generated by selecting the most reliable observations among overlapped surface measurements considering surface visibility, orientation and distance from the camera. We analyse the characteristics and behaviour of errors for spherical stereo imaging. Performance of the proposed algorithm is evaluated against ground-truth from the Middlebury stereo test bed and LIDAR scans. Results are also compared with conventional structure-from-motion algorithms. The final composite model is rendered from a wide range of viewpoints with high quality textures. © 2013 Springer Science+Business Media New York.
We address the problem of reliable real-time 3D-tracking of multiple objects which are observed in multiple wide-baseline camera views. Establishing the spatio-temporal correspondence is a problem with combinatorial complexity in the number of objects and views. In addition vision based tracking suffers from the ambiguities introduced by occlusion, clutter and irregular 3D motion. We present a discrete relaxation algorithm for reducing the intrinsic combinatorial complexity by pruning the decision tree based on unreliable prior information from independent 2D-tracking for each view. The algorithm improves the reliability of spatio-temporal correspondence by simultaneous optimisation over multiple views in the case where 2D-tracking in one or more views is ambiguous. Application to the 3D reconstruction of human movement, based on tracking of skin-coloured regions in three views, demonstrates considerable improvement in reliability and performance. The results demonstrate that the optimisation over multiple views gives correct 3D reconstruction and object labeling in the presence of incorrect 2D-tracking whilst maintaining real-time performance
We consider the problem of geometric integration and representation of multiple views of non-rigidly deforming 3D surface geometry captured at video rate. Instead of treating each frame as a separate mesh we present a representation which takes into consideration temporal and spatial coherence in the data where possible. We first segment gross base transformations using correspondence based on a closest point metric and represent these motions as piecewise rigid transformations. The remaining residual is encoded as displacement maps at each frame giving a displacement video. At both these stages occlusions and missing data are interpolated to give a representation which is continuous in space and time. We demonstrate the integration of multiple views for four different non-rigidly deforming scenes: hand, face, cloth and a composite scene. The approach achieves the integration of multiple-view data at different times into one representation which can processed and edited.
Kittler J, Hilton A, Hamouz M, Illingworth J (2005) 3D Assisted Face Recognition: A Survey of 3D Imaging, Modelling and Recognition Approaches, IEEE Workshop on Advanced 3D Imaging for Safety and Security, A3DISS 2005 (Proceedings of the CVPR 2005 (DVD-ROM))
Hilton A, Roberts JB, Hadded O (1992) Comparative Evaluation of Techniques for Estimating Turbulent Flow Parameters from In-Cylinder LDA Engine Data, Fifth International Symposium on Applications of Laser Anemometry to Fluid Mechanics, Lisbon, Portugalpp. 130-138
Budd C, Hilton A (2009) Skeleton Driven Volumetric Deformation, ACM Symposium on Computer Animation
In this paper we present the first Facial Action Coding System
(FACS) valid model to be based on dynamic 3D scans of human
faces for use in graphics and psychological research. The model
consists of FACS Action Unit (AU) based parameters and has been
independently validated by FACS experts. Using this model, we explore
the perceptual differences between linear facial motions ? represented
by a linear blend shape approach ? and real facial motions
that have been synthesized through the 3D facial model. Through
numerical measures and visualizations, we show that this latter type
of motion is geometrically nonlinear in terms of its vertices. In experiments,
we explore the perceptual benefits of nonlinear motion
for different AUs. Our results are insightful for designers of animation
systems both in the entertainment industry and in scientific
research. They reveal a significant overall benefit to using captured
nonlinear geometric vertex motion over linear blend shape motion.
However, our findings suggest that not all motions need to be animated
nonlinearly. The advantage may depend on the type of facial
action being produced and the phase of the movement.
The probability hypothesis density (PHD) filter based on sequential Monte Carlo (SMC) approximation (also known as SMC-PHD filter) has proven to be a promising algorithm for multi-speaker tracking. However, it has a heavy computational cost as surviving, spawned and born particles need to be distributed in each frame to model the state of the speakers and to estimate jointly the variable number of speakers with their states. In particular, the computational cost is mostly caused by the born particles as they need to be propagated over the entire image in every frame to detect the new speaker presence in the view of the visual tracker. In this paper, we propose to use audio data to improve the visual SMC-PHD (VSMC- PHD) filter by using the direction of arrival (DOA) angles of the audio sources to determine when to propagate the born particles and re-allocate the surviving and spawned particles. The tracking accuracy of the AV-SMC-PHD algorithm is further improved by using a modified mean-shift algorithm to search and climb density gradients iteratively to find the peak of the probability distribution, and the extra computational complexity introduced by mean-shift is controlled with a sparse sampling technique. These improved algorithms, named as AVMS-SMCPHD and sparse-AVMS-SMC-PHD respectively, are compared systematically with AV-SMC-PHD and V-SMC-PHD based on the AV16.3, AMI and CLEAR datasets.
Sun W, Hilton A, Smith R (2000) Building Animated Models from 3D Scanned Data, Fifth Industrial Congress on 3D Digitizingpp. 1---8-1---8
Tena JR, Hamouz M, Hilton A, Illingworth J (2006) A Validated Method for Dense Non-rigid 3D Face Registration, IEEE Int. Conf. on Advanced Video and Signal based Surveillance (AVSS?06)pp. 81-90-81-90
This paper presents a general approach based on the shape similarity tree for non-sequential alignment across databases of multiple unstructured mesh sequences from non-rigid surface capture. The optimal shape similarity tree for non-rigid alignment is defined as the minimum spanning tree in shape similarity space. Non-sequential alignment based on the shape similarity tree minimises the total non-rigid deformation required to register all frames in a database into a consistent mesh structure with surfaces in correspondence. This allows alignment across multiple sequences of different motions, reduces drift in sequential alignment and is robust to rapid non-rigid motion. Evaluation is performed on three benchmark databases of 3D mesh sequences with a variety of complex human and cloth motion. Comparison with sequential alignment demonstrates reduced errors due to drift and improved robustness to large non-rigid deformation, together with global alignment across multiple sequences which is not possible with previous sequential approaches. © 2012 The Author(s).
Roberts JB, Hilton ADM (2001) A direct transform method for the analysis of laser Doppler anemometry engine data, PROCEEDINGS OF THE INSTITUTION OF MECHANICAL ENGINEERS PART D-JOURNAL OF AUTOMOBILE ENGINEERING215(D6)pp. 725-738 PROFESSIONAL ENGINEERING PUBLISHING LTD
Image-based modelling allows the reconstruction of highly realistic digital models from real-world objects. This paper presents a model-based approach to recover animated models of people from multipleview video images. Two contributions are made, a multiple resolution model-based framework is introduced that combines multiple visual cues in reconstruction. Second, a novel mesh parameterisation is presented to preserve the vertex parameterisation in the model for animation. A prior humanoid surface model is first decomposed into multiple levels of detail and represented as a hierarchical deformable model for image fitting. A novel mesh parameterisation is presented that allows propagation of deformation in the model hierarchy and regularisation of surface deformation to preserve vertex parameterisation and animation structure. The hierarchical model is then used to fuse multipleshape cues from silhouette, stereo and sparse feature data in a coarse-to-fine strategy to recover a model that reproduces the appearance in the images. The framework is compared to physics-based deformable surface fitting at a single resolution, demonstrating an improved reconstruction accuracy against ground-truth data with a reduced model distortion. Results demonstrate realistic modelling of real people with accurate shape and appearance while preserving model structure for use in animation.
Price M, Chandaria J, Grau O, Thomas GA, Chatting D, Thorne J, Milnthorpe G, Woodward P, Bull L, Ong E-J, Hilton A, Mitchelson J, Starck J (2002) Real-Time Production and Delivery of 3D Media,International Broadcasting Convention, Conference Proceedings
The Prometheus project has investigated new ways
of creating, distributing and displaying 3D television.
The tools developed will also help today?s virtual
3D content is created by extension of the
principles of a virtual studio to include realistic 3D
representation of actors. Several techniques for this
have been developed:
" Texture-mapping of live video onto rough 3D
" Fully-animated 3D avatars:
" Photo-realistic body model generated
from several still images of a person
from different viewpoints.
" Addition of a detailed head model taken
from two close-up images of the head.
" Tracking of face and body movements
of a live performer using several
cameras, to derive animation data
which can be applied to the face
" Simulation of virtual clothing which can be
applied to the animated avatars.
MPEG-4 is used to distribute the content in its
original 3D form.
The 3D scene may be rendered in a form
suitable for display on a ?glasses-free? 3D display,
based on the principle of Integral Imaging.
By assembling these elements in an end-to-end
chain, the project has shown how a future 3D TV
system could be realised. Furthermore, the tools
developed will also improve the production
methods available for conventional virtual studios,
by focusing on sensor-free and markerless motion
capture technology, methods for the rapid creation
of photo-realistic virtual humans, and real-time
Shen X, Palmer P, McLauchlan P, Hilton A (2000) Error Propagation from Camera Motion to Epipolar Constraint, British Machine Vision Conferencepp. 546---555-546---555
Sarim M, Hilton A, Guillemaut JY (2011) Temporal trimap propagation for video matting using inferential statistics, Proceedings - International Conference on Image Processing, ICIPpp. 1745-1748
This paper introduces a statistical inference framework to temporally propagate trimap labels from sparsely defined key frames to estimate trimaps for the entire video sequence. Trimap is a fundamental requirement for digital image and video matting approaches. Statistical inference is coupled with Bayesian statistics to allow robust trimap labelling in the presence of shadows, illumination variation and overlap between the foreground and background appearance. Results demonstrate that trimaps are sufficiently accurate to allow high quality video matting using existing natural image matting algorithms. Quantitative evaluation against ground-truth demonstrates that the approach achieves accurate matte estimation with less amount of user interaction compared to the state-of-the-art techniques. © 2011 IEEE.
We propose a region-based method to extract foreground
regions from colour video sequences. The foreground region
is decided by voting with scores from background subtraction
to the sub-regions by graph- based segmentation. Experiments
show that the proposed algorithm improves on conventional
approaches especially in strong shadow regions.
Turkmani A, Hilton A (2006) Appearane-Based Inner-Lip Detection, IET European Conference on Visual Media Productionpp. 176-176
Budd C, Hilton A (2009) Skeleton Driven Volumetric Laplacian Deformation, European Conference on Visual Media Production
Kim H, Hilton A (2009) Graph-based Foreground Extraction in Extended Colour Space, Int.Conf.Image Processing (ICIP)
Ahmed A, Hilton A, Mokhtarian F (2004) Intuitive Parametric Synthesis of Human Animation Sequences, IEEE Computer Animation and Social Agents
Wang T, McLauchlan P, Palmer P, Hilton A (2001) Calibration for an Integrated Measurement System of Camera and Laser and its Application, 5th World Multiconference on Systemics, Cybernetics and Informatics (Awarded Best Paper), Orlando, Florida, USA
We propose a 3D environment modelling method using multiple pairs of high-resolution spherical images. Spherical images of a scene are captured using a rotating line scan camera. Reconstruction is based on stereo image pairs with a vertical displacement between camera views. A 3D mesh model for each pair of spherical images is reconstructed by stereo matching. For accurate surface reconstruction, we propose a PDE-based disparity estimation method which produces continuous depth fields with sharp depth discontinuities even in occluded and highly textured regions. A full environment model is constructed by fusion of partial reconstruction from spherical stereo pairs at multiple widely spaced locations. To avoid camera calibration steps for all camera locations, we calculate 3D rigid transforms between capture points using feature matching and register all meshes into a unified coordinate system. Finally a complete 3D model of the environment is generated by selecting the most reliable observations among overlapped surface measurements considering surface visibility, orientation and distance from the camera. We analyse the characteristics and behaviour of errors for spherical stereo imaging. Performance of the proposed algorithm is evaluated against ground-truth from the Middlebury stereo test bed and LIDAR scans. Results are also compared with conventional structure-from-motion algorithms. The final composite model is rendered from a wide range of viewpoints with high quality textures.
Edge J, Hilton A, Jackson P (2008) Parameterisation of Speech Lip Movements, Proceedings of International Conference on Auditory-visual Speech Processing
In this paper we describe a parameterisation of lip movements which maintains the dynamic structure inherent in the task of producing speech sounds. A stereo capture system is used to reconstruct 3D models of a speaker producing sentences from the TIMIT corpus. This data is mapped into a space which maintains the relationships between samples and their temporal derivatives. By incorporating dynamic information within the parameterisation of lip movements we can model the cyclical structure, as well as the causal nature of speech movements as described by an underlying visual speech manifold. It is believed that such a structure will be appropriate to various areas of speech modeling, in particular the synthesis of speech lip movements.
This paper presents a general approach based on the shape similarity tree for non-sequential alignment across databases of multiple unstructured mesh sequences from non-rigid surface capture. The optimal shape similarity tree for non-rigid alignment is defined as the minimum spanning tree in shape similarity space. Non-sequential alignment based on the shape similarity tree minimises the total non-rigid deformation required to register all frames in a database into a consistent mesh structure with surfaces in correspondence. This allows alignment across multiple sequences of different motions, reduces drift in sequential alignment and is robust to rapid non-rigid motion. Evaluation is performed on three benchmark databases of 3D mesh sequences with a variety of complex human and cloth motion. Comparison with sequential alignment demonstrates reduced errors due to drift and improved robustness to large non-rigid deformation, together with global alignment across multiple sequences which is not possible with previous sequential approaches. © 2012 The Author(s).
Hilton A (1992) Algorithms for Estimating Turbulent Flow Parameters from In-Cylinder Laser Doppler Anemometer Data, Doctor of Philosophy (D.Phil.) Thesis, University of Sussex,UK
This paper presents an approach for reconstruction of
4D temporally coherent models of complex dynamic scenes.
No prior knowledge is required of scene structure or camera
calibration allowing reconstruction from multiple moving
cameras. Sparse-to-dense temporal correspondence is integrated
with joint multi-view segmentation and reconstruction
to obtain a complete 4D representation of static and
dynamic objects. Temporal coherence is exploited to overcome
visual ambiguities resulting in improved reconstruction
of complex scenes. Robust joint segmentation and reconstruction
of dynamic objects is achieved by introducing
a geodesic star convexity constraint. Comparative evaluation
is performed on a variety of unstructured indoor and
outdoor dynamic scenes with hand-held cameras and multiple
people. This demonstrates reconstruction of complete
temporally coherent 4D scene models with improved nonrigid
object segmentation and shape reconstruction.
Smith R, Hilton A, Sun W (2000) Seamless VRML Humans, Fifth Industrial Congress on 3D Digitizingpp. 1---8-1---8
Hilton A, Fua P (2001) Modeling people toward vision-based understanding of a person's shape, appearance, and movement, Computer Vision and Image Understanding81(3)pp. 227-230
This paper presents a layered animation framework
which uses displacement maps for efficient representation
and animation of highly detailed surfaces. The model consists
of three layers: a skeleton; low-resolution control
model; and a displacement map image. The novel aspects
of this approach are an automatic closed-form solution for
displacement map generation and animation of the layered
displacement map model. This approach provides an efficient
representation of complex geometry which allows realistic
deformable animation with multiple levels-of-detail.
The representation enables compression, efficient transmission
and level-of-detail control for animated models.
This paper presents a novel system for the 3D capture of facial performance using standard video and lighting equipment. The mesh of an actor's face is tracked non-sequentially throughout a performance using multi-view image sequences. The minimum spanning tree calculated in expression dissimilarity space defines the traversal of the sequences optimal with respect to error accumulation. A robust patch-based frame-to-frame surface alignment combined with the optimal traversal significantly reduces drift compared to previous sequential techniques. Multi-path temporal fusion resolves inconsistencies between different alignment paths and yields a final mesh sequence which is temporally consistent. The surface tracking framework is coupled with photometric stereo using colour lights which captures metrically correct skin geometry. High-detail UV normal maps corrected for shadow and bias artefacts augment the temporally consistent mesh sequence. Evaluation on challenging performances by several actors demonstrates the acquisition of subtle skin dynamics and minimal drift over long sequences. A quantitative comparison to a state-of-the-art system shows similar quality of temporal alignment. © 2012 IEEE.
Hamouz M, Tena JR, Kittler J, Hilton A, Illingworth J (2006) 3D Assisted Face Recognition: A Survey, Book Chapter
In this paper we describe a parameterisation of lip movements
which maintains the dynamic structure inherent in the task of
producing speech sounds. A stereo capture system is used to
reconstruct 3D models of a speaker producing sentences from
the TIMIT corpus. This data is mapped into a space which
maintains the relationships between samples and their temporal
derivatives. By incorporating dynamic information within
the parameterisation of lip movements we can model the cyclical
structure, as well as the causal nature of speech movements
as described by an underlying visual speech manifold. It is believed
that such a structure will be appropriate to various areas
of speech modeling, in particular the synthesis of speech lip
We present a new approach to reflectance estimation for dynamic scenes. Non-parametric image statistics are used to transfer reflectance properties from a static example set to a dynamic image sequence. The approach allows reflectance estimation for surface materials with inhomogeneous appearance, such as those which commonly occur with patterned or textured clothing. Material reflectance properties
are initially estimated from static images of the subject under multiple directional illuminations using photometric stereo.
The estimated reflectance together with the corresponding image under uniform ambient illumination form a prior set of reference material observations. Material reflectance
properties are then estimated for video sequences of a moving person captured under uniform ambient illumination by matching the observed local image statistics to the reference
observations. Results demonstrate that the transfer of reflectance properties enables estimation of the dynamic surface normals and subsequent relighting. This approach
overcomes limitations of previous work on material transfer and relighting of dynamic scenes which was limited to surfaces with regions of homogeneous reflectance. We
evaluate for relighting 3D model sequences reconstructed from multiple view video. Comparison to previous model relighting
demonstrates improved reproduction of detailed texture and shape dynamics.
Starck J, Collins G, Smith R, Hilton A, Illingworth J (2003) Animated Statues,Journal of Machine Vision Applications14(4)pp. 248-259
In this paper we describe a method for the synthesis of visual speech movements using a hybrid unit selection/model-based approach. Speech lip movements are captured using a 3D stereo face capture system, and split up into phonetic units. A dynamic parameterisation of this data is constructed which maintains the relationship between lip shapes and velocities; within this parameterisation a model of how lips move is built and is used in the animation of visual speech movements from speech audio input. The mapping from audio parameters to lip movements is disambiguated by selecting only the most similar stored phonetic units to the target utterance during synthesis. By combining properties of model-based synthesis (e.g. HMMs, neural nets) with unit selection we improve the quality of our speech synthesis.
Sarim M, Hilton A, Guillemaut J-Y, Kim H, Takai T (2010) Wide-Baseline Multi-View Video Segmentation For 3D Reconstruction, Proceedings of the 1st international workshop on 3D video processingpp. 13-18 ACM
Obtaining a foreground silhouette across multiple views is one of the fundamental steps in 3D reconstruction. In this paper we present a novel video segmentation approach, to obtain a foreground silhouette, for scenes captured by a wide-baseline camera rig given a sparse manual interaction in a single view. The algorithm is based on trimap propagation, a framework used in video matting. Bayesian inference coupled with camera calibration information are used to spatio-temporally propagate high confidence trimap labels across the multi-view video to obtain coarse silhouettes which are later refined using a matting algorithm. Recent techniques have been developed for foreground segmentation, based on image matting, in multiple views but they are limited to narrow baseline with low foreground variation. The proposed wide-baseline silhouette propagation is robust to inter-view foreground appearance changes, shadows and similarity in foreground/background appearance. The approach has demonstrated good performance in silhouette estimation for views up to 180 degree baseline (opposing views). The segmentation technique has been fully integrated in a multi-view reconstruction pipeline. The results obtained demonstrate the suitability of the technique for multi-view reconstruction with wide-baseline camera set-ups and natural background.
This paper addresses the synthesis of virtual views of people from multiple view
image sequences. We consider the target area of the multiple camera ?3D Virtual
Studio? with the ultimate goal of capturing video-realistic dynamic human appearance.
A mesh based reconstruction framework is introduced to initialise and optimise
the shape of a dynamic scene for view-dependent rendering, making use of silhouette
and stereo data as complementary shape cues. The technique addresses two
key problems: (1) robust shape reconstruction; and (2) accurate image correspondence
for view dependent rendering in the presence of camera calibration error. We
present results against ground truth data in synthetic test cases and for captured
sequences of people in a studio. The framework demonstrates a higher resolution in
rendering compared to shape from silhouette and multiple view stereo.
Hilton A, Starck J, Collins G (2002) From 3D Shape Capture to Animated Models, IEEE Conference on 3D Data Processing, Visualisation and Transmission
Hilton A, Illingworth J, Windeatt T (1994) Surface Curvature Estimation, 12th IAPR International Conference on Pattern Recognitionpp. 37---41-37---41 IEEE
Hilton A (2003) Computer Vision for Human Modelling and Analysis, Journal of Machine Vision Applications144pp. 206---209-206---209
Doshi A, Hilton A, Starck J (2008) An Empirical Study of Non-rigid Surface Feature Matching, European Conference on Visual Media Production
Hilton A, Godin G, Shu C, Masuda T (2011) Special issue on 3D imaging and modelling, COMPUTER VISION AND IMAGE UNDERSTANDING115(5)pp. 559-560 ACADEMIC PRESS INC ELSEVIER SCIENCE
Hilton A, Illingworth J (1997) Multi-Resolution Geometric Fusion,International Conference on Recent Advances in 3D Digital Imaging and Modelingpp. 181---188-181---188
We propose a human performance capture system employing convolutional neural networks (CNN) to estimate human pose from a volumetric representation of a performer derived from multiple view-point video (MVV).We compare direct CNN pose regression to the performance of an affine invariant pose descriptor learned by a CNN through a classification task. A non-linear manifold embedding is learned between the descriptor and articulated pose spaces, enabling regression of pose from the source MVV. The results are evaluated against ground truth pose data captured using a Vicon marker-based system and demonstrate good generalisation over a range of human poses, providing a system that requires no special suit to be worn by the performer.
Grau O, Hilton A, Kilner J, Miller G, Sargeant T, Starck J (2007) A Free-Viewpoint Video System for Visualisation of Sports Scenes, SMPTE Motion Imaging Journal1165-6pp. 213-219-213-219
Imre E, Guillemaut JY, Hilton A (2012) Through-the-lens multi-camera synchronisation and frame-drop detection for 3D reconstruction, Proceedings - 2nd Joint 3DIM/3DPVT Conference: 3D Imaging, Modeling, Processing, Visualization and Transmission, 3DIMPVT 2012pp. 395-402
Synchronisation is an essential requirement for multiview 3D reconstruction of dynamic scenes. However, the use of HD cameras and large set-ups put a considerable stress on hardware and cause frame drops, which is usually detected by manually verifying very large amounts of data. This paper improves , and extends it with frame-drop detection capability. In order to spot frame-drop events, the algorithm fits a broken line to the frame index correspondences for each camera pair, and then fuses the pair wise drop hypotheses into a consistent, absolute frame-drop estimate. The success and the practical utility of the the improved pipeline is demonstrated through a number of experiments, including 3D reconstruction and free-viewpoint video rendering tasks. © 2012 IEEE.
Digital content production traditionally requires highly skilled artists and animators to first manually
craft shape and appearance models and then instill the models with a believable performance. Motion
capture technology is now increasingly used to record the articulated motion of a real human performance
to increase the visual realism in animation. Motion capture is limited to recording only the skeletal
motion of the human body and requires the use of specialist suits and markers to track articulated motion.
In this paper we present surface capture, a fully automated system to capture shape and appearance as
well as motion from multiple video cameras as a basis to create highly realistic animated content from
an actor?s performance in full wardrobe. We address wide-baseline scene reconstruction to provide 360
degree appearance from just 8 camera views and introduce an efficient scene representation for level
of detail control in streaming and rendering. Finally we demonstrate interactive animation control in a
computer games scenario using a captured library of human animation, achieving a frame rate of 300fps
on consumer level graphics hardware.
Hilton A, Stoddart AJ, Illingworth J, Windeatt T (1994) Automatic inspection of loaded PCB?s using 3D range data, SPIE Machine Vision Application in Industrial Inspection II, International Symposium on Electronic Imaging: Science and Technology, San Jose, CA Volume 2183pp. 226---237-226---237 SPIE
Hilton A, Roberts JB, Hadded O (1991) Autocorrelation Based Analysis of LDA Engine Data for Bias-Free Turbulence Estaimates, Society of Automotive Engineers International Congresspp. 22---30-22---30
Huang P, Hilton A, Starck J (2010) Shape Similarity for 3D Video Sequences of People, International Journal of Computer Vision89(2-3)pp. 362-381 Springer
This paper presents a performance evaluation of
shape similarity metrics for 3D video sequences of people
with unknown temporal correspondence. Performance of similarity
measures is compared by evaluating Receiver Operator
Characteristics for classification against ground-truth for
a comprehensive database of synthetic 3D video sequences
comprising animations of fourteen people performing twentyeight
motions. Static shape similarity metrics shape distribution,
spin image, shape histogram and spherical harmonics
are evaluated using optimal parameter settings for each approach.
Shape histograms with volume sampling are found
to consistently give the best performance for different people
and motions. Static shape similarity is extended over time to
eliminate the temporal ambiguity. Time-filtering of the static
shape similarity together with two novel shape-flow descriptors
are evaluated against temporal ground-truth. This evaluation demonstrates that shape-flow with a multi-frame alignment of motion sequences achieves the best performance, is stable for different people and motions, and overcome the ambiguity in static shape similarity. Time-filtering of the static shape histogram similarity measure with a fixed window size achieves marginally lower performance for linear motions with the same computational cost as static shape descriptors. Performance of the temporal shape descriptors is validated for real 3D video sequence of nine actors performing a variety of movements. Time-filtered shape histograms are shown to reliably identify frames from 3D video sequences with similar shape and motion for people with loose clothing and complex motion.
Ahmed A, Hilton A, Mokhtarian F (2002) Adaptive Compression of Human Animation Data, Eurograhics - Short Paper
Kittler J, Hamouz M, Tena JR, Hilton A, Illingworth J, Ruiz M (2005) 3D Assisted 2D Face Recognition: Methodology, Lecture Notes in Computer Science 3773 (Proc. of CIARPý05)pp. 1055---1065-1055---1065
Moeslund TB, Hilton A, Krüger V, Sigal L (2011) Visual Analysis of Humans: Looking at People, Springer-Verlag New York Inc
Hilton A, Stoddart AJ, Illingworth J, Windeatt T (1996) Building 3D Graphical Models of Complex Objects, Eurographics UK Conferencepp. 193---203-193---203 EGUK
We present a novel method to relight video sequences given known surface shape and illumination. The method preserves fine visual details. It requires single view video frames, approximate 3D shape and standard studio illumination only, making it applicable in studio production. The technique is demonstrated for relighting video sequences of faces
Starck J, Hilton A (2003) Towards a 3D Virtual Studio for Human Apperance Capture, IMA International Conference on Vision, Video and Graphics, Bathpp. 17---24-17---24
Hilton A, Kalkavouras M, Collins G (2004) MELIES: 3D Studio Production of Animated Actor Models, IEE European Conference on Visual Media Productionpp. 283---288-283---288
Blat J, Evans A, Kim H, Imre H, Polok L, Ila V, Nikolaidis N, Zamcik P, Tefas A, Smrz P, Hilton A, Pitas I (2015) Big Data Analysis for Media Production,Proceedings of the IEEE104(11)pp. 2085-2113
A typical high-end film production generates several
terabytes of data per day, either as footage from multiple
cameras or as background information regarding the set (laser
scans, spherical captures, etc). This paper presents solutions to
improve the integration, and the understanding of the quality,
of the multiple data sources, which are used both to support
creative decisions on-set (or near it) and enhance the postproduction
process. The main contributions covered in this paper
are: a public multisource production dataset made available
for research purposes, monitoring and quality assurance of
multicamera set-ups, multisource registration, anthropocentric
visual analysis for semantic content annotation, acceleration of
3D reconstruction, and integrated 2D-3D web visualization tools.
Furthermore, this paper presents a toolset for analysis and
visualisation of multi-modal media production datasets which
enables onset data quality verification and management, thus
significantly reducing the risk and time required in production.
Some of the basic techniques used for acceleration, clustering
and visualization could be applied to much larger classes of big
Gkalelis N, Kim H, Hilton A, Nikolaidis N, Pitas I (2009) The i3DPost multi-view and 3D human action/interaction,
Huang P, Hilton A, Starck J (2008) Automatic 3D video summarization: Key frame extraction from self-similarity, 4th International Symposium on 3D Data Processing, Visualization and Transmission, 3DPVT 2008 - Proceedingspp. 71-78
© 2008 Georgia Institute of Technology.In this paper we present an automatic key frame selection method to summarise 3D video sequences. Key-frame selection is based on optimisation for the set of frames which give the best representation of the sequence according to a rate-distortion trade-off. Distortion of the summarization from the original sequence is based on measurement of self-similarity using volume histograms. The method evaluates the globally optimal set of key-frames to represent the entire sequence without requiring pre-segmentation of the sequence into shots or temporal correspondence. Results demonstrate that for 3D video sequences of people wearing a variety of clothing the summarization automatically selects a set of key-frames which represent the dynamics. Comparative evaluation of rate-distortion characteristics with previous 3D video summarization demonstrates improved performance.
Manessis A, Hilton A (2005) Scene Modelling from Sparse 3D Data, Journal of Image and Vision Computing2310pp. 900---920-900---920
We propose a plane-based urban scene reconstruction method using spherical stereo image pairs. We assume that the urban scene consists of axis-aligned approximately planar structures (Manhattan world). Captured spherical stereo images are converted into six central-point perspective images by cubic projection and facade alignment. Facade alignment automatically identifies the principal planes direction in the scene allowing the cubic projection to preserve the plane structure. Depth information is recovered by stereo matching between images and independent 3D rectangular planes are constructed by plane fitting aligned with the principal axes. Finally planar regions are refined by expanding, detecting intersections and cropping based on visibility. The reconstructed model efficiently represents the structure of the scene and texture mapping allows natural walk-through rendering.
Hamouz M, Tena JR, Kittler J, Hilton A, Illingworth J (2006) Algorithms for 3D-assisted face recognition, 2006 IEEE 14TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS, VOLS 1 AND 2pp. 826-829 IEEE
Surface motion capture (Surf Cap) enables 3D reconstruction of human performance with detailed cloth and hair deformation. However, there is a lack of tools that allow flexible editing of Surf Cap sequences. In this paper, we present a Laplacian editing technique that constrains the mesh deformation to plausible surface shapes learnt from a set of examples. A part-Based representation of the mesh enables learning of surface deformation locally in the space of Laplacian coordinates, avoiding correlations between body parts while preserving surface details. This extends the range of animation with natural surface deformation beyond the whole-body poses present in the Surf Cap data. We illustrate successful use of our tool on three different characters. © 2013 IEEE.
Kim H, Hilton A (2014) HYBRID 3D FEATURE DESCRIPTION AND MATCHING FOR MULTI-MODAL DATA REGISTRATION, pp. 3493-3497
We propose a robust 3D feature description and registration method for 3D models reconstructed from various sensor devices.
General 3D feature detectors and descriptors generally show low distinctiveness and repeatability for matching between different data modalities due to differences in noise and errors in geometry. The proposed method considers not only local 3D points but also neighbouring 3D keypoints to improve keypoint matching. The proposed method is tested on various multi-modal datasets including LIDAR scans, multiple photos, spherical images and RGBD videos to evaluate the performance against existing methods.
This paper addresses the problem of reconstructing an integrated 3D model from multiple 2.5D range images. A novel integration algorithm is presented based on a continuous implicit surface representation. This is the first reconstruction algorithm to use operations in 3D space only. The algorithm is guaranteed to reconstruct the correct topology of surface features larger than the range image sampling resolution. Reconstruction of triangulated models from multi-image data sets is demonstrated for complex objects. Performance characterization of existing range image integration algorithms is addressed in the second part of this paper. This comparison defines the relative computational complexity and geometric limitations of existing integration algorithms.
Stoddart AJ, Hilton A (1996) Registration of multiple point sets, ICPRpp. 1---4-1---4 Vienna
Mitchelson J, Hilton A (2003) Hierarchical Tracking of Human Motion for Animation, Model-based Imaging, Rendering, image Analysis and Graphical Special Effects, Paris
In this paper we introduce a video-based representation for free viewpoint visualization and motion control of
3D character models created from multiple view video sequences of real people. Previous approaches to videobased
rendering provide no control of scene dynamics to manipulate, retarget, and create new 3D content from
captured scenes. Here we contribute a new approach, combining image based reconstruction and video-based
animation to allow controlled animation of people from captured multiple view video sequences. We represent a
character as a motion graph of free viewpoint video motions for animation control. We introduce the use of geometry
videos to represent reconstructed scenes of people for free viewpoint video rendering. We describe a novel
spherical matching algorithm to derive global surface to surface correspondence in spherical geometry images
for motion blending and the construction of seamless transitions between motion sequences. Finally, we demonstrate
interactive video-based character animation with real-time rendering and free viewpoint visualization. This
approach synthesizes highly realistic character animations with dynamic surface shape and appearance captured
from multiple view video of people.
McLauchlan P, Shen X, Palmer P, Manessis A, Hilton A (2000) Surface-Based Structure-from-Motion using Feature Groupings, IEEE International Asian Conference on Computer Visionpp. 1---10-1---10
Molina L, Hilton A (2001) Learning models for sythesis of human motion, BMVA Workshop on Probabalistic Methods in Computer Vision
Miller Graham, Hilton Adrian (2007) Safe Hulls,IET European Conference on Visual Media Productionpp. 1---8-1---8
The visual hull is widely used as a proxy for novel view synthesis in computer vision. This paper introduces the safe hull, the first visual hull reconstruction technique to produce a surface containing only foreground parts. A theoretical basis underlies this novel approach which, unlike any previous work, can also identify phantom volumes attached to real objects. Using an image-based method, the visual hull is constructed with respect to each real view and used to identify safe zones in the original silhouettes. The safe zones define volumes known to only contain surface corresponding to a real object. The zones are used in a second reconstruction step to produce a surface without phantom volumes. Results demonstrate the effectiveness of this method for improving surface shape and scene realism, and its advantages over heuristic techniques.
Sarim M, Guillemaut JY, Kim H, Hilton A (2009) Wide-baseline Image Matting, European Conference on Visual Media Production(CVMP)
This paper addresses the problem of human action matching in outdoor sports broadcast environments, by analysing 3D data from a recorded human activity and retrieving the most appropriate proxy action from a motion capture library. Typically pose recognition is carried out using images from a single camera, however this approach is sensitive to occlusions and restricted fields of view, both of which are common in the outdoor sports environment. This paper presents a novel technique for the automatic matching of human activities which operates on the 3D data available in a multi-camera broadcast environment. Shape is retrieved using multi-camera techniques to generate a 3D representation of the scene. Use of 3D data renders the system camera-pose-invariant and allows it to work while cameras are moving and zooming. By comparing the reconstructions to an appropriate 3D library, action matching can be achieved in the presence of significant calibration and matting errors which cause traditional pose detection schemes to fail. An appropriate feature descriptor and distance metric are presented as well as a technique to use these features for key-pose detection and action matching. The technique is then applied to real footage captured at an outdoor sporting event
Imre E, Hilton A (2014) Covariance estimation for minimal geometry solvers via scaled unscented transformation, Computer Vision and Image Understanding0pp. ---
In film production, many post-production tasks require the availability of accurate camera calibration information. This paper presents an algorithm for through-the-lens calibration of a moving camera for a common scenario in film production and broadcasting: The camera views a dynamic scene, which is also viewed by a set of static cameras with known calibration. The proposed method involves the construction of a sparse scene model from the static cameras, with respect to which the moving camera is registered, by applying the appropriate perspective-n-point (PnP) solver. In addition to the general motion case, the algorithm can handle the nodal cameras with unknown focal length via a novel P2P algorithm. The approach can identify a subset of static cameras that are more likely to generate a high number of scene-image correspondences, and can robustly deal with dynamic scenes. Our target applications include dense 3D reconstruction, stereoscopic 3D rendering and 3D scene augmentation, through which the success of the algorithm is demonstrated experimentally.
We describe a novel framework for segmenting a time- and view-coherent foreground matte sequence from synchronised multiple view video. We construct a Markov Random Field (MRF) comprising links between superpixels corresponded across views, and links between superpixels and their constituent pixels. Texture, colour and disparity cues are incorporated to model foreground appearance. We solve using a multi-resolution iterative approach enabling an eight view high definition (HD) frame to be processed in less than a minute. Furthermore we incorporate a temporal diffusion process introducing a prior on the MRF using information propagated from previous frames, and a facility for optional user correction. The result is a set of temporally coherent mattes that are solved for simultaneously across views for each frame, exploiting similarities across views and time.
Starck J, Kilner J, Hilton A (2009) Free-viewpoint Video Render, Journal of Graphics Tools
Mitchelson J, Hilton A (2002) Wand-based Calibration of Multiple Cameras, British Machine Vision Association workshop on Multiple Views
Mustafa A, Kim H, Imre HE, Hilton A (2014) Initial Disparity Estimation Using Sparse Matching for Wide-Baseline Dense,
Trumble Matthew, Gilbert Andrew, Malleson Charles, Hilton Adrian, Collomosse John Total Capture,
University of Surrey
This paper presents a novel volumetric reconstruction technique that combines shape-from-silhouette with stereo photo-consistency in a global optimisation that enforces feature constraints across multiple views. Human shape reconstruction is considered where extended regions of uniform appearance, complex self-occlusions and sparse feature cues represent a challenging problem for conventional reconstruction techniques. A unified approach is introduced to first reconstruct the occluding contours and left-right consistent edge contours in a scene and then incorporate these contour constraints in a global surface optimisation using graph-cuts. The proposed technique maximises photo-consistency on the surface, while satisfying silhouette constraints to provide shape in the presence of uniform surface appearance and edge feature constraints to align key image features across views.
We propose a framework for 2D/3D multi-modal data registration and evaluate 3D feature descriptors for registration of 3D datasets from different sources. 3D datasets of outdoor environments can be acquired using a variety of active and passive sensor technologies including laser scanning and video cameras. Registration of these datasets into a common coordinate frame is required for subsequent modelling and visualisation. 2D images are converted into 3D structure by stereo or multi-view reconstruction techniques and registered to a unified 3D domain with other datasets in a 3D world. Multi-modal datasets have different density, noise, and types of errors in geometry. This paper provides a performance benchmark for existing 3D feature descriptors across multi-modal datasets. Performance is evaluated for the registration of datasets obtained from high-resolution laser scanning with reconstructions obtained from images and video. This analysis highlights the limitations of existing 3D feature detectors and descriptors which need to be addressed for robust multi-modal data registration. We analyse and discuss the performance of existing methods in registering various types of datasets then identify future directions required to achieve robust multi-modal 3D data registration.
In this paper we present an automatic key frame selection
method to summarise 3D video sequences. Key-frame selection
is based on optimisation for the set of frames which
give the best representation of the sequence according to
a rate-distortion trade-off. Distortion of the summarization
from the original sequence is based on measurement of
self-similarity using volume histograms. The method evaluates
the globally optimal set of key-frames to represent
the entire sequence without requiring pre-segmentation of
the sequence into shots or temporal correspondence. Results
demonstrate that for 3D video sequences of people
wearing a variety of clothing the summarization automatically
selects a set of key-frames which represent the dynamics.
Comparative evaluation of rate-distortion characteristics
with previous 3D video summarization demonstrates
This paper presents an investigation of the visual variation
on the bilabial plosive consonant /p/ in three coarticulation contexts.
The aim is to provide detailed ensemble analysis to assist
coarticulation modelling in visual speech synthesis. The underlying
dynamics of labeled visual speech units, represented as lip
shape, from symmetric VCV utterances, is investigated. Variation
in lip dynamics is quantitively and qualitatively analyzed.
This analysis shows that there are statistically significant differences
in both the lip shape and trajectory during coarticulation.
Roberts JB, Hilton A (2001) A Direct Transform Method for the Analysis of LDA Engine Data, I.Mech.E. Journal of Automotive Engineering251Dpp. 725---738-725---738
Conventional view-dependent texture mapping techniques produce composite images by blending subsets of input images, weighted according to their relative influence at the rendering viewpoint, over regions where the views overlap. Geometric or camera calibration errors often result in a los s of detail due to blurring or double exposure artefacts which tends to be exacerbated by the number of blending views considered. We propose a novel view-dependent rendering technique which optimises the blend region dynamically at rendering time, and reduces the adverse effects of camera calibration or geometric errors otherwise observed. The technique has been successfully integrated in a rendering pipeline which operates at interactive frame rates. Improvement over state-of-the-art view-dependent texture mapping techniques are illustrated on a synthetic scene as well as real imagery of a large scale outdoor scene where large camera calibration and geometric errors are present.
The ability to predict the acoustics of a room without acoustical measurements is a useful capability. The motivation here stems from spatial audio reproduction, where knowledge of the acoustics of a space could allow for more accurate reproduction of a captured environment, or for reproduction room compensation techniques to be applied. A cuboid-based room geometry estimation method using a spherical camera is proposed, assuming a room and objects inside can be represented as cuboids aligned to the main axes of the coordinate system. The estimated geometry is used to produce frequency-dependent acoustic predictions based on geometrical room modelling techniques. Results are compared to measurements through calculated reverberant spatial audio object parameters used for reverberation reproduction customized to the given loudspeaker set up.
This thesis addresses the problem of reconstructing complex real-world dynamic scenes without prior knowledge of the scene structure, dynamic objects or background. Previous approaches to 3D reconstruction of dynamic scenes either require a controlled studio set-up
with chroma-key backgrounds or prior knowledge such as static background appearance or segmentation of the dynamic objects. This thesis presents a new approach which enables general dynamic scene reconstruction. This is achieved by initializing the reconstruction with
sparse wide-baseline feature matches between views which avoids the requirement for prior knowledge of the background appearance or assumptions that the background is static. To achieve sparse reconstruction of dynamic objects a novel segmentation based feature detector
SFD is introduced. SFD is shown to give an order of magnitude increase in the number and reliability of features detected. A coarse-to-fine approach is introduced for reconstruction of dense 3D models of dynamic scenes. This uses joint segmentation and shape refinement to achieve robust reconstruction of dynamic object such as people. The approach is evaluated
across a wide-range of indoor and outdoor scenes.
The second major contribution of this research is to introduce temporal coherence into the reconstruction process. The dynamic scene is segmented into objects based on the initial sparse 3D feature reconstruction of the scene. Dense reconstruction is then performed for
each object. For dynamic objects the reconstruction is propagated over time to provide a prior for the reconstruction at successive frames in the sequence. This is combined with the introduction of a geodesic star convexity constraint in the segmentation refinement to improve the segmentation of complex objects. Evaluation on general dynamic scene demonstrates significant improvement in both segmentation and reconstruction with temporal coherence reducing the ambiguity in the reconstruction of complex shape. The final significant contribution of this research is the introduction of a complete framework for 4D temporally coherent shape reconstruction from one or more camera views. The 4D match tree is introduced as an intermediate representation for robust alignment of partial
surface reconstructions across a complete sequence. SFD is used to achieve wide-timeframe matching of partial surface reconstructions between any pair of frames in the sequence. This allows the evaluation of a frame-to-frame shape similarity metric. A 4D match tree is then reconstructed as the minimum spanning tree which represents the shortest path in shape similarity space for alignment across all frames in the sequence. The 4D match tree is applied to achieve robust 4D shape reconstruction of complex dynamic scenes. This is the
first approach to demonstrate 4D reconstruction of general real-world dynamic scenes with non-rigid shape from video.
This paper introduces a general approach to dynamic scene reconstruction from multiple moving cameras without prior knowledge or limiting constraints on the scene structure, appearance, or illumination. Existing techniques for dynamic scene reconstruction from multiple wide-baseline camera views primarily focus on accurate reconstruction in controlled environments, where the cameras are fixed and calibrated and background is known. These approaches are not robust for general dynamic scenes captured with sparse moving cameras. Previous approaches for outdoor dynamic scene reconstruction assume prior knowledge of the static background appearance and structure. The primary contributions of this paper are twofold: an automatic method for initial coarse dynamic scene segmentation and reconstruction without prior knowledge of background appearance or structure; and a general robust approach for joint segmentation refinement and dense reconstruction of dynamic scenes from multiple wide-baseline static or moving cameras. Evaluation is performed on a variety of indoor and outdoor scenes with cluttered backgrounds and multiple dynamic non-rigid objects such as people. Comparison with state-of-the-art approaches demonstrates improved accuracy in both multiple view segmentation and dense reconstruction. The proposed approach also eliminates the requirement for prior knowledge of scene structure and appearance.
Sequential Monte Carlo probability hypothesis density (SMC- PHD) ltering has been recently exploited for audio-visual (AV) based tracking of multiple speakers, where audio data are used to inform the particle distribution and propagation in the visual SMC-PHD lter. How- ever, the performance of the AV-SMC-PHD lter can be a ected by the mismatch between the proposal and the posterior distribution. In this pa- per, we present a new method to improve the particle distribution where audio information (i.e. DOA angles derived from microphone array mea- surements) is used to detect new born particles and visual information (i.e. histograms) is used to modify the particles with particle ow (PF). Using particle ow has the bene t of migrating particles smoothly from the prior to the posterior distribution. We compare the proposed algo- rithm with the baseline AV-SMC-PHD algorithm using experiments on the AV16.3 dataset with multi-speaker sequences.
We present a novel human performance capture technique capable of robustly estimating the pose (articulated joint positions) of a performer observed passively via multiple view-point video (MVV). An affine invariant pose descriptor is learned using a convolutional neural network (CNN) trained over volumetric data extracted from a MVV dataset of diverse human pose and appearance. A manifold embedding is learned via Gaussian Processes for the CNN descriptor and articulated pose spaces enabling regression and so estimation of human pose from MVV input. The learned descriptor and manifold are shown to generalise over a wide range of human poses, providing an efficient performance capture solution that requires no fiducials or other markers to be worn. The system is evaluated against ground truth joint configuration data from a commercial marker-based pose estimation system
This paper presents a method for dense 4D temporal alignment of partial reconstructions of non-rigid surfaces observed from single or multiple moving cameras of complex scenes. 4D Match Trees are introduced for robust global alignment of non-rigid shape based on the similarity between images across sequences and views. Wide-timeframe sparse correspondence between arbitrary pairs of images is established using a segmentation-based feature detector (SFD) which is demonstrated to give improved matching of non-rigid shape. Sparse SFD correspondence allows the similarity between any pair of image frames to be estimated for moving cameras and multiple views. This enables the 4D Match Tree to be constructed which minimises the observed change in non-rigid shape for global alignment across all images. Dense 4D temporal correspondence across all frames is then estimated by traversing the 4D Match tree using optical flow initialised from the sparse feature matches. The approach is evaluated on single and multiple view images sequences for alignment of partial surface reconstructions of dynamic objects in complex indoor and outdoor scenes to obtain a temporally consistent 4D representation. Comparison to previous 2D and 3D scene flow demonstrates that 4D Match Trees achieve reduced errors due to drift and improved robustness to large non-rigid deformations.
In this paper we propose a pipeline for estimating 3D room layout with object and material attribute prediction using a spherical stereo image pair. We assume that the room and objects can be represented as cuboids aligned to the main axes of the room coordinate (Manhattan world). A spherical stereo alignment algorithm is proposed to align two spherical images to the global world coordinate sys- tem. Depth information of the scene is estimated by stereo matching between images. Cubic projection images of the spherical RGB and estimated depth are used for object and material attribute detection. A single Convolutional Neu- ral Network is designed to assign object and attribute la- bels to geometrical elements built from the spherical image. Finally simplified room layout is reconstructed by cuboid fitting. The reconstructed cuboid-based model shows the structure of the scene with object information and material attributes.
This engineering brief reports on the production of 3 object-based audio drama scenes, commissioned as part of the S3A project. 3D reproduction and an object-based workflow were considered and implemented from the initial script commissioning through to the final mix of the scenes. The scenes are being made available as Broadcast Wave Format files containing all objects as separate tracks and all metadata necessary to render the scenes as an XML chunk in the header conforming to the Audio Definition Model specification (Recommendation ITU-R BS.2076 ). It is hoped that these scenes will find use in perceptual experiments and in the testing of 3D audio systems. The scenes are available via the following link: http://dx.doi.org/10.17866/rd.salford.3043921.
Recent developments of video and sensing technology can lead to large amounts of digital media data. Current media production rely on both video from the principal camera together with a wide variety of heterogeneous source of supporting data (photos, LiDAR point clouds, witness video camera, HDRI and depth imagery). Registration of visual data acquired from various 2D and 3D sensing modalities is challenging because current matching and registration methods are not appropriate due to differences in formats and noise types of multi-modal data. A combined 2D/3D visualisation of this registered data allows an integrated overview of the entire dataset. For such a visualisation a web-based context presents several advantages. In this paper we propose a unified framework for registration and visualisation of this type of visual media data. A new feature description and matching method is proposed, adaptively considering local geometry, semi-global geometry and colour information in the scene for more robust registration. The resulting registered 2D/3D multi-modal visual data is too large to be downloaded and viewed directly via the web browser while maintaining an acceptable user experience. Thus, we employ hierarchical techniques for compression and restructuring to enable efficient transmission and visualisation over the web, leading to interactive visualisation as registered point clouds, 2D images, and videos in the browser, improving on the current state of the art techniques for web-based visualisation of big media data. This is the first unified 3D web-based visualisation of multi-modal visual media production datasets. The proposed pipeline is tested on big multimodal dataset typical of film and broadcast production which are made publicly available. The proposed feature description method shows two times higher precision of feature matching and more stable registration performance than existing 3D feature descriptors.
In this paper we propose a framework for spatially and temporally coherent semantic co-segmentation and reconstruction of complex dynamic scenes from multiple static or moving cameras. Semantic co-segmentation exploits the coherence in semantic class labels both spatially, between views at a single time instant, and temporally, between widely spaced time instants of dynamic objects with similar shape and appearance. We demonstrate that semantic coherence results in improved segmentation and reconstruction for complex scenes. A joint formulation is proposed for semantically coherent object-based co-segmentation and reconstruction of scenes by enforcing consistent semantic labelling between views and over time. Semantic tracklets are introduced to enforce temporal coherence in semantic labelling and reconstruction between widely spaced instances of dynamic objects. Tracklets of dynamic objects enable unsupervised learning of appearance and shape priors that are exploited in joint segmentation and reconstruction. Evaluation on challenging indoor and outdoor sequences with hand-held moving cameras shows improved accuracy in segmentation, temporally coherent semantic labelling and 3D reconstruction of dynamic scenes.
Complete scene reconstruction from single view RGBD is a challenging task, requiring
estimation of scene regions occluded from the captured depth surface. We propose
that scene-centric analysis of human motion within an indoor scene can reveal fully occluded
objects and provide functional cues to enhance scene understanding tasks. Captured
skeletal joint positions of humans, utilised as naturally exploring active sensors,
are projected into a human-scene motion representation. Inherent body occupancy is
leveraged to carve a volumetric scene occupancy map initialised from captured depth,
revealing a more complete voxel representation of the scene. To obtain a structured box
model representation of the scene, we introduce unique terms to an object detection optimisation
that overcome depth occlusions whilst deriving from the same depth data. The
method is evaluated on challenging indoor scenes with multiple occluding objects such as
tables and chairs. Evaluation shows that human-centric scene analysis can be applied to
effectively enhance state-of-the-art scene understanding approaches, resulting in a more
complete representation than single view depth alone.
Existing work on animation synthesis can be roughly split into two approaches, those that combine segments of motion capture data, and those that perform inverse kinematics. In this paper, we present a method for performing animation synthesis of an articulated object (e.g. human body and a dog) from a minimal set of body joint positions, following the approach of inverse kinematics. We tackle this problem from a learning perspective. Firstly, we address the need for knowledge on the physical constraints of the articulated body, so as to avoid the generation of a physically impossible poses. A common solution is to heuristically specify the kinematic constraints for the skeleton model. In this paper however, the physical constraints of the articulated body are represented using a hierarchical cluster model learnt from a motion capture database. Additionally, we shall show that the learnt model automatically captures the correlation between different joints through the simultaneous modelling their angles. We then show how this model can be utilised to perform inverse kinematics in a simple and efficient manner. Crucially, we describe how IK is carried out from a minimal set of end-effector positions. Following this, we show how this "learnt inverse kinematics" framework can be used to perform animation syntheses of different types of articulated structures. To this end, the results presented include the retargeting of a at surface walking animation to various uneven terrains to demonstrate the synthesis of a full human body motion from the positions of only the hands, feet and torso. Additionally, we show how the same method can be applied to the animation synthesis of a dog using only its feet and torso positions.
We present an algorithm for fusing multi-viewpoint video (MVV) with inertial measurement
unit (IMU) sensor data to accurately estimate 3D human pose. A 3-D convolutional
neural network is used to learn a pose embedding from volumetric probabilistic
visual hull data (PVH) derived from the MVV frames. We incorporate this model within
a dual stream network integrating pose embeddings derived from MVV and a forward
kinematic solve of the IMU data. A temporal model (LSTM) is incorporated within
both streams prior to their fusion. Hybrid pose inference using these two complementary
data sources is shown to resolve ambiguities within each sensor modality, yielding improved
accuracy over prior methods. A further contribution of this work is a new hybrid
MVV dataset (TotalCapture) comprising video, IMU and a skeletal joint ground truth
derived from a commercial motion capture system. The dataset is available online at
In this paper we propose a cuboid-based air-tight indoor
room geometry estimation method using combination
of audio-visual sensors. Existing vision-based 3D reconstruction
methods are not applicable for scenes with transparent
or reflective objects such as windows and mirrors. In
this work we fuse multi-modal sensory information to overcome
the limitations of purely visual reconstruction for reconstruction
of complex scenes including transparent and
mirror surfaces. A full scene is captured by 360ý cameras
and acoustic room impulse responses (RIRs) recorded by a
loudspeaker and compact microphone array. Depth information
of the scene is recovered by stereo matching from the
captured images and estimation of major acoustic reflector
locations from the sound. The coordinate systems for audiovisual
sensors are aligned into a unified reference frame and
plane elements are reconstructed from audio-visual data.
Finally cuboid proxies are fitted to the planes to generate a
complete room model. Experimental results show that the
proposed system generates complete representations of the
room structures regardless of transparent windows, featureless
walls and shiny surfaces.
Light-field video has recently been used in virtual and
augmented reality applications to increase realism and immersion.
However, existing light-field methods are generally
limited to static scenes due to the requirement to acquire
a dense scene representation. The large amount of
data and the absence of methods to infer temporal coherence
pose major challenges in storage, compression and
editing compared to conventional video. In this paper, we
propose the first method to extract a spatio-temporally coherent
light-field video representation. A novel method to
obtain Epipolar Plane Images (EPIs) from a spare lightfield
camera array is proposed. EPIs are used to constrain
scene flow estimation to obtain 4D temporally coherent representations
of dynamic light-fields. Temporal coherence is
achieved on a variety of light-field datasets. Evaluation of
the proposed light-field scene flow against existing multiview
dense correspondence approaches demonstrates a significant
improvement in accuracy of temporal coherence.
A real-time full-body motion capture system is presented
which uses input from a sparse set of inertial measurement
units (IMUs) along with images from two or more standard
video cameras and requires no optical markers or specialized
infra-red cameras. A real-time optimization-based
framework is proposed which incorporates constraints from
the IMUs, cameras and a prior pose model. The combination
of video and IMU data allows the full 6-DOF motion to
be recovered including axial rotation of limbs and drift-free
global position. The approach was tested using both indoor
and outdoor captured data. The results demonstrate the effectiveness
of the approach for tracking a wide range of human
motion in real time in unconstrained indoor/outdoor
In object-based spatial audio system, positions of the
audio objects (e.g. speakers/talkers or voices) presented in the
sound scene are required as important metadata attributes for
object acquisition and reproduction. Binaural microphones are
often used as a physical device to mimic human hearing and to
monitor and analyse the scene, including localisation and tracking
of multiple speakers. The binaural audio tracker, however, is
usually prone to the errors caused by room reverberation and
background noise. To address this limitation, we present a
multimodal tracking method by fusing the binaural audio with
depth information (from a depth sensor, e.g., Kinect). More
specifically, the PHD filtering framework is first applied to the
depth stream, and a novel clutter intensity model is proposed
to improve the robustness of the PHD filter when an object
is occluded either by other objects or due to the limited field
of view of the depth sensor. To compensate mis-detections in
the depth stream, a novel gap filling technique is presented to
map audio azimuths obtained from the binaural audio tracker to
3D positions, using speaker-dependent spatial constraints learned
from the depth stream. With our proposed method, both the
errors in the binaural tracker and the mis-detections in the depth
tracker can be significantly reduced. Real-room recordings are
used to show the improved performance of the proposed method
in removing outliers and reducing mis-detections.
Coleman Philip, Franck A, Francombe Jon, Liu Qingju, de Campos Teofilo, Hughes R, Menzies D, Simon Galvez, M, Tang Y, Woodcock J, Jackson Philip, Melchior F, Pike C, Fazi F, Cox T, Hilton Adrian (2018) An Audio-Visual System for Object-Based Audio:
From Recording to Listening,IEEE Transactions on Multimedia20(8)pp. 1919-1931
Object-based audio is an emerging representation
for audio content, where content is represented in a reproductionformat-
agnostic way and thus produced once for consumption on
many different kinds of devices. This affords new opportunities
for immersive, personalized, and interactive listening experiences.
This article introduces an end-to-end object-based spatial audio
pipeline, from sound recording to listening. A high-level
system architecture is proposed, which includes novel audiovisual
interfaces to support object-based capture and listenertracked
rendering, and incorporates a proposed component for
objectification, i.e., recording content directly into an object-based
form. Text-based and extensible metadata enable communication
between the system components. An open architecture for object
rendering is also proposed.
The system?s capabilities are evaluated in two parts. First,
listener-tracked reproduction of metadata automatically estimated
from two moving talkers is evaluated using an objective
binaural localization model. Second, object-based scene capture
with audio extracted using blind source separation (to remix
between two talkers) and beamforming (to remix a recording of
a jazz group), is evaluated with perceptually-motivated objective
and subjective experiments. These experiments demonstrate that
the novel components of the system add capabilities beyond
the state of the art. Finally, we discuss challenges and future
perspectives for object-based audio workflows.
Coleman Philip, Franck Andreas, Francombe Jon, Liu Qingju, de Campos Teofilo, Hughes Richard, Menzies Dylan, Simo?n Ga?lvez Marcos, Tang Yan, Woodcock James, Melchior Frank, Pike Chris, Fazi Filippo, Cox Trevor, Hilton Adrian, Jackson Philip (2018) S3A Audio-Visual System for Object-Based Audio,
University of Surrey
We propose a generalisation of the local feature matching framework, where keypoints are replaced by k-keygraphs, i.e., isomorphic directed attributed graphs of cardinality k whose vertices are keypoints. Keygraphs have structural and topological properties which are discriminative and efficient to compute, based on graph edge length and orientation as well as vertex scale and orientation. Keypoint matching is performed based on descriptor similarity. Next, 2-keygraphs are calculated; as a result, the number of incorrect keypoint matches reduced in 75% (while the correct keypoint matches were preserved). Then, 3-keygraphs are calculated, followed by 4-keygraphs; this yielded a significant reduction of 99% in the number of remaining incorrect keypoint matches. The stage that finds 2-keygraphs has a computational cost equal to a small fraction of the cost of the keypoint matching stage, while the stages that find 3-keygraphs or 4-keygraphs have a negligible cost. In the final stage, RANSAC finds object poses represented as affine transformations mapping images. Our experiments concern large-scale object instance recognition subject to occlusion, background clutter and appearance changes. By using 4-keygraphs, RANSAC needed 1% of the iterations in comparison with 2-keygraphs or simple keypoints. As a result, using 4-keygraphs provided a better efficiency as well as allowed a larger number of initial keypoints matches to be established, which increased performance.
Remaggi Luca, Kim Hansung, Jackson Philip, Fazi Filippo Maria, Hilton Adrian (2018) Acoustic reflector localization and classification,Proceedings of ICASSP 2018 - 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Institute of Electrical and Electronics Engineers (IEEE)
The process of understanding acoustic properties of environments
is important for several applications, such as spatial
audio, augmented reality and source separation. In this paper,
multichannel room impulse responses are recorded and transformed
into their direction of arrival (DOA)-time domain, by
employing a superdirective beamformer. This domain can be
represented as a 2D image. Hence, a novel image processing
method is proposed to analyze the DOA-time domain, and
estimate the reflection times of arrival and DOAs. The main
acoustically reflective objects are then localized. Recent studies
in acoustic reflector localization usually assume the room
to be free from furniture. Here, by analyzing the scattered
reflections, an algorithm is also proposed to binary classify
reflectors into room boundaries and interior furniture. Experiments
were conducted in four rooms. The classification
algorithm showed high quality performance, also improving
the localization accuracy, for non-static listener scenarios.
The sequential Monte Carlo probability hypothesis density
(SMC-PHD) filter has been shown to be promising for
audio-visual multi-speaker tracking. Recently, the zero diffusion
particle flow (ZPF) has been used to mitigate the weight
degeneracy problem in the SMC-PHD filter. However, this
leads to a substantial increase in the computational cost due to
the migration of particles from prior to posterior distribution
with a partial differential equation. This paper proposes an alternative
method based on the non-zero diffusion particle flow
(NPF) to adjust the particle states by fitting the particle distribution
with the posterior probability density using the nonzero
diffusion. This property allows efficient computation of
the migration of particles. Results from the AV16.3 dataset
demonstrate that we can significantly mitigate the weight degeneracy
problem with a smaller computational cost as compared
with the ZPF based SMC-PHD filter.
Francombe Jon, Woodcock James, Hughes Richard J., Mason Russell, Franck Andreas, Pike Chris, Brookes Tim, Davies William J., Jackson Philip J.B., Cox Trevor J., Fazi Filippo M., Hilton Adrian (2018) Qualitative evaluation of media device orchestration for immersive spatial audio reproduction,Journal of the Audio Engineering Society66(6)pp. 414-429
Audio Engineering Society
The challenge of installing and setting up dedicated spatial audio systems
can make it difficult to deliver immersive listening experiences to the general
public. However, the proliferation of smart mobile devices and the rise of
the Internet of Things mean that there are increasing numbers of connected
devices capable of producing audio in the home. \Media device orchestration"
(MDO) is the concept of utilizing an ad hoc set of devices to deliver
or augment a media experience. In this paper, the concept is evaluated by
implementing MDO for augmented spatial audio reproduction using objectbased
audio with semantic metadata. A thematic analysis of positive and
negative listener comments about the system revealed three main categories
of response: perceptual, technical, and content-dependent aspects. MDO
performed particularly well in terms of immersion/envelopment, but the
quality of listening experience was partly dependent on loudspeaker quality
and listener position. Suggestions for further development based on these
categories are given.
We present a method for simultaneously estimating 3D hu-
man pose and body shape from a sparse set of wide-baseline camera views.
We train a symmetric convolutional autoencoder with a dual loss that
enforces learning of a latent representation that encodes skeletal joint
positions, and at the same time learns a deep representation of volumetric
body shape. We harness the latter to up-scale input volumetric data by a
factor of 4X, whilst recovering a 3D estimate of joint positions with equal
or greater accuracy than the state of the art. Inference runs in real-time
(25 fps) and has the potential for passive human behaviour monitoring
where there is a requirement for high fidelity estimation of human body
shape and pose.
Recent advances in sensor technology have introduced
low-cost RGB video plus depth sensors, such as the
Kinect, which enable simultaneous acquisition of colour and
depth images at video rates. This paper introduces a framework
for representation of general dynamic scenes from video plus
depth acquisition. A hybrid representation is proposed which
combines the advantages of prior surfel graph surface segmentation
and modelling work with the higher-resolution surface
reconstruction capability of volumetric fusion techniques. The
contributions are (1) extension of a prior piecewise surfel graph
modelling approach for improved accuracy and completeness, (2)
combination of this surfel graph modelling with TSDF surface
fusion to generate dense geometry, and (3) proposal of means for
validation of the reconstructed 4D scene model against the input
data and efficient storage of any unmodelled regions via residual
depth maps. The approach allows arbitrary dynamic scenes to be
efficiently represented with temporally consistent structure and
enhanced levels of detail and completeness where possible, but
gracefully falls back to raw measurements where no structure
can be inferred. The representation is shown to facilitate creative
manipulation of real scene data which would previously require
more complex capture setups or manual processing.
In applications such as virtual and augmented reality, a plausible
and coherent audio-visual reproduction can be achieved by deeply
understanding the reference scene acoustics. This requires knowledge
of the scene geometry and related materials. In this paper, we
present an audio-visual approach for acoustic scene understanding.
We propose a novel material recognition algorithm, that exploits
information carried by acoustic signals. The acoustic absorption
coefficients are selected as features. The training dataset was constructed
by combining information available in the literature, and
additional labeled data that we recorded in a small room having
short reverberation time (RT60). Classic machine learning methods
are used to validate the model, by employing data recorded in five
rooms, having different sizes and RT60s. The estimated materials
are utilized to label room boundaries, reconstructed by a visionbased
method. Results show 89 % and 80 % agreement between the
estimated and reference room volumes and materials, respectively.
Woodcock James, Franombe Jon, Franck Andreas, Coleman Philip, Hughes Richard, Kim Hansung, Liu Qingju, Menzies Dylan, Simón Gálvez Marcos F, Tang Yan, Brookes Tim, Davies William J, Fazenda Bruno M, Mason Russell, Cox Trevor J, Fazi Filippo Maria, Jackson Philip, Pike Chris, Hilton Adrian (2018) A Framework for Intelligent Metadata Adaptation in Object-Based Audio,AES E-Librarypp. P11-3
Audio Engineering Society
Object-based audio can be used to customize, personalize, and optimize audio reproduction depending on the speci?c listening scenario. To investigate and exploit the bene?ts of object-based audio, a framework for intelligent metadata adaptation was developed. The framework uses detailed semantic metadata that describes the audio objects, the loudspeakers, and the room. It features an extensible software tool for real-time metadata adaptation that can incorporate knowledge derived from perceptual tests and/or feedback from perceptual meters to drive adaptation and facilitate optimal rendering. One use case for the system is demonstrated through a rule-set (derived from perceptual tests with experienced mix engineers) for automatic adaptation of object levels and positions when rendering 3D content to two- and ?ve-channel systems.
We propose an approach to accurately esti-
mate 3D human pose by fusing multi-viewpoint video
(MVV) with inertial measurement unit (IMU) sensor
data, without optical markers, a complex hardware setup
or a full body model. Uniquely we use a multi-channel
3D convolutional neural network to learn a pose em-
bedding from visual occupancy and semantic 2D pose
estimates from the MVV in a discretised volumetric
probabilistic visual hull (PVH). The learnt pose stream
is concurrently processed with a forward kinematic solve
of the IMU data and a temporal model (LSTM) exploits
the rich spatial and temporal long range dependencies
among the solved joints, the two streams are then fused
in a final fully connected layer. The two complemen-
tary data sources allow for ambiguities to be resolved
within each sensor modality, yielding improved accu-
racy over prior methods. Extensive evaluation is per-
formed with state of the art performance reported on
the popular Human 3.6M dataset , the newly re-
leased TotalCapture dataset and a challenging set of
outdoor videos TotalCaptureOutdoor. We release the
new hybrid MVV dataset (TotalCapture) comprising
of multi- viewpoint video, IMU and accurate 3D skele-
tal joint ground truth derived from a commercial mo-
tion capture system. The dataset is available online at
A common problem in wide-baseline matching is the sparse and non-uniform distribution of correspondences when using conventional detectors such as SIFT, SURF, FAST, A-KAZE and MSER. In this paper we introduce a novel segmentation based feature detector (SFD) that produces an increased number of accurate features for wide-baseline matching. A multi-scale SFD is proposed using bilateral image decomposition to produce a large number of scale-invariant features for wide-baseline reconstruction. All input images are over-segmented into regions using any existing segmentation technique like Watershed, Mean-shift, and SLIC. Feature points are then detected at the intersection of the boundaries of three or more regions. The detected feature points are local maxima of the image function. The key advantage of feature detection based on segmentation is that it does not require global threshold setting and can therefore detect features throughout the image. A comprehensive evaluation demonstrates that SFD gives an increased number of features which are accurately localised and matched between wide-baseline camera views; the number of features for a given matching error increases by a factor of 3-5 compared to SIFT; feature detection and matching performance is maintained with increasing baseline between views; multi-scale SFD improves matching performance at varying scales. Application of SFD to sparse multi-view wide-baseline reconstruction demonstrates a factor of ten increase in the number of reconstructed points with improved scene coverage compared to SIFT/MSER/A-KAZE. Evaluation against ground-truth shows that SFD produces an increased number of wide-baseline matches with reduced error.
We present a convolutional autoencoder that enables high
fidelity volumetric reconstructions of human performance to be captured
from multi-view video comprising only a small set of camera views. Our
method yields similar end-to-end reconstruction error to that of a prob-
abilistic visual hull computed using significantly more (double or more)
viewpoints. We use a deep prior implicitly learned by the autoencoder
trained over a dataset of view-ablated multi-view video footage of a wide
range of subjects and actions. This opens up the possibility of high-end
volumetric performance capture in on-set and prosumer scenarios where
time or cost prohibit a high witness camera count.
This paper presents a framework for creating realistic virtual characters
that can be delivered via the Internet and interactively controlled
in a WebGL enabled web-browser. Four-dimensional performance
capture is used to capture realistic human motion and appearance.
The captured data is processed into efficient and compact
representations for geometry and texture. Motions are analysed
against a high-level, user-defined motion graph and suitable
inter- and intra-motion transitions are identified. This processed
data is stored on a webserver and downloaded by a client application
is used to manage the state of the character which responds to user
input and sends required frames to a WebGL-based renderer for
display. Through the efficient geometry, texture and motion graph
representations, a game character capable of performing a range of
motions can be represented in 40-50 MB of data. This highlights
the potential use of four-dimensional performance capture for creating
web-based content. Datasets are made available for further
research and an online demo is provided.
This paper presents a hybrid skeleton-driven surface registration
(HSDSR) approach to generate temporally consistent
meshes from multiple view video of human subjects.
2D pose detections from multiple view video are used to
estimate 3D skeletal pose on a per-frame basis. The 3D
pose is embedded into a 3D surface reconstruction allowing
any frame to be reposed into the shape from any other
frame in the captured sequence. Skeletal motion transfer
is performed by selecting a reference frame from the surface
reconstruction data and reposing it to match the pose
estimation of other frames in a sequence. This allows an
initial coarse alignment to be performed prior to refinement
by a patch-based non-rigid mesh deformation. The
proposed approach overcomes limitations of previous work
by reposing a reference mesh to match the pose of a target
mesh reconstruction, providing a closer starting point
for further non-rigid mesh deformation. It is shown that the
proposed approach is able to achieve comparable results to
existing model-based and model-free approaches. Finally,
it is demonstrated that this framework provides an intuitive
way for artists and animators to edit volumetric video.
Immersive audio-visual perception relies on the spatial integration of both auditory and visual information which are heterogeneous sensing modalities with different fields of reception and spatial
resolution. This study investigates the perceived coherence of audiovisual
object events presented either centrally or peripherally with horizontally aligned/misaligned sound. Various object events were selected to represent three acoustic feature classes. Subjective test results in a simulated virtual environment from 18 participants indicate
a wider capture region in the periphery, with an outward bias favoring more lateral sounds. Centered stimulus results support previous findings for simpler scenes.
In this paper, we propose an approach to indoor scene understanding from observation of people in single view spherical video. As input, our approach takes a centrally located spherical video capture of an indoor scene, estimating the 3D localisation of human actions performed throughout the long term capture. The central contribution of this work is a deep convolutional encoder-decoder network trained on a synthetic dataset to reconstruct regions of affordance from captured human activity. The predicted affordance segmentation is then applied to compose a reconstruction of the complete 3D scene, integrating the affordance segmentation into 3D space. The mapping learnt between human activity and affordance segmentation demonstrates that omnidirectional observation of human activity can be applied to scene understanding tasks such as 3D reconstruction. We show that our approach using only observation of people performs well against previous approaches, allowing reconstruction of occluded regions and labelling of scene affordances.
Shape information has been recognised as playing a role in intrinsic image estimation since its inception. However, it is only in recent years that hints of the importance of
geometry have been found in decomposing surface appearance into albedo and shading estimates. This thesis establishes the central importance of shape in intrinsic surface
property estimation for static and dynamic scenes, and introduces methods for the use of approximate shape in a wide range of related problems to provide high-level constraints on shading.
A key contribution is intrinsic texture estimation. This is a generalisation of intrinsic image estimation, in which appearance is processed as a function of surface position
rather than pixel position. This approach has numerous advantages, in that the shape can be used to resolve occlusion, inter-reflection and attached shading as a natural part of the method. Unlike previous bidirectional texture function estimation approaches, high-quality albedo and shading textures are produced without prior knowledge of
materials or lighting.
Many of the concepts in intrinsic texture estimation can be extended to single-viewpoint capture for which depth information is available. Depth information greatly reduces the ambiguity of the shading estimation problem, allowing online intrinsic video to be developed for the first time. The availability of a lighting function also allows high-level temporal constraints on shading to be applied over video sequences, which previously required per-pixel correspondence between frames to be established. A number of applications of intrinsic video are investigated, including augmented reality, video stylisation and relighting, all of which run at interactive framerates. The albedo distribution of the input video is preserved, even in the case of natural scenes with complex appearance, and a globally-consistent shading estimate is obtained which remains robust over dynamic sequences.
Finally, an integrated framework bridging the gaps between intrinsic image, video and texture estimation is presented for the first time. Approximate scene geometry provides a convenient means of achieving this, and is used in establishing pixel constraints between adjacent cameras, reconstructing scene lighting, and removing cast shadows and inter-reflections. This introduces a unified geometry-based approach to intrinsic image estimation and related fields, which achieves high-quality results for complex natural scenes for a wide range of capture modalities.
This thesis addresses the problem of modeling human shape in three dimensions. Specifically, this thesis is focused on modeling body shape variation across multiple individuals, pose induced shape deformations and garment deformations that are influenced both by body shape and pose. A methodology for constructing data driven models of human body and garment deformation is provided. Additionally, an application for online fashion retailing is presented.
Abstract Firstly, a quantitative and qualitative evaluation, is introduced, of surface representations used in recent statistical models of human shape and pose. It is shown that the Euclidean representation generates a more compact human shape model compared to other representations. A small number of model parameters indicates better convergence in a human body estimation framework. In contrast, a high number of model parameters increases the risk of the optimization getting trapped in a local optimum. Based on these insights a system for human body shape estimation and classification for on-line fashion applications is presented. Given a single image of a subject and the subject's height and weight the proposed framework is able to estimate the 3D human body shape using a learnt statistical model. Results demonstrate that a single image holds sufficient information for accurate shape classification. This technology has been exploited as part of a collaborative project with fashion designers to develop a mobile app to classify body shape for clothing recommendation in online fashion retail.
Abstract Next, Shape and Pose Space Deformation (SPSD) is presented, a technique for modeling subject specific pose induced deformations. By exploiting examples of different people in multiple poses, plausible animations of novel subjects can be synthesized by interpolating and extrapolating in a joint shape and pose parameter space. The results show that greater detail is achieved by incorporating subject specific pose deformations as opposed to a subject independent pose model. Finally, SPSD is extended to a three layered data-driven model of human shape, pose and garment deformation. Each layer represents the deformation of a template mesh and can be controlled independently and intuitively. The garment deformation layer is trained on sequences of dressed actors and relies on a novel technique for human shape and posture estimation under clothing.
Liu Yang, Wang Wenwu, Chambers Jonathon, Kilic Volkan, Hilton Adrian (2017) Particle flow SMC-PHD filter for audio-visual
multi-speaker tracking. Proc. 13th International Conference on Latent Variable Analysis and Signal Separation(LVA/ICA 2017), Grenoble, France, February 21-23, 2017.,In: Tichavský P, Babaie-Zadeh M, Michel O, Thirion-Moreau N (eds.), Latent Variable Analysis and Signal Separation. LVA/ICA 2017Proceedings 13th International Conference on Latent Variable Analysis and Signal Separation(LVA/ICA 2017)10169pp. 344-353
Sequential Monte Carlo probability hypothesis density (SMC-
PHD) filtering has been recently exploited for audio-visual (AV) based
tracking of multiple speakers, where audio data are used to inform the
particle distribution and propagation in the visual SMC-PHD filter. However, the performance of the AV-SMC-PHD filter can be affected by the
mismatch between the proposal and the posterior distribution. In this paper, we present a new method to improve the particle distribution where
audio information (i.e. DOA angles derived from microphone array measurements) is used to detect new born particles and visual information
(i.e. histograms) is used to modify the particles with particle
flow has the benefit of migrating particles smoothly from
the prior to the posterior distribution. We compare the proposed algorithm with the baseline AV-SMC-PHD algorithm using experiments on
the AV16.3 dataset with multi-speaker sequences.
Liu Qingju, deCampos T, Wang Wenwu, Jackson Philip, Hilton Adrian (2016) Person tracking using audio and depth cues,International Conference on Computer Vision (ICCV) Workshop on 3D Reconstruction and Understanding with Video and Soundpp. 709-717
In this paper, a novel probabilistic Bayesian tracking scheme is proposed and applied to bimodal measurements consisting of tracking results from the depth sensor and audio recordings collected using binaural microphones. We use random finite sets to cope with varying number of tracking targets. A measurement-driven birth process is integrated to quickly localize any emerging person. A new bimodal fusion method that prioritizes the most confident modality is employed. The approach was tested on real room recordings and experimental results show that the proposed combination of audio and depth outperforms individual modalities, particularly when there are multiple people talking simultaneously and when occlusions are frequent.
The work on 3D human pose estimation has seen a significant amount of progress in recent years, particularly due to the widespread availability of commodity depth sensors. However, most pose estimation methods follow a tracking-as-detection approach which does not explicitly handle occlusions, thus introducing outliers and identity association issues when multiple targets are involved. To address these issues, we propose a new method based on Probability Hypothesis Density (PHD) filter. In this method, the PHD filter with a novel clutter intensity model is used to remove outliers in the 3D head detection results, followed by an identity association scheme with occlusion detection for the targets. Experimental results show that our proposed method greatly mitigates the outliers, and correctly associates identities to individual detections with low computational cost.
Over the past decade, markerless performance capture, through multiple synchronised cameras, has emerged as an alternative to traditional motion capture techniques, allowing the simultaneous acquisition of shape, motion and appearance. This technology is capable of capturing the subtle details of human motion, e.g. clothing, skin and hair dynamics, which cannot be achieved through current marker based capture techniques. Markerless performance capture has the potential to revolutionise digital content creation in many creative industries, but must overcome several hurdles before it can be seen as a practical mainstream technology. One limitation of the technology is the enormous size of the generated data. This thesis addresses issues surrounding compact appearance representation of virtual characters generated through markerless performance capture, optimisation of the underlying 3D geometry and delivery of interactive content over the internet.
Current approaches to multiple camera texture representation effectively reduce the storage requirements by discarding huge amounts of view dependent and dynamic appearance information. This information is important for reproducing the realism of the captured multiple view video. The first contribution of this thesis introduces a novel multiple layer texture representation (MLTR) for multiple view video. The MLTR preserves dynamic, view dependent appearance information by resampling the captured frames into a hierarchical set of texture maps ordered by surface visibility. The MLTR also enables computationally efficient view-dependent rendering by pre-computing visibility testing and reduces projective texturing to a simple texture lookup. The representation is quantitatively evaluated and shown to reduce the storage cost by > 90% without a significant effect on visual quality.
The second contribution outlines the ideal properties for the optimal representation of 4D video and takes steps in achieving this goal. Using the MLTR, spatial and temporal consistency is enforced using a Markov random field framework, allowing video compression algorithms to make further storage reductions through increased spatial and temporal redundancies. An optical flow-based multiple camera alignment method is also introduced to reduce visual artefacts, such as blurring and ghosting, that are caused by approximate
geometry and camera calibration errors. This results in clearer and sharper textures with a lower storage footprint.
In order to facilitate high quality free-viewpoint rendering, two shape optimisation methods are proposed. The first combines the strengths of the visual hull, multiple view stereo and temporally consistent geometry to match visually important features using a non-rigid iterative closest point method. The second is based on a bundle adjustment formulation which jointly refines shape and calibration. While, these methods achieve the objective of enhancing the geometry and/or camera calibration parameters, further research is required to improve the resulting shape.
Finally, it is shown how the methods developed in this thesis could be used to deliver interactive 4D video to consumers via a WebGL enabled internet browser, e.g. Firefox or Chrome. Existing methods for parametric motion graphs are adapted and combined with an efficient WebGL renderer to allow interactive 4D character delivery over the Internet. This demonstrates for the first time that 4D video has the potential to provide interactive content via the internet which opens this technology up to the widest possible audience.
We describe a non-parametric algorithm for multiple-viewpoint video inpainting. Uniquely, our algorithm addresses the domain of wide baseline multiple-viewpoint video (MVV) with no temporal look-ahead in near real time speed. A Dictionary of Patches (DoP) is built using multi-resolution texture patches reprojected from geometric proxies available in the alternate views. We dynamically update the DoP over time, and a Markov Random Field optimisation over depth and appearance is used to resolve and align a selection of multiple candidates for a given patch, this ensures the inpainting of large regions in a plausible manner conserving both spatial and temporal coherence. We demonstrate the removal of large objects (e.g. people) on challenging indoor and outdoor MVV exhibiting cluttered, dynamic backgrounds and moving cameras.
Virtual Reality (VR) systems have been intensely explored, with several research communities investigating the
different modalities involved. Regarding the audio modality, one of the main issues is the generation of sound that
is perceptually coherent with the visual reproduction. Here, we propose a pipeline for creating plausible interactive
reverb using visual information: first, we characterize real environment acoustics given a pair of spherical cameras;
then, we reproduce reverberant spatial sound, by using the estimated acoustics, within a VR scene. The evaluation
is made by extracting the room impulse responses (RIRs) of four virtually rendered rooms. Results show agreement,
in terms of objective metrics, between the synthesized acoustics and the ones calculated from RIRs recorded within
the respective real rooms.
Learning visual representations plays an important role in computer vision and machine learning applications. It facilitates a model to understand and perform high-level tasks intelligently. A common approach for learning visual representations is supervised one which requires a huge amount of human annotations to train the model. This paper presents a self-supervised approach which learns visual representations from input images without human annotations. We learn the correct arrangement of object
proposals to represent an image using a convolutional neural network (CNN) without any manual annotations. We hypothesize that the network trained for solving this problem requires the embedding of semantic visual representations. Unlike existing approaches that use uniformly sampled patches, we relate object proposals that contain prominent objects and object parts. More
specifically, we discover the representation that considers overlap, inclusion, and exclusion relationship of proposals as well as their relative position. This allows focusing on potential objects and parts rather than on clutter. We demonstrate that our model
outperforms existing self-supervised learning methods and can be used as a generic feature extractor by applying it to object detection, classification, action recognition, image retrieval, and semantic matching tasks.
Recent progresses in Virtual Reality (VR) and Augmented Reality
(AR) allow us to experience various VR/AR applications in our
daily life. In order to maximise the immersiveness of user in VR/AR
environments, a plausible spatial audio reproduction synchronised
with visual information is essential. In this paper, we propose a
simple and efficient system to estimate room acoustic for plausible
reproducton of spatial audio using 360° cameras for VR/AR applications.
A pair of 360° images is used for room geometry and acoustic
property estimation. A simplified 3D geometric model of the scene
is estimated by depth estimation from captured images and semantic
labelling using a convolutional neural network (CNN). The real
environment acoustics are characterised by frequency-dependent
acoustic predictions of the scene. Spatially synchronised audio is
reproduced based on the estimated geometric and acoustic properties
in the scene. The reconstructed scenes are rendered with synthesised
spatial audio as VR/AR content. The results of estimated room
geometry and simulated spatial audio are evaluated against the actual
measurements and audio calculated from ground-truth Room
Impulse Responses (RIRs) recorded in the rooms.
Malleson Charles, Bazin Jean-Charles, Wang Oliver, Bradley Derek, Beeler Thabo, Hilton Adrian, Sorkine-Hornung Alexander (2016) FaceDirector: Continuous Control of Facial Performance in Video,Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV 2015)pp. 3979-3987
Institute of Electrical and Electronics Engineers (IEEE)
We present a method to continuously blend between multiple facial performances of an actor, which can contain different facial expressions or emotional states. As an example, given sad and angry video takes of a scene, our method empowers the movie director to specify arbitrary weighted combinations and smooth transitions between the two takes in post-production. Our contributions include (1) a robust nonlinear audio-visual synchronization technique that exploits complementary properties of audio and visual cues to automatically determine robust, dense spatiotemporal correspondences between takes, and (2) a seamless facial blending approach that provides the director full control to interpolate timing, facial expression, and local appearance, in order to generate novel performances after filming. In contrast to most previous works, our approach operates entirely in image space, avoiding the need of 3D facial reconstruction. We demonstrate that our method can synthesize visually believable performances with applications in emotion transition, performance correction, and timing control.
In order to maximise the immersion in VR environments, a plausible spatial audio reproduction synchronised with visual information is essential. In this work, we propose a pipeline to create plausible interactive audio from a pair of 360 degree cameras.
We introduce the first approach to solve the challenging problem of unsupervised 4D visual scene understanding for complex dynamic scenes with multiple interacting people from multi-view video. Our approach simultaneously estimates a detailed model that includes a per-pixel semantically and temporally coherent reconstruction, together with instance-level segmentation exploiting photo-consistency, semantic and motion information. We further leverage recent advances in 3D pose estimation to constrain the joint semantic instance segmentation and 4D temporally coherent reconstruction. This enables per person semantic instance segmentation of multiple interacting people in
complex dynamic scenes. Extensive evaluation of the joint visual scene understanding framework against state-of-the-art methods on challenging indoor and outdoor sequences demonstrates a significant (H 40%) improvement in semantic segmentation, reconstruction and scene flow accuracy.
The rise of autonomous machines in our day-to-day lives has
led to an increasing demand for machine perception of real-world to be
more robust, accurate and human-like. The research in visual scene un-
derstanding over the past two decades has focused on machine perception
in controlled environments such as indoor, static and rigid objects. There
is a gap in literature for machine perception in general complex scenes
(outdoor with multiple interacting people). The proposed research ad-
dresses the limitations of existing methods by proposing an unsupervised
framework to simultaneously model, semantically segment and estimate
motion for general dynamic scenes captured from multiple view videos
with a network of static or moving cameras. In this talk I will explain the
proposed joint framework to understand general dynamic scenes for ma-
chine perception; give a comprehensive performance evaluation against
state-of-the-art techniques on challenging indoor and outdoor sequences;
and demonstrate applications such as virtual, augmented, mixed reality
(VR/AR/MR) and broadcast production (Free-view point video - FVV).
Malleson Charles, Guillemaut Jean-Yves, Hilton Adrian (2019) 3D Reconstruction from RGB-D Data,In: Rosin Paul L., Lai Yu-Kun, Shao Ling, Liu Yonghuai (eds.), RGB-D Image Analysis and Processingpp. pp 87-115
Springer Nature Switzerland AG
A key task in computer vision is that of generating virtual 3D models
of real-world scenes by reconstructing the shape, appearance and, in the case of
dynamic scenes, motion of the scene from visual sensors. Recently, low-cost video
plus depth (RGB-D) sensors have become widely available and have been applied
to 3D reconstruction of both static and dynamic scenes. RGB-D sensors contain an
active depth sensor, which provides a stream of depth maps alongside standard colour
video. The low cost and ease of use of RGB-D devices as well as their video rate
capture of images along with depth make them well suited to 3D reconstruction. Use
of active depth capture overcomes some of the limitations of passive monocular or
multiple-view video-based approaches since reliable, metrically accurate estimates
of the scene depth at each pixel can be obtained from a single view, even in scenes
that lack distinctive texture. There are two key components to 3D reconstruction
from RGB-D data: (1) spatial alignment of the surface over time and, (2) fusion
of noisy, partial surface measurements into a more complete, consistent 3D model.
In the case of static scenes, the sensor is typically moved around the scene and
its pose is estimated over time. For dynamic scenes, there may be multiple rigid,
articulated, or non-rigidly deforming surfaces to be tracked over time. The fusion
component consists of integration of the aligned surface measurements, typically
using an intermediate representation, such as the volumetric truncated signed distance
field (TSDF). In this chapter, we discuss key recent approaches to 3D reconstruction
from depth or RGB-D input, with an emphasis on real-time reconstruction of static
A real-time motion capture system is presented which uses input from multiple standard video cameras and inertial measurement units (IMUs). The system is able to track multiple people simultaneously and requires no optical markers, specialized infra-red cameras or foreground/background segmentation, making it applicable to general indoor and outdoor scenarios with dynamic backgrounds and lighting. To overcome limitations of prior video or IMU-only approaches, we propose to use flexible combinations of multiple-view, calibrated video and IMU input along with a pose prior in an online optimization-based framework, which allows the full 6-DoF motion to be recovered including axial rotation of limbs and drift-free global position. A method for sorting and assigning raw input 2D keypoint detections into corresponding subjects is presented which facilitates multi-person tracking and rejection of any bystanders in the scene. The approach is evaluated on data from several indoor and outdoor capture environments with one or more subjects and the trade-off between input sparsity and tracking performance is discussed. State-of-the-art pose estimation performance is obtained on the Total Capture (mutli-view video and IMU) and Human 3.6M (multi-view video) datasets. Finally, a live demonstrator for the approach is presented showing real-time capture, solving and character animation using a light-weight, commodity hardware setup.
Visual scene understanding studies the task of representing a captured scene in a manner emulating human-like understanding of that space. Considering indoor scenes are designed for human use and are utilised everyday, attaining this understanding is crucial for applications such as robotic mapping and navigation, smart home and security systems, and home healthcare and assisted living. However, although we as humans utilise such spaces in our day-to-day lives, analysis of human activity is not commonly applied towards enhancing indoor scene-level understanding. As such, the work presented in this thesis investigates the benefits of including human activity information in indoor scene understanding challenges, aiming to demonstrate its potential contributions, applications, and versatility.
The first contribution of this thesis utilises human activity to reveal scene regions occluded behind objects and clutter. Human poses recognised from a static sensor are projected into a top-down scene representation recording belief of human activity over time. This representation is applied to carve a volumetric scene map, initialised on captured depth, to expose the occupancy of hidden scene regions. An object detection approach exploits the revealed occluded scene occupancy to localise self-, partially-, and, significantly, fully-occluded objects. The second contribution extends the top-down activity representation to predict the functionality of major scene surfaces from human activity recognised in 360 degree video. A convolutional network is trained on simulated human activity to segment walkable, sittable, and interactable surfaces from the top-down perspective. This prediction is applied to construct a complete scene 3D approximation, with results showing scene structure and surface functionality are predicted well from human activity alone. Finally, this thesis investigates an association between the top-down functionality prediction and the captured visual scene. A new dataset capturing long-term human activity is introduced to train a model on combined activity and visual scene information. The model is trained to segment functional scene surfaces from the capture sensor perspective, with evaluation establishing that the introduction of human activity information can improve functional surface segmentation performance.
Overall, the work presented in this thesis demonstrates that analysis of human activity can be applied to enhance indoor scene understanding across various challenges, sensors, and representations. Assorted datasets are introduced alongside the major contributions to motivate further investigation into its application.
Typical colour digital cameras have a single sensor with a colour filter array (CFA), each pixel capturing a single channel (red, green or blue). A full RGB colour output image is generated by demosaicing (DM), i.e. interpolating to infer the two unobserved channels for each pixel. The DM approach used can have a significant effect on the quality of the output image, particularly in the presence of common imaging artifacts such as chromatic aberration (CA). Small differences in the focal length for each channel (lateral CA) and the inability of the lens to bring all three channels simultaneously into focus (longitudinal CA) can cause objectionable colour fringing artifacts in edge regions. These artifacts can be particularly severe when using low-cost lenses. We propose to use a set of simple neural networks to learn to jointly perform DM and CA correction, producing high quality colour images subject to severe CA as well as image noise. The proposed neural network-based joint DM and CA correction produces a significant improvement in image quality metrics (PSNR and SSIM) compared the baseline edge-directed linear interpolation approach preserving image detail and reducing objectionable false colour and comb artifacts. The approach can be applied in the production of high quality images and video from machine vision cameras with low cost lenses, thus extending the viability of such hardware to visual media production.
Person tracking is an often studied facet of computer vision, with applications in security, automated driving and entertainment. However, despite the advantages they offer, few current solutions work for 360° cameras, due to projection distortion.
This paper presents a simple yet robust method for 3D tracking of multiple people in a scene from a pair of 360° cameras. By using 2D pose information, rather than potentially unreliable 3D position or repeated colour information, we create a tracker that is both appearance independent as well as capable
of operating at narrow baseline. Our results demonstrate state of the art performance on 360° scenes, as well as the capability to handle vertical axis rotation.
4D human performance capture aims to create volumetric representations of observed human subjects performing arbitrary motions with the ability to replay and render dynamic scenes with the realism of the recorded video. This representation has the potential to enable highly realistic content production for immersive virtual and augmented reality experiences. Human models are typically rendered using detailed, explicit 3D models, which consist of meshes and textures, and animated using tailored motion models to simulate human behaviour and activity. However, designing a realistic 3D human model is still a costly and laborious process. Hence, this work investigates techniques to learn models of human body shape and appearance, aiming to facilitate the generation of highly realistic human animation, and demonstrate its potential contributions, applications, and versatility.
The first contribution of this work is a skeleton driven surface registration approach to generate temporally consistent meshes from multi-view video of human subjects. 2D pose detections from multi-view video are used to estimate 3D skeletal pose on a per-frame basis, which allows a reference frame to match the pose estimation of other frames in a sequence. This allows an initial coarse alignment followed by a patch-based non-rigid mesh deformation to generate temporally consistent mesh sequences.
The second contribution presents techniques to represent human-like shape using a compressed learnt model from 4D volumetric performance capture data. Sequences of 4D dynamic geometry representing a human are encoded with a generative network into a compact space representation, whilst maintaining the original properties, such as surface non-rigid deformations. This compact representation enables synthesis, interpolation and generation of 3D shapes.
The third contribution is Deep4D generative network that is capable of compact representation of 4D volumetric video sequences from skeletal motion of people with two orders of magnitude compression. A variational encoder-decoder is employed to learn an encoded latent space that maps from 3D skeletal pose to 4D shape and appearance. This enable high-quality 4D volumetric video synthesis to be driven by skeletal animation.
Finally, this thesis introduces Deep4D motion graph to implicitly combine multiple captured motions in a unified representation for character animation from volumetric video, allowing novel character movements to be generated with dynamic shape and appearance detail. Deep4D motion graphs allow character animation to be driven by skeletal motion sequences providing a compact encoded representation capable of high-quality synthesis of the 4D volumetric video with two orders of magnitude compression.
We aim to simultaneously estimate the 3D articulated pose and high fidelity volumetric
occupancy of human performance, from multiple viewpoint video (MVV) with as
few as two views. We use a multi-channel symmetric 3D convolutional encoder-decoder
with a dual loss to enforce the learning of a latent embedding that enables inference of
skeletal joint positions and a volumetric reconstruction of the performance. The inference
is regularised via a prior learned over a dataset of view-ablated multi-view video footage
of a wide range of subjects and actions, and show this to generalise well across unseen
subjects and actions. We demonstrate improved reconstruction accuracy and lower pose
estimation error relative to prior work on two MVV performance capture datasets: Human
3.6M and TotalCapture.
Existing techniques for dynamic scene re-
construction from multiple wide-baseline cameras pri-
marily focus on reconstruction in controlled environ-
ments, with fixed calibrated cameras and strong prior
constraints. This paper introduces a general approach to
obtain a 4D representation of complex dynamic scenes
from multi-view wide-baseline static or moving cam-
eras without prior knowledge of the scene structure, ap-
pearance, or illumination. Contributions of the work
are: An automatic method for initial coarse reconstruc-
tion to initialize joint estimation; Sparse-to-dense tem-
poral correspondence integrated with joint multi-view
segmentation and reconstruction to introduce tempo-
ral coherence; and a general robust approach for joint
segmentation refinement and dense reconstruction of
dynamic scenes by introducing shape constraint. Com-
parison with state-of-the-art approaches on a variety of
complex indoor and outdoor scenes, demonstrates im-
proved accuracy in both multi-view segmentation and
dense reconstruction. This paper demonstrates unsuper-
vised reconstruction of complete temporally coherent
4D scene models with improved non-rigid object seg-
mentation and shape reconstruction and its application
to various applications such as free-view rendering and