Professor Adrian Hilton

Contact Me

E-mail:
Phone: 01483 68 3956

Find me on campus
Room: 24 BA 00

Publications

Highlights

  • Imre E, Hilton A. (2014) 'Covariance estimation for minimal geometry solvers via scaled unscented transformation'. Computer Vision and Image Understanding, Article number 0 , pp. ---.
  • Imre E, Hilton A. (2014) 'Order Statistics of RANSAC and Their Practical Application'. International Journal of Computer Vision,
    [ Status: Accepted ]
  • Kim H, Hilton A. (2013) '3D Scene Reconstruction from Multiple Spherical Stereo Pairs'. International Journal of Computer Vision, 104 (1), pp. 94-116.

    Abstract

    We propose a 3D environment modelling method using multiple pairs of high-resolution spherical images. Spherical images of a scene are captured using a rotating line scan camera. Reconstruction is based on stereo image pairs with a vertical displacement between camera views. A 3D mesh model for each pair of spherical images is reconstructed by stereo matching. For accurate surface reconstruction, we propose a PDE-based disparity estimation method which produces continuous depth fields with sharp depth discontinuities even in occluded and highly textured regions. A full environment model is constructed by fusion of partial reconstruction from spherical stereo pairs at multiple widely spaced locations. To avoid camera calibration steps for all camera locations, we calculate 3D rigid transforms between capture points using feature matching and register all meshes into a unified coordinate system. Finally a complete 3D model of the environment is generated by selecting the most reliable observations among overlapped surface measurements considering surface visibility, orientation and distance from the camera. We analyse the characteristics and behaviour of errors for spherical stereo imaging. Performance of the proposed algorithm is evaluated against ground-truth from the Middlebury stereo test bed and LIDAR scans. Results are also compared with conventional structure-from-motion algorithms. The final composite model is rendered from a wide range of viewpoints with high quality textures.

  • Budd C, Huang P, Klaudiny M, Hilton A. (2013) 'Global non-rigid alignment of surface sequences'. International Journal of Computer Vision, 102 (1-3), pp. 256-270.

    Abstract

    This paper presents a general approach based on the shape similarity tree for non-sequential alignment across databases of multiple unstructured mesh sequences from non-rigid surface capture. The optimal shape similarity tree for non-rigid alignment is defined as the minimum spanning tree in shape similarity space. Non-sequential alignment based on the shape similarity tree minimises the total non-rigid deformation required to register all frames in a database into a consistent mesh structure with surfaces in correspondence. This allows alignment across multiple sequences of different motions, reduces drift in sequential alignment and is robust to rapid non-rigid motion. Evaluation is performed on three benchmark databases of 3D mesh sequences with a variety of complex human and cloth motion. Comparison with sequential alignment demonstrates reduced errors due to drift and improved robustness to large non-rigid deformation, together with global alignment across multiple sequences which is not possible with previous sequential approaches. © 2012 The Author(s).

  • Kim H, Hilton A. (2013) '3D Scene Reconstruction from Multiple Spherical Stereo Pairs'. International Journal of Computer Vision, , pp. 1-23.
  • Budd C, Huang P, Klaudiny M, Hilton A. (2012) 'Global Non-rigid Alignment of Surface Sequences'. International Journal of Computer Vision, , pp. 1-15.

    Abstract

    This paper presents a general approach based on the shape similarity tree for non-sequential alignment across databases of multiple unstructured mesh sequences from non-rigid surface capture. The optimal shape similarity tree for non-rigid alignment is defined as the minimum spanning tree in shape similarity space. Non-sequential alignment based on the shape similarity tree minimises the total non-rigid deformation required to register all frames in a database into a consistent mesh structure with surfaces in correspondence. This allows alignment across multiple sequences of different motions, reduces drift in sequential alignment and is robust to rapid non-rigid motion. Evaluation is performed on three benchmark databases of 3D mesh sequences with a variety of complex human and cloth motion. Comparison with sequential alignment demonstrates reduced errors due to drift and improved robustness to large non-rigid deformation, together with global alignment across multiple sequences which is not possible with previous sequential approaches. © 2012 The Author(s).

  • Moeslund TB, Hilton A, Krüger V, Sigal L. (2011) Visual Analysis of Humans: Looking at People. Springer-Verlag New York Inc
  • Guillemaut J-Y, Hilton A. (2011) 'Joint Multi-Layer Segmentation and Reconstruction for Free-Viewpoint Video Applications'. Springer International Journal of Computer Vision, 93 (1), pp. 73-100.
  • Huang P, Hilton A, Starck J. (2010) 'Shape Similarity for 3D Video Sequences of People'. Springer International Journal of Computer Vision, 89 (2-3), pp. 362-381.

Journal articles

  • Kim H , Evans A, Blat J, Hilton ADM. (2016) 'Multi-modal Visual Data Registration and Web-based Visualisation'. IEEE IEEE Transactions on Circuits and Systems for Video Technology,

    Abstract

    Recent developments of video and sensing technology can lead to large amounts of digital media data. Current media production rely on both video from the principal camera together with a wide variety of heterogeneous source of supporting data (photos, LiDAR point clouds, witness video camera, HDRI and depth imagery). Registration of visual data acquired from various 2D and 3D sensing modalities is challenging because current matching and registration methods are not appropriate due to differences in formats and noise types of multi-modal data. A combined 2D/3D visualisation of this registered data allows an integrated overview of the entire dataset. For such a visualisation a web-based context presents several advantages. In this paper we propose a unified framework for registration and visualisation of this type of visual media data. A new feature description and matching method is proposed, adaptively considering local geometry, semi-global geometry and colour information in the scene for more robust registration. The resulting registered 2D/3D multi-modal visual data is too large to be downloaded and viewed directly via the web browser while maintaining an acceptable user experience. Thus, we employ hierarchical techniques for compression and restructuring to enable efficient transmission and visualisation over the web, leading to interactive visualisation as registered point clouds, 2D images, and videos in the browser, improving on the current state of the art techniques for web-based visualisation of big media data. This is the first unified 3D web-based visualisation of multi-modal visual media production datasets. The proposed pipeline is tested on big multimodal dataset typical of film and broadcast production which are made publicly available. The proposed feature description method shows two times higher precision of feature matching and more stable registration performance than existing 3D feature descriptors.

  • Kilic V, Barnard M, Wang W, Hilton A, Kittler J. (2016) 'Mean-Shift and Sparse Sampling Based SMC-PHD Filtering for Audio Informed Visual Speaker Tracking'. IEEE Transactions on Multimedia,
    [ Status: Accepted ]

    Abstract

    The probability hypothesis density (PHD) filter based on sequential Monte Carlo (SMC) approximation (also known as SMC-PHD filter) has proven to be a promising algorithm for multi-speaker tracking. However, it has a heavy computational cost as surviving, spawned and born particles need to be distributed in each frame to model the state of the speakers and to estimate jointly the variable number of speakers with their states. In particular, the computational cost is mostly caused by the born particles as they need to be propagated over the entire image in every frame to detect the new speaker presence in the view of the visual tracker. In this paper, we propose to use audio data to improve the visual SMC-PHD (VSMC- PHD) filter by using the direction of arrival (DOA) angles of the audio sources to determine when to propagate the born particles and re-allocate the surviving and spawned particles. The tracking accuracy of the AV-SMC-PHD algorithm is further improved by using a modified mean-shift algorithm to search and climb density gradients iteratively to find the peak of the probability distribution, and the extra computational complexity introduced by mean-shift is controlled with a sparse sampling technique. These improved algorithms, named as AVMS-SMCPHD and sparse-AVMS-SMC-PHD respectively, are compared systematically with AV-SMC-PHD and V-SMC-PHD based on the AV16.3, AMI and CLEAR datasets.

  • Blat J, Evans A, Kim H, Imre H, Polok L, Ila V, Nikolaidis N, Zamcik P, Tefas A, Smrz P, Hilton A, Pitas I. (2015) 'Big Data Analysis for Media Production'. IEEE Proceedings of the IEEE, 104 (11), pp. 2085-2113.

    Abstract

    A typical high-end film production generates several terabytes of data per day, either as footage from multiple cameras or as background information regarding the set (laser scans, spherical captures, etc). This paper presents solutions to improve the integration, and the understanding of the quality, of the multiple data sources, which are used both to support creative decisions on-set (or near it) and enhance the postproduction process. The main contributions covered in this paper are: a public multisource production dataset made available for research purposes, monitoring and quality assurance of multicamera set-ups, multisource registration, anthropocentric visual analysis for semantic content annotation, acceleration of 3D reconstruction, and integrated 2D-3D web visualization tools. Furthermore, this paper presents a toolset for analysis and visualisation of multi-modal media production datasets which enables onset data quality verification and management, thus significantly reducing the risk and time required in production. Some of the basic techniques used for acceleration, clustering and visualization could be applied to much larger classes of big data problems.

  • Imre E, Hilton A. (2014) 'Covariance estimation for minimal geometry solvers via scaled unscented transformation'. Computer Vision and Image Understanding, Article number 0 , pp. ---.
  • Imre E, Hilton A. (2014) 'Order Statistics of RANSAC and Their Practical Application'. International Journal of Computer Vision,
    [ Status: Accepted ]
  • Casas D, Volino M, Collomosse JP, Hilton A. (2014) '4D Video Textures for Interactive Character Appearance'. Proceedings Eurographics 2014 Edition. Computer Graphics Forum: the international journal of the Eurographics Association,
  • Kim H, Hilton A. (2013) '3D Scene Reconstruction from Multiple Spherical Stereo Pairs'. International Journal of Computer Vision, 104 (1), pp. 94-116.

    Abstract

    We propose a 3D environment modelling method using multiple pairs of high-resolution spherical images. Spherical images of a scene are captured using a rotating line scan camera. Reconstruction is based on stereo image pairs with a vertical displacement between camera views. A 3D mesh model for each pair of spherical images is reconstructed by stereo matching. For accurate surface reconstruction, we propose a PDE-based disparity estimation method which produces continuous depth fields with sharp depth discontinuities even in occluded and highly textured regions. A full environment model is constructed by fusion of partial reconstruction from spherical stereo pairs at multiple widely spaced locations. To avoid camera calibration steps for all camera locations, we calculate 3D rigid transforms between capture points using feature matching and register all meshes into a unified coordinate system. Finally a complete 3D model of the environment is generated by selecting the most reliable observations among overlapped surface measurements considering surface visibility, orientation and distance from the camera. We analyse the characteristics and behaviour of errors for spherical stereo imaging. Performance of the proposed algorithm is evaluated against ground-truth from the Middlebury stereo test bed and LIDAR scans. Results are also compared with conventional structure-from-motion algorithms. The final composite model is rendered from a wide range of viewpoints with high quality textures.

  • Budd C, Huang P, Klaudiny M, Hilton A. (2013) 'Global non-rigid alignment of surface sequences'. International Journal of Computer Vision, 102 (1-3), pp. 256-270.

    Abstract

    This paper presents a general approach based on the shape similarity tree for non-sequential alignment across databases of multiple unstructured mesh sequences from non-rigid surface capture. The optimal shape similarity tree for non-rigid alignment is defined as the minimum spanning tree in shape similarity space. Non-sequential alignment based on the shape similarity tree minimises the total non-rigid deformation required to register all frames in a database into a consistent mesh structure with surfaces in correspondence. This allows alignment across multiple sequences of different motions, reduces drift in sequential alignment and is robust to rapid non-rigid motion. Evaluation is performed on three benchmark databases of 3D mesh sequences with a variety of complex human and cloth motion. Comparison with sequential alignment demonstrates reduced errors due to drift and improved robustness to large non-rigid deformation, together with global alignment across multiple sequences which is not possible with previous sequential approaches. © 2012 The Author(s).

  • Tejera M, Hilton A. (2013) 'Learning part-based models for animation from surface motion capture'. Proceedings - 2013 International Conference on 3D Vision, 3DV 2013, , pp. 159-166.

    Abstract

    Surface motion capture (Surf Cap) enables 3D reconstruction of human performance with detailed cloth and hair deformation. However, there is a lack of tools that allow flexible editing of Surf Cap sequences. In this paper, we present a Laplacian editing technique that constrains the mesh deformation to plausible surface shapes learnt from a set of examples. A part-Based representation of the mesh enables learning of surface deformation locally in the space of Laplacian coordinates, avoiding correlations between body parts while preserving surface details. This extends the range of animation with natural surface deformation beyond the whole-body poses present in the Surf Cap data. We illustrate successful use of our tool on three different characters. © 2013 IEEE.

  • Kim H, Hilton A. (2013) '3D Scene Reconstruction from Multiple Spherical Stereo Pairs'. International Journal of Computer Vision, , pp. 1-23.
  • Budd C, Huang P, Klaudiny M, Hilton A. (2012) 'Global Non-rigid Alignment of Surface Sequences'. International Journal of Computer Vision, , pp. 1-15.

    Abstract

    This paper presents a general approach based on the shape similarity tree for non-sequential alignment across databases of multiple unstructured mesh sequences from non-rigid surface capture. The optimal shape similarity tree for non-rigid alignment is defined as the minimum spanning tree in shape similarity space. Non-sequential alignment based on the shape similarity tree minimises the total non-rigid deformation required to register all frames in a database into a consistent mesh structure with surfaces in correspondence. This allows alignment across multiple sequences of different motions, reduces drift in sequential alignment and is robust to rapid non-rigid motion. Evaluation is performed on three benchmark databases of 3D mesh sequences with a variety of complex human and cloth motion. Comparison with sequential alignment demonstrates reduced errors due to drift and improved robustness to large non-rigid deformation, together with global alignment across multiple sequences which is not possible with previous sequential approaches. © 2012 The Author(s).

  • Guillemaut J-Y, Hilton A. (2011) 'Joint Multi-Layer Segmentation and Reconstruction for Free-Viewpoint Video Applications'. Springer International Journal of Computer Vision, 93 (1), pp. 73-100.
  • Hilton A, Godin G, Shu C, Masuda T. (2011) 'Special issue on 3D imaging and modelling'. ACADEMIC PRESS INC ELSEVIER SCIENCE COMPUTER VISION AND IMAGE UNDERSTANDING, 115 (5), pp. 559-560.
  • Huang P, Hilton A, Starck J. (2010) 'Shape Similarity for 3D Video Sequences of People'. Springer International Journal of Computer Vision, 89 (2-3), pp. 362-381.
  • Edge JD, Hilton A, Jackson PJB. (2009) 'Model-based synthesis of visual speech movements from 3D video'. Hindawi Publishing Corporation EURASIP Journal on Audio, Speech, and Music Processing, 2009 Article number 597267 , pp. 12-12.

    Abstract

    In this paper we describe a method for the synthesis of visual speech movements using a hybrid unit selection/model-based approach. Speech lip movements are captured using a 3D stereo face capture system, and split up into phonetic units. A dynamic parameterisation of this data is constructed which maintains the relationship between lip shapes and velocities; within this parameterisation a model of how lips move is built and is used in the animation of visual speech movements from speech audio input. The mapping from audio parameters to lip movements is disambiguated by selecting only the most similar stored phonetic units to the target utterance during synthesis. By combining properties of model-based synthesis (e.g. HMMs, neural nets) with unit selection we improve the quality of our speech synthesis.

  • Starck J, Maki A, Nobuhara S, Hilton A, Matsuyama T. (2009) 'The Multiple-Camera 3-D Production Studio'. IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 19 (6), pp. 856-869.
  • Starck J, Kilner J, Hilton A. (2009) 'Free-viewpoint Video Render'. Journal of Graphics Tools,
  • Starck J, Hilton A. (2008) 'Model-based human shape reconstruction from multiple views'. Elsevier Computer Vision and Image Understanding, 111 (2), pp. 179-194.

    Abstract

    Image-based modelling allows the reconstruction of highly realistic digital models from real-world objects. This paper presents a model-based approach to recover animated models of people from multipleview video images. Two contributions are made, a multiple resolution model-based framework is introduced that combines multiple visual cues in reconstruction. Second, a novel mesh parameterisation is presented to preserve the vertex parameterisation in the model for animation. A prior humanoid surface model is first decomposed into multiple levels of detail and represented as a hierarchical deformable model for image fitting. A novel mesh parameterisation is presented that allows propagation of deformation in the model hierarchy and regularisation of surface deformation to preserve vertex parameterisation and animation structure. The hierarchical model is then used to fuse multipleshape cues from silhouette, stereo and sparse feature data in a coarse-to-fine strategy to recover a model that reproduces the appearance in the images. The framework is compared to physics-based deformable surface fitting at a single resolution, demonstrating an improved reconstruction accuracy against ground-truth data with a reduced model distortion. Results demonstrate realistic modelling of real people with accurate shape and appearance while preserving model structure for use in animation.

  • Starck J, Hilton A. (2007) 'Surface Capture for Performance-Based Animation'. IEEE IEEE Computer Graphics and Applications, 27 (3), pp. 21-31.
  • Grau O, Hilton A, Kilner J, Miller G, Sargeant T, Starck J. (2007) 'A Free-Viewpoint Video System for Visualisation of Sports Scenes'. SMPTE Motion Imaging Journal, 116 Article number 5-6 , pp. 213-219-213-219.
  • Hilton A, Fua P, Ronfard R. (2006) 'Vision-based Understanding of a Person�s Shape, Appearance, Movement and Behaviour'. Computer Vision and Image Understanding - Special Issue on Modelling People, 104 Article number 2-3 , pp. 87---90-87---90.
  • Grau O, Hilton A, Kilner J, Miller G, Sargeant T, Starck J. (2006) 'A Free-Viewpoint Video System for Visualisation of Sports Scenes'. International Broadcast Convention, September
  • Ong E , Hilton ADM. (2006) 'Learnt Inverse Kinematics for Animation Synthesis'. Elsevier Graphical Models, 68 Article number 5-6 , pp. 472-483.

    Abstract

    Existing work on animation synthesis can be roughly split into two approaches, those that combine segments of motion capture data, and those that perform inverse kinematics. In this paper, we present a method for performing animation synthesis of an articulated object (e.g. human body and a dog) from a minimal set of body joint positions, following the approach of inverse kinematics. We tackle this problem from a learning perspective. Firstly, we address the need for knowledge on the physical constraints of the articulated body, so as to avoid the generation of a physically impossible poses. A common solution is to heuristically specify the kinematic constraints for the skeleton model. In this paper however, the physical constraints of the articulated body are represented using a hierarchical cluster model learnt from a motion capture database. Additionally, we shall show that the learnt model automatically captures the correlation between different joints through the simultaneous modelling their angles. We then show how this model can be utilised to perform inverse kinematics in a simple and efficient manner. Crucially, we describe how IK is carried out from a minimal set of end-effector positions. Following this, we show how this "learnt inverse kinematics" framework can be used to perform animation syntheses of different types of articulated structures. To this end, the results presented include the retargeting of a at surface walking animation to various uneven terrains to demonstrate the synthesis of a full human body motion from the positions of only the hands, feet and torso. Additionally, we show how the same method can be applied to the animation synthesis of a dog using only its feet and torso positions.

  • Csakany P, Hilton A. (2006) 'Relighting of Dynamic Video'. Academy Publisher Journal of Multimedia, 1 (3), pp. 23-30.
  • Kilner JJ, Starck JR, Hilton A. (2006) 'A Comparative Study of Free Viewpoint Video Techniques for Sports Events'. IET IET European Conference on Visual Media Production, , pp. 87-96.
  • Moeslund T, Hilton A, Kruger V. (2006) 'A Survey of Advances in Vision-Based Human Motion Capture and Analysis'. Computer Vision and Image Understanding, 104 Article number 2-3 , pp. 90---127-90---127.
  • Starck J, Hilton A. (2005) 'Virtual View Synthesis of People from Multiple View Video'. Elsevier Graphical Models, 67 (6), pp. 600-620.

    Abstract

    This paper addresses the synthesis of virtual views of people from multiple view image sequences. We consider the target area of the multiple camera “3D Virtual Studio” with the ultimate goal of capturing video-realistic dynamic human appearance. A mesh based reconstruction framework is introduced to initialise and optimise the shape of a dynamic scene for view-dependent rendering, making use of silhouette and stereo data as complementary shape cues. The technique addresses two key problems: (1) robust shape reconstruction; and (2) accurate image correspondence for view dependent rendering in the presence of camera calibration error. We present results against ground truth data in synthetic test cases and for captured sequences of people in a studio. The framework demonstrates a higher resolution in rendering compared to shape from silhouette and multiple view stereo.

  • Manessis A, Hilton A. (2005) 'Scene Modelling from Sparse 3D Data'. Journal of Image and Vision Computing, 23 Article number 10 , pp. 900---920-900---920.
  • Hilton A, Kalkavouras K, Collins G. (2005) '3D Studio Production of Animated Actor Models'. Institution of Engineering and Technology IEE Proceedings of Vision, Image and Signal Processing, 152 Article number 4 , pp. 481-490.

    Abstract

    A framework for construction of detailed animated models of an actor's shape and appearance from multiple view images is presented. Multiple views of an actor are captured in a studio with controlled illumination and background. An initial low-resolution approximation of the person's shape is reconstructed by deformation of a generic humanoid model to fit the visual hull using shape constrained optimisation to preserve the surface parameterisation for animation. Stereo reconstruction with multiple view constraints is then used to reconstruct the detailed surface shape. High-resolution shape detail from stereo is represented in a structured format for animation by displacement mapping from the low-resolution model surface. A novel integration algorithm using displacement maps is introduced to combine overlapping stereo surface measurements from multiple views into a single displacement map representation of the high-resolution surface detail. Results of 3-D actor modelling in a 14 camera studio demonstrate improved representation of detailed surface shape such as creases in clothing compared to previous model fitting approaches. Actor models can be animated and rendered from arbitrary views under different illumination to produce free-viewpoint video sequences. The proposed framework enables rapid transformation of captured multiple view images into a structured representation suitable for realistic animation.

  • Hilton A. (2003) 'Computer Vision for Human Modelling and Analysis'. Journal of Machine Vision Applications, 14 Article number 4 , pp. 206---209-206---209.
  • Starck J, Collins G, Smith R, Hilton A, Illingworth J. (2003) 'Animated Statues'. Springer Journal of Machine Vision Applications, 14 (4), pp. 248-259.
  • Li Y, Hilton A, Illingworth J. (2002) 'A relaxation algorithm for real-time multiple view 3D-tracking'. ELSEVIER SCIENCE BV IMAGE AND VISION COMPUTING, 20 (12) Article number PII S0262-8856(02)00094-X , pp. 841-859.
  • Hilton A, Fua P. (2001) 'Modeling people toward vision-based understanding of a person's shape, appearance, and movement'. Computer Vision and Image Understanding, 81 (3), pp. 227-230.
  • Roberts JB, Hilton A. (2001) 'A Direct Transform Method for the Analysis of LDA Engine Data'. I.Mech.E. Journal of Automotive Engineering, 251 Article number D , pp. 725---738-725---738.
  • Sun W, Hilton A, Smith R, Illingworth J. (2001) 'Layered Animation of Captured Data'. Springer Visual Computer: International Journal of Computer Graphics, 17 (8), pp. 457-474.
  • Collins G, Hilton A. (2001) 'Models for Character Animation'. Software Focus, Wiley, 2 (2) Article number 2 , pp. 44-51.
  • Roberts JB, Hilton ADM. (2001) 'A direct transform method for the analysis of laser Doppler anemometry engine data'. PROFESSIONAL ENGINEERING PUBLISHING LTD PROCEEDINGS OF THE INSTITUTION OF MECHANICAL ENGINEERS PART D-JOURNAL OF AUTOMOBILE ENGINEERING, 215 (D6), pp. 725-738.
  • Hilton A, Beresford D, Gentils T, Smith R, Sun W, Illingworth J. (2000) 'Whole-body modelling of people from multi-view images to populate virtual worlds'. SPRINGER Visual Computer: International Journal of Computer Graphics, 16 (7), pp. 411-436.

    Abstract

    In this paper a new technique is introduced for automatically building recognisable, moving 3D models of individual people. A set of multiview colour images of a person is captured from the front, sides and back by one or more cameras. Model-based reconstruction of shape from silhouettes is used to transform a standard 3D generic humanoid model to approximate a person’s shape and anatomical structure. Realistic appearance is achieved by colour texture mapping from the multiview images. The results show the reconstruction of a realistic 3D facsimile of the person suitable for animation in a virtual world. The system is inexpensive and is reliable for large variations in shape, size and clothing. This is the first approach to achieve realistic model capture for clothed people and automatic reconstruction of animated models. A commercial system based on this approach has recently been used to capture thousands of models of the general public.

  • Hilton A, Illingworth J. (2000) 'Geometric Fusion for a Hand-held 3D Sensor'. Springer Machine Vision and Applications, 12 (1), pp. 44-51.
  • Hilton A, Stoddart AJ, Illingworth J, Windeatt T. (1998) 'Implicit Surface based Geometric Fusion'. Inderscience International Journal of Computer Vision and Image Understanding, 69 (3), pp. 273-291.
  • Illingworth J, Hilton A. (1998) 'Looking to Build a Model World: Automatic Construction of Static Object Models using Computer Vision'. IEE Journal Electronics and Communications Engineering, 10 Article number 3 , pp. 103---113-103---113.
  • Stoddart AJ, Lemke S, Hilton A, Renn T. (1998) 'Estimating pose uncertainty for surface registration'. ELSEVIER SCIENCE BV IMAGE AND VISION COMPUTING, 16 (2), pp. 111-120.
  • HILTON A, ILLINGWORTH J, WINDEATT T. (1995) 'STATISTICS OF SURFACE CURVATURE ESTIMATES'. PERGAMON-ELSEVIER SCIENCE LTD PATTERN RECOGNITION, 28 (8), pp. 1201-1221.
  • Hilton A, Roberts JB, Hadded O. (1991) 'Autocorrelation Based Analysis of Ensemble Averaged LDA Engine Data for Bias-Free Turbulence Estimates: A Unified Approach'. Journal of the Society of Automotive Engineering SAE, 91 Article number 0479 , pp. 1---21-1---21.

Conference papers

  • Volino M, Casas D, Collomosse JP, Hilton A. 'Optimal Representation of Multi-View Video'. BMVC Nottingham, UK: British Machine Vision Conference (BMVC)

    Abstract

    Multi-view video acquisition is widely used for reconstruction and free-viewpoint rendering of dynamic scenes by directly resampling from the captured images. This paper addresses the problem of optimally resampling and representing multi-view video to obtain a compact representation without loss of the view-dependent dynamic surface appearance. Spatio-temporal optimisation of the multi-view resampling is introduced to extract a coherent multi-layer texture map video. This resampling is combined with a surface-based optical flow alignment between views to correct for errors in geometric reconstruction and camera calibration which result in blurring and ghosting artefacts. The multi-view alignment and optimised resampling results in a compact representation with minimal loss of information allowing high-quality free-viewpoint rendering. Evaluation is performed on multi-view datasets for dynamic sequences of cloth, faces and people. The representation achieves >90% compression without significant loss of visual quality.

  • Trumble M, Gilbert A , Malleson CD, Hilton ADM, Collomosse JP. (2017) 'Total Capture: 3D Human Pose Estimation Fusing Video and Inertial Sensors'. Proceedings of 28th British Machine Vision Conference, 28th British Machine Vision Conference, pp. pp. 1-13.

    Abstract

    We present an algorithm for fusing multi-viewpoint video (MVV) with inertial measurement unit (IMU) sensor data to accurately estimate 3D human pose. A 3-D convolutional neural network is used to learn a pose embedding from volumetric probabilistic visual hull data (PVH) derived from the MVV frames. We incorporate this model within a dual stream network integrating pose embeddings derived from MVV and a forward kinematic solve of the IMU data. A temporal model (LSTM) is incorporated within both streams prior to their fusion. Hybrid pose inference using these two complementary data sources is shown to resolve ambiguities within each sensor modality, yielding improved accuracy over prior methods. A further contribution of this work is a new hybrid MVV dataset (TotalCapture) comprising video, IMU and a skeletal joint ground truth derived from a commercial motion capture system. The dataset is available online at http://cvssp.org/data/totalcapture/.

  • Mustafa A , Hilton ADM. (2017) 'Semantically Coherent Co-segmentation and Reconstruction of Dynamic Scenes'. CVPR 2017 Proceedings, Honolulu, Hawaii: IEEE International Conference in Computer Vision & Pattern Recognition CVPR, 2017
    [ Status: Accepted ]

    Abstract

    In this paper we propose a framework for spatially and temporally coherent semantic co-segmentation and reconstruction of complex dynamic scenes from multiple static or moving cameras. Semantic co-segmentation exploits the coherence in semantic class labels both spatially, between views at a single time instant, and temporally, between widely spaced time instants of dynamic objects with similar shape and appearance. We demonstrate that semantic coherence results in improved segmentation and reconstruction for complex scenes. A joint formulation is proposed for semantically coherent object-based co-segmentation and reconstruction of scenes by enforcing consistent semantic labelling between views and over time. Semantic tracklets are introduced to enforce temporal coherence in semantic labelling and reconstruction between widely spaced instances of dynamic objects. Tracklets of dynamic objects enable unsupervised learning of appearance and shape priors that are exploited in joint segmentation and reconstruction. Evaluation on challenging indoor and outdoor sequences with hand-held moving cameras shows improved accuracy in segmentation, temporally coherent semantic labelling and 3D reconstruction of dynamic scenes.

  • Kim H , Hughes R, Remaggi L , Jackson PJB, Hilton ADM, Cox T, Shirley B. (2017) 'Acoustic Room Modelling using a Spherical Camera for Reverberant Spatial Audio Objects'. Proceedings of the Audio Engineering Society, Berlin, Germany: 142nd Convention of the Audio Engineering Society 142
    [ Status: Accepted ]

    Abstract

    The ability to predict the acoustics of a room without acoustical measurements is a useful capability. The motivation here stems from spatial audio reproduction, where knowledge of the acoustics of a space could allow for more accurate reproduction of a captured environment, or for reproduction room compensation techniques to be applied. A cuboid-based room geometry estimation method using a spherical camera is proposed, assuming a room and objects inside can be represented as cuboids aligned to the main axes of the coordinate system. The estimated geometry is used to produce frequency-dependent acoustic predictions based on geometrical room modelling techniques. Results are compared to measurements through calculated reverberant spatial audio object parameters used for reverberation reproduction customized to the given loudspeaker set up.

  • Liu Y, Wang W , Chambers J, Kilic V, Hilton ADM. (2017) 'Particle ow SMC-PHD lter for audio-visual multi-speaker tracking'. Latent Variable Analysis and Signal Separation, Grenoble, France: 13th International Conference on Latent Variable Analysis and Signal Separation, pp. pp. 344-353.

    Abstract

    Sequential Monte Carlo probability hypothesis density (SMC- PHD) ltering has been recently exploited for audio-visual (AV) based tracking of multiple speakers, where audio data are used to inform the particle distribution and propagation in the visual SMC-PHD lter. How- ever, the performance of the AV-SMC-PHD lter can be a ected by the mismatch between the proposal and the posterior distribution. In this pa- per, we present a new method to improve the particle distribution where audio information (i.e. DOA angles derived from microphone array mea- surements) is used to detect new born particles and visual information (i.e. histograms) is used to modify the particles with particle ow (PF). Using particle ow has the bene t of migrating particles smoothly from the prior to the posterior distribution. We compare the proposed algo- rithm with the baseline AV-SMC-PHD algorithm using experiments on the AV16.3 dataset with multi-speaker sequences.

  • Fowler SE, Kim H , Hilton ADM. (2017) 'Towards Complete Scene Reconstruction from Single-View Depth and Human Motion'. Proceedings of the 28th British Machine Vision Conference (BMVC 2017), London, UK: 28th British Machine Vision Conference (BMVC 2017)
    [ Status: Accepted ]

    Abstract

    Complete scene reconstruction from single view RGBD is a challenging task, requiring estimation of scene regions occluded from the captured depth surface. We propose that scene-centric analysis of human motion within an indoor scene can reveal fully occluded objects and provide functional cues to enhance scene understanding tasks. Captured skeletal joint positions of humans, utilised as naturally exploring active sensors, are projected into a human-scene motion representation. Inherent body occupancy is leveraged to carve a volumetric scene occupancy map initialised from captured depth, revealing a more complete voxel representation of the scene. To obtain a structured box model representation of the scene, we introduce unique terms to an object detection optimisation that overcome depth occlusions whilst deriving from the same depth data. The method is evaluated on challenging indoor scenes with multiple occluding objects such as tables and chairs. Evaluation shows that human-centric scene analysis can be applied to effectively enhance state-of-the-art scene understanding approaches, resulting in a more complete representation than single view depth alone.

  • Kim H , de Campos T, Hilton ADM. (2016) 'Room Layout Estimation with Object and Material Attributes Information using a Spherical Camera'. Fourth International Conference on 3D Vision (3DV), Standford University, California, USA: International Conference on 3D Vision

    Abstract

    In this paper we propose a pipeline for estimating 3D room layout with object and material attribute prediction using a spherical stereo image pair. We assume that the room and objects can be represented as cuboids aligned to the main axes of the room coordinate (Manhattan world). A spherical stereo alignment algorithm is proposed to align two spherical images to the global world coordinate sys- tem. Depth information of the scene is estimated by stereo matching between images. Cubic projection images of the spherical RGB and estimated depth are used for object and material attribute detection. A single Convolutional Neu- ral Network is designed to assign object and attribute la- bels to geometrical elements built from the spherical image. Finally simplified room layout is reconstructed by cuboid fitting. The reconstructed cuboid-based model shows the structure of the scene with object information and material attributes.

  • Trumble M, Gilbert A , Hilton ADM, Collomosse JP. (2016) 'Learning Markerless Human Pose Estimation from Multiple Viewpoint Video'. Computer Vision – ECCV 2016 Workshops. Lecture Notes in Computer Science, Amsterdam, Netherlands: VARVAI 2016 in conjunction with ECCV 2016 9915, pp. pp. 871-878.

    Abstract

    We present a novel human performance capture technique capable of robustly estimating the pose (articulated joint positions) of a performer observed passively via multiple view-point video (MVV). An affine invariant pose descriptor is learned using a convolutional neural network (CNN) trained over volumetric data extracted from a MVV dataset of diverse human pose and appearance. A manifold embedding is learned via Gaussian Processes for the CNN descriptor and articulated pose spaces enabling regression and so estimation of human pose from MVV input. The learned descriptor and manifold are shown to generalise over a wide range of human poses, providing an efficient performance capture solution that requires no fiducials or other markers to be worn. The system is evaluated against ground truth joint configuration data from a commercial marker-based pose estimation system

  • Mustafa A , Kim H , Hilton ADM. (2016) '4D Match Trees for Non-rigid Surface Alignment'. Computer Vision – ECCV 2016 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I, Amsterdam, The Netherlands: ECCV'16 The 14th European Conference on Computer Vision 9905 (1), pp. 213-229.

    Abstract

    This paper presents a method for dense 4D temporal alignment of partial reconstructions of non-rigid surfaces observed from single or multiple moving cameras of complex scenes. 4D Match Trees are introduced for robust global alignment of non-rigid shape based on the similarity between images across sequences and views. Wide-timeframe sparse correspondence between arbitrary pairs of images is established using a segmentation-based feature detector (SFD) which is demonstrated to give improved matching of non-rigid shape. Sparse SFD correspondence allows the similarity between any pair of image frames to be estimated for moving cameras and multiple views. This enables the 4D Match Tree to be constructed which minimises the observed change in non-rigid shape for global alignment across all images. Dense 4D temporal correspondence across all frames is then estimated by traversing the 4D Match tree using optical flow initialised from the sparse feature matches. The approach is evaluated on single and multiple view images sequences for alignment of partial surface reconstructions of dynamic objects in complex indoor and outdoor scenes to obtain a temporally consistent 4D representation. Comparison to previous 2D and 3D scene flow demonstrates that 4D Match Trees achieve reduced errors due to drift and improved robustness to large non-rigid deformations.

  • Woodcock J, Pike C, Melchior F, Coleman P , Franck A, Hilton ADM. (2016) 'Presenting the S3A Object-Based Audio Drama dataset'. Audio Engineering Society AES E-library, Paris, France: 140th AES Convention

    Abstract

    This engineering brief reports on the production of 3 object-based audio drama scenes, commissioned as part of the S3A project. 3D reproduction and an object-based workflow were considered and implemented from the initial script commissioning through to the final mix of the scenes. The scenes are being made available as Broadcast Wave Format files containing all objects as separate tracks and all metadata necessary to render the scenes as an XML chunk in the header conforming to the Audio Definition Model specification (Recommendation ITU-R BS.2076 [1]). It is hoped that these scenes will find use in perceptual experiments and in the testing of 3D audio systems. The scenes are available via the following link: http://dx.doi.org/10.17866/rd.salford.3043921.

  • Trumble M, Gilbert A, Hilton A, Collomosse J. (2016) 'Deep Convolutional Networks for Marker-less Human Pose Estimation from Multiple Views'. Proceedings of CVMP 2016. The 13th European Conference on Visual Media Production, London, UK: CVMP 2016. The 13th European Conference on Visual Media Production
    [ Status: Accepted ]
  • Mustafa A, Kim H, Guillemaut J-Y, Hilton ADM. (2016) 'Temporally coherent 4D reconstruction of complex dynamic scenes'. IEEE Las Vegas, Nevada: CVPR 2016

    Abstract

    This paper presents an approach for reconstruction of 4D temporally coherent models of complex dynamic scenes. No prior knowledge is required of scene structure or camera calibration allowing reconstruction from multiple moving cameras. Sparse-to-dense temporal correspondence is integrated with joint multi-view segmentation and reconstruction to obtain a complete 4D representation of static and dynamic objects. Temporal coherence is exploited to overcome visual ambiguities resulting in improved reconstruction of complex scenes. Robust joint segmentation and reconstruction of dynamic objects is achieved by introducing a geodesic star convexity constraint. Comparative evaluation is performed on a variety of unstructured indoor and outdoor dynamic scenes with hand-held cameras and multiple people. This demonstrates reconstruction of complete temporally coherent 4D scene models with improved nonrigid object segmentation and shape reconstruction.

  • Mustafa A , Kim H , Guillemaut J , Hilton ADM. (2015) 'General Dynamic Scene Reconstruction from Multiple View Video'. IEEE 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile: 2015 IEEE International Conference on Computer Vision (ICCV), pp. pp. 900-908.

    Abstract

    This paper introduces a general approach to dynamic scene reconstruction from multiple moving cameras without prior knowledge or limiting constraints on the scene structure, appearance, or illumination. Existing techniques for dynamic scene reconstruction from multiple wide-baseline camera views primarily focus on accurate reconstruction in controlled environments, where the cameras are fixed and calibrated and background is known. These approaches are not robust for general dynamic scenes captured with sparse moving cameras. Previous approaches for outdoor dynamic scene reconstruction assume prior knowledge of the static background appearance and structure. The primary contributions of this paper are twofold: an automatic method for initial coarse dynamic scene segmentation and reconstruction without prior knowledge of background appearance or structure; and a general robust approach for joint segmentation refinement and dense reconstruction of dynamic scenes from multiple wide-baseline static or moving cameras. Evaluation is performed on a variety of indoor and outdoor scenes with cluttered backgrounds and multiple dynamic non-rigid objects such as people. Comparison with state-of-the-art approaches demonstrates improved accuracy in both multiple view segmentation and dense reconstruction. The proposed approach also eliminates the requirement for prior knowledge of scene structure and appearance.

  • Mustafa A, Kim H, Imre HE, Hilton A. (2014) 'Initial Disparity Estimation Using Sparse Matching for Wide-Baseline Dense'. 11th European Conference on Visual Media Production
  • Kim H, Hilton A. (2014) 'HYBRID 3D FEATURE DESCRIPTION AND MATCHING FOR MULTI-MODAL DATA REGISTRATION'. Paris, France: IEEE International Conference on Image Processing, pp. 3493-3497.
  • Wang T, Collomosse JP, Hilton A. (2014) 'Wide Baseline Multi-View Video Matting using a Hybrid Markov Random Field'. IEEE Stockholm: International Conference on Pattern Recognition (ICPR)

    Abstract

    We describe a novel framework for segmenting a time- and view-coherent foreground matte sequence from synchronised multiple view video. We construct a Markov Random Field (MRF) comprising links between superpixels corresponded across views, and links between superpixels and their constituent pixels. Texture, colour and disparity cues are incorporated to model foreground appearance. We solve using a multi-resolution iterative approach enabling an eight view high definition (HD) frame to be processed in less than a minute. Furthermore we incorporate a temporal diffusion process introducing a prior on the MRF using information propagated from previous frames, and a facility for optional user correction. The result is a set of temporally coherent mattes that are solved for simultaneously across views for each frame, exploiting similarities across views and time.

  • Kim H, Hilton A. (2013) 'Evaluation of 3D Feature Descriptors for Multi-modal Data Registration'. Seattle, USA: 3DV

    Abstract

    We propose a framework for 2D/3D multi-modal data registration and evaluate 3D feature descriptors for registration of 3D datasets from different sources. 3D datasets of outdoor environments can be acquired using a variety of active and passive sensor technologies including laser scanning and video cameras. Registration of these datasets into a common coordinate frame is required for subsequent modelling and visualisation. 2D images are converted into 3D structure by stereo or multi-view reconstruction techniques and registered to a unified 3D domain with other datasets in a 3D world. Multi-modal datasets have different density, noise, and types of errors in geometry. This paper provides a performance benchmark for existing 3D feature descriptors across multi-modal datasets. Performance is evaluated for the registration of datasets obtained from high-resolution laser scanning with reconstructions obtained from images and video. This analysis highlights the limitations of existing 3D feature detectors and descriptors which need to be addressed for robust multi-modal data registration. We analyse and discuss the performance of existing methods in registering various types of datasets then identify future directions required to achieve robust multi-modal 3D data registration.

  • Kim H, Hilton A. (2013) 'Planar Urban Scene Reconstruction from Spherical Images using Facade Alignment'. Seoul, South Korea: 11th IEEE IVMSP Workshop

    Abstract

    We propose a plane-based urban scene reconstruction method using spherical stereo image pairs. We assume that the urban scene consists of axis-aligned approximately planar structures (Manhattan world). Captured spherical stereo images are converted into six central-point perspective images by cubic projection and facade alignment. Facade alignment automatically identifies the principal planes direction in the scene allowing the cubic projection to preserve the plane structure. Depth information is recovered by stereo matching between images and independent 3D rectangular planes are constructed by plane fitting aligned with the principal axes. Finally planar regions are refined by expanding, detecting intersections and cropping based on visibility. The reconstructed model efficiently represents the structure of the scene and texture mapping allows natural walk-through rendering.

  • Imre E, Guillemaut JY, Hilton A. (2012) 'Through-the-lens multi-camera synchronisation and frame-drop detection for 3D reconstruction'. Proceedings - 2nd Joint 3DIM/3DPVT Conference: 3D Imaging, Modeling, Processing, Visualization and Transmission, 3DIMPVT 2012, , pp. 395-402.
  • Klaudiny M, Hilton A. (2012) 'High-detail 3D capture and non-sequential alignment of facial performance'. IEEE Proceedings - 2nd Joint 3DIM/3DPVT Conference: 3D Imaging, Modeling, Processing, Visualization and Transmission, 3DIMPVT 2012, Zurich: 3DIMPVT, pp. 17-24.

    Abstract

    This paper presents a novel system for the 3D capture of facial performance using standard video and lighting equipment. The mesh of an actor's face is tracked non-sequentially throughout a performance using multi-view image sequences. The minimum spanning tree calculated in expression dissimilarity space defines the traversal of the sequences optimal with respect to error accumulation. A robust patch-based frame-to-frame surface alignment combined with the optimal traversal significantly reduces drift compared to previous sequential techniques. Multi-path temporal fusion resolves inconsistencies between different alignment paths and yields a final mesh sequence which is temporally consistent. The surface tracking framework is coupled with photometric stereo using colour lights which captures metrically correct skin geometry. High-detail UV normal maps corrected for shadow and bias artefacts augment the temporally consistent mesh sequence. Evaluation on challenging performances by several actors demonstrates the acquisition of subtle skin dynamics and minimal drift over long sequences. A quantitative comparison to a state-of-the-art system shows similar quality of temporal alignment. © 2012 IEEE.

  • Imre E, Guillemaut J-Y, Hilton A. (2011) 'Calibration of nodal and free-moving cameras in dynamic scenes for post-production'. IEEE Proceedings - 2011 International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission, 3DIMPVT 2011, Hangzhou: International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT), pp. 260-267.

    Abstract

    In film production, many post-production tasks require the availability of accurate camera calibration information. This paper presents an algorithm for through-the-lens calibration of a moving camera for a common scenario in film production and broadcasting: The camera views a dynamic scene, which is also viewed by a set of static cameras with known calibration. The proposed method involves the construction of a sparse scene model from the static cameras, with respect to which the moving camera is registered, by applying the appropriate perspective-n-point (PnP) solver. In addition to the general motion case, the algorithm can handle the nodal cameras with unknown focal length via a novel P2P algorithm. The approach can identify a subset of static cameras that are more likely to generate a high number of scene-image correspondences, and can robustly deal with dynamic scenes. Our target applications include dense 3D reconstruction, stereoscopic 3D rendering and 3D scene augmentation, through which the success of the algorithm is demonstrated experimentally.

  • Budd C, Huang P, Hilton A. (2011) 'Hierarchical shape matching for temporally consistent 3D video'. Proceedings of International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission, Hangzhou, China: 3DIMPVT 2011, pp. 172-179.
  • Sarim M, Hilton A, Guillemaut JY. (2011) 'Temporal trimap propagation for video matting using inferential statistics'. Proceedings - International Conference on Image Processing, ICIP, , pp. 1745-1748.
  • Huang P, Budd C, Hilton A. (2011) 'Global temporal registration of multiple non-rigid surface sequences'. IEEE Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Providence, RI: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 3473-3480.

    Abstract

    In this paper we consider the problem of aligning multiple non-rigid surface mesh sequences into a single temporally consistent representation of the shape and motion. A global alignment graph structure is introduced which uses shape similarity to identify frames for inter-sequence registration. Graph optimisation is performed to minimise the total non-rigid deformation required to register the input sequences into a common structure. The resulting global alignment ensures that all input sequences are resampled with a common mesh structure which preserves the shape and temporal correspondence. Results demonstrate temporally consistent representation of several public databases of mesh sequences for multiple people performing a variety of motions with loose clothing and hair.

  • Guillemaut J, Sarim M, Hilton A. (2010) 'Stereoscopic Content Production of Complex Dynamic Scenes Using a Wide-Baseline Monoscopic Camera Set-Up'. Hong Kong: Proc. International Conference on Image Processing (ICIP 2010), Special Session on Image Processing for Stereo Digital Cinema Production, pp. 9-12.

    Abstract

    Conventional stereoscopic video content production requires use of dedicated stereo camera rigs which is both costly and lacking video editing flexibility. In this paper, we propose a novel approach which only requires a small number of standard cameras sparsely located around a scene to automatically convert the monocular inputs into stereoscopic streams. The approach combines a probabilistic spatio-temporal segmentation framework with a state-of-the-art multi-view graph-cut reconstruction algorithm, thus providing full control of the stereoscopic settings at render time. Results with studio sequences of complex human motion demonstrate the suitability of the method for high quality stereoscopic content generation with minimum user interaction.

  • Imre HE, Guillemaut J-Y, Hilton ADM. (2010) 'Moving Camera Registration for Multiple Camera Setups in Dynamic Scenes'. Proceedings of the 21st British Machine Vision Conference, Aberystwyth, UK: BMVC 2010

    Abstract

    Many practical applications require an accurate knowledge of the extrinsic calibration (\ie, pose) of a moving camera. The existing SLAM and structure-from-motion solutions are not robust to scenes with large dynamic objects, and do not fully utilize the available information in the presence of static cameras, a common practical scenario. In this paper, we propose an algorithm that addresses both of these issues for a hybrid static-moving camera setup. The algorithm uses the static cameras to build a sparse 3D model of the scene, with respect to which the pose of the moving camera is estimated at each time instant. The performance of the algorithm is studied through extensive experiments that cover a wide range of applications, and is shown to be satisfactory.

  • Sarim M, Hilton A, Guillemaut J-Y, Kim H, Takai T. (2010) 'Wide-Baseline Multi-View Video Segmentation For 3D Reconstruction'. ACM Proceedings of the 1st international workshop on 3D video processing, Firenze, Italy: 3DVP 2010 Workshop: MM '10 ACM Multimedia Conference, pp. 13-18.
  • Kim H, Sarim M, Takai T, Guillemaut J-Y, Hilton A. (2010) 'Dynamic 3D Scene Reconstruction in Outdoor Environments'. IEEE In Proc. IEEE Symp. on 3D Data Processing and Visualization, France: 3DPVT
  • Cosker D, Krumhuber E, Hilton A. (2010) 'Perception of Linear and Nonlinear Motion Properties using a FACS Validated 3D Facial Model'. ACM In Proc. of ACM Symposium on Applied Perception in Graphics and Visualisation (APGV), Los Angeles: ACM 7th Symposium on Applied Perception in Graphics and Visualization, pp. 101-108.

    Abstract

    In this paper we present the first Facial Action Coding System (FACS) valid model to be based on dynamic 3D scans of human faces for use in graphics and psychological research. The model consists of FACS Action Unit (AU) based parameters and has been independently validated by FACS experts. Using this model, we explore the perceptual differences between linear facial motions – represented by a linear blend shape approach – and real facial motions that have been synthesized through the 3D facial model. Through numerical measures and visualizations, we show that this latter type of motion is geometrically nonlinear in terms of its vertices. In experiments, we explore the perceptual benefits of nonlinear motion for different AUs. Our results are insightful for designers of animation systems both in the entertainment industry and in scientific research. They reveal a significant overall benefit to using captured nonlinear geometric vertex motion over linear blend shape motion. However, our findings suggest that not all motions need to be animated nonlinearly. The advantage may depend on the type of facial action being produced and the phase of the movement.

  • Huang P, Hilton A, Starck J. (2009) 'Human Motion Synthesis from 3D Video'. IEEE Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Miami, USA: CVPR 2009, pp. 1478-1485.

    Abstract

    Multiple view 3D video reconstruction of actor performance captures a level-of-detail for body and clothing movement which is time-consuming to produce using existing animation tools. In this paper we present a framework for concatenative synthesis from multiple 3D video sequences according to user constraints on movement, position and timing. Multiple 3D video sequences of an actor performing different movements are automatically constructed into a surface motion graph which represents the possible transitions with similar shape and motion between sequences without unnatural movement artifacts. Shape similarity over an adaptive temporal window is used to identify transitions between 3D video sequences. Novel 3D video sequences are synthesized by finding the optimal path in the surface motion graph between user specified key-frames for control of movement, location and timing. The optimal path which satisfies the user constraints whilst minimizing the total transition cost between 3D video sequences is found using integer linear programming. Results demonstrate that this framework allows flexible production of novel 3D video sequences which preserve the detailed dynamics of the captured movement for an actress with loose clothing and long hair without visible artifacts.

  • Guillemaut J-Y, Kilner J, Hilton A. (2009) 'Robust Graph-Cut Scene Segmentation and Reconstruction for Free-Viewpoint Video of Complex Dynamic Scenes'. IEEE IEEE Int.Conf. on Computer Vision, ICCV, 12th International Conference on Computer Vision, 2009 IEEE, pp. 809-816.
  • Budd C, Hilton A. (2009) 'Skeleton Driven Volumetric Deformation'. ACM Symposium on Computer Animation,
  • Kilner JJ, Guillemaut J-Y, Hilton A. (2009) '3D Action Matching with Key-Pose Detection'. IEEE IEEE 12th International Conference on Computer Vision Workshops (ICCV Workshops), Kyoto, Japan: ICCV 2009, pp. 1-8.

    Abstract

    This paper addresses the problem of human action matching in outdoor sports broadcast environments, by analysing 3D data from a recorded human activity and retrieving the most appropriate proxy action from a motion capture library. Typically pose recognition is carried out using images from a single camera, however this approach is sensitive to occlusions and restricted fields of view, both of which are common in the outdoor sports environment. This paper presents a novel technique for the automatic matching of human activities which operates on the 3D data available in a multi-camera broadcast environment. Shape is retrieved using multi-camera techniques to generate a 3D representation of the scene. Use of 3D data renders the system camera-pose-invariant and allows it to work while cameras are moving and zooming. By comparing the reconstructions to an appropriate 3D library, action matching can be achieved in the presence of significant calibration and matting errors which cause traditional pose detection schemes to fail. An appropriate feature descriptor and distance metric are presented as well as a technique to use these features for key-pose detection and action matching. The technique is then applied to real footage captured at an outdoor sporting event

  • Huang P, Hilton A. (2009) 'Surface Motion Graphs for Animation from 3D Video'. ACM ACM SIGGRAPH (Talk), New Orleans: ACM SIGGRAPH 2009
  • Sarim M, Guillemaut JY, Kim H, Hilton A. (2009) 'Wide-baseline Image Matting'. European Conference on Visual Media Production(CVMP),
  • Gkalelis N, Kim H, Hilton A, Nikolaidis N, Pitas I. (2009) 'The i3DPost multi-view and 3D human action/interaction'. London: CVMP (Conference for Visual Media Production)
  • Kim H, Hilton A. (2009) 'Environment Modelling using Spherical Stereo Imaging'. IEEE Symposium on 3D Imaging (3DIM),
  • Sarim M, Hilton A, Guillemaut J. (2009) 'Non-parametric patch based video matting'. British Machine Vision Association London, UK: Proc. British Machine Vision Conference (BMVC 2009)

    Abstract

    In computer vision, matting is the process of accurate foreground estimation in images and videos. In this paper we presents a novel patch based approach to video matting relying on non-parametric statistics to represent image variations in appearance. This overcomes the limitation of parametric algorithms which only rely on strong colour correlation between the nearby pixels. Initially we construct a clean background by utilising the foreground object’s movement across the background. For a given frame, a trimap is constructed using the background and the last frame’s trimap. A patch-based approach is used to estimate the foreground colour for every unknown pixel and finally the alpha matte is extracted. Quantitative evaluation shows that the technique performs better, in terms of the accuracy and the required user interaction, than the current state-of-the-art parametric approaches.

  • Budd C, Hilton A. (2009) 'Skeleton Driven Volumetric Laplacian Deformation'. European Conference on Visual Media Production,
  • Kilner JJ, Guillemaut J-Y, Hilton A. (2009) 'Summarised Hierarchical Markov Models for Speed Invariant Action Matching.'. ICCV Workshop on Tracking Humans for the Evaluation of their Motion in Image Sequences, , pp. 1065-1072.

    Abstract

    Action matching, where a recorded sequence is matched against, and synchronised with, a suitable proxy from a library of animations, is a technique for generating a synthetic representation of a recorded human activity. This proxy can then be used to represent the action in a virtual environment or as a prior on further processing of the sequence. In this paper we present a novel technique for performing action matching in outdoor sports environments. Outdoor sports broadcasts are typically multi-camera environments and as such reconstruction techniques can be applied to the footage to generate a 3D model of the scene. However due to poor calibration and matting this reconstruction is of a very low quality. Our technique matches the 3D reconstruction sequence against a predefined library of actions to select an appropriate high quality synthetic representation. A hierarchical Markov model combined with 3D summarisation of the data allows a large number of different actions to be matched successfully to the sequence in a rate-invariant manner without prior segmentation of the sequence into discrete units. The technique is applied to data captured at rugby and soccer games.

  • Kim H, Hilton A. (2009) 'Graph-based Foreground Extraction in Extended Colour Space'. Int.Conf.Image Processing (ICIP),
  • Edge J, Hilton A, Jackson P. (2008) 'Parameterisation of Speech Lip Movements'. Proceedings of International Conference on Auditory-visual Speech Processing, Tangalooma, Australia: AVSP
  • Edge J, Hilton A. (2008) 'Parameterising Visual Speech Movements'. ACM SIGGRAPH/Eurographics Symposium on Computer Animation,
  • Huang P, Hilton A, Starck J. (2008) 'Automatic 3D Video Summarization: Key Frame Extraction from Self-Similarity'. Proceedings of 3DPVT'08 - the Fourth International Symposium on 3D Data Processing, Visualization and Transmission, Georgia Institute of Technology, Atlanta, GA, USA: Fourth International Symposium on 3D Data Processing, Visualization and Transmission, pp. 1-8.

    Abstract

    In this paper we present an automatic key frame selection method to summarise 3D video sequences. Key-frame selection is based on optimisation for the set of frames which give the best representation of the sequence according to a rate-distortion trade-off. Distortion of the summarization from the original sequence is based on measurement of self-similarity using volume histograms. The method evaluates the globally optimal set of key-frames to represent the entire sequence without requiring pre-segmentation of the sequence into shots or temporal correspondence. Results demonstrate that for 3D video sequences of people wearing a variety of clothing the summarization automatically selects a set of key-frames which represent the dynamics. Comparative evaluation of rate-distortion characteristics with previous 3D video summarization demonstrates improved performance.

  • Huang P, Hilton A, Starck J. (2008) 'Automatic 3D video summarization: Key frame extraction from self-similarity'. 4th International Symposium on 3D Data Processing, Visualization and Transmission, 3DPVT 2008 - Proceedings, , pp. 71-78.
  • Stroia-Williams P, Hilton A. (2008) 'Example-based Reflectance Estimation for Capturing Relightable Models of People'. IEEE European Conference on Visual Media Production, , pp. 1-10.

    Abstract

    We present a new approach to reflectance estimation for dynamic scenes. Non-parametric image statistics are used to transfer reflectance properties from a static example set to a dynamic image sequence. The approach allows reflectance estimation for surface materials with inhomogeneous appearance, such as those which commonly occur with patterned or textured clothing. Material reflectance properties are initially estimated from static images of the subject under multiple directional illuminations using photometric stereo. The estimated reflectance together with the corresponding image under uniform ambient illumination form a prior set of reference material observations. Material reflectance properties are then estimated for video sequences of a moving person captured under uniform ambient illumination by matching the observed local image statistics to the reference observations. Results demonstrate that the transfer of reflectance properties enables estimation of the dynamic surface normals and subsequent relighting. This approach overcomes limitations of previous work on material transfer and relighting of dynamic scenes which was limited to surfaces with regions of homogeneous reflectance. We evaluate for relighting 3D model sequences reconstructed from multiple view video. Comparison to previous model relighting demonstrates improved reproduction of detailed texture and shape dynamics.

  • Doshi A, Hilton A, Starck J. (2008) 'An Empirical Study of Non-rigid Surface Feature Matching'. European Conference on Visual Media Production,
  • Kim H, Hilton A. (2008) 'Region-based Foreground Extraction'. Curran Associates European Conference on Visual Media Production, London: European Conference on Visual Media Production

    Abstract

    We propose a region-based method to extract foreground regions from colour video sequences. The foreground region is decided by voting with scores from background subtraction to the sub-regions by graph- based segmentation. Experiments show that the proposed algorithm improves on conventional approaches especially in strong shadow regions.

  • Starck J, Kilner J, Hilton A. (2008) 'Objective Quality Assessment in Free-viewpoint Video Production'. IEEE Conference on 3DTV, , pp. 1---8-1---8.
  • Guillemaut J-Y, Kilner J, Starck J, Hilton A. (2007) ' Dynamic Feathering: Minimising Blending Artefacts in View Dependent Rendering'. IET European Conference on Visual Media Production, , pp. 1---8-1---8.
  • Huang P, Starck J, Hilton A. (2007) 'A Study of Shape Similarity for Temporal Surface Sequences of People'. IEEE IEEE Int.Conf. on 3D Imaging and Modeling, Sixth International Conference on 3-D Digital Imaging and Modeling, 2007. 3DIM '07., pp. 408-418.
  • Miller G, Hilton A. (2007) 'Safe Hulls'. IEEE IET European Conference on Visual Media Production, , pp. 1---8-1---8.

    Abstract

    The visual hull is widely used as a proxy for novel view synthesis in computer vision. This paper introduces the safe hull, the first visual hull reconstruction technique to produce a surface containing only foreground parts. A theoretical basis underlies this novel approach which, unlike any previous work, can also identify phantom volumes attached to real objects. Using an image-based method, the visual hull is constructed with respect to each real view and used to identify safe zones in the original silhouettes. The safe zones define volumes known to only contain surface corresponding to a real object. The zones are used in a second reconstruction step to produce a surface without phantom volumes. Results demonstrate the effectiveness of this method for improving surface shape and scene realism, and its advantages over heuristic techniques.

  • Csakany P, Vajda F, Hilton A. (2007) 'Recovering Refined Surface Normals for Relighting Clothing in Dynamic Scenes'. IET IET European Conference on Visual Media Production, , pp. 1---8-1---8.

    Abstract

    In this paper we present a method to relight captured 3D video sequences of non-rigid, dynamic scenes, such as clothing of real actors, reconstructed from multiple view video. A view-dependent approach is introduced to refine an initial coarse surface reconstruction using shape-from-shading to estimate detailed surface normals. The prior surface approximation is used to constrain the simultaneous estimation of surface normals and scene illumination, under the assumption of Lambertian surface reflectance. This approach enables detailed surface normals of a moving non-rigid object to be estimated from a single image frame. Refined normal estimates from multiple views are integrated into a single surface normal map. This approach allows highly non-rigid surfaces, such as creases in clothing, to be relit whilst preserving the detailed dynamics observed in video.

  • Turkmani A, Hilton A, Jackson PJB, Edge J. (2007) 'Visual analysis of lip coarticulation in VCV utterances'. Curran Associates 8th Annual Conference of the International Speech Communication Association, Antwerp, Belgium: INTERSPEECH 2007, pp. 1406-1409.

    Abstract

    This paper presents an investigation of the visual variation on the bilabial plosive consonant /p/ in three coarticulation contexts. The aim is to provide detailed ensemble analysis to assist coarticulation modelling in visual speech synthesis. The underlying dynamics of labeled visual speech units, represented as lip shape, from symmetric VCV utterances, is investigated. Variation in lip dynamics is quantitively and qualitatively analyzed. This analysis shows that there are statistically significant differences in both the lip shape and trajectory during coarticulation.

  • Hilton A, Starck J. (2007) 'Animation of People from Surface Motion Capture'. IEEE IEEE Computer Graphics and Applications, New York: Workshop on 3D Cinematography 3 (27), pp. 21-31.

    Abstract

    Digital content production traditionally requires highly skilled artists and animators to first manually craft shape and appearance models and then instill the models with a believable performance. Motion capture technology is now increasingly used to record the articulated motion of a real human performance to increase the visual realism in animation. Motion capture is limited to recording only the skeletal motion of the human body and requires the use of specialist suits and markers to track articulated motion. In this paper we present surface capture, a fully automated system to capture shape and appearance as well as motion from multiple video cameras as a basis to create highly realistic animated content from an actor’s performance in full wardrobe. We address wide-baseline scene reconstruction to provide 360 degree appearance from just 8 camera views and introduce an efficient scene representation for level of detail control in streaming and rendering. Finally we demonstrate interactive animation control in a computer games scenario using a captured library of human animation, achieving a frame rate of 300fps on consumer level graphics hardware.

  • Huang P, Starck A, Hilton A. (2007) 'Temporal 3D Shape Matching'. IET IET European Conference on Visual Media Production, London, UK: 4th European Conference on Visual Media Production, 2007. IETCVMP., pp. 1-8.

    Abstract

    This paper introduces a novel 4D shape descriptor to match temporal surface sequences. A quantitative evaluation based on the receiver-operator characteristic (ROC) curve is presented to compare the performance of conventional 3D shape descriptors with and without using a time filter. Feature- based 3D shape descriptors including shape distribution (Osada et al., 2002 ), spin image (Johnson et al., 1999), shape histogram (Ankest et al., 1999) and spherical harmonics (Kazhdan et al., 2003) are considered. Evaluation shows that filtered descriptors outperform unfiltered descriptors and the best performing volume-sampling shape-histogram descriptor is extended to define a new 4D "shape-flow" descriptor. Shape-flow matching demonstrates improved performance in the context of matching time-varying sequences which is motivated by the requirement to connect similar sequences for animation production. Both simulated and real 3D human surface motion sequences are used for evaluation.

  • Grau O, Thomas GA, Hilton A, Kilner J, Starck J. (2007) 'A Robust Free-viewpoint Video System for Sport Scenes'. IEEE 3DTV Conference, 3DTV Conference, 2007
  • Edge JD, Hilton A. (2007) 'Facial Animation with Motion Capture based on Surface Blending'. International Conference on Computer Graphics Theory and Applications,
  • Nadtoka N, Tena JR, Hilton A, Edge J. (2007) 'High-resolution Animation of Facial Dynamics'. IET IET European Conference on Visual Media Production, London: 4th European Conference on Visual Media Production, 2007. IETCVMP., pp. 1---8-1---8.

    Abstract

    This paper presents a framework for performance-based animation and retargeting of high-resolution face models from motion capture. A novel method is introduced for learning a mapping between sparse 3D motion capture markers and dense high-resolution 3D scans of face shape and appearance. A high-resolution facial expression space is learnt from a set of 3D face scans as a person specific morphable model. Sparse 3D face points sampled at the motion capture marker positions are used to build a corresponding low-resolution expression space to represent the facial dynamics from motion capture. Radial basis function interpolation is used to automatically map the low-resolution motion capture of facial dynamics to the high-resolution facial expression space. This produces a high-resolution facial animation with the detailed shape and appearance of real facial dynamics. Retargeting is introduced to transfer facial expressions to a novel subject captured from a single photograph or 3D scan. The subject specific high- resolution expression space is mapped to the novel subject based on anatomical differences in face shape. Results facial animation and retargeting demonstrate realistic animation of expressions from motion capture.

  • Starck J, Hilton A. (2007) 'Correspondence labelling for wide-timeframe free-form surface matching'. IEEE IEEE Int.Conf.on Computer Vision, IEEE 11th International Conference on Computer Vision, 2007. ICCV 2007., pp. 1-8.
  • Guillemaut JY, Hilton A, Starck J, Kilner JJ, Grau O. (2007) 'A Baysian Framework for Simultaneous Reconstruction and Matting'. IEEE IEEE Int.Conf. on 3D Imaging and Modeling, Sixth International Conference on 3-D Digital Imaging and Modeling, 2007. 3DIM '07., pp. 167-176.
  • Kilner JJ, Starck J, Hilton A, Guillemaut JY, Grau O. (2007) 'Dual Mode Deformable Models for Free-Viewpoint Video of Outdoor Sports Events'. IEEE IEEE Int.Conf. on 3D Imaging and Modeling, Sixth International Conference on 3-D Digital Imaging and Modeling, 2007. 3DIM '07., pp. 177-184.
  • Guillemaut J , Kilner J, Starck J, Hilton ADM. (2007) 'Dynamic feathering: Minimising blending artefacts in view-dependent rendering'. IET Conference Publications, London UK: ET 4th European Conference on Visual Media Production (CVMP 2007) 534 (534 CP)

    Abstract

    Conventional view-dependent texture mapping techniques produce composite images by blending subsets of input images, weighted according to their relative influence at the rendering viewpoint, over regions where the views overlap. Geometric or camera calibration errors often result in a los s of detail due to blurring or double exposure artefacts which tends to be exacerbated by the number of blending views considered. We propose a novel view-dependent rendering technique which optimises the blend region dynamically at rendering time, and reduces the adverse effects of camera calibration or geometric errors otherwise observed. The technique has been successfully integrated in a rendering pipeline which operates at interactive frame rates. Improvement over state-of-the-art view-dependent texture mapping techniques are illustrated on a synthetic scene as well as real imagery of a large scale outdoor scene where large camera calibration and geometric errors are present.

  • Starck J, Hilton A. (2006) 'Free-viewpoint Video for Interactive Character Animation'. COE Conference, Japan,
  • Miller G, Hilton A. (2006) 'Exact view-dependent visual hulls'. IEEE COMPUTER SOC 18th International Conference on Pattern Recognition, Vol 1, Proceedings, Hong Kong, PEOPLES R CHINA: 18th International Conference on Pattern Recognition (ICPR 2006), pp. 107-110.
  • Huang P, Hilton A. (2006) 'Football Player Tracking for Video Annotation'. IET IET European Conference on Visual Media Production, 3rd European Conference on Visual Media Production, 2006. CVMP 2006., pp. 175-175.
  • Williams P, Hilton A. (2006) '3D Reconstruction Using Spherical Images'. IET IET European Conference on Visual Media Production, 3rd European Conference on Visual Media Production, 2006. CVMP 2006., pp. 179-179.
  • Miller G, Starck JR, Hilton A. (2006) 'Projective Surface Refinement for Free-Viewpoint Video'. IET European Conference on Visual Media Production, , pp. 153-162-153-162.

    Abstract

    This paper introduces a novel method of surface refinement for free-viewpoint video of dynamic scenes. Unlike previous approaches, the method presented here uses both visual hull and silhouette contours to constrain refinement of viewdependent depth maps from wide baseline views. A technique for extracting silhouette contours as rims in 3D from the view-dependent visual hull (VDVH) is presented. A new method for improving correspondence is introduced, where refinement of the VDVH is posed as a global problem in projective ray space. Artefacts of global optimisations are reduced by incorporating rims as constraints. Real time rendering of virtual views in a free-viewpoint video system is achieved using an image+depth representation for each real view. Results illustrate the high quality of rendered views achieved through this refinement technique.

  • Kittler J, Hilton A, Hamouz M, Illingworth J. (2006) '3D Assisted Face Recognition: A Survey of 3D Imaging, Modelling and Recognition Approaches'. CVPR, , pp. 114---122-114---122.
  • Csakany P, Hilton A. (2006) 'Relighting of Facial Video'. IEEE ICPR, 18th International Conference on Pattern Recognition, 2006. ICPR 2006., pp. 203-206.

    Abstract

    We present a novel method to relight video sequences given known surface shape and illumination. The method preserves fine visual details. It requires single view video frames, approximate 3D shape and standard studio illumination only, making it applicable in studio production. The technique is demonstrated for relighting video sequences of faces

  • Hamouz M, Tena JR, Kittler J, Hilton A, Illingworth J. (2006) '3D Assisted Face Recognition: A Survey'. Book Chapter,
  • Csakany P, Vajda F, Hilton A. (2006) 'Model Refinement by Iterative Normal-From-Shading'. IET IET European Conference on Visual Media Production, 3rd European Conference on Visual Media Production, 2006. CVMP 2006., pp. 181-181.
  • Turkmani A, Hilton A. (2006) 'Appearane-Based Inner-Lip Detection'. IET European Conference on Visual Media Production, , pp. 176-176.
  • Edge J, Hilton A. (2006) 'Visual Speech Synthesis from 3D Video'. IEEE IET European Conference on Visual Media Production, London: 3rd European Conference on Visual Media Production, 2006., pp. 174-174.

    Abstract

    In this paper we describe a parameterisation of lip movements which maintains the dynamic structure inherent in the task of producing speech sounds. A stereo capture system is used to reconstruct 3D models of a speaker producing sentences from the TIMIT corpus. This data is mapped into a space which maintains the relationships between samples and their temporal derivatives. By incorporating dynamic information within the parameterisation of lip movements we can model the cyclical structure, as well as the causal nature of speech movements as described by an underlying visual speech manifold. It is believed that such a structure will be appropriate to various areas of speech modeling, in particular the synthesis of speech lip movements.

  • Hamouz M, Tena JR, Kittler J, Hilton A, Illingworth J. (2006) 'Algorithms for 3D-assisted face recognition'. IEEE 2006 IEEE 14TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS, VOLS 1 AND 2, Antalya, TURKEY: IEEE 14th Signal Processing and Communications Applications, pp. 826-829.
  • Starck J, Miller G, Hilton A. (2006) 'Volumetric stereo with silhouette and feature constraints'. British Machine Vision Association British Machine Vision Conference, Edinburgh: BMVC 2006, pp. 1189-1198.

    Abstract

    This paper presents a novel volumetric reconstruction technique that combines shape-from-silhouette with stereo photo-consistency in a global optimisation that enforces feature constraints across multiple views. Human shape reconstruction is considered where extended regions of uniform appearance, complex self-occlusions and sparse feature cues represent a challenging problem for conventional reconstruction techniques. A unified approach is introduced to first reconstruct the occluding contours and left-right consistent edge contours in a scene and then incorporate these contour constraints in a global surface optimisation using graph-cuts. The proposed technique maximises photo-consistency on the surface, while satisfying silhouette constraints to provide shape in the presence of uniform surface appearance and edge feature constraints to align key image features across views.

  • Nadtoka N, Hilton A, Tena J, Edge J, Jackson PJB. (2006) 'Representing Dynamics of Facial Expression'. IET European Conference on Visual Media Production, IET 3rd European Conference on Visual Media Production, pp. 183-183.

    Abstract

    Motion capture (mocap) is widely used in a large number of industrial applications. Our work offers a new way of representing the mocap facial dynamics in a high resolution 3D morphable model expression space. A data-driven approach to modelling of facial dynamics is presented. We propose a way to combine high quality static face scans with dynamic 3D mocap data which has lower spatial resolution in order to study the dynamics of facial expressions.

  • Tena JR, Hamouz M, Hilton A, Illingworth J. (2006) 'A Validation Method for Dense Non-rigid 3D Face Registration'. IEEE Conf. on Advanced Video and Signal-based Surveillance,
  • Tena JR, Hamouz M, Hilton A, Illingworth J. (2006) 'A Validated Method for Dense Non-rigid 3D Face Registration'. IEEE Int. Conf. on Advanced Video and Signal based Surveillance (AVSS’06), , pp. 81-90-81-90.
  • Csakany P, Hilton A. (2006) 'Relighting of Facial Images'. IEEE IEEE Int.Conf. on Face and Gesture Recognition, 7th International Conference on Automatic Face and Gesture Recognition, 2006. FGR 2006., pp. 55-60.
  • Kittler J, Hamouz M, Tena JR, Hilton A, Illingworth J, Ruiz M. (2005) '3D Assisted 2D Face Recognition: Methodology'. Lecture Notes in Computer Science 3773 (Proc. of CIARP�05), , pp. 1055---1065-1055---1065.
  • Starck J, Hilton A. (2005) 'Spherical Matching for Temporal Correspondence of Non-Rigid Surfaces'. IEEE IEEE Int.Conf.Computer Vision, Tenth IEEE International Conference on Computer Vision, 2005. ICCV 2005. 2, pp. 1387-1394.
  • Kittler J, Hilton A, Hamouz M, Illingworth J. (2005) '3D Assisted Face Recognition: A Survey of 3D Imaging, Modelling and Recognition Approaches'. IEEE Workshop on Advanced 3D Imaging for Safety and Security, A3DISS 2005 (Proceedings of the CVPR 2005 (DVD-ROM)),
  • Collins G, Hilton A. (2005) 'A Rigid Transform Basis for Animation Compression and Level of Detail'. Eurographics Association IMA Conference on Vision, Video and Graphics, Edinburgh: Second IMA Conference on Vision, Video and Graphics, pp. 21-28.
  • Starck J, Miller G, Hilton A. (2005) 'Video-Based Character Animation'. ACM ACM SIGGRAPH/Eurographics Symposium on Computer Animation, Los Angeles: 2005 ACM SIGGRAPH/Eurographics symposium on Computer animation, pp. 49---58-49---58.

    Abstract

    In this paper we introduce a video-based representation for free viewpoint visualization and motion control of 3D character models created from multiple view video sequences of real people. Previous approaches to videobased rendering provide no control of scene dynamics to manipulate, retarget, and create new 3D content from captured scenes. Here we contribute a new approach, combining image based reconstruction and video-based animation to allow controlled animation of people from captured multiple view video sequences. We represent a character as a motion graph of free viewpoint video motions for animation control. We introduce the use of geometry videos to represent reconstructed scenes of people for free viewpoint video rendering. We describe a novel spherical matching algorithm to derive global surface to surface correspondence in spherical geometry images for motion blending and the construction of seamless transitions between motion sequences. Finally, we demonstrate interactive video-based character animation with real-time rendering and free viewpoint visualization. This approach synthesizes highly realistic character animations with dynamic surface shape and appearance captured from multiple view video of people.

  • Collins G, Hilton A. (2005) 'Spatio-Temporal Fusion of Multiple View Video Rate 3D Surfaces'. Los Alamitos, CA 90720-1314 : IEEE Fifth International Conference on 3-D Digital Imaging and Modeling (3DIM’05), Ottawa, Ontario, Canada: Proceeding 3DIM '05 Proceedings of the Fifth International Conference on 3-D Digital Imaging and Modeling, pp. 142-149.

    Abstract

    We consider the problem of geometric integration and representation of multiple views of non-rigidly deforming 3D surface geometry captured at video rate. Instead of treating each frame as a separate mesh we present a representation which takes into consideration temporal and spatial coherence in the data where possible. We first segment gross base transformations using correspondence based on a closest point metric and represent these motions as piecewise rigid transformations. The remaining residual is encoded as displacement maps at each frame giving a displacement video. At both these stages occlusions and missing data are interpolated to give a representation which is continuous in space and time. We demonstrate the integration of multiple views for four different non-rigidly deforming scenes: hand, face, cloth and a composite scene. The approach achieves the integration of multiple-view data at different times into one representation which can processed and edited.

  • Miller G, Hilton A, Starck J. (2005) 'Interactive Free-viewpoint Video'. IEE IEE European Conf. on Visual Media Production, The 2nd IEE European Conference on Visual Media Production, 2005. CVMP 2005., pp. 50-59.
  • Hilton A, Kalkavouras M, Collins G. (2004) 'MELIES: 3D Studio Production of Animated Actor Models'. IEE European Conference on Visual Media Production, , pp. 283---288-283---288.
  • Hilton A, Starck J. (2004) 'Multiple View Reconstruction of People'. IEEE IEEE Conference on 3D Data Processing, Visualisation and Transmission, 2nd International Symposium on 3D Data Processing, Visualization and Transmission, 2004. 3DPVT 2004., pp. 357-364.
  • Ahmed A, Hilton A, Mokhtarian F. (2004) 'Intuitive Parametric Synthesis of Human Animation Sequences'. IEEE Computer Animation and Social Agents,
  • Ahmed A, Hilton A, Mokhtarian F. (2004) 'Enriching Animation Databases'. Eurographics Short Paper,
  • Ypsilos IA, Hilton A, Rowe S. (2004) 'Video-rate Capture of Dynamic Face Shape and Appearance'. IEEE IEEE Face and Gesture Recognition, Proceedings Sixth IEEE International Automatic Face and Gesture Recognition, 2004.

    Abstract

    This paper presents a system for simultaneous capture of video sequences of face shape and colour appearance. Shape capture uses a projected infra-red structured light pattern together with stereo reconstruction to simultaneously acquire full resolution shape and colour image sequences at video rate. Displacement mapping techniques are introduced to represent dynamic face surface shape as a displacement video. This unifies the representation of face shape and colour. The displacement video representation enables efficient registration, integration and spatiotemporal analysis of captured face data. Results demonstrate that the system achieves video-rate (25Hz) acquisition of dynamic 3D colour faces at PAL resolution with an rms accuracy of 0.2mm and a visual quality comparable to the captured video.

  • Ypsilos IA, Hilton A, Turkmani A, Jackson PJB. (2004) 'Speech Driven Face Synthesis from 3D Video'. IEEE IEEE Symposium on 3D Data Processing, Visualisation and Transmission, Thessaloniki, Greece: 2nd International Symposium on 3D Data Processing, Visualization and Transmission, pp. 58-65.

    Abstract

    We present a framework for speech-driven synthesis of real faces from a corpus of 3D video of a person speaking. Video-rate capture of dynamic 3D face shape and colour appearance provides the basis for a visual speech synthesis model. A displacement map representation combines face shape and colour into a 3D video. This representation is used to efficiently register and integrate shape and colour information captured from multiple views. To allow visual speech synthesis viseme primitives are identified from the corpus using automatic speech recognition. A novel nonrigid alignment algorithm is introduced to estimate dense correspondence between 3D face shape and appearance for different visemes. The registered displacement map representation together with a novel optical flow optimisation using both shape and colour, enables accurate and efficient nonrigid alignment. Face synthesis from speech is performed by concatenation of the corresponding viseme sequence using the nonrigid correspondence to reproduce both 3D face shape and colour appearance. Concatenative synthesis reproduces both viseme timing and co-articulation. Face capture and synthesis has been performed for a database of 51 people. Results demonstrate synthesis of 3D visual speech animation with a quality comparable to the captured video of a person.

  • Mitchelson J, Hilton A. (2003) 'Hierarchical Tracking of Multiple People'. British Machine Vision Conference,
  • Starck J, Hilton A. (2003) 'Model-based Multiple View Reconstruction of People'. IEEE IEEE International Conference on Computer Vision, Ninth IEEE International Conference on Computer Vision, 2003., pp. 915-922.
  • Mitchelson J, Hilton A. (2003) 'Hierarchical Tracking of Human Motion for Animation'. Model-based Imaging, Rendering, image Analysis and Graphical Special Effects, Paris,
  • Starck J, Hilton A. (2003) 'View-dependant Rendering with Multiple View Stereo Optimisation'. CVPR,
  • Starck J, Hilton A. (2003) 'Towards a 3D Virtual Studio for Human Apperance Capture'. IMA International Conference on Vision, Video and Graphics, Bath, , pp. 17---24-17---24.
  • Ahmed A, Hilton A, Mokhtarian F. (2003) 'Cyclification of Animation for Human Motion Synthesis'. Eurographics Short Paper,
  • Hilton A, Starck J, Collins G, Kalkavouras M. (2002) '3D Shape Capture for Archiving and Animation'. AIVA 2002 Workshop,
  • Ahmed A, Hilton A, Mokhtarian F. (2002) 'Adaptive Compression of Human Animation Data'. Eurograhics - Short Paper,
  • Price M, Chandaria J, Grau O, Thomas GA, Chatting D, Thorne J, Milnthorpe G, Woodward P, Bull L, Ong E-J, Hilton A, Mitchelson J, Starck J. (2002) 'Real-Time Production and Delivery of 3D Media'. International Broadcasting Convention, Conference Proceedings, Amsterdam, Netherlands.: International Broadcasting Conference 2002

    Abstract

    The Prometheus project has investigated new ways of creating, distributing and displaying 3D television. The tools developed will also help today’s virtual studio production. 3D content is created by extension of the principles of a virtual studio to include realistic 3D representation of actors. Several techniques for this have been developed: • Texture-mapping of live video onto rough 3D actor models. • Fully-animated 3D avatars: • Photo-realistic body model generated from several still images of a person from different viewpoints. • Addition of a detailed head model taken from two close-up images of the head. • Tracking of face and body movements of a live performer using several cameras, to derive animation data which can be applied to the face and body. • Simulation of virtual clothing which can be applied to the animated avatars. MPEG-4 is used to distribute the content in its original 3D form. The 3D scene may be rendered in a form suitable for display on a ‘glasses-free’ 3D display, based on the principle of Integral Imaging. By assembling these elements in an end-to-end chain, the project has shown how a future 3D TV system could be realised. Furthermore, the tools developed will also improve the production methods available for conventional virtual studios, by focusing on sensor-free and markerless motion capture technology, methods for the rapid creation of photo-realistic virtual humans, and real-time clothing simulation.

  • Collins G, Hilton A. (2002) 'Mesh Decimation for Displacement Mapping'. Eurograhics - Short Paper,
  • Hilton A, Starck J, Collins G. (2002) 'From 3D Shape Capture to Animated Models'. IEEE Conference on 3D Data Processing, Visualisation and Transmission,
  • Mitchelson J, Hilton A. (2002) 'Wand-based Calibration of Multiple Cameras'. British Machine Vision Association workshop on Multiple Views,
  • Starck J, Hilton A. (2002) 'Reconstruction of animated models from images using constrained deformable surfaces'. Springer Lecture Notes in Computer Science, 10th Conf. on Discrete Geometry for Computer Imagery 2301, pp. 382-391.
  • Ahmed A, Mokhtarian F, Hilton A. (2001) 'Parametric Motion Blending through Wavelet Analysis'. Eurographics 2001 - Short Paper, , pp. 347---353-347---353.
  • Starck J, Hilton A, Illingworth J. (2001) 'Human Shape Estimation in a Multi-Camera Studio'. BMVC,
  • Li Y, Hilton A, Illingworth J. (2001) 'Towards Reliable Real-Time Multiview Tracking'. IEEE International Workshop on Multiple Object Tracking, , pp. 43-50.

    Abstract

    We address the problem of reliable real-time 3D-tracking of multiple objects which are observed in multiple wide-baseline camera views. Establishing the spatio-temporal correspondence is a problem with combinatorial complexity in the number of objects and views. In addition vision based tracking suffers from the ambiguities introduced by occlusion, clutter and irregular 3D motion. We present a discrete relaxation algorithm for reducing the intrinsic combinatorial complexity by pruning the decision tree based on unreliable prior information from independent 2D-tracking for each view. The algorithm improves the reliability of spatio-temporal correspondence by simultaneous optimisation over multiple views in the case where 2D-tracking in one or more views is ambiguous. Application to the 3D reconstruction of human movement, based on tracking of skin-coloured regions in three views, demonstrates considerable improvement in reliability and performance. The results demonstrate that the optimisation over multiple views gives correct 3D reconstruction and object labeling in the presence of incorrect 2D-tracking whilst maintaining real-time performance

  • Wang T, McLauchlan P, Palmer P, Hilton A. (2001) 'Calibration for an Integrated Measurement System of Camera and Laser and its Application'. 5th World Multiconference on Systemics, Cybernetics and Informatics (Awarded Best Paper), Orlando, Florida, USA,
  • Hilton A, Illingworth J, Li Y, Mitchelson J. (2001) 'Real-Time Human Motion Estimation for Studio Production'. BMVA Workshop on Understanding Human Behaviour,
  • Molina L, Hilton A. (2001) 'Learning models for sythesis of human motion'. BMVA Workshop on Probabalistic Methods in Computer Vision,
  • Tanco LM, Hilton A. (2000) 'Realistic synthesis of novel human movements from a database of motion capture examples'. IEEE Workshop on Workshop on Human Motion, 2000., Workshop on Human Motion, 2000., pp. 137-142.
  • Shen X, Palmer P, McLauchlan P, Hilton A. (2000) 'Error Propagation from Camera Motion to Epipolar Constraint'. British Machine Vision Conference, , pp. 546---555-546---555.
  • Manessis A, Hilton A, McLauchlan P, Palmer P. (2000) 'A Statistical Geometric Framework for Reconstruction of Scene Models'. British Machine Vision Conference, , pp. 222---231-222---231.
  • Smith R, Sun W, Hilton A, Illingworth J. (2000) 'Layered Animation using Displacement Maps'. IEEE IEEE International Conference on Computer Animation, Philadelphia, USA: Computer Animation 2000., pp. 146-151.

    Abstract

    This paper presents a layered animation framework which uses displacement maps for efficient representation and animation of highly detailed surfaces. The model consists of three layers: a skeleton; low-resolution control model; and a displacement map image. The novel aspects of this approach are an automatic closed-form solution for displacement map generation and animation of the layered displacement map model. This approach provides an efficient representation of complex geometry which allows realistic deformable animation with multiple levels-of-detail. The representation enables compression, efficient transmission and level-of-detail control for animated models.

  • Smith R, Hilton A, Sun W. (2000) 'Seamless VRML Humans'. Fifth Industrial Congress on 3D Digitizing, , pp. 1---8-1---8.
  • Sun W, Hilton A, Smith R. (2000) 'Building Animated Models from 3D Scanned Data'. Fifth Industrial Congress on 3D Digitizing, , pp. 1---8-1---8.
  • Manessis A, Hilton A, McLauchlan P, Palmer P. (2000) 'Reconstruction of Scene Models from Sparse 3D Structure'. IEEE IEEE International Conference on Computer Vision and Pattern Recognition, IEEE Conference on Computer Vision and Pattern Recognition, 2000. 2, pp. 666-671.
  • McLauchlan P, Shen X, Palmer P, Manessis A, Hilton A. (2000) 'Surface-Based Structure-from-Motion using Feature Groupings'. IEEE International Asian Conference on Computer Vision, , pp. 1---10-1---10.
  • Hilton A. (1999) 'Towards Model-based Capture of a Persons Shape, Appearance and Motion'. IEEE International Workshop on Modelling People, , pp. 37---44-37---44.

    Abstract

    This paper introduces a model-based approach to capturing a persons shape, appearance and movement. A 3D animated model of a clothed persons whole-body shape and appearance is automatically constructed from a set of orthogonal view colour images. The reconstructed model of a person is then used together with the least-squares inverse-kinematics framework of Bregler and Malik (1998) to capture simple 3D movements from a video image sequence

  • Sun W, Hilton A, Smith R, Illingworth J. (1999) 'Layered Animation Models from Captured Data'. Eurographics Workshop on Computer Animation, Eurographics Workshop on Computer Animation 1999, pp. 145---154-145---154.
  • Hilton A, Beresford D, Gentils T, Smith R, Sun W. (1999) 'Virtual People: Capturing human models to populate virtual worlds'. IEEE IEEE International Conference on Computer Animation, Geneva, Switzerland: Proceedings Computer Animation, 1999., pp. 174-185.

    Abstract

    In this paper a new technique is introduced for automatically building recognisable moving 3D models of individual people. A set of multi-view colour images of a person are captured from the front, side and back using one or more cameras. Model-based reconstruction of shape from silhouettes is used to transform a standard 3D generic humanoid model to approximate the persons shape and anatomical structure. Realistic appearance is achieved by colour texture mapping from the multi-view images. Results demonstrate the reconstruction of a realistic 3D facsimile of the person suitable for animation in a virtual world. The system is low-cost and is reliable for large variations in shape, size and clothing. This is the first approach to achieve realistic model capture for clothed people and automatic reconstruction of animated models. A commercial system based on this approach has recently been used to capture thousands of models of the general public.

  • Beresford D, Hilton T, Smith R, Sun W. (1999) 'Building 3D Human Models from Captured Images'. Eurographics UK Chapter 17th Annual Conference, Cambridge, April 13-15, 1999, , pp. 22-30-22-30.
  • Hilton A, Gentils T, Beresford D. (1998) 'Popup-People: Capturing 3D Articulated Models of Individual People'. IEE IEE Colloquim on Computer Vision for Virtual Human Modelling, , pp. 1---6-1---6.
  • Hilton A, Illingworth J. (1997) 'Multi-Resolution Geometric Fusion'. IEEE International Conference on Recent Advances in 3D Digital Imaging and Modeling, , pp. 181---188-181---188.
  • Saminathan A, Stoddart AJ, Hilton A, Illingworth J. (1997) 'Progress in arbitrary topology deformable surfaces'. BMVA BMVC, , pp. 1---6-1---6.
  • Stoddart AJ, Lemke S, Hilton A, Renn T. (1996) 'Uncertainty estimation for surface registration'. BMVA Press BMVC, , pp. 1---6-1---6.
  • Hilton A, Stoddart AJ, Illingworth J, Windeatt T. (1996) 'Reconstruction of 3D Delaunay Surface Models of Complex Objects'. IEEE IEEE International Conference on Systems, Man and Cybernetics, International Conference on Systems, Man, and Cybernetics, 1996., IEEE, pp. 2445-2450.
  • Hilton A, Stoddart AJ, Illingworth J, Windeatt T. (1996) 'Implicit Surface based Geometric Fusion'. Leeds Leeds 16th Annual Statistics Workshop, , pp. 1-8.
  • Hilton A, Stoddart AJ, Illingworth J, Windeatt T. (1996) 'Reliable Surface Reconstruction from Multiple Range Images'. Springer 4th European Conference on Computer Vision, 1064, pp. 117---126-117---126.

    Abstract

    This paper addresses the problem of reconstructing an integrated 3D model from multiple 2.5D range images. A novel integration algorithm is presented based on a continuous implicit surface representation. This is the first reconstruction algorithm to use operations in 3D space only. The algorithm is guaranteed to reconstruct the correct topology of surface features larger than the range image sampling resolution. Reconstruction of triangulated models from multi-image data sets is demonstrated for complex objects. Performance characterization of existing range image integration algorithms is addressed in the second part of this paper. This comparison defines the relative computational complexity and geometric limitations of existing integration algorithms.

  • Hilton A, Stoddart AJ, Illingworth J, Windeatt T. (1996) 'Marching Triangles: Range Image Fusion for Complex Object Modelling'. IEEE ICIP, Lausanne, Switzerland: International conference on image processing, pp. 381-384.

    Abstract

    A new surface based approach to implicit surface polygonisation is introduced. This is applied to the reconstruction of 3D surface models of complex objects from multiple range images. Geometric fusion of multiple range images into an implicit surface representation was presented in previous work. This paper introduces an efficient algorithm to reconstruct a triangulated model of a manifold implicit surface, a local 3D constraint is derived which defines the Delaunay surface triangulation of a set of points on a manifold surface in 3D space. The `marching triangles' algorithm uses the local 3D constraint to reconstruct a Delaunay triangulation of an arbitrary topology manifold surface. Computational and representational costs are both a factor of 3-5 lower than previous volumetric approaches such as marching cubes

  • Stoddart AJ, Hilton A. (1996) 'Registration of multiple point sets'. Vienna ICPR, , pp. 1---4-1---4.
  • Hilton A, Stoddart AJ, Illingworth J, Windeatt T. (1996) 'Building 3D Graphical Models of Complex Objects'. EGUK Eurographics UK Conference, , pp. 193---203-193---203.
  • Hilton A, Goncalves J. (1995) '3D Scene Representation Using a Deformable Surface'. IEEE IEEE Workshop on Physics Based Modelling, , pp. 24---30-24---30.
  • Hilton A, Stoddart AJ, Illingworth J, Windeatt T. (1994) 'Automatic inspection of loaded PCB’s using 3D range data'. SPIE SPIE Machine Vision Application in Industrial Inspection II, International Symposium on Electronic Imaging: Science and Technology, San Jose, CA Volume 2183, , pp. 226---237-226---237.
  • Stoddart AJ, Hilton A, Illingworth J. (1994) 'Slime: A new deformable surface'. BMVA Press BMVC, , pp. 285---293-285---293.
  • Hilton A, Illingworth J, Windeatt T. (1994) 'Surface Curvature Estimation'. IEEE 12th IAPR International Conference on Pattern Recognition, , pp. 37---41-37---41.
  • Hilton A, Roberts JB, Hadded O. (1992) 'Comparative Evaluation of Techniques for Estimating Turbulent Flow Parameters from In-Cylinder LDA Engine Data'. Fifth International Symposium on Applications of Laser Anemometry to Fluid Mechanics, Lisbon, Portugal, , pp. 130-138.
  • Hilton A, Roberts JB, Hadded O. (1991) 'Autocorrelation Based Analysis of LDA Engine Data for Bias-Free Turbulence Estaimates'. Society of Automotive Engineers International Congress, , pp. 22---30-22---30.

Books

  • Moeslund TB, Hilton A, Krüger V, Sigal L. (2011) Visual Analysis of Humans: Looking at People. Springer-Verlag New York Inc
  • Starck J. (2003) Human Modelling from Multiple Views. PhD Thesis, University of Surrey
  • Hilton A. (1992) Algorithms for Estimating Turbulent Flow Parameters from In-Cylinder Laser Doppler Anemometer Data. Doctor of Philosophy (D.Phil.) Thesis, University of Sussex,UK

Theses and dissertations

  • Mustafa A. (2017) General 4D dynamic scene reconstruction from multiple view video..
    [ Status: Approved ]

    Abstract

    This thesis addresses the problem of reconstructing complex real-world dynamic scenes without prior knowledge of the scene structure, dynamic objects or background. Previous approaches to 3D reconstruction of dynamic scenes either require a controlled studio set-up with chroma-key backgrounds or prior knowledge such as static background appearance or segmentation of the dynamic objects. This thesis presents a new approach which enables general dynamic scene reconstruction. This is achieved by initializing the reconstruction with sparse wide-baseline feature matches between views which avoids the requirement for prior knowledge of the background appearance or assumptions that the background is static. To achieve sparse reconstruction of dynamic objects a novel segmentation based feature detector SFD is introduced. SFD is shown to give an order of magnitude increase in the number and reliability of features detected. A coarse-to-fine approach is introduced for reconstruction of dense 3D models of dynamic scenes. This uses joint segmentation and shape refinement to achieve robust reconstruction of dynamic object such as people. The approach is evaluated across a wide-range of indoor and outdoor scenes. The second major contribution of this research is to introduce temporal coherence into the reconstruction process. The dynamic scene is segmented into objects based on the initial sparse 3D feature reconstruction of the scene. Dense reconstruction is then performed for each object. For dynamic objects the reconstruction is propagated over time to provide a prior for the reconstruction at successive frames in the sequence. This is combined with the introduction of a geodesic star convexity constraint in the segmentation refinement to improve the segmentation of complex objects. Evaluation on general dynamic scene demonstrates significant improvement in both segmentation and reconstruction with temporal coherence reducing the ambiguity in the reconstruction of complex shape. The final significant contribution of this research is the introduction of a complete framework for 4D temporally coherent shape reconstruction from one or more camera views. The 4D match tree is introduced as an intermediate representation for robust alignment of partial surface reconstructions across a complete sequence. SFD is used to achieve wide-timeframe matching of partial surface reconstructions between any pair of frames in the sequence. This allows the evaluation of a frame-to-fr

Page Owner: ees2ah
Page Created: Thursday 16 September 2010 13:54:40 by lb0014
Last Modified: Monday 10 April 2017 10:24:04 by m07811
Expiry Date: Friday 16 December 2011 13:53:36
Assembly date: Mon Aug 21 12:59:57 BST 2017
Content ID: 37229
Revision: 28
Community: 1379

Rhythmyx folder: //Sites/surrey.ac.uk/CVSSP/people
Content type: rx:StaffProfile