
Dr Jean-Yves Guillemaut
Academic and research departments
Centre for Vision, Speech and Signal Processing (CVSSP), Department of Electrical and Electronic Engineering, Faculty of Engineering and Physical Sciences.About
Biography
I am Senior Lecturer in 3D Computer Vision and Postgraduate Research Director within the Centre for Vision, Speech and Signal Processing (CVSSP) at the University of Surrey.
My research centres on scene modelling from multi-view video input, with a focus on complex dynamic scenes. In particular, I’m interested in how to generalise modelling to scenes with complex surface reflectance properties (e.g. glossy materials), and how to extend modelling to uncontrolled outdoor environments. I work in close collaboration with industry to research applications for the creative sector. Examples of applications of my research have included free-viewpoint video, AR, VR and immersive content production.
I joined the University of Surrey in 2001 as a PhD student before becoming a Research Fellow in 2005, a Lecturer in 2012 and Senior Lecturer in 2018. I am a Fellow of the Higher Education Academy, a Member of the Institute of Electrical and Electronics Engineers, and a Member of the Institution of Engineering and Technology.
Areas of specialism
University roles and responsibilities
- Senior Lecturer in 3D Computer Vision
- CVSSP Postgraduate Research Director
- Department Prizes Officer
- Professional Training Tutor
- MSc Personal Tutor
My qualifications
Previous roles
ResearchResearch interests
My research interests include computer vision, 3D reconstruction, computational photography, VR, AR, free-viewpoint video, AI and deep learning.
My research centres on scene modelling from multi-view video input, with a focus on complex dynamic scenes. In particular I am looking at how to generalise modelling to scenes with complex surface reflectance properties (e.g. glossy materials), and how to extend modelling to uncontrolled outdoor environments. I have undertaken many research projects in collaboration with industry to develop applications for the creative sector. Examples of applications of my research include free-viewpoint video, AR, VR and immersive content production.
To date, my area of research has resulted in the introduction of robust methods for outdoor scene modelling from a small number of cameras and novel algorithms for the reconstruction of scenes with arbitrary unknown reflectance properties. Applications have been investigated as part of a stream of collaborative research projects with major companies from the creative industry sector (BBC, Foundry, Double Negative, Framestore, etc). This has the potential to open new applications in other disciplines requiring accurate scene modelling and understanding (e.g. in robotics, healthcare, cultural heritage).
In the future I would like to explore how new acquisition techniques such as lightfield imaging, and deep learning, can be leveraged to remove current constraints, enabling us to model live action scenes ‘in the wild’ from a small number of cameras.
Research projects
EPSRC Platform Grant, May 2017–April 2022, Co-Investigator InnovateUK, July 2016–December 2017, Co-Investigator EU Horizon 2020, January 2016–December 2017, Co-Investigator EPSRC First Grant, June 2015–May 2017, Principal Investigator Multi-View Computational Photography for Dynamic Scene Modelling in the Wild Royal Society Research Grant, March 2016–March 2017, Principal Investigator Hand tracking and Pose Estimation InnovateUK/HEFCE ICURe Innovation-to-Commercialisation Project, April 2016–July 2016, Principal Investigator InnovateUK, October 2014–April 2016, Co-Investigator Imagineer Systems: Multiview Planar Tracking TSB, October 2012–September 2014, Co-Investigator EU FP7, November 2012–October 2015, Co-Investigator EU FP7, November 2011–October 2014, Co-Investigator TSB, November 2011–April 2013, Co-Investigator TSB/EPSRC, January 2009–July 2011, Researcher TSB/EPSRC, May 2006–April 2009, Researcher VAMPIRE – Visual Active Memory and Interactive REtrieval EU FP5, May 2002–July 2005, Researcher Research collaborations
Industry: BBC, Foundry, Double Negative, Framestore, Figment Productions, Brainstorm, Never.no, Signum Bildtechnik, IRT, RTVE, Setanta Sports, BlueSky, TVR, Bayerischer Rundfunk, Hallingdolen, Numerion, Change of Paradigm, Lightworks, Imagineer Systems, FilmLight, iMinds, ARRI, 3DLized, Barcelona Media, Technicolor, Fraunhofer HHI, Snell, Hawk-Eye Innovations, US FDA, St Thomas Hospital, Imagination Technologies, InSync Technology.
Academia: NTNU, Universitat Pompeu Fabra, Aristotle University of Thessaloniki, Brno University of Technology, Universitat des Saarlandes
Indicators of esteem
Best Poster Award at European Conference on Visual Media Production (CVMP 2016)
Best Student Paper Award at Int. Conference on Computer Vision Theory and Applications (VISAPP 2014)
University of Surrey Faculty of Engineering and Physical Sciences Researcher of the Year Award (2012)
Honorable Mention for the Best Paper Award at ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games 2012
Best Poster Prize at EPSRC/BMVA Summer School on Computer Vision 2002
Research interests
My research interests include computer vision, 3D reconstruction, computational photography, VR, AR, free-viewpoint video, AI and deep learning.
My research centres on scene modelling from multi-view video input, with a focus on complex dynamic scenes. In particular I am looking at how to generalise modelling to scenes with complex surface reflectance properties (e.g. glossy materials), and how to extend modelling to uncontrolled outdoor environments. I have undertaken many research projects in collaboration with industry to develop applications for the creative sector. Examples of applications of my research include free-viewpoint video, AR, VR and immersive content production.
To date, my area of research has resulted in the introduction of robust methods for outdoor scene modelling from a small number of cameras and novel algorithms for the reconstruction of scenes with arbitrary unknown reflectance properties. Applications have been investigated as part of a stream of collaborative research projects with major companies from the creative industry sector (BBC, Foundry, Double Negative, Framestore, etc). This has the potential to open new applications in other disciplines requiring accurate scene modelling and understanding (e.g. in robotics, healthcare, cultural heritage).
In the future I would like to explore how new acquisition techniques such as lightfield imaging, and deep learning, can be leveraged to remove current constraints, enabling us to model live action scenes ‘in the wild’ from a small number of cameras.
Research projects
Research collaborations
Industry: BBC, Foundry, Double Negative, Framestore, Figment Productions, Brainstorm, Never.no, Signum Bildtechnik, IRT, RTVE, Setanta Sports, BlueSky, TVR, Bayerischer Rundfunk, Hallingdolen, Numerion, Change of Paradigm, Lightworks, Imagineer Systems, FilmLight, iMinds, ARRI, 3DLized, Barcelona Media, Technicolor, Fraunhofer HHI, Snell, Hawk-Eye Innovations, US FDA, St Thomas Hospital, Imagination Technologies, InSync Technology.
Academia: NTNU, Universitat Pompeu Fabra, Aristotle University of Thessaloniki, Brno University of Technology, Universitat des Saarlandes
Indicators of esteem
Best Poster Award at European Conference on Visual Media Production (CVMP 2016)
Best Student Paper Award at Int. Conference on Computer Vision Theory and Applications (VISAPP 2014)
University of Surrey Faculty of Engineering and Physical Sciences Researcher of the Year Award (2012)
Honorable Mention for the Best Paper Award at ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games 2012
Best Poster Prize at EPSRC/BMVA Summer School on Computer Vision 2002
Supervision
Postgraduate research supervision
Current PhD students I supervise as principal supervisor:
- Matthew Bailey, Photorealistic digitisation and rendering of scenes with complex materials,
- Yue Zhang, Multi-view video scene segmentation and reconstruction with deep neural networks,
- Gianmarco Addari, Unconstrained full-3D modelling of scenes with complex surface reflectance.
Completed postgraduate research projects I have supervised
Graduated PhD students I supervised as principal supervisor:
- Chathura Galkandage, Perception inspired stereoscopic image and video quality assessment (PhD awarded in 2017),
- Michaela Spiteri, Imaging biomarkers in paediatric brain resection (PhD awarded in 2017),
- Mark Brown, A saliency based framework for multi-modal (PhD awarded in 2016),
- Nadejda Roubtsova, Accurate 3D reconstruction of dynamic scenes with complex reflectance properties (PhD awarded in 2016).
Teaching
I currently lead and teach the following three modules in the Department of Electrical and Electronic Engineering:
- Computer Algorithms & Architecture (EEE2048),
- Laboratories, Design & Professional Studies III (EEE2036),
- Laboratories, Design & Professional Studies IV (EEE2037).
In Computer Algorithms & Architecture, I teach algorithms and data structures in C. The course focuses on how to analyse the complexity of algorithms and how to improve their performance and efficiency.
In the Laboratories, Design & Professional Studies module series, I teach professional studies, with an emphasis on entrepreneurship. As part of these modules, I lead the Enterprise Project which is a year-long project where students work in groups on creating innovative business concepts and pitching them at a Dragon’s Den.
I also supervise undergraduate and MSc projects in the areas of Computer Vision, Computer Graphics and Computational Photography (typically 6 projects per academic year).
Publications
This landing page contains the datasets presented in the paper "Finite Aperture Stereo". The datasets are intended for defocus-based 3D reconstruction and analysis. Each download link contains images of a static scene, captured from multiple viewpoints and with different focus settings. The captured objects exhibit a range of reflectance properties and are physically small in scale. Calibration images are also available. A CC BY-NC licence is in effect. Use of this data must be for non-commercial research purposes. Acknowledgement must be given to the original authors by referencing the dataset DOI, the dataset web address, and the aforementioned publication. Re-distribution of this data is prohibited. Before downloading, you must agree with these conditions as presented on the dataset webpage.
While the accuracy of multi-view stereo (MVS) has continued to advance, its performance reconstructing challenging scenes from images with a limited depth of field is generally poor. Typical implementations assume a pinhole camera model, and therefore treat defocused regions as a source of outlier. In this paper, we address these limitations by instead modelling the camera as a thick lens. Doing so allows us to exploit the complementary nature of stereo and defocus information, and overcome constraints imposed by traditional MVS methods. Using our novel reconstruction framework, we recover complete 3D models of complex macro-scale scenes. Our approach demonstrates robustness to view-dependent materials, and outperforms state-of-the-art MVS and depth from defocus across a range of real and synthetic datasets.
Many 3D reconstruction techniques are based on the assumption of prior knowledge of the object's surface reflectance, which severely restricts the scope of scenes that can be reconstructed. In contrast, Helmholtz Stereopsis (HS) employs Helmholtz Reciprocity to compute the scene geometry regardless of its Bidirectional Reflectance Distribution Function (BRDF). Despite this advantage, most HS implementations to date have been limited to 2.5D reconstruction, with the few extensions to full 3D being generally limited to a local refinement due to the nature of the optimisers they rely on. In this paper, we propose a novel approach to full 3D HS based on Markov Random Field (MRF) optimisation. After defining a solution space that contains the surface of the object, the energy function to be minimised is computed based on the HS quality measure and a normal consistency term computed across neighbouring surface points. This new method offers several key advantages with respect to previous work: the optimisation is performed globally instead of locally; a more discriminative energy function is used, allowing for better and faster convergence; a novel visibility handling approach to take advantage of Helmholtz reciprocity is proposed; and surface integration is performed implicitly as part of the optimisation process, thereby avoiding the need for an additional step. The approach is evaluated on both synthetic and real scenes, with an analysis of the sensitivity to input noise performed in the synthetic case. Accurate results are obtained on both types of scenes. Further, experimental results indicate that the proposed approach significantly outperforms previous work in terms of geometric and normal accuracy.
Multi-view stereo remains a popular choice when recovering 3D geometry, despite performance varying dramatically according to the scene content. Moreover, typical pinhole camera assumptions fail in the presence of shallow depth of field inherent to macro-scale scenes; limiting application to larger scenes with diffuse reflectance. However, the presence of defocus blur can itself be considered a useful reconstruction cue, particularly in the presence of view-dependent materials. With this in mind, we explore the complimentary nature of stereo and defocus cues in the context of multi-view 3D reconstruction; and propose a complete pipeline for scene modelling from a finite aperature camera that encompasses image formation, camera calibration and reconstruction stages. As part of our evaluation, an ablation study reveals how each cue contributes to the higher performance observed over a range of complex materials and geometries. Though of lesser concern with large apertures, the effects of image noise are also considered. By introducing pre-trained deep feature extraction into our cost function, we show a step improvement over per-pixel comparisons; as well as verify the cross-domain applicability of networks using largely in-focus training data applied to defocused images. Finally, we compare to a number of modern multi-view stereo methods, and demonstrate how the use of both cues leads to a significant increase in performance across several synthetic and real datasets.
Conventional view-dependent texture mapping techniques produce composite images by blending subsets of input images, weighted according to their relative influence at the rendering viewpoint, over regions where the views overlap. Geometric or camera calibration errors often result in a los s of detail due to blurring or double exposure artefacts which tends to be exacerbated by the number of blending views considered. We propose a novel view-dependent rendering technique which optimises the blend region dynamically at rendering time, and reduces the adverse effects of camera calibration or geometric errors otherwise observed. The technique has been successfully integrated in a rendering pipeline which operates at interactive frame rates. Improvement over state-of-the-art view-dependent texture mapping techniques are illustrated on a synthetic scene as well as real imagery of a large scale outdoor scene where large camera calibration and geometric errors are present.
This paper introduces a general approach to dynamic scene reconstruction from multiple moving cameras without prior knowledge or limiting constraints on the scene structure, appearance, or illumination. Existing techniques for dynamic scene reconstruction from multiple wide-baseline camera views primarily focus on accurate reconstruction in controlled environments, where the cameras are fixed and calibrated and background is known. These approaches are not robust for general dynamic scenes captured with sparse moving cameras. Previous approaches for outdoor dynamic scene reconstruction assume prior knowledge of the static background appearance and structure. The primary contributions of this paper are twofold: an automatic method for initial coarse dynamic scene segmentation and reconstruction without prior knowledge of background appearance or structure; and a general robust approach for joint segmentation refinement and dense reconstruction of dynamic scenes from multiple wide-baseline static or moving cameras. Evaluation is performed on a variety of indoor and outdoor scenes with cluttered backgrounds and multiple dynamic non-rigid objects such as people. Comparison with state-of-the-art approaches demonstrates improved accuracy in both multiple view segmentation and dense reconstruction. The proposed approach also eliminates the requirement for prior knowledge of scene structure and appearance.
Action matching, where a recorded sequence is matched against, and synchronised with, a suitable proxy from a library of animations, is a technique for generating a synthetic representation of a recorded human activity. This proxy can then be used to represent the action in a virtual environment or as a prior on further processing of the sequence. In this paper we present a novel technique for performing action matching in outdoor sports environments. Outdoor sports broadcasts are typically multi-camera environments and as such reconstruction techniques can be applied to the footage to generate a 3D model of the scene. However due to poor calibration and matting this reconstruction is of a very low quality. Our technique matches the 3D reconstruction sequence against a predefined library of actions to select an appropriate high quality synthetic representation. A hierarchical Markov model combined with 3D summarisation of the data allows a large number of different actions to be matched successfully to the sequence in a rate-invariant manner without prior segmentation of the sequence into discrete units. The technique is applied to data captured at rugby and soccer games.
Post-operative cerebellar mutism syndrome (POPCMS) in children is a post- surgical complication which occurs following the resection of tumors within the brain stem and cerebellum. High resolution brain magnetic resonance (MR) images acquired at multiple time points across a patient’s treatment allow the quantification of localized changes caused by the progression of this syndrome. However, MR images are not necessarily acquired at regular intervals throughout treatment and are often not volumetric. This restricts the analysis to 2D space and causes difficulty in intra- and inter-subject comparison. To address these challenges, we have developed an automated image processing and analysis pipeline. Multi-slice 2D MR image slices are interpolated in space and time to produce a 4D volumetric MR image dataset providing a longitudinal representation of the cerebellum and brain stem at specific time points across treatment. The deformations within the brain over time are represented using a novel metric known as the Jacobian of deformations determinant. This metric, together with the changing grey-level intensity of areas within the brain over time, are analyzed using machine learning techniques in order to identify biomarkers that correspond with the development of POPCMS following tumor resection. This study makes use of a fully automated approach which is not hypothesis-driven. As a result, we were able to automatically detect six potential biomarkers that are related to the development of POPCMS following tumor resection in the posterior fossa.
© 2015 The Authors. Here we present a novel, information-theoretic salient line segment detector. Existing line detectors typically only use the image gradient to search for potential lines. Consequently, many lines are found, particularly in repetitive scenes. In contrast, our approach detects lines that define regions of significant divergence between pixel intensity or colour statistics. This results in a novel detector that naturally avoids the repetitive parts of a scene while detecting the strong, discriminative lines present. We furthermore use our approach as a saliency filter on existing line detectors to more efficiently detect salient line segments. The approach is highly generalisable, depending only on image statistics rather than image gradient; and this is demonstrated by an extension to depth imagery. Our work is evaluated against a number of other line detectors and a quantitative evaluation demonstrates a significant improvement over existing line detectors for a range of image transformations.
We present a method for reconstructing the geometry and appearance of indoor scenes containing dynamic human subjects using a single (optionally moving) RGBD sensor. We introduce a framework for building a representation of the articulated scene geometry as a set of piecewise rigid parts which are tracked and accumulated over time using moving voxel grids containing a signed distance representation. Data association of noisy depth measurements with body parts is achieved by online training of a prior shape model for the specific subject. A novel frame-to-frame model registration is introduced which combines iterative closest-point with additional correspondences from optical flow and prior pose constraints from noisy skeletal tracking data. We quantitatively evaluate the reconstruction and tracking performance of the approach using a synthetic animated scene. We demonstrate that the approach is capable of reconstructing mid-resolution surface models of people from low-resolution noisy data acquired from a consumer RGBD camera. © 2013 IEEE.
In this paper we present a novel approach to estimate the alpha mattes of a foreground object captured by a widebaseline circular camera rig provided a single key frame trimap. Bayesian inference coupled with camera calibration information are used to propagate high confidence trimaps labels across the views. Recent techniques have been developed to estimate an alpha matte of an image using multiple views but they are limited to narrow baseline views with low foreground variation. The proposed wide-baseline trimap propagation is robust to inter-view foreground appearance changes, shadows and similarity in foreground/background appearance for cameras with opposing views enabling high quality alpha matte extraction using any state-of-the-art image matting algorithm.
Vitrectomy and pneumatic retinopexy are common surgical procedures used to treat retinal detachment. To reattach the retina, gases are used to inflate the vitreous space allowing the retina to attach by surface tension and buoyancy forces that are superior to the location of the bubble. These procedures require the injection of either a pure tamponade gas, such as C3F8 or SF6, or mixtures of these gases with air. The location of the retinal detachment, the anatomical spread of the retinal defect, and the length of time the defect has persisted, will determine the suggested volume and duration of the gas bubble to allow reattachment. After inflation, the gases are slowly absorbed by the blood allowing the vitreous to be refilled by aqueous. We have developed a model of the mass transfer dynamics of tamponade gases during pneumatic retinopexy or pars plana vitrectomy procedures. The model predicts the expansion and persistence of intraocular gases (C3F8, SF6), oxygen, nitrogen, and carbon dioxide, as well as the intraocular pressure. The model was validated using published literature in rabbits and humans. In addition to correlating the mass transfer dynamics by surface area, permeability, and partial pressure driving forces, the mass transfer dynamics are affected by the percentage of the tamponade gases. Rates were also correlated with the physical properties of the tamponade and blood gases. The model gave accurate predictions in humans.
This paper critiques the opportunities afforded by immersive experience technology to create stimulating, innovative living environments for long-term residents of care homes for the elderly. We identify the ways in which virtual mobility can facilitate reconnection with recreational environments. Specifically, the project examines the potential of two assistive and immersive experiences; virtual reality (VR) and multisensory stimulation environments (MSSE). Findings identify three main areas of knowledge contribution. First, the introduction of VR and MSSE facilitated participants re-engagement and sharing of past experiences as they recalled past family holidays, day trips or everyday practices. Secondly, the combination of the hardware of the VR and MSSE technology with the physical objects of the sensory trays created alternative, multisensual ways of engaging with the experiences presented to participants. Lastly, the clear preference for the MSSE experience over the VR experience highlighted the importance of social interaction and exchange for participants.
Helmholtz Stereopsis is a 3D reconstruction method uniquely independent of surface reflectance. Yet, its sub-optimal maximum likelihood formulation with drift-prone normal integration limits performance. Via three contributions this paper presents a complete novel pipeline for Helmholtz Stereopsis. Firstly, we propose a Bayesian formulation replacing the maximum likelihood problem by a maximum a posteriori one. Secondly, a tailored prior enforcing consistency between depth and normal estimates via a novel metric related to optimal surface integrability is proposed. Thirdly, explicit surface integration is eliminated by taking advantage of the accuracy of prior and high resolution of the coarse-to-fine approach. The pipeline is validated quantitatively and qualitatively against alternative formulations, reaching sub-millimetre accuracy and coping with complex geometry and reflectance.
We present a novel approach to 2D-3D registration from points or lines without correspondences. While there exist established solutions in the case where correspondences are known, there are many situations where it is not possible to reliably extract such correspondences across modalities, thus requiring the use of a correspondence-free registration algorithm. Existing correspondence-free methods rely on local search strategies and consequently have no guarantee of finding the optimal solution. In contrast, we present the first globally optimal approach to 2D-3D registration without correspondences, achieved by a Branch-and-Bound algorithm. Furthermore, a deterministic annealing procedure is proposed to speed up the nested branch-and-bound algorithm used. The theoretical and practical advantages this brings are demonstrated on a range of synthetic and real data where it is observed that the proposed approach is significantly more robust to high proportions of outliers compared to existing approaches.
Helmholtz Stereopsis is a powerful technique for reconstruction of scenes with arbitrary re ectance properties. However, previous formulations have been limited to static objects due to the requirement to se- quentially capture reciprocal image pairs (i.e. two im- ages with the camera and light source positions mu- tually interchanged). In this paper, we propose Colour Helmholtz Stereopsis - a novel framework for Helmholtz Stereopsis based on wavelength multiplexing. To ad- dress the new set of challenges introduced by multispec- tral data acquisition, the proposed Colour Helmholtz Stereopsis pipeline uniquely combines a tailored pho- tometric calibration for multiple camera/light source pairs, a novel procedure for spatio-temporal surface chromaticity calibration and a state-of-the-art Bayesian formulation necessary for accurate reconstruction from a minimal number of reciprocal pairs. In this frame- work, re ectance is spatially unconstrained both in terms of its chromaticity and the directional component dependent on the illumination incidence and viewing angles. The proposed approach for the rst time en- ables modelling of dynamic scenes with arbitrary un- known and spatially varying re ectance using a practi- cal acquisition set-up consisting of a small number of cameras and light sources. Experimental results demon- strate the accuracy and exibility of the technique on a variety of static and dynamic scenes with arbitrary un- known BRDF and chromaticity ranging from uniform to arbitrary and spatially varying.
This paper addresses the problem of human action matching in outdoor sports broadcast environments, by analysing 3D data from a recorded human activity and retrieving the most appropriate proxy action from a motion capture library. Typically pose recognition is carried out using images from a single camera, however this approach is sensitive to occlusions and restricted fields of view, both of which are common in the outdoor sports environment. This paper presents a novel technique for the automatic matching of human activities which operates on the 3D data available in a multi-camera broadcast environment. Shape is retrieved using multi-camera techniques to generate a 3D representation of the scene. Use of 3D data renders the system camera-pose-invariant and allows it to work while cameras are moving and zooming. By comparing the reconstructions to an appropriate 3D library, action matching can be achieved in the presence of significant calibration and matting errors which cause traditional pose detection schemes to fail. An appropriate feature descriptor and distance metric are presented as well as a technique to use these features for key-pose detection and action matching. The technique is then applied to real footage captured at an outdoor sporting event. ©2009 IEEE.
Accurate 3D modelling of real world objects is essential in many applications such as digital film production and cultural heritage preservation. However, current modelling techniques rely on assumptions to constrain the problem, effectively limiting the categories of scenes that can be reconstructed. A common assumption is that the scene’s surface reflectance is Lambertian or known a priori. These constraints rarely hold true in practice and result in inaccurate reconstructions. Helmholtz Stereopsis (HS) addresses this limitation by introducing a reflectance agnostic modelling constraint, but prior work in this area has been predominantly limited to 2.5D reconstruction, providing only a partial model of the scene. In contrast, this paper introduces the first Markov Random Field (MRF) optimisation framework for full 3D HS. First, an initial reconstruction is obtained by performing 2.5D MRF optimisation with visibility constraints from multiple viewpoints and fusing the different outputs. Then, a refined 3D model is obtained through volumetric MRF optimisation using a tailored Iterative Conditional Modes (ICM) algorithm. The proposed approach is evaluated with both synthetic and real data. Results show that the proposed full 3D optimisation significantly increases both geometric and normal accuracy, being able to achieve sub-millimetre precision. Furthermore, the approach is shown to be robust to occlusions and noise.
Many practical applications require an accurate knowledge of the extrinsic calibration (____ie, pose) of a moving camera. The existing SLAM and structure-from-motion solutions are not robust to scenes with large dynamic objects, and do not fully utilize the available information in the presence of static cameras, a common practical scenario. In this paper, we propose an algorithm that addresses both of these issues for a hybrid static-moving camera setup. The algorithm uses the static cameras to build a sparse 3D model of the scene, with respect to which the pose of the moving camera is estimated at each time instant. The performance of the algorithm is studied through extensive experiments that cover a wide range of applications, and is shown to be satisfactory.
This paper addresses the problem of human action matching in outdoor sports broadcast environments, by analysing 3D data from a recorded human activity and retrieving the most appropriate proxy action from a motion capture library. Typically pose recognition is carried out using images from a single camera, however this approach is sensitive to occlusions and restricted fields of view, both of which are common in the outdoor sports environment. This paper presents a novel technique for the automatic matching of human activities which operates on the 3D data available in a multi-camera broadcast environment. Shape is retrieved using multi-camera techniques to generate a 3D representation of the scene. Use of 3D data renders the system camera-pose-invariant and allows it to work while cameras are moving and zooming. By comparing the reconstructions to an appropriate 3D library, action matching can be achieved in the presence of significant calibration and matting errors which cause traditional pose detection schemes to fail. An appropriate feature descriptor and distance metric are presented as well as a technique to use these features for key-pose detection and action matching. The technique is then applied to real footage captured at an outdoor sporting event
A key task in computer vision is that of generating virtual 3D models of real-world scenes by reconstructing the shape, appearance and, in the case of dynamic scenes, motion of the scene from visual sensors. Recently, low-cost video plus depth (RGB-D) sensors have become widely available and have been applied to 3D reconstruction of both static and dynamic scenes. RGB-D sensors contain an active depth sensor, which provides a stream of depth maps alongside standard colour video. The low cost and ease of use of RGB-D devices as well as their video rate capture of images along with depth make them well suited to 3D reconstruction. Use of active depth capture overcomes some of the limitations of passive monocular or multiple-view video-based approaches since reliable, metrically accurate estimates of the scene depth at each pixel can be obtained from a single view, even in scenes that lack distinctive texture. There are two key components to 3D reconstruction from RGB-D data: (1) spatial alignment of the surface over time and, (2) fusion of noisy, partial surface measurements into a more complete, consistent 3D model. In the case of static scenes, the sensor is typically moved around the scene and its pose is estimated over time. For dynamic scenes, there may be multiple rigid, articulated, or non-rigidly deforming surfaces to be tracked over time. The fusion component consists of integration of the aligned surface measurements, typically using an intermediate representation, such as the volumetric truncated signed distance field (TSDF). In this chapter, we discuss key recent approaches to 3D reconstruction from depth or RGB-D input, with an emphasis on real-time reconstruction of static scenes.
Precis. A mathematical model is described of the physical properties of intraocular gases providing a guide to the correct gas concentrations to achieve 100% fill of the vitreous cavity postoperatively. A table for the instruction of surgeons is provided and the effects of different axial lengths examined. ABSTRACT Purpose – To determine the concentrations of different gas tamponades in air to achieve 100% fill of the vitreous cavity postoperatively and to examine the influence of eye volume on these concentrations. Methods – A mathematical model of the mass transfer dynamics of tamponade and blood gases (O2, N2, CO2) when injected into the eye was used. Mass transfer surface areas were calculated from published anatomical data. The model has been calibrated from published volumetric decay and composition results for three gases sulphahexafluoride, SF6, hexafluoroethane, C2F6, or perfluoropropane, C3F8. The concentrations of these gases (in air) required to achieve 100% fill of the vitreous cavity postoperatively without an intra-ocular pressure rise were determined. The concentrations were calculated for three volumes of the vitreous cavity to test if ocular size influenced the results. Results – A table of gas concentrations was produced. In a simulation of pars plana vitrectomy operations in which an 80% to 85% fill of the vitreous cavity with gas was achieved at surgery, the concentrations of the three gases in air to achieve 100% fill postoperatively were 10-13% for C3F8, 12-15% for C2F6 and 19-25% for SF6. These were similar to the so-called ''non-expansive'' concentrations used in the clinical setting. The calculations were repeated for three different sizes of eye. Aiming for an 80% fill at surgery and 100% postoperatively, an eye with a 4ml vitreous cavity required 24% SF6, 15% C2F6 or 13% C3F8; 7.2ml required 25% SF6, 15% C2F6 or 13% C3F8; and 10ml required 25% SF6, 16% C2F6 or 13% C3F8. When using 100% gas (for example, employed in pneumatic retinopexy), in order to achieve 100% fill postoperatively, the minimum vitreous cavity fill at surgery was 43% for SF6, 29% for C2F6 and 25% for C3F8 and was only minimally changed by variation in the size of the eye. Conclusions – A table has been produced which could be used for surgical innovation in gas usage in the vitreous cavity. It provides concentrations for different percentage fills, which will achieve a moment post-operatively with a full fill of the cavity without a pressure rise. Variation in axial length and size of the eye does not appear to alter the values in the table significantly. Those using pneumatic retinopexy need to increase the volume of gas injected with increased size of the eye in order to match the percentage fill of the vitreous cavity recommended for a given tamponade agent.
We propose a multi-view framework for joint object detection and labelling based on pairs of images. The proposed framework extends the single-view Mask R-CNN approach to multiple views without need for additional training. Dedicated components are embedded into the framework to match objects across views by enforcing epipolar constraints, appearance feature similarity and class coherence. The multi-view extension enables the proposed framework to detect objects which would otherwise be mis-detected in a classical Mask R-CNN approach, and achieves coherent object labelling across views. By avoiding the need for additional training, the approach effectively overcomes the current shortage of multi-view datasets. The proposed framework achieves high quality results on a range of complex scenes, being able to output class, bounding box, mask and an additional label enforcing coherence across views. In the evaluation, we show qualitative and quantitative results on several challenging outd oor multi-view datasets and perform a comprehensive comparison to verify the advantages of the proposed method
This paper presents an approach to estimate the intrinsic texture properties (albedo, shading, normal) of scenes from multiple view acquisition under unknown illumination conditions. We introduce the concept of intrinsic textures, which are pixel-resolution surface textures representing the intrinsic appearance parameters of a scene. Unlike previous video relighting methods, the approach does not assume regions of uniform albedo, which makes it applicable to richly textured scenes. We show that intrinsic image methods can be used to refine an initial, low-frequency shading estimate based on a global lighting reconstruction from an original texture and coarse scene geometry in order to resolve the inherent global ambiguity in shading. The method is applied to relighting of free-viewpoint rendering from multiple view video capture. This demonstrates relighting with reproduction of fine surface detail. Quantitative evaluation on synthetic models with textured appearance shows accurate estimation of intrinsic surface reflectance properties. © 2014 Springer International Publishing.
We present a family of methods for 2D–3D registration spanning both deterministic and non-deterministic branch-and-bound approaches. Critically, the methods exhibit invariance to the underlying scene primitives, enabling e.g. points and lines to be treated on an equivalent basis, potentially enabling a broader range of problems to be tackled while maximising available scene information, all scene primitives being simultaneously considered. Being a branch-and-bound based approach, the method furthermore enjoys intrinsic guarantees of global optimality; while branch-and-bound approaches have been employed in a number of computer vision contexts, the proposed method represents the first time that this strategy has been applied to the 2D–3D correspondence-free registration problem from points and lines. Within the proposed procedure, deterministic and probabilistic procedures serve to speed up the nested branch-and-bound search while maintaining optimality. Experimental evaluation with synthetic and real data indicates that the proposed approach significantly increases both accuracy and robustness compared to the state of the art.
Reconstruction approaches based on monocular defocus analysis such as Depth from Defocus (DFD) often utilise the thin lens camera model. Despite this widespread adoption, there are inherent limitations associated with it. Coupled with invalid parameterisation commonplace in literature, the overly-simplified image formation it describes leads to inaccurate defocus modelling; especially in macro-scale scenes. As a result, DFD reconstructions based around this model are not geometrically consistent, and are typically restricted to single-view applications. Subsequently, the handful of existing approaches which attempt to include additional viewpoints have had only limited success.In this work, we address these issues by instead utilising a thick lens camera model, and propose a novel calibration procedure to accurately parameterise it. The effectiveness of our model and calibration is demonstrated with a novel DFD reconstruction framework. We achieve highly detailed, geometrically accurate and complete 3D models of real-world scenes from multi-view focal stacks. To our knowledge, this is the first time DFD has been successfully applied to complete scene modelling in this way.
Existing techniques for dynamic scene re- construction from multiple wide-baseline cameras pri- marily focus on reconstruction in controlled environ- ments, with fixed calibrated cameras and strong prior constraints. This paper introduces a general approach to obtain a 4D representation of complex dynamic scenes from multi-view wide-baseline static or moving cam- eras without prior knowledge of the scene structure, ap- pearance, or illumination. Contributions of the work are: An automatic method for initial coarse reconstruc- tion to initialize joint estimation; Sparse-to-dense tem- poral correspondence integrated with joint multi-view segmentation and reconstruction to introduce tempo- ral coherence; and a general robust approach for joint segmentation refinement and dense reconstruction of dynamic scenes by introducing shape constraint. Com- parison with state-of-the-art approaches on a variety of complex indoor and outdoor scenes, demonstrates im- proved accuracy in both multi-view segmentation and dense reconstruction. This paper demonstrates unsuper- vised reconstruction of complete temporally coherent 4D scene models with improved non-rigid object seg- mentation and shape reconstruction and its application to various applications such as free-view rendering and virtual reality.
This paper presents a method to estimate alpha mattes for video sequences of the same foreground scene from wide-baseline views given sparse key-frame trimaps in a single view. A statistical inference framework is introduced for spatio-temporal propagation of high-confidence trimap labels between video sequences without a requirement for correspondence or camera calibration and motion estimation. Multiple view trimap propagation integrates appearance information between views and over time to achieve robust labelling in the presence of shadows, changes in appearance with view point and overlap between foreground and background appearance. Results demonstrate that trimaps are sufficiently accurate to allow high-quality video matting using existing single view natural image matting algorithms. Quantitative evaluation against ground-truth demonstrates that the approach achieves accurate matte estimation for camera views separated by up to 180◦ , with the same amount of manual interaction required for conventional single view video matting.
Light-field video has recently been used in virtual and augmented reality applications to increase realism and immersion. However, existing light-field methods are generally limited to static scenes due to the requirement to acquire a dense scene representation. The large amount of data and the absence of methods to infer temporal coherence pose major challenges in storage, compression and editing compared to conventional video. In this paper, we propose the first method to extract a spatio-temporally coherent light-field video representation. A novel method to obtain Epipolar Plane Images (EPIs) from a spare lightfield camera array is proposed. EPIs are used to constrain scene flow estimation to obtain 4D temporally coherent representations of dynamic light-fields. Temporal coherence is achieved on a variety of light-field datasets. Evaluation of the proposed light-field scene flow against existing multiview dense correspondence approaches demonstrates a significant improvement in accuracy of temporal coherence.
3DTV production of live sports events presents a challenging problem involving conflicting requirements of main- taining broadcast stereo picture quality with practical problems in developing robust systems for cost effective deployment. In this paper we propose an alternative approach to stereo production in sports events using the conventional monocular broadcast cameras for 3D reconstruction of the event and subsequent stereo rendering. This approach has the potential advantage over stereo camera rigs of recovering full scene depth, allowing inter-ocular distance and convergence to be adapted according to the requirements of the target display and enabling stereo coverage from both existing and ‘virtual’ camera positions without additional cameras. A prototype system is presented with results of sports TV production trials for rendering of stereo and free-viewpoint video sequences of soccer and rugby.
Conventional stereoscopic video content production requires use of dedicated stereo camera rigs which is both costly and lacking video editing flexibility. In this paper, we propose a novel approach which only requires a small number of standard cameras sparsely located around a scene to automatically convert the monocular inputs into stereoscopic streams. The approach combines a probabilistic spatio-temporal segmentation framework with a state-of-the-art multi-view graph-cut reconstruction algorithm, thus providing full control of the stereoscopic settings at render time. Results with studio sequences of complex human motion demonstrate the suitability of the method for high quality stereoscopic content generation with minimum user interaction.
This paper presents an approach for reconstruction of 4D temporally coherent models of complex dynamic scenes. No prior knowledge is required of scene structure or camera calibration allowing reconstruction from multiple moving cameras. Sparse-to-dense temporal correspondence is integrated with joint multi-view segmentation and reconstruction to obtain a complete 4D representation of static and dynamic objects. Temporal coherence is exploited to overcome visual ambiguities resulting in improved reconstruction of complex scenes. Robust joint segmentation and reconstruction of dynamic objects is achieved by introducing a geodesic star convexity constraint. Comparative evaluation is performed on a variety of unstructured indoor and outdoor dynamic scenes with hand-held cameras and multiple people. This demonstrates reconstruction of complete temporally coherent 4D scene models with improved nonrigid object segmentation and shape reconstruction.
Here we present a novel, histogram-based salient point feature detector that may naturally be applied to both images and 3D data. Existing point feature detectors are often modality specific, with 2D and 3D feature detectors typically constructed in separate ways. As such, their applicability in a 2D-3D context is very limited, particularly where the 3D data is obtained by a LiDAR scanner. By contrast, our histogram-based approach is highly generalisable and as such, may be meaningfully applied between 2D and 3D data. Using the generalised approach, we propose salient point detectors for images, and both untextured and textured 3D data. The approach naturally allows for the detection of salient 3D points based jointly on both the geometry and texture of the scene, allowing for broader applicability. The repeatability of the feature detectors is evaluated using a range of datasets including image and LiDAR input from indoor and outdoor scenes. Experimental results demonstrate a significant improvement in terms of 2D-2D and 2D-3D repeatability compared to existing multi-modal feature detectors.
In computer vision, matting is the process of accurate foreground estimation in images and videos. In this paper we presents a novel patch based approach to video matting relying on non-parametric statistics to represent image variations in appearance. This overcomes the limitation of parametric algorithms which only rely on strong colour correlation between the nearby pixels. Initially we construct a clean background by utilising the foreground object’s movement across the background. For a given frame, a trimap is constructed using the background and the last frame’s trimap. A patch-based approach is used to estimate the foreground colour for every unknown pixel and finally the alpha matte is extracted. Quantitative evaluation shows that the technique performs better, in terms of the accuracy and the required user interaction, than the current state-of-the-art parametric approaches.
Recent advances in sensor technology have introduced low-cost RGB video plus depth sensors, such as the Kinect, which enable simultaneous acquisition of colour and depth images at video rates. This paper introduces a framework for representation of general dynamic scenes from video plus depth acquisition. A hybrid representation is proposed which combines the advantages of prior surfel graph surface segmentation and modelling work with the higher-resolution surface reconstruction capability of volumetric fusion techniques. The contributions are (1) extension of a prior piecewise surfel graph modelling approach for improved accuracy and completeness, (2) combination of this surfel graph modelling with TSDF surface fusion to generate dense geometry, and (3) proposal of means for validation of the reconstructed 4D scene model against the input data and efficient storage of any unmodelled regions via residual depth maps. The approach allows arbitrary dynamic scenes to be efficiently represented with temporally consistent structure and enhanced levels of detail and completeness where possible, but gracefully falls back to raw measurements where no structure can be inferred. The representation is shown to facilitate creative manipulation of real scene data which would previously require more complex capture setups or manual processing.
Stereoscopic video quality assessment has become a major research topic in recent years. Existing stereoscopic video quality metrics are predominantly based on stereoscopic image quality metrics extended to the time domain via for example temporal pooling. These approaches do not explicitly consider the motion sensitivity of the Human Visual System (HVS). To address this limitation, this paper introduces a novel HVS model inspired by physiological findings characterising the motion sensitive response of complex cells in the primary visual cortex (V1 area). The proposed HVS model generalises previous HVS models, which characterised the behaviour of simple and complex cells but ignored motion sensitivity, by estimating optical flow to measure scene velocity at different scales and orientations. The local motion characteristics (direction and amplitude) are used to modulate the output of complex cells. The model is applied to develop a new type of full-reference stereoscopic video quality metrics which uniquely combine non-motion sensitive and motion sensitive energy terms to mimic the response of the HVS. A tailored two-stage multi-variate stepwise regression algorithm is introduced to determine the optimal contribution of each energy term. The two proposed stereoscopic video quality metrics are evaluated on three stereoscopic video datasets. Results indicate that they achieve average correlations with subjective scores of 0.9257 (PLCC), 0.9338 and 0.9120 (SRCC), 0.8622 and 0.8306 (KRCC), and outperform previous stereoscopic video quality metrics including other recent HVS-based metrics.
Stereoscopic imaging is becoming increasingly popular. However, to ensure the best quality of experience, there is a need to develop more robust and accurate objective metrics for stereoscopic content quality assessment. Existing stereoscopic image and video metrics are either extensions of conventional 2D metrics (with added depth or disparity information) or are based on relatively simple perceptual models. Consequently, they tend to lack the accuracy and robustness required for stereoscopic content quality assessment. This paper introduces full-reference stereoscopic image and video quality metrics based on a Human Visual System (HVS) model incorporating important physiological findings on binocular vision. The proposed approach is based on the following three contributions. First, it introduces a novel HVS model extending previous models to include the phenomena of binocular suppression and recurrent excitation. Second, an image quality metric based on the novel HVS model is proposed. Finally, an optimised temporal pooling strategy is introduced to extend the metric to the video domain. Both image and video quality metrics are obtained via a training procedure to establish a relationship between subjective scores and objective measures of the HVS model. The metrics are evaluated using publicly available stereoscopic image/video databases as well as a new stereoscopic video database. An extensive experimental evaluation demonstrates the robustness of the proposed quality metrics. This indicates a considerable improvement with respect to the state-of-the-art with average correlations with subjective scores of 0.86 for the proposed stereoscopic image metric and 0.89 and 0.91 for the proposed stereoscopic video metrics.
We present a new algorithm for segmenting video frames into temporally stable colored regions, applying our technique to create artistic stylizations (e.g. cartoons and paintings) from real video sequences. Our approach is based on a multilabel graph cut applied to successive frames, in which the color data term and label priors are incrementally updated and propagated over time. We demonstrate coherent segmentation and stylization over a variety of home videos.
Helmholtz Stereopsis (HS) has recently been explored as a promising technique for capturing shape of objects with unknown reflectance. So far, it has been widely applied to objects of smooth geometry and piecewise uniform Bidirectional Reflectance Distribution Function (BRDF). Moreover, for nonconvex surfaces the inter-reflect ion effects have been completely neglected. We extend the method to surfaces which exhibit strong texture, nontrivial geometry and are possibly nonconvex. The problem associated with these surface features is that Helmholtz reciprocity is apparently violated when point-based measurements are used independently to establish the matching constraint as in the standard HS implementation. We argue that the problem is avoided by computing radiance measurements on image regions corresponding exactly to projections of the same surface point neighbourhood with appropriate scale. The experimental results demonstrate the success of the novel method proposed on real objects.
© Springer International Publishing Switzerland 2014.Human pose estimation from monocular video streams is a challenging problem. Much of the work on this problem has focused on developing inference algorithms and probabilistic prior models based on learned measurements. Such algorithms face challenges in generalisation beyond the learned dataset.We propose an interactive model-based generative approach for estimating the human pose from uncalibratedmonocular video in unconstrained sportsTVfootage. Belief propagation over a spatio-temporal graph of candidate body part hypotheses is used to estimate a temporally consistent pose between user-defined keyframe constraints. Experimental results show that the proposed generative pose estimation framework is capable of estimating pose even in very challenging unconstrained scenarios.