Professor Adrian Hilton FREng FIAPR FIET
About
Biography
In January 2012 I became Director of the Centre for Vision, Speech and Signal Processing (CVSSP) at the University of Surrey. CVSSP is one of the largest UK research groups in Audio-Visual Machine Perception with 170 researchers and a grant portfolio in excess of £31M. CVSSP research spans audio and video processing, computer vision, machine learning, spatial audio, 3D/4D video, medical image analysis and multimedia communication systems with strong industry collaboration. The centre has an outstanding track-record of pioneering research leading to successful technology transfer with UK industry.
The goal of my research is to develop seeing machines with the visual sense to understand and model dynamic real-world scenes. For example, measuring from video the biomechanics of an Olympic athlete performing a world-record high-jump.
My research is pioneering the next generation of 4D computer vision, capable of sensing both 3D shape and motion, to enable seeing machines that can understand and model dynamic scenes. Recent research introduced 4D vision for analysis in sports, for instance, measurement of football players from live TV cameras. This technology has been used by the BBC in sports commentary to visualise the action from novel directions, such as the referee or goalkeepers view.
I have successfully commercialised technologies for 3D and 4D shape capture exploited in entertainment, manufacture & health, receiving two EU IST Innovation Prizes, a Manufacturing Industry Achievement Award, and a Royal Society RS Industry Fellowship with Framestore on Digital Doubles for Film (2008-11). I am a Royal Society Wolfson Research Merit Award holder in 4D Vision (2013-18) and is currently PI on the S3A Programme Grant in Future Spatial Audio at Home combining audio and vision expertise and InnovateUK project ALIVE, led by The Foundry, developing tools for 360 video reconstruction and editing.
I lead the Visual Media Research (V-Lab) in CVSSP, which is conducting research in video analysis, computer vision and graphics for next generation communication and entertainment applications.
My research combines the fields of computer vision, machine learning, graphics and animation to investigate new methods for reconstruction, modelling and understanding of the real world from images and video.Applications include: sports analysis (soccer, rugby, athletics), 3D TV and film production, visual effects, character animation for games, digital doubles for film and facial animation for visual communication.
Current research is focused on video-based measurement in sports, multiple camera systems in film and TV production, and 3D video for highly realistic animation of people and faces. Research is conducted in collaboration with UK companies in the creative industries.From 2008-12 I was supported by a Royal Society Industry Fellowship to conduct research with leading visual-effects company Framestore investigating 4D technologies for Digital Doubles in film production.
Please contact me if you are interested in current PhD and post-doctoral research opportunities.
Areas of specialism
University roles and responsibilities
- Director, Centre for Vision, Speech and Signal Processing (CVSSP)
- Distinguished Professor of Computer Vision
- Head of the Visual Media Research Lab. (V-Lab)
Previous roles
Affiliations and memberships
Business, industry and community links
News
ResearchResearch interests
I lead AI@Surrey a cross-University initiative bringing together an interdisciplinary network of over 300 academic and research staff at Surrey active in AI related research from fundamental theory to practical applications which are of benefit to society. AI@Surrey spans almost every area of our research, from medical imaging to improve cancer detection and IoT sensor network to support healthy ageing, to law and ethics for the the regulation of autonomous decision making systems. The University leads on both the fundamental AI theory - pioneering new machine learning algorithms and understanding of blackbox AI system - through to applications in healthcare, security, transport, education, entertainment, communication and scientific discovery.
My current research is focused on Audio-Visual Machine Perception to enable machines to understand and interact with dynamic real-world scenes. Machine Perception is the bridge between Artificial Intelligence and Sensing (seeing, hearing and other senses). The goal is to enable machines to This research brings together expertise in audio, vision, machine learning and AI to pioneer future intelligent sensing technologies in collaboration with UK industry.
Machine perception of people, and the everyday environments that we live and work in, is a key enabling technology for future intelligent machines. This will impact and benefit many aspects of our lives, from robot assistants able to work safely alongside people, personalised healthcare at home, improved safety and security, safe autonomous vehicles and automated animal welfare monitoring, through to improved social communication and new forms of immersive audio-visual entertainment. The central research challenge for all of these technologies is to enable machines to understand complex real-world dynamic scenes. Reliable Machine Perception of complex scenes requires the fusion of multiple sensing modalities with complementary information, such as seeing and hearing.
Towards this goal I lead research in the area of 4D Vision to understand and model dynamic scenes such as people. Over the past decade we have introduced world-leading technology for 4D dynamic scene capture and reconstruction from multiple view video. This technology underpins future machine perception and is also enabling creation of video-realistic content for film, TV and Virtual Reality.
Our research has pioneered several new technologies including:
- 4D Performance Capture and Animation: Volumetric capture of actor performance enabling video-realistic replay with real-time control of movement. This technology bridges-the-gap between real and computer generated imagery of people. Current research is enabling the use of this technology in Virtual and Augmented Reality to create cinematic experiences.
- Real-time full-body motion capture outdoors: Our research has introduced methods for production quality full-body motion capture in unconstrained outdoor environments (such as a film set) from video + IMU. This technology is enabling content production for film and TV.
- Free-viewpoint video for sports TV production: Research in collaboration with the BBC has introduced technology to produce virtual camera views and 3D stereo of sports events from the broadcast TV cameras. This has been used in soccer and rugby to produce views such as the referees viewpoint or a virtual view along the offside line.
- Video-rate 4D capture of facial shape and motion: The first video system to simultaneously capture facial shape and appearance, used to produce highly realistic animated face models of real people for games and communication. This technology is now used in medicine for analysis of facial shape, in security for face recognition and in the creative industries to create highly realistic digital doubles of actors.
- AvatarMe: A 3D photo-booth to capture realistic 3D animated models of people for games. This technology was used to produce animated models of over 300,000 members of the general public.
This research has received several awards for outstanding academic publications and technical innovation. Research has enabled a number of award winning commercial systems for industrial design, computer animation, broadcast production, games and visual communication.
Current research is pursuing video-realistic reproduction and behavioural modelling of dynamic scenes such as moving people, sports and other natural phenomena. Ultimately the challenge is to capture representations which produce the appearance and interactive behaviour of real people.
If you are interested in pursuing related PhD or post-doctoral research in the area of vision, graphics or animation research contact me for current vacancies.
Research projects
EPSRC Programme Grant (2013-2019)
Audio-Visual Media Research PlatformEPSRC Platform Grant (2017-2022) supporting research in audio-visual machine perception
InnovateUK - Foundary and Figment Productions
InnovateUK - Numerion and The Imaginarium
EU H2020 - BrainStorm
Research collaborations
Collaborators on this research include companies from the film, broadcast, games. communication and consumer electronics industries:
- BBC
- BT
- Sony
- Philips
- Mitsubishi
- Framestore
- DNeg
- The Foundry
- Filmlight
- Hensons
- 3D Scanners
- Avatar-Me
- BUF
- QD
- Rebellion
- IKinema.
Research interests
I lead AI@Surrey a cross-University initiative bringing together an interdisciplinary network of over 300 academic and research staff at Surrey active in AI related research from fundamental theory to practical applications which are of benefit to society. AI@Surrey spans almost every area of our research, from medical imaging to improve cancer detection and IoT sensor network to support healthy ageing, to law and ethics for the the regulation of autonomous decision making systems. The University leads on both the fundamental AI theory - pioneering new machine learning algorithms and understanding of blackbox AI system - through to applications in healthcare, security, transport, education, entertainment, communication and scientific discovery.
My current research is focused on Audio-Visual Machine Perception to enable machines to understand and interact with dynamic real-world scenes. Machine Perception is the bridge between Artificial Intelligence and Sensing (seeing, hearing and other senses). The goal is to enable machines to This research brings together expertise in audio, vision, machine learning and AI to pioneer future intelligent sensing technologies in collaboration with UK industry.
Machine perception of people, and the everyday environments that we live and work in, is a key enabling technology for future intelligent machines. This will impact and benefit many aspects of our lives, from robot assistants able to work safely alongside people, personalised healthcare at home, improved safety and security, safe autonomous vehicles and automated animal welfare monitoring, through to improved social communication and new forms of immersive audio-visual entertainment. The central research challenge for all of these technologies is to enable machines to understand complex real-world dynamic scenes. Reliable Machine Perception of complex scenes requires the fusion of multiple sensing modalities with complementary information, such as seeing and hearing.
Towards this goal I lead research in the area of 4D Vision to understand and model dynamic scenes such as people. Over the past decade we have introduced world-leading technology for 4D dynamic scene capture and reconstruction from multiple view video. This technology underpins future machine perception and is also enabling creation of video-realistic content for film, TV and Virtual Reality.
Our research has pioneered several new technologies including:
- 4D Performance Capture and Animation: Volumetric capture of actor performance enabling video-realistic replay with real-time control of movement. This technology bridges-the-gap between real and computer generated imagery of people. Current research is enabling the use of this technology in Virtual and Augmented Reality to create cinematic experiences.
- Real-time full-body motion capture outdoors: Our research has introduced methods for production quality full-body motion capture in unconstrained outdoor environments (such as a film set) from video + IMU. This technology is enabling content production for film and TV.
- Free-viewpoint video for sports TV production: Research in collaboration with the BBC has introduced technology to produce virtual camera views and 3D stereo of sports events from the broadcast TV cameras. This has been used in soccer and rugby to produce views such as the referees viewpoint or a virtual view along the offside line.
- Video-rate 4D capture of facial shape and motion: The first video system to simultaneously capture facial shape and appearance, used to produce highly realistic animated face models of real people for games and communication. This technology is now used in medicine for analysis of facial shape, in security for face recognition and in the creative industries to create highly realistic digital doubles of actors.
- AvatarMe: A 3D photo-booth to capture realistic 3D animated models of people for games. This technology was used to produce animated models of over 300,000 members of the general public.
This research has received several awards for outstanding academic publications and technical innovation. Research has enabled a number of award winning commercial systems for industrial design, computer animation, broadcast production, games and visual communication.
Current research is pursuing video-realistic reproduction and behavioural modelling of dynamic scenes such as moving people, sports and other natural phenomena. Ultimately the challenge is to capture representations which produce the appearance and interactive behaviour of real people.
If you are interested in pursuing related PhD or post-doctoral research in the area of vision, graphics or animation research contact me for current vacancies.
Research projects
EPSRC Programme Grant (2013-2019)
EPSRC Platform Grant (2017-2022) supporting research in audio-visual machine perception
InnovateUK - Foundary and Figment Productions
InnovateUK - Numerion and The Imaginarium
EU H2020 - BrainStorm
Research collaborations
Collaborators on this research include companies from the film, broadcast, games. communication and consumer electronics industries:
- BBC
- BT
- Sony
- Philips
- Mitsubishi
- Framestore
- DNeg
- The Foundry
- Filmlight
- Hensons
- 3D Scanners
- Avatar-Me
- BUF
- QD
- Rebellion
- IKinema.
Publications
This landing page contains the datasets presented in the paper "Finite Aperture Stereo". The datasets are intended for defocus-based 3D reconstruction and analysis. Each download link contains images of a static scene, captured from multiple viewpoints and with different focus settings. The captured objects exhibit a range of reflectance properties and are physically small in scale. Calibration images are also available. A CC BY-NC licence is in effect. Use of this data must be for non-commercial research purposes. Acknowledgement must be given to the original authors by referencing the dataset DOI, the dataset web address, and the aforementioned publication. Re-distribution of this data is prohibited. Before downloading, you must agree with these conditions as presented on the dataset webpage.
Leveraging machine learning techniques, in the context of object-based media production, could enable provision of personalized media experiences to diverse audiences. To fine-tune and evaluate techniques for personalization applications, as well as more broadly, datasets which bridge the gap between research and production are needed. We introduce and publicly release such a dataset, themed around a UK weather forecast and shot against a blue-screen background, of three professional actors/presenters – one male and one female (English) and one female (British Sign Language). Scenes include both production and research-oriented examples, with a range of dialogue, motions, and actions. Capture techniques consisted of a synchronized 4K resolution 16-camera array, production-typical microphones plus professional audio mix, a 16-channel microphone array with collocated Grasshopper3 camera, and a photogrammetry array. We demonstrate applications relevant to virtual production and creation of personalized media including neural radiance fields, shadow casting, action/event detection, speaker source tracking and video captioning.
To overcome the shortage of real-world multi-view multiple people, we introduce a new synthetic multi-view multiple people labelling dataset named Multi-View 3D Humans (MV3DHumans). This dataset is a large-scale synthetic image dataset that was generated for multi-view multiple people detection, labelling and segmentation tasks. The MV3DHumans dataset contains 1200 scenes captured by multiple cameras, with 4, 6, 8 or 10 people in each scene. Each scene is captured by 16 cameras with overlapping field of views. The MV3DHumans dataset provides RGB images with resolution of 640 × 480. Ground truth annotations including bounding boxes, instance masks and multi-view correspondences, as well as camera calibrations are provided in the dataset.
3DVHshadow contains images of diverse synthetic humans generated to evaluate the performance of cast hard shadow algorithms for humans. Each dataset entry includes (a) a rendering of the subject from the camera view point, (b) its binary segmentation mask, and (c) its binary cast shadow mask on a planar surface -- in total 3 images. The respective rendering metadata such as point light source position, camera pose, camera calibration, etc. is also provided alongside the images. Please refer to the corresponding publication for details of the dataset generation.
Data accompanying the paper "Evaluation of Spatial Audi Reproduction Methods (Part2): Analysis of Listener Preference.
This is the dataset used for the accompanying paper "Automatic text clustering for audio attribute elicitation experiment responses".
About 1.7 million new cases of breast cancer were estimated by the World Health Organization (WHO) in 2012, accounting for 23 percent of all female cancers. In the UK 33 percent of women aged 50 and above were diagnosed in the same year, thus positioning the UK as the 6th highest in breast cancer amongst the European countries. The national Screening programme in the UK has been focused on the procedure of early detection and to improve prognosis by timely intervention to extend the life span of patients. To this end, the National Health Service Breast Screening Programme (NHSBSP) employs 2-D planar mammography because it is considered to be the gold standard technique for early breast cancer detection worldwide. Breast tomosynthesis has shown great promise as an alternative method for removing the intrinsic overlying clutter seen in conventional 2D imaging. However, preliminary work in breast CT has provided a number of compelling aspects that motivates the work featured in this thesis. These advantages include removal of the need to mechanically compress the breast which is a source of screening non-attendances, and that it provides unique cross sectional images that removes almost all the overlying clutter seen in 2D. This renders lesions more visible and hence aids in early detection of malignancy. However work in Breast CT to date has been focused on using scaled down versions of standard clinical CT systems. By contrast, this thesis proposes using a photon counting approach. The work of this thesis focuses on investigating photoncounting detector technology and comparing it to conventional CT in terms of contrast visualization. Results presented from simulation work developed in this thesis has demonstrated the ability of photoncounting detector technology to utilize data in polychromatic beam where contrast are seen to decrease with increasing photon energy and compared to the conventional CT approach which is the standard clinical CT system.
This paper proposes to learn self-shadowing on full-body, clothed human postures from monocular colour image input, by supervising a deep neural model. The proposed approach implicitly learns the articulated body shape in order to generate self-shadow maps without seeking to reconstruct explicitly or estimate parametric 3D body geometry. Furthermore, it is generalisable to different people without per-subject pre-training, and has fast inference timings. The proposed neural model is trained on self-shadow maps rendered from 3D scans of real people for various light directions. Inference of shadow maps for a given illumination is performed from only 2D image input. Quantitative and qualitative experiments demonstrate comparable results to the state of the art whilst being monocular and achieving a considerably faster inference time. We provide ablations of our methodology and further show how the inferred self-shadow maps can benefit monocular full-body human relighting.
Object co-segmentation is a challenging task, which aims to segment common objects in multiple images at the same time. Generally, common information of the same object needs to be found to solve this problem. For various scenarios, common objects in different images only have the same semantic information. In this paper, we propose a deep object co-segmentation method based on channel and spatial attention, which combines the attention mechanism with a deep neural network to enhance the common semantic information. Siamese encoder and decoder structure are used for this task. Firstly, the encoder network is employed to extract low-level and high-level features of image pairs. Secondly, we introduce an improved attention mechanism in the channel and spatial domain to enhance the multi-level semantic features of common objects. Then, the decoder module accepts the enhanced feature maps and generates the masks of both images. Finally, we evaluate our approach on the commonly used datasets for the co-segmentation task. And the experimental results show that our approach achieves competitive performance.
This paper addresses the problem of reconstructing an integrated 3D model from multiple 2.5D range images. A novel integration algorithm is presented based on a continuous implicit surface representation. This is the first reconstruction algorithm to use operations in 3D space only. The algorithm is guaranteed to reconstruct the correct topology of surface features larger than the range image sampling resolution. Reconstruction of triangulated models from multi-image data sets is demonstrated for complex objects. Performance characterization of existing range image integration algorithms is addressed in the second part of this paper. This comparison defines the relative computational complexity and geometric limitations of existing integration algorithms.
In this study, a novel sleep pose identification method has been proposed for classifying 12 different sleep postures using a two-step deep learning process. For this purpose, transfer learning as an initial stage retrains a well-known CNN network (VGG-19) to categorise the data into four main pose classes, namely: supine, left, right, and prone. According to the decision made by VGG-19, subsets of the image data are next passed to one of four dedicated sub-class CNNs. As a result, the pose estimation label is further refined from one of four sleep pose labels to one of 12 sleep pose labels. 10 participants contributed for recording infrared (IR) images of 12 pre-defined sleep positions. Participants were covered by a blanket to occlude the original pose and present a more realistic sleep situation. Finally, we have compared our results with (1) the traditional CNN learning from scratch and (2) retrained VGG-19 network in one stage. The average accuracy increased from 74.5% & 78.1% to 85.6% compared with (1) & (2) respectively.
We present PAT, a transformer-based network that learns complex temporal co-occurrence action dependencies in a video by exploiting multi-scale temporal features. In existing methods, the self-attention mechanism in transformers loses the temporal positional information, which is essential for robust action detection. To address this issue, we (i) embed relative positional encoding in the self-attention mechanism and (ii) exploit multi-scale temporal relationships by designing a novel non-hierarchical network, in contrast to the recent transformer-based approaches that use a hierarchical structure. We argue that joining the self-attention mechanism with multiple sub-sampling processes in the hierarchical approaches results in increased loss of positional information. We evaluate the performance of our proposed approach on two challenging dense multi-label benchmark datasets, and show that PAT improves the current state-of-the-art result by 1.1% and 0.6% mAP on the Charades and MultiTHUMOS datasets, respectively, thereby achieving the new state-of-the-art mAP at 26.5% and 44.6%, respectively. We also perform extensive ablation studies to examine the impact of the different components of our proposed network.
This work introduces a new approach to material editing and to mutual illumination correction in edited or composited scenes. Lambertian reflectance values from a static reference set are transfered across a dynamic sequence using non-parametric statistics. Both original and edited reflectance values are transfered from the reference set. Dividing out the original reflectance from the dynamic sequence gives an illumination-shading map. This illumination-shading map is corrected for the changes in mutual illumination caused by the material edit and composite before being combined with the edited reflectance values giving the material edited and mutual illumination corrected result. The correction is made by splitting the illumination-shading map into direct and mutual illumination components and replacing the mutual illumination component. Results show that given reflectance values for a surface mutual illumination can be detected and modified giving a greater degree of photorealism to composites and material edits.
In this paper a new multi-view/3D human action/interaction database is presented. The database has been created using a convergent eight camera setup to produce high definition multi-view videos, where each video depicts one of eight persons performing one of twelve different human motions. Various types of motions have been recorded, i.e., scenes where one person performs a specific movement, scenes where a person executes different movements in a succession and scenes where two persons interact with each other. Moreover, the subjects have different body sizes, clothing and are of different sex, nationalities, etc.. The multi-view videos have been further processed to produce a 3D mesh at each frame describing the respective 3D human body surface. To increase the applicability of the database, for each person a multi-view video depicting the person performing sequentially the six basic facial expressions separated by the neutral expression has also been recorded. The database is freely available for research purposes.
We present a novel method for modelling dynamic texture appearance from a minimal set of cameras. Previous methods to capture the dynamic appearance of a human from multi-view video have relied on large, expensive camera setups, and typically store texture on a frame-by-frame basis. We fit a parameterised human body model to multi-view video from minimal cameras (as few as 3), and combine the partial texture observations from multiple viewpoints and frames in a learned framework to generate full-body textures with dynamic details given an input pose. Key to our method are our multi-band loss functions, which apply separate blending functions to the high and low spatial frequencies to reduce texture artefacts. We evaluate our method on a range of multi-view datasets, and show that our model is able to accurately produce full-body dynamic textures, even with only partial camera coverage. We demonstrate that our method outperforms other texture generation methods on minimal camera setups.
Semantic scene completion is the task of predicting a complete 3D representation of volumetric occupancy with corresponding semantic labels for a scene from a single point of view. In this paper, we present EdgeNet, a new end-to-end neural network architecture that fuses information from depth and RGB, explicitly representing RGB edges in 3D space. Previous works on this task used either depth-only or depth with colour by projecting 2D semantic labels generated by a 2D segmentation network into the 3D volume, requiring a two step training process. Our EdgeNet representation encodes colour information in 3D space using edge detection and flipped truncated signed distance, which improves semantic completion scores especially in hard to detect classes. We achieved state-of-the-art scores on both synthetic and real datasets with a simpler and a more computationally efficient training pipeline than competing approaches.
Correct labelling of multiple people from different viewpoints in complex scenes is a challenging task due to occlusions, visual ambiguities, as well as variations in appearance and illumination. In recent years, deep learning approaches have proved very successful at improving the performance of a wide range of recognition and labelling tasks such as person re-identification and video tracking. However, to date, applications to multi-view tasks have proved more challenging due to the lack of suitably labelled multi-view datasets, which are difficult to collect and annotate. The contributions of this paper are two-fold. First, a synthetic dataset is generated by combining 3D human models and panoramas along with human poses and appearance detail rendering to overcome the shortage of real dataset for multi-view labelling. Second, a novel framework named Multi-View Labelling network (MVL-net) is introduced to leverage the new dataset and unify the multi-view multiple people detection, segmentation and labelling tasks in complex scenes. To the best of our knowledge, this is the first work using deep learning to train a multi-view labelling network. Experiments conducted on both synthetic and real datasets demonstrate that the proposed method outperforms the existing state-of-the-art approaches.
Leveraging machine learning techniques, in the context of object-based media production, could enable provision of personalized media experiences to diverse audiences. To fine-tune and evaluate techniques for personalization applications, as well as more broadly, datasets which bridge the gap between research and production are needed. We introduce and publicly release such a dataset, themed around a UK weather forecast and shot against a blue-screen background, of three professional actors/presenters – one male and one female (English) and one female (British Sign Language). Scenes include both production and research-oriented examples, with a range of dialogue, motions, and actions. Capture techniques consisted of a synchronized 4K resolution 16-camera array, production-typical microphones plus professional audio mix, a 16-channel microphone array with collocated Grasshopper3 camera, and a photogrammetry array. We demonstrate applications relevant to virtual production and creation of personalized media including neural radiance fields, shadow casting, action/event detection, speaker source tracking and video captioning.
We present a review of current methods for 3D face modeling, 3D to 3D and 3D to 2D registration, 3D based recognition, and 3D assisted 2D based recognition. The emphasis is on the 3D registration which plays a crucial role in the recognition chain. An evaluation study of a mainstream state-of-the-art 3D face registration algorithm is carried out and the results discussed
This paper presents a 3D human pose estimation system that uses a stereo pair of 360° sensors to capture the complete scene from a single location. The approach combines the advantages of omnidirectional capture, the accuracy of multiple view 3D pose estimation and the portability of monocular acquisition. Joint monocular belief maps for joint locations are estimated from 360° images and are used to fit a 3D skeleton to each frame. Temporal data association and smoothing is performed to produce accurate 3D pose estimates throughout the sequence. We evaluate our system on the Panoptic Studio dataset, as well as real 360° video for tracking multiple people, demonstrating an average Mean Per Joint Position Error of 12.47cm with 30cm baseline cameras. We also demonstrate improved capabilities over perspective and 360° multi-view systems when presented with limited camera views of the subject.
A common problem in the 4D reconstruction of people from multi-view video is the quality of the captured dynamic texture appearance which depends on both the camera resolution and capture volume. Typically the requirement to frame cameras to capture the volume of a dynamic performance (> 50m^3) results in the person occupying only a small proportion < 10% of the field of view. Even with ul-tra high-definition 4k video acquisition this results in sampling the person at less-than standard definition 0.5k video resolution resulting in low-quality rendering. In this paper we propose a solution to this problem through super-resolution appearance transfer from a static high-resolution appearance capture rig using digital stills cameras (> 8k) to capture the person in a small volume (< 8m 3). A pipeline is proposed for super-resolution appearance transfer from high-resolution static capture to dynamic video performance capture to produce super-resolution dynamic textures. This addresses two key problems: colour mapping between different camera systems; and dynamic texture map super-resolution using a learnt model. Comparative evaluation demonstrates a significant qualitative and quantitative improvement in rendering the 4D performance capture with super-resolution dynamic texture appearance. The proposed approach reproduces the high-resolution detail of the static capture whilst maintaining the appearance dynamics of the captured video.
We propose a multi-view framework for joint object detection and labelling based on pairs of images. The proposed framework extends the single-view Mask R-CNN approach to multiple views without need for additional training. Dedicated components are embedded into the framework to match objects across views by enforcing epipolar constraints, appearance feature similarity and class coherence. The multi-view extension enables the proposed framework to detect objects which would otherwise be mis-detected in a classical Mask R-CNN approach, and achieves coherent object labelling across views. By avoiding the need for additional training, the approach effectively overcomes the current shortage of multi-view datasets. The proposed framework achieves high quality results on a range of complex scenes, being able to output class, bounding box, mask and an additional label enforcing coherence across views. In the evaluation, we show qualitative and quantitative results on several challenging outd oor multi-view datasets and perform a comprehensive comparison to verify the advantages of the proposed method
We aim to simultaneously estimate the 3D articulated pose and high fidelity volumetric occupancy of human performance, from multiple viewpoint video (MVV) with as few as two views. We use a multi-channel symmetric 3D convolutional encoder-decoder with a dual loss to enforce the learning of a latent embedding that enables inference of skeletal joint positions and a volumetric reconstruction of the performance. The inference is regularised via a prior learned over a dataset of view-ablated multi-view video footage of a wide range of subjects and actions, and show this to generalise well across unseen subjects and actions. We demonstrate improved reconstruction accuracy and lower pose estimation error relative to prior work on two MVV performance capture datasets: Human 3.6M and TotalCapture.
We propose a novel framework to reconstruct super-resolution human shape from a single low-resolution input image. The approach overcomes limitations of existing approaches that reconstruct 3D human shape from a single image, which require high-resolution images together with auxiliary data such as surface normal or a parametric model to reconstruct high-detail shape. The proposed framework represents the reconstructed shape with a high-detail implicit function. Analogous to the objective of 2D image super-resolution, the approach learns the mapping from a low-resolution shape to its high-resolution counterpart and it is applied to reconstruct 3D shape detail from low-resolution images. The approach is trained end-to-end employing a novel loss function which estimates the information lost between a low and high-resolution representation of the same 3D surface shape. Evaluation for single image reconstruction of clothed people demonstrates that our method achieves high-detail surface reconstruction from low-resolution images without auxiliary data. Extensive experiments show that the proposed approach can estimate super-resolution human geometries with a significantly higher level of detail than that obtained with previous approaches when applied to low-resolution images. https://marcopesavento.github.io/SuRS/.
This paper proposes a novel Attention-based Multi-Reference Super-resolution network (AMRSR) that, given a low-resolution image, learns to adaptively transfer the most similar texture from multiple reference images to the super-resolution output whilst maintaining spatial coherence. The use of multiple reference images together with attention-based sampling is demonstrated to achieve significantly improved performance over state-of-the-art reference super-resolution approaches on multiple benchmark datasets. Reference super-resolution approaches have recently been proposed to overcome the ill-posed problem of image super-resolution by providing additional information from a high-resolution reference image. Multi-reference super-resolution extends this approach by providing a more diverse pool of image features to overcome the inherent information deficit whilst maintaining memory efficiency. A novel hierarchical attention-based sampling approach is introduced to learn the similarity between low-resolution image features and multiple reference images based on a perceptual loss. Ablation demonstrates the contribution of both multi-reference and hierarchical attention-based sampling to overall performance. Perceptual and quantitative ground-truth evaluation demonstrates significant improvement in performance even when the reference images deviate significantly from the target image.
—Active speaker detection (ASD) is a multi-modal task that aims to identify who, if anyone, is speaking from a set of candidates. Current audiovisual approaches for ASD typically rely on visually pre-extracted face tracks (sequences of consecutive face crops) and the respective monaural audio. However, their recall rate is often low as only the visible faces are included in the set of candidates. Monaural audio may successfully detect the presence of speech activity but fails in localizing the speaker due to the lack of spatial cues. Our solution extends the audio front-end using a microphone array. We train an audio convolutional neural network (CNN) in combination with beamforming techniques to regress the speaker's horizontal position directly in the video frames. We propose to generate weak labels using a pre-trained active speaker detector on pre-extracted face tracks. Our pipeline embraces the " student-teacher " paradigm, where a trained " teacher " network is used to produce pseudo-labels visually. The " student " network is an audio network trained to generate the same results. At inference, the student network can independently localize the speaker in the visual frames directly from the audio input. Experimental results on newly collected data prove that our approach significantly outperforms a variety of other baselines as well as the teacher network itself. It results in an excellent speech activity detector too.
Immersive audio-visual perception relies on the spatial integration of both auditory and visual information which are heterogeneous sensing modalities with different fields of reception and spatial resolution. This study investigates the perceived coherence of audiovisual object events presented either centrally or peripherally with horizontally aligned/misaligned sound. Various object events were selected to represent three acoustic feature classes. Subjective test results in a simulated virtual environment from 18 participants indicate a wider capture region in the periphery, with an outward bias favoring more lateral sounds. Centered stimulus results support previous findings for simpler scenes.
Generating grammatically and semantically correct captions in video captioning is a challenging task. The captions generated from the existing methods are either word-by-word that do not align with grammatical structure or miss key information from the input videos. To address these issues, we introduce a novel global-local fusion network, with a Global-Local Fusion Block (GLFB) that encodes and fuses features from different parts of speech (POS) components with visual-spatial features. We use novel combinations of different POS components - 'determinant + subject', 'auxiliary verb', 'verb', and 'determinant + object' for supervision of the POS blocks - Det + Subject, Aux Verb, Verb, and Det + Object respectively. The novel global-local fusion network together with POS blocks helps align the visual features with language description to generate grammatically and semantically correct captions. Extensive qualitative and quantitative experiments on benchmark MSVD and MSRVTT datasets demonstrate that the proposed approach generates more grammatically and semantically correct captions compared to the existing methods, achieving the new state-of-the-art. Ablations on the POS blocks and the GLFB demonstrate the impact of the contributions on the proposed method.
Virtual Reality (VR) systems have been intensely explored, with several research communities investigating the different modalities involved. Regarding the audio modality, one of the main issues is the generation of sound that is perceptually coherent with the visual reproduction. Here, we propose a pipeline for creating plausible interactive reverb using visual information: first, we characterize real environment acoustics given a pair of spherical cameras; then, we reproduce reverberant spatial sound, by using the estimated acoustics, within a VR scene. The evaluation is made by extracting the room impulse responses (RIRs) of four virtually rendered rooms. Results show agreement, in terms of objective metrics, between the synthesized acoustics and the ones calculated from RIRs recorded within the respective real rooms.
Recent progresses in Virtual Reality (VR) and Augmented Reality (AR) allow us to experience various VR/AR applications in our daily life. In order to maximise the immersiveness of user in VR/AR environments, a plausible spatial audio reproduction synchronised with visual information is essential. In this paper, we propose a simple and efficient system to estimate room acoustic for plausible reproducton of spatial audio using 360 degrees cameras for VR/AR applications. A pair of 360 degrees images is used for room geometry and acoustic property estimation. A simplified 3D geometric model of the scene is estimated by depth estimation from captured images and semantic labelling using a convolutional neural network (CNN). The real environment acoustics are characterised by frequency -dependent acoustic predictions of the scene. Spatially synchronised audio is reproduced based on the estimated geometric and acoustic properties in the scene. The reconstructed scenes are rendered with synthesised spatial audio as VR/AR content. The results of estimated room geometry and simulated spatial audio are evaluated against the actual measurements and audio calculated from ground -truth Room Impulse Responses (RIRs) recorded in the rooms. Details about the data underlying this work, along with the terms for data access, are available from: http://dx.doi.org/10.15126/surreydata.00812228
Background Sleep disturbances are both risk factors for and symptoms of dementia. Current methods for assessing sleep disturbances are largely based on either polysomnography (PSG) which is costly and inconvenient, or self‐ or care‐giver reports which are prone to measurement error. Low‐cost methods to monitor sleep disturbances longitudinally and at scale can be useful for assessing symptom development. Here, we develop deep learning models that use multimodal variables (accelerometers and temperature) recorded by the AX3 to accurately identify sleep and wake epochs and derive sleep parameters. Method Eighteen men and women (65‐80y) participated in a sleep laboratory‐based study in which multiple devices for sleep monitoring were evaluated. PSGs were recorded over a 10‐h period and scored according to established criteria per 30 sec epochs. Tri‐axial accelerometers and temperature signals were captured with an Axivity AX3, at 100Hz and 1Hz, respectively, throughout a 19‐h period, including 10‐h concurrent PSG recording and 9‐h of wakefulness. We developed and evaluated a supervised deep learning algorithm to detect sleep and wake epochs and determine sleep parameters from the multimodal AX3 raw data. We validated our results with gold standard PSG measurements and compared our algorithm to the Biobank accelerometer analysis toolbox. Single modality (accelerometer or temperature) and multimodality (both signals) approaches were evaluated using the 3‐fold cross‐validation. Result The proposed deep learning model outperformed baseline models such as the Biobank accelerometer analysis toolbox and conventional machine learning classifiers (Random Forest and Support Vector Machine) by up to 25%. Using multimodal data improved sleep and wake classification performance (up to 18% higher) compared with the single modality. In terms of the sleep parameters, our approach boosted the accuracy of estimations by 11% on average compared to the Biobank accelerometer analysis toolbox. Conclusion In older adults without dementia, combining multimodal data from AX3 with deep learning methods allows satisfactory quantification of sleep and wakefulness. This approach holds promise for monitoring sleep behaviour and deriving accurate sleep parameters objectively and longitudinally from a low‐cost wearable sensor. A limitation of our current study is that the participants were healthy older adults: future work will focus on people living with dementia.
In this study, a novel hybrid tensor factorisation and deep learning approach has been proposed and implemented for sleep pose identification and classification of twelve different sleep postures. We have applied tensor factorisation to infrared (IR) images of 10 subjects to extract group-level data patterns, undertake dimensionality reduction and reduce occlusion for IR images. Pre-trained VGG-19 neural network has been used to predict the sleep poses under the blanket. Finally, we compared our results with those without the factorisation stage and with CNN network. Our new pose detection method outperformed the methods solely based on VGG-19 and 4-layer CNN network. The average accuracy for 10 volunteers increased from 78.1% and 75.4% to 86.0%.
In order to maximise the immersion in VR environments, a plausible spatial audio reproduction synchronised with visual information is essential. In this work, we propose a pipeline to create plausible interactive audio from a pair of 360 degree cameras. Details about the data underlying this work, along with the terms for data access, are available from: http://dx.doi.org/10.15126/surreydata.00812228.
This paper introduces Deep4D a compact generative representation of shape and appearance from captured 4D volumetric video sequences of people. 4D volumetric video achieves highly realistic reproduction, replay and free-viewpoint rendering of actor performance from multiple view video acquisition systems. A deep generative network is trained on 4D video sequences of an actor performing multiple motions to learn a generative model of the dynamic shape and appearance. We demonstrate the proposed generative model can provide a compact encoded representation capable of high-quality synthesis of 4D volumetric video with two orders of magnitude compression. A variational encoder-decoder network is employed to learn an encoded latent space that maps from 3D skeletal pose to 4D shape and appearance. This enables high-quality 4D volumetric video synthesis to be driven by skeletal motion, including skeletal motion capture data. This encoded latent space supports the representation of multiple sequences with dynamic interpolation to transition between motions. Therefore we introduce Deep4D motion graphs, a direct application of the proposed generative representation. Deep4D motion graphs allow real-tiome interactive character animation whilst preserving the plausible realism of movement and appearance from the captured volumetric video. Deep4D motion graphs implicitly combine multiple captured motions from a unified representation for character animation from volumetric video, allowing novel character movements to be generated with dynamic shape and appearance detail.
Accurate camera synchronisation is indispensable for many video processing tasks, such as surveillance and 3D modelling. Video-based synchronisation facilitates the design and setup of networks with moving cameras or devices without an external synchronisation capability, such as low-cost web cameras, or Kinects. In this paper, we present an algorithm which can work with such heterogeneous networks. The algorithm first finds the corresponding frame indices between each camera pair, by the help of image feature correspondences and epipolar geometry. Then, for each pair, a relative frame rate and offset are computed by fitting a 2D line to the index correspondences. These pairwise relations define a graph, in which each spanning cycle comprises an absolute synchronisation hypothesis. The optimal solution is found by an exhaustive search over the spanning cycles. The algorithm is experimentally demonstrated to yield highly accurate estimates in a number of scenarios involving static and moving cameras, and Kinect.
We present PAT, a transformer-based network that learns complex temporal co-occurrence action dependencies in a video by exploiting multi-scale temporal features. In existing methods, the self-attention mechanism in transformers loses the temporal positional information, which is essential for robust action detection. To address this issue, we (i) embed relative positional encoding in the self-attention mechanism and (ii) exploit multi-scale temporal relationships by designing a novel non hierarchical network, in contrast to the recent transformer-based approaches that use a hierarchical structure. We argue that joining the self-attention mechanism with multiple sub-sampling processes in the hierarchical approaches results in increased loss of positional information. We evaluate the performance of our proposed approach on two challenging dense multi-label benchmark datasets, and show that PAT improves the current state-of-the-art result by 1.1% and 0.6% mAP on the Charades and MultiTHUMOS datasets, respectively, thereby achieving the new state-of-the-art mAP at 26.5% and 44.6%, respectively. We also perform extensive ablation studies to examine the impact of the different components of our proposed network.
In the context of Audio Visual Question Answering (AVQA) tasks, the audio visual modalities could be learnt on three levels: 1) Spatial, 2) Temporal, and 3) Semantic. Existing AVQA methods suffer from two major shortcomings; the audio-visual (AV) information passing through the network isn't aligned on Spatial and Temporal levels; and, inter-modal (audio and visual) Semantic information is often not balanced within a context; this results in poor performance. In this paper, we propose a novel end-to-end Contextual Multi-modal Alignment (CAD) network that addresses the challenges in AVQA methods by i) introducing a parameter-free stochastic Contextual block that ensures robust audio and visual alignment on the Spatial level; ii) proposing a pre-training technique for dynamic audio and visual alignment on Temporal level in a self-supervised setting, and iii) introducing a cross-attention mechanism to balance audio and visual information on Semantic level. The proposed novel CAD network improves the overall performance over the state-of-the-art methods on average by 9.4% on the MUSIC-AVQA dataset. We also demonstrate that our proposed contributions to AVQA can be added to the existing methods to improve their performance without additional complexity requirements.
We present a novel method to learn temporally consistent 3D reconstruction of clothed people from a monocular video. Recent methods for 3D human reconstruction from monocular video using volumetric, implicit or parametric human shape models, produce per frame reconstructions giving temporally inconsistent output and limited performance when applied to video. In this paper, we introduce an approach to learn temporally consistent features for textured reconstruction of clothed 3D human sequences from monocular video by proposing two advances: a novel temporal consistency loss function; and hybrid representation learning for implicit 3D reconstruction from 2D images and coarse 3D geometry. The proposed advances improve the temporal consistency and accuracy of both the 3D reconstruction and texture prediction from a monocular video. Comprehensive comparative performance evaluation on images of people demonstrates that the proposed method significantly outperforms the state-of-the-art learning-based single image 3D human shape estimation approaches achieving significant improvement of reconstruction accuracy, completeness, quality and temporal consistency.
We present a new end-to-end learning framework to obtain detailed and spatially coherent reconstructions of multiple people from a single image. Existing multi-person methods suffer from two main drawbacks: they are often model-based and therefore cannot capture accurate 3D models of people with loose clothing and hair; or they require manual intervention to resolve occlusions or interactions. Our method addresses both limitations by introducing the first end-to-end learning approach to perform model-free implicit reconstruction for realistic 3D capture of multiple clothed people in arbitrary poses (with occlusions) from a single image. Our network simultaneously estimates the 3D geometry of each person and their 6DOF spatial locations, to obtain a coherent multi-human reconstruction. In addition, we introduce a new synthetic dataset that depicts images with a varying number of inter-occluded humans and a variety of clothing and hair styles. We demonstrate robust, high-resolution reconstructions on images of multiple humans with complex occlusions, loose clothing and a large variety of poses and scenes. Our quantitative evaluation on both synthetic and real world datasets demonstrates state-of-the-art performance with significant improvements in the accuracy and completeness of the reconstructions over competing approaches.
Apparatus and methods are disclosed for performing object-based audio rendering on a plurality of audio objects which define a sound scene, each audio object comprising at least one audio signal and associated metadata. The apparatus comprises: a plurality of renderers each capable of rendering one or more of the audio objects to output rendered audio data; and object adapting means for adapting one or more of the plurality of audio objects for a current reproduction scenario, the object adapting means being configured to send the adapted one or more audio objects to one or more of the plurality of renderers.
Generating grammatically and semantically correct captions in video captioning is a challenging task.. The captions generated from the existing methods are either word-by-word that do not align with grammatical structure or miss key information from the input videos. To address these issues, we introduce a novel global-local fusion network, with a Global-Local Fusion Block (GLFB) that encodes and fuses features from different parts of speech (POS) components with visual-spatial features. We use novel combinations of different POS components - 'determinant + subject', 'auxiliary verb', 'verb', and 'determinant + object' for supervision of the POS blocks - Det + Subject, Aux Verb, Verb, and Det + Object respectively. The novel global-local fusion network together with POS blocks helps align the visual features with language description to generate grammatically and semantically correct captions. Extensive qualitative and quantitative experiments on benchmark MSVD and MSRVTT datasets demonstrate that the proposed approach generates more grammatically and semantically correct captions compared to the existing methods, achieving the new state-of-the-art. Ablations on the POS blocks and the GLFB demonstrate the impact of the contributions on the proposed method.
A common problem in the 4D reconstruction of people from multi-view video is the quality of the captured dynamic texture appearance which depends on both the camera resolution and capture volume. Typically the requirement to frame cameras to capture the volume of a dynamic performance (> 50m 3 ) results in the person occupying only a small proportion < 10% of the field of view. Even with ultra high-definition 4k video acquisition this results in sampling the person at less-than standard definition 0.5k video resolution resulting in low-quality rendering. In this paper we propose a solution to this problem through super-resolution appearance transfer from a static high-resolution appearance capture rig using digital stills cameras (> 8k) to capture the person in a small volume (< 8m 3 ). A pipeline is proposed for super-resolution appearance transfer from high-resolution static capture to dynamic video performance capture to produce super-resolution dynamic textures. This addresses two key problems: colour mapping between different camera systems; and dynamic texture map super-resolution using a learnt model. Comparative evaluation demonstrates a significant qualitative and quantitative improvement in rendering the 4D performance capture with super-resolution dynamic texture appearance. The proposed approach reproduces the high-resolution detail of the static capture whilst maintaining the appearance dynamics of the captured video.
Proceedings of the ICA 2019 and EAA Euroregio : 23rd International Congress on Acoustics, integrating 4th EAA Euroregio 2019 : 9-13 September 2019, Aachen, Germany / proceedings editor: Martin Ochmann, Michael Vorländer, Janina Fels 23rd International Congress on Acoustics, integrating 4th EAA Euroregio 2019, ICA 2019, Aachen, Germany, 9 Sep 2019 - 13 Sep 2019; Aachen (2019). Published by Aachen
We present a novel method to learn temporally consistent 3D reconstruction of clothed people from a monocular video. Recent methods for 3D human reconstruction from monocular video using volumetric, implicit or parametric human shape models, produce per frame reconstructions giving temporally inconsistent output and limited performance when applied to video. In this paper we introduce an approach to learn temporally consistent features for textured reconstruction of clothed 3D human sequences from monocular video by proposing two advances: a novel temporal consistency loss function; and hybrid representation learning for implicit 3D reconstruction from 2D images and coarse 3D geometry. The proposed advances improve the temporal consistency and accuracy of both the 3D reconstruction and texture prediction from a monocular video. Comprehensive comparative performance evaluation on images of people demonstrates that the proposed method significantly outperforms the state-of-the-art learning-based single image 3D human shape estimation approaches achieving significant improvement of reconstruction accuracy, completeness, quality and temporal consistency.
This paper proposes a novel Attention-based Multi-Reference Super-resolution network (AMRSR) that, given a low-resolution image, learns to adaptively transfer the most similar texture from multiple reference images to the super-resolution output whilst maintaining spatial coherence. The use of multiple reference images together with attention-based sampling is demonstrated to achieve significantly improved performance over state-of-the-art reference super-resolution approaches on multiple benchmark datasets. Reference super-resolution approaches have recently been proposed to overcome the ill-posed problem of image super-resolution by providing additional information from a high-resolution reference image. Multi-reference super-resolution extends this approach by providing a more diverse pool of image features to overcome the inherent information deficit whilst maintaining memory efficiency. A novel hierarchical attention-based sampling approach is introduced to learn the similarity between low-resolution image features and multiple reference images based on a perceptual loss. Ablation demonstrates the contribution of both multi-reference and hierarchical attention-based sampling to overall performance. Perceptual and quantitative ground-truth evaluation demonstrates significant improvement in performance even when the reference images deviate significantly from the target image. The project website can be found at https://marcopesavento.github.io/AMRSR/
This paper presents a method for dense 4D temporal alignment of partial reconstructions of non-rigid surfaces observed from single or multiple moving cameras of complex scenes. 4D Match Trees are introduced for robust global alignment of non-rigid shape based on the similarity between images across sequences and views. Wide-timeframe sparse correspondence between arbitrary pairs of images is established using a segmentation-based feature detector (SFD) which is demonstrated to give improved matching of non-rigid shape. Sparse SFD correspondence allows the similarity between any pair of image frames to be estimated for moving cameras and multiple views. This enables the 4D Match Tree to be constructed which minimises the observed change in non-rigid shape for global alignment across all images. Dense 4D temporal correspondence across all frames is then estimated by traversing the 4D Match tree using optical flow initialised from the sparse feature matches. The approach is evaluated on single and multiple view images sequences for alignment of partial surface reconstructions of dynamic objects in complex indoor and outdoor scenes to obtain a temporally consistent 4D representation. Comparison to previous 2D and 3D scene flow demonstrates that 4D Match Trees achieve reduced errors due to drift and improved robustness to large non-rigid deformations.
Scene relighting and estimating illumination of a real scene for insertion of virtual objects in a mixed-reality scenario are well-studied challenges in the computer vision and graphics fields. Classical inverse rendering approaches aim to decompose a scene into its orthogonal constituting elements, namely scene geometry, illumination and surface materials, which can later be used for augmented reality or to render new images under novel lighting or viewpoints. Recently, the application of deep neural computing to illumination estimation, relighting and inverse rendering has shown promising results. This contribution aims to bring together in a coherent manner current advances in this conjunction. We examine in detail the attributes of the proposed approaches, presented in three categories: scene illumination estimation, relighting with reflectance-aware scene-specific representations and finally relighting as image-to-image transformations. Each category is concluded with a discussion on the main characteristics of the current methods and possible future trends. We also provide an overview of current publicly available datasets for neural lighting applications.
This contribution introduces a two-step, novel neural rendering framework to learn the transformation from a 2D human silhouette mask to the corresponding cast shadows on background scene geometries. In the first step, the proposed neural renderer learns a binary shadow texture (canonical shadow) from the 2D foreground subject, for each point light source, independent of the background scene geometry. Next, the generated binary shadows are texture-mapped to transparent virtual shadow map planes which are seamlessly used in a traditional rendering pipeline to project hard or soft shadows for arbitrary scenes and light sources of different sizes. The neural renderer is trained with shadow images rendered from a fast, scalable, synthetic data generation framework. We introduce the 3D Virtual Human Shadow (3DVHshadow) dataset as a public benchmark for training and evaluation of human shadow generation. Evaluation on the 3DVHshadow test set and real 2D silhouette images of people demonstrates the proposed framework achieves comparable performance to traditional geometry-based renderers without any requirement for knowledge or computationally intensive, explicit estimation of the 3D human shape. We also show the benefit of learning intermediate canonical shadow textures, compared to learning to generate shadows directly in camera image space. Further experiments are provided to evaluate the effect of having multiple light sources in the scene, model performance with regard to the relative camera-light 2D angular distance, potential aliasing artefacts related to output image resolution, and effect of light sources' dimensions on shadow softness.
Presents the introductory welcome message from the conference proceedings. May include the conference officers' congratulations to all involved with the conference event and publication of the proceedings record.
We propose a plane-based urban scene reconstruction method using spherical stereo image pairs. We assume that the urban scene consists of axis-aligned approximately planar structures (Manhattan world). Captured spherical stereo images are converted into six central-point perspective images by cubic projection and facade alignment. Facade alignment automatically identifies the principal planes direction in the scene allowing the cubic projection to preserve the plane structure. Depth information is recovered by stereo matching between images and independent 3D rectangular planes are constructed by plane fitting aligned with the principal axes. Finally planar regions are refined by expanding, detecting intersections and cropping based on visibility. The reconstructed model efficiently represents the structure of the scene and texture mapping allows natural walk-through rendering.
Presents the introductory welcome message from the conference proceedings.
Research on 3D video processing has gained a tremendous amount of momentum due to advances in video communications, broadcasting and entertainment technology (e.g., animation blockbusters like Avatar and Up). There is an increasing need for reliable technologies capable of visualizing 3-D content from viewpoints decided by the user; the 2010 football World Cup in South Africa has made very evident the need to replay crucial football footage from new viewpoints to decide whether the ball has or has not crossed the goal line. Remote videoconferencing prototypes are introducing a sense of presence into large- and small-scale (PC-based) systems alike by manipulating single and multiple video sequences to improve eye contact and place participants in convincing virtual spaces. All this, and more, is pushing the introduction of 3D services and the development of high-quality 3D displays to be available in a future which is drawing nearer and nearer.
Video-based free-viewpoint rendering from multiple view video capture has achieved video-realistic performance replay. Existing free-viewpoint rendering approaches require storage, streaming and re-sampling of multiple videos, which requires high bandwidth and computational resources limiting applications to local replay on high-performance computers. This paper introduces a layered texture representation for efficient storage and view-dependent rendering from multiple view video capture whilst maintaining the video-realism. Layered textures re-sample the captured video according to the surface visibility. Prioritisation of layers according to surface visibility allows the N-best views for all surface elements to be pre-computed significantly reducing both storage and rendering cost. Typically 3 texture map layers are required for free-viewpoint rendering with an equivalent visual quality to the multiple view video giving a significant reduction in storage cost. Quantitative evaluation demonstrates that the layered representation achieves a 90% reduction in storage cost and 50% reduction in rendering cost without loss of visual quality compared to storing only the foreground of the original multiple view video. This reduces the storage and transmission cost for free-viewpoint video rendering from eight cameras to be similar to the requirements for a single video. Streaming the layered representation enables, for the first time, demonstration of free-viewpoint video rendering on mobile devices and web platforms.
A shape constrained Laplacian mesh deformation approach is introduced for interactive editing of mesh sequences. This allows low-level constraints, such as foot or hand contact, to be imposed while preserving the natural dynamics of the captured surface. The approach also allows artistic manipulation of motion style to achieve effects such as squash-and-stretch. Interactive editing of key-frames is followed by automatic temporal propagation over a window of frames. User edits are seamlessly integrated into the captured mesh sequence. Three spatio-temporal interpolation methods are evaluated. Results on a variety of real and synthetic sequences demonstrate that the approach enables flexible manipulation of captured 3D video sequences.
In this paper we introduce a process for the synthesis of visual speech from captured 3D video. Animation from real speech is performed by path optimisation over a graph structure containing dynamic phonetic units.
The surface normals of a 3D human model, generated form multiple viewpoint capture, are refined using an iterative variant of the shape-from-shading technique to recover fine details of clothing enabling relighting of the model according to a new virtual environment. The method requires the images to consist of a number of uniformly coloured surface patches.
This paper presents a method to combine unreliable player tracks from multiple cameras to obtain a unique track for each player. A recursive graph optimisation algorithm is introduced to evaluate the best association between player tracks.
A process is presented for the creation of immersive high-dynamic range 3D models from a pair of spherical images. The concept of applying a spherical Delauney triangulation to 3D model creation is introduced and a complete texturing solution is given.
The visual and auditory modalities are the most important stimuli for humans. In order to maximise the sense of immersion in VR environments, a plausible spatial audio reproduction synchronised with visual information is essential. However, measuring acoustic properties of an environment using audio equipment is a complicated process. In this chapter, we introduce a simple and efficient system to estimate room acoustic for plausible spatial audio rendering using 360 ∘ cameras for real scene reproduction in VR. A simplified 3D semantic model of the scene is estimated from captured images using computer vision algorithms and convolutional neural network (CNN). Spatially synchronised audio is reproduced based on the estimated geometric and acoustic properties in the scene. The reconstructed scenes are rendered with synthesised spatial audio.
We welcome you to this virtual edition of the International Conference on 3D Vision (3DV 2020). The conference was originally to be held at Kyushu University in Fukuoka, Japan, but like most other technical meetings this year, it was changed to an online format due to the worldwide spread of COVID-19.
Light field video for content production is gaining both research and commercial interest as it has the potential to push the level of immersion for augmented and virtual reality to a close-to-reality experience. Light fields densely sample the viewing space of an object or scene using hundreds or even thousands of images with small displacements in between. However, a lack of standardised formats for compression, storage and transmission, along with the lack of tools to enable editing of light field data currently make it impractical for use in real-world content production. In this chapter we address two fundamental problems with light field data, namely representation and compression. Firstly we propose a method to obtain a 4D temporally coherent representation from the input light field video. This is an essential problem to solve that will enable efficient compression editing. Secondly, we present a method for compression of light field data based on the eigen texture method that provides a compact representation and enables efficient view-dependent rendering at interactive frame rates. These approaches achieve an order of magnitude compression and temporally consistent representation that are important steps towards practical toolsets for light field video content production.
We propose an environment modelling method using high-resolution spherical stereo colour imaging. We capture indoor or outdoor scenes with line scanning by a rotating spherical camera and recover depth information from a stereo image pair using correspondence matching and spherical cross-slits stereo geometry. The existing single spherical imaging technique is extended to stereo geometry and a hierarchical PDE-based sub-pixel disparity estimation method for large images is proposed. The estimated floating-point disparity fields are used for generating an accurate and smooth depth. Finally, the 3D environments were reconstructed using triangular meshes from the depth field. Through experiments, we evaluate the accuracy of reconstruction against ground-truth and analyze the behaviour of errors for spherical stereo imaging.
The Facial Action Coding System (FACS) [Ekman et al. 2002] has become a popular reference for creating fully controllable facial models that allow the manipulation of single actions or so-called Action Units (AUs). For example, realistic 3D models based on FACS have been used for investigating the perceptual effects of moving faces, and for character expression mapping in recent movies. However, since none of the facial actions (AUs) in these models are validated by FACS experts it is unclear how valid the model would be in situations where the accurate production of an AU is essential [Krumhuber and Tamarit 2010]. Moreover, previous work has employed motion capture data representing only sparse 3D facial positions which does not include dense surface deformation detail.
We present a new approach to reflectance estimation for dynamic scenes. Non-parametric image statistics are used to transfer reflectance properties from a static example set to a dynamic image sequence. The approach allows reflectance estimation for surface materials with inhomogeneous appearance, such as those which commonly occur with patterned or textured clothing. Material reflectance properties are initially estimated from static images of the subject under multiple directional illuminations using photometric stereo. The estimated reflectance together with the corresponding image under uniform ambient illumination form a prior set of reference material observations. Material reflectance properties are then estimated for video sequences of a moving person captured under uniform ambient illumination by matching the observed local image statistics to the reference observations. Results demonstrate that the transfer of reflectance properties enables estimation of the dynamic surface normals and subsequent relighting. This approach overcomes limitations of previous work on material transfer and relighting of dynamic scenes which was limited to surfaces with regions of homogeneous reflectance. We evaluate for relighting 3D model sequences reconstructed from multiple view video. Comparison to previous model relighting demonstrates improved reproduction of detailed texture and shape dynamics. (10 pages)
Registering 3D point sets subject to rigid body motion is a common problem in computer vision. The optimal transformation is usually specified to be the minimum of a weighted least squares cost. The case of 2 point sets has been solved by several authors using analytic methods such as SVD. In this paper we present a numerical method for solving the problem when there are more than 2 point sets. Although of general applicability the new method is particularly aimed at the multiview surface registration problem. To date almost all authors have registered only two point sets at a time. This approach discards information and we show in quantitative terms the errors caused.
We present a novel method to improve the accuracy of the 3D reconstruction of clothed human shape from a single image. Recent work has introduced volumetric, implicit and model-based shape learning frameworks for reconstruction of objects and people from one or more images. However, the accuracy and completeness for reconstruction of clothed people is limited due to the large variation in shape resulting from clothing, hair, body size, pose and camera viewpoint. This paper introduces two advances to overcome this limitation: firstly a new synthetic dataset of realistic clothed people, 3DVH; and secondly, a novel multiple-view loss function for training of monocular volumetric shape estimation, which is demonstrated to significantly improve generalisation and reconstruction accuracy. The 3DVH dataset of realistic clothed 3D human models rendered with diverse natural backgrounds is demonstrated to allows transfer to reconstruction from real images of people. Comprehensive comparative performance evaluation on both synthetic and real images of people demonstrates that the proposed method significantly outperforms the previous state-of-the-art learning-based single image 3D human shape estimation approaches achieving significant improvement of reconstruction accuracy, completeness, and quality. An ablation study shows that this is due to both the proposed multiple-view training and the new 3DVH dataset. The code and the dataset can be found at the project website: https://akincaliskan3d.github.io/MV3DH/.
This paper presents a framework for performance-based animation and retargeting of high-resolution face models from motion capture. A novel method is introduced for learning a mapping between sparse 3D motion capture markers and dense high-resolution 3D scans of face shape and appearance. A high-resolution facial expression space is learnt from a set of 3D face scans as a person specific morphable model. Sparse 3D face points sampled at the motion capture marker positions are used to build a corresponding low-resolution expression space to represent the facial dynamics from motion capture. Radial basis function interpolation is used to automatically map the low-resolution motion capture of facial dynamics to the high-resolution facial expression space. This produces a high-resolution facial animation with the detailed shape and appearance of real facial dynamics. Retargeting is introduced to transfer facial expressions to a novel subject captured from a single photograph or 3D scan. The subject specific high- resolution expression space is mapped to the novel subject based on anatomical differences in face shape. Results facial animation and retargeting demonstrate realistic animation of expressions from motion capture. (10 pages)
Presents the welcome message from the conference proceedings.
This paper presents an accurate inner-lip extraction technique designed to work on long sequences of mouth images during speech. We present a novel appearance-based searching technique for the detection of the inner-lip.
This paper introduces a novel 4D shape descriptor to match temporal surface sequences. A quantitative evaluation based on the receiver-operator characteristic (ROC) curve is presented to compare the performance of conventional 3D shape descriptors with and without using a time filter. Feature- based 3D shape descriptors including shape distribution (Osada et al., 2002 ), spin image (Johnson et al., 1999), shape histogram (Ankest et al., 1999) and spherical harmonics (Kazhdan et al., 2003) are considered. Evaluation shows that filtered descriptors outperform unfiltered descriptors and the best performing volume-sampling shape-histogram descriptor is extended to define a new 4D "shape-flow" descriptor. Shape-flow matching demonstrates improved performance in the context of matching time-varying sequences which is motivated by the requirement to connect similar sequences for animation production. Both simulated and real 3D human surface motion sequences are used for evaluation. (10 pages)
We propose a region-based method to extract foreground regions from colour video sequences. The foreground region is decided by voting with scores from background subtraction to the sub-regions by graph-based segmentation. Experiments show that the proposed algorithm improves on conventional approaches especially in strong shadow regions. (1 page)
In this paper we present a method to relight captured 3D video sequences of non-rigid, dynamic scenes, such as clothing of real actors, reconstructed from multiple view video. A view-dependent approach is introduced to refine an initial coarse surface reconstruction using shape-from-shading to estimate detailed surface normals. The prior surface approximation is used to constrain the simultaneous estimation of surface normals and scene illumination, under the assumption of Lambertian surface reflectance. This approach enables detailed surface normals of a moving non-rigid object to be estimated from a single image frame. Refined normal estimates from multiple views are integrated into a single surface normal map. This approach allows highly non-rigid surfaces, such as creases in clothing, to be relit whilst preserving the detailed dynamics observed in video. (8 pages)
We propose a 3D modelling method from multiple pairs of spherical stereo images. A static environment is captured as a vertical stereo pair with a rotating line scan camera at multiple locations and depth fields are extracted for each pair using spherical stereo geometry. We propose a new PDE-based stereo matching method which handles occlusion and over-segmentation problem in highly textured regions. In order to avoid cumbersome camera calibration steps, we extract a 3D rigid transform using feature matching between views and fuse all models into one complete mesh. A reliable surface selection algorithm for overlapped surfaces is proposed for merging multiple meshes in order to keep surface details while removing outliers. The performances of the proposed algorithms are evaluated against ground-truth from LIDAR scans.
We present an approach to multi-person 3D pose estimation and tracking from multi-view video. Following independent 2D pose detection in each view, we: (1) correct errors in the output of the pose detector; (2) apply a fast greedy algorithm for associating 2D pose detections between camera views; and (3) use the associated poses to generate and track 3D skeletons. Previous methods for estimating skeletons of multiple people suffer long processing times or rely on appearance cues, reducing their applicability to sports. Our approach to associating poses between views works by seeking the best correspondences first in a greedy fashion, while reasoning about the cyclic nature of correspondences to constrain the search. The associated poses can be used to generate 3D skeletons, which we produce via robust triangulation. Our method can track 3D skeletons in the presence of missing detections, substantial occlusions, and large calibration error. We believe ours is the first method for full-body 3D pose estimation and tracking of multiple players in highly dynamic sports scenes. The proposed method achieves a significant improvement in speed over state-of-the-art methods.
The visual hull is widely used as a proxy for novel view synthesis in computer vision. This paper introduces the safe hull, the first visual hull reconstruction technique to produce a surface containing only foreground parts. A theoretical basis underlies this novel approach which, unlike any previous work, can also identify phantom volumes attached to real objects. Using an image-based method, the visual hull is constructed with respect to each real view and used to identify safe zones in the original silhouettes. The safe zones define volumes known to only contain surface corresponding to a real object. The zones are used in a second reconstruction step to produce a surface without phantom volumes. Results demonstrate the effectiveness of this method for improving surface shape and scene realism, and its advantages over heuristic techniques. (8 pages)
Recent progresses in Virtual Reality (VR) and Augmented Reality (AR) allow us to experience various VR/AR applications in our daily life. In order to maximise the immersiveness of user in VR/AR environments, a plausible spatial audio reproduction synchronised with visual information is essential. In this paper, we propose a simple and efficient system to estimate room acoustic for plausible reproducton of spatial audio using 360° cameras for VR/AR applications. A pair of 360° images is used for room geometry and acoustic property estimation. A simplified 3D geometric model of the scene is estimated by depth estimation from captured images and semantic labelling using a convolutional neural network (CNN). The real environment acoustics are characterised by frequency-dependent acoustic predictions of the scene. Spatially synchronised audio is reproduced based on the estimated geometric and acoustic properties in the scene. The reconstructed scenes are rendered with synthesised spatial audio as VR/AR content. The results of estimated room geometry and simulated spatial audio are evaluated against the actual measurements and audio calculated from ground-truth Room Impulse Responses (RIRs) recorded in the rooms.
This paper presents an approach to estimate the intrinsic texture properties (albedo, shading, normal) of scenes from multiple view acquisition under unknown illumination conditions. We introduce the concept of intrinsic textures, which are pixel-resolution surface textures representing the intrinsic appearance parameters of a scene. Unlike previous video relighting methods, the approach does not assume regions of uniform albedo, which makes it applicable to richly textured scenes. We show that intrinsic image methods can be used to refine an initial, low-frequency shading estimate based on a global lighting reconstruction from an original texture and coarse scene geometry in order to resolve the inherent global ambiguity in shading. The method is applied to relighting of free-viewpoint rendering from multiple view video capture. This demonstrates relighting with reproduction of fine surface detail. Quantitative evaluation on synthetic models with textured appearance shows accurate estimation of intrinsic surface reflectance properties. © 2014 Springer International Publishing.
This paper presents an empirical study of affine invariant feature detectors to perform matching on video sequences of people with non-rigid surface deformation. Recent advances in feature detection and wide baseline matching have focused on static scenes. Video frames of human movement captures highly non-rigid deformation such as loose hair, cloth creases, skin stretching and free flowing clothing. This study evaluates the performance of three widely used feature detectors for sparse temporal correspondence on single view and multiple view video sequences. Quantitative evaluation is performed of both the number of features detected and their temporal matching against and without ground truth correspondences. Recall-accuracy analysis of feature matching is reported for temporal correspondence on single view and multiple view sequences of people with variation in clothing and movement. This analysis identifies that existing feature detection and matching algorithms are unreliable for fast movement with common clothing. For patterned clothing techniques such as SIFT produce reliable correspondence. (10 pages)
We present a framework for efficient reconstruction of dense scene structure from video. Sequential structure-from-motion recovers camera information from video, providing only sparse 3D points. We build a dense 3D point cloud by performing full-frame tracking and depth estimation across sequences. First, we present a novel algorithm for sequential frame selection to extract a set of key frames with sufficient parallax for accurate depth reconstruction. Second, we introduce a technique for efficient reconstruction using dense tracking with geometrically correct optimisation of depth and orientation. Key frame selection is also performed in optimisation to provide accurate depth reconstruction for different scene elements. We test our work on benchmark footage and scenes containing local non-rigid motion, foreground clutter and occlusions to show comparable performance to state of the art techniques. We also show a substantial increase in speed on real world footage compared to existing methods, when they succeed, and successful reconstructions when they fail.
This paper introduces a novel method of surface refinement for free-viewpoint video of dynamic scenes. Unlike previous approaches, the method presented here uses both visual hull and silhouette contours to constrain refinement of view-dependent depth maps from wide baseline views. A technique for extracting silhouette contours as rims in 3D from the view-dependent visual hull (VDVH) is presented. A new method for improving correspondence is introduced, where refinement of the VDVH is posed as a global problem in projective ray space. Artefacts of global optimisations are reduced by incorporating rims as constraints. Real time rendering of virtual views in a free-viewpoint video system is achieved using an image+depth representation for each real view. Results illustrate the high quality of rendered views achieved through this refinement technique.
GORDON COLLINS and ADRIAN HILTON present a review of methods for the construction and deformation of character models. They consider both state of the art research and common practice. In particular they review applications, data capture methods, manual model construction, polygonal, parametric and implicit surface representations, basic geometric deformations, free form deformations, subdivision surfaces, displacement map schemes and physical deformation. Copyright © 2001 John Wiley & Sons, Ltd.
A key task in computer vision is that of generating virtual 3D models of real-world scenes by reconstructing the shape, appearance and, in the case of dynamic scenes, motion of the scene from visual sensors. Recently, low-cost video plus depth (RGB-D) sensors have become widely available and have been applied to 3D reconstruction of both static and dynamic scenes. RGB-D sensors contain an active depth sensor, which provides a stream of depth maps alongside standard colour video. The low cost and ease of use of RGB-D devices as well as their video rate capture of images along with depth make them well suited to 3D reconstruction. Use of active depth capture overcomes some of the limitations of passive monocular or multiple-view video-based approaches since reliable, metrically accurate estimates of the scene depth at each pixel can be obtained from a single view, even in scenes that lack distinctive texture. There are two key components to 3D reconstruction from RGB-D data: (1) spatial alignment of the surface over time and, (2) fusion of noisy, partial surface measurements into a more complete, consistent 3D model. In the case of static scenes, the sensor is typically moved around the scene and its pose is estimated over time. For dynamic scenes, there may be multiple rigid, articulated, or non-rigidly deforming surfaces to be tracked over time. The fusion component consists of integration of the aligned surface measurements, typically using an intermediate representation, such as the volumetric truncated signed distance field (TSDF). In this chapter, we discuss key recent approaches to 3D reconstruction from depth or RGB-D input, with an emphasis on real-time reconstruction of static scenes.
Multiple view 3D video reconstruction of actor performance captures a level-of-detail for body and clothing movement which is time-consuming to produce using existing animation tools. In this paper we present a framework for concatenative synthesis from multiple 3D video sequences according to user constraints on movement, position and timing. Multiple 3D video sequences of an actor performing different movements are automatically constructed into a surface motion graph which represents the possible transitions with similar shape and motion between sequences without unnatural movement artifacts. Shape similarity over an adaptive temporal window is used to identify transitions between 3D video sequences. Novel 3D video sequences are synthesized by finding the optimal path in the surface motion graph between user specified key-frames for control of movement, location and timing. The optimal path which satisfies the user constraints whilst minimizing the total transition cost between 3D video sequences is found using integer linear programming. Results demonstrate that this framework allows flexible production of novel 3D video sequences which preserve the detailed dynamics of the captured movement for an actress with loose clothing and long hair without visible artifacts.
We present a method for Semantic Scene Completion (SSC) of complete indoor scenes from a single 360 degrees RGB image and corresponding depth map using a Deep Convolution Neural Network that takes advantage of existing datasets of synthetic and real RGB-D images for training. Recent works on SSC only perform occupancy prediction of small regions of the room covered by the field-of-view of the sensor in use, which implies the need of multiple images to cover the whole scene, being an inappropriate method for dynamic scenes. Our approach uses only a single 360 degrees image with its corresponding depth map to infer the occupancy and semantic labels of the whole room. Using one single image is important to allow predictions with no previous knowledge of the scene and enable extension to dynamic scene applications. We evaluated our method on two 360 degrees image datasets: a high-quality 360 degrees RGB-D dataset gathered with a Matterport sensor and low-quality 360 degrees RGB-D images generated with a pair of commercial 360 degrees cameras and stereo matching. The experiments showed that the proposed pipeline performs SSC not only with Matterport cameras but also with more affordable 360 degrees cameras, which adds a great number of potential applications, including immersive spatial audio reproduction, augmented reality, assistive computing and robotics.
This paper addresses the synthesis of virtual views of people from multiple view image sequences. We consider the target area of the multiple camera “3D Virtual Studio” with the ultimate goal of capturing video-realistic dynamic human appearance. A mesh based reconstruction framework is introduced to initialise and optimise the shape of a dynamic scene for view-dependent rendering, making use of silhouette and stereo data as complementary shape cues. The technique addresses two key problems: (1) robust shape reconstruction; and (2) accurate image correspondence for view dependent rendering in the presence of camera calibration error. We present results against ground truth data in synthetic test cases and for captured sequences of people in a studio. The framework demonstrates a higher resolution in rendering compared to shape from silhouette and multiple view stereo.
We present a novel hybrid representation for character animation from 4D Performance Capture (4DPC) data which combines skeletal control with surface motion graphs. 4DPC data are temporally aligned 3D mesh sequence reconstructions of the dynamic surface shape and associated appearance from multiple view video. The hybrid representation supports the production of novel surface sequences which satisfy constraints from user specified key-frames or a target skeletal motion. Motion graph path optimisation concatenates fragments of 4DPC data to satisfy the constraints whilst maintaining plausible surface motion at transitions between sequences. Spacetime editing of the mesh sequence using a learnt part-based Laplacian surface deformation model is performed to match the target skeletal motion and transition between sequences. The approach is quantitatively evaluated for three 4DPC datasets with a variety of clothing styles. Results for key-frame animation demonstrate production of novel sequences which satisfy constraints on timing and position of less than 1% of the sequence duration and path length. Evaluation of motion capture driven animation over a corpus of 130 sequences shows that the synthesised motion accurately matches the target skeletal motion. The combination of skeletal control with the surface motion graph extends the range and style of motion which can be produced whilst maintaining the natural dynamics of shape and appearance from the captured performance.
Image-based modelling allows the reconstruction of highly realistic digital models from real-world objects. This paper presents a model-based approach to recover animated models of people from multipleview video images. Two contributions are made, a multiple resolution model-based framework is introduced that combines multiple visual cues in reconstruction. Second, a novel mesh parameterisation is presented to preserve the vertex parameterisation in the model for animation. A prior humanoid surface model is first decomposed into multiple levels of detail and represented as a hierarchical deformable model for image fitting. A novel mesh parameterisation is presented that allows propagation of deformation in the model hierarchy and regularisation of surface deformation to preserve vertex parameterisation and animation structure. The hierarchical model is then used to fuse multipleshape cues from silhouette, stereo and sparse feature data in a coarse-to-fine strategy to recover a model that reproduces the appearance in the images. The framework is compared to physics-based deformable surface fitting at a single resolution, demonstrating an improved reconstruction accuracy against ground-truth data with a reduced model distortion. Results demonstrate realistic modelling of real people with accurate shape and appearance while preserving model structure for use in animation.
Recent advances in sensor technology have introduced low-cost RGB video plus depth sensors, such as the Kinect, which enable simultaneous acquisition of colour and depth images at video rates. This paper introduces a framework for representation of general dynamic scenes from video plus depth acquisition. A hybrid representation is proposed which combines the advantages of prior surfel graph surface segmentation and modelling work with the higher-resolution surface reconstruction capability of volumetric fusion techniques. The contributions are (1) extension of a prior piecewise surfel graph modelling approach for improved accuracy and completeness, (2) combination of this surfel graph modelling with TSDF surface fusion to generate dense geometry, and (3) proposal of means for validation of the reconstructed 4D scene model against the input data and efficient storage of any unmodelled regions via residual depth maps. The approach allows arbitrary dynamic scenes to be efficiently represented with temporally consistent structure and enhanced levels of detail and completeness where possible, but gracefully falls back to raw measurements where no structure can be inferred. The representation is shown to facilitate creative manipulation of real scene data which would previously require more complex capture setups or manual processing.
This paper presents a performance evaluation of shape similarity metrics for 3D video sequences of people with unknown temporal correspondence. Performance of similarity measures is compared by evaluating Receiver Operator Characteristics for classification against ground-truth for a comprehensive database of synthetic 3D video sequences comprising animations of fourteen people performing twentyeight motions. Static shape similarity metrics shape distribution, spin image, shape histogram and spherical harmonics are evaluated using optimal parameter settings for each approach. Shape histograms with volume sampling are found to consistently give the best performance for different people and motions. Static shape similarity is extended over time to eliminate the temporal ambiguity. Time-filtering of the static shape similarity together with two novel shape-flow descriptors are evaluated against temporal ground-truth. This evaluation demonstrates that shape-flow with a multi-frame alignment of motion sequences achieves the best performance, is stable for different people and motions, and overcome the ambiguity in static shape similarity. Time-filtering of the static shape histogram similarity measure with a fixed window size achieves marginally lower performance for linear motions with the same computational cost as static shape descriptors. Performance of the temporal shape descriptors is validated for real 3D video sequence of nine actors performing a variety of movements. Time-filtered shape histograms are shown to reliably identify frames from 3D video sequences with similar shape and motion for people with loose clothing and complex motion.
A common problem in wide-baseline matching is the sparse and non-uniform distribution of correspondences when using conventional detectors such as SIFT, SURF, FAST, A-KAZE and MSER. In this paper we introduce a novel segmentation based feature detector (SFD) that produces an increased number of accurate features for wide-baseline matching. A multi-scale SFD is proposed using bilateral image decomposition to produce a large number of scale-invariant features for wide-baseline reconstruction. All input images are over-segmented into regions using any existing segmentation technique like Watershed, Mean-shift, and SLIC. Feature points are then detected at the intersection of the boundaries of three or more regions. The detected feature points are local maxima of the image function. The key advantage of feature detection based on segmentation is that it does not require global threshold setting and can therefore detect features throughout the image. A comprehensive evaluation demonstrates that SFD gives an increased number of features which are accurately localised and matched between wide-baseline camera views; the number of features for a given matching error increases by a factor of 3-5 compared to SIFT; feature detection and matching performance is maintained with increasing baseline between views; multi-scale SFD improves matching performance at varying scales. Application of SFD to sparse multi-view wide-baseline reconstruction demonstrates a factor of ten increase in the number of reconstructed points with improved scene coverage compared to SIFT/MSER/A-KAZE. Evaluation against ground-truth shows that SFD produces an increased number of wide-baseline matches with reduced error.
We propose a block-based scene reconstruction method using multiple stereo pairs of spherical images. We assume that the urban scene consists of axis-aligned planar structures (Manhattan world). Captured spherical stereo images are converted into six central-point perspective images by cubic projection and fa____c cade alignment. Depth information is recovered by stereo matching between images. Semantic regions are segmented based on colour, edge and normal information. Independent 3D rectangular planes are constructed by fitting planes aligned with the principal axes of the segmented 3D points. Finally cuboid-based scene structure is recovered from multiple viewpoints by merging and refining planes based on connectivity and visibility. The reconstructed model efficiently shows the structure of the scene with a small amount of data.
This paper proposes a clustered exemplar-based model for performing viewpoint invariant tracking of the 3D motion of a human subject from a single camera. Each exemplar is associated with multiple view visual information of a person and the corresponding 3D skeletal pose. The visual information takes the form of contours obtained from different viewpoints around the subject. The inclusion of multi-view information is important for two reasons: viewpoint invariance; and generalisation to novel motions. Visual tracking of human motion is performed using a particle filter coupled to the dynamics of human movement represented by the exemplar-based model. Dynamics are modelled by clustering 3D skeletal motions with similar movement and encoding the flow both within and between clusters. Results of single view tracking demonstrate that the exemplar-based models incorporating dynamics generalise to viewpoint invariant tracking of novel movements.
Learning visual representations plays an important role in computer vision and machine learning applications. It facilitates a model to understand and perform high-level tasks intelligently. A common approach for learning visual representations is supervised one which requires a huge amount of human annotations to train the model. This paper presents a self-supervised approach which learns visual representations from input images without human annotations. We learn the correct arrangement of object proposals to represent an image using a convolutional neural network (CNN) without any manual annotations. We hypothesize that the network trained for solving this problem requires the embedding of semantic visual representations. Unlike existing approaches that use uniformly sampled patches, we relate object proposals that contain prominent objects and object parts. More specifically, we discover the representation that considers overlap, inclusion, and exclusion relationship of proposals as well as their relative position. This allows focusing on potential objects and parts rather than on clutter. We demonstrate that our model outperforms existing self-supervised learning methods and can be used as a generic feature extractor by applying it to object detection, classification, action recognition, image retrieval, and semantic matching tasks.
Existing systems for 3D reconstruction from multiple view video use controlled indoor environments with uniform illumination and backgrounds to allow accurate segmentation of dynamic foreground objects. In this paper we present a portable system for 3D reconstruction of dynamic outdoor scenes which require relatively large capture volumes with complex backgrounds and non-uniform illumination. This is motivated by the demand for 3D reconstruction of natural outdoor scenes to support film and broadcast production. Limitations of existing multiple view 3D reconstruction techniques for use in outdoor scenes are identified. Outdoor 3D scene reconstruction is performed in three stages: (1) 3D background scene modelling using spherical stereo image capture; (2) multiple view segmentation of dynamic foreground objects by simultaneous video matting across multiple views; and (3) robust 3D foreground reconstruction and multiple view segmentation refinement in the presence of segmentation and calibration errors. Evaluation is performed on several outdoor productions with complex dynamic scenes including people and animals. Results demonstrate that the proposed approach overcomes limitations of previous indoor multiple view reconstruction approaches enabling high-quality free-viewpoint rendering and 3D reference models for production.
A real-time motion capture system is presented which uses input from multiple standard video cameras and inertial measurement units (IMUs). The system is able to track multiple people simultaneously and requires no optical markers, specialized infra-red cameras or foreground/background segmentation, making it applicable to general indoor and outdoor scenarios with dynamic backgrounds and lighting. To overcome limitations of prior video or IMU-only approaches, we propose to use flexible combinations of multiple-view, calibrated video and IMU input along with a pose prior in an online optimization-based framework, which allows the full 6-DoF motion to be recovered including axial rotation of limbs and drift-free global position. A method for sorting and assigning raw input 2D keypoint detections into corresponding subjects is presented which facilitates multi-person tracking and rejection of any bystanders in the scene. The approach is evaluated on data from several indoor and outdoor capture environments with one or more subjects and the trade-off between input sparsity and tracking performance is discussed. State-of-the-art pose estimation performance is obtained on the Total Capture (mutli-view video and IMU) and Human 3.6M (multi-view video) datasets. Finally, a live demonstrator for the approach is presented showing real-time capture, solving and character animation using a light-weight, commodity hardware setup.
3DTV production of live sports events presents a challenging problem involving conflicting requirements of main- taining broadcast stereo picture quality with practical problems in developing robust systems for cost effective deployment. In this paper we propose an alternative approach to stereo production in sports events using the conventional monocular broadcast cameras for 3D reconstruction of the event and subsequent stereo rendering. This approach has the potential advantage over stereo camera rigs of recovering full scene depth, allowing inter-ocular distance and convergence to be adapted according to the requirements of the target display and enabling stereo coverage from both existing and ‘virtual’ camera positions without additional cameras. A prototype system is presented with results of sports TV production trials for rendering of stereo and free-viewpoint video sequences of soccer and rugby.
Existing techniques for dynamic scene re- construction from multiple wide-baseline cameras pri- marily focus on reconstruction in controlled environ- ments, with fixed calibrated cameras and strong prior constraints. This paper introduces a general approach to obtain a 4D representation of complex dynamic scenes from multi-view wide-baseline static or moving cam- eras without prior knowledge of the scene structure, ap- pearance, or illumination. Contributions of the work are: An automatic method for initial coarse reconstruc- tion to initialize joint estimation; Sparse-to-dense tem- poral correspondence integrated with joint multi-view segmentation and reconstruction to introduce tempo- ral coherence; and a general robust approach for joint segmentation refinement and dense reconstruction of dynamic scenes by introducing shape constraint. Com- parison with state-of-the-art approaches on a variety of complex indoor and outdoor scenes, demonstrates im- proved accuracy in both multi-view segmentation and dense reconstruction. This paper demonstrates unsuper- vised reconstruction of complete temporally coherent 4D scene models with improved non-rigid object seg- mentation and shape reconstruction and its application to various applications such as free-view rendering and virtual reality.
In object-based spatial audio system, positions of the audio objects (e.g. speakers/talkers or voices) presented in the sound scene are required as important metadata attributes for object acquisition and reproduction. Binaural microphones are often used as a physical device to mimic human hearing and to monitor and analyse the scene, including localisation and tracking of multiple speakers. The binaural audio tracker, however, is usually prone to the errors caused by room reverberation and background noise. To address this limitation, we present a multimodal tracking method by fusing the binaural audio with depth information (from a depth sensor, e.g., Kinect). More specifically, the PHD filtering framework is first applied to the depth stream, and a novel clutter intensity model is proposed to improve the robustness of the PHD filter when an object is occluded either by other objects or due to the limited field of view of the depth sensor. To compensate mis-detections in the depth stream, a novel gap filling technique is presented to map audio azimuths obtained from the binaural audio tracker to 3D positions, using speaker-dependent spatial constraints learned from the depth stream. With our proposed method, both the errors in the binaural tracker and the mis-detections in the depth tracker can be significantly reduced. Real-room recordings are used to show the improved performance of the proposed method in removing outliers and reducing mis-detections.
Recent developments of video and sensing technology can lead to large amounts of digital media data. Current media production rely on both video from the principal camera together with a wide variety of heterogeneous source of supporting data (photos, LiDAR point clouds, witness video camera, HDRI and depth imagery). Registration of visual data acquired from various 2D and 3D sensing modalities is challenging because current matching and registration methods are not appropriate due to differences in formats and noise types of multi-modal data. A combined 2D/3D visualisation of this registered data allows an integrated overview of the entire dataset. For such a visualisation a web-based context presents several advantages. In this paper we propose a unified framework for registration and visualisation of this type of visual media data. A new feature description and matching method is proposed, adaptively considering local geometry, semi-global geometry and colour information in the scene for more robust registration. The resulting registered 2D/3D multi-modal visual data is too large to be downloaded and viewed directly via the web browser while maintaining an acceptable user experience. Thus, we employ hierarchical techniques for compression and restructuring to enable efficient transmission and visualisation over the web, leading to interactive visualisation as registered point clouds, 2D images, and videos in the browser, improving on the current state of the art techniques for web-based visualisation of big media data. This is the first unified 3D web-based visualisation of multi-modal visual media production datasets. The proposed pipeline is tested on big multimodal dataset typical of film and broadcast production which are made publicly available. The proposed feature description method shows two times higher precision of feature matching and more stable registration performance than existing 3D feature descriptors.
Existing work on animation synthesis can be roughly split into two approaches, those that combine segments of motion capture data, and those that perform inverse kinematics. In this paper, we present a method for performing animation synthesis of an articulated object (e.g. human body and a dog) from a minimal set of body joint positions, following the approach of inverse kinematics. We tackle this problem from a learning perspective. Firstly, we address the need for knowledge on the physical constraints of the articulated body, so as to avoid the generation of a physically impossible poses. A common solution is to heuristically specify the kinematic constraints for the skeleton model. In this paper however, the physical constraints of the articulated body are represented using a hierarchical cluster model learnt from a motion capture database. Additionally, we shall show that the learnt model automatically captures the correlation between different joints through the simultaneous modelling their angles. We then show how this model can be utilised to perform inverse kinematics in a simple and efficient manner. Crucially, we describe how IK is carried out from a minimal set of end-effector positions. Following this, we show how this "learnt inverse kinematics" framework can be used to perform animation syntheses of different types of articulated structures. To this end, the results presented include the retargeting of a at surface walking animation to various uneven terrains to demonstrate the synthesis of a full human body motion from the positions of only the hands, feet and torso. Additionally, we show how the same method can be applied to the animation synthesis of a dog using only its feet and torso positions.
In this paper we describe a method for the synthesis of visual speech movements using a hybrid unit selection/model-based approach. Speech lip movements are captured using a 3D stereo face capture system, and split up into phonetic units. A dynamic parameterisation of this data is constructed which maintains the relationship between lip shapes and velocities; within this parameterisation a model of how lips move is built and is used in the animation of visual speech movements from speech audio input. The mapping from audio parameters to lip movements is disambiguated by selecting only the most similar stored phonetic units to the target utterance during synthesis. By combining properties of model-based synthesis (e.g. HMMs, neural nets) with unit selection we improve the quality of our speech synthesis.
We describe a non-parametric algorithm for multiple-viewpoint video inpainting. Uniquely, our algorithm addresses the domain of wide baseline multiple-viewpoint video (MVV) with no temporal look-ahead in near real time speed. A Dictionary of Patches (DoP) is built using multi-resolution texture patches reprojected from geometric proxies available in the alternate views. We dynamically update the DoP over time, and a Markov Random Field optimisation over depth and appearance is used to resolve and align a selection of multiple candidates for a given patch, this ensures the inpainting of large regions in a plausible manner conserving both spatial and temporal coherence. We demonstrate the removal of large objects (e.g. people) on challenging indoor and outdoor MVV exhibiting cluttered, dynamic backgrounds and moving cameras.
We propose a generalisation of the local feature matching framework, where keypoints are replaced by k-keygraphs, i.e., isomorphic directed attributed graphs of cardinality k whose vertices are keypoints. Keygraphs have structural and topological properties which are discriminative and efficient to compute, based on graph edge length and orientation as well as vertex scale and orientation. Keypoint matching is performed based on descriptor similarity. Next, 2-keygraphs are calculated; as a result, the number of incorrect keypoint matches reduced in 75% (while the correct keypoint matches were preserved). Then, 3-keygraphs are calculated, followed by 4-keygraphs; this yielded a significant reduction of 99% in the number of remaining incorrect keypoint matches. The stage that finds 2-keygraphs has a computational cost equal to a small fraction of the cost of the keypoint matching stage, while the stages that find 3-keygraphs or 4-keygraphs have a negligible cost. In the final stage, RANSAC finds object poses represented as affine transformations mapping images. Our experiments concern large-scale object instance recognition subject to occlusion, background clutter and appearance changes. By using 4-keygraphs, RANSAC needed 1% of the iterations in comparison with 2-keygraphs or simple keypoints. As a result, using 4-keygraphs provided a better efficiency as well as allowed a larger number of initial keypoints matches to be established, which increased performance.
We propose an approach to accurately esti- mate 3D human pose by fusing multi-viewpoint video (MVV) with inertial measurement unit (IMU) sensor data, without optical markers, a complex hardware setup or a full body model. Uniquely we use a multi-channel 3D convolutional neural network to learn a pose em- bedding from visual occupancy and semantic 2D pose estimates from the MVV in a discretised volumetric probabilistic visual hull (PVH). The learnt pose stream is concurrently processed with a forward kinematic solve of the IMU data and a temporal model (LSTM) exploits the rich spatial and temporal long range dependencies among the solved joints, the two streams are then fused in a final fully connected layer. The two complemen- tary data sources allow for ambiguities to be resolved within each sensor modality, yielding improved accu- racy over prior methods. Extensive evaluation is per- formed with state of the art performance reported on the popular Human 3.6M dataset [26], the newly re- leased TotalCapture dataset and a challenging set of outdoor videos TotalCaptureOutdoor. We release the new hybrid MVV dataset (TotalCapture) comprising of multi- viewpoint video, IMU and accurate 3D skele- tal joint ground truth derived from a commercial mo- tion capture system. The dataset is available online at http://cvssp.org/data/totalcapture/.
This paper presents a general approach based on the shape similarity tree for non-sequential alignment across databases of multiple unstructured mesh sequences from non-rigid surface capture. The optimal shape similarity tree for non-rigid alignment is defined as the minimum spanning tree in shape similarity space. Non-sequential alignment based on the shape similarity tree minimises the total non-rigid deformation required to register all frames in a database into a consistent mesh structure with surfaces in correspondence. This allows alignment across multiple sequences of different motions, reduces drift in sequential alignment and is robust to rapid non-rigid motion. Evaluation is performed on three benchmark databases of 3D mesh sequences with a variety of complex human and cloth motion. Comparison with sequential alignment demonstrates reduced errors due to drift and improved robustness to large non-rigid deformation, together with global alignment across multiple sequences which is not possible with previous sequential approaches. © 2012 The Author(s).
Sequential Monte Carlo probability hypothesis density (SMC- PHD) filtering has been recently exploited for audio-visual (AV) based tracking of multiple speakers, where audio data are used to inform the particle distribution and propagation in the visual SMC-PHD filter. However, the performance of the AV-SMC-PHD filter can be affected by the mismatch between the proposal and the posterior distribution. In this paper, we present a new method to improve the particle distribution where audio information (i.e. DOA angles derived from microphone array measurements) is used to detect new born particles and visual information (i.e. histograms) is used to modify the particles with particle flow (PF). Using particle flow has the benefit of migrating particles smoothly from the prior to the posterior distribution. We compare the proposed algorithm with the baseline AV-SMC-PHD algorithm using experiments on the AV16.3 dataset with multi-speaker sequences.
Simultaneous semantically coherent object-based long-term 4D scene flow estimation, co-segmentation and reconstruction is proposed exploiting the coherence in semantic class labels both spatially, between views at a single time instant, and temporally, between widely spaced time instants of dynamic objects with similar shape and appearance. In this paper we propose a framework for spatially and temporally coherent semantic 4D scene flow of general dynamic scenes from multiple view videos captured with a network of static or moving cameras. Semantic coherence results in improved 4D scene flow estimation, segmentation and reconstruction for complex dynamic scenes. Semantic tracklets are introduced to robustly initialize the scene flow in the joint estimation and enforce temporal coherence in 4D flow, semantic labelling and reconstruction between widely spaced instances of dynamic objects. Tracklets of dynamic objects enable unsupervised learning of long-term flow, appearance and shape priors that are exploited in semantically coherent 4D scene flow estimation, co-segmentation and reconstruction. Comprehensive performance evaluation against state-of-the-art techniques on challenging indoor and outdoor sequences with hand-held moving cameras shows improved accuracy in 4D scene flow, segmentation, temporally coherent semantic labelling, and reconstruction of dynamic scenes.
This paper presents a general approach based on the shape similarity tree for non-sequential alignment across databases of multiple unstructured mesh sequences from non-rigid surface capture. The optimal shape similarity tree for non-rigid alignment is defined as the minimum spanning tree in shape similarity space. Non-sequential alignment based on the shape similarity tree minimises the total non-rigid deformation required to register all frames in a database into a consistent mesh structure with surfaces in correspondence. This allows alignment across multiple sequences of different motions, reduces drift in sequential alignment and is robust to rapid non-rigid motion. Evaluation is performed on three benchmark databases of 3D mesh sequences with a variety of complex human and cloth motion. Comparison with sequential alignment demonstrates reduced errors due to drift and improved robustness to large non-rigid deformation, together with global alignment across multiple sequences which is not possible with previous sequential approaches. © 2012 The Author(s).
Object-based audio is an emerging representation for audio content, where content is represented in a reproductionformat- agnostic way and thus produced once for consumption on many different kinds of devices. This affords new opportunities for immersive, personalized, and interactive listening experiences. This article introduces an end-to-end object-based spatial audio pipeline, from sound recording to listening. A high-level system architecture is proposed, which includes novel audiovisual interfaces to support object-based capture and listenertracked rendering, and incorporates a proposed component for objectification, i.e., recording content directly into an object-based form. Text-based and extensible metadata enable communication between the system components. An open architecture for object rendering is also proposed. The system’s capabilities are evaluated in two parts. First, listener-tracked reproduction of metadata automatically estimated from two moving talkers is evaluated using an objective binaural localization model. Second, object-based scene capture with audio extracted using blind source separation (to remix between two talkers) and beamforming (to remix a recording of a jazz group), is evaluated with perceptually-motivated objective and subjective experiments. These experiments demonstrate that the novel components of the system add capabilities beyond the state of the art. Finally, we discuss challenges and future perspectives for object-based audio workflows.
A typical high-end film production generates several terabytes of data per day, either as footage from multiple cameras or as background information regarding the set (laser scans, spherical captures, etc). This paper presents solutions to improve the integration, and the understanding of the quality, of the multiple data sources, which are used both to support creative decisions on-set (or near it) and enhance the postproduction process. The main contributions covered in this paper are: a public multisource production dataset made available for research purposes, monitoring and quality assurance of multicamera set-ups, multisource registration, anthropocentric visual analysis for semantic content annotation, acceleration of 3D reconstruction, and integrated 2D-3D web visualization tools. Furthermore, this paper presents a toolset for analysis and visualisation of multi-modal media production datasets which enables onset data quality verification and management, thus significantly reducing the risk and time required in production. Some of the basic techniques used for acceleration, clustering and visualization could be applied to much larger classes of big data problems.
The challenge of installing and setting up dedicated spatial audio systems can make it difficult to deliver immersive listening experiences to the general public. However, the proliferation of smart mobile devices and the rise of the Internet of Things mean that there are increasing numbers of connected devices capable of producing audio in the home. ____Media device orchestration" (MDO) is the concept of utilizing an ad hoc set of devices to deliver or augment a media experience. In this paper, the concept is evaluated by implementing MDO for augmented spatial audio reproduction using objectbased audio with semantic metadata. A thematic analysis of positive and negative listener comments about the system revealed three main categories of response: perceptual, technical, and content-dependent aspects. MDO performed particularly well in terms of immersion/envelopment, but the quality of listening experience was partly dependent on loudspeaker quality and listener position. Suggestions for further development based on these categories are given.
Multiple-camera systems are currently widely used in research and development as a means of capturing and synthesizing realistic 3-D video content. Studio systems for 3-D production of human performance are reviewed from the literature, and the practical experience gained in developing prototype studios is reported across two research laboratories. System design should consider the studio backdrop for foreground matting, lighting for ambient illumination, camera acquisition hardware., the camera configuration for scene capture, and accurate geometric and photometric camera calibration. A ground-truth evaluation is performed to quantify the effect of different constraints on the multiple-camera system in terms of geometric accuracy and the requirement for high-quality view synthesis. As changing camera height has only a limited influence on surface visibility, multiple-camera sets or an active vision system may be required for wide area capture, and accurate reconstruction requires a camera baseline of 25 degrees, and the achievable accuracy is 5-10-mm at current camera resolutions. Accuracy is inherently limited, and view-dependent rendering is required for view synthesis with sub-pixel accuracy where display resolutions match camera resolutions. The two prototype studios are contrasted and state-of-the-art techniques for 3-D content production demonstrated.
This paper presents techniques to animate realistic human-like motion using a compressed learnt model from 4D volumetric performance capture data. Sequences of 4D dynamic geometry representing a human performing an arbitrary motion are encoded through a generative network into a compact space representation, whilst maintaining the original properties, such as, surface dynamics. An animation framework is proposed which computes an optimal motion graph using the novel capabilities of compression and generative synthesis properties of the network. This approach significantly reduces the memory space requirements, improves quality of animation, and facilitates the interpolation between motions. The framework optimises the number of transitions in the graph with respect to the shape and motion of the dynamic content. This generates a compact graph structure with low edge connectivity, and maintains realism when transitioning between motions. Finally, it demonstrates that generative networks facilitate the computation of novel poses, and provides a compact motion graph representation of captured dynamic shape enabling real-time interactive animation and interpolation of novel poses to smoothly transition between motions.
In this paper we present a framework for generating arbitrary human models and animating them realistically given a few intuitive parameters. Shape and pose space deformation (SPSD) is introduced as a technique for modeling subject specific pose induced deformations from whole-body registered 3D scans. By exploiting examples of different people in multiple poses we are able to realistically animate a novel subject by interpolating and extrapolating in a joint shape and pose parameter space. Our results show that we can produce plausible animations of new people and that greater detail is achieved by incorporating subject specific pose deformations. We demonstrate the application of SPSD to produce subject specific animation sequences driven by RGB-Z performance capture.
This paper addresses the problem of optimal alignment of non-rigid surfaces from multi-view video observations to obtain a temporally consistent representation. Conventional non-rigid surface tracking performs frame-to-frame alignment which is subject to the accumulation of errors resulting in drift over time. Recently, non-sequential tracking approaches have been introduced which re-order the input data based on a dissimilarity measure. One or more input sequences are represented in a tree with reducing alignment path length. This limits drift and increases robustness to large non-rigid deformations. However, jumps may occur in the aligned mesh sequence where tree branches meet due to independent error accumulation. Optimisation of the tree for non-sequential tracking is proposed to minimise the errors in temporal consistency due to both the drift and jumps. A novel cluster tree enforces sequential tracking in local segments of the sequence while allowing global non-sequential traversal among these segments. This provides a mechanism to create a tree structure which reduces the number of jumps between branches and limits the length of branches. Comprehensive evaluation is performed on a variety of challenging non-rigid surfaces including faces, cloth and people. This demonstrates that the proposed cluster tree achieves better temporal consistency than the previous sequential and non-sequential tracking approaches. Quantitative ground-truth comparison on a synthetic facial performance shows reduced error with the cluster tree.
This paper approaches the problem of compressing temporally consistent 3D video mesh sequences with the aim of reducing the storage cost. We present an evaluation of compression techniques which apply Principal Component Analysis to the representation of the mesh in different domain spaces, and demonstrate the applicability of mesh deformation algorithms for compression purposes. A novel layered mesh representation is introduced for compression of 3D video sequences with an underlying articulated motion, such as a person with loose clothing. Comparative evaluation on captured mesh sequences of people demonstrates that this representation achieves a significant improvement in compression compared to previous techniques. Results show a compression ratio of 8-15 for an RMS error of less than 5mm.
Surface motion capture (SurfCap) of actor performance from multiple view video provides reconstruction of the natural nonrigid deformation of skin and clothing. This paper introduces techniques for interactive animation control of SurfCap sequences which allow the flexibility in editing and interactive manipulation associated with existing tools for animation from skeletal motion capture (MoCap). Laplacian mesh editing is extended using a basis model learned from SurfCap sequences to constrain the surface shape to reproduce natural deformation. Three novel approaches for animation control of SurfCap sequences, which exploit the constrained Laplacian mesh editing, are introduced: 1) space-time editing for interactive sequence manipulation; 2) skeleton-driven animation to achieve natural nonrigid surface deformation; and 3) hybrid combination of skeletal MoCap driven and SurfCap sequence to extend the range of movement. These approaches are combined with high-level parametric control of SurfCap sequences in a hybrid surface and skeleton-driven animation control framework to achieve natural surface deformation with an extended range of movement by exploiting existing MoCap archives. Evaluation of each approach and the integrated animation framework are presented on real SurfCap sequences for actors performing multiple motions with a variety of clothing styles. Results demonstrate that these techniques enable flexible control for interactive animation with the natural nonrigid surface dynamics of the captured performance and provide a powerful tool to extend current SurfCap databases by incorporating new motions from MoCap sequences.
This survey reviews advances in human motion capture and analysis from 2000 to 2006, following a previous survey of papers up to 2000 [T.B. Moeslund, E. Granum, A survey of computer vision-based human motion capture, Computer Vision and Image Understanding, 81(3) (2001) 231–268.]. Human motion capture continues to be an increasingly active research area in computer vision with over 350 publications over this period. A number of significant research advances are identified together with novel methodologies for automatic initialization, tracking, pose estimation, and movement recognition. Recent research has addressed reliable tracking and pose estimation in natural scenes. Progress has also been made towards automatic understanding of human actions and behavior. This survey reviews recent trends in video-based human capture and analysis, as well as discussing open problems for future research to achieve automatic visual analysis of human movement.
A real-time full-body motion capture system is presented which uses input from a sparse set of inertial measurement units (IMUs) along with images from two or more standard video cameras and requires no optical markers or specialized infra-red cameras. A real-time optimization-based framework is proposed which incorporates constraints from the IMUs, cameras and a prior pose model. The combination of video and IMU data allows the full 6-DOF motion to be recovered including axial rotation of limbs and drift-free global position. The approach was tested using both indoor and outdoor captured data. The results demonstrate the effectiveness of the approach for tracking a wide range of human motion in real time in unconstrained indoor/outdoor scenes.
We introduce the concept of 4D model flow for the precomputed alignment of dynamic surface appearance across 4D video sequences of different motions reconstructed from multi-view video. Precomputed 4D model flow allows the efficient parametrization of surface appearance from the captured videos, which enables efficient real-time rendering of interpolated 4D video sequences whilst accurately reproducing visual dynamics, even when using a coarse underlying geometry. We estimate the 4D model flow using an image-based approach that is guided by available geometry proxies. We propose a novel representation in surface texture space for efficient storage and online parametric interpolation of dynamic appearance. Our 4D model flow overcomes previous requirements for computationally expensive online optical flow computation for data-driven alignment of dynamic surface appearance by precomputing the appearance alignment. This leads to an efficient rendering technique that enables the online interpolation between 4D videos in real time, from arbitrary viewpoints and with visual quality comparable to the state of the art.
Pose estimation in the context of human motion analysis is the process of approximating the body configuration in each frame of a motion sequence. We propose a novel pose estimation method based on fitting a skeletal model to tree structures built from skeletonised visual hulls reconstructed from multi-view video. The pose is estimated independently in each frame, hence the method can recover from errors in previous frames, which overcomes some problems of tracking. Publically available datasets were used to evaluate the method. On real data the method performs at a framerate of similar to 14 fps. Using synthetic data the positions of the joints were determined with a mean error of similar to 6 cm.
Multiple view 3D video reconstruction of actor performance captures a level-of-detail for body and clothing movement which is time-consuming to produce using existing animation tools. In this paper we present a framework for concatenative synthesis from multiple 3D video sequences according to user constraints on movement, position and timing. Multiple 3D video sequences of an actor performing different movements are automatically constructed into a surface motion graph which represents the possible transitions with similar shape and motion between sequences without unnatural movement artifacts. Shape similarity over an adaptive temporal window is used to identify transitions between 3D video sequences. Novel 3D video sequences are synthesized by finding the optimal path in the surface motion graph between user specified key-frames for control of movement, location and timing. The optimal path which satisfies the user constraints whilst minimizing the total transition cost between 3D video sequences is found using integer linear programming. Results demonstrate that this framework allows flexible production of novel 3D video sequences which preserve the detailed dynamics of the captured movement for an actress with loose clothing and long hair without visible artifacts.
This paper introduces a statistical inference framework to temporally propagate trimap labels from sparsely defined key frames to estimate trimaps for the entire video sequence. Trimap is a fundamental requirement for digital image and video matting approaches. Statistical inference is coupled with Bayesian statistics to allow robust trimap labelling in the presence of shadows, illumination variation and overlap between the foreground and background appearance. Results demonstrate that trimaps are sufficiently accurate to allow high quality video matting using existing natural image matting algorithms. Quantitative evaluation against ground-truth demonstrates that the approach achieves accurate matte estimation with less amount of user interaction compared to the state-of-the-art techniques.
Covariance is a well-established characterisation of the output uncertainty for estimators dealing with noisy data. It is conventionally estimated via first-order forward propagation (FOP) of input covariance. However, since FOP employs a local linear approximation of the estimator, its reliability is compromised in the case of nonlinear transformations. An alternative method, scaled unscented transformation (SUT) is known to cope with such cases better. However, despite the nonlinear nature of many vision problems, its adoption remains limited. This paper investigates the application of SUT on common minimal geometry solvers, a class of algorithms at the core of many applications ranging from image stitching to film production and robot navigation. The contributions include an experimental comparison of SUT against FOP on synthetic and real data, and practical suggestions for adapting the original SUT to the geometry solvers. The experiments demonstrate the superiority of SUT to FOP as a covariance estimator, over a range of scene types and noise levels, on synthetic and real data. (C) 2014 Elsevier Inc. All rights reserved.
We propose a framework for 2D/3D multi-modal data registration and evaluate 3D feature descriptors for registration of 3D datasets from different sources. 3D datasets of outdoor environments can be acquired using a variety of active and passive sensor technologies. Registration of these datasets into a common coordinate frame is required for subsequent modelling and visualisation. 2D images are converted into 3D structure by stereo or multi-view reconstruction techniques and registered to a unified 3D domain with other datasets in a 3D world. Multi-modal datasets have different density, noise, and types of errors in geometry. This paper provides a performance benchmark for existing 3D feature descriptors across multi-modal datasets. This analysis highlights the limitations of existing 3D feature detectors and descriptors which need to be addressed for robust multi-modal data registration. We analyse and discuss the performance of existing methods in registering various types of datasets then identify future directions required to achieve robust multi-modal data registration.
Sparse 3D measurements of real scenes are readily estimated from N-view image sequences using structure-from-motion techniques. In this paper, we present a geometric theory for reconstruction of surface models from sparse 3D data captured from N camera views. Based on this theory, we introduce a general N-view algorithm for reconstruction of 3D models of arbitrary scenes from sparse data. This algorithm reconstructs a surface model which converges to an approximation of the real scene surfaces and is consistent with the feature visibility in all N-views. To achieve efficient reconstruction independent of the number of views a recursive reconstruction algorithm is developed which integrates the feature visibility independently for each view. This approach is shown to converge to an approximation of the real scene structure and have a computational cost which is linear in the number of views. It is assumed that structure-from-motion estimates of 3D feature locations are consistent with the multiple view visual geometry and do not contain outliers. Uncertainty in 3D feature estimates is incorporated in the feature visibility to achieve reliable reconstruction in the presence of noise inherent in estimates of 3D scene structure from real image sequences. Results are presented for reconstruction of both real and synthetic scenes together with an evaluation of the reconstruction performance in the presence of noise. The algorithm presented in this paper provides a reliable and computationally efficient approach to model reconstruction from sparse 3D scene data.
Light fields are becoming an increasingly popular method of digital content production for visual effects and virtual/augmented reality as they capture a view dependent representation enabling photo realistic rendering over a range of viewpoints. Light field video is generally captured using arrays of cameras resulting in tens to hundreds of images of a scene at each time instance. An open problem is how to efficiently represent the data preserving the view-dependent detail of the surface in such a way that is compact to store and efficient to render. In this paper we show that constructing an Eigen texture basis representation from the light field using an approximate 3D surface reconstruction as a geometric proxy provides a compact representation that maintains view-dependent realism. We demonstrate that the proposed method is able to reduce storage requirements by > 95% while maintaining the visual quality of the captured data. An efficient view-dependent rendering technique is also proposed which is performed in eigen space allowing smooth continuous viewpoint interpolation through the light field.
4D Video Textures (4DVT) introduce a novel representation for rendering video-realistic interactive character animation from a database of 4D actor performance captured in a multiple camera studio. 4D performance capture reconstructs dynamic shape and appearance over time but is limited to free-viewpoint video replay of the same motion. Interactive animation from 4D performance capture has so far been limited to surface shape only. 4DVT is the final piece in the puzzle enabling video-realistic interactive animation through two contributions: a layered view-dependent texture map representation which supports efficient storage, transmission and rendering from multiple view video capture; and a rendering approach that combines multiple 4DVT sequences in a parametric motion space, maintaining video quality rendering of dynamic surface appearance whilst allowing high-level interactive control of character motion and viewpoint. 4DVT is demonstrated for multiple characters and evaluated both quantitatively and through a user-study which confirms that the visual quality of captured video is maintained. The 4DVT representation achieves >90% reduction in size and halves the rendering cost.
This paper presents a novel PDE-based method for floating-point disparity estimation which produces smooth disparity fields with sharp object boundaries for surface reconstruction. In order to avoid the over-segmentation problem of image-driven structure tensor and the blurred boundary problem of field-driven tensor, we propose a new anisotropic diffusivity function controlled by image and disparity gradients. We also embed a bi-directional disparity matching term to control the data term in occluded regions. We evaluate the proposed method on data sets from the Middlebury benchmarking site and real data sets with ground-truth models scanned by a LIDAR sensor.
This paper presents the first dynamic 3D FACS data set for facial expression research, containing 10 subjects performing between 19 and 97 different AUs both individually and incombination. In total the corpus contains 519 AU sequences. The peak expression frame of each sequence has been manually FACS coded by certified FACS experts. This provides a ground truth for 3D FACS based AU recognition systems. In order to use this data, we describe the first framework for building dynamic 3D morphable models. This includes a novel Active Appearance Model (AAM) based 3D facial registration and mesh correspondence scheme. The approach overcomes limitations in existing methods that require facial markers or are prone to optical flow drift. We provide the first quantitative assessment of such 3D facial mesh registration techniques and show how our proposed method provides more reliable correspondence.
Natural image matting is an extremely challenging image processing problem due to its ill-posed nature. It often requires skilled user interaction to aid definition of foreground and background regions. Current algorithms use these pre-defined regions to build local foreground and background colour models. In this paper we propose a novel approach which uses non-parametric statistics to model image appearance variations. This technique overcomes the limitations of previous parametric approaches which are purely colour-based and thereby unable to model natural image structure. The proposed technique consists of three successive stages: (i) background colour estimation, (ii) foreground colour estimation, (iii) alpha estimation. Colour estimation uses patch-based matching techniques to efficiently recover the optimum colour by comparison against patches from the known regions. Quantitative evaluation against ground truth demonstrates that the technique produces better results and successfully recovers fine details such as hair where many other algorithms fail.
This paper presents a quantitative and qualitative analysis of surface representations used in recent statistical models of human shape and pose. Our analysis and comparison framework is twofold. Firstly, we qualitatively examine generated shapes and poses by interpolating points in the shape and pose variation spaces. Secondly, we evaluate the performance of the statistical human models in the context of human shape and pose reconstruction from silhouette. The analysis demonstrates that body shape variation can be controlled with a lower dimensional model using a PCA basis in the Euclidean space. In addition, the Euclidean representation is shown to give more accurate shape estimates than other surface representations in the absence of pose variation. Furthermore, the analysis indicates that shape and pose parametrizations based on translation and rotation invariant representations are not robust for reconstruction from silhouette without pose initialization.
In computer vision, matting is the process of accurate foreground estimation in images and videos. In this paper we presents a novel patch based approach to video matting relying on non-parametric statistics to represent image variations in appearance. This overcomes the limitation of parametric algorithms which only rely on strong colour correlation between the nearby pixels. Initially we construct a clean background by utilising the foreground object’s movement across the background. For a given frame, a trimap is constructed using the background and the last frame’s trimap. A patch-based approach is used to estimate the foreground colour for every unknown pixel and finally the alpha matte is extracted. Quantitative evaluation shows that the technique performs better, in terms of the accuracy and the required user interaction, than the current state-of-the-art parametric approaches.
For statistical analysis purposes, RANSAC is usually treated as a Bernoulli process: each hypothesis is a Bernoulli trial with the outcome outlier-free/contaminated; a run is a sequence of such trials. However, this model only covers the special case where all outlier-free hypotheses are equally good, e.g. generated from noise-free data. In this paper, we explore a more general model which obviates the noise-free data assumption: we consider RANSAC a random process returning the best hypothesis, , among a number of hypotheses drawn from a finite set (). We employ the rank of within for the statistical characterisation of the output, present a closed-form expression for its exact probability mass function, and demonstrate that -distribution is a good approximation thereof. This characterisation leads to two novel termination criteria, which indicate the number of iterations to come arbitrarily close to the global minimum in with a specified probability. We also establish the conditions defining when a RANSAC process is statistically equivalent to a cascade of shorter RANSAC processes. These conditions justify a RANSAC scheme with dedicated stages to handle the outliers and the noise separately. We demonstrate the validity of the developed theory via Monte-Carlo simulations and real data experiments on a number of common geometry estimation problems. We conclude that a two-stage RANSAC process offers similar performance guarantees at a much lower cost than the equivalent one-stage process, and that a cascaded set-up has a better performance than LO-RANSAC, without the added complexity of a nested RANSAC implementation.
Synchronisation is an essential requirement for multiview 3D reconstruction of dynamic scenes. However, the use of HD cameras and large set-ups put a considerable stress on hardware and cause frame drops, which is usually detected by manually verifying very large amounts of data. This paper improves [9], and extends it with frame-drop detection capability. In order to spot frame-drop events, the algorithm fits a broken line to the frame index correspondences for each camera pair, and then fuses the pairwise drop hypotheses into a consistent, absolute frame-drop estimate. The success and the practical utility of the the improved pipeline is demonstrated through a number of experiments, including 3D reconstruction and free-viewpoint video rendering tasks.
Sleep quality is an important determinant of human health and wellbeing. Novel technologies that can quantify sleep quality at scale are required to enable the diagnosis and epidemiology of poor sleep. One important indicator of sleep quality is body posture. In this paper, we present the design and implementation of a non-contact sleep monitoring system that analyses body posture and movement. Supervised machine learning strategies applied to noncontact vision-based infrared camera data using a transfer learning approach, successfully quantified sleep poses of participants covered by a blanket. This represents the first occasion that such a machine learning approach has been used to successfully detect four predefined poses and the empty bed state during 8-10 hour overnight sleep episodes representing a realistic domestic sleep situation. The methodology was evaluated against manually scored sleep poses and poses estimated using clinical polysomnography measurement technology. In a cohort of 12 healthy participants, we find that a ResNet-152 pre-trained network achieved the best performance compared with the standard de novo CNN network and other pre-trained networks. The performance of our approach was better than other video-based methods for sleep pose estimation and produced higher performance compared to the clinical standard for pose estimation using a polysomnography position sensor. It can be concluded that infrared video capture coupled with deep learning AI can be successfully used to quantify sleep poses as well as the transitions between poses in realistic nocturnal conditions, and that this non-contact approach provides superior pose estimation compared to currently accepted clinical methods.
In this paper we address the problem of reliable real-time 3D-tracking of multiple objects which are observed in multiple wide-baseline camera views. Establishing the spatio-temporal correspondence is a problem with combinatorial complexity in the number of objects and views. In addition vision-based tracking suffers from the ambiguities introduced by occlusion, clutter and irregular 3D motion. In this paper we present a discrete relaxation algorithm for reducing the intrinsic combinatorial complexity by pruning the decision tree based on unreliable prior information from independent 2D-tracking for each view. The algorithm improves the reliability of spatio-temporal correspondence by simultaneous optimisation over multiple views in the case where 2D-tracking in one or more views are ambiguous. Application to the 3D reconstruction of human movement, based on tracking of skin-coloured regions in three views, demonstrates considerable improvement in reliability and performance. Results demonstrate that the optimisation over multiple views gives correct 3D reconstruction and object labelling in the presence of incorrect 2D-tracking whilst maintaining real-time performance.
Motion capture (mocap) is widely used in a large number of industrial applications. Our work offers a new way of representing the mocap facial dynamics in a high resolution 3D morphable model expression space. A data-driven approach to modelling of facial dynamics is presented. We propose a way to combine high quality static face scans with dynamic 3D mocap data which has lower spatial resolution in order to study the dynamics of facial expressions.
We introduce the first approach to solve the challenging problem of unsupervised 4D visual scene understanding for complex dynamic scenes with multiple interacting people from multi-view video. Our approach simultaneously estimates a detailed model that includes a per-pixel semantically and temporally coherent reconstruction, together with instance-level segmentation exploiting photo-consistency, semantic and motion information. We further leverage recent advances in 3D pose estimation to constrain the joint semantic instance segmentation and 4D temporally coherent reconstruction. This enables per person semantic instance segmentation of multiple interacting people in complex dynamic scenes. Extensive evaluation of the joint visual scene understanding framework against state-of-the-art methods on challenging indoor and outdoor sequences demonstrates a significant (≈ 40%) improvement in semantic segmentation, reconstruction and scene flow accuracy.
Person tracking is an often studied facet of computer vision, with applications in security, automated driving and entertainment. However, despite the advantages they offer, few current solutions work for 360° cameras, due to projection distortion. This paper presents a simple yet robust method for 3D tracking of multiple people in a scene from a pair of 360° cameras. By using 2D pose information, rather than potentially unreliable 3D position or repeated colour information, we create a tracker that is both appearance independent as well as capable of operating at narrow baseline. Our results demonstrate state of the art performance on 360° scenes, as well as the capability to handle vertical axis rotation.
Recent progresses in Virtual Reality (VR) and Augmented Reality (AR) allow us to experience various VR/AR applications in our daily life. In order to maximise the immersiveness of user in VR/AR environments, a plausible spatial audio reproduction synchronised with visual information is essential. In this paper, we propose a simple and efficient system to estimate room acoustic for plausible reproducton of spatial audio using 360° cameras for VR/AR applications. A pair of 360° images is used for room geometry and acoustic property estimation. A simplified 3D geometric model of the scene is estimated by depth estimation from captured images and semantic labelling using a convolutional neural network (CNN). The real environment acoustics are characterised by frequency-dependent acoustic predictions of the scene. Spatially synchronised audio is reproduced based on the estimated geometric and acoustic properties in the scene. The reconstructed scenes are rendered with synthesised spatial audio as VR/AR content. The results of estimated room geometry and simulated spatial audio are evaluated against the actual measurements and audio calculated from ground-truth Room Impulse Responses (RIRs) recorded in the rooms.
IEEE VR 2020 Immersive audio-visual perception relies on the spatial integration of both auditory and visual information which are heterogeneous sensing modalities with different fields of reception and spatial resolution. This study investigates the perceived coherence of audiovisual object events presented either centrally or peripherally with horizontally aligned/misaligned sound. Various object events were selected to represent three acoustic feature classes. Subjective test results in a simulated virtual environment from 18 participants indicate a wider capture region in the periphery, with an outward bias favoring more lateral sounds. Centered stimulus results support previous findings for simpler scenes.
We introduce the first approach to solve the challenging problem of automatic 4D visual scene understanding for complex dynamic scenes with multiple interacting people from multi-view video. Our approach simultaneously estimates a detailed model that includes a per-pixel semantically and temporally coherent reconstruction, together with instance-level segmentation exploiting photo-consistency, semantic and motion information. We further leverage recent advances in 3D pose estimation to constrain the joint semantic instance segmentation and 4D temporally coherent reconstruction. This enables per person semantic instance segmentation of multiple interacting people in complex dynamic scenes. Extensive evaluation of the joint visual scene understanding framework against state-of-the-art methods on challenging indoor and outdoor sequences demonstrates a significant (≈40%) improvement in semantic segmentation, reconstruction and scene flow accuracy. In addition to the evaluation on several indoor and outdoor scenes, the proposed joint 4D scene understanding framework is applied to challenging outdoor sports scenes in the wild captured with manually operated wide-baseline broadcast cameras.
While the accuracy of multi-view stereo (MVS) has continued to advance, its performance reconstructing challenging scenes from images with a limited depth of field is generally poor. Typical implementations assume a pinhole camera model, and therefore treat defocused regions as a source of outlier. In this paper, we address these limitations by instead modelling the camera as a thick lens. Doing so allows us to exploit the complementary nature of stereo and defocus information, and overcome constraints imposed by traditional MVS methods. Using our novel reconstruction framework, we recover complete 3D models of complex macro-scale scenes. Our approach demonstrates robustness to view-dependent materials, and outperforms state-of-the-art MVS and depth from defocus across a range of real and synthetic datasets.
In this paper we propose a pipeline for estimating acoustic 3D room structure with geometry and attribute prediction using spherical 360° cameras. Instead of setting microphone arrays with loudspeakers to measure acoustic parameters for specific rooms, a simple and practical single-shot capture of the scene using a stereo pair of 360° cameras can be used to simulate those acoustic parameters. We assume that the room and objects can be represented as cuboids aligned to the main axes of the room coordinate (Manhattan world). The scene is captured as a stereo pair using off-the-shelf consumer spherical 360 cameras. A cuboid-based 3D room geometry model is estimated by correspondence matching between captured images and semantic labelling using a convolutional neural network (SegNet). The estimated geometry is used to produce frequency-dependent acoustic predictions of the scene. This is, to our knowledge, the first attempt in the literature to use visual geometry estimation and object classification algorithms to predict acoustic properties. Results are compared to measurements through calculated reverberant spatial audio object parameters used for reverberation reproduction customized to the given loudspeaker set up.
AbstractAs personalised immersive display systems have been intensely explored in virtual reality (VR), plausible 3D audio corresponding to the visual content is required to provide more realistic experiences to users. It is well known that spatial audio synchronised with visual information improves a sense of immersion but limited research progress has been achieved in immersive audio-visual content production and reproduction. In this paper, we propose an end-to-end pipeline to simultaneously reconstruct 3D geometry and acoustic properties of the environment from a pair of omnidirectional panoramic images. A semantic scene reconstruction and completion method using a deep convolutional neural network is proposed to estimate the complete semantic scene geometry in order to adapt spatial audio reproduction to the scene. Experiments provide objective and subjective evaluations of the proposed pipeline for plausible audio-visual VR reproduction of real scenes.
Multi-view stereo remains a popular choice when recovering 3D geometry, despite performance varying dramatically according to the scene content. Moreover, typical pinhole camera assumptions fail in the presence of shallow depth of field inherent to macro-scale scenes; limiting application to larger scenes with diffuse reflectance. However, the presence of defocus blur can itself be considered a useful reconstruction cue, particularly in the presence of view-dependent materials. With this in mind, we explore the complimentary nature of stereo and defocus cues in the context of multi-view 3D reconstruction; and propose a complete pipeline for scene modelling from a finite aperature camera that encompasses image formation, camera calibration and reconstruction stages. As part of our evaluation, an ablation study reveals how each cue contributes to the higher performance observed over a range of complex materials and geometries. Though of lesser concern with large apertures, the effects of image noise are also considered. By introducing pre-trained deep feature extraction into our cost function, we show a step improvement over per-pixel comparisons; as well as verify the cross-domain applicability of networks using largely in-focus training data applied to defocused images. Finally, we compare to a number of modern multi-view stereo methods, and demonstrate how the use of both cues leads to a significant increase in performance across several synthetic and real datasets.
We present a framework for speech-driven synthesis of real faces from a corpus of 3D video of a person speaking. Video-rate capture of dynamic 3D face shape and colour appearance provides the basis for a visual speech synthesis model. A displacement map representation combines face shape and colour into a 3D video. This representation is used to efficiently register and integrate shape and colour information captured from multiple views. To allow visual speech synthesis viseme primitives are identified from the corpus using automatic speech recognition. A novel nonrigid alignment algorithm is introduced to estimate dense correspondence between 3D face shape and appearance for different visemes. The registered displacement map representation together with a novel optical flow optimisation using both shape and colour, enables accurate and efficient nonrigid alignment. Face synthesis from speech is performed by concatenation of the corresponding viseme sequence using the nonrigid correspondence to reproduce both 3D face shape and colour appearance. Concatenative synthesis reproduces both viseme timing and co-articulation. Face capture and synthesis has been performed for a database of 51 people. Results demonstrate synthesis of 3D visual speech animation with a quality comparable to the captured video of a person.
In this paper, a novel probabilistic Bayesian tracking scheme is proposed and applied to bimodal measurements consisting of tracking results from the depth sensor and audio recordings collected using binaural microphones. We use random finite sets to cope with varying number of tracking targets. A measurement-driven birth process is integrated to quickly localize any emerging person. A new bimodal fusion method that prioritizes the most confident modality is employed. The approach was tested on real room recordings and experimental results show that the proposed combination of audio and depth outperforms individual modalities, particularly when there are multiple people talking simultaneously and when occlusions are frequent.
The process of understanding acoustic properties of environments is important for several applications, such as spatial audio, augmented reality and source separation. In this paper, multichannel room impulse responses are recorded and transformed into their direction of arrival (DOA)-time domain, by employing a superdirective beamformer. This domain can be represented as a 2D image. Hence, a novel image processing method is proposed to analyze the DOA-time domain, and estimate the reflection times of arrival and DOAs. The main acoustically reflective objects are then localized. Recent studies in acoustic reflector localization usually assume the room to be free from furniture. Here, by analyzing the scattered reflections, an algorithm is also proposed to binary classify reflectors into room boundaries and interior furniture. Experiments were conducted in four rooms. The classification algorithm showed high quality performance, also improving the localization accuracy, for non-static listener scenarios.
We present a novel method to relight video sequences given known surface shape and illumination. The method preserves fine visual details. It requires single view video frames, approximate 3D shape and standard studio illumination only, making it applicable in studio production. The technique is demonstrated for relighting video sequences of faces
This paper introduces a novel 4D shape descriptor to match temporal surface sequences. A quantitative evaluation based on the receiver-operator characteristic (ROC) curve is presented to compare the performance of conventional 3D shape descriptors with and without using a time filter. Feature- based 3D shape descriptors including shape distribution (Osada et al., 2002 ), spin image (Johnson et al., 1999), shape histogram (Ankest et al., 1999) and spherical harmonics (Kazhdan et al., 2003) are considered. Evaluation shows that filtered descriptors outperform unfiltered descriptors and the best performing volume-sampling shape-histogram descriptor is extended to define a new 4D "shape-flow" descriptor. Shape-flow matching demonstrates improved performance in the context of matching time-varying sequences which is motivated by the requirement to connect similar sequences for animation production. Both simulated and real 3D human surface motion sequences are used for evaluation.
We propose a region-based method to extract foreground regions from colour video sequences. The foreground region is decided by voting with scores from background subtraction to the sub-regions by graph- based segmentation. Experiments show that the proposed algorithm improves on conventional approaches especially in strong shadow regions.
In this paper we consider the problem of aligning multiple non-rigid surface mesh sequences into a single temporally consistent representation of the shape and motion. A global alignment graph structure is introduced which uses shape similarity to identify frames for inter-sequence registration. Graph optimisation is performed to minimise the total non-rigid deformation required to register the input sequences into a common structure. The resulting global alignment ensures that all input sequences are resampled with a common mesh structure which preserves the shape and temporal correspondence. Results demonstrate temporally consistent representation of several public databases of mesh sequences for multiple people performing a variety of motions with loose clothing and hair.
Sequential Monte Carlo probability hypothesis density (SMC- PHD) ltering has been recently exploited for audio-visual (AV) based tracking of multiple speakers, where audio data are used to inform the particle distribution and propagation in the visual SMC-PHD lter. How- ever, the performance of the AV-SMC-PHD lter can be a ected by the mismatch between the proposal and the posterior distribution. In this pa- per, we present a new method to improve the particle distribution where audio information (i.e. DOA angles derived from microphone array mea- surements) is used to detect new born particles and visual information (i.e. histograms) is used to modify the particles with particle ow (PF). Using particle ow has the bene t of migrating particles smoothly from the prior to the posterior distribution. We compare the proposed algo- rithm with the baseline AV-SMC-PHD algorithm using experiments on the AV16.3 dataset with multi-speaker sequences.
The work on 3D human pose estimation has seen a significant amount of progress in recent years, particularly due to the widespread availability of commodity depth sensors. However, most pose estimation methods follow a tracking-as-detection approach which does not explicitly handle occlusions, thus introducing outliers and identity association issues when multiple targets are involved. To address these issues, we propose a new method based on Probability Hypothesis Density (PHD) filter. In this method, the PHD filter with a novel clutter intensity model is used to remove outliers in the 3D head detection results, followed by an identity association scheme with occlusion detection for the targets. Experimental results show that our proposed method greatly mitigates the outliers, and correctly associates identities to individual detections with low computational cost.
This paper introduces a model-based approach to capturing a persons shape, appearance and movement. A 3D animated model of a clothed persons whole-body shape and appearance is automatically constructed from a set of orthogonal view colour images. The reconstructed model of a person is then used together with the least-squares inverse-kinematics framework of Bregler and Malik (1998) to capture simple 3D movements from a video image sequence
In this paper, we propose an approach to indoor scene understanding from observation of people in single view spherical video. As input, our approach takes a centrally located spherical video capture of an indoor scene, estimating the 3D localisation of human actions performed throughout the long term capture. The central contribution of this work is a deep convolutional encoder-decoder network trained on a synthetic dataset to reconstruct regions of affordance from captured human activity. The predicted affordance segmentation is then applied to compose a reconstruction of the complete 3D scene, integrating the affordance segmentation into 3D space. The mapping learnt between human activity and affordance segmentation demonstrates that omnidirectional observation of human activity can be applied to scene understanding tasks such as 3D reconstruction. We show that our approach using only observation of people performs well against previous approaches, allowing reconstruction of occluded regions and labelling of scene affordances.
In this paper we present the first Facial Action Coding System (FACS) valid model to be based on dynamic 3D scans of human faces for use in graphics and psychological research. The model consists of FACS Action Unit (AU) based parameters and has been independently validated by FACS experts. Using this model, we explore the perceptual differences between linear facial motions – represented by a linear blend shape approach – and real facial motions that have been synthesized through the 3D facial model. Through numerical measures and visualizations, we show that this latter type of motion is geometrically nonlinear in terms of its vertices. In experiments, we explore the perceptual benefits of nonlinear motion for different AUs. Our results are insightful for designers of animation systems both in the entertainment industry and in scientific research. They reveal a significant overall benefit to using captured nonlinear geometric vertex motion over linear blend shape motion. However, our findings suggest that not all motions need to be animated nonlinearly. The advantage may depend on the type of facial action being produced and the phase of the movement.
Modern digital film production uses large quantities of data from videos, digital photographs, LIDAR scans, spherical photography and many other sources to create the final film frames. The processing and management of this massive amount of heterogeneous data consumes enormous resources. We propose an integrated pipeline for 2D/3D data registration for film production. We present the prototype application Jigsaw, which allows users to efficiently manage and process various data from digital photographs to 3D point clouds. A key requirement in the use of multi-modal 2D/3D data for content production is the registration into a common coordinate frame. 3D geometric information is reconstructed from 2D data and registered to the reference 3D models using 3D feature matching. We provide a public multi-modal database captured with a wide variety of devices in different environments to assist further research. An order of magnitude gain in efficiency is achieved with the proposed approach.
We propose a framework for 2D/3D multi-modal data registration and evaluate 3D feature descriptors for registration of 3D datasets from different sources. 3D datasets of outdoor environments can be acquired using a variety of active and passive sensor technologies including laser scanning and video cameras. Registration of these datasets into a common coordinate frame is required for subsequent modelling and visualisation. 2D images are converted into 3D structure by stereo or multi-view reconstruction techniques and registered to a unified 3D domain with other datasets in a 3D world. Multi-modal datasets have different density, noise, and types of errors in geometry. This paper provides a performance benchmark for existing 3D feature descriptors across multi-modal datasets. Performance is evaluated for the registration of datasets obtained from high-resolution laser scanning with reconstructions obtained from images and video. This analysis highlights the limitations of existing 3D feature detectors and descriptors which need to be addressed for robust multi-modal data registration. We analyse and discuss the performance of existing methods in registering various types of datasets then identify future directions required to achieve robust multi-modal 3D data registration.
Digital content production traditionally requires highly skilled artists and animators to first manually craft shape and appearance models and then instill the models with a believable performance. Motion capture technology is now increasingly used to record the articulated motion of a real human performance to increase the visual realism in animation. Motion capture is limited to recording only the skeletal motion of the human body and requires the use of specialist suits and markers to track articulated motion. In this paper we present surface capture, a fully automated system to capture shape and appearance as well as motion from multiple video cameras as a basis to create highly realistic animated content from an actor’s performance in full wardrobe. We address wide-baseline scene reconstruction to provide 360 degree appearance from just 8 camera views and introduce an efficient scene representation for level of detail control in streaming and rendering. Finally we demonstrate interactive animation control in a computer games scenario using a captured library of human animation, achieving a frame rate of 300fps on consumer level graphics hardware.
This paper presents a system for simultaneous capture of video sequences of face shape and colour appearance. Shape capture uses a projected infra-red structured light pattern together with stereo reconstruction to simultaneously acquire full resolution shape and colour image sequences at video rate. Displacement mapping techniques are introduced to represent dynamic face surface shape as a displacement video. This unifies the representation of face shape and colour. The displacement video representation enables efficient registration, integration and spatiotemporal analysis of captured face data. Results demonstrate that the system achieves video-rate (25Hz) acquisition of dynamic 3D colour faces at PAL resolution with an rms accuracy of 0.2mm and a visual quality comparable to the captured video.
This paper presents a framework for performance-based animation and retargeting of high-resolution face models from motion capture. A novel method is introduced for learning a mapping between sparse 3D motion capture markers and dense high-resolution 3D scans of face shape and appearance. A high-resolution facial expression space is learnt from a set of 3D face scans as a person specific morphable model. Sparse 3D face points sampled at the motion capture marker positions are used to build a corresponding low-resolution expression space to represent the facial dynamics from motion capture. Radial basis function interpolation is used to automatically map the low-resolution motion capture of facial dynamics to the high-resolution facial expression space. This produces a high-resolution facial animation with the detailed shape and appearance of real facial dynamics. Retargeting is introduced to transfer facial expressions to a novel subject captured from a single photograph or 3D scan. The subject specific high- resolution expression space is mapped to the novel subject based on anatomical differences in face shape. Results facial animation and retargeting demonstrate realistic animation of expressions from motion capture.
We consider the problem of geometric integration and representation of multiple views of non-rigidly deforming 3D surface geometry captured at video rate. Instead of treating each frame as a separate mesh we present a representation which takes into consideration temporal and spatial coherence in the data where possible. We first segment gross base transformations using correspondence based on a closest point metric and represent these motions as piecewise rigid transformations. The remaining residual is encoded as displacement maps at each frame giving a displacement video. At both these stages occlusions and missing data are interpolated to give a representation which is continuous in space and time. We demonstrate the integration of multiple views for four different non-rigidly deforming scenes: hand, face, cloth and a composite scene. The approach achieves the integration of multiple-view data at different times into one representation which can processed and edited.
This paper introduces a novel method of surface refinement for free-viewpoint video of dynamic scenes. Unlike previous approaches, the method presented here uses both visual hull and silhouette contours to constrain refinement of viewdependent depth maps from wide baseline views. A technique for extracting silhouette contours as rims in 3D from the view-dependent visual hull (VDVH) is presented. A new method for improving correspondence is introduced, where refinement of the VDVH is posed as a global problem in projective ray space. Artefacts of global optimisations are reduced by incorporating rims as constraints. Real time rendering of virtual views in a free-viewpoint video system is achieved using an image+depth representation for each real view. Results illustrate the high quality of rendered views achieved through this refinement technique.
In this paper we describe a parameterisation of lip movements which maintains the dynamic structure inherent in the task of producing speech sounds. A stereo capture system is used to reconstruct 3D models of a speaker producing sentences from the TIMIT corpus. This data is mapped into a space which maintains the relationships between samples and their temporal derivatives. By incorporating dynamic information within the parameterisation of lip movements we can model the cyclical structure, as well as the causal nature of speech movements as described by an underlying visual speech manifold. It is believed that such a structure will be appropriate to various areas of speech modeling, in particular the synthesis of speech lip movements.
In this paper we present an automatic key frame selection method to summarise 3D video sequences. Key-frame selection is based on optimisation for the set of frames which give the best representation of the sequence according to a rate-distortion trade-off. Distortion of the summarization from the original sequence is based on measurement of self-similarity using volume histograms. The method evaluates the globally optimal set of key-frames to represent the entire sequence without requiring pre-segmentation of the sequence into shots or temporal correspondence. Results demonstrate that for 3D video sequences of people wearing a variety of clothing the summarization automatically selects a set of key-frames which represent the dynamics. Comparative evaluation of rate-distortion characteristics with previous 3D video summarization demonstrates improved performance.
The probability hypothesis density (PHD) filter based on sequential Monte Carlo (SMC) approximation (also known as SMC-PHD filter) has proven to be a promising algorithm for multi-speaker tracking. However, it has a heavy computational cost as surviving, spawned and born particles need to be distributed in each frame to model the state of the speakers and to estimate jointly the variable number of speakers with their states. In particular, the computational cost is mostly caused by the born particles as they need to be propagated over the entire image in every frame to detect the new speaker presence in the view of the visual tracker. In this paper, we propose to use audio data to improve the visual SMC-PHD (VSMC- PHD) filter by using the direction of arrival (DOA) angles of the audio sources to determine when to propagate the born particles and re-allocate the surviving and spawned particles. The tracking accuracy of the AV-SMC-PHD algorithm is further improved by using a modified mean-shift algorithm to search and climb density gradients iteratively to find the peak of the probability distribution, and the extra computational complexity introduced by mean-shift is controlled with a sparse sampling technique. These improved algorithms, named as AVMS-SMCPHD and sparse-AVMS-SMC-PHD respectively, are compared systematically with AV-SMC-PHD and V-SMC-PHD based on the AV16.3, AMI and CLEAR datasets.
In computer vision, matting is the process of accurate foreground estimation in images and videos. In this paper we presents a novel patch based approach to video matting relying on non-parametric statistics to represent image variations in appearance. This overcomes the limitation of parametric algorithms which only rely on strong colour correlation between the nearby pixels. Initially we construct a clean background by utilising the foreground object’s movement across the background. For a given frame, a trimap is constructed using the background and the last frame’s trimap. A patch-based approach is used to estimate the foreground colour for every unknown pixel and finally the alpha matte is extracted. Quantitative evaluation shows that the technique performs better, in terms of the accuracy and the required user interaction, than the current state-of-the-art parametric approaches.
Obtaining a foreground silhouette across multiple views is one of the fundamental steps in 3D reconstruction. In this paper we present a novel video segmentation approach, to obtain a foreground silhouette, for scenes captured by a wide-baseline camera rig given a sparse manual interaction in a single view. The algorithm is based on trimap propagation, a framework used in video matting. Bayesian inference coupled with camera calibration information are used to spatio-temporally propagate high confidence trimap labels across the multi-view video to obtain coarse silhouettes which are later refined using a matting algorithm. Recent techniques have been developed for foreground segmentation, based on image matting, in multiple views but they are limited to narrow baseline with low foreground variation. The proposed wide-baseline silhouette propagation is robust to inter-view foreground appearance changes, shadows and similarity in foreground/background appearance. The approach has demonstrated good performance in silhouette estimation for views up to 180 degree baseline (opposing views). The segmentation technique has been fully integrated in a multi-view reconstruction pipeline. The results obtained demonstrate the suitability of the technique for multi-view reconstruction with wide-baseline camera set-ups and natural background.
Action matching, where a recorded sequence is matched against, and synchronised with, a suitable proxy from a library of animations, is a technique for generating a synthetic representation of a recorded human activity. This proxy can then be used to represent the action in a virtual environment or as a prior on further processing of the sequence. In this paper we present a novel technique for performing action matching in outdoor sports environments. Outdoor sports broadcasts are typically multi-camera environments and as such reconstruction techniques can be applied to the footage to generate a 3D model of the scene. However due to poor calibration and matting this reconstruction is of a very low quality. Our technique matches the 3D reconstruction sequence against a predefined library of actions to select an appropriate high quality synthetic representation. A hierarchical Markov model combined with 3D summarisation of the data allows a large number of different actions to be matched successfully to the sequence in a rate-invariant manner without prior segmentation of the sequence into discrete units. The technique is applied to data captured at rugby and soccer games. ©2009 IEEE.
Many practical applications require an accurate knowledge of the extrinsic calibration (____ie, pose) of a moving camera. The existing SLAM and structure-from-motion solutions are not robust to scenes with large dynamic objects, and do not fully utilize the available information in the presence of static cameras, a common practical scenario. In this paper, we propose an algorithm that addresses both of these issues for a hybrid static-moving camera setup. The algorithm uses the static cameras to build a sparse 3D model of the scene, with respect to which the pose of the moving camera is estimated at each time instant. The performance of the algorithm is studied through extensive experiments that cover a wide range of applications, and is shown to be satisfactory.
A 4D parametric motion graph representation is presented for interactive animation from actor performance capture in a multiple camera studio. The representation is based on a 4D model database of temporally aligned mesh sequence reconstructions for multiple motions. High-level movement controls such as speed and direction are achieved by blending multiple mesh sequences of related motions. A real-time mesh sequence blending approach is introduced which combines the realistic deformation of previous non-linear solutions with efficient online computation. Transitions between different parametric motion spaces are evaluated in real-time based on surface shape and motion similarity. 4D parametric motion graphs allow real-time interactive character animation while preserving the natural dynamics of the captured performance. © 2012 ACM.
Conventional view-dependent texture mapping techniques produce composite images by blending subsets of input images, weighted according to their relative influence at the rendering viewpoint, over regions where the views overlap. Geometric or camera calibration errors often result in a los s of detail due to blurring or double exposure artefacts which tends to be exacerbated by the number of blending views considered. We propose a novel view-dependent rendering technique which optimises the blend region dynamically at rendering time, and reduces the adverse effects of camera calibration or geometric errors otherwise observed. The technique has been successfully integrated in a rendering pipeline which operates at interactive frame rates. Improvement over state-of-the-art view-dependent texture mapping techniques are illustrated on a synthetic scene as well as real imagery of a large scale outdoor scene where large camera calibration and geometric errors are present.
This paper presents a method to estimate alpha mattes for video sequences of the same foreground scene from wide-baseline views given sparse key-frame trimaps in a single view. A statistical inference framework is introduced for spatio-temporal propagation of high-confidence trimap labels between video sequences without a requirement for correspondence or camera calibration and motion estimation. Multiple view trimap propagation integrates appearance information between views and over time to achieve robust labelling in the presence of shadows, changes in appearance with view point and overlap between foreground and background appearance. Results demonstrate that trimaps are sufficiently accurate to allow high-quality video matting using existing single view natural image matting algorithms. Quantitative evaluation against ground-truth demonstrates that the approach achieves accurate matte estimation for camera views separated by up to 180◦ , with the same amount of manual interaction required for conventional single view video matting.
This paper presents a hybrid skeleton-driven surface registration (HSDSR) approach to generate temporally consistent meshes from multiple view video of human subjects. 2D pose detections from multiple view video are used to estimate 3D skeletal pose on a per-frame basis. The 3D pose is embedded into a 3D surface reconstruction allowing any frame to be reposed into the shape from any other frame in the captured sequence. Skeletal motion transfer is performed by selecting a reference frame from the surface reconstruction data and reposing it to match the pose estimation of other frames in a sequence. This allows an initial coarse alignment to be performed prior to refinement by a patch-based non-rigid mesh deformation. The proposed approach overcomes limitations of previous work by reposing a reference mesh to match the pose of a target mesh reconstruction, providing a closer starting point for further non-rigid mesh deformation. It is shown that the proposed approach is able to achieve comparable results to existing model-based and model-free approaches. Finally, it is demonstrated that this framework provides an intuitive way for artists and animators to edit volumetric video.
Action matching, where a recorded sequence is matched against, and synchronised with, a suitable proxy from a library of animations, is a technique for generating a synthetic representation of a recorded human activity. This proxy can then be used to represent the action in a virtual environment or as a prior on further processing of the sequence. In this paper we present a novel technique for performing action matching in outdoor sports environments. Outdoor sports broadcasts are typically multi-camera environments and as such reconstruction techniques can be applied to the footage to generate a 3D model of the scene. However due to poor calibration and matting this reconstruction is of a very low quality. Our technique matches the 3D reconstruction sequence against a predefined library of actions to select an appropriate high quality synthetic representation. A hierarchical Markov model combined with 3D summarisation of the data allows a large number of different actions to be matched successfully to the sequence in a rate-invariant manner without prior segmentation of the sequence into discrete units. The technique is applied to data captured at rugby and soccer games.
This paper addresses the problem of human action matching in outdoor sports broadcast environments, by analysing 3D data from a recorded human activity and retrieving the most appropriate proxy action from a motion capture library. Typically pose recognition is carried out using images from a single camera, however this approach is sensitive to occlusions and restricted fields of view, both of which are common in the outdoor sports environment. This paper presents a novel technique for the automatic matching of human activities which operates on the 3D data available in a multi-camera broadcast environment. Shape is retrieved using multi-camera techniques to generate a 3D representation of the scene. Use of 3D data renders the system camera-pose-invariant and allows it to work while cameras are moving and zooming. By comparing the reconstructions to an appropriate 3D library, action matching can be achieved in the presence of significant calibration and matting errors which cause traditional pose detection schemes to fail. An appropriate feature descriptor and distance metric are presented as well as a technique to use these features for key-pose detection and action matching. The technique is then applied to real footage captured at an outdoor sporting event. ©2009 IEEE.
In film production, many post-production tasks require the availability of accurate camera calibration information. This paper presents an algorithm for through-the-lens calibration of a moving camera for a common scenario in film production and broadcasting: The camera views a dynamic scene, which is also viewed by a set of static cameras with known calibration. The proposed method involves the construction of a sparse scene model from the static cameras, with respect to which the moving camera is registered, by applying the appropriate perspective-n-point (PnP) solver. In addition to the general motion case, the algorithm can handle the nodal cameras with unknown focal length via a novel P2P algorithm. The approach can identify a subset of static cameras that are more likely to generate a high number of scene-image correspondences, and can robustly deal with dynamic scenes. Our target applications include dense 3D reconstruction, stereoscopic 3D rendering and 3D scene augmentation, through which the success of the algorithm is demonstrated experimentally.
This paper addresses the problem of human action matching in outdoor sports broadcast environments, by analysing 3D data from a recorded human activity and retrieving the most appropriate proxy action from a motion capture library. Typically pose recognition is carried out using images from a single camera, however this approach is sensitive to occlusions and restricted fields of view, both of which are common in the outdoor sports environment. This paper presents a novel technique for the automatic matching of human activities which operates on the 3D data available in a multi-camera broadcast environment. Shape is retrieved using multi-camera techniques to generate a 3D representation of the scene. Use of 3D data renders the system camera-pose-invariant and allows it to work while cameras are moving and zooming. By comparing the reconstructions to an appropriate 3D library, action matching can be achieved in the presence of significant calibration and matting errors which cause traditional pose detection schemes to fail. An appropriate feature descriptor and distance metric are presented as well as a technique to use these features for key-pose detection and action matching. The technique is then applied to real footage captured at an outdoor sporting event
We address the problem of reliable real-time 3D-tracking of multiple objects which are observed in multiple wide-baseline camera views. Establishing the spatio-temporal correspondence is a problem with combinatorial complexity in the number of objects and views. In addition vision based tracking suffers from the ambiguities introduced by occlusion, clutter and irregular 3D motion. We present a discrete relaxation algorithm for reducing the intrinsic combinatorial complexity by pruning the decision tree based on unreliable prior information from independent 2D-tracking for each view. The algorithm improves the reliability of spatio-temporal correspondence by simultaneous optimisation over multiple views in the case where 2D-tracking in one or more views is ambiguous. Application to the 3D reconstruction of human movement, based on tracking of skin-coloured regions in three views, demonstrates considerable improvement in reliability and performance. The results demonstrate that the optimisation over multiple views gives correct 3D reconstruction and object labeling in the presence of incorrect 2D-tracking whilst maintaining real-time performance
In this paper we introduce a video-based representation for free viewpoint visualization and motion control of 3D character models created from multiple view video sequences of real people. Previous approaches to videobased rendering provide no control of scene dynamics to manipulate, retarget, and create new 3D content from captured scenes. Here we contribute a new approach, combining image based reconstruction and video-based animation to allow controlled animation of people from captured multiple view video sequences. We represent a character as a motion graph of free viewpoint video motions for animation control. We introduce the use of geometry videos to represent reconstructed scenes of people for free viewpoint video rendering. We describe a novel spherical matching algorithm to derive global surface to surface correspondence in spherical geometry images for motion blending and the construction of seamless transitions between motion sequences. Finally, we demonstrate interactive video-based character animation with real-time rendering and free viewpoint visualization. This approach synthesizes highly realistic character animations with dynamic surface shape and appearance captured from multiple view video of people.
We present a method for reconstructing the geometry and appearance of indoor scenes containing dynamic human subjects using a single (optionally moving) RGBD sensor. We introduce a framework for building a representation of the articulated scene geometry as a set of piecewise rigid parts which are tracked and accumulated over time using moving voxel grids containing a signed distance representation. Data association of noisy depth measurements with body parts is achieved by online training of a prior shape model for the specific subject. A novel frame-to-frame model registration is introduced which combines iterative closest-point with additional correspondences from optical flow and prior pose constraints from noisy skeletal tracking data. We quantitatively evaluate the reconstruction and tracking performance of the approach using a synthetic animated scene. We demonstrate that the approach is capable of reconstructing mid-resolution surface models of people from low-resolution noisy data acquired from a consumer RGBD camera. © 2013 IEEE.
4D Video Textures (4DVT) introduce a novel representation for rendering video-realistic interactive character animation from a database of 4D actor performance captured in a multiple camera studio. 4D performance capture reconstructs dynamic shape and appearance over time but is limited to free-viewpoint video replay of the same motion. Interactive animation from 4D performance capture has so far been limited to surface shape only. 4DVT is the final piece in the puzzle enabling video-realistic interactive animation through two contributions: a layered view-dependent texture map representation which supports efficient storage, transmission and rendering from multiple view video capture; and a rendering approach that combines multiple 4DVT sequences in a parametric motion space, maintaining video quality rendering of dynamic surface appearance whilst allowing high-level interactive control of character motion and viewpoint. 4DVT is demonstrated for multiple characters and evaluated both quantitatively and through a user-study which confirms that the visual quality of captured video is maintained. The 4DVT representation achieves >90% reduction in size and halves the rendering cost.
We propose a human performance capture system employing convolutional neural networks (CNN) to estimate human pose from a volumetric representation of a performer derived from multiple view-point video (MVV).We compare direct CNN pose regression to the performance of an affine invariant pose descriptor learned by a CNN through a classification task. A non-linear manifold embedding is learned between the descriptor and articulated pose spaces, enabling regression of pose from the source MVV. The results are evaluated against ground truth pose data captured using a Vicon marker-based system and demonstrate good generalisation over a range of human poses, providing a system that requires no special suit to be worn by the performer.
We present a method for simultaneously estimating 3D hu- man pose and body shape from a sparse set of wide-baseline camera views. We train a symmetric convolutional autoencoder with a dual loss that enforces learning of a latent representation that encodes skeletal joint positions, and at the same time learns a deep representation of volumetric body shape. We harness the latter to up-scale input volumetric data by a factor of 4X, whilst recovering a 3D estimate of joint positions with equal or greater accuracy than the state of the art. Inference runs in real-time (25 fps) and has the potential for passive human behaviour monitoring where there is a requirement for high fidelity estimation of human body shape and pose.
We present a novel human performance capture technique capable of robustly estimating the pose (articulated joint positions) of a performer observed passively via multiple view-point video (MVV). An affine invariant pose descriptor is learned using a convolutional neural network (CNN) trained over volumetric data extracted from a MVV dataset of diverse human pose and appearance. A manifold embedding is learned via Gaussian Processes for the CNN descriptor and articulated pose spaces enabling regression and so estimation of human pose from MVV input. The learned descriptor and manifold are shown to generalise over a wide range of human poses, providing an efficient performance capture solution that requires no fiducials or other markers to be worn. The system is evaluated against ground truth joint configuration data from a commercial marker-based pose estimation system
In this paper we propose a framework for spatially and temporally coherent semantic co-segmentation and reconstruction of complex dynamic scenes from multiple static or moving cameras. Semantic co-segmentation exploits the coherence in semantic class labels both spatially, between views at a single time instant, and temporally, between widely spaced time instants of dynamic objects with similar shape and appearance. We demonstrate that semantic coherence results in improved segmentation and reconstruction for complex scenes. A joint formulation is proposed for semantically coherent object-based co-segmentation and reconstruction of scenes by enforcing consistent semantic labelling between views and over time. Semantic tracklets are introduced to enforce temporal coherence in semantic labelling and reconstruction between widely spaced instances of dynamic objects. Tracklets of dynamic objects enable unsupervised learning of appearance and shape priors that are exploited in joint segmentation and reconstruction. Evaluation on challenging indoor and outdoor sequences with hand-held moving cameras shows improved accuracy in segmentation, temporally coherent semantic labelling and 3D reconstruction of dynamic scenes.
The ability to predict the acoustics of a room without acoustical measurements is a useful capability. The motivation here stems from spatial audio reproduction, where knowledge of the acoustics of a space could allow for more accurate reproduction of a captured environment, or for reproduction room compensation techniques to be applied. A cuboid-based room geometry estimation method using a spherical camera is proposed, assuming a room and objects inside can be represented as cuboids aligned to the main axes of the coordinate system. The estimated geometry is used to produce frequency-dependent acoustic predictions based on geometrical room modelling techniques. Results are compared to measurements through calculated reverberant spatial audio object parameters used for reverberation reproduction customized to the given loudspeaker set up.
In this paper we describe a parameterisation of lip movements which maintains the dynamic structure inherent in the task of producing speech sounds. A stereo capture system is used to reconstruct 3D models of a speaker producing sentences from the TIMIT corpus. This data is mapped into a space which maintains the relationships between samples and their temporal derivatives. By incorporating dynamic information within the parameterisation of lip movements we can model the cyclical structure, as well as the causal nature of speech movements as described by an underlying visual speech manifold. It is believed that such a structure will be appropriate to various areas of speech modeling, in particular the synthesis of speech lip movements.
We propose a robust 3D feature description and registration method for 3D models reconstructed from various sensor devices. General 3D feature detectors and descriptors generally show low distinctiveness and repeatability for matching between different data modalities due to differences in noise and errors in geometry. The proposed method considers not only local 3D points but also neighbouring 3D keypoints to improve keypoint matching. The proposed method is tested on various multi-modal datasets including LIDAR scans, multiple photos, spherical images and RGBD videos to evaluate the performance against existing methods.
This paper presents a novel volumetric reconstruction technique that combines shape-from-silhouette with stereo photo-consistency in a global optimisation that enforces feature constraints across multiple views. Human shape reconstruction is considered where extended regions of uniform appearance, complex self-occlusions and sparse feature cues represent a challenging problem for conventional reconstruction techniques. A unified approach is introduced to first reconstruct the occluding contours and left-right consistent edge contours in a scene and then incorporate these contour constraints in a global surface optimisation using graph-cuts. The proposed technique maximises photo-consistency on the surface, while satisfying silhouette constraints to provide shape in the presence of uniform surface appearance and edge feature constraints to align key image features across views.
This paper introduces a general approach to dynamic scene reconstruction from multiple moving cameras without prior knowledge or limiting constraints on the scene structure, appearance, or illumination. Existing techniques for dynamic scene reconstruction from multiple wide-baseline camera views primarily focus on accurate reconstruction in controlled environments, where the cameras are fixed and calibrated and background is known. These approaches are not robust for general dynamic scenes captured with sparse moving cameras. Previous approaches for outdoor dynamic scene reconstruction assume prior knowledge of the static background appearance and structure. The primary contributions of this paper are twofold: an automatic method for initial coarse dynamic scene segmentation and reconstruction without prior knowledge of background appearance or structure; and a general robust approach for joint segmentation refinement and dense reconstruction of dynamic scenes from multiple wide-baseline static or moving cameras. Evaluation is performed on a variety of indoor and outdoor scenes with cluttered backgrounds and multiple dynamic non-rigid objects such as people. Comparison with state-of-the-art approaches demonstrates improved accuracy in both multiple view segmentation and dense reconstruction. The proposed approach also eliminates the requirement for prior knowledge of scene structure and appearance.
The sequential Monte Carlo probability hypothesis density (SMC-PHD) filter has been shown to be promising for audio-visual multi-speaker tracking. Recently, the zero diffusion particle flow (ZPF) has been used to mitigate the weight degeneracy problem in the SMC-PHD filter. However, this leads to a substantial increase in the computational cost due to the migration of particles from prior to posterior distribution with a partial differential equation. This paper proposes an alternative method based on the non-zero diffusion particle flow (NPF) to adjust the particle states by fitting the particle distribution with the posterior probability density using the nonzero diffusion. This property allows efficient computation of the migration of particles. Results from the AV16.3 dataset demonstrate that we can significantly mitigate the weight degeneracy problem with a smaller computational cost as compared with the ZPF based SMC-PHD filter.
Virtual Reality (VR) systems have been intensely explored, with several research communities investigating the different modalities involved. Regarding the audio modality, one of the main issues is the generation of sound that is perceptually coherent with the visual reproduction. Here, we propose a pipeline for creating plausible interactive reverb using visual information: first, we characterize real environment acoustics given a pair of spherical cameras; then, we reproduce reverberant spatial sound, by using the estimated acoustics, within a VR scene. The evaluation is made by extracting the room impulse responses (RIRs) of four virtually rendered rooms. Results show agreement, in terms of objective metrics, between the synthesized acoustics and the ones calculated from RIRs recorded within the respective real rooms.
We describe a novel framework for segmenting a time- and view-coherent foreground matte sequence from synchronised multiple view video. We construct a Markov Random Field (MRF) comprising links between superpixels corresponded across views, and links between superpixels and their constituent pixels. Texture, colour and disparity cues are incorporated to model foreground appearance. We solve using a multi-resolution iterative approach enabling an eight view high definition (HD) frame to be processed in less than a minute. Furthermore we incorporate a temporal diffusion process introducing a prior on the MRF using information propagated from previous frames, and a facility for optional user correction. The result is a set of temporally coherent mattes that are solved for simultaneously across views for each frame, exploiting similarities across views and time.
Typical colour digital cameras have a single sensor with a colour filter array (CFA), each pixel capturing a single channel (red, green or blue). A full RGB colour output image is generated by demosaicing (DM), i.e. interpolating to infer the two unobserved channels for each pixel. The DM approach used can have a significant effect on the quality of the output image, particularly in the presence of common imaging artifacts such as chromatic aberration (CA). Small differences in the focal length for each channel (lateral CA) and the inability of the lens to bring all three channels simultaneously into focus (longitudinal CA) can cause objectionable colour fringing artifacts in edge regions. These artifacts can be particularly severe when using low-cost lenses. We propose to use a set of simple neural networks to learn to jointly perform DM and CA correction, producing high quality colour images subject to severe CA as well as image noise. The proposed neural network-based joint DM and CA correction produces a significant improvement in image quality metrics (PSNR and SSIM) compared the baseline edge-directed linear interpolation approach preserving image detail and reducing objectionable false colour and comb artifacts. The approach can be applied in the production of high quality images and video from machine vision cameras with low cost lenses, thus extending the viability of such hardware to visual media production.
In immersive and interactive audio-visual content, there is very significant scope for spatial misalignment between the two main modalities. So, in productions that have both 3D video and spatial audio, the positioning of sound sources relative to the visual display requires careful attention. This may be achieved in the form of object-based audio, moreover allowing the producer to maintain control over individual elements within the mix. Yet each object?s metadata is needed to define its position over time. In the present study, audio-visual studio recordings were made of short scenes representing three genres: drama, sport and music. Foreground video was captured by a light-field camera array, which incorporated a microphone array, alongside more conventional sound recording by spot microphones and a first-order ambisonic room microphone. In the music scenes, a direct feed from the guitar pickup was also recorded. Video data was analysed to form a 3D reconstruction of the scenes, and human figure detection was applied to the 2D frames of the central camera. Visual estimates of the sound source positions were used to provide ground truth. Position metadata were encoded within audio definition model (ADM) format audio files, suitable for standard object-based rendering. The steered response power of the acoustical signals at the microphone array were used, with the phase transform (SRP-PHAT), to determine the dominant source position(s) at any time, and given as input to a Sequential Monte Carlo Probability Hypothesis Density (SMC-PHD) tracker. The output of this was evaluated in relation to the ground truth. Results indicate a hierarchy of accuracy in azimuth, elevation and range, in accordance with human spatial auditory perception. Azimuth errors were within the tolerance bounds reported by studies of the Ventriloquism Effect, giving an initial promising indication that such an approach may open the door to object-based production for live events.
A real-time full-body motion capture system is presented which uses input from a sparse set of inertial measurement units (IMUs) along with images from two or more standard video cameras and requires no optical markers or specialized infra-red cameras. A real-time optimization-based framework is proposed which incorporates constraints from the IMUs, cameras and a prior pose model. The combination of video and IMU data allows the full 6-DOF motion to be recovered including axial rotation of limbs and drift-free global position. The approach was tested using both indoor and outdoor captured data. The results demonstrate the effectiveness of the approach for tracking a wide range of human motion in real time in unconstrained indoor/outdoor scenes.
Complete scene reconstruction from single view RGBD is a challenging task, requiring estimation of scene regions occluded from the captured depth surface. We propose that scene-centric analysis of human motion within an indoor scene can reveal fully occluded objects and provide functional cues to enhance scene understanding tasks. Captured skeletal joint positions of humans, utilised as naturally exploring active sensors, are projected into a human-scene motion representation. Inherent body occupancy is leveraged to carve a volumetric scene occupancy map initialised from captured depth, revealing a more complete voxel representation of the scene. To obtain a structured box model representation of the scene, we introduce unique terms to an object detection optimisation that overcome depth occlusions whilst deriving from the same depth data. The method is evaluated on challenging indoor scenes with multiple occluding objects such as tables and chairs. Evaluation shows that human-centric scene analysis can be applied to effectively enhance state-of-the-art scene understanding approaches, resulting in a more complete representation than single view depth alone.
A common problem in wide-baseline stereo is the sparse and non-uniform distribution of correspondences when using conventional detectors such as SIFT, SURF, FAST and MSER. In this paper we introduce a novel segmentation based feature detector SFD that produces an increased number of ‘good’ features for accurate wide-baseline reconstruction. Each image is segmented into regions by over-segmentation and feature points are detected at the intersection of the boundaries for three or more regions. Segmentation-based feature detection locates features at local maxima giving a relatively large number of feature points which are consistently detected across wide-baseline views and accurately localised. A comprehensive comparative performance evaluation with previous feature detection approaches demonstrates that: SFD produces a large number of features with increased scene coverage; detected features are consistent across wide-baseline views for images of a variety of indoor and outdoor scenes; and the number of wide-baseline matches is increased by an order of magnitude compared to alternative detector-descriptor combinations. Sparse scene reconstruction from multiple wide-baseline stereo views using the SFD feature detector demonstrates at least a factor six increase in the number of reconstructed points with reduced error distribution compared to SIFT when evaluated against ground-truth and similar computational cost to SURF/FAST.
We present a method to continuously blend between multiple facial performances of an actor, which can contain different facial expressions or emotional states. As an example, given sad and angry video takes of a scene, our method empowers the movie director to specify arbitrary weighted combinations and smooth transitions between the two takes in post-production. Our contributions include (1) a robust nonlinear audio-visual synchronization technique that exploits complementary properties of audio and visual cues to automatically determine robust, dense spatiotemporal correspondences between takes, and (2) a seamless facial blending approach that provides the director full control to interpolate timing, facial expression, and local appearance, in order to generate novel performances after filming. In contrast to most previous works, our approach operates entirely in image space, avoiding the need of 3D facial reconstruction. We demonstrate that our method can synthesize visually believable performances with applications in emotion transition, performance correction, and timing control.
In this paper we propose a pipeline for estimating 3D room layout with object and material attribute prediction using a spherical stereo image pair. We assume that the room and objects can be represented as cuboids aligned to the main axes of the room coordinate (Manhattan world). A spherical stereo alignment algorithm is proposed to align two spherical images to the global world coordinate sys- tem. Depth information of the scene is estimated by stereo matching between images. Cubic projection images of the spherical RGB and estimated depth are used for object and material attribute detection. A single Convolutional Neu- ral Network is designed to assign object and attribute la- bels to geometrical elements built from the spherical image. Finally simplified room layout is reconstructed by cuboid fitting. The reconstructed cuboid-based model shows the structure of the scene with object information and material attributes.
Motion capture (mocap) is widely used in a large number of industrial applications. Our work offers a new way of representing the mocap facial dynamics in a high resolution 3D morphable model expression space. A data-driven approach to modelling of facial dynamics is presented. We propose a way to combine high quality static face scans with dynamic 3D mocap data which has lower spatial resolution in order to study the dynamics of facial expressions.
We present an algorithm for fusing multi-viewpoint video (MVV) with inertial measurement unit (IMU) sensor data to accurately estimate 3D human pose. A 3-D convolutional neural network is used to learn a pose embedding from volumetric probabilistic visual hull data (PVH) derived from the MVV frames. We incorporate this model within a dual stream network integrating pose embeddings derived from MVV and a forward kinematic solve of the IMU data. A temporal model (LSTM) is incorporated within both streams prior to their fusion. Hybrid pose inference using these two complementary data sources is shown to resolve ambiguities within each sensor modality, yielding improved accuracy over prior methods. A further contribution of this work is a new hybrid MVV dataset (TotalCapture) comprising video, IMU and a skeletal joint ground truth derived from a commercial motion capture system. The dataset is available online at http://cvssp.org/data/totalcapture/.
This engineering brief reports on the production of 3 object-based audio drama scenes, commissioned as part of the S3A project. 3D reproduction and an object-based workflow were considered and implemented from the initial script commissioning through to the final mix of the scenes. The scenes are being made available as Broadcast Wave Format files containing all objects as separate tracks and all metadata necessary to render the scenes as an XML chunk in the header conforming to the Audio Definition Model specification (Recommendation ITU-R BS.2076 [1]). It is hoped that these scenes will find use in perceptual experiments and in the testing of 3D audio systems. The scenes are available via the following link: http://dx.doi.org/10.17866/rd.salford.3043921.
In this paper we present a method to relight captured 3D video sequences of non-rigid, dynamic scenes, such as clothing of real actors, reconstructed from multiple view video. A view-dependent approach is introduced to refine an initial coarse surface reconstruction using shape-from-shading to estimate detailed surface normals. The prior surface approximation is used to constrain the simultaneous estimation of surface normals and scene illumination, under the assumption of Lambertian surface reflectance. This approach enables detailed surface normals of a moving non-rigid object to be estimated from a single image frame. Refined normal estimates from multiple views are integrated into a single surface normal map. This approach allows highly non-rigid surfaces, such as creases in clothing, to be relit whilst preserving the detailed dynamics observed in video.
In this paper we propose a cuboid-based air-tight indoor room geometry estimation method using combination of audio-visual sensors. Existing vision-based 3D reconstruction methods are not applicable for scenes with transparent or reflective objects such as windows and mirrors. In this work we fuse multi-modal sensory information to overcome the limitations of purely visual reconstruction for reconstruction of complex scenes including transparent and mirror surfaces. A full scene is captured by 360 cameras and acoustic room impulse responses (RIRs) recorded by a loudspeaker and compact microphone array. Depth information of the scene is recovered by stereo matching from the captured images and estimation of major acoustic reflector locations from the sound. The coordinate systems for audiovisual sensors are aligned into a unified reference frame and plane elements are reconstructed from audio-visual data. Finally cuboid proxies are fitted to the planes to generate a complete room model. Experimental results show that the proposed system generates complete representations of the room structures regardless of transparent windows, featureless walls and shiny surfaces.
Deformable surface fitting methods have been widely used to establish dense correspondence across different 3D objects of the same class. Dense correspondence is a critical step in constructing morphable face models for face recognition. In this paper a mainstream method for constructing dense correspondences is evaluated on 912 3D face scans from the Face Recognition Grand Challenge FRGC V1 database. A number of modifications to the standard deformable surface approach are introduced to overcome limitations identified in the evaluation. Proposed modifications include multi-resolution fitting, adaptive correspondence search range and enforcing symmetry constraints. The modified deformable surface approach is validated on the 912 FRGC 3D face scans and is shown to overcome limitations of the standard approach which resulted in gross fitting errors. The modified approach halves the rms fitting error with 98% of points within 0.5mm of their true position compared to 67% with the standard approach. © 2006 IEEE.
Multi-view video acquisition is widely used for reconstruction and free-viewpoint rendering of dynamic scenes by directly resampling from the captured images. This paper addresses the problem of optimally resampling and representing multi-view video to obtain a compact representation without loss of the view-dependent dynamic surface appearance. Spatio-temporal optimisation of the multi-view resampling is introduced to extract a coherent multi-layer texture map video. This resampling is combined with a surface-based optical flow alignment between views to correct for errors in geometric reconstruction and camera calibration which result in blurring and ghosting artefacts. The multi-view alignment and optimised resampling results in a compact representation with minimal loss of information allowing high-quality free-viewpoint rendering. Evaluation is performed on multi-view datasets for dynamic sequences of cloth, faces and people. The representation achieves >90% compression without significant loss of visual quality.
A number of systems have been developed for dynamic 3D reconstruction from multiple view videos over the past decade. In this paper we present a system for multiple view reconstruction of dynamic outdoor scenes transferring studio technology to uncontrolled environments. A synchronised portable multiple camera system is composed of off-the-shelf HD cameras for dynamic scene capture. For foreground extraction, we propose a multi-view trimap propagation method which is robust against dynamic changes in appearance between views and over time. This allows us to apply state-of-the-art natural image matting algorithms for multi-view sequences with minimal interaction. Optimal 3D surface of the foreground models are reconstructed by integrating multi-view shape cues and features. For background modelling, we use a line scan camera with a fish eye lens to capture a full environment with high resolution. The environment model is reconstructed from a spherical stereo image pair with sub-pixel correspondence. Finally the foreground and background models are merged into a 3D world coordinate and the composite model is rendered from arbitrary viewpoints. We show that the proposed system generates high quality scene images with dynamic virtual camera actions.
This paper presents a method to estimate alpha mattes for video sequences of the same foreground scene from wide-baseline views given sparse key-frame trimaps in a single view. A statistical inference framework is introduced for spatio-temporal propagation of high-confidence trimap labels between video sequences without a requirement for correspondence or camera calibration and motion estimation. Multiple view trimap propagation integrates appearance information between views and over time to achieve robust labelling in the presence of shadows, changes in appearance with view point and overlap between foreground and background appearance. Results demonstrate that trimaps are sufficiently accurate to allow high-quality video matting using existing single view natural image matting algorithms. Quantitative evaluation against ground-truth demonstrates that the approach achieves accurate matte estimation for camera views separated by up to 180◦ , with the same amount of manual interaction required for conventional single view video matting.
In this paper we present a novel approach for temporal alignment of reconstructed mesh sequences with non-rigid surfaces to obtain a consistent representation. We propose a hierarchical scheme for non-sequential matching of frames across the sequence using shape similarity. This gives a tree structure which represents the optimal path for alignment of each frame in the sequence to minimize the change in shape. Non-rigid alignment is performed by recursively traversing the tree to align all frames. Non-sequential alignment reduces problems of drift or tracking failure which occur in previous sequential frame-to-frame techniques. Comparative evaluation on challenging 3D video sequences demonstrates that the proposed approach produces a temporally coherent representation with reduced error in shape and correspondence.
This paper presents a framework for creating realistic virtual characters that can be delivered via the Internet and interactively controlled in a WebGL enabled web-browser. Four-dimensional performance capture is used to capture realistic human motion and appearance. The captured data is processed into efficient and compact representations for geometry and texture. Motions are analysed against a high-level, user-defined motion graph and suitable inter- and intra-motion transitions are identified. This processed data is stored on a webserver and downloaded by a client application when required. A Javascript-based character animation engine is used to manage the state of the character which responds to user input and sends required frames to a WebGL-based renderer for display. Through the efficient geometry, texture and motion graph representations, a game character capable of performing a range of motions can be represented in 40-50 MB of data. This highlights the potential use of four-dimensional performance capture for creating web-based content. Datasets are made available for further research and an online demo is provided.
This paper presents a framework for construction of animated models from captured surface shape of real objects. Algorithms are introduced to transform the captured surface shape into a layered model. The layered model comprises an articulation structure, generic control model and a displacement map to represent the high-resolution surface detail. Novel methods are presented for automatic control model generation, shape constrained fitting and displacement mapping of the captured data. Results are demonstrated for surface shape captured using both multiple view images and active surface measurement. The framework enables rapid transformation of captured data into a structured representation suitable for realistic animation.
Light-field video has recently been used in virtual and augmented reality applications to increase realism and immersion. However, existing light-field methods are generally limited to static scenes due to the requirement to acquire a dense scene representation. The large amount of data and the absence of methods to infer temporal coherence pose major challenges in storage, compression and editing compared to conventional video. In this paper, we propose the first method to extract a spatio-temporally coherent light-field video representation. A novel method to obtain Epipolar Plane Images (EPIs) from a spare lightfield camera array is proposed. EPIs are used to constrain scene flow estimation to obtain 4D temporally coherent representations of dynamic light-fields. Temporal coherence is achieved on a variety of light-field datasets. Evaluation of the proposed light-field scene flow against existing multiview dense correspondence approaches demonstrates a significant improvement in accuracy of temporal coherence.
We present a scheme for achieving level of detail and compression for animation sequences with known constant connectivity. We suggest compression is useful to automatically create low levels of detail in animations which may be more compressed than the original animation parameters and for high levels of detail where the original animation is expensive to compute. Our scheme is based on spatial segmentation of a base mesh into rigidly transforming segments and then temporal aggregation of these transformations. The result will approximate the given animation within a user specified tolerance which can be adjusted to give the required level of detail. A spatio-temporal smoothing algorithm is used on decoding to give acceptable animations. We show that the rigid transformation basis will span the space of all animations. We also show that the algorithm will converge to the specified tolerance. The algorithm is applied to several examples of synthetic animation and rate distortion curves are given which show that in some cases, the scheme outperforms current compressors.
Object-based audio can be used to customize, personalize, and optimize audio reproduction depending on the speci?c listening scenario. To investigate and exploit the bene?ts of object-based audio, a framework for intelligent metadata adaptation was developed. The framework uses detailed semantic metadata that describes the audio objects, the loudspeakers, and the room. It features an extensible software tool for real-time metadata adaptation that can incorporate knowledge derived from perceptual tests and/or feedback from perceptual meters to drive adaptation and facilitate optimal rendering. One use case for the system is demonstrated through a rule-set (derived from perceptual tests with experienced mix engineers) for automatic adaptation of object levels and positions when rendering 3D content to two- and ?ve-channel systems.
The visual hull is widely used as a proxy for novel view synthesis in computer vision. This paper introduces the safe hull, the first visual hull reconstruction technique to produce a surface containing only foreground parts. A theoretical basis underlies this novel approach which, unlike any previous work, can also identify phantom volumes attached to real objects. Using an image-based method, the visual hull is constructed with respect to each real view and used to identify safe zones in the original silhouettes. The safe zones define volumes known to only contain surface corresponding to a real object. The zones are used in a second reconstruction step to produce a surface without phantom volumes. Results demonstrate the effectiveness of this method for improving surface shape and scene realism, and its advantages over heuristic techniques.
In order to maximise the immersion in VR environments, a plausible spatial audio reproduction synchronised with visual information is essential. In this work, we propose a pipeline to create plausible interactive audio from a pair of 360 degree cameras.
Conventional stereoscopic video content production requires use of dedicated stereo camera rigs which is both costly and lacking video editing flexibility. In this paper, we propose a novel approach which only requires a small number of standard cameras sparsely located around a scene to automatically convert the monocular inputs into stereoscopic streams. The approach combines a probabilistic spatio-temporal segmentation framework with a state-of-the-art multi-view graph-cut reconstruction algorithm, thus providing full control of the stereoscopic settings at render time. Results with studio sequences of complex human motion demonstrate the suitability of the method for high quality stereoscopic content generation with minimum user interaction.
In applications such as virtual and augmented reality, a plausible and coherent audio-visual reproduction can be achieved by deeply understanding the reference scene acoustics. This requires knowledge of the scene geometry and related materials. In this paper, we present an audio-visual approach for acoustic scene understanding. We propose a novel material recognition algorithm, that exploits information carried by acoustic signals. The acoustic absorption coefficients are selected as features. The training dataset was constructed by combining information available in the literature, and additional labeled data that we recorded in a small room having short reverberation time (RT60). Classic machine learning methods are used to validate the model, by employing data recorded in five rooms, having different sizes and RT60s. The estimated materials are utilized to label room boundaries, reconstructed by a visionbased method. Results show 89 % and 80 % agreement between the estimated and reference room volumes and materials, respectively.
This paper presents an approach for reconstruction of 4D temporally coherent models of complex dynamic scenes. No prior knowledge is required of scene structure or camera calibration allowing reconstruction from multiple moving cameras. Sparse-to-dense temporal correspondence is integrated with joint multi-view segmentation and reconstruction to obtain a complete 4D representation of static and dynamic objects. Temporal coherence is exploited to overcome visual ambiguities resulting in improved reconstruction of complex scenes. Robust joint segmentation and reconstruction of dynamic objects is achieved by introducing a geodesic star convexity constraint. Comparative evaluation is performed on a variety of unstructured indoor and outdoor dynamic scenes with hand-held cameras and multiple people. This demonstrates reconstruction of complete temporally coherent 4D scene models with improved nonrigid object segmentation and shape reconstruction.
We present a new approach to reflectance estimation for dynamic scenes. Non-parametric image statistics are used to transfer reflectance properties from a static example set to a dynamic image sequence. The approach allows reflectance estimation for surface materials with inhomogeneous appearance, such as those which commonly occur with patterned or textured clothing. Material reflectance properties are initially estimated from static images of the subject under multiple directional illuminations using photometric stereo. The estimated reflectance together with the corresponding image under uniform ambient illumination form a prior set of reference material observations. Material reflectance properties are then estimated for video sequences of a moving person captured under uniform ambient illumination by matching the observed local image statistics to the reference observations. Results demonstrate that the transfer of reflectance properties enables estimation of the dynamic surface normals and subsequent relighting. This approach overcomes limitations of previous work on material transfer and relighting of dynamic scenes which was limited to surfaces with regions of homogeneous reflectance. We evaluate for relighting 3D model sequences reconstructed from multiple view video. Comparison to previous model relighting demonstrates improved reproduction of detailed texture and shape dynamics.
This paper presents a novel system for the 3D capture of facial performance using standard video and lighting equipment. The mesh of an actor's face is tracked non-sequentially throughout a performance using multi-view image sequences. The minimum spanning tree calculated in expression dissimilarity space defines the traversal of the sequences optimal with respect to error accumulation. A robust patch-based frame-to-frame surface alignment combined with the optimal traversal significantly reduces drift compared to previous sequential techniques. Multi-path temporal fusion resolves inconsistencies between different alignment paths and yields a final mesh sequence which is temporally consistent. The surface tracking framework is coupled with photometric stereo using colour lights which captures metrically correct skin geometry. High-detail UV normal maps corrected for shadow and bias artefacts augment the temporally consistent mesh sequence. Evaluation on challenging performances by several actors demonstrates the acquisition of subtle skin dynamics and minimal drift over long sequences. A quantitative comparison to a state-of-the-art system shows similar quality of temporal alignment. © 2012 IEEE.
We present a convolutional autoencoder that enables high fidelity volumetric reconstructions of human performance to be captured from multi-view video comprising only a small set of camera views. Our method yields similar end-to-end reconstruction error to that of a prob- abilistic visual hull computed using significantly more (double or more) viewpoints. We use a deep prior implicitly learned by the autoencoder trained over a dataset of view-ablated multi-view video footage of a wide range of subjects and actions. This opens up the possibility of high-end volumetric performance capture in on-set and prosumer scenarios where time or cost prohibit a high witness camera count.
The rise of autonomous machines in our day-to-day lives has led to an increasing demand for machine perception of real-world to be more robust, accurate and human-like. The research in visual scene un- derstanding over the past two decades has focused on machine perception in controlled environments such as indoor, static and rigid objects. There is a gap in literature for machine perception in general complex scenes (outdoor with multiple interacting people). The proposed research ad- dresses the limitations of existing methods by proposing an unsupervised framework to simultaneously model, semantically segment and estimate motion for general dynamic scenes captured from multiple view videos with a network of static or moving cameras. In this talk I will explain the proposed joint framework to understand general dynamic scenes for ma- chine perception; give a comprehensive performance evaluation against state-of-the-art techniques on challenging indoor and outdoor sequences; and demonstrate applications such as virtual, augmented, mixed reality (VR/AR/MR) and broadcast production (Free-view point video - FVV).