We propose a novel video inpainting algorithm that simultaneously hallucinates missing appearance and motion (optical flow) information, building upon the recent 'Deep Image Prior' (DIP) that exploits convolutional network architectures to enforce plausible texture in static images. In extending DIP to video we make two important contributions. First, we show that coherent video inpainting is possible without a priori training. We take a generative approach to inpainting based on internal (within-video) learning without reliance upon an external corpus of visual data to train a one-size-fits-all model for the large space of general videos. Second, we show that such a framework can jointly generate both appearance and flow, whilst exploiting these complementary modalities to ensure mutual consistency. We show that leveraging appearance statistics specific to each video achieves visually plausible results whilst handling the challenging problem of long-term consistency.
We present a novel method for generating robust adversarial
image examples building upon the recent ‘deep image
prior’ (DIP) that exploits convolutional network architectures
to enforce plausible texture in image synthesis. Adversarial
images are commonly generated by perturbing images
to introduce high frequency noise that induces image misclassification,
but that is fragile to subsequent digital manipulation
of the image. We show that using DIP to reconstruct
an image under adversarial constraint induces perturbations
that are more robust to affine deformation, whilst remaining
visually imperceptible. Furthermore we show that our DIP
approach can also be adapted to produce local adversarial
patches (‘adversarial stickers’). We demonstrate robust adversarial
examples over a broad gamut of images and object
classes drawn from the ImageNet dataset.
The deluge of visual content on the Internet - from user-generated content to commercial image collections - motivates intuitive new methods for searching digital image content: how can we find certain images in a database of millions? Sketch-based image retrieval (SBIR) is an emerging research topic in which a free-hand drawing can be used to visually query photographic images. SBIR is aligned to emerging trends for visual content consumption on mobile touch-screen based devices, for which gestural interactions such as sketch are a natural alternative to textual input. This thesis presents several contributions to the literature of SBIR. First, we propose a cross-domain learning framework that maps both sketches and images into a joint embedding space invariant to depictive style, while preserving semantics. The resulting embedding enables direct comparison and search between sketches and images and is based upon a multi-branch convolutional neural network (CNN) trained using unique parameter sharing and training schemes. The deeply learned embedding is shown to yield state-of-art retrieval performance on several SBIR benchmarks. Second, under two separate works we propose to disambiguate sketched queries by combining sketched shape with a secondary modality: SBIR with colour and with aesthetic context. The former enables querying with coloured line-art sketches. Colour and shape features are extracted locally using a modified version of gradient field orientation histogram (GF-HoG) before globally pooled using dictionary learning. Various colour-shape fusion strategies are explored, coupled with an efficient indexing scheme for fast retrieval performance. The latter supports querying using both a sketched shape accompanied by one or several images serving as an aesthetic constraint governing the visual style of search results. We propose to model structure and style separately dis-entangling one modality from the other; then learn structure-style fusion using a hierarchical triplet network. This method enables further studies beyond SBIR such as style blending, style analogy and retrieval with alternative-modal queries. Third, we explore mid-grain SBIR -- a novel field requiring retrieved images to match both category and key visual characteristics of the sketch without demanding fine-grain, instance-level matching of specific object instance. We study a semi-supervised approach that requires mainly class-labelled sketches and images plus a small number of instance-labelled sketch-image pairs. This approach involves aligning sketch and image embeddings before pooling them into clusters from which mid-grain similarity may be measured. Our learned model demonstrates not only intra-category discrimination (mid-grain) but also improved inter-category discrimination (coarse-grain) on a newly created MidGrain65c dataset.
We present ‘Screen codes’ - a space- and time-efficient, aesthetically compelling method for transferring data from a display to a camera-equipped mobile device. Screen codes encode data as a grid of luminosity fluctuations within an arbitrary image, displayed on the video screen and decoded on a mobile device. These ‘twinkling’ images are a form of ‘visual hyperlink’, by which users can move dynamically generated content to and from their mobile devices. They help bridge the ‘content divide’ between mobile and fixed computing.
We present ARCHANGEL; a decentralised platform for ensuring the long-term integrity of digital documents stored within public archives. Document integrity is fundamental to public trust in archives. Yet currently that trust is built upon institutional reputation --- trust at face value in a centralised authority, like a national government archive or University. ARCHANGEL proposes a shift to a technological underscoring of that trust, using distributed ledger technology (DLT) to cryptographically guarantee the provenance, immutability and so the integrity of archived documents. We describe the ARCHANGEL architecture, and report on a prototype of that architecture build over the Ethereum infrastructure. We report early evaluation and feedback of ARCHANGEL from stakeholders in the research data archives space.
Falling hardware costs have prompted an explosion in casual video capture by domestic users. Yet, this video is infrequently accessed post-capture and often lies dormant on users’ PCs. We present a system to breathe life into home video repositories, drawing upon artistic stylization to create a “Digital Ambient Display” that automatically selects, stylizes and transitions between videos in a semantically meaningful sequence. We present a novel algorithm based on multi-label graph cut for segmenting video into temporally coherent region maps. These maps are used to both stylize video into cartoons and paintings, and measure visual similarity between frames for smooth sequence transitions. We demonstrate coherent segmentation and stylization over a variety of home videos.
We present an image retrieval system driven by free-hand sketched queries depicting shape. We introduce Gradient Field HoG (GF-HOG) as a depiction invariant image descriptor, encapsulating local spatial structure in the sketch and facilitating efficient codebook based retrieval. We show improved retrieval accuracy over 3 leading descriptors (Self Similarity, SIFT, HoG) across two datasets (Flickr160, ETHZ extended objects), and explain how GF-HOG can be combined with RANSAC to localize sketched objects within relevant images. We also demonstrate a prototype sketch driven photo montage application based on our system.
We present a novel video retrieval system that accepts annotated free-hand sketches as queries. Existing sketch based video retrieval (SBVR) systems enable the appearance and movements of objects to be searched naturally through pictorial representations. Whilst visually expressive, such systems present an imprecise vehicle for conveying the semantics (e.g. object types) within a scene. Our contribution is to fuse the semantic richness of text with the expressivity of sketch, to create a hybrid `semantic sketch' based video retrieval system. Trajectory extraction and clustering are applied to pre-process each clip into a video object representation that we augment with object classification and colour information. The result is a system capable of searching videos based on the desired colour, motion path, and semantic labels of the objects present. We evaluate the performance of our system over the TSF dataset of broadcast sports footage.
We present an image retrieval system driven by free-hand sketched queries depicting shape. We introduce Gradient Field HoG (GF-HOG) as a depiction invariant image descriptor, encapsulating local spatial structure in the sketch and facilitating efficient codebook based retrieval. We show improved retrieval accuracy over 3 leading descriptors (Self Similarity, SIFT, HoG) across two datasets (Flickr160, ETHZ extended objects), and explain how GF-HOG can be combined with RANSAC to localize sketched objects within relevant images. We also demonstrate a prototype sketch driven photo montage application based on our system.
4D human performance capture aims to create volumetric representations of observed human subjects performing arbitrary motions with the ability to replay and render dynamic scenes with the realism of the recorded video. This representation has the potential to enable highly realistic content production for immersive virtual and augmented reality experiences. Human models are typically rendered using detailed, explicit 3D models, which consist of meshes and textures, and animated using tailored motion models to simulate human behaviour and activity. However, designing a realistic 3D human model is still a costly and laborious process. Hence, this work investigates techniques to learn models of human body shape and appearance, aiming to facilitate the generation of highly realistic human animation, and demonstrate its potential contributions, applications, and versatility. The first contribution of this work is a skeleton driven surface registration approach to generate temporally consistent meshes from multi-view video of human subjects. 2D pose detections from multi-view video are used to estimate 3D skeletal pose on a per-frame basis, which allows a reference frame to match the pose estimation of other frames in a sequence. This allows an initial coarse alignment followed by a patch-based non-rigid mesh deformation to generate temporally consistent mesh sequences. The second contribution presents techniques to represent human-like shape using a compressed learnt model from 4D volumetric performance capture data. Sequences of 4D dynamic geometry representing a human are encoded with a generative network into a compact space representation, whilst maintaining the original properties, such as surface non-rigid deformations. This compact representation enables synthesis, interpolation and generation of 3D shapes. The third contribution is Deep4D generative network that is capable of compact representation of 4D volumetric video sequences from skeletal motion of people with two orders of magnitude compression. A variational encoder-decoder is employed to learn an encoded latent space that maps from 3D skeletal pose to 4D shape and appearance. This enable high-quality 4D volumetric video synthesis to be driven by skeletal animation. Finally, this thesis introduces Deep4D motion graph to implicitly combine multiple captured motions in a unified representation for character animation from volumetric video, allowing novel character movements to be generated with dynamic shape and appearance detail. Deep4D motion graphs allow character animation to be driven by skeletal motion sequences providing a compact encoded representation capable of high-quality synthesis of the 4D volumetric video with two orders of magnitude compression.
Performance capture is used extensively within the creative industries to efficiently produce high quality, realistic character animation in movies and video games. Existing commercial systems for performance capture are limited to working within constrained environments, requiring wearable visual markers or suits, and frequently specialised imaging devices (e.g. infra-red cameras) both of which limit deployment scenarios (e.g. indoor capture). This thesis explores novel methods to relax these constraints, applying machine learning techniques to estimate human pose using regular video cameras and without the requirement of visible markers on the performer. This unlocks the potential for co-production of principal footage and performance capture data, leading to production efficiencies. For example, using an array of static witness cameras deployed on-set, performance capture data for a video games character accompanying a major movie franchise might be captured at the same time the movie is shot. The need to call the actor for a second day of shooting in a specialised motion capture (mo-cap) facility is avoided, saving time and money, since performance capture was possible without corrupting the principal movie footage with markers or constraining set design. Furthermore, if such performance capture data is available in real-time, the director may immediately pre-visualize the look and feel of the final character animation enabling tighter capture iteration and improved creative direction. This further enhances the potential for production efficiencies. The core technical contributions of this thesis are novel software algorithms that leverage machine learning to fuse of data from multiple sensors – synchronised video cameras, and in some cases, inertial measurement units (IMUs) – in order to robustly estimate human body pose over time, doing so at real-time or near real-time rates. Firstly, a hardware-accelerated capture solution is developed for acquiring coarse volumetric occupancy data from multiple viewpoint video footage, in the form of a probabilistic visual hull (PVH). Using CUDA-based GPU acceleration the PVH may be estimated in real-time, and subsequently used to train machine learning algorithms to infer human skeletal pose from PVH data. Initially a variety of machine learning approaches for skeletal joint pose estimation are explored, contrasting classical and deep inference methods. By quantizing volumetric data into a two-dimensional (2D) spherical histogram representation it is shown that convolutional neural networks (CNN) architectures used traditionally for object recognition may be re-purposed for skeletal joint estimation given suitable a training methodology and data augmentation strategy. The generalization of such architectures to a fully volumetric (3D) CNN is explored, achieving state of the art performance at human pose estimation using an volumetric auto-encoder (hour-glass) architecture that emulates networks traditionally used for de-noising and super-resolution (up-scaling) of 2D data. A framework is developed that is capable of simultaneously estimating human pose from volumetric data, whilst also up-scaling that volumetric data to enable fine-grain estimation of surface detail given a deeply learned prior from previous performance. The method is shown to generalise well even when that prior is learned across different subjects, performing different movements even in different studio camera configurations. Performance can be further improved using a learned temporal model of data, and through the fusion of complementary sensor modalities – video and IMUs – to enhance the accuracy of human pose estimation inferred from a volumetric CNN. Although IMUs have been applied in the performance capture domain for many years, they are prone to drift limiting their use to short capture sequences. The novel fusion of IMU with video data enables improved global localization and so reduced error over time whilst simultaneously mitigating the issues of limb inter-occlusion that can frustrate video-only approaches.
Trumble Matthew, Gilbert Andrew, Malleson Charles, Hilton Adrian, Collomosse John (2020)Total Capture
University of Surrey
Ren Reede, Collomosse John, Jose Joemon (2011)A BOVW Based Query Generative Model, In: Proc. International Conference on MultiMedia Modelling (MMM), Lecture Notes in Computer Science (LNCS)6523pp. 118-128
Bag-of-visual words (BOVW) is a local feature based framework for content-based image and video retrieval. Its performance relies on the discriminative power of visual vocabulary, i.e. the cluster set on local features. However, the optimisation of visual vocabulary is of a high complexity in a large collection. This paper aims to relax such a dependence by adapting the query generative model to BOVW based retrieval. Local features are directly projected onto latent content topics to create effective visual queries; visual word distributions are learnt around local features to estimate the contribution of a visual word to a query topic; the relevance is justified by considering concept distributions on visual words as well as on local features. Massive experiments are carried out the TRECVid 2009 collection. The notable improvement on retrieval performance shows that this probabilistic framework alleviates the problem of visual ambiguity and is able to afford visual vocabulary with relatively low discriminative power.
Gilbert Andrew, Volino Marco, Collomosse John, Hilton Adrian (2018)Volumetric performance capture from minimal camera viewpoints, In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds.), Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science11215pp. 591-607
Springer Science+Business Media
We present a convolutional autoencoder that enables high fidelity volumetric reconstructions of human performance to be captured from multi-view video comprising only a small set of camera views. Our method yields similar end-to-end reconstruction error to that of a prob- abilistic visual hull computed using significantly more (double or more) viewpoints. We use a deep prior implicitly learned by the autoencoder trained over a dataset of view-ablated multi-view video footage of a wide range of subjects and actions. This opens up the possibility of high-end volumetric performance capture in on-set and prosumer scenarios where time or cost prohibit a high witness camera count.
We present a novel hybrid representation for character animation from 4D Performance Capture (4DPC) data which combines skeletal control with surface motion graphs. 4DPC data are temporally aligned 3D mesh sequence reconstructions of the dynamic surface shape and associated appearance from multiple view video. The hybrid representation supports the production of novel surface sequences which satisfy constraints from user specified key-frames or a target skeletal motion. Motion graph path optimisation concatenates fragments of 4DPC data to satisfy the constraints whilst maintaining plausible surface motion at transitions between sequences. Spacetime editing of the mesh sequence using a learnt part-based Laplacian surface deformation model is performed to match the target skeletal motion and transition between sequences. The approach is quantitatively evaluated for three 4DPC datasets with a variety of clothing styles. Results for key-frame animation demonstrate production of novel sequences which satisfy constraints on timing and position of less than 1% of the sequence duration and path length. Evaluation of motion capture driven animation over a corpus of 130 sequences shows that the synthesised motion accurately matches the target skeletal motion. The combination of skeletal control with the surface motion graph extends the range and style of motion which can be produced whilst maintaining the natural dynamics of shape and appearance from the captured performance.
We present a novel steganographic technique for concealing information within an image. Uniquely we explore the practicality of hiding, and recovering, data within the pattern of brush strokes generated by a non-photorealistic rendering (NPR) algorithm. We incorporate a linear binary coding (LBC) error correction scheme over a raw data channel established through the local statistics of NPR stroke orientations. This enables us to deliver a robust channel for conveying short (e.g. 30 character) strings covertly within a painterly rendering. We evaluate over a variety of painterly renderings, parameter settings and message lengths.
We present a method for simultaneously estimating 3D hu- man pose and body shape from a sparse set of wide-baseline camera views. We train a symmetric convolutional autoencoder with a dual loss that enforces learning of a latent representation that encodes skeletal joint positions, and at the same time learns a deep representation of volumetric body shape. We harness the latter to up-scale input volumetric data by a factor of 4X, whilst recovering a 3D estimate of joint positions with equal or greater accuracy than the state of the art. Inference runs in real-time (25 fps) and has the potential for passive human behaviour monitoring where there is a requirement for high fidelity estimation of human body shape and pose.
Content-aware image completion or in-painting is a fundamental tool for the correction of defects or removal of objects in images. We propose a non-parametric in-painting algorithm that enforces both structural and aesthetic (style) consistency within the resulting image. Our contributions are two-fold: 1) we explicitly disentangle image structure and style during patch search and selection to ensure a visually consistent look and feel within the target image. 2) we perform adaptive stylization of patches to conform the aesthetics of selected patches to the target image, so harmonizing the integration of selected patches into the final composition. We show that explicit consideration of visual style during in-painting delivers excellent qualitative and quantitative results across the varied image styles and content, over the Places2 scene photographic dataset and a challenging new in-painting dataset of artwork derived from BAM!
We present an efficient representation for sketch based image retrieval (SBIR) derived from a triplet loss convolutional neural network (CNN). We treat SBIR as a cross-domain modelling problem, in which a depiction invariant embedding of sketch and photo data is learned by regression over a siamese CNN architecture with half-shared weights and modified triplet loss function. Uniquely, we demonstrate the ability of our learned image descriptor to generalise beyond the categories of object present in our training data, forming a basis for general cross-category SBIR. We explore appropriate strategies for training, and for deriving a compact image descriptor from the learned representation suitable for indexing data on resource constrained e. g. mobile devices. We show the learned descriptors to outperform state of the art SBIR on the defacto standard Flickr15k dataset using a significantly more compact (56 bits per image, i. e. ≈ 105KB total) search index than previous methods.
We describe a novel framework for segmenting a time- and view-coherent foreground matte sequence from synchronised multiple view video. We construct a Markov Random Field (MRF) comprising links between superpixels corresponded across views, and links between superpixels and their constituent pixels. Texture, colour and disparity cues are incorporated to model foreground appearance. We solve using a multi-resolution iterative approach enabling an eight view high definition (HD) frame to be processed in less than a minute. Furthermore we incorporate a temporal diffusion process introducing a prior on the MRF using information propagated from previous frames, and a facility for optional user correction. The result is a set of temporally coherent mattes that are solved for simultaneously across views for each frame, exploiting similarities across views and time.
LiveSketch is a novel algorithm for searching large image collections using hand-sketched queries. LiveSketch tackles the inherent ambiguity of sketch search by creating visual suggestions that augment the query as it is drawn, making query specification an iterative rather than one-shot process that helps disambiguate users' search intent. Our technical contributions are: a triplet convnet architecture that incorporates an RNN based variational autoencoder to search for images using vector (stroke-based) queries; real-time clustering to identify likely search intents (and so, targets within the search embedding); and the use of backpropagation from those targets to perturb the input stroke sequence, so suggesting alterations to the query in order to guide the search. We show improvements in accuracy and time-to-task over contemporary baselines using a 67M image corpus.
This guide s cutting-edge coverage explains the full spectrum of NPR techniques used in photography, TV and film.
We describe a novel algorithm for visually identifying the font used in a scanned printed document. Our algorithm requires no pre-recognition of characters in the string (i. e. optical character recognition). Gradient orientation features are collected local the character boundaries, and quantized into a hierarchical Bag of Visual Words representation. Following stop-word analysis, classification via logistic regression (LR) of the codebooked features yields per-character probabilities which are combined across the string to decide the posterior for each font. We achieve 93.4% accuracy over a 1000 font database of scanned printed text comprising Latin characters.
We present a novel algorithm for the semantic labeling of photographs shared via social media. Such imagery is diverse, exhibiting high intra-class variation that demands large training data volumes to learn representative classifiers. Unfortunately image annotation at scale is noisy resulting in errors in the training corpus that confound classifier accuracy. We show how evolutionary algorithms may be applied to select a ’purified’ subset of the training corpus to optimize classifier performance. We demonstrate our approach over a variety of image descriptors (including deeply learned features) and support vector machines.
Benard P, Thollot J, Collomosse JP (2013)Temporally Coherent Video Stylization, In: Collomosse J, Rosin P (eds.), Image and video based artistic stylization42pp. 257-284
We present an algorithm for visually searching image collections using free-hand sketched queries. Prior sketch based image retrieval (SBIR) algorithms adopt either a category-level or fine-grain (instance-level) definition of cross-domain similarity—returning images that match the sketched object class (category-level SBIR), or a specific instance of that object (fine-grain SBIR). In this paper we take the middle-ground; proposing an SBIR algorithm that returns images sharing both the object category and key visual characteristics of the sketched query without assuming photo-approximate sketches from the user. We describe a deeply learned cross-domain embedding in which ‘mid-grain’ sketch-image similarity may be measured, reporting on the efficacy of unsupervised and semi-supervised manifold alignment techniques to encourage better intra-category (mid-grain) discrimination within that embedding. We propose a new mid-grain sketch-image dataset (MidGrain65c) and demonstrate not only mid-grain discrimination, but also improved category-level discrimination using our approach.
We propose a novel measure of visual similarity for image retrieval that incorporates both structural and aesthetic (style) constraints. Our algorithm accepts a query as sketched shape, and a set of one or more contextual images specifying the desired visual aesthetic. A triplet network is used to learn a feature embedding capable of measuring style similarity independent of structure, delivering significant gains over previous networks for style discrimination. We incorporate this model within a hierarchical triplet network to unify and learn a joint space from two discriminatively trained streams for style and structure. We demonstrate that this space enables, for the first time, style-constrained sketch search over a diverse domain of digital artwork comprising graphics, paintings and drawings. We also briefly explore alternative query modalities.
We describe a non-parametric algorithm for multiple-viewpoint video inpainting. Uniquely, our algorithm addresses the domain of wide baseline multiple-viewpoint video (MVV) with no temporal look-ahead in near real time speed. A Dictionary of Patches (DoP) is built using multi-resolution texture patches reprojected from geometric proxies available in the alternate views. We dynamically update the DoP over time, and a Markov Random Field optimisation over depth and appearance is used to resolve and align a selection of multiple candidates for a given patch, this ensures the inpainting of large regions in a plausible manner conserving both spatial and temporal coherence. We demonstrate the removal of large objects (e.g. people) on challenging indoor and outdoor MVV exhibiting cluttered, dynamic backgrounds and moving cameras.
Vanderhaeghe D, Collomosse JP (2013)Stroke based painterly rendering, In: Collomosse J, Rosin P (eds.), Image and video based artistic stylization42pp. 3-22
Deep Learning methods are currently the state-of-the-art in many Computer Vision and Image Processing problems, in particular image classification. After years of intensive investigation, a few models matured and became important tools, including Convolutional Neural Networks (CNNs), Siamese and Triplet Networks, Auto-Encoders (AEs) and Generative Adversarial Networks (GANs). The field is fast-paced and there is a lot of terminologies to catch up for those who want to adventure in Deep Learning waters. This paper has the objective to introduce the most fundamental concepts of Deep Learning for Computer Vision in particular CNNs, AEs and GANs, including architectures, inner workings and optimization. We offer an updated description of the theoretical and practical knowledge of working with those models. After that, we describe Siamese and Triplet Networks, not often covered in tutorial papers, as well as review the literature on recent and exciting topics such as visual stylization, pixel-wise prediction and video processing. Finally, we discuss the limitations of Deep Learning for Computer Vision.
We propose an approach to accurately esti- mate 3D human pose by fusing multi-viewpoint video (MVV) with inertial measurement unit (IMU) sensor data, without optical markers, a complex hardware setup or a full body model. Uniquely we use a multi-channel 3D convolutional neural network to learn a pose em- bedding from visual occupancy and semantic 2D pose estimates from the MVV in a discretised volumetric probabilistic visual hull (PVH). The learnt pose stream is concurrently processed with a forward kinematic solve of the IMU data and a temporal model (LSTM) exploits the rich spatial and temporal long range dependencies among the solved joints, the two streams are then fused in a final fully connected layer. The two complemen- tary data sources allow for ambiguities to be resolved within each sensor modality, yielding improved accu- racy over prior methods. Extensive evaluation is per- formed with state of the art performance reported on the popular Human 3.6M dataset , the newly re- leased TotalCapture dataset and a challenging set of outdoor videos TotalCaptureOutdoor. We release the new hybrid MVV dataset (TotalCapture) comprising of multi- viewpoint video, IMU and accurate 3D skele- tal joint ground truth derived from a commercial mo- tion capture system. The dataset is available online at http://cvssp.org/data/totalcapture/.
Sketchformer is a novel transformer-based representation for encoding free-hand sketches input in a vector form, i.e. as a sequence of strokes. Sketchformer effectively addresses multiple tasks: sketch classification, sketch based image retrieval (SBIR), and the reconstruction and interpolation of sketches. We report several variants exploring continuous and tokenized input representations, and contrast their performance. Our learned embedding, driven by a dictionary learning tokenization scheme, yields state of the art performance in classification and image retrieval tasks, when compared against baseline representations driven by LSTM sequence to sequence architectures: SketchRNN and derivatives. We show that sketch reconstruction and interpolation are improved significantly by the Sketchformer embedding for complex sketches with longer stroke sequences.
We present a scalable system for sketch-based image retrieval (SBIR), extending the state of the art Gradient Field HoG (GF-HoG) retrieval framework through two technical contributions. First, we extend GF-HoG to enable color-shape retrieval and comprehensively evaluate several early-and late-fusion approaches for integrating the modality of color, considering both the accuracy and speed of sketch retrieval. Second, we propose an efficient inverse-index representation for GF-HoG that delivers scalable search with interactive query times over millions of images. A mobile app demo accompanies this paper (Android).
Bui Tu, Collomosse John, Bell Mark, Green Alex, Sheridan John, Higgins Jez, Das Arindra, Keller Jared, Thereaux Olivier, Brown Alan (2019)ARCHANGEL: Tamper-Proofing Video Archives Using Temporal Content Hashes on the Blockchain, In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019) Workshops
Institute of Electrical and Electronics Engineers (IEEE)
We present ARCHANGEL; a novel distributed ledger based system for assuring the long-term integrity of digital video archives. First, we describe a novel deep network architecture for computing compact temporal content hashes (TCHs) from audio-visual streams with durations of minutes or hours. Our TCHs are sensitive to accidental or malicious content modification (tampering) but invariant to the codec used to encode the video. This is necessary due to the curatorial requirement for archives to format shift video over time to ensure future accessibility. Second, we describe how the TCHs (and the models used to derive them) are secured via a proof-of-authority blockchain distributed across multiple independent archives. We report on the efficacy of ARCHANGEL within the context of a trial deployment in which the national government archives of the United Kingdom, Estonia and Norway participated.
We present a new incremental learning framework for realtime object recognition in video streams. ImageNet is used to bootstrap a set of one-vs-all incrementally trainable SVMs which are updated by user annotation events during streaming. We adopt an inductive transfer learning (ITL) approach to warp the video feature space to the ImageNet feature space, so enabling the incremental updates. Uniquely, the transformation used for the ITL warp is also learned incrementally using the same update events. We demonstrate a semiautomated video logging (SAVL) system using our incrementally learned ITL approach and show this to outperform existing SAVL which uses non-incremental transfer learning.
Bui Tu, Cooper Daniel, Collomosse John, Bell Mark, Green Alex, Sheridan John, Higgins Jez, Das Arindra, Keller Jared Robert, Thereaux Olivier Tamper-proofing Video with Hierarchical Attention Autoencoder Hashing on Blockchain, In: IEEE Transactions on Multimedia
Institute of Electrical and Electronics Engineers
We present ARCHANGEL; a novel distributed ledger based system for assuring the long-term integrity of digital video archives. First, we introduce a novel deep network architecture using a hierarchical attention autoencoder (HAAE) to compute temporal content hashes (TCHs) from minutes or hourlong audio-visual streams. Our TCHs are sensitive to accidental or malicious content modification (tampering). The focus of our self-supervised HAAE is to guard against content modification such as frame truncation or corruption but ensure invariance against format shift (i.e. codec change). This is necessary due to the curatorial requirement for archives to format shift video over time to ensure future accessibility. Second, we describe how the TCHs (and the models used to derive them) are secured via a proof-of-authority blockchain distributed across multiple independent archives.We report on the efficacy of ARCHANGEL within the context of a trial deployment in which the national government archives of the United Kingdom, United States of America, Estonia, Australia and Norway participated.
A real-time full-body motion capture system is presented which uses input from a sparse set of inertial measurement units (IMUs) along with images from two or more standard video cameras and requires no optical markers or specialized infra-red cameras. A real-time optimization-based framework is proposed which incorporates constraints from the IMUs, cameras and a prior pose model. The combination of video and IMU data allows the full 6-DOF motion to be recovered including axial rotation of limbs and drift-free global position. The approach was tested using both indoor and outdoor captured data. The results demonstrate the effectiveness of the approach for tracking a wide range of human motion in real time in unconstrained indoor/outdoor scenes.
Keitler P, Pankratz F, Schwerdtfeger B, Pustka D, Rodiger W, Klinker G, Rauch C, Chathoth A, Collomosse J, Song Y (2009)Mobile Augmented Reality based 3D Snapshots, In: Proc. GI-Workshop on VR/AR
We propose and evaluate several deep network architectures for measuring the similarity between sketches and photographs, within the context of the sketch based image retrieval (SBIR) task. We study the ability of our networks to generalize across diverse object categories from limited training data, and explore in detail strategies for weight sharing, pre-processing, data augmentation and dimensionality reduction. In addition to a detailed comparative study of network configurations, we contribute by describing a hybrid multi-stage training network that exploits both contrastive and triplet networks to exceed state of the art performance on several SBIR benchmarks by a significant margin.
We present 'Screen codes' - a space- and time-efficient, aesthetically compelling method for transferring data from a display to a camera-equipped mobile device. Screen codes encode data as a grid of luminosity fluctuations within an arbitrary image, displayed on the video screen and decoded on a mobile device. These 'twinkling' images are a form of 'visual hyperlink', by which users can move dynamically generated content to and from their mobile devices. They help bridge the 'content divide' between mobile and fixed computing.
Multi-view video acquisition is widely used for reconstruction and free-viewpoint rendering of dynamic scenes by directly resampling from the captured images. This paper addresses the problem of optimally resampling and representing multi-view video to obtain a compact representation without loss of the view-dependent dynamic surface appearance. Spatio-temporal optimisation of the multi-view resampling is introduced to extract a coherent multi-layer texture map video. This resampling is combined with a surface-based optical flow alignment between views to correct for errors in geometric reconstruction and camera calibration which result in blurring and ghosting artefacts. The multi-view alignment and optimised resampling results in a compact representation with minimal loss of information allowing high-quality free-viewpoint rendering. Evaluation is performed on multi-view datasets for dynamic sequences of cloth, faces and people. The representation achieves >90% compression without significant loss of visual quality.
We present an algorithm for fusing multi-viewpoint video (MVV) with inertial measurement unit (IMU) sensor data to accurately estimate 3D human pose. A 3-D convolutional neural network is used to learn a pose embedding from volumetric probabilistic visual hull data (PVH) derived from the MVV frames. We incorporate this model within a dual stream network integrating pose embeddings derived from MVV and a forward kinematic solve of the IMU data. A temporal model (LSTM) is incorporated within both streams prior to their fusion. Hybrid pose inference using these two complementary data sources is shown to resolve ambiguities within each sensor modality, yielding improved accuracy over prior methods. A further contribution of this work is a new hybrid MVV dataset (TotalCapture) comprising video, IMU and a skeletal joint ground truth derived from a commercial motion capture system. The dataset is available online at http://cvssp.org/data/totalcapture/.
This Pictorial documents the process of designing a device as an intervention within a field study of new parents. The device was deployed in participating parents’ homes to invite reflection on their everyday experiences of portraying self and others through social media in their transition to parenthood. The design creates a dynamic representation of each participant’s Facebook photo collection, extracting and amalgamating ‘faces’ from it to create an alternative portrait of an online self. We document the rationale behind our design, explaining how its features were inspired and developed, and how they function to address research questions about human experience.
We present a novel human performance capture technique capable of robustly estimating the pose (articulated joint positions) of a performer observed passively via multiple view-point video (MVV). An affine invariant pose descriptor is learned using a convolutional neural network (CNN) trained over volumetric data extracted from a MVV dataset of diverse human pose and appearance. A manifold embedding is learned via Gaussian Processes for the CNN descriptor and articulated pose spaces enabling regression and so estimation of human pose from MVV input. The learned descriptor and manifold are shown to generalise over a wide range of human poses, providing an efficient performance capture solution that requires no fiducials or other markers to be worn. The system is evaluated against ground truth joint configuration data from a commercial marker-based pose estimation system
In this paper, we propose an object segmentation algorithm driven by minimal user interactions. Compared to previous user-guided systems, our system can cut out the desired object in a given image with only a single finger touch minimizing user effort. The proposed model harnesses both edge and region based local information in an adaptive manner as well as geometric cues implied by the user-input to achieve fast and robust segmentation in a level set framework. We demonstrate the advantages of our method in terms of computational efficiency and accuracy comparing qualitatively and quantitatively with graph cut based techniques.
We present a novel Content Based Video Retrieval (CBVR) system, driven by free-hand sketch queries depicting both objects and their movement (via dynamic cues; streak-lines and arrows). Our main contribution is a probabilistic model of video clips (based on Linear Dynamical Systems), leading to an algorithm for matching descriptions of sketched objects to video. We demonstrate our model fitting to clips under static and moving camera conditions, exhibiting linear and oscillatory motion. We evaluate retrieval on two real video data sets, and on a video data set exhibiting controlled variation in shape, color, motion and clutter.
Keitler P., Pankratz F., Schwerdtfeger B., Pustka D., Rodiger W., Klinker G., Rauch C., Chathoth A., Collomosse J., Song Yi-Zhe (2010)Mobile augmented reality based 3D snapshots, In: Proceedings of the 2009 8th IEEE International Symposium on Mixed and Augmented Reality (ISMAR 2009)pp. 199-200
Institute of Electrical and Electronics Engineers (IEEE)
We describe a mobile augmented reality application that is based on 3D snapshotting using multiple photographs. Optical square markers provide the anchor for reconstructed virtual objects in the scene. A novel approach based on pixel flow highly improves tracking performance. This dual tracking approach also allows for a new single-button user interface metaphor for moving virtual objects in the scene. The development of the AR viewer was accompanied by user studies confirming the chosen approach.
We describe a semi-automatic video logging system, ca- pable of annotating frames with semantic metadata describ- ing the objects present. The system learns by visual exam- ples provided interactively by the logging operator, which are learned incrementally to provide increased automation over time. Transfer learning is initially used to bootstrap the sys- tem using relevant visual examples from ImageNet. We adapt the hard-assignment Bag of Word strategy for object recogni- tion to our interactive use context, showing transfer learning to significantly reduce the degree of interaction required.
We present a new algorithm for segmenting video frames into temporally stable colored regions, applying our technique to create artistic stylizations (e.g. cartoons and paintings) from real video sequences. Our approach is based on a multilabel graph cut applied to successive frames, in which the color data term and label priors are incrementally updated and propagated over time. We demonstrate coherent segmentation and stylization over a variety of home videos.
Falls in the home are a major source of injury for the elderly. The affordability of commodity video cameras is prompting the development of ambient intelligent environments to monitor the occurence of falls in the home. This paper describes an automated fall detection system, capable of tracking movement and detecting falls in real-time. In particular we explore the application of the Bag of Features paradigm, frequently applied to general activity recognition in Computer Vision, to the domestic fall detection problem. We show that fall detection is feasible using such a framework, evaluted our approach in both controlled test scenarios and domestic scenarios exhibiting uncontrolled fall direction and visually cluttered environments.
Three dimensional (3D) displays typically rely on stereo disparity, requiring specialized hardware to be worn or embedded in the display. We present a novel 3D graphics display system for volumetric scene visualization using only standard 2D display hardware and a pair of calibrated web cameras. Our computer vision-based system requires no worn or other special hardware. Rather than producing the depth illusion through disparity, we deliver a full volumetric 3D visualization - enabling users to interactively explore 3D scenes by varying their viewing position and angle according to the tracked 3D position of their face and eyes. We incorporate a novel wand-based calibration that allows the cameras to be placed at arbitrary positions and orientations relative to the display. The resulting system operates at real-time speeds (~25 fps) with low latency (120-225 ms) delivering a compelling natural user interface and immersive experience for 3D viewing. In addition to objective evaluation of display stability and responsiveness, we report on user trials comparing users' timings on a spatial orientation task.
We present a robust algorithm for temporally coherent video segmentation. Our approach is driven by multi-label graph cut applied to successive frames, fusing information from the current frame with an appearance model and labeling priors propagated forwarded from past frames. We propagate using a novel motion diffusion model, producing a per-pixel motion distribution that mitigates against cumulative estimation errors inherent in systems adopting “hard” decisions on pixel motion at each frame. Further, we encourage spatial coherence by imposing label consistency constraints within image regions (super-pixels) obtained via a bank of unsupervised frame segmentations, such as mean-shift. We demonstrate quantitative improvements in accuracy over state-of-the-art methods on a variety of sequences exhibiting clutter and agile motion, adopting the Berkeley methodology for our comparative evaluation.
This paper surveys the field of non-photorealistic rendering (NPR), focusing on techniques for transforming 2D input (images and video) into artistically stylized renderings. We first present a taxonomy of the 2D NPR algorithms developed over the past two decades, structured according to the design characteristics and behavior of each technique. We then describe a chronology of development from the semi-automatic paint systems of the early nineties, through to the automated painterly rendering systems of the late nineties driven by image gradient analysis. Two complementary trends in the NPR literature are then addressed, with reference to our taxonomy. First, the fusion of higher level computer vision and NPR, illustrating the trends toward scene analysis to drive artistic abstraction and diversity of style. Second, the evolution of local processing approaches toward edge-aware filtering for real-time stylization of images and video. The survey then concludes with a discussion of open challenges for 2D NPR identified in recent NPR symposia, including topics such as user and aesthetic evaluation.
This paper presents a sketch-based image retrieval system using a bag-of-region representation of images. Regions from the nodes of a hierarchical region tree range in various scales of details. They have appealing properties for object level inference such as the naturally encoded shape and scale information of objects and the specified domains on which to compute features without being affected by clutter from outside the region. The proposed approach builds shape descriptor on the salient shape among the clutters and thus yields significant performance improvements over the previous results on three leading descriptors in Bag-of-Words framework for sketch based image retrieval. Matched region also facilitates the localization of sketched object within the retrieved image.
We introduce a simple but versatile camera model that we call the Rational Tensor Camera (RTcam). RTcams are well principled mathematically and provably subsume several important contemporary camera models in both computer graphics and vision; their generality Is one contribution. They can be used alone or compounded to produce more complicated visual effects. In this paper, we apply RTcams to generate synthetic artwork with novel perspective effects from real photographs. Existing Nonphotorealistic Rendering from Photographs (NPRP) is constrained to the projection inherent in the source photograph, which is most often linear. RTcams lift this restriction and so contribute to NPRP via multiperspective projection. This paper describes RTcams, compares them to contemporary alternatives, and discusses how to control them in practice. Illustrative examples are provided throughout.
We describe a novel fully automatic algorithm for identifying salient objects in video based on their motion. Spatially coherent clusters of optical flow vectors are sampled to generate estimates of affine motion parameters local to super-pixels identified within each frame. These estimates, combined with spatial data, form coherent point distributions in a 5D solution space corresponding to objects or parts there-of. These distributions are temporally denoised using a particle filtering approach, and clustered to estimate the position and motion parameters of salient moving objects in the clip. We demonstrate localization of salient object/s in a variety of clips exhibiting moving and cluttered backgrounds.
A real-time motion capture system is presented which uses input from multiple standard video cameras and inertial measurement units (IMUs). The system is able to track multiple people simultaneously and requires no optical markers, specialized infra-red cameras or foreground/background segmentation, making it applicable to general indoor and outdoor scenarios with dynamic backgrounds and lighting. To overcome limitations of prior video or IMU-only approaches, we propose to use flexible combinations of multiple-view, calibrated video and IMU input along with a pose prior in an online optimization-based framework, which allows the full 6-DoF motion to be recovered including axial rotation of limbs and drift-free global position. A method for sorting and assigning raw input 2D keypoint detections into corresponding subjects is presented which facilitates multi-person tracking and rejection of any bystanders in the scene. The approach is evaluated on data from several indoor and outdoor capture environments with one or more subjects and the trade-off between input sparsity and tracking performance is discussed. State-of-the-art pose estimation performance is obtained on the Total Capture (mutli-view video and IMU) and Human 3.6M (multi-view video) datasets. Finally, a live demonstrator for the approach is presented showing real-time capture, solving and character animation using a light-weight, commodity hardware setup.
We describe a system for matching human posture (pose) across a large cross-media archive of dance footage spanning nearly 100 years, comprising digitized photographs and videos of rehearsals and performances. This footage presents unique challenges due to its age, quality and diversity. We propose a forest-like pose representation combining visual structure (self-similarity) descriptors over multiple scales, without explicitly detecting limb positions which would be infeasible for our data. We explore two complementary multi-scale representations, applying passage retrieval and latent Dirichlet allocation (LDA) techniques inspired by the the text retrieval domain, to the problem of pose matching. The result is a robust system capable of quickly searching large cross-media collections for similarity to a visually specified query pose. We evaluate over a crosssection of the UK National Research Centre for Dance’s (UK-NRCD), and the Siobhan Davies Replay’s (SDR) digital dance archives, using visual queries supplied by dance professionals. We demonstrate significant performance improvements over two base-lines; classical single and multi-scale Bag of Visual Words (BoVW) and spatial pyramid kernel (SPK) matching.
This article reports a user-experience study in which a group of 18 older adults used a location-based mobile multimedia service in the setting of a rural nature reserve. The prototype system offered a variety of means of obtaining rich multimedia content from oak waymarker posts using a mobile phone. Free text questionnaires and focus groups were employed to investigate participants' experiences with the system and their attitudes to the use of mobile and pervasive systems in general. The users' experiences with the system were positive with respect to the design of the system in the context of the surrounding natural environment. However, the authors found some significant barriers to their adoption of mobile and pervasive systems as replacements for traditional information sources.
Collomosse JP, Hall PM, Rothlauf F, Branke J, Cagnoni S, Corne DW, Drechsler R, Jin Y, Machado P, Marchiori E, Romero J, Smith GD, Squillero G (2005)Genetic paint: A search for salient paintings, In: APPLICATIONS OF EVOLUTIONARY COMPUTING, PROCEEDINGS3449pp. 437-447
We propose a human performance capture system employing convolutional neural networks (CNN) to estimate human pose from a volumetric representation of a performer derived from multiple view-point video (MVV).We compare direct CNN pose regression to the performance of an affine invariant pose descriptor learned by a CNN through a classification task. A non-linear manifold embedding is learned between the descriptor and articulated pose spaces, enabling regression of pose from the source MVV. The results are evaluated against ground truth pose data captured using a Vicon marker-based system and demonstrate good generalisation over a range of human poses, providing a system that requires no special suit to be worn by the performer.
We describe a novel system for synthesising video choreography using sketched visual storyboards comprising human poses (stick men) and action labels. First, we describe an algorithm for searching archival dance footage using sketched pose. We match using an implicit representation of pose parsed from a mix of challenging low and high delity footage. In a training pre-process we learn a mapping between a set of exemplar sketches and corresponding pose representations parsed from the video, which are generalized at query-time to enable retrieval over previously unseen frames, and over additional unseen videos. Second, we describe how a storyboard of sketched poses, interspersed with labels indicating connecting actions, may be used to drive the synthesis of novel video choreography from the archival footage. We demonstrate both our retrieval and synthesis algorithms over both low delity PAL footage from the UK Digital Dance Archives (DDA) repository of contemporary dance, circa 1970, and over higher-defi nition studio captured footage.
We present an image retrieval system for the interactive search of photo collections using free-hand sketches depicting shape. We describe Gradient Field HOG (GF-HOG); an adapted form of the HOG descriptor suitable for Sketch Based Image Retrieval (SBIR). We incorporate GF-HOG into a Bag of Visual Words (BoVW) retrieval framework, and demonstrate how this combination may be harnessed both for robust SBIR, and for localizing sketched objects within an image. We evaluate over a large Flickr sourced dataset comprising 33 shape categories, using queries from 10 non-expert sketchers. We compare GF-HOG against state-of-the-art descriptors with common distance measures and language models for image retrieval, and explore how affine deformation of the sketch impacts search performance. GF-HOG is shown to consistently outperform retrieval versus SIFT, multi-resolution HOG, Self Similarity, Shape Context and Structure Tensor. Further, we incorporate semantic keywords into our GF-HOG system to enable the use of annotated sketches for image search. A novel graph-based measure of semantic similarity is proposed and two applications explored: semantic sketch based image retrieval and a semantic photo montage.
The falling cost of digital cameras and camcorders has encouraged the creation of massive collections of personal digital media. However, once captured, this media is infrequently accessed and often lies dormant on users' PCs. We present a system to breathe life into home digital media collections, drawing upon artistic stylization to create a “Digital Ambient Display” that automatically selects, stylizes and transitions between digital contents in a semantically meaningful sequence. We present a novel algorithm based on multi-label graph cut for segmenting video into temporally coherent region maps. These maps are used to both stylize video into cartoons and paintings, and measure visual similarity between frames for smooth sequence transitions. The system automatically structures the media collection into a hierarchical representation based on visual content and semantics. Graph optimization is applied to adaptively sequence content for display in a coarse-to-fine manner, driven by user attention level (detected in real-time by a webcam). Our system is deployed on embedded hardware in the form of a compact digital photo frame. We demonstrate coherent segmentation and stylization over a variety of home videos and photos. We evaluate our media sequencing algorithm via a small-scale user study, indicating that our adaptive display conveys a more compelling media consumption experience than simple linear “slide-shows”.
We present the "empathie painting" - an interactive painterly rendering whose appearance adapts in real time to reflect the perceived emotional state of the viewer. The empathie painting is an experiment into the feasibility of using high level control parameters (namely, emotional state) to replace the plethora of low-level constraints users must typically set to affect the output of artistic rendering algorithms. We describe a suite of Computer Vision algorithms capable of recognising users' facial expressions through the detection of facial action units derived from the FACS scheme. Action units are mapped to vectors within a continuous 2D space representing emotional state, from which we in turn derive a continuous mapping to the style parameters of a simple but fast segmentation-based painterly rendering algorithm. The result is a digital canvas capable of smoothly varying its painterly style at approximately 4 frames per second, providing a novel user interactive experience using only commodity hardware.
The contribution of this paper is a novel non-photorealistic rendering (NPR) technique, capable of producing an artificial 'hand-painted' effect on 2D images, such as photographs. Our method requires no user interaction, and makes use of image salience and gradient information to determine the implicit ordering and attributes of individual brush strokes. The benefits of our technique are complete automation, and mitigation against the loss of image detail during painting. Strokes from lower salience regions of the image do not encroach upon higher salience regions; this can occur with some existing painting methods. We describe our algorithm in detail, and illustrate its application with a gallery of images.
We present TouchCut; a robust and efficient algorithm for segmenting image and video sequences with minimal user interaction. Our algorithm requires only a single finger touch to identify the object of interest in the image or first frame of video. Our approach is based on a level set framework, with an appearance model fusing edge, region texture and geometric information sampled local to the touched point. We first present our image segmentation solution, then extend this framework to progressive (per-frame) video segmentation, encouraging temporal coherence by incorporating motion estimation and a shape prior learned from previous frames. This new approach to visual object cut-out provides a practical solution for image and video segmentation on compact touch screen devices, facilitating spatially localized media manipulation. We describe such a case study, enabling users to selectively stylize video objects to create a hand-painted effect. We demonstrate the advantages of TouchCut by quantitatively comparing against the state of the art both in terms of accuracy, and run-time performance.
We aim to simultaneously estimate the 3D articulated pose and high fidelity volumetric occupancy of human performance, from multiple viewpoint video (MVV) with as few as two views. We use a multi-channel symmetric 3D convolutional encoder-decoder with a dual loss to enforce the learning of a latent embedding that enables inference of skeletal joint positions and a volumetric reconstruction of the performance. The inference is regularised via a prior learned over a dataset of view-ablated multi-view video footage of a wide range of subjects and actions, and show this to generalise well across unseen subjects and actions. We demonstrate improved reconstruction accuracy and lower pose estimation error relative to prior work on two MVV performance capture datasets: Human 3.6M and TotalCapture.
Yang Yifan, Cooper Daniel, Collomosse John, Dragan Catalin, Manulis Mark, Briggs Jo, Steane Jamie, Manohar Arthi, Moncur Wendy, Jones Helen (2020)TAPESTRY: A De-centralized Service for Trusted Interaction Online, In: IEEE Transactions on Services Computingpp. 1-1
Institute of Electrical and Electronics Engineers (IEEE)
We present a novel de-centralised service for proving the provenance of online digital identity, exposed as an assistive tool to help non-expert users make better decisions about whom to trust online. Our service harnesses the digital personhood (DP); the longitudinal and multi-modal signals created through users' lifelong digital interactions, as a basis for evidencing the provenance of identity. We describe how users may exchange trust evidence derived from their DP, in a granular and privacy-preserving manner, with other users in order to demonstrate coherence and longevity in their behaviour online. This is enabled through a novel secure infrastructure combining hybrid on- and off-chain storage combined with deep learning for DP analytics and visualization. We show how our tools enable users to make more effective decisions on whether to trust unknown third parties online, and also to spot behavioural deviations in their own social media footprints indicative of account hijacking.
We present a fast technique for retrieving video clips using free-hand sketched queries. Visual keypoints within each video are detected and tracked to form short trajectories, which are clustered to form a set of spacetime tokens summarising video content. A Viterbi process matches a space-time graph of tokens to a description of colour and motion extracted from the query sketch. Inaccuracies in the sketched query are ameliorated by computing path cost using a Levenshtein (edit) distance. We evaluate over datasets of sports footage.
We present a novel algorithm for stylizing photographs into portrait paintings comprised of curved brush strokes. Rather than drawing upon a prescribed set of heuristics to place strokes, our system learns a flexible model of artistic style by analyzing training data from a human artist. Given a training pair — a source image and painting of that image—a non-parametric model of style is learned by observing the geometry and tone of brush strokes local to image features. A Markov Random Field (MRF) enforces spatial coherence of style parameters. Style models local to facial features are learned using a semantic segmentation of the input face image, driven by a combination of an Active Shape Model and Graph-cut. We evaluate style transfer between a variety of training and test images, demonstrating a wide gamut of learned brush and shading styles.
We introduce a trainable system that simultaneously filters and classifies low-level features into types specified by the user. The system operates over full colour images, and outputs a vector at each pixel indicating the probability that the pixel belongs to each feature type. We explain how common features such as edge, corner, and ridge can all be detected within a single framework, and how we combine these detectors using simple probability theory. We show its efficacy, using stereo-matching as an example.
4D Video Textures (4DVT) introduce a novel representation for rendering video-realistic interactive character animation from a database of 4D actor performance captured in a multiple camera studio. 4D performance capture reconstructs dynamic shape and appearance over time but is limited to free-viewpoint video replay of the same motion. Interactive animation from 4D performance capture has so far been limited to surface shape only. 4DVT is the final piece in the puzzle enabling video-realistic interactive animation through two contributions: a layered view-dependent texture map representation which supports efficient storage, transmission and rendering from multiple view video capture; and a rendering approach that combines multiple 4DVT sequences in a parametric motion space, maintaining video quality rendering of dynamic surface appearance whilst allowing high-level interactive control of character motion and viewpoint. 4DVT is demonstrated for multiple characters and evaluated both quantitatively and through a user-study which confirms that the visual quality of captured video is maintained. The 4DVT representation achieves >90% reduction in size and halves the rendering cost.
Humans have an innate ability to communicate visually; the earliest forms of communication were cave drawings, and children can communicate visual descriptions of scenes through drawings well before they can write. Drawings and sketches offer an intuitive and efficient means for communicating visual concepts. Today, society faces a deluge of digital visual content driven by a surge in the generation of video on social media and the online availability of video archives. Mobile devices are emerging as the dominant platform for consuming this content, with Cisco predicting that by 2018 over 80% of mobile traffic will be video. Sketch offers a familiar and expressive modality for interacting with video on the touch-screens commonly present on such devices. This thesis contributes several new algorithms for searching and manipulating video using free-hand sketches. We propose the Visual Narrative (VN); a storyboarded sequence of one or more actions in the form of sketch that collectively describe an event. We show that VNs can be used to both efficiently search video repositories, and to synthesise video clips. First, we describe a sketch based video retrieval (SBVR) system that fuses multiple modalities (shape, colour, semantics, and motion) in order to find relevant video clips. An efficient multi-modal video descriptor is proposed enabling the search of hundreds of videos in milliseconds. This contrasts with prior SBVR that lacks an efficient index representation, and take minutes or hours to search similar datasets. This contribution not only makes SBVR practical at interactive speeds, but also enables user-refinement of results through relevance feedback to resolve sketch ambiguity, including the relative priority of the different VN modalities. Second, we present the first algorithm for sketch based pose retrieval. A pictographic representation (stick-men) is used to specify a desired human pose within the VN, and similar poses found within a video dataset. We use archival dance performance footage from the UK National Resource Centre for Dance (UK-NRCD), containing diverse examples of human pose. We investigate appropriate descriptors for sketch and video, and propose a novel manifold learning technique for mapping between the two descriptor spaces and so performing sketched pose retrieval. We show that domain adaptation can be applied to boost the performance of this system through a novel piece-wise feature-space warping technique. Third, we present a graph representation for VNs comprising multiple actions. We focus on the extension of our pose retrieval system to a sequence of poses interspersed with actions (e.g. jump, twirl). We show that our graph representation can be used for multiple applications: 1) to retrieve sequences of video comprising multiple actions; 2) to navigate in pictorial form, the retrieved video sequences; 3) to synthesise new video sequences by retrieving and concatenating video fragments from archival footage.
This thesis addresses the problem of modeling human shape in three dimensions. Specifically, this thesis is focused on modeling body shape variation across multiple individuals, pose induced shape deformations and garment deformations that are influenced both by body shape and pose. A methodology for constructing data driven models of human body and garment deformation is provided. Additionally, an application for online fashion retailing is presented. Abstract Firstly, a quantitative and qualitative evaluation, is introduced, of surface representations used in recent statistical models of human shape and pose. It is shown that the Euclidean representation generates a more compact human shape model compared to other representations. A small number of model parameters indicates better convergence in a human body estimation framework. In contrast, a high number of model parameters increases the risk of the optimization getting trapped in a local optimum. Based on these insights a system for human body shape estimation and classification for on-line fashion applications is presented. Given a single image of a subject and the subject's height and weight the proposed framework is able to estimate the 3D human body shape using a learnt statistical model. Results demonstrate that a single image holds sufficient information for accurate shape classification. This technology has been exploited as part of a collaborative project with fashion designers to develop a mobile app to classify body shape for clothing recommendation in online fashion retail. Abstract Next, Shape and Pose Space Deformation (SPSD) is presented, a technique for modeling subject specific pose induced deformations. By exploiting examples of different people in multiple poses, plausible animations of novel subjects can be synthesized by interpolating and extrapolating in a joint shape and pose parameter space. The results show that greater detail is achieved by incorporating subject specific pose deformations as opposed to a subject independent pose model. Finally, SPSD is extended to a three layered data-driven model of human shape, pose and garment deformation. Each layer represents the deformation of a template mesh and can be controlled independently and intuitively. The garment deformation layer is trained on sequences of dressed actors and relies on a novel technique for human shape and posture estimation under clothing.
This thesis addresses the problem of reconstructing complex real-world dynamic scenes without prior knowledge of the scene structure, dynamic objects or background. Previous approaches to 3D reconstruction of dynamic scenes either require a controlled studio set-up with chroma-key backgrounds or prior knowledge such as static background appearance or segmentation of the dynamic objects. This thesis presents a new approach which enables general dynamic scene reconstruction. This is achieved by initializing the reconstruction with sparse wide-baseline feature matches between views which avoids the requirement for prior knowledge of the background appearance or assumptions that the background is static. To achieve sparse reconstruction of dynamic objects a novel segmentation based feature detector SFD is introduced. SFD is shown to give an order of magnitude increase in the number and reliability of features detected. A coarse-to-fine approach is introduced for reconstruction of dense 3D models of dynamic scenes. This uses joint segmentation and shape refinement to achieve robust reconstruction of dynamic object such as people. The approach is evaluated across a wide-range of indoor and outdoor scenes. The second major contribution of this research is to introduce temporal coherence into the reconstruction process. The dynamic scene is segmented into objects based on the initial sparse 3D feature reconstruction of the scene. Dense reconstruction is then performed for each object. For dynamic objects the reconstruction is propagated over time to provide a prior for the reconstruction at successive frames in the sequence. This is combined with the introduction of a geodesic star convexity constraint in the segmentation refinement to improve the segmentation of complex objects. Evaluation on general dynamic scene demonstrates significant improvement in both segmentation and reconstruction with temporal coherence reducing the ambiguity in the reconstruction of complex shape. The final significant contribution of this research is the introduction of a complete framework for 4D temporally coherent shape reconstruction from one or more camera views. The 4D match tree is introduced as an intermediate representation for robust alignment of partial surface reconstructions across a complete sequence. SFD is used to achieve wide-timeframe matching of partial surface reconstructions between any pair of frames in the sequence. This allows the evaluation of a frame-to-frame shape similarity metric. A 4D match tree is then reconstructed as the minimum spanning tree which represents the shortest path in shape similarity space for alignment across all frames in the sequence. The 4D match tree is applied to achieve robust 4D shape reconstruction of complex dynamic scenes. This is the first approach to demonstrate 4D reconstruction of general real-world dynamic scenes with non-rigid shape from video.
Understanding human behaviour in an automatic but also non-intrusive manner, constitutes an important and emerging area for various fields. This requires collaboration of information technology with humanitarian sciences in order to transfer existing knowl- edge of human behaviour into self-acting tools to eliminate the human error. This work strives to shed some light in the area of Mobile Social Signal Processing by trying to understand if today’s mobile devices, given their advanced sensing and computational capabilities, are able to extract various aspects of human behaviour. Although one of the core aspects of human behaviour are social interactions, current tools do not pro- vide an accurate, reliable and real-time solution for social interaction detection, which constitutes a significant barrier in automatic human behaviour understanding. Towards filling the aforementioned gap in order to enable human behaviour under- standing through mobile devices, particular contributions were made. Firstly, an interpersonal distance estimation technique is developed based upon a non-intrusive opportunistic mechanism that solely relies on sensors and communication capabilities of off-the-shelf smartphones. Secondly, based on user’s interpersonal distance and relative orientation, a pervasive and opportunistic approach based on off-the-shelf smartphones for social interaction detection system is presented. Leveraging information provided by psychology, analytical and error models are proposed to estimate the probability of people having social interactions. Then, to showcase the ability of mobile devices to infer human behaviour, a trust relationship quantification mechanism is developed based on users’ behavioural traits and psychological models. Finally, a prediction and compensation mechanism for the device displacement error that leverages human loco- motion patterns to refine the device orientation is introduced. The above contributions were evaluated through experimentation and hard data collected from real-world environments to prove their accuracy and reliability as well as showing the applicability of the proposed approaches in daily situations. This work showed that mobile devices are able to accurately detect social interactions and further social and trust relationships among people, despite the noise induced in real-world situations. Close collaboration between informatics and social sciences is imperative, to overcome the significant barrier in the development of human behaviour understanding. This work could constitute a fundamental building block, as the computational power and battery autonomy of mobile devices increases, for the development of novel techniques towards understanding human behaviour, by including multiple behavioural traits and enabling the creation of socially-aware information systems.