We present 'Screen codes' - a space- and time-efficient, aesthetically compelling method for transferring data from a display to a camera-equipped mobile device. Screen codes encode data as a grid of luminosity fluctuations within an arbitrary image, displayed on the video screen and decoded on a mobile device. These 'twinkling' images are a form of 'visual hyperlink', by which users can move dynamically generated content to and from their mobile devices. They help bridge the 'content divide' between mobile and fixed computing.
We describe a system for matching human posture (pose) across a large cross-media archive of dance footage spanning nearly 100 years, comprising digitized photographs and videos of rehearsals and performances. This footage presents unique challenges due to its age, quality and diversity. We propose a forest-like pose representation combining visual structure (self-similarity) descriptors over multiple scales, without explicitly detecting limb positions which would be infeasible for our data. We explore two complementary multi-scale representations, applying passage retrieval and latent Dirichlet allocation (LDA) techniques inspired by the the text retrieval domain, to the problem of pose matching. The result is a robust system capable of quickly searching large cross-media collections for similarity to a visually specified query pose. We evaluate over a crosssection of the UK National Research Centre for Dance?s (UK-NRCD), and the Siobhan Davies Replay?s (SDR) digital dance archives, using visual queries supplied by dance professionals. We demonstrate significant performance improvements over two base-lines; classical single and multi-scale Bag of Visual Words (BoVW) and spatial pyramid kernel (SPK) matching.
Falls in the home are a major source of injury for the elderly. The affordability of commodity video cameras is prompting the development of ambient intelligent environments to monitor the occurence of falls in the home. This paper describes an automated fall detection system, capable of tracking movement and detecting falls in real-time. In particular we explore the application of the Bag of Features paradigm, frequently applied to general activity recognition in Computer Vision, to the domestic fall detection problem. We show that fall detection is feasible using such a framework, evaluted our approach in both controlled test scenarios and domestic scenarios exhibiting uncontrolled fall direction and visually cluttered environments.
We present a novel Content Based Video Retrieval (CBVR) system, driven by free-hand sketch queries depicting both objects and their movement (via dynamic cues; streak-lines and arrows). Our main contribution is a probabilistic model of video clips (based on Linear Dynamical Systems), leading to an algorithm for matching descriptions of sketched objects to video. We demonstrate our model fitting to clips under static and moving camera conditions, exhibiting linear and oscillatory motion. We evaluate retrieval on two real video data sets, and on a video data set exhibiting controlled variation in shape, color, motion and clutter.
Wang T, Han B, Collomosse JP Touchcut: Single-touch object segmentation driven by level set methods, IEEE
In this paper, we propose an object segmentation algorithm
driven by minimal user interactions. Compared to previous
user-guided systems, our system can cut out the desired
object in a given image with only a single finger touch minimizing
user effort. The proposed model harnesses both edge
and region based local information in an adaptive manner as
well as geometric cues implied by the user-input to achieve
fast and robust segmentation in a level set framework. We
demonstrate the advantages of our method in terms of computational
efficiency and accuracy comparing qualitatively and
quantitatively with graph cut based techniques.
We propose a human performance capture system employing convolutional neural networks (CNN) to estimate human pose from a volumetric representation of a performer derived from multiple view-point video (MVV).We compare direct CNN pose regression to the performance of an affine invariant pose descriptor learned by a CNN through a classification task. A non-linear manifold embedding is learned between the descriptor and articulated pose spaces, enabling regression of pose from the source MVV. The results are evaluated against ground truth pose data captured using a Vicon marker-based system and demonstrate good generalisation over a range of human poses, providing a system that requires no special suit to be worn by the performer.
Wang T, Collomosse JP, Hunter A, Greig D Learnable Stroke Models for Example-based Portrait Painting, BMVA
We present a novel algorithm for stylizing photographs into portrait paintings comprised
of curved brush strokes. Rather than drawing upon a prescribed set of heuristics
to place strokes, our system learns a flexible model of artistic style by analyzing training
data from a human artist. Given a training pair ? a source image and painting of that
image?a non-parametric model of style is learned by observing the geometry and tone
of brush strokes local to image features. A Markov Random Field (MRF) enforces spatial
coherence of style parameters. Style models local to facial features are learned using
a semantic segmentation of the input face image, driven by a combination of an Active
Shape Model and Graph-cut. We evaluate style transfer between a variety of training and
test images, demonstrating a wide gamut of learned brush and shading styles.
Kim J, Collomosse JP (2013) Semi-automated Video Logging by Incremental and Transfer Learning, IEEE
We describe a semi-automatic video logging system, ca-
pable of annotating frames with semantic metadata describ-
ing the objects present. The system learns by visual exam-
ples provided interactively by the logging operator, which are
learned incrementally to provide increased automation over
time. Transfer learning is initially used to bootstrap the sys-
tem using relevant visual examples from ImageNet. We adapt
the hard-assignment Bag of Word strategy for object recogni-
tion to our interactive use context, showing transfer learning
to significantly reduce the degree of interaction required.
Collomosse J (2007) Evolutionary search for the artistic rendering of photographs, In: The Art of Artificial Evolution: A Handbook 2 pp. 39-62-39-62 Springer-Verlag
We present a novel steganographic technique for concealing information within an image. Uniquely we explore the practicality of hiding, and recovering, data within the pattern of brush strokes generated by a non-photorealistic rendering (NPR) algorithm. We incorporate a linear binary coding (LBC) error correction scheme over a raw data channel established through the local statistics of NPR stroke orientations. This enables us to deliver a robust channel for conveying short (e.g. 30 character) strings covertly within a painterly rendering. We evaluate over a variety of painterly renderings, parameter settings and message lengths.
Collomosse J, Hall P (2004) A Mid-level Description of Video, with Application to Non-photorealistic Animation, Proceedings 15th British Machine Vision Conference (BMVC) 1 pp. 7-16-7-16
Hu R, Collomosse J (2013) A performance evaluation of gradient field HOG descriptor for sketch based image retrieval, Computer Vision and Image Understanding pp. 790-806 Elsevier
We present an image retrieval system for the interactive search of photo collections using free-hand sketches depicting shape. We describe Gradient Field HOG (GF-HOG); an adapted form of the HOG descriptor suitable for Sketch Based Image Retrieval (SBIR). We incorporate GF-HOG into a Bag of Visual Words (BoVW) retrieval framework, and demonstrate how this combination may be harnessed both for robust SBIR, and for localizing sketched objects within an image. We evaluate over a large Flickr sourced dataset comprising 33 shape categories, using queries from 10 non-expert sketchers. We compare GF-HOG against state-of-the-art descriptors with common distance measures and language models for image retrieval, and explore how affine deformation of the sketch impacts search performance. GF-HOG is shown to consistently outperform retrieval versus SIFT, multi-resolution HOG, Self Similarity, Shape Context and Structure Tensor. Further, we incorporate semantic keywords into our GF-HOG system to enable the use of annotated sketches for image search. A novel graph-based measure of semantic similarity is proposed and two applications explored: semantic sketch based image retrieval and a semantic photo montage.
This paper investigates the feasibility of evolutionary search techniques as a mechanism for interactively exploring the design space of 2D painterly renderings. Although a growing body of painterly rendering literature exists, the large number of low-level configurable parameters that feature in contemporary algorithms can be counter-intuitive for non-expert users to set. In this paper we first describe a multi-resolution painting algorithm capable of transforming photographs into paintings at interactive speeds. We then present a supervised evolutionary search process in which the user scores paintings on their aesthetics to guide the specification of their desired painterly rendering. Using our system, non-expert users are able to produce their desired aesthetic in approximately 20 mouse clicks - around half an order of magnitude faster than manual specification of individual rendering parameters by trial and error. © Springer-Verlag Berlin Heidelberg 2006.
Collomosse JP, Kindberg T (2009) Method of Generating a Sequence of Display Frames For Display on a Display Device, 12/179857
We present a novel video retrieval system that accepts annotated free-hand sketches as queries. Existing sketch based video retrieval (SBVR) systems enable the appearance and movements of objects to be searched naturally through pictorial representations. Whilst visually expressive, such systems present an imprecise vehicle for conveying the semantics (e.g. object types) within a scene. Our contribution is to fuse the semantic richness of text with the expressivity of sketch, to create a hybrid `semantic sketch' based video retrieval system. Trajectory extraction and clustering are applied to pre-process each clip into a video object representation that we augment with object classification and colour information. The result is a system capable of searching videos based on the desired colour, motion path, and semantic labels of the objects present. We evaluate the performance of our system over the TSF dataset of broadcast sports footage.
4D Video Textures (4DVT) introduce a novel representation for rendering video-realistic interactive character animation from a database of 4D actor performance captured in a multiple camera studio. 4D performance capture reconstructs dynamic shape and appearance over time but is limited to free-viewpoint video replay of the same motion. Interactive animation from 4D performance capture has so far been limited to surface shape only. 4DVT is the final piece in the puzzle enabling video-realistic interactive animation through two contributions: a layered view-dependent texture map representation which supports efficient storage, transmission and rendering from multiple view video capture; and a rendering approach that combines multiple 4DVT sequences in a parametric motion space, maintaining video quality rendering of dynamic surface appearance whilst allowing high-level interactive control of character motion and viewpoint. 4DVT is demonstrated for multiple characters and evaluated both quantitatively and through a user-study which confirms that the visual quality of captured video is maintained. The 4DVT representation achieves >90% reduction in size and halves the rendering cost.
Collomosse J, Hall P (2005) Motion analysis in video: dolls, dynamic cues and Modern Art, Proceedings of 2nd Intl. Conf on Vision Video and Graphics (VVG) pp. 109-116-109-116 Eurographics
We present TouchCut; a robust and efficient algorithm for segmenting
image and video sequences with minimal user interaction. Our algorithm requires
only a single finger touch to identify the object of interest in the image
or first frame of video. Our approach is based on a level set framework, with
an appearance model fusing edge, region texture and geometric information
sampled local to the touched point. We first present our image segmentation
solution, then extend this framework to progressive (per-frame) video
segmentation, encouraging temporal coherence by incorporating motion estimation
and a shape prior learned from previous frames. This new approach
to visual object cut-out provides a practical solution for image and video
segmentation on compact touch screen devices, facilitating spatially localized
media manipulation. We describe such a case study, enabling users to
selectively stylize video objects to create a hand-painted effect. We demonstrate
the advantages of TouchCut by quantitatively comparing against the
state of the art both in terms of accuracy, and run-time performance.
Collomosse JP, kindberg T (2009) Content Encoder and Decoder and Methods of Encoding and Decoding Content,
Multi-view video acquisition is widely used for reconstruction and free-viewpoint rendering of dynamic scenes by directly resampling from the captured images. This paper addresses the problem of optimally resampling and representing multi-view video to obtain a compact representation without loss of the view-dependent dynamic surface appearance. Spatio-temporal optimisation of the multi-view resampling is introduced to extract a coherent multi-layer texture map video. This resampling is combined with a surface-based optical flow alignment between views to correct for errors in geometric reconstruction and camera calibration which result in blurring and ghosting artefacts. The multi-view alignment and optimised resampling results in a compact representation with minimal loss of information allowing high-quality free-viewpoint rendering. Evaluation is performed on multi-view datasets for dynamic sequences of cloth, faces and people. The representation achieves >90% compression without significant loss of visual quality.
We describe a novel framework for segmenting a time- and view-coherent foreground matte sequence from synchronised multiple view video. We construct a Markov Random Field (MRF) comprising links between superpixels corresponded across views, and links between superpixels and their constituent pixels. Texture, colour and disparity cues are incorporated to model foreground appearance. We solve using a multi-resolution iterative approach enabling an eight view high definition (HD) frame to be processed in less than a minute. Furthermore we incorporate a temporal diffusion process introducing a prior on the MRF using information propagated from previous frames, and a facility for optional user correction. The result is a set of temporally coherent mattes that are solved for simultaneously across views for each frame, exploiting similarities across views and time.
Collomosse JP, Hall PM, Rothlauf F, Branke J, Cagnoni S, Corne DW, Drechsler R, Jin Y, Machado P, Marchiori E, Romero J, Smith GD, Squillero G (2005) Genetic paint: A search for salient paintings, APPLICATIONS OF EVOLUTIONARY COMPUTING, PROCEEDINGS 3449 pp. 437-447
Vanderhaeghe D, Collomosse JP (2013) Stroke based painterly rendering, In: Collomosse J, Rosin P (eds.), Image and video based artistic stylization 42 pp. 3-22 Springer-Verlag
James S, Fonseca M, Collomosse JP ReEnact: Sketch based Choreographic Design from
Archival Dance Footage, ACM
ReEnact: Sketch based Choreographic Design from
Archival Dance Footage
Collomosse JP, Rowntree D, Hall PM (2003) Cartoon-style Rendering of Motion from Video, Proceedings Video, Vision and Graphics (VVG) pp. 117-124-117-124 Eurographics
Collomosse JP, Ren R (2011) A Bag of Visual Words based Query Generative Model,
Collomosse JP, Rowntree D, Hall PM (2005) Rendering cartoon-style motion cues in post-production video, GRAPHICAL MODELS 67 (6) pp. 549-564 ACADEMIC PRESS INC ELSEVIER SCIENCE
Benard P, Thollot J, Collomosse JP (2013) Temporally Coherent Video Stylization, In: Collomosse J, Rosin P (eds.), Image and video based artistic stylization 42 pp. 257-284 Springer-Verlag
McNeill G, Collomosse JP (2007) Reverse storyboarding for video retrieval, IET Conference Publications (534 CP)
We report early experiments towards a content based retrieval (CBVR) system that harnesses storyboards as an efficient and intuitive input mechanism for querying video databases. Storyboards encapsulate high level semantics describing both objects in the scene (spatially) and their movement (dynamics). Dynamics are depicted using a rich vocabulary of motion cues borrowed from animation; streak and ghosting lines, deformations as well as conventional indicators such as arrowheads. Fusing spatial and dynamic cues for CBVR promises advantages not only in terms of usability, but also constraining search to yield improved precision and performance.
We present a new algorithm for segmenting video frames into temporally stable colored regions, applying our technique to create artistic stylizations (e.g. cartoons and paintings) from real video sequences. Our approach is based on a multilabel graph cut applied to successive frames, in which the color data term and label priors are incrementally updated and propagated over time. We demonstrate coherent segmentation and stylization over a variety of home videos.
This Pictorial documents the process of designing a device as an intervention within a field study of new parents. The device was deployed in participating parents? homes to invite reflection on their everyday experiences of portraying self and others through social media in their transition to parenthood.
The design creates a dynamic representation of each participant?s Facebook photo collection, extracting and amalgamating ?faces? from it to create an alternative portrait of an online self. We document the rationale behind our design, explaining how its features were inspired and developed, and how they function to address research questions about human experience.
Hu R, James S, Wang T, Collomosse JP (2013) Markov Random Fields for Sketch based Video Retrieval, pp. 279-286 ACM
We describe a new system for searching video databases us-
ing free-hand sketched queries. Our query sketches depict
both object appearance and motion, and are annotated with
keywords that indicate the semantic category of each object.
We parse space-time volumes from video to form graph rep-
resentation, which we match to sketches under a Markov
Random Field (MRF) optimization. The MRF energy func-
tion is used to rank videos for relevance and contains unary,
pairwise and higher-order potentials that reflect the colour,
shape, motion and type of sketched objects. We evaluate
performance over a dataset of 500 sports footage clips.
Rosin PL, Collomosse J (2012) Image and Video-Based Artistic Stylisation, Springer
This guide s cutting-edge coverage explains the full spectrum of NPR techniques used in photography, TV and film.
The contribution of this paper is a novel non-photorealistic rendering (NPR) technique, capable of producing an artificial 'hand-painted' effect on 2D images, such as photographs. Our method requires no user interaction, and makes use of image salience and gradient information to determine the implicit ordering and attributes of individual brush strokes. The benefits of our technique are complete automation, and mitigation against the loss of image detail during painting. Strokes from lower salience regions of the image do not encroach upon higher salience regions; this can occur with some existing painting methods. We describe our algorithm in detail, and illustrate its application with a gallery of images.
This paper presents a sketch-based image retrieval system using a bag-of-region representation of images. Regions from the nodes of a hierarchical region tree range in various scales of details. They have appealing properties for object level inference such as the naturally encoded shape and scale information of objects and the specified domains on which to compute features without being affected by clutter from outside the region. The proposed approach builds shape descriptor on the salient shape among the clutters and thus yields significant performance improvements over the previous results on three leading descriptors in Bag-of-Words framework for sketch based image retrieval. Matched region also facilitates the localization of sketched object within the retrieved image.
We present the "empathie painting" - an interactive painterly rendering whose appearance adapts in real time to reflect the perceived emotional state of the viewer. The empathie painting is an experiment into the feasibility of using high level control parameters (namely, emotional state) to replace the plethora of low-level constraints users must typically set to affect the output of artistic rendering algorithms. We describe a suite of Computer Vision algorithms capable of recognising users' facial expressions through the detection of facial action units derived from the FACS scheme. Action units are mapped to vectors within a continuous 2D space representing emotional state, from which we in turn derive a continuous mapping to the style parameters of a simple but fast segmentation-based painterly rendering algorithm. The result is a digital canvas capable of smoothly varying its painterly style at approximately 4 frames per second, providing a novel user interactive experience using only commodity hardware.
We present a robust algorithm for temporally coherent video segmentation. Our approach is driven by multi-label graph cut applied to successive frames, fusing information from the current frame with an appearance model and labeling priors propagated forwarded from past frames. We propagate using a novel motion diffusion model, producing a per-pixel motion distribution that mitigates against cumulative estimation errors inherent in systems adopting ?hard? decisions on pixel motion at each frame. Further, we encourage spatial coherence by imposing label consistency constraints within image regions (super-pixels) obtained via a bank of unsupervised frame segmentations, such as mean-shift. We demonstrate quantitative improvements in accuracy over state-of-the-art methods on a variety of sequences exhibiting clutter and agile motion, adopting the Berkeley methodology for our comparative evaluation.
Hall PM, Owen MJ, Collomosse JP (2004) A Trainable Low-level Feature Detector, Proceedings Intl. Conference on Pattern Recognition (ICPR) 1 pp. 708-711-708-711 IEEE
We introduce a trainable system that simultaneously filters and classifies low-level features into types specified by the user. The system operates over full colour images, and outputs a vector at each pixel indicating the probability that the pixel belongs to each feature type. We explain how common features such as edge, corner, and ridge can all be detected within a single framework, and how we combine these detectors using simple probability theory. We show its efficacy, using stereo-matching as an example.
Gray C, James S, Collomosse J, Asente P (2014) A Particle Filtering Approach to Salient Video Object Localization, pp. 194-198 IEEE
We describe a novel fully automatic algorithm for identifying salient objects in video based on their motion. Spatially coherent clusters of optical flow vectors are sampled to generate estimates of affine motion parameters local to super-pixels identified within each frame. These estimates, combined with spatial data, form coherent point distributions in a 5D solution space corresponding to objects or parts there-of. These distributions are temporally denoised using a particle filtering approach, and clustered to estimate the position and motion parameters of salient moving objects in the clip. We demonstrate localization of salient object/s in a variety of clips exhibiting moving and cluttered backgrounds.
We present an image retrieval system driven by free-hand sketched queries depicting shape. We introduce Gradient Field HoG (GF-HOG) as a depiction invariant image descriptor, encapsulating local spatial structure in the sketch and facilitating efficient codebook based retrieval. We show improved retrieval accuracy over 3 leading descriptors (Self Similarity, SIFT, HoG) across two datasets (Flickr160, ETHZ extended objects), and explain how GF-HOG can be combined with RANSAC to localize sketched objects within relevant images. We also demonstrate a prototype sketch driven photo montage application based on our system.
We present "screen codes" - a space- and time-efficient, aesthetically compelling method for transferring data from a display (e.g. a VDU or projected public display) to a camera equipped mobile device. Screen codes encode data as a grid of luminosity fluctuations within an arbitrary image, displayed on the video screen. These fluctuations, manifested as a "twinkling" within the image, are observed by the mobile device over time and decoded to reconstruct the data. Observation is passive; there is no back-channel from the camera to the display. Novel spatial and temporal coding strategies are employed, tailored to channel noise conditions. The display may be observed from any angle or orientation.
Collomosse JP, Rowntree D, Hall PM (2003) Video analysis for Cartoon-style Special Effects, Proceedings 14th British Machine Vision Conference (BMVC) 2 pp. 749-758-749-758
Three dimensional (3D) displays typically rely on stereo disparity, requiring specialized hardware to be worn or embedded in the display. We present a novel 3D graphics display system for volumetric scene visualization using only standard 2D display hardware and a pair of calibrated web cameras. Our computer vision-based system requires no worn or other special hardware. Rather than producing the depth illusion through disparity, we deliver a full volumetric 3D visualization - enabling users to interactively explore 3D scenes by varying their viewing position and angle according to the tracked 3D position of their face and eyes. We incorporate a novel wand-based calibration that allows the cameras to be placed at arbitrary positions and orientations relative to the display. The resulting system operates at real-time speeds (~25 fps) with low latency (120-225 ms) delivering a compelling natural user interface and immersive experience for 3D viewing. In addition to objective evaluation of display stability and responsiveness, we report on user trials comparing users' timings on a spatial orientation task.
Song YZ, Hall PM, Rosin PL, Collomosse JP (2008) Arty Shapes, Proceedings of Computational Aesthetics pp. 65-73-65-73
Kim J, Gray C, Asente P, Collomosse JP (2015) Comprehensible Video Thumbnails, Computer Graphics Forum (Eurographics 2015) 34 (2) pp. 167-177 John Wiley & Sons Ltd.
We present the Comprehensible Video Thumbnail; an automatically generated visual précis that summarizes salient objects and their dynamics within a video clip. Salient moving objects are detected within clips using a novel stochastic sampling technique that identifies, clusters and then tracks regions exhibiting affine motion
coherence within the clip. Tracks are analyzed to determine salient instants at which motion and/or appearance changes significantly, and the resulting objects arranged in a stylized composition optimized to reduce visual clutter and enhance understanding of scene content through classification and depiction of motion type and trajectory. The result is an object-level visual gist of the clip, obtained with full automation and depicting content and motion with greater descriptive power that prior approaches. We demonstrate these benefits through a user study in which the comprehension of our video thumbnails is compared to the state of the art over a wide variety of sports footage.
Collomosse JP, Kindberg T (2009) Encoder and Decoder and Methods of Encoding and Decoding Sequence Information,
Kim J, Collomosse JP (2014) Incremental Transfer Learning for Object Classification in Streaming Video, 2014 IEEE International Conference on Image Processing (ICIP) pp. 2729-2733 IEEE
We present a new incremental learning framework for realtime object recognition in video streams. ImageNet is used to bootstrap a set of one-vs-all incrementally trainable SVMs which are updated by user annotation events during streaming. We adopt an inductive transfer learning (ITL) approach to warp the video feature space to the ImageNet feature space,
so enabling the incremental updates. Uniquely, the transformation used for the ITL warp is also learned incrementally using the same update events. We demonstrate a semiautomated
video logging (SAVL) system using our incrementally learned ITL approach and show this to outperform existing SAVL which uses non-incremental transfer learning.
Collomosse JP, Hall PM (2006) Video motion analysis for the synthesis of dynamic cues and Futurist art, GRAPHICAL MODELS 68 (5-6) pp. 402-414 ACADEMIC PRESS INC ELSEVIER SCIENCE
Collomosse JP, Hall PM (2005) Video Paintbox: The fine art of video painting, COMPUTERS & GRAPHICS-UK 29 (6) pp. 862-870 PERGAMON-ELSEVIER SCIENCE LTD
We present a novel hybrid representation for character animation from 4D Performance Capture (4DPC) data which combines skeletal control with surface motion graphs. 4DPC data are temporally aligned 3D mesh sequence reconstructions of the dynamic surface shape and associated appearance from multiple view video. The hybrid representation supports the production of novel surface sequences which satisfy constraints from user specified key-frames or a target skeletal motion. Motion graph path optimisation concatenates fragments of 4DPC data to satisfy the constraints whilst maintaining plausible surface motion at transitions between sequences. Spacetime editing of the mesh sequence using a learnt part-based Laplacian surface deformation model is performed to match the target skeletal motion and transition between sequences. The approach is quantitatively evaluated for three 4DPC datasets with a variety of clothing styles. Results for key-frame animation demonstrate production of novel sequences which satisfy constraints on timing and position of less than 1% of the sequence duration and path length. Evaluation of motion capture driven animation over a corpus of 130 sequences shows that the synthesised motion accurately matches the target skeletal motion. The combination of skeletal control with the surface motion graph extends the range and style of motion which can be produced whilst maintaining the natural dynamics of shape and appearance from the captured performance.
Collomosse JP, Rowntree D, Hall PM (2005) Stroke surfaces: Temporally coherent artistic animations from video, IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 11 (5) pp. 540-549 IEEE COMPUTER SOC
Keitler P, Pankratz F, Schwerdtfeger B, Pustka D, Rödiger W, Klinker G, Rauch C, Chathoth A, Collomosse JP, Song Y-Z (2009) Mobile augmented reality based 3D snapshots., ISMAR pp. 199-200
IEEE Computer Society
Falling hardware costs have prompted an explosion in casual video capture by domestic users. Yet, this video is infrequently accessed post-capture and often lies dormant on users? PCs. We present a system to breathe life into home video repositories, drawing upon artistic stylization to create a ?Digital Ambient Display? that automatically selects, stylizes and transitions between videos in a semantically meaningful sequence. We present a novel algorithm based on multi-label graph cut for segmenting video into temporally coherent region maps. These maps are used to both stylize video into cartoons and paintings, and measure visual similarity between frames for smooth sequence transitions. We demonstrate coherent segmentation and stylization over a variety of home videos.
This paper surveys the field of non-photorealistic rendering (NPR), focusing on techniques for transforming 2D input (images and video) into artistically stylized renderings. We first present a taxonomy of the 2D NPR algorithms developed over the past two decades, structured according to the design characteristics and behavior of each technique. We then describe a chronology of development from the semi-automatic paint systems of the early nineties, through to the automated painterly rendering systems of the late nineties driven by image gradient analysis. Two complementary trends in the NPR literature are then addressed, with reference to our taxonomy. First, the fusion of higher level computer vision and NPR, illustrating the trends toward scene analysis to drive artistic abstraction and diversity of style. Second, the evolution of local processing approaches toward edge-aware filtering for real-time stylization of images and video. The survey then concludes with a discussion of open challenges for 2D NPR identified in recent NPR symposia, including topics such as user and aesthetic evaluation.
Trumble Matthew, Gilbert Andrew, Malleson Charles, Hilton Adrian, Collomosse John Total Capture,
University of Surrey
We present a novel algorithm for the semantic labeling of photographs shared via social media. Such imagery is diverse, exhibiting high intra-class variation that demands large training data volumes to learn representative classifiers. Unfortunately image annotation at scale is noisy resulting in errors in the training corpus that confound classifier accuracy. We show how evolutionary algorithms may be applied to select a ?purified? subset of the training corpus to optimize classifier performance. We demonstrate our approach over a variety of image descriptors (including deeply learned features) and support vector machines.
We present an efficient representation for sketch based image retrieval (SBIR) derived from a triplet loss convolutional neural network (CNN). We treat SBIR as a cross-domain modelling problem, in which a depiction invariant embedding of sketch and photo data is learned by regression over a siamese CNN architecture with half-shared weights and modified triplet loss function. Uniquely, we demonstrate the ability of our learned image descriptor to generalise beyond the categories of object present in our training data, forming a basis for general cross-category SBIR. We explore appropriate strategies for training, and for deriving a compact image descriptor from the learned representation suitable for indexing data on resource constrained e. g. mobile devices. We show the learned descriptors to outperform state of the art SBIR on the defacto standard Flickr15k dataset using a significantly more compact (56 bits per image, i. e. H 105KB total) search index than previous methods.
We describe a novel system for synthesising video choreography using sketched visual storyboards comprising human poses (stick men) and action labels. First, we describe an algorithm for searching archival dance footage using sketched pose. We match using an implicit representation of pose parsed from a mix of challenging low and high delity footage. In a training pre-process we learn a mapping between a set of exemplar sketches and corresponding pose representations parsed from the video, which are generalized at query-time to enable retrieval over previously unseen frames, and over additional unseen videos. Second, we describe how a storyboard of sketched poses, interspersed with labels indicating connecting actions, may be used to drive the synthesis of novel video choreography from the archival footage. We demonstrate both our retrieval and synthesis algorithms over both low delity PAL footage from the UK Digital Dance Archives (DDA) repository of contemporary dance, circa 1970, and over higher-defi nition studio captured footage.
We present a novel human performance capture technique capable of robustly estimating the pose (articulated joint positions) of a performer observed passively via multiple view-point video (MVV). An affine invariant pose descriptor is learned using a convolutional neural network (CNN) trained over volumetric data extracted from a MVV dataset of diverse human pose and appearance. A manifold embedding is learned via Gaussian Processes for the CNN descriptor and articulated pose spaces enabling regression and so estimation of human pose from MVV input. The learned descriptor and manifold are shown to generalise over a wide range of human poses, providing an efficient performance capture solution that requires no fiducials or other markers to be worn. The system is evaluated against ground truth joint configuration data from a commercial marker-based pose estimation system
We present a fast technique for retrieving video clips using free-hand sketched queries. Visual keypoints within each video are detected and tracked to form short trajectories, which are clustered to form a set of spacetime tokens summarising video content. A Viterbi process matches a space-time graph of tokens to a description of colour and motion extracted from the query sketch. Inaccuracies in the sketched query are ameliorated by computing path cost using a Levenshtein (edit) distance. We evaluate over datasets of sports footage.
This article reports a user-experience study in which a group of 18 older adults used a location-based mobile multimedia service in the setting of a rural nature reserve. The prototype system offered a variety of means of obtaining rich multimedia content from oak waymarker posts using a mobile phone. Free text questionnaires and focus groups were employed to investigate participants' experiences with the system and their attitudes to the use of mobile and pervasive systems in general. The users' experiences with the system were positive with respect to the design of the system in the context of the surrounding natural environment. However, the authors found some significant barriers to their adoption of mobile and pervasive systems as replacements for traditional information sources.
We present an algorithm for fusing multi-viewpoint video (MVV) with inertial measurement
unit (IMU) sensor data to accurately estimate 3D human pose. A 3-D convolutional
neural network is used to learn a pose embedding from volumetric probabilistic
visual hull data (PVH) derived from the MVV frames. We incorporate this model within
a dual stream network integrating pose embeddings derived from MVV and a forward
kinematic solve of the IMU data. A temporal model (LSTM) is incorporated within
both streams prior to their fusion. Hybrid pose inference using these two complementary
data sources is shown to resolve ambiguities within each sensor modality, yielding improved
accuracy over prior methods. A further contribution of this work is a new hybrid
MVV dataset (TotalCapture) comprising video, IMU and a skeletal joint ground truth
derived from a commercial motion capture system. The dataset is available online at
The falling cost of digital cameras and camcorders has encouraged the creation of massive collections of personal digital media. However, once captured, this media is infrequently accessed and often lies dormant on users' PCs. We present a system to breathe life into home digital media collections, drawing upon artistic stylization to create a ?Digital Ambient Display? that automatically selects, stylizes and transitions between digital contents in a semantically meaningful sequence. We present a novel algorithm based on multi-label graph cut for segmenting video into temporally coherent region maps. These maps are used to both stylize video into cartoons and paintings, and measure visual similarity between frames for smooth sequence transitions. The system automatically structures the media collection into a hierarchical representation based on visual content and semantics. Graph optimization is applied to adaptively sequence content for display in a coarse-to-fine manner, driven by user attention level (detected in real-time by a webcam). Our system is deployed on embedded hardware in the form of a compact digital photo frame. We demonstrate coherent segmentation and stylization over a variety of home videos and photos. We evaluate our media sequencing algorithm via a small-scale user study, indicating that our adaptive display conveys a more compelling media consumption experience than simple linear ?slide-shows?.
We propose and evaluate several deep network architectures for measuring the similarity
between sketches and photographs, within the context of the sketch based image
retrieval (SBIR) task. We study the ability of our networks to generalize across diverse
object categories from limited training data, and explore in detail strategies for weight
sharing, pre-processing, data augmentation and dimensionality reduction. In addition
to a detailed comparative study of network configurations, we contribute by describing
a hybrid multi-stage training network that exploits both contrastive and triplet networks
to exceed state of the art performance on several SBIR benchmarks by a significant
Content-aware image completion or in-painting is a fundamental
tool for the correction of defects or removal of
objects in images. We propose a non-parametric in-painting
algorithm that enforces both structural and aesthetic (style)
consistency within the resulting image. Our contributions
are two-fold: 1) we explicitly disentangle image structure
and style during patch search and selection to ensure a visually
consistent look and feel within the target image. 2)
we perform adaptive stylization of patches to conform the
aesthetics of selected patches to the target image, so harmonizing
the integration of selected patches into the final composition.
We show that explicit consideration of visual style
during in-painting delivers excellent qualitative and quantitative
results across the varied image styles and content,
over the Places2 scene photographic dataset and a challenging
new in-painting dataset of artwork derived from BAM!
A real-time full-body motion capture system is presented
which uses input from a sparse set of inertial measurement
units (IMUs) along with images from two or more standard
video cameras and requires no optical markers or specialized
infra-red cameras. A real-time optimization-based
framework is proposed which incorporates constraints from
the IMUs, cameras and a prior pose model. The combination
of video and IMU data allows the full 6-DOF motion to
be recovered including axial rotation of limbs and drift-free
global position. The approach was tested using both indoor
and outdoor captured data. The results demonstrate the effectiveness
of the approach for tracking a wide range of human
motion in real time in unconstrained indoor/outdoor
We present a method for simultaneously estimating 3D hu-
man pose and body shape from a sparse set of wide-baseline camera views.
We train a symmetric convolutional autoencoder with a dual loss that
enforces learning of a latent representation that encodes skeletal joint
positions, and at the same time learns a deep representation of volumetric
body shape. We harness the latter to up-scale input volumetric data by a
factor of 4X, whilst recovering a 3D estimate of joint positions with equal
or greater accuracy than the state of the art. Inference runs in real-time
(25 fps) and has the potential for passive human behaviour monitoring
where there is a requirement for high fidelity estimation of human body
shape and pose.
We propose an approach to accurately esti-
mate 3D human pose by fusing multi-viewpoint video
(MVV) with inertial measurement unit (IMU) sensor
data, without optical markers, a complex hardware setup
or a full body model. Uniquely we use a multi-channel
3D convolutional neural network to learn a pose em-
bedding from visual occupancy and semantic 2D pose
estimates from the MVV in a discretised volumetric
probabilistic visual hull (PVH). The learnt pose stream
is concurrently processed with a forward kinematic solve
of the IMU data and a temporal model (LSTM) exploits
the rich spatial and temporal long range dependencies
among the solved joints, the two streams are then fused
in a final fully connected layer. The two complemen-
tary data sources allow for ambiguities to be resolved
within each sensor modality, yielding improved accu-
racy over prior methods. Extensive evaluation is per-
formed with state of the art performance reported on
the popular Human 3.6M dataset , the newly re-
leased TotalCapture dataset and a challenging set of
outdoor videos TotalCaptureOutdoor. We release the
new hybrid MVV dataset (TotalCapture) comprising
of multi- viewpoint video, IMU and accurate 3D skele-
tal joint ground truth derived from a commercial mo-
tion capture system. The dataset is available online athttp://cvssp.org/data/totalcapture/
We propose a novel measure of visual similarity for image retrieval that incorporates both structural and aesthetic (style) constraints. Our algorithm accepts a query as sketched shape, and a set of one or more contextual images specifying the desired visual aesthetic. A triplet network is used to learn a feature embedding capable of measuring style similarity independent of structure, delivering significant gains over previous networks for style discrimination. We incorporate this model within a hierarchical triplet network to unify and learn a joint space from two discriminatively trained streams for style and structure. We demonstrate that this space enables, for the first time, style-constrained sketch search over a diverse domain of digital artwork comprising graphics, paintings and drawings. We also briefly explore alternative query modalities.
We present a scalable system for sketch-based image retrieval (SBIR), extending the state of the art Gradient Field HoG (GF-HoG) retrieval framework through two technical contributions. First, we extend GF-HoG to enable color-shape retrieval and comprehensively evaluate several early-and late-fusion approaches for integrating the modality of color, considering both the accuracy and speed of sketch retrieval. Second, we propose an efficient inverse-index representation for GF-HoG that delivers scalable search with interactive query times over millions of images. A mobile app demo accompanies this paper (Android).
Deep Learning methods are currently the state-of-the-art in many Computer Vision and Image Processing problems, in particular image classification. After years of intensive investigation, a few models matured and became important tools, including Convolutional Neural Networks (CNNs), Siamese and Triplet Networks, Auto-Encoders (AEs) and Generative Adversarial Networks (GANs). The field is fast-paced and there is a lot of terminologies to catch up for those who want to adventure in Deep Learning waters. This paper has the objective to introduce the most fundamental concepts of Deep Learning for Computer Vision in particular CNNs, AEs and GANs, including architectures, inner workings and optimization. We offer an updated description of the theoretical and practical knowledge of working with those models. After that, we describe Siamese and Triplet Networks, not often covered in tutorial papers, as well as review the literature on recent and exciting topics such as visual stylization, pixel-wise prediction and video processing. Finally, we discuss the limitations of Deep Learning for Computer Vision.
We present a convolutional autoencoder that enables high
fidelity volumetric reconstructions of human performance to be captured
from multi-view video comprising only a small set of camera views. Our
method yields similar end-to-end reconstruction error to that of a prob-
abilistic visual hull computed using significantly more (double or more)
viewpoints. We use a deep prior implicitly learned by the autoencoder
trained over a dataset of view-ablated multi-view video footage of a wide
range of subjects and actions. This opens up the possibility of high-end
volumetric performance capture in on-set and prosumer scenarios where
time or cost prohibit a high witness camera count.
We describe a non-parametric algorithm for multiple-viewpoint video inpainting. Uniquely, our algorithm addresses the domain of wide baseline multiple-viewpoint video (MVV) with no temporal look-ahead in near real time speed. A Dictionary of Patches (DoP) is built using multi-resolution texture patches reprojected from geometric proxies available in the alternate views. We dynamically update the DoP over time, and a Markov Random Field optimisation over depth and appearance is used to resolve and align a selection of multiple candidates for a given patch, this ensures the inpainting of large regions in a plausible manner conserving both spatial and temporal coherence. We demonstrate the removal of large objects (e.g. people) on challenging indoor and outdoor MVV exhibiting cluttered, dynamic backgrounds and moving cameras.
Performance capture is used extensively within the creative industries to efficiently produce high quality, realistic character animation in movies and video games. Existing commercial systems for performance capture are limited to working within constrained environments, requiring wearable visual markers or suits, and frequently specialised imaging devices (e.g. infra-red cameras) both of which limit deployment scenarios (e.g. indoor capture). This thesis explores novel methods to relax these constraints, applying machine learning techniques to estimate human pose using regular video cameras and without the requirement of visible markers on the performer. This unlocks the potential for co-production of principal footage and performance capture data, leading to production efficiencies. For example, using an array of static witness cameras deployed on-set, performance capture data for a video games character accompanying a major movie franchise might be captured at the same time the movie is shot. The need to call the actor for a second day of shooting in a specialised motion capture (mo-cap) facility is avoided, saving time and money, since performance capture was possible without corrupting the principal movie footage with markers or constraining set design. Furthermore, if such performance capture data is available in real-time, the director may immediately pre-visualize the look and feel of the final character animation enabling tighter capture iteration and improved creative direction. This further enhances the potential for production efficiencies.
The core technical contributions of this thesis are novel software algorithms that leverage machine learning to fuse of data from multiple sensors ? synchronised video cameras, and in some cases, inertial measurement units (IMUs) ? in order to robustly estimate human body pose over time, doing so at real-time or near real-time rates.
Firstly, a hardware-accelerated capture solution is developed for acquiring coarse volumetric occupancy data from multiple viewpoint video footage, in the form of a probabilistic visual hull (PVH). Using CUDA-based GPU acceleration the PVH may be estimated in real-time, and subsequently used to train machine learning algorithms to infer human skeletal pose from PVH data.
Initially a variety of machine learning approaches for skeletal joint pose estimation are explored, contrasting classical and deep inference methods. By quantizing volumetric data into a two-dimensional (2D) spherical histogram representation it is shown that convolutional neural networks (CNN) architectures used traditionally for object recognition may be re-purposed for skeletal joint estimation given suitable a training methodology and data augmentation strategy.
The generalization of such architectures to a fully volumetric (3D) CNN is explored, achieving state of the art performance at human pose estimation using an volumetric auto-encoder (hour-glass) architecture that emulates networks traditionally used for de-noising and super-resolution (up-scaling) of 2D data. A framework is developed that is capable of simultaneously estimating human pose from volumetric data, whilst also up-scaling that volumetric data to enable fine-grain estimation of surface detail given a deeply learned prior from previous performance. The method is shown to generalise well even when that prior is learned across different subjects, performing different movements even in different studio camera configurations.
Performance can be further improved using a learned temporal model of data, and through the fusion of complementary sensor modalities ? video and IMUs ? to enhance the accuracy of human pose estimation inferred from a volumetric CNN. Although IMUs have been applied in the performance capture domain for many years, they are prone to drift limiting their use to short capture sequences. The novel fusion of IMU with video data enables improved global localization and so reduced
The deluge of visual content on the Internet - from user-generated content to commercial image collections - motivates intuitive new methods for searching digital image content: how can we find certain images in a database of millions?
Sketch-based image retrieval (SBIR) is an emerging research topic in which a free-hand drawing can be used to visually query photographic images. SBIR is aligned to emerging trends for visual content consumption on mobile touch-screen based devices, for which gestural interactions such as sketch are a natural alternative to textual input.
This thesis presents several contributions to the literature of SBIR. First, we propose a cross-domain learning framework that maps both sketches and images into a joint embedding space invariant to depictive style, while preserving semantics. The resulting embedding enables direct comparison and search between sketches and images and is based upon a multi-branch convolutional neural network (CNN) trained using unique parameter sharing and training schemes. The deeply learned embedding is shown to yield state-of-art retrieval performance on several SBIR benchmarks.
Second, under two separate works we propose to disambiguate sketched queries by combining sketched shape with a secondary modality: SBIR with colour and with aesthetic context. The former enables querying with coloured line-art sketches. Colour and shape features are extracted locally using a modified version of gradient field orientation histogram (GF-HoG) before globally pooled using dictionary learning. Various colour-shape fusion strategies are explored, coupled with an efficient indexing scheme for fast retrieval performance. The latter supports querying using both a sketched shape accompanied by one or several images serving as an aesthetic constraint governing the visual style of search results. We propose to model structure and style separately dis-entangling one modality from the other; then learn structure-style fusion using a hierarchical triplet network. This method enables further studies beyond SBIR such as style blending, style analogy and retrieval with alternative-modal queries.
Third, we explore mid-grain SBIR -- a novel field requiring retrieved images to match both category and key visual characteristics of the sketch without demanding fine-grain, instance-level matching of specific object instance. We study a semi-supervised approach that requires mainly class-labelled sketches and images plus a small number of instance-labelled sketch-image pairs. This approach involves aligning sketch and image embeddings before pooling them into clusters from which mid-grain similarity may be measured. Our learned model demonstrates not only intra-category discrimination (mid-grain) but also improved inter-category discrimination (coarse-grain) on a newly created MidGrain65c dataset.