We present a generic, efficient and iterative algorithm for interactively clustering classes of images and videos. The approach moves away from the use of large hand labelled training datasets, instead allowing the user to find natural groups of similar content based upon a handful of ?seed? examples. Two efficient data mining tools originally developed for text analysis; min-Hash and APriori are used and extended to achieve both speed and scalability on large image and video datasets. Inspired by the Bag-of-Words (BoW) architecture, the idea of an image signature is introduced as a simple descriptor on which nearest neighbour classification can be performed. The image signature is then dynamically expanded to identify common features amongst samples of the same class. The iterative approach uses APriori to identify common and distinctive elements of a small set of labelled true and false positive signatures. These elements are then accentuated in the signature to increase similarity between examples and ?pull? positive classes together. By repeating this process, the accuracy of similarity increases dramatically despite only a few training examples, only 10% of the labelled groundtruth is needed, compared to other approaches. It is tested on two image datasets including the caltech101  dataset and on three state-of-the-art action recognition datasets. On the YouTube  video dataset the accuracy increases from 72% to 97% using only 44 labelled examples from a dataset of over 1200 videos. The approach is both scalable and efficient, with an iteration on the full YouTube dataset taking around 1 minute on a standard desktop machine.
Actions in the wild? is the term given to examples of human motion that are performed in natural settings, such as those harvested from movies  or the Internet . State-of-the-art approaches in this domain are orders of magnitude lower than in more contrived settings. One of the primary reasons being the huge variability within each action class. We propose to tackle recognition in the wild by automatically breaking complex action categories into multiple modes/group, and training a separate classifier for each mode. This is achieved using RANSAC which identifies and separates the modes while rejecting outliers. We employ a novel reweighting scheme within the RANSAC procedure to iteratively reweight training examples, ensuring their inclusion in the final classification model. Our results demonstrate the validity of the approach, and for classes which exhibit multi-modality, we achieve in excess of double the performance over approaches that assume single modality.
Merino L, Gilbert A, Bowden R, Illingworth J, Capitán J, Ollero A, Ollero A (2012) Data fusion in ubiquitous networked robot systems for urban services, Annales des Telecommunications/Annals of Telecommunications pp. 1-21
There is a clear trend in the use of robots to accomplish services that can help humans. In this paper, robots acting in urban environments are considered for the task of person guiding. Nowadays, it is common to have ubiquitous sensors integrated within the buildings, such as camera networks, and wireless communications like 3G or WiFi. Such infrastructure can be directly used by robotic platforms. The paper shows how combining the information from the robots and the sensors allows tracking failures to be overcome, by being more robust under occlusion, clutter, and lighting changes. The paper describes the algorithms for tracking with a set of fixed surveillance cameras and the algorithms for position tracking using the signal strength received by a wireless sensor network (WSN). Moreover, an algorithm to obtain estimations on the positions of people from cameras on board robots is described. The estimate from all these sources are then combined using a decentralized data fusion algorithm to provide an increase in performance. This scheme is scalable and can handle communication latencies and failures. We present results of the system operating in real time on a large outdoor environment, including 22 nonoverlapping cameras, WSN, and several robots. © 2012 Institut Mines-Télécom and Springer-Verlag.
Villegas M, Muller H, Seco de Herrera A, Schaer R, Bromuri S, Gilbert A, Piras L, Wang J, Yan F, Ramisa A, Dellandrea E, Gaizauskas R, Mikolajczyk K, Puigcerver J, Toselli A, S anchez J, Vidal E (2016) General Overview of ImageCLEF at the CLEF 2016 Labs,Experimental IR Meets Multilinguality, Multimodality, and Interaction, Lecture Notes in Computer Science 9822 pp. 267-285
This paper presents an overview of the ImageCLEF 2016 evaluation campaign, an event that was organized as part of the CLEF (Conference and Labs of the Evaluation Forum) labs 2016. ImageCLEF is an ongoing initiative that promotes the evaluation of technologies for annotation, indexing and retrieval for providing information access to collections of images in various usage scenarios and domains. In 2016, the 14th edition of ImageCLEF, three main tasks were proposed: 1) identi cation, multi-label classi cation and separation of compound gures from biomedical literature; 2) automatic annotation of general web images; and 3) retrieval from collections of scanned handwritten documents. The handwritten retrieval task was the only completely novel task this year, although the other two tasks introduced several modi cations to keep the proposed tasks challenging.
Since 2010, ImageCLEF has run a scalable image annotation task, to promote research into the annotation of images using noisy web page data. It aims to develop techniques to allow computers to describe images reliably, localise di erent concepts depicted and generate descriptions of the scenes. The primary goal of the challenge is to encourage creative ideas of using web page data to improve image annotation. Three subtasks and two pilot teaser tasks were available to participants; all tasks use a single mixed modality data source of 510,123 web page items for both training and test. The dataset included raw images, textual features obtained from the web pages on which the images appeared, as well as extracted visual features. Extracted from the Web by querying popular image search engines, the dataset was formed. For the main subtasks, the development and test sets were both taken from the \training set". For the teaser tasks, 200,000 web page items were reserved for testing, and a separate development set was provided. The 251 concepts were chosen to be visual objects that are localizable and that are useful for generating textual descriptions of the visual content of images and were mined from the texts of our extensive database of image-webpage pairs. This year seven groups participated in the task, submitting over 50 runs across all subtasks, and all participants also provided working notes papers. In general, the groups' performance is impressive across the tasks, and there are interesting insights into these very relevant challenges.
This paper proposes a generic approach combining a bottom-up (low-level) visual detector with a top-down (high-level) fuzzy first-order logic (FOL) reasoning framework in order to detect pedestrians from a moving vehicle. Detections from the low-level visual corner based detector are fed into the logical reasoning framework as logical facts. A set of FOL clauses utilising fuzzy predicates with piecewise linear continuous membership functions associates a fuzzy confidence (a degree-of-truth) to each detector input. Detections associated with lower confidence functions are deemed as false positives and blanked out, thus adding top-down constraints based on global logical consistency of detections. We employ a state of the art visual detector on a challenging pedestrian detection dataset, and demonstrate an increase in detection performance when used in a framework that combines bottom-up detections with (fuzzy FOL-based) top-down constraints. © 2012 ICPR Org Committee.
Villegas M, Meuller H, Gilbert A, Piras L, Wang J, Mikolajczyk K, de Herrera AGS, Bromuri S, Amin MA, Mohammed MK, Acar B, Uskudarli S, Marvasti NB, Aldana JF, Roldan Garcia MDM (2015) General Overview of ImageCLEF at the CLEF 2015 Labs, EXPERIMENTAL IR MEETS MULTILINGUALITY, MULTIMODALITY, AND INTERACTION 9283 pp. 444-461 SPRINGER-VERLAG BERLIN
The field of Action Recognition has seen a large increase in activity in recent years. Much of the progress has been through incorporating ideas from single-frame object recognition and adapting them for temporal-based action recognition. Inspired by the success of interest points in the 2D spatial domain, their 3D (space-time) counterparts typically form the basic components used to describe actions, and in action recognition the features used are often engineered to fire sparsely. This is to ensure that the problem is tractable; however, this can sacrifice recognition accuracy as it cannot be assumed that the optimum features in terms of class discrimination are obtained from this approach. In contrast, we propose to initially use an overcomplete set of simple 2D corners in both space and time. These are grouped spatially and temporally using a hierarchical process, with an increasing search area. At each stage of the hierarchy, the most distinctive and descriptive features are learned efficiently through data mining. This allows large amounts of data to be searched for frequently reoccurring patterns of features. At each level of the hierarchy, the mined compound features become more complex, discriminative, and sparse. This results in fast, accurate recognition with real-time performance on high-resolution video. As the compound features are constructed and selected based upon their ability to discriminate, their speed and accuracy increase at each level of the hierarchy. The approach is tested on four state-of-the-art data sets, the popular KTH data set to provide a comparison with other state-of-the-art approaches, the Multi-KTH data set to illustrate performance at simultaneous multiaction classification, despite no explicit localization information provided during training. Finally, the recent Hollywood and Hollywood2 data sets provide challenging complex actions taken from commercial movie sequences. For all four data sets, the proposed hierarchical approa- h outperforms all other methods reported thus far in the literature and can achieve real-time operation.
In this paper, we propose a computational model for social interaction between three people in a conversation, and demonstrate results using human video motion synthesis. We utilised semi-supervised computer vision techniques to label social signals between the people, like laughing, head nod and gaze direction. Data mining is used to deduce frequently occurring patterns of social signals between a speaker and a listener in both interested and not interested social scenarios, and the mined confidence values are used as conditional probabilities to animate social responses. The human video motion synthesis is done using an appearance model to learn a multivariate probability distribution, combined with a transition matrix to derive the likelihood of motion given a pose configuration. Our system uses social labels to more accurately define motion transitions and build a texture motion graph. Traditional motion synthesis algorithms are best suited to large human movements like walking and running, where motion variations are large and prominent. Our method focuses on generating more subtle human movement like head nods. The user can then control who speaks and the interest level of the individual listeners resulting in social interactive conversational agents.
We present an approach to automatically expand the annotation of images using the internet as an additional information source. The novelty of the work is in the expansion of image tags by automatically introducing new unseen complex linguistic labels which are collected unsupervised from associated webpages. Taking a small subset of existing image tags, a web based search retrieves additional textual information. Both a textual bag of words model and a visual bag of words model are combined and symbolised for data mining. Association rule mining is then used to identify rules which relate words to visual contents. Unseen images that fit these rules are re-tagged. This approach allows a large number of additional annotations to be added to unseen images, on average 12.8 new tags per image, with an 87.2% true positive rate. Results are shown on two datasets including a new 2800 image annotation dataset of landmarks, the results include pictures of buildings being tagged with the architect, the year of construction and even events that have taken place there. This widens the tag annotation impact and their use in retrieval. This dataset is made available along with tags and the 1970 webpages and additional images which form the information corpus. In addition, results for a common state-of-the-art dataset MIRFlickr25000 are presented for comparison of the learning framework against previous works. © 2013 Springer-Verlag.
We propose a human performance capture system employing convolutional neural networks (CNN) to estimate human pose from a volumetric representation of a performer derived from multiple view-point video (MVV).We compare direct CNN pose regression to the performance of an affine invariant pose descriptor learned by a CNN through a classification task. A non-linear manifold embedding is learned between the descriptor and articulated pose spaces, enabling regression of pose from the source MVV. The results are evaluated against ground truth pose data captured using a Vicon marker-based system and demonstrate good generalisation over a range of human poses, providing a system that requires no special suit to be worn by the performer.
Oshin O, Gilbert A, Bowden R (2014) Capturing relative motion and finding modes for action recognition in the wild, Computer Vision and Image Understanding
"Actions in the wild" is the term given to examples of human motion that are performed in natural settings, such as those harvested from movies  or Internet databases . This paper presents an approach to the categorisation of such activity in video, which is based solely on the relative distribution of spatio-temporal interest points. Presenting the Relative Motion Descriptor, we show that the distribution of interest points alone (without explicitly encoding their neighbourhoods) effectively describes actions. Furthermore, given the huge variability of examples within action classes in natural settings, we propose to further improve recognition by automatically detecting outliers, and breaking complex action categories into multiple modes. This is achieved using a variant of Random Sampling Consensus (RANSAC), which identifies and separates the modes. We employ a novel reweighting scheme within the RANSAC procedure to iteratively reweight training examples, ensuring their inclusion in the final classification model. We demonstrate state-of-the-art performance on five human action datasets. © 2014 Elsevier Inc. All rights reserved.
Gilbert A, bowden R (2015) Geometric Mining: Scaling Geometric Hashing to Large Datasets, 3rd Workshop on Web-scale Vision and Social Media (VSM), at ICCV 2015
Often within the field of tracking people within only fixed cameras are used. This can mean that when the the illumination of the image changes or object occlusion occurs, the tracking can fail. We propose an approach that uses three simultaneous separate sensors. The fixed surveillance cameras track objects of interest cross camera through incrementally learning relationships between regions on the image. Cameras and laser rangefinder sensors onboard robots also provide an estimate of the person. Moreover, the signal strength of mobile devices carried by the person can be used to estimate his position. The estimate from all these sources are then combined using data fusion to provide an increase in performance. We present results of the fixed camera based tracking operating in real time on a large outdoor environment of over 20 non-overlapping cameras. Moreover, the tracking algorithms for robots and wireless nodes are described. A decentralized data fusion algorithm for combining all these information is presented. ©2009 IEEE.
Gilbert A, Bowden R (2011) Push and pull: Iterative grouping of media, BMVC 2011 - Proceedings of the British Machine Vision Conference 2011
We present an approach to iteratively cluster images and video in an efficient and intuitive manor. While many techniques use the traditional approach of time consuming groundtruthing large amounts of data [10, 16, 20, 23], this is increasingly infeasible as dataset size and complexity increase. Furthermore it is not applicable to the home user, who wants to intuitively group his/her own media without labelling the content. Instead we propose a solution that allows the user to select media that semantically belongs to the same class and use machine learning to "pull" this and other related content together. We introduce an "image signature" descriptor and use min-Hash and greedy clustering to efficiently present the user with clusters of the dataset using multi-dimensional scaling. The image signatures of the dataset are then adjusted by APriori data mining identifying the common elements between a small subset of image signatures. This is able to both pull together true positive clusters and push apart false positive examples. The approach is tested on real videos harvested from the web using the state of the art YouTube dataset . The accuracy of correct group label increases from 60.4% to 81.7% using 15 iterations of pulling and pushing the media around. While the process takes only 1 minute to compute the pair wise similarities of the image signatures and visualise the youtube whole dataset. © 2011. The copyright of this document resides with its authors.
This paper introduces a novel approach to social behaviour recognition governed by the exchange of non-verbal cues between people. We conduct experiments to try and deduce distinct rules that dictate the social dynamics of people in a conversation, and utilise semi-supervised computer vision techniques to extract their social signals such as laughing and nodding. Data mining is used to deduce frequently occurring patterns of social trends between a speaker and listener in both interested and not interested social scenarios. The confidence values from rules are utilised to build a Social Dynamic Model (SDM), that can then be used for classification and visualisation. By visualising the rules generated in the SDM, we can analyse distinct social trends between an interested and not interested listener in a conversation. Results show that these distinctions can be applied generally and used to accurately predict conversational interest.
This paper presents a generic method for recognising and localising human actions in video based solely on the distribution of interest points. The use of local interest points has shown promising results in both object and action recognition. While previous methods classify actions based on the appearance and/or motion of these points, we hypothesise that the distribution of interest points alone contains the majority of the discriminatory information. Motivated by its recent success in rapidly detecting 2D interest points, the semi-naive Bayesian classification method of Randomised Ferns is employed. Given a set of interest points within the boundaries of an action, the generic classifier learns the spatial and temporal distributions of those interest points. This is done efficiently by comparing sums of responses of interest points detected within randomly positioned spatio-temporal blocks within the action boundaries. We present results on the largest and most popular human action dataset  using a number of interest point detectors, and demostrate that the distribution of interest points alone can perform as well as approaches that rely upon the appearance of the interest points. ©2009 IEEE.
In this chapter, we present a generic classifier for detecting spatio-temporal interest points within video, the premise being that, given an interest point detector, we can learn a classifier that duplicates its functionality and which is both accurate and computationally efficient. This means that interest point detection can be achieved independent of the complexity of the original interest point formulation. We extend the naive Bayesian classifier of Randomised Ferns to the spatio-temporal domain and learn classifiers that duplicate the functionality of common spatio-temporal interest point detectors. Results demonstrate accurate reproduction of results with a classifier that can be applied exhaustively to video at frame-rate, without optimisation, in a scanning window approach. © 2010, IGI Global.
Krejov P, Gilbert A, Bowden R (2015) Combining Discriminative and Model Based Approaches for Hand Pose Estimation, 2015 11TH IEEE INTERNATIONAL CONFERENCE AND WORKSHOPS ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG), VOL. 2 IEEE
This paper presents a solution to the problem of tracking
people within crowded scenes. The aim is to maintain individual object
identity through a crowded scene which contains complex interactions
and heavy occlusions of people. Our approach uses the strengths of two
separate methods; a global object detector and a localised frame by frame
tracker. A temporal relationship model of torso detections built during
low activity period, is used to further disambiguate during periods of
high activity. A single camera with no calibration and no environmental
information is used. Results are compared to a standard tracking method
and groundtruth. Two video sequences containing interactions, overlaps
and occlusions between people are used to demonstrate our approach.
The results show that our technique performs better that a standard
tracking method and can cope with challenging occlusions and crowd
Gilbert A, Bowden R (2015) Data mining for action recognition, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9007 pp. 290-303
© Springer International Publishing Switzerland 2015.In recent years, dense trajectories have shown to be an efficient representation for action recognition and have achieved state-of-theart results on a variety of increasingly difficult datasets. However, while the features have greatly improved the recognition scores, the training process and machine learning used hasn?t in general deviated from the object recognition based SVM approach. This is despite the increase in quantity and complexity of the features used. This paper improves the performance of action recognition through two data mining techniques, APriori association rule mining and Contrast Set Mining. These techniques are ideally suited to action recognition and in particular, dense trajectory features as they can utilise the large amounts of data, to identify far shorter discriminative subsets of features called rules. Experimental results on one of the most challenging datasets, Hollywood2 outperforms the current state-of-the-art.
This paper presents an approach to the categorisation of spatio-temporal activity in video, which is based solely on the relative distribution of feature points. Introducing a Relative Motion Descriptor for actions in video, we show that the spatio-temporal distribution of features alone (without explicit appearance information) effectively describes actions, and demonstrate performance consistent with state-of-the-art. Furthermore, we propose that for actions where noisy examples exist, it is not optimal to group all action examples as a single class. Therefore, rather than engineering features that attempt to generalise over noisy examples, our method follows a different approach: We make use of Random Sampling Consensus (RANSAC) to automatically discover and reject outlier examples within classes. We evaluate the Relative Motion Descriptor and outlier rejection approaches on four action datasets, and show that outlier rejection using RANSAC provides a consistent and notable increase in performance, and demonstrate superior performance to more complex multiple-feature based approaches.
Sanfeliu A, Andrade-Cetto J, Barbosa M, Bowden R, Capitan J, Corominas A, Gilbert A, Illingworth J, Merino L, Mirats JM, Moreno P, Ollero A, Sequeira J, Spaan MTJ (2010) Decentralized Sensor Fusion for Ubiquitous Networking Robotics in Urban Areas, Sensors 10 (3) pp. 2274-2314 MOLECULAR DIVERSITY PRESERVATION INTERNATIONAL-MDPI
In this article we explain the architecture for the environment and sensors that has been built for the European project URUS (Ubiquitous Networking Robotics in Urban Sites), a project whose objective is to develop an adaptable network robot architecture for cooperation between network robots and human beings and/or the environment in urban areas. The project goal is to deploy a team of robots in an urban area to give a set of services to a user community. This paper addresses the sensor architecture devised for URUS and the type of robots and sensors used, including environment sensors and sensors onboard the robots. Furthermore, we also explain how sensor fusion takes place to achieve urban outdoor execution of robotic services. Finally some results of the project related to the sensor network are highlighted.
Trumble Matthew, Gilbert Andrew, Malleson Charles, Hilton Adrian, Collomosse John Total Capture,
University of Surrey
Within the eld of image and video recognition, the traditional approach is a dataset split into xed training and test partitions. However, the labelling of the training set is time-consuming, especially as datasets grow in size and complexity. Furthermore, this approach is not applicable to the home user, who wants to intuitively group their media without tirelessly labelling the content. Consequently, we propose a solution similar in nature to an active learning paradigm, where a small subset of media is labelled as semantically belonging to the same class, and machine learning is then used to pull this and other related content together in the feature space. Our interactive approach is able to iteratively cluster classes of images and video. We reformulate it in an online learning framework and demonstrate competitive performance to batch learning approaches using only a fraction of the labelled data. Our approach is based around the concept of an image signature which, unlike a standard bag of words model, can express co-occurrence statistics as well as symbol frequency. We e ciently compute metric distances between signatures despite their inherent high dimensionality and provide discriminative feature selection, to allow common and distinctive elements to be identi ed from a small set of user labelled examples. These elements are then accentuated in the image signature to increase similarity between examples and pull correct classes together. By repeating this process in an online learning framework, the accuracy of similarity increases dramatically despite labelling only a few training examples. To demonstrate that the approach is agnostic to media type and features used, we evaluate on three image datasets (15 scene, Caltech101 and FG-NET), a mixed text and image dataset (ImageTag), a dataset used in active learning (Iris) and on three action recognition datasets (UCF11, KTH and Hollywood2). On the UCF11 video dataset, the accuracy is 86.7% despite using only 90 labelled examples from a dataset of over 1200 videos, instead of the standard 1122 training videos. The approach is both scalable and e cient, with a single iteration over the full UCF11 dataset of around 1200 videos taking approximately 1 minute on a standard desktop machine.
This paper presents an approach to hand pose estimation that combines discriminative and model-based methods to leverage the advantages of both. Randomised Decision Forests are trained using real data to provide fast coarse segmentation of the hand. The segmentation then forms the basis of constraints applied in model fitting, using an efficient projected Gauss-Seidel solver, which enforces temporal continuity and kinematic limitations. However, when fitting a generic model to multiple users with varying hand shape, there is likely to be residual errors between the model and their hand. Also, local minima can lead to failures in tracking that are difficult to recover from. Therefore, we introduce an error regression stage that learns to correct these instances of optimisation failure. The approach provides improved accuracy over the current state of the art methods, through the inclusion of temporal cohesion and by learning to correct from failure cases. Using discriminative learning, our approach performs guided optimisation, greatly reducing model fitting complexity and radically improves efficiency. This allows tracking to be performed at over 40 frames per second using a single CPU thread.
We present a novel human performance capture technique capable of robustly estimating the pose (articulated joint positions) of a performer observed passively via multiple view-point video (MVV). An affine invariant pose descriptor is learned using a convolutional neural network (CNN) trained over volumetric data extracted from a MVV dataset of diverse human pose and appearance. A manifold embedding is learned via Gaussian Processes for the CNN descriptor and articulated pose spaces enabling regression and so estimation of human pose from MVV input. The learned descriptor and manifold are shown to generalise over a wide range of human poses, providing an efficient performance capture solution that requires no fiducials or other markers to be worn. The system is evaluated against ground truth joint configuration data from a commercial marker-based pose estimation system
We present an algorithm for fusing multi-viewpoint video (MVV) with inertial measurement
unit (IMU) sensor data to accurately estimate 3D human pose. A 3-D convolutional
neural network is used to learn a pose embedding from volumetric probabilistic
visual hull data (PVH) derived from the MVV frames. We incorporate this model within
a dual stream network integrating pose embeddings derived from MVV and a forward
kinematic solve of the IMU data. A temporal model (LSTM) is incorporated within
both streams prior to their fusion. Hybrid pose inference using these two complementary
data sources is shown to resolve ambiguities within each sensor modality, yielding improved
accuracy over prior methods. A further contribution of this work is a new hybrid
MVV dataset (TotalCapture) comprising video, IMU and a skeletal joint ground truth
derived from a commercial motion capture system. The dataset is available online athttp://cvssp.org/data/totalcapture/
Villegas M, Muller H, Gilbert A, Piras L, Wang J, Mikolajczyk K, Seco de Herrera A, Bromuri S, Ashraful Amin M, Kazi Mohammed M, Acar B, Uskudarli S, Marvasti N, Aldana J, Roldan Garc1a M (2015) General Overview of ImageCLEF at the CLEF
2015 Labs,Experimental IR Meets Multilinguality, Multimodality, and Interaction. Lecture Notes in Computer Science, vol 9283 pp. 444-461
This paper presents an overview of the ImageCLEF 2015
evaluation campaign, an event that was organized as part of the CLEF
labs 2015. ImageCLEF is an ongoing initiative that promotes the evaluation
of technologies for annotation, indexing and retrieval for providing
information access to databases of images in various usage scenarios and
domains. In 2015, the 13th edition of ImageCLEF, four main tasks were
proposed: 1) automatic concept annotation, localization and sentence
description generation for general images; 2) identification, multi-label
classification and separation of compound figures from biomedical literature;
3) clustering of x-rays from all over the body; and 4) prediction of
missing radiological annotations in reports of liver CT images. The x-ray
task was the only fully novel task this year, although the other three tasks
introduced modifications to keep up relevancy of the proposed challenges.
The participation was considerably positive in this edition of the lab, receiving
almost twice the number of submitted working notes papers as
compared to previous years.
Content-aware image completion or in-painting is a fundamental
tool for the correction of defects or removal of
objects in images. We propose a non-parametric in-painting
algorithm that enforces both structural and aesthetic (style)
consistency within the resulting image. Our contributions
are two-fold: 1) we explicitly disentangle image structure
and style during patch search and selection to ensure a visually
consistent look and feel within the target image. 2)
we perform adaptive stylization of patches to conform the
aesthetics of selected patches to the target image, so harmonizing
the integration of selected patches into the final composition.
We show that explicit consideration of visual style
during in-painting delivers excellent qualitative and quantitative
results across the varied image styles and content,
over the Places2 scene photographic dataset and a challenging
new in-painting dataset of artwork derived from BAM!
There is a clear trend in the use of robots to accomplish services that can help humans. In this paper, robots acting in urban environments are considered for the task of person guiding. Nowadays, it is common to have ubiquitous sensors integrated within the buildings, such as camera networks, and wireless communications like 3G or WiFi. Such infrastructure can be directly used by robotic platforms. The paper shows how combining the information from the robots and the sensors allows tracking failures to be overcome, by being more robust under occlusion, clutter, and lighting changes. The paper describes the algorithms for tracking with a set of fixed surveillance cameras and the algorithms for position tracking using the signal strength received by a wireless sensor network (WSN). Moreover, an algorithm to obtain estimations on the positions of people from cameras on board robots is described. The estimate from all these sources are then combined using a decentralized data fusion algorithm to provide an increase in performance. This scheme is scalable and can handle communication latencies and failures. We present results of the system operating in real time on a large outdoor environment, including 22 nonoverlapping cameras,WSN, and several robots. © Institut Mines-Télécom and Springer-Verlag 2012.
This paper presents a scalable solution to the problem of tracking objects across spatially separated, uncalibrated, non-overlapping cameras. Unlike other approaches this technique uses an incremental learning method to create the spatio-temporal links between cameras, and thus model the posterior probability distribution of these links. This can then be used with an appearance model of the object to track across cameras. It requires no calibration or batch preprocessing and becomes more accurate over time as evidence is accumulated.
The ImageCLEF 2015 Scalable Image Annotation, Localization
and Sentence Generation task was the fourth edition of a challenge
aimed at developing more scalable image annotation systems. In particular
this year the focus of the three subtasks available to participants
had the goal to develop techniques to allow computers to reliably describe
images, localize the different concepts depicted in the images and
generate a description of the scene. All three tasks use a single mixed
modality data source of 500,000 web page items which included raw images,
textual features obtained from the web pages on which the images
appeared, as well as various visual features extracted from the images
themselves. Unlike previous years the test set was also the training set
and in this edition of the task hand-labelled data has been allowed. The
images were obtained from the Web by querying popular image search
engines. The development and subtasks 1 and 2 test sets were both taken
from the ?training set? and had 1,979 and 3,070 samples, and the subtask
3 track had 500 and 450 samples. The 251 concepts this year were chosen
to be visual objects that are localizable and that are useful for generating
textual descriptions of visual content of images and were mined
from the texts of our large database of image-webpage pairs. This year
14 groups participated in the task, submitting a total of 122 runs across
the 3 subtasks and 11 of the participants also submitted working notes
papers. This result is very positive, in fact if compared to the 11 participants
and 58 submitted runs of the last year it is possible to see how
the interest in this topic is still very high.
Within the field of action recognition, features and descriptors are often engineered to be sparse and invariant to transformation. While sparsity makes the problem tractable, it is not necessarily optimal in terms of class separability and classification. This paper proposes a novel approach that uses very dense corner features that are spatially and temporally grouped in a hierarchical process to produce an overcomplete compound feature set. Frequently reoccurring patterns of features are then found through data mining, designed for use with large data sets. The novel use of the hierarchical classifier allows real time operation while the approach is demonstrated to handle camera motion, scale, human appearance variations, occlusions and background clutter. The performance of classification, outperforms other state-of-the-art action recognition algorithms on the three datasets; KTH, multi-KTH, and Hollywood. Multiple action localisation is performed, though no groundtruth localisation data is required, using only weak supervision of class labels for each training sequence. The Hollywood dataset contain complex realistic actions from movies, the approach outperforms the published accuracy on this dataset and also achieves real time performance. ©2009 IEEE.
In this paper, we aim to tackle the problem of recognising temporal sequences in the context of a multi-class problem. In the past, the representation of sequential patterns was used for modelling discriminative temporal patterns for different classes. Here, we have improved on this by using the more general representation of episodes, of which sequential patterns are a special case. We then propose a novel tree structure called a MultI-Class Episode Tree (MICE-Tree) that allows one to simultaneously model a set of different episodes in an efficient manner whilst providing labels for them. A set of MICE-Trees are then combined together into a MICE-Forest that is learnt in a Boosting framework. The result is a strong classifier that utilises episodes for performing classification of temporal sequences. We also provide experimental evidence showing that the MICE-Trees allow for a more compact and efficient model compared to sequential patterns. Additionally, we demonstrate the accuracy and robustness of the proposed method in the presence of different levels of noise and class labels.
The use of sparse invariant features to recognise classes of actions or objects has become common in the literature. However, features are often ?engineered? to be both sparse and invariant to transformation and it is assumed that they provide the greatest discriminative information. To tackle activity recognition, we propose learning compound features that are assembled from simple 2D corners in both space and time. Each corner is encoded in relation to its neighbours and from an over complete set (in excess of 1 million possible features), compound features are extracted using data mining. The final classifier, consisting of sets of compound features, can then be applied to recognise and localise an activity in real-time while providing superior performance to other state-of-the-art approaches (including those based upon sparse feature detectors). Furthermore, the approach requires only weak supervision in the form of class labels for each training sequence. No ground truth position or temporal alignment is required during training.
A real-time full-body motion capture system is presented
which uses input from a sparse set of inertial measurement
units (IMUs) along with images from two or more standard
video cameras and requires no optical markers or specialized
infra-red cameras. A real-time optimization-based
framework is proposed which incorporates constraints from
the IMUs, cameras and a prior pose model. The combination
of video and IMU data allows the full 6-DOF motion to
be recovered including axial rotation of limbs and drift-free
global position. The approach was tested using both indoor
and outdoor captured data. The results demonstrate the effectiveness
of the approach for tracking a wide range of human
motion in real time in unconstrained indoor/outdoor
We present a convolutional autoencoder that enables high
fidelity volumetric reconstructions of human performance to be captured
from multi-view video comprising only a small set of camera views. Our
method yields similar end-to-end reconstruction error to that of a prob-
abilistic visual hull computed using significantly more (double or more)
viewpoints. We use a deep prior implicitly learned by the autoencoder
trained over a dataset of view-ablated multi-view video footage of a wide
range of subjects and actions. This opens up the possibility of high-end
volumetric performance capture in on-set and prosumer scenarios where
time or cost prohibit a high witness camera count.
We present a method for simultaneously estimating 3D hu-
man pose and body shape from a sparse set of wide-baseline camera views.
We train a symmetric convolutional autoencoder with a dual loss that
enforces learning of a latent representation that encodes skeletal joint
positions, and at the same time learns a deep representation of volumetric
body shape. We harness the latter to up-scale input volumetric data by a
factor of 4X, whilst recovering a 3D estimate of joint positions with equal
or greater accuracy than the state of the art. Inference runs in real-time
(25 fps) and has the potential for passive human behaviour monitoring
where there is a requirement for high fidelity estimation of human body
shape and pose.
We propose an approach to accurately esti-
mate 3D human pose by fusing multi-viewpoint video
(MVV) with inertial measurement unit (IMU) sensor
data, without optical markers, a complex hardware setup
or a full body model. Uniquely we use a multi-channel
3D convolutional neural network to learn a pose em-
bedding from visual occupancy and semantic 2D pose
estimates from the MVV in a discretised volumetric
probabilistic visual hull (PVH). The learnt pose stream
is concurrently processed with a forward kinematic solve
of the IMU data and a temporal model (LSTM) exploits
the rich spatial and temporal long range dependencies
among the solved joints, the two streams are then fused
in a final fully connected layer. The two complemen-
tary data sources allow for ambiguities to be resolved
within each sensor modality, yielding improved accu-
racy over prior methods. Extensive evaluation is per-
formed with state of the art performance reported on
the popular Human 3.6M dataset , the newly re-
leased TotalCapture dataset and a challenging set of
outdoor videos TotalCaptureOutdoor. We release the
new hybrid MVV dataset (TotalCapture) comprising
of multi- viewpoint video, IMU and accurate 3D skele-
tal joint ground truth derived from a commercial mo-
tion capture system. The dataset is available online athttp://cvssp.org/data/totalcapture/
We describe a non-parametric algorithm for multiple-viewpoint video inpainting. Uniquely, our algorithm addresses the domain of wide baseline multiple-viewpoint video (MVV) with no temporal look-ahead in near real time speed. A Dictionary of Patches (DoP) is built using multi-resolution texture patches reprojected from geometric proxies available in the alternate views. We dynamically update the DoP over time, and a Markov Random Field optimisation over depth and appearance is used to resolve and align a selection of multiple candidates for a given patch, this ensures the inpainting of large regions in a plausible manner conserving both spatial and temporal coherence. We demonstrate the removal of large objects (e.g. people) on challenging indoor and outdoor MVV exhibiting cluttered, dynamic backgrounds and moving cameras.
Automatic image annotation is the task of automatically assigning some
form of semantic label to images, such as words, phrases or sentences describing
the objects, attributes, actions, and scenes depicted in the image. In this chapter,
we present an overview of the various automatic image annotation tasks that were
organized in conjunction with the ImageCLEF track at CLEF between 2009?2016.
Throughout the eight years, the image annotation tasks have evolved from annotating
Flickr photos by learning from clean data to annotating web images by learning
from large-scale noisy web data. The tasks are divided into three distinct phases, and
this chapter will provide a discussion for each of these phases.We will also compare
and contrast other related benchmarking challenges, and provide some insights into
the future of automatic image annotation.
Performance capture is used extensively within the creative industries to efficiently produce high quality, realistic character animation in movies and video games. Existing commercial systems for performance capture are limited to working within constrained environments, requiring wearable visual markers or suits, and frequently specialised imaging devices (e.g. infra-red cameras) both of which limit deployment scenarios (e.g. indoor capture). This thesis explores novel methods to relax these constraints, applying machine learning techniques to estimate human pose using regular video cameras and without the requirement of visible markers on the performer. This unlocks the potential for co-production of principal footage and performance capture data, leading to production efficiencies. For example, using an array of static witness cameras deployed on-set, performance capture data for a video games character accompanying a major movie franchise might be captured at the same time the movie is shot. The need to call the actor for a second day of shooting in a specialised motion capture (mo-cap) facility is avoided, saving time and money, since performance capture was possible without corrupting the principal movie footage with markers or constraining set design. Furthermore, if such performance capture data is available in real-time, the director may immediately pre-visualize the look and feel of the final character animation enabling tighter capture iteration and improved creative direction. This further enhances the potential for production efficiencies.
The core technical contributions of this thesis are novel software algorithms that leverage machine learning to fuse of data from multiple sensors ? synchronised video cameras, and in some cases, inertial measurement units (IMUs) ? in order to robustly estimate human body pose over time, doing so at real-time or near real-time rates.
Firstly, a hardware-accelerated capture solution is developed for acquiring coarse volumetric occupancy data from multiple viewpoint video footage, in the form of a probabilistic visual hull (PVH). Using CUDA-based GPU acceleration the PVH may be estimated in real-time, and subsequently used to train machine learning algorithms to infer human skeletal pose from PVH data.
Initially a variety of machine learning approaches for skeletal joint pose estimation are explored, contrasting classical and deep inference methods. By quantizing volumetric data into a two-dimensional (2D) spherical histogram representation it is shown that convolutional neural networks (CNN) architectures used traditionally for object recognition may be re-purposed for skeletal joint estimation given suitable a training methodology and data augmentation strategy.
The generalization of such architectures to a fully volumetric (3D) CNN is explored, achieving state of the art performance at human pose estimation using an volumetric auto-encoder (hour-glass) architecture that emulates networks traditionally used for de-noising and super-resolution (up-scaling) of 2D data. A framework is developed that is capable of simultaneously estimating human pose from volumetric data, whilst also up-scaling that volumetric data to enable fine-grain estimation of surface detail given a deeply learned prior from previous performance. The method is shown to generalise well even when that prior is learned across different subjects, performing different movements even in different studio camera configurations.
Performance can be further improved using a learned temporal model of data, and through the fusion of complementary sensor modalities ? video and IMUs ? to enhance the accuracy of human pose estimation inferred from a volumetric CNN. Although IMUs have been applied in the performance capture domain for many years, they are prone to drift limiting their use to short capture sequences. The novel fusion of IMU with video data enables improved global localization and so reduced
We aim to simultaneously estimate the 3D articulated pose and high fidelity volumetric
occupancy of human performance, from multiple viewpoint video (MVV) with as
few as two views. We use a multi-channel symmetric 3D convolutional encoder-decoder
with a dual loss to enforce the learning of a latent embedding that enables inference of
skeletal joint positions and a volumetric reconstruction of the performance. The inference
is regularised via a prior learned over a dataset of view-ablated multi-view video footage
of a wide range of subjects and actions, and show this to generalise well across unseen
subjects and actions. We demonstrate improved reconstruction accuracy and lower pose
estimation error relative to prior work on two MVV performance capture datasets: Human
3.6M and TotalCapture.