I have a number of funded PhD studentships (for UK and exceptional Overseas students) to start in September 2022, around activity recognition and 3D human pose understanding and rendering. Please contact me for more details.
I have a passion for enabling and exploring how to make intelligent machines or computers that are able to be creative or understand their world.
Recently published works include
- ALADIN: All Layer Adaptive Instance Normalization for Fine-grained Style Similarity Dan Ruta, Saeid Motiian, Baldo Faieta, Zhe Lin, Hailin Jin, Alex Filipkowski, Andrew Gilbert, John Collomosse. In Proc ICCV'21, 2021
- Neural architecture search for deep image prior Kary Ho, Andrew Gilbert, Halin Jin, John Collomosse, In Proc Computers & Graphics 2021
- Rethinking Genre Classification with Fine Grained semantic Clustering, Edward Fish, Jon Weinbren, Andrew Gilbert, In Proc International Conference on Image Processing (ICIP'21), 2021
- Semantic Estimation of 3D Body Shape and Pose using Minimal Cameras A Gilbert, M Trumble, A Hilton, J Collomosse, Bristish machine Vision Conference (BMVC'20) 2020
- Fusing visual and inertial sensors with semantics for 3d human pose estimation A Gilbert, M Trumble, C Malleson, A Hilton, J Collomosse International Journal of Computer Vision 127 (4), 381-397, 2019
- Inpainting of Wide-baseline Multiple Viewpoint Video, A Gilbert, M Trumble, A Hilton, J Collomosse, IEEE transactions on visualization and computer graphics, 2018
My Research ranges from research in intelligent creative arts such as fine grained style search and movie trailer genre understanding and 4D performance capture. Through to enabling machines to perceive and understand their surroundings with real-time accurate 3D human pose estimation for large-scale markerless motion capture for use both indoors and outdoors, complex realistic activity recognition and early work on tracking people on vast surveillance networks.
I am interested in employing state of the art technologies using minimal specialist hardware by applying the latest modelling and learning techniques. I see this as a key enabler for future AR and VR experiences, allowing users to create, develop and enjoy immersive technologies through all the devices they own.
I have published over 40 articles in the leading international vision conferences and journals providing state of the art advancements in this field. Most recently I've worked on the InnovateUK projects REFRAME and TotalCapture, leading advances in video-rate actor performance capture for use in film and games production. I'm also a member of the British Machine Vision Association (BMVA) Executive committee and coordinate the national BMVA technical meetings. bringing together key expert from industry and academia to discuss and identify solutions to current problems in specialist areas of the computer vision and machine learning field.
There exist many options for securing PhD funding for outstanding candidates that can always be explored. I am always looking for dedicated and ambitious individuals to join my team. Please first send an email to discuss, specifying your interest and availability to join. Include a description of achievements that you are particularly proud of, and write a short paragraph on your ideal research project at Surrey.
Please drop me an email, if interested in any of the areas I am researching.
Current PhD Students
Soon Yau - Text to Storyboard generation - 2021
Kar Balan - Decentralized virtual content Understanding for Blockchain- 2021
Dan Ruta - Exploring and understanding fine grained style - 2019
Ed Fish - Film Trailer Genre Understanding and exploration - 2019
Violeta Menendez Gonzalez - Novel viewpoint and stereo inpainting - 2019
Tony Orme - Temporal Prediction of IP packets in network switches - 2019
Rob Penhaligon - 2019
Previous PhD Students
Mat Trumble - 2015
Phil Krejov - 2012
Segun Oshin - 2008
Areas of specialism
University roles and responsibilities
- Senior Lecturer of Film and Video Production
- Undergraduate final year project supervision
- Undergraduate placement year supervisor
- Postgraduate (PhD) Supervisor
- Head of Departmental Athena Swan Self Assessment Team
Affiliations and memberships
Recent published works
Click the images below to find out more about recently published works,
My research interests span both machine learning and computer vision, and throughout my career, I have been very active in state of the art data mining with large datasets, through the use of hierarchical image feature mining applied to action recognition. This has progressed towards dealing with the organization of big data and on large scale learning, active learning with minimal user-led training. Also, mixed modality mining using web pages to provide large-scale automatic context for images; While in collaboration with other groups, I have performed investigations into the fusion of different data sources. During my Ph.D. I researched and published ways for incremental learning of spatiotemporal tracking models for large amounts of surveillance data, for the guidance of autonomous systems.
Visual Media production is a large and growing industry, and as a result, have a number of research challenges within it. Multiple viewpoint video is becoming commonplace in film and music video production, where shots are captured using multiple, wide-spaced, synchronized cameras for later editing. Within this, I have two research questions
- Is it possible to remove an actor in a motion-capture suit to be later replaced with an animated virtual character? Illustrated by the music video from the Chemical Brothers Wide Open, and Alicia Vikander in Ex Machina,
- Can I perform real-time human pose estimation of actors, however, freeing the actor from having to wear a cumbersome and environment limiting motion capture suit and therefore allowing the pose estimation to occur in free environments and in costume? For current solutions to this check out the recent state of the art performance on Human3.6M dataset published at BMVC 2017 and 3DV 2017
Volumetric performance capture from minimal camera viewpoints
We present a convolutional autoencoder that enables high fidelity volumetric reconstructions of human performance to be captured from multi-view video comprising only a small set of camera views. Our method yields similar end-to-end reconstruction error to that of a probabilistic visual hull computed using significantly more (double or more) viewpoints. We use a deep prior implicitly learned by the autoencoder trained over a dataset of view-ablated multi-view video footage of a wide range of subjects and actions. This opens up the possibility of high-end volumetric performance capture in on-set and prosumer scenarios where time or cost prohibit a high witness camera count. For more details and the paper click here
Postgraduate research supervision
There exist many options for securing PhD funding for outstanding candidates that can always be explored. I am always looking for dedicated and ambitious individuals to join my team. Please first send an email to discuss, specifying your interest and availability to join. Include a description of achievements that you are particularly proud of, and write a short paragraph on your ideal research project at Surrey.
Please drop me an email, if interested in any of the areas I am researching.
Current PhD Students
I currently supervise a couple of PhD students
- Mr Ed Fish(2019-) (Exploring Bias in Machine Learning through film)
- Miss Kary Ho(2017-) (Image Inpainting)
Completed postgraduate research projects I have supervised
Completed PhD Students
I have supervised to completion several students, what they are up to now can be found below
I am available for tutorials throughout the week, please email me to arrange a convenient time to meet. include in your email an outline of the topics for discussion.
I teaching on the Film and Video Production Technology, an undergraduate Broadcast Engineering programme. an exciting course that combines broadcast engineering, artistic storytelling and professional film making. The students study the fundamentals of signals, computing and broadcast technologies, video and audio engineering, wireless links and video over IP. On the practical side, they study camera skills, cinematography for cinema, film sound post-production, TV production, titles and graphics. Examples of the students’ work can be found here.
Volumetric performance capture from minimal camera viewpoints - Andrew Gilbert, Marco Volino, John Collomosse and Adrian Hilton, In 2018 European Conference on Computer Vision (ECCV'18)
Deep Autoencoder for Combined Human Pose Estimation and Body Model Upscaling - Matthew Trumble Andrew Gilbert, Adrian Hilton and John Collomosse, In 2018 European Conference on Computer Vision (ECCV'18)
Fusing Visual and Inertial Sensors with Semantics for 3D Human Pose Estimation, - Andrew Gilbert, Matthew Trumble, Charles Malleson, Adrian Hilton, John Collomosse (2018), In International Journal of Computer Vison
We present an approach to iteratively cluster images and video in an efficient and intuitive manor. While many techniques use the traditional approach of time consuming groundtruthing large amounts of data [10, 16, 20, 23], this is increasingly infeasible as dataset size and complexity increase. Furthermore it is not applicable to the home user, who wants to intuitively group his/her own media without labelling the content. Instead we propose a solution that allows the user to select media that semantically belongs to the same class and use machine learning to "pull" this and other related content together. We introduce an "image signature" descriptor and use min-Hash and greedy clustering to efficiently present the user with clusters of the dataset using multi-dimensional scaling. The image signatures of the dataset are then adjusted by APriori data mining identifying the common elements between a small subset of image signatures. This is able to both pull together true positive clusters and push apart false positive examples. The approach is tested on real videos harvested from the web using the state of the art YouTube dataset . The accuracy of correct group label increases from 60.4% to 81.7% using 15 iterations of pulling and pushing the media around. While the process takes only 1 minute to compute the pair wise similarities of the image signatures and visualise the youtube whole dataset. © 2011. The copyright of this document resides with its authors.
A real-time full-body motion capture system is presented which uses input from a sparse set of inertial measurement units (IMUs) along with images from two or more standard video cameras and requires no optical markers or specialized infra-red cameras. A real-time optimization-based framework is proposed which incorporates constraints from the IMUs, cameras and a prior pose model. The combination of video and IMU data allows the full 6-DOF motion to be recovered including axial rotation of limbs and drift-free global position. The approach was tested using both indoor and outdoor captured data. The results demonstrate the effectiveness of the approach for tracking a wide range of human motion in real time in unconstrained indoor/outdoor scenes.
There is a clear trend in the use of robots to accomplish services that can help humans. In this paper, robots acting in urban environments are considered for the task of person guiding. Nowadays, it is common to have ubiquitous sensors integrated within the buildings, such as camera networks, and wireless communications like 3G or WiFi. Such infrastructure can be directly used by robotic platforms. The paper shows how combining the information from the robots and the sensors allows tracking failures to be overcome, by being more robust under occlusion, clutter, and lighting changes. The paper describes the algorithms for tracking with a set of fixed surveillance cameras and the algorithms for position tracking using the signal strength received by a wireless sensor network (WSN). Moreover, an algorithm to obtain estimations on the positions of people from cameras on board robots is described. The estimate from all these sources are then combined using a decentralized data fusion algorithm to provide an increase in performance. This scheme is scalable and can handle communication latencies and failures. We present results of the system operating in real time on a large outdoor environment, including 22 nonoverlapping cameras,WSN, and several robots. © Institut Mines-Télécom and Springer-Verlag 2012.
•Representation and method for evolutionary neural architecture search of encoder-decoder architectures for Deep Image prior,•Leveraging a state-of-the-art perceptual metric to guide the optimization.•State of the art DIP results for inpainting, denoising, up-scaling, beating the hand-optimized DIP architectures proposed.•Demonstrated the content- style dependency of DIP architectures. We present a neural architecture search (NAS) technique to enhance image denoising, inpainting, and super-resolution tasks under the recently proposed Deep Image Prior (DIP). We show that evolutionary search can automatically optimize the encoder-decoder (E-D) structure and meta-parameters of the DIP network, which serves as a content-specific prior to regularize these single image restoration tasks. Our binary representation encodes the design space for an asymmetric E-D network that typically converges to yield a content-specific DIP within 10-20 generations using a population size of 500. The optimized architectures consistently improve upon the visual quality of classical DIP for a diverse range of photographic and artistic content.
In this chapter, we present a generic classifier for detecting spatio-temporal interest points within video, the premise being that, given an interest point detector, we can learn a classifier that duplicates its functionality and which is both accurate and computationally efficient. This means that interest point detection can be achieved independent of the complexity of the original interest point formulation. We extend the naive Bayesian classifier of Randomised Ferns to the spatio-temporal domain and learn classifiers that duplicate the functionality of common spatio-temporal interest point detectors. Results demonstrate accurate reproduction of results with a classifier that can be applied exhaustively to video at frame-rate, without optimisation, in a scanning window approach. © 2010, IGI Global.
This paper presents an approach to the categorisation of spatio-temporal activity in video, which is based solely on the relative distribution of feature points. Introducing a Relative Motion Descriptor for actions in video, we show that the spatio-temporal distribution of features alone (without explicit appearance information) effectively describes actions, and demonstrate performance consistent with state-of-the-art. Furthermore, we propose that for actions where noisy examples exist, it is not optimal to group all action examples as a single class. Therefore, rather than engineering features that attempt to generalise over noisy examples, our method follows a different approach: We make use of Random Sampling Consensus (RANSAC) to automatically discover and reject outlier examples within classes. We evaluate the Relative Motion Descriptor and outlier rejection approaches on four action datasets, and show that outlier rejection using RANSAC provides a consistent and notable increase in performance, and demonstrate superior performance to more complex multiple-feature based approaches.
We propose a human performance capture system employing convolutional neural networks (CNN) to estimate human pose from a volumetric representation of a performer derived from multiple view-point video (MVV).We compare direct CNN pose regression to the performance of an affine invariant pose descriptor learned by a CNN through a classification task. A non-linear manifold embedding is learned between the descriptor and articulated pose spaces, enabling regression of pose from the source MVV. The results are evaluated against ground truth pose data captured using a Vicon marker-based system and demonstrate good generalisation over a range of human poses, providing a system that requires no special suit to be worn by the performer.
This paper introduces a novel approach to social behaviour recognition governed by the exchange of non-verbal cues between people. We conduct experiments to try and deduce distinct rules that dictate the social dynamics of people in a conversation, and utilise semi-supervised computer vision techniques to extract their social signals such as laughing and nodding. Data mining is used to deduce frequently occurring patterns of social trends between a speaker and listener in both interested and not interested social scenarios. The confidence values from rules are utilised to build a Social Dynamic Model (SDM), that can then be used for classification and visualisation. By visualising the rules generated in the SDM, we can analyse distinct social trends between an interested and not interested listener in a conversation. Results show that these distinctions can be applied generally and used to accurately predict conversational interest.
"Actions in the wild" is the term given to examples of human motion that are performed in natural settings, such as those harvested from movies  or Internet databases . This paper presents an approach to the categorisation of such activity in video, which is based solely on the relative distribution of spatio-temporal interest points. Presenting the Relative Motion Descriptor, we show that the distribution of interest points alone (without explicitly encoding their neighbourhoods) effectively describes actions. Furthermore, given the huge variability of examples within action classes in natural settings, we propose to further improve recognition by automatically detecting outliers, and breaking complex action categories into multiple modes. This is achieved using a variant of Random Sampling Consensus (RANSAC), which identifies and separates the modes. We employ a novel reweighting scheme within the RANSAC procedure to iteratively reweight training examples, ensuring their inclusion in the final classification model. We demonstrate state-of-the-art performance on five human action datasets. © 2014 Elsevier Inc. All rights reserved.
In this paper, we aim to tackle the problem of recognising temporal sequences in the context of a multi-class problem. In the past, the representation of sequential patterns was used for modelling discriminative temporal patterns for different classes. Here, we have improved on this by using the more general representation of episodes, of which sequential patterns are a special case. We then propose a novel tree structure called a MultI-Class Episode Tree (MICE-Tree) that allows one to simultaneously model a set of different episodes in an efficient manner whilst providing labels for them. A set of MICE-Trees are then combined together into a MICE-Forest that is learnt in a Boosting framework. The result is a strong classifier that utilises episodes for performing classification of temporal sequences. We also provide experimental evidence showing that the MICE-Trees allow for a more compact and efficient model compared to sequential patterns. Additionally, we demonstrate the accuracy and robustness of the proposed method in the presence of different levels of noise and class labels.
The field of Action Recognition has seen a large increase in activity in recent years. Much of the progress has been through incorporating ideas from single-frame object recognition and adapting them for temporal-based action recognition. Inspired by the success of interest points in the 2D spatial domain, their 3D (space-time) counterparts typically form the basic components used to describe actions, and in action recognition the features used are often engineered to fire sparsely. This is to ensure that the problem is tractable; however, this can sacrifice recognition accuracy as it cannot be assumed that the optimum features in terms of class discrimination are obtained from this approach. In contrast, we propose to initially use an overcomplete set of simple 2D corners in both space and time. These are grouped spatially and temporally using a hierarchical process, with an increasing search area. At each stage of the hierarchy, the most distinctive and descriptive features are learned efficiently through data mining. This allows large amounts of data to be searched for frequently reoccurring patterns of features. At each level of the hierarchy, the mined compound features become more complex, discriminative, and sparse. This results in fast, accurate recognition with real-time performance on high-resolution video. As the compound features are constructed and selected based upon their ability to discriminate, their speed and accuracy increase at each level of the hierarchy. The approach is tested on four state-of-the-art data sets, the popular KTH data set to provide a comparison with other state-of-the-art approaches, the Multi-KTH data set to illustrate performance at simultaneous multiaction classification, despite no explicit localization information provided during training. Finally, the recent Hollywood and Hollywood2 data sets provide challenging complex actions taken from commercial movie sequences. For all four data sets, the proposed hierarchical approa- h outperforms all other methods reported thus far in the literature and can achieve real-time operation.
This paper presents an overview of the ImageCLEF 2016 evaluation campaign, an event that was organized as part of the CLEF (Conference and Labs of the Evaluation Forum) labs 2016. ImageCLEF is an ongoing initiative that promotes the evaluation of technologies for annotation, indexing and retrieval for providing information access to collections of images in various usage scenarios and domains. In 2016, the 14th edition of ImageCLEF, three main tasks were proposed: 1) identi cation, multi-label classi cation and separation of compound gures from biomedical literature; 2) automatic annotation of general web images; and 3) retrieval from collections of scanned handwritten documents. The handwritten retrieval task was the only completely novel task this year, although the other two tasks introduced several modi cations to keep the proposed tasks challenging.
This paper presents a scalable solution to the problem of tracking objects across spatially separated, uncalibrated, non-overlapping cameras. Unlike other approaches this technique uses an incremental learning method to create the spatio-temporal links between cameras, and thus model the posterior probability distribution of these links. This can then be used with an appearance model of the object to track across cameras. It requires no calibration or batch preprocessing and becomes more accurate over time as evidence is accumulated.
We present a convolutional autoencoder that enables high fidelity volumetric reconstructions of human performance to be captured from multi-view video comprising only a small set of camera views. Our method yields similar end-to-end reconstruction error to that of a prob- abilistic visual hull computed using significantly more (double or more) viewpoints. We use a deep prior implicitly learned by the autoencoder trained over a dataset of view-ablated multi-view video footage of a wide range of subjects and actions. This opens up the possibility of high-end volumetric performance capture in on-set and prosumer scenarios where time or cost prohibit a high witness camera count.
Content-aware image completion or in-painting is a fundamental tool for the correction of defects or removal of objects in images. We propose a non-parametric in-painting algorithm that enforces both structural and aesthetic (style) consistency within the resulting image. Our contributions are two-fold: 1) we explicitly disentangle image structure and style during patch search and selection to ensure a visually consistent look and feel within the target image. 2) we perform adaptive stylization of patches to conform the aesthetics of selected patches to the target image, so harmonizing the integration of selected patches into the final composition. We show that explicit consideration of visual style during in-painting delivers excellent qualitative and quantitative results across the varied image styles and content, over the Places2 scene photographic dataset and a challenging new in-painting dataset of artwork derived from BAM!
It is known that relative feature location is important in representing objects, but assumptions that make learning tractable often simplify how structure is encoded e.g. spatial pooling or star models. For example, techniques such as spatial pyramid matching (SPM), in-conjunction with machine learning techniques perform well . However, there are limitations to such spatial encoding schemes which discard important information about the layout of features. In contrast, we propose to use the object itself to choose the basis of the features in an object centric approach. In doing so we return to the early work of geometric hashing  but demonstrate how such approaches can be scaled-up to modern day object detection challenges in terms of both the number of examples and their variability. We apply a two stage process; initially filtering background features to localise the objects and then hashing the remaining pairwise features in an affine invariant model. During learning, we identify class-wise key feature predictors. We validate our detection and classification of objects on the PASCAL VOC’07 and ’11  and CarDb  datasets and compare with state of the art detectors and classifiers. Importantly we demonstrate how structure in features can be efficiently identified and how its inclusion can increase performance. This feature centric learning technique allows us to localise objects even without object annotation during training and the resultant segmentation provides accurate state of the art object localization, without the need for annotations.
We propose an approach to accurately esti- mate 3D human pose by fusing multi-viewpoint video (MVV) with inertial measurement unit (IMU) sensor data, without optical markers, a complex hardware setup or a full body model. Uniquely we use a multi-channel 3D convolutional neural network to learn a pose em- bedding from visual occupancy and semantic 2D pose estimates from the MVV in a discretised volumetric probabilistic visual hull (PVH). The learnt pose stream is concurrently processed with a forward kinematic solve of the IMU data and a temporal model (LSTM) exploits the rich spatial and temporal long range dependencies among the solved joints, the two streams are then fused in a final fully connected layer. The two complemen- tary data sources allow for ambiguities to be resolved within each sensor modality, yielding improved accu- racy over prior methods. Extensive evaluation is per- formed with state of the art performance reported on the popular Human 3.6M dataset , the newly re- leased TotalCapture dataset and a challenging set of outdoor videos TotalCaptureOutdoor. We release the new hybrid MVV dataset (TotalCapture) comprising of multi- viewpoint video, IMU and accurate 3D skele- tal joint ground truth derived from a commercial mo- tion capture system. The dataset is available online at http://cvssp.org/data/totalcapture/.
This paper presents an overview of the ImageCLEF 2015 evaluation campaign, an event that was organized as part of the CLEF labs 2015. ImageCLEF is an ongoing initiative that promotes the evaluation of technologies for annotation, indexing and retrieval for providing information access to databases of images in various usage scenarios and domains. In 2015, the 13th edition of ImageCLEF, four main tasks were proposed: 1) automatic concept annotation, localization and sentence description generation for general images; 2) identification, multi-label classification and separation of compound figures from biomedical literature; 3) clustering of x-rays from all over the body; and 4) prediction of missing radiological annotations in reports of liver CT images. The x-ray task was the only fully novel task this year, although the other three tasks introduced modifications to keep up relevancy of the proposed challenges. The participation was considerably positive in this edition of the lab, receiving almost twice the number of submitted working notes papers as compared to previous years.
We present a method for simultaneously estimating 3D hu- man pose and body shape from a sparse set of wide-baseline camera views. We train a symmetric convolutional autoencoder with a dual loss that enforces learning of a latent representation that encodes skeletal joint positions, and at the same time learns a deep representation of volumetric body shape. We harness the latter to up-scale input volumetric data by a factor of 4X, whilst recovering a 3D estimate of joint positions with equal or greater accuracy than the state of the art. Inference runs in real-time (25 fps) and has the potential for passive human behaviour monitoring where there is a requirement for high fidelity estimation of human body shape and pose.
We present a generic, efficient and iterative algorithm for interactively clustering classes of images and videos. The approach moves away from the use of large hand labelled training datasets, instead allowing the user to find natural groups of similar content based upon a handful of “seed” examples. Two efficient data mining tools originally developed for text analysis; min-Hash and APriori are used and extended to achieve both speed and scalability on large image and video datasets. Inspired by the Bag-of-Words (BoW) architecture, the idea of an image signature is introduced as a simple descriptor on which nearest neighbour classification can be performed. The image signature is then dynamically expanded to identify common features amongst samples of the same class. The iterative approach uses APriori to identify common and distinctive elements of a small set of labelled true and false positive signatures. These elements are then accentuated in the signature to increase similarity between examples and “pull” positive classes together. By repeating this process, the accuracy of similarity increases dramatically despite only a few training examples, only 10% of the labelled groundtruth is needed, compared to other approaches. It is tested on two image datasets including the caltech101  dataset and on three state-of-the-art action recognition datasets. On the YouTube  video dataset the accuracy increases from 72% to 97% using only 44 labelled examples from a dataset of over 1200 videos. The approach is both scalable and efficient, with an iteration on the full YouTube dataset taking around 1 minute on a standard desktop machine.
We describe a non-parametric algorithm for multiple-viewpoint video inpainting. Uniquely, our algorithm addresses the domain of wide baseline multiple-viewpoint video (MVV) with no temporal look-ahead in near real time speed. A Dictionary of Patches (DoP) is built using multi-resolution texture patches reprojected from geometric proxies available in the alternate views. We dynamically update the DoP over time, and a Markov Random Field optimisation over depth and appearance is used to resolve and align a selection of multiple candidates for a given patch, this ensures the inpainting of large regions in a plausible manner conserving both spatial and temporal coherence. We demonstrate the removal of large objects (e.g. people) on challenging indoor and outdoor MVV exhibiting cluttered, dynamic backgrounds and moving cameras.
This paper presents an approach to hand pose estimation that combines discriminative and model-based methods to leverage the advantages of both. Randomised Decision Forests are trained using real data to provide fast coarse segmentation of the hand. The segmentation then forms the basis of constraints applied in model fitting, using an efficient projected Gauss-Seidel solver, which enforces temporal continuity and kinematic limitations. However, when fitting a generic model to multiple users with varying hand shape, there is likely to be residual errors between the model and their hand. Also, local minima can lead to failures in tracking that are difficult to recover from. Therefore, we introduce an error regression stage that learns to correct these instances of optimisation failure. The approach provides improved accuracy over the current state of the art methods, through the inclusion of temporal cohesion and by learning to correct from failure cases. Using discriminative learning, our approach performs guided optimisation, greatly reducing model fitting complexity and radically improves efficiency. This allows tracking to be performed at over 40 frames per second using a single CPU thread.
We present an algorithm for fusing multi-viewpoint video (MVV) with inertial measurement unit (IMU) sensor data to accurately estimate 3D human pose. A 3-D convolutional neural network is used to learn a pose embedding from volumetric probabilistic visual hull data (PVH) derived from the MVV frames. We incorporate this model within a dual stream network integrating pose embeddings derived from MVV and a forward kinematic solve of the IMU data. A temporal model (LSTM) is incorporated within both streams prior to their fusion. Hybrid pose inference using these two complementary data sources is shown to resolve ambiguities within each sensor modality, yielding improved accuracy over prior methods. A further contribution of this work is a new hybrid MVV dataset (TotalCapture) comprising video, IMU and a skeletal joint ground truth derived from a commercial motion capture system. The dataset is available online at http://cvssp.org/data/totalcapture/.
Intelligent visual surveillance is an important application area for computer vision. In situations where networks of hundreds of cameras are used to cover a wide area, the obvious limitation becomes the users’ ability to manage such vast amounts of information. For this reason, automated tools that can generalise about activities or track objects are important to the operator. Key to the users’ requirements is the ability to track objects across (spatially separated) camera scenes. However, extensive geometric knowledge about the site and camera position is typically required. Such an explicit mapping from camera to world is infeasible for large installations as it requires that the operator know which camera to switch to when an object disappears. To further compound the problem the installation costs of CCTV systems outweigh those of the hardware. This means that geometric constraints or any form of calibration (such as that which might be used with epipolar constraints) is simply not realistic for a real world installation. The algorithms cannot afford to dictate to the installer. This work attempts to address this problem and outlines a method to allow objects to be related and tracked across cameras without any explicit calibration, be it geometric or colour.
Often within the field of tracking people within only fixed cameras are used. This can mean that when the the illumination of the image changes or object occlusion occurs, the tracking can fail. We propose an approach that uses three simultaneous separate sensors. The fixed surveillance cameras track objects of interest cross camera through incrementally learning relationships between regions on the image. Cameras and laser rangefinder sensors onboard robots also provide an estimate of the person. Moreover, the signal strength of mobile devices carried by the person can be used to estimate his position. The estimate from all these sources are then combined using data fusion to provide an increase in performance. We present results of the fixed camera based tracking operating in real time on a large outdoor environment of over 20 non-overlapping cameras. Moreover, the tracking algorithms for robots and wireless nodes are described. A decentralized data fusion algorithm for combining all these information is presented.
We aim to simultaneously estimate the 3D articulated pose and high fidelity volumetric occupancy of human performance, from multiple viewpoint video (MVV) with as few as two views. We use a multi-channel symmetric 3D convolutional encoder-decoder with a dual loss to enforce the learning of a latent embedding that enables inference of skeletal joint positions and a volumetric reconstruction of the performance. The inference is regularised via a prior learned over a dataset of view-ablated multi-view video footage of a wide range of subjects and actions, and show this to generalise well across unseen subjects and actions. We demonstrate improved reconstruction accuracy and lower pose estimation error relative to prior work on two MVV performance capture datasets: Human 3.6M and TotalCapture.
This paper presents a solution to the problem of tracking people within crowded scenes. The aim is to maintain individual object identity through a crowded scene which contains complex interactions and heavy occlusions of people. Our approach uses the strengths of two separate methods; a global object detector and a localised frame by frame tracker. A temporal relationship model of torso detections built during low activity period, is used to further disambiguate during periods of high activity. A single camera with no calibration and no environmental information is used. Results are compared to a standard tracking method and groundtruth. Two video sequences containing interactions, overlaps and occlusions between people are used to demonstrate our approach. The results show that our technique performs better that a standard tracking method and can cope with challenging occlusions and crowd interactions.
In this paper, we propose a computational model for social interaction between three people in a conversation, and demonstrate results using human video motion synthesis. We utilised semi-supervised computer vision techniques to label social signals between the people, like laughing, head nod and gaze direction. Data mining is used to deduce frequently occurring patterns of social signals between a speaker and a listener in both interested and not interested social scenarios, and the mined confidence values are used as conditional probabilities to animate social responses. The human video motion synthesis is done using an appearance model to learn a multivariate probability distribution, combined with a transition matrix to derive the likelihood of motion given a pose configuration. Our system uses social labels to more accurately define motion transitions and build a texture motion graph. Traditional motion synthesis algorithms are best suited to large human movements like walking and running, where motion variations are large and prominent. Our method focuses on generating more subtle human movement like head nods. The user can then control who speaks and the interest level of the individual listeners resulting in social interactive conversational agents.
This paper proposes a generic approach combining a bottom-up (low-level) visual detector with a top-down (high-level) fuzzy first-order logic (FOL) reasoning framework in order to detect pedestrians from a moving vehicle. Detections from the low-level visual corner based detector are fed into the logical reasoning framework as logical facts. A set of FOL clauses utilising fuzzy predicates with piecewise linear continuous membership functions associates a fuzzy confidence (a degree-of-truth) to each detector input. Detections associated with lower confidence functions are deemed as false positives and blanked out, thus adding top-down constraints based on global logical consistency of detections. We employ a state of the art visual detector on a challenging pedestrian detection dataset, and demonstrate an increase in detection performance when used in a framework that combines bottom-up detections with (fuzzy FOL-based) top-down constraints. © 2012 ICPR Org Committee.
Within the eld of image and video recognition, the traditional approach is a dataset split into xed training and test partitions. However, the labelling of the training set is time-consuming, especially as datasets grow in size and complexity. Furthermore, this approach is not applicable to the home user, who wants to intuitively group their media without tirelessly labelling the content. Consequently, we propose a solution similar in nature to an active learning paradigm, where a small subset of media is labelled as semantically belonging to the same class, and machine learning is then used to pull this and other related content together in the feature space. Our interactive approach is able to iteratively cluster classes of images and video. We reformulate it in an online learning framework and demonstrate competitive performance to batch learning approaches using only a fraction of the labelled data. Our approach is based around the concept of an image signature which, unlike a standard bag of words model, can express co-occurrence statistics as well as symbol frequency. We e ciently compute metric distances between signatures despite their inherent high dimensionality and provide discriminative feature selection, to allow common and distinctive elements to be identi ed from a small set of user labelled examples. These elements are then accentuated in the image signature to increase similarity between examples and pull correct classes together. By repeating this process in an online learning framework, the accuracy of similarity increases dramatically despite labelling only a few training examples. To demonstrate that the approach is agnostic to media type and features used, we evaluate on three image datasets (15 scene, Caltech101 and FG-NET), a mixed text and image dataset (ImageTag), a dataset used in active learning (Iris) and on three action recognition datasets (UCF11, KTH and Hollywood2). On the UCF11 video dataset, the accuracy is 86.7% despite using only 90 labelled examples from a dataset of over 1200 videos, instead of the standard 1122 training videos. The approach is both scalable and e cient, with a single iteration over the full UCF11 dataset of around 1200 videos taking approximately 1 minute on a standard desktop machine.
Since 2010, ImageCLEF has run a scalable image annotation task, to promote research into the annotation of images using noisy web page data. It aims to develop techniques to allow computers to describe images reliably, localise di erent concepts depicted and generate descriptions of the scenes. The primary goal of the challenge is to encourage creative ideas of using web page data to improve image annotation. Three subtasks and two pilot teaser tasks were available to participants; all tasks use a single mixed modality data source of 510,123 web page items for both training and test. The dataset included raw images, textual features obtained from the web pages on which the images appeared, as well as extracted visual features. Extracted from the Web by querying popular image search engines, the dataset was formed. For the main subtasks, the development and test sets were both taken from the ____training set". For the teaser tasks, 200,000 web page items were reserved for testing, and a separate development set was provided. The 251 concepts were chosen to be visual objects that are localizable and that are useful for generating textual descriptions of the visual content of images and were mined from the texts of our extensive database of image-webpage pairs. This year seven groups participated in the task, submitting over 50 runs across all subtasks, and all participants also provided working notes papers. In general, the groups' performance is impressive across the tasks, and there are interesting insights into these very relevant challenges.
© Springer International Publishing Switzerland 2015. In recent years, dense trajectories have shown to be an efficient representation for action recognition and have achieved state-of-the art results on a variety of increasingly difficult datasets. However, while the features have greatly improved the recognition scores, the training process and machine learning used hasn’t in general deviated from the object recognition based SVM approach. This is despite the increase in quantity and complexity of the features used. This paper improves the performance of action recognition through two data mining techniques, APriori association rule mining and Contrast Set Mining. These techniques are ideally suited to action recognition and in particular, dense trajectory features as they can utilise the large amounts of data, to identify far shorter discriminative subsets of features called rules. Experimental results on one of the most challenging datasets, Hollywood2 outperforms the current state-of-the-art.
Automatic image annotation is the task of automatically assigning some form of semantic label to images, such as words, phrases or sentences describing the objects, attributes, actions, and scenes depicted in the image. In this chapter, we present an overview of the various automatic image annotation tasks that were organized in conjunction with the ImageCLEF track at CLEF between 2009–2016. Throughout the eight years, the image annotation tasks have evolved from annotating Flickr photos by learning from clean data to annotating web images by learning from large-scale noisy web data. The tasks are divided into three distinct phases, and this chapter will provide a discussion for each of these phases.We will also compare and contrast other related benchmarking challenges, and provide some insights into the future of automatic image annotation.
We present a novel human performance capture technique capable of robustly estimating the pose (articulated joint positions) of a performer observed passively via multiple view-point video (MVV). An affine invariant pose descriptor is learned using a convolutional neural network (CNN) trained over volumetric data extracted from a MVV dataset of diverse human pose and appearance. A manifold embedding is learned via Gaussian Processes for the CNN descriptor and articulated pose spaces enabling regression and so estimation of human pose from MVV input. The learned descriptor and manifold are shown to generalise over a wide range of human poses, providing an efficient performance capture solution that requires no fiducials or other markers to be worn. The system is evaluated against ground truth joint configuration data from a commercial marker-based pose estimation system
The use of sparse invariant features to recognise classes of actions or objects has become common in the literature. However, features are often ”engineered” to be both sparse and invariant to transformation and it is assumed that they provide the greatest discriminative information. To tackle activity recognition, we propose learning compound features that are assembled from simple 2D corners in both space and time. Each corner is encoded in relation to its neighbours and from an over complete set (in excess of 1 million possible features), compound features are extracted using data mining. The final classifier, consisting of sets of compound features, can then be applied to recognise and localise an activity in real-time while providing superior performance to other state-of-the-art approaches (including those based upon sparse feature detectors). Furthermore, the approach requires only weak supervision in the form of class labels for each training sequence. No ground truth position or temporal alignment is required during training.
We present an approach to automatically expand the annotation of images using the internet as an additional information source. The novelty of the work is in the expansion of image tags by automatically introducing new unseen complex linguistic labels which are collected unsupervised from associated webpages. Taking a small subset of existing image tags, a web based search retrieves additional textual information. Both a textual bag of words model and a visual bag of words model are combined and symbolised for data mining. Association rule mining is then used to identify rules which relate words to visual contents. Unseen images that fit these rules are re-tagged. This approach allows a large number of additional annotations to be added to unseen images, on average 12.8 new tags per image, with an 87.2% true positive rate. Results are shown on two datasets including a new 2800 image annotation dataset of landmarks, the results include pictures of buildings being tagged with the architect, the year of construction and even events that have taken place there. This widens the tag annotation impact and their use in retrieval. This dataset is made available along with tags and the 1970 webpages and additional images which form the information corpus. In addition, results for a common state-of-the-art dataset MIRFlickr25000 are presented for comparison of the learning framework against previous works. © 2013 Springer-Verlag.
Within the field of action recognition, features and descriptors are often engineered to be sparse and invariant to transformation. While sparsity makes the problem tractable, it is not necessarily optimal in terms of class separability and classification. This paper proposes a novel approach that uses very dense corner features that are spatially and temporally grouped in a hierarchical process to produce an overcomplete compound feature set. Frequently reoccurring patterns of features are then found through data mining, designed for use with large data sets. The novel use of the hierarchical classifier allows real time operation while the approach is demonstrated to handle camera motion, scale, human appearance variations, occlusions and background clutter. The performance of classification, outperforms other state-of-the-art action recognition algorithms on the three datasets; KTH, multi-KTH, and Hollywood. Multiple action localisation is performed, though no groundtruth localisation data is required, using only weak supervision of class labels for each training sequence. The Hollywood dataset contain complex realistic actions from movies, the approach outperforms the published accuracy on this dataset and also achieves real time performance. ©2009 IEEE.
It is known that relative feature location is important in representing objects, but assumptions that make learning tractable often simplify how structure is encoded e.g. spatial pooling or star models. For example, techniques such as spatial pyramid matching (SPM), in-conjunction with machine learning techniques perform well . However, there are limitations to such spatial encoding schemes which discard important information about the layout of features. In contrast, we propose to use the object itself to choose the basis of the features in an object centric approach. In doing so we return to the early work of geometric hashing  but demonstrate how such approaches can be scaled-up to modern day object detection challenges in terms of both the number of examples and their variability. We apply a two stage process, initially filtering background features to localise the objects and then hashing the remaining pairwise features in an affine invariant model. During learning, we identify class-wise key feature predictors. We validate our detection and classification of objects on the PASCAL VOC'07 and' 11  and CarDb  datasets and compare with state of the art detectors and classifiers. Importantly we demonstrate how structure in features can be efficiently identified and how its inclusion can increase performance. This feature centric learning technique allows us to localise objects even without object annotation during training and the resultant segmentation provides accurate state of the art object localization, without the need for annotations.
The ImageCLEF 2015 Scalable Image Annotation, Localization and Sentence Generation task was the fourth edition of a challenge aimed at developing more scalable image annotation systems. In particular this year the focus of the three subtasks available to participants had the goal to develop techniques to allow computers to reliably describe images, localize the different concepts depicted in the images and generate a description of the scene. All three tasks use a single mixed modality data source of 500,000 web page items which included raw images, textual features obtained from the web pages on which the images appeared, as well as various visual features extracted from the images themselves. Unlike previous years the test set was also the training set and in this edition of the task hand-labelled data has been allowed. The images were obtained from the Web by querying popular image search engines. The development and subtasks 1 and 2 test sets were both taken from the “training set” and had 1,979 and 3,070 samples, and the subtask 3 track had 500 and 450 samples. The 251 concepts this year were chosen to be visual objects that are localizable and that are useful for generating textual descriptions of visual content of images and were mined from the texts of our large database of image-webpage pairs. This year 14 groups participated in the task, submitting a total of 122 runs across the 3 subtasks and 11 of the participants also submitted working notes papers. This result is very positive, in fact if compared to the 11 participants and 58 submitted runs of the last year it is possible to see how the interest in this topic is still very high.
This paper presents a generic method for recognising and localising human actions in video based solely on the distribution of interest points. The use of local interest points has shown promising results in both object and action recognition. While previous methods classify actions based on the appearance and/or motion of these points, we hypothesise that the distribution of interest points alone contains the majority of the discriminatory information. Motivated by its recent success in rapidly detecting 2D interest points, the semi-naive Bayesian classification method of Randomised Ferns is employed. Given a set of interest points within the boundaries of an action, the generic classifier learns the spatial and temporal distributions of those interest points. This is done efficiently by comparing sums of responses of interest points detected within randomly positioned spatio-temporal blocks within the action boundaries. We present results on the largest and most popular human action dataset using a number of interest point detectors, and demostrate that the distribution of interest points alone can perform as well as approaches that rely upon the appearance of the interest points.
We present ALADIN (All Layer AdaIN); a novel architecture for searching images based on the similarity of their artistic style. Representation learning is critical to visual search, where distance in the learned search embedding reflects image similarity. Learning an embedding that discriminates fine-grained variations in style is hard, due to the difficulty of defining and labelling style. ALADIN takes a weakly supervised approach to learning a representation for fine-grained style similarity of digital artworks, leveraging BAM-FG, a novel large-scale dataset of user generated content groupings gathered from the web. ALADIN sets a new state of the art accuracy for style-based visual search over both coarse labelled style data (BAM) and BAM-FG; a new 2.62 million image dataset of 310,000 fine-grained style groupings also contributed by this work.