Professor Richard Bowden


Professor of Computer Vision and Machine Learning
BSc, MSc, PhD, SMIEEE, FHEA, FIAPR
+44 (0)1483 689838
22 BA 00

Biography

Areas of specialism

Sign and gesture recognition; Deep learning; Cognitive Robotics; Activity and action recognition; Lip-reading; Machine Perception; Facial feature tracking; Autonomous Vehicles; Computer Vision; AI

University roles and responsibilities

  • Associate Dean (Doctoral College)
  • Professor of Computer Vision and Machine Learning

My qualifications

1993
BSc degree in computer science
University of London
1995
MSc degree with distinction
University of Leeds
1999
PhD degree in computer vision
Brunel University

Previous roles

2015 - 2016
Postgraduate Research Director for Faculty
University of Surrey
2013 - 2014
Royal Society Leverhulme Trust Senior Research Fellowship
Royal Society
2012
General chair
BMVC2012
2012
Track chair
ICPR2012
2008 - 2011
Reader
University of Surrey
2010
General Chair
Sign, Gesture, Activity 2010
2003 - 2009
Senior Tutor for Professional Training
University of Surrey
2006 - 2008
Senior Lecturer
University of Surrey
2001 - 2006
Lecturer in Multimedia Signal Processing
University of Surrey
2001 - 2004
Visiting Research Fellow working with Profs Zisserman and Brady
University of Oxford
1998 - 2001
Lecturer in Image Processing
Brunel University
1997
General Chair
VRSig97

Affiliations and memberships

IEEE Pattern Analysis and Machine Learning
Associate Editor
Image and Vision Computing journal
Associate Editor
British Machine Vision Association (BMVA) Executive Committee
Previous member
British Machine Vision Association (BMVA) Executive Committee
Previous Company Director
Higher Education Academy
Fellow
IEEE
Senior Member
IAPR
Fellow

Research

Research interests

Research projects

Indicators of esteem

  • Awarded Fellow of the International Association of Pattern Recognition in 2016

  • Member of Royal Society’s International Exchanges Committee 2016

  • Royal Society Leverhulme Trust Senior Research Fellowship

  • Sullivan thesis prize in 2000

  • Executive Committee member and theme leader for EPSRC ViiHM Network 2015

  • TIGA Games award for Makaton Learning Environment with Gamelab UK 2013

  • Appointed Associate Editor for IEEE Pattern Analysis and Machine Intelligence 2013

  • Best Paper Award at VISAPP2012

  • Advisory Board for Springer Advances in Computer Vision and Pattern Recognition

  • General Chair BMVC2012

  • Outstanding Reviewer Award ICCV 2011

  • Best Paper Award at IbPRIA2011

  • Main Track Chair (Computer & Robot Vision) ICPR2012, Japan

  • Appointed Associate Editor International Journal of Image & Vision Comp

My teaching

Supervision

Postgraduate research supervision

Postgraduate research supervision

My publications

Publications

In this paper, we address the problem of tracking an unknown object in 3D space. Online 2D tracking often fails for strong outof-plane rotation which results in considerable changes in appearance beyond those that can be represented by online update strategies. However, by modelling and learning the 3D structure of the object explicitly, such effects are mitigated. To address this, a novel approach is presented, combining techniques from the fields of visual tracking, structure from motion (SfM) and simultaneous localisation and mapping (SLAM). This algorithm is referred to as TMAGIC (Tracking, Modelling And Gaussianprocess Inference Combined). At every frame, point and line features are tracked in the image plane and are used, together with their 3D correspondences, to estimate the camera pose. These features are also used to model the 3D shape of the object as a Gaussian process. Tracking determines the trajectories of the object in both the image plane and 3D space, but the approach also provides the 3D object shape. The approach is validated on several video-sequences used in the tracking literature, comparing favourably to state-of-the-art trackers for simple scenes (error reduced by 22 %) with clear advantages in the case of strong out-of-plane rotation, where 2D approaches fail (error reduction of 58 %).
Cooper H, Bowden R (2009) Sign Language Recognition: Working with Limited Corpora, UNIVERSAL ACCESS IN HUMAN-COMPUTER INTERACTION: APPLICATIONS AND SERVICES, PT III 5616 pp. 472-481 SPRINGER-VERLAG BERLIN
Mendez Maldonado O, Hadfield S, Pugeault N, Bowden R (2016) Next-best stereo: extending next best view optimisation for collaborative sensors, Proceedings of BMVC 2016
Most 3D reconstruction approaches passively optimise over all data, exhaustively matching pairs, rather than actively selecting data to process. This is costly both in terms of time and computer resources, and quickly becomes intractable for large datasets. This work proposes an approach to intelligently filter large amounts of data for 3D reconstructions of unknown scenes using monocular cameras. Our contributions are twofold: First, we present a novel approach to efficiently optimise the Next-Best View ( NBV ) in terms of accuracy and coverage using partial scene geometry. Second, we extend this to intelligently selecting stereo pairs by jointly optimising the baseline and vergence to find the NBV ?s best stereo pair to perform reconstruction. Both contributions are extremely efficient, taking 0.8ms and 0.3ms per pose, respectively. Experimental evaluation shows that the proposed method allows efficient selection of stereo pairs for reconstruction, such that a dense model can be obtained with only a small number of images. Once a complete model has been obtained, the remaining computational budget is used to intelligently refine areas of uncertainty, achieving results comparable to state-of-the-art batch approaches on the Middlebury dataset, using as little as 3.8% of the views.
Efthimiou E, Fotinea S-E, Vogler C, Hanke T, Glauert J, Bowden R, Braffort A, Collet C, Maragos P, Segouat J (2009) Sign language recognition, generation, and modelling: A research effort with applications in deaf communication, Lecture Notes in Computer Science: Proceedings of 5th International Conference of Universal Access in Human-Computer Interaction. Addressing Diversity, Part 1 5614 pp. 21-30 Springer
Sign language and Web 2.0 applications are currently incompatible, because of the lack of anonymisation and easy editing of online sign language contributions. This paper describes Dicta-Sign, a project aimed at developing the technologies required for making sign language-based Web contributions possible, by providing an integrated framework for sign language recognition, animation, and language modelling. It targets four different European sign languages: Greek, British, German, and French. Expected outcomes are three showcase applications for a search-by-example sign language dictionary, a sign language-to-sign language translator, and a sign language-based Wiki.
Koller O, Ney H, Bowden R (2016) Deep Hand: How to Train a CNN on 1 Million Hand Images When Your Data Is Continuous and Weakly Labelled, Proceddings of 2016 IEEE Conference on Computer Vision and Pattern Recognition
This work presents a new approach to learning a framebased
classifier on weakly labelled sequence data by embedding
a CNN within an iterative EM algorithm. This
allows the CNN to be trained on a vast number of example
images when only loose sequence level information is
available for the source videos. Although we demonstrate
this in the context of hand shape recognition, the approach
has wider application to any video recognition task where
frame level labelling is not available. The iterative EM algorithm
leverages the discriminative ability of the CNN to
iteratively refine the frame level annotation and subsequent
training of the CNN. By embedding the classifier within an
EM framework the CNN can easily be trained on 1 million
hand images. We demonstrate that the final classifier generalises
over both individuals and data sets. The algorithm
is evaluated on over 3000 manually labelled hand shape
images of 60 different classes which will be released to the
community. Furthermore, we demonstrate its use in continuous
sign language recognition on two publicly available
large sign language data sets, where it outperforms the current
state-of-the-art by a large margin. To our knowledge no
previous work has explored expectation maximization without
Gaussian mixture models to exploit weak sequence labels
for sign language recognition.
KaewTrakulPong P, Bowden R (2003) A real time adaptive visual surveillance system for tracking low-resolution colour targets in dynamically changing scenes, IMAGE AND VISION COMPUTING 21 (10) pp. 913-929 ELSEVIER SCIENCE BV
Oshin O, Gilbert A, Illingworth J, Bowden R (2008) Spatio-Temporal Feature Recognition using Randomised Ferns, The 1st International Workshop on Machine Learning for Vision-based Motion Analysis (MVLMA'08)
Efthimiou E, Fotinea SE, Hanke T, Glauert J, Bowden R, Braffort A, Collet C, Maragos P, Lefebvre-Albaret F (2012) The dicta-sign Wiki: Enabling web communication for the deaf, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 7383 LNCS (PART 2) pp. 205-212
The paper provides a report on the user-centred showcase prototypes of the DICTA-SIGN project (http://www.dictasign.eu/), an FP7-ICT project which ended in January 2012. DICTA-SIGN researched ways to enable communication between Deaf individuals through the development of human-computer interfaces (HCI) for Deaf users, by means of Sign Language. Emphasis is placed on the Sign-Wiki prototype that demonstrates the potential of sign languages to participate in contemporary Web 2.0 applications where user contributions are editable by an entire community and sign language users can benefit from collaborative editing facilities. © 2012 Springer-Verlag.
Pugeault N, Bowden R (2011) Driving me Around the Bend: Learning to Drive from Visual Gist, 2011 IEEE International Conference on Computer Vision pp. 1022-1029 IEEE
This article proposes an approach to learning steering and road following behaviour from a human driver using holistic visual features. We use a random forest (RF) to regress a mapping between these features and the driver's actions, and propose an alternative to classical random forest regression based on the Medoid (RF-Medoid), that reduces the underestimation of extreme control values. We compare prediction performance using different holistic visual descriptors: GIST, Channel-GIST (C-GIST) and Pyramidal-HOG (P-HOG). The proposed methods are evaluated on two different datasets: predicting human behaviour on countryside roads and also for autonomous control of a robot on an indoor track. We show that 1) C-GIST leads to the best predictions on both sequences, and 2) RF-Medoid leads to a better estimation of extreme values, where a classical RF tends to under-steer. We use around 10% of the data for training and show excellent generalization over a dataset of thousands of images. Importantly, we do not engineer the solution but instead use machine learning to automatically identify the relationship between visual features and behaviour, providing an efficient, generic solution to autonomous control.
Krejov P, Gilbert A, Bowden R (2015) Combining Discriminative and Model Based Approaches for Hand Pose Estimation, 2015 11TH IEEE INTERNATIONAL CONFERENCE AND WORKSHOPS ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG), VOL. 2 IEEE
Gilbert A, Illingworth J, Bowden R (2008) Scale Invariant Action Recognition Using Compound Features Mined from Dense Spatio-temporal Corners, Lecture Notes in Computer Science: Proceedings of 10th European Conference on Computer Vision (Part 1) 5302 pp. 222-233 Springer
The use of sparse invariant features to recognise classes of actions or objects has become common in the literature. However, features are often ?engineered? to be both sparse and invariant to transformation and it is assumed that they provide the greatest discriminative information. To tackle activity recognition, we propose learning compound features that are assembled from simple 2D corners in both space and time. Each corner is encoded in relation to its neighbours and from an over complete set (in excess of 1 million possible features), compound features are extracted using data mining. The final classifier, consisting of sets of compound features, can then be applied to recognise and localise an activity in real-time while providing superior performance to other state-of-the-art approaches (including those based upon sparse feature detectors). Furthermore, the approach requires only weak supervision in the form of class labels for each training sequence. No ground truth position or temporal alignment is required during training.
Bowden R (2004) Progress in sign and gesture recognition, ARTICULATED MOTION AN DEFORMABLE OBJECTS, PROCEEDINGS 3179 pp. 13-13 SPRINGER-VERLAG BERLIN
Hadfield SJ, Bowden R, Lebeda K (2016) The Visual Object Tracking VOT2016 Challenge Results, Lecture Notes in Computer Science 9914 pp. 777-823 Springer
The Visual Object Tracking challenge VOT2016 aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. Results of 70 trackers are presented, with a large number of trackers being published at major computer vision conferences and journals in the recent years. The number of tested state-of-the-art trackers makes the VOT 2016 the largest and most challenging benchmark on short-term tracking to date. For each participating tracker, a short description is provided in the Appendix. The VOT2016 goes beyond its predecessors by (i) introducing a new semi-automatic ground truth bounding box annotation methodology and (ii) extending the evaluation system with the no-reset experiment. The dataset, the evaluation kit as well as the results are publicly available at the challenge website (http:// votchallenge. net).
Bowden R, Sarhadi M (2000) Building Temporal Models for Gesture Recognition, Proceedings of BMVC 2000 - The Eleventh British Machine Vision Conference BMVA (British Machine Vision Association)
This work presents a piecewise linear approximation to non-linear Point
Distribution Models for modelling the human hand. The work utilises the
natural segmentation of shape space, inherent to the technique, to apply
temporal constraints which can be used with CONDENSATION to support
multiple hypotheses and quantum leaps through shape space. This paper
presents a novel method by which the one-state transitions of the English
Language are projected into shape space for tracking and model prediction
using a HMM like approach.
Okwechime D, Ong E-J, Bowden R, Member S (2011) MIMiC: Multimodal Interactive Motion Controller, IEEE Transactions on Multimedia 13 (2) pp. 255-265 IEEE
We introduce a new algorithm for real-time interactive motion control and demonstrate its application to motion captured data, prerecorded videos, and HCI. Firstly, a data set of frames are projected into a lower dimensional space. An appearance model is learnt using a multivariate probability distribution. A novel approach to determining transition points is presented based on k-medoids, whereby appropriate points of intersection in the motion trajectory are derived as cluster centers. These points are used to segment the data into smaller subsequences. A transition matrix combined with a kernel density estimation is used to determine suitable transitions between the subsequences to develop novel motion. To facilitate real-time interactive control, conditional probabilities are used to derive motion given user commands. The user commands can come from any modality including auditory, touch, and gesture. The system is also extended to HCI using audio signals of speech in a conversation to trigger nonverbal responses from a synthetic listener in real-time. We demonstrate the flexibility of the model by presenting results ranging from data sets composed of vectorized images, 2-D, and 3-D point representations. Results show real-time interaction and plausible motion generation between different types of movement.
Shaukat A, Gilbert A, Windridge D, Bowden R (2012) Meeting in the Middle: A top-down and bottom-up approach to detect pedestrians, Proceedings - International Conference on Pattern Recognition pp. 874-877
This paper proposes a generic approach combining a bottom-up (low-level) visual detector with a top-down (high-level) fuzzy first-order logic (FOL) reasoning framework in order to detect pedestrians from a moving vehicle. Detections from the low-level visual corner based detector are fed into the logical reasoning framework as logical facts. A set of FOL clauses utilising fuzzy predicates with piecewise linear continuous membership functions associates a fuzzy confidence (a degree-of-truth) to each detector input. Detections associated with lower confidence functions are deemed as false positives and blanked out, thus adding top-down constraints based on global logical consistency of detections. We employ a state of the art visual detector on a challenging pedestrian detection dataset, and demonstrate an increase in detection performance when used in a framework that combines bottom-up detections with (fuzzy FOL-based) top-down constraints. © 2012 ICPR Org Committee.
Oshin O, Gilbert A, Bowden R (2011) Capturing the relative distribution of features for action recognition, 2011 IEEE International Conference on Automatic Face and Gesture Recognition and Workshops pp. 111-116 IEEE
This paper presents an approach to the categorisation of spatio-temporal activity in video, which is based solely on the relative distribution of feature points. Introducing a Relative Motion Descriptor for actions in video, we show that the spatio-temporal distribution of features alone (without explicit appearance information) effectively describes actions, and demonstrate performance consistent with state-of-the-art. Furthermore, we propose that for actions where noisy examples exist, it is not optimal to group all action examples as a single class. Therefore, rather than engineering features that attempt to generalise over noisy examples, our method follows a different approach: We make use of Random Sampling Consensus (RANSAC) to automatically discover and reject outlier examples within classes. We evaluate the Relative Motion Descriptor and outlier rejection approaches on four action datasets, and show that outlier rejection using RANSAC provides a consistent and notable increase in performance, and demonstrate superior performance to more complex multiple-feature based approaches.
Cooper H, Bowden R (2009) Learning Signs from Subtitles: A Weakly Supervised Approach to Sign Language Recognition, CVPR: 2009 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOLS 1-4 pp. 2560-2566 IEEE
Ong E-J, Ellis L, Bowden R (2009) Problem solving through imitation, IMAGE AND VISION COMPUTING 27 (11) pp. 1715-1728 ELSEVIER SCIENCE BV
Bowden R (2000) Learning Statistical Models of Human Motion, Proceedings of CVPR 2000 - IEEE Workshop on Human Modeling, Analysis and Synthesis IEEE
Non-linear statistical models of deformation
provide methods to learn a priori shape and deformation
for an object or class of objects by example. This paper
extends these models of deformation to that of motion by
augmenting the discrete representation of piecewise nonlinear
principle component analysis of shape with a
markov chain which represents the temporal dynamics of
the model. In this manner, mean trajectories can be learnt
and reproduced for either the simulation of movement or
for object tracking. This paper demonstrates the use of
these techniques in learning human motion from capture
data.
Ong EJ, Lan Y, Theobald BJ, Harvey R, Bowden R (2009) Robust Facial Feature Tracking using Selected Multi-Resolution Linear Predictors,, pp. 1483-1490
Sheerman-Chase T, Ong E-J, Pugeault N, Bowden R (2013) Improving Recognition and Identification of Facial Areas Involved in Non-verbal Communication by Feature Selection, Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on
Meaningful Non-Verbal Communication (NVC) signals can be recognised by facial deformations based on video tracking. However, the geometric features previously used contain a signi?cant amount of redundant or irrelevant information. A feature selection method is described for selecting a subset of features that improves performance and allows for the identi?cation and visualisation of facial areas involved in NVC. The feature selection is based on a sequential backward elimination of features to ?nd a effective subset of components. This results in a signi?cant improvement in recognition performance, as well as providing evidence that brow lowering is involved in questioning sentences. The improvement in performance is a step towards a more practical automatic system and the facial areas identi?ed provide some insight into human behaviour.
Ong E, Bowden R (2011) Learning Sequential Patterns for Lipreading, Proceedings of the 22nd British Machine Vision Conference pp. 55.1-55.10 BMVA Press
This paper proposes a novel machine learning algorithm (SP-Boosting) to tackle the
problem of lipreading by building visual sequence classifiers based on sequential patterns.
We show that an exhaustive search of optimal sequential patterns is not possible
due to the immense search space, and tackle this with a novel, efficient tree-search
method with a set of pruning criteria. Crucially, the pruning strategies preserve our ability
to locate the optimal sequential pattern. Additionally, the tree-based search method
accounts for the training set?s boosting weight distribution. This temporal search method
is then integrated into the boosting framework resulting in the SP-Boosting algorithm.
We also propose a novel constrained set of strong classifiers that further improves recognition
accuracy. The resulting learnt classifiers are applied to lipreading by performing
multi-class recognition on the OuluVS database. Experimental results show that our
method achieves state of the art recognition performane, using only a small set of sequential
patterns.
Gupta A, Bowden R (2012) Fuzzy encoding for image classification using Gustafson-Kessel algorithm, IEEE PES Innovative Smart Grid Technologies Conference Europe pp. 3137-3140
This paper presents a novel adaptation of fuzzy clustering and feature encoding for image classification. Visual word ambiguity has recently been successfully modeled by kernel codebooks to provide improvement in classification performance over the standard 'Bag-of-Features' (BoF) approach, which uses hard partitioning and crisp logic for assignment of features to visual words. Motivated by this progress we utilize fuzzy logic to model the ambiguity and combine it with clustering to discover fuzzy visual words. The feature descriptors of an image are encoded using the learned fuzzy membership function associated with each word. The codebook built using this fuzzy encoding technique is demonstrated to provide superior performance over BoF. We use the Gustafson-Kessel algorithm which is an improvement over Fuzzy C-Means clustering and can adapt to local distributions. We evaluate our approach on several popular datasets and demonstrate that it consistently provides superior performance to the BoF approach. © 2012 IEEE.
Bowden R, Cox SJ, Harvey RW, Lan Y, Ong EJ, Owen G, Theobald BJ (2012) Is automated conversion of video to text a reality?, Proceedings of SPIE - The International Society for Optical Engineering 8546
A recent trend in law enforcement has been the use of Forensic lip-readers. Criminal activities are often recorded on CCTV or other video gathering systems. Knowledge of what suspects are saying enriches the evidence gathered but lip-readers, by their own admission, are fallible so, based on long term studies of automated lip-reading, we are investigating the possibilities and limitations of applying this technique under realistic conditions. We have adopted a step-by-step approach and are developing a capability when prior video information is available for the suspect of interest. We use the terminology video-to-text (V2T) for this technique by analogy with speech-to-text (S2T) which also has applications in security and law-enforcement. © 2012 SPIE.
Windridge D, Bowden R, Kittler J (2004) A General Strategy for Hidden Markov Chain Parameterisation in Composite Feature-Spaces, SSPR/SPR pp. 1069-1077-1069-1077
Hadfield S, Lebeda K, Bowden R (2014) Natural action recognition using invariant 3D motion encoding, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 8690 LNCS (PART 2) pp. 758-771
We investigate the recognition of actions "in the wild" using 3D motion information. The lack of control over (and knowledge of) the camera configuration, exacerbates this already challenging task, by introducing systematic projective inconsistencies between 3D motion fields, hugely increasing intra-class variance. By introducing a robust, sequence based, stereo calibration technique, we reduce these inconsistencies from fully projective to a simple similarity transform. We then introduce motion encoding techniques which provide the necessary scale invariance, along with additional invariances to changes in camera viewpoint. On the recent Hollywood 3D natural action recognition dataset, we show improvements of 40% over previous state-of-the-art techniques based on implicit motion encoding. We also demonstrate that our robust sequence calibration simplifies the task of recognising actions, leading to recognition rates 2.5 times those for the same technique without calibration. In addition, the sequence calibrations are made available. © 2014 Springer International Publishing.
Gupta A, Bowden R (2012) Unity in diversity: Discovering topics from words: Information theoretic co-clustering for visual categorization, VISAPP 2012 - Proceedings of the International Conference on Computer Vision Theory and Applications 1 pp. 628-633
This paper presents a novel approach to learning a codebook for visual categorization, that resolves the key issue of intra-category appearance variation found in complex real world datasets. The codebook of visual-topics (semantically equivalent descriptors) is made by grouping visual-words (syntactically equivalent descriptors) that are scattered in feature space. We analyze the joint distribution of images and visual-words using information theoretic co-clustering to discover visual-topics. Our approach is compared with the standard 'Bagof- Words' approach. The statistically significant performance improvement in all the datasets utilized (Pascal VOC 2006; VOC 2007; VOC 2010; Scene-15) establishes the efficacy of our approach.
Gilbert A, Bowden R (2006) Tracking objects across cameras by incrementally learning inter-camera colour calibration and patterns of activity, Lecture Notes in Computer Science: 9th European Conference on Computer Vision, Proceedings Part 2 3952 pp. 125-136 Springer
This paper presents a scalable solution to the problem of tracking objects across spatially separated, uncalibrated, non-overlapping cameras. Unlike other approaches this technique uses an incremental learning method, to model both the colour variations and posterior probability distributions of spatio-temporal links between cameras. These operate in parallel and are then used with an appearance model of the object to track across spatially separated cameras. The approach requires no pre-calibration or batch preprocessing, is completely unsupervised, and becomes more accurate over time as evidence is accumulated.
Gilbert A, Illingworth J, Bowden R, Capitan J, Merino L (2009) Accurate fusion of robot, camera and wireless sensors for surveillance applications, IEEE 12th International Conference on Computer Vision Workshops pp. 1290-1297 IEEE
Often within the field of tracking people within only fixed cameras are used. This can mean that when the the illumination of the image changes or object occlusion occurs, the tracking can fail. We propose an approach that uses three simultaneous separate sensors. The fixed surveillance cameras track objects of interest cross camera through incrementally learning relationships between regions on the image. Cameras and laser rangefinder sensors onboard robots also provide an estimate of the person. Moreover, the signal strength of mobile devices carried by the person can be used to estimate his position. The estimate from all these sources are then combined using data fusion to provide an increase in performance. We present results of the fixed camera based tracking operating in real time on a large outdoor environment of over 20 non-overlapping cameras. Moreover, the tracking algorithms for robots and wireless nodes are described. A decentralized data fusion algorithm for combining all these information is presented.
Bowden R, Mitchel TA, Sarhadi M (1997) Real-time Dynamic Deformable Meshes for Volumetric Segmentation and Visualisation, BMVC97 Electronic Proceedings of the Eighth British Machine Vision Conference 1 pp. 310-319
This paper presents a surface segmentation method which uses a simulated inflating balloon model to segment surface structure from volumetric data using a triangular mesh. The model employs simulated surface tension and an inflationary force to grow from within an object and find its boundary. Mechanisms are described which allow both evenly spaced and minimal polygonal count surfaces to be generated. The work is based on inflating balloon models by Terzopolous[8]. Simplifications are made to the model, and an approach proposed which provides a technique robust to noise regardless of the feature detection scheme used. The proposed technique uses no explicit attraction to data features, and as such is less dependent on the initialisation of the model and parameters. The model grows under its own forces, and is never anchored to boundaries, but instead constrained to remain inside the desired object. Results are presented which demonstrate the technique?s ability and speed at the segmentation of a complex, concave object with narrow features.
Marter M, Hadfield S, Bowden R (2014) Friendly faces: Weakly supervised character identification, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 8912 pp. 121-132
© Springer International Publishing Switzerland 2015.This paper demonstrates a novel method for automatically discovering and recognising characters in video without any labelled examples or user intervention. Instead weak supervision is obtained via a rough script-to-subtitle alignment. The technique uses pose invariant features, extracted from detected faces and clustered to form groups of co-occurring characters. Results show that with 9 characters, 29% of the closest exemplars are correctly identified, increasing to 50% as additional exemplars are considered.
Elliott R, Cooper HM, Ong EJ, Glauert J, Bowden R, Lefebvre-Albaret F (2011) Search-By-Example in Multilingual Sign Language Databases,
We describe a prototype Search-by-Example or look-up tool for signs, based on a newly developed 1000-concept sign lexicon for four national sign languages (GSL, DGS, LSF,BSL), which includes a spoken language gloss, a HamNoSys description, and a video for each sign. The look-up tool combines an interactive sign recognition system, supported by Kinect technology, with a real-time sign synthesis system,using a virtual human signer, to present results to the user. The user performs a sign to the system and is presented with animations of signs recognised as similar. The user also has the option to view any of these signs performed
in the other three sign languages. We describe the supporting technology and architecture for this system, and present some preliminary evaluation results.
Cooper H, Bowden R (2010) Sign Language Recognition using Linguistically Derived Sub-Units, Proceedings of 4th Workshop on the Representation and Processing of Sign Languages: Corpora and Sign Language Technologies pp. 57-61 European Language Resources Association (ELRA)
This work proposes to learn linguistically-derived sub-unit classifiers for sign language. The responses of these classifiers can be
combined by Markov models, producing efficient sign-level recognition. Tracking is used to create vectors of hand positions per frame
as inputs for sub-unit classifiers learnt using AdaBoost. Grid-like classifiers are built around specific elements of the tracking vector to
model the placement of the hands. Comparative classifiers encode the positional relationship between the hands. Finally, binary-pattern
classifiers are applied over the tracking vectors of multiple frames to describe the motion of the hands. Results for the sub-unit classifiers
in isolation are presented, reaching averages over 90%. Using a simple Markov model to combine the sub-unit classifiers allows sign
level classification giving an average of 63%, over a 164 sign lexicon, with no grammatical constraints.
Hadfield S, Bowden R (2014) Scene Flow Estimation using Intelligent Cost Functions, Proceedings of the British Conference on Machine Vision (BMVC) BMVA
Motion estimation algorithms are typically based upon the assumption of brightness constancy or related assumptions such as gradient constancy. This manuscript evaluates several common cost functions from the motion estimation literature, which embody these assumptions. We demonstrate that such assumptions break for real world data, and the functions are therefore unsuitable. We propose a simple solution, which significantly increases the discriminatory ability of the metric, by learning a nonlinear relationship using techniques from machine learning. Furthermore, we demonstrate how context and a nonlinear combination of metrics, can provide additional gains, and demonstrating a 44% improvement in the performance of a state of the art scene flow estimation technique. In addition, smaller gains of 20% are demonstrated in optical flow estimation tasks.
Bowden R, KaewTraKulPong P (2005) Towards automated wide area visual surveillance: tracking objects between spatially-separated, uncalibrated views, IEE PROCEEDINGS-VISION IMAGE AND SIGNAL PROCESSING 152 (2) pp. 213-223 IEE-INST ELEC ENG
Ong EJ, Bowden R (2011) Robust Facial Feature Tracking Using Shape-Constrained Multi-Resolution Selected Linear Predictors., IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (9) pp. 1844-1859 IEEE Computer Society
This paper proposes a learnt {\em data-driven} approach for accurate, real-time tracking of facial features using only intensity information, a non-trivial task since the face is a highly deformable object with large textural variations and motion in certain regions. The framework proposed here largely avoids the need for apriori design of feature trackers by automatically identifying the optimal visual support required for tracking a single facial feature point. This is essentially equivalent to automatically determining the visual context required for tracking. Tracking is achieved via biased linear predictors which provide a fast and effective method for mapping pixel-intensities into tracked feature position displacements. Multiple linear predictors are grouped into a rigid flock to increase robustness. To further improve tracking accuracy, a novel probabilistic selection method is used to identify relevant visual areas for tracking a feature point. These selected flocks are then combined into a hierarchical multi-resolution LP model. Finally, we also exploit a simple shape constraint for correcting the occasional tracking failure of a minority of feature points. Experimental results also show that this method performs more robustly and accurately than AAMs, on example sequences that range from SD quality to Youtube quality.
Holt B, Ong EJ, Bowden R (2013) Accurate static pose estimation combining direct regression and geodesic extrema, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, FG 2013
Human pose estimation in static images has received significant attention recently but the problem remains challenging. Using data acquired from a consumer depth sensor, our method combines a direct regression approach for the estimation of rigid body parts with the extraction of geodesic extrema to find extremities. We show how these approaches are complementary and present a novel approach to combine the results resulting in an improvement over the state-of-the-art. We report and compare our results a new dataset of aligned RGB-D pose sequences which we release as a benchmark for further evaluation. © 2013 IEEE.
Dowson N, Kadir T, Bowden R (2008) Estimating the joint statistics of images using Nonparametric Windows with application to registration using Mutual Information, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 30 (10) pp. 1841-1857 IEEE COMPUTER SOC
Recently, Nonparametric (NP) Windows has been proposed to estimate the statistics of real 1D and 2D signals. NP Windows is accurate because it is equivalent to sampling images at a high (infinite) resolution for an assumed interpolation model. This paper extends the proposed approach to consider joint distributions of image pairs. Second, Green's Theorem is used to simplify the previous NP Windows algorithm. Finally, a resolution-aware NP Windows algorithm is proposed to improve robustness to relative scaling between an image pair. Comparative testing of 2D image registration was performed using translation only and affine transformations. Although it is more expensive than other methods, NP Windows frequently demonstrated superior performance for bias (distance between ground truth and global maximum) and frequency of convergence. Unlike other methods, the number of samples and the number of bins have little effect on NP Windows and the prior selection of a kernel is not required.
Moore S, Ong EJ, Bowden R (2010) Facial Expression Recognition using Spatiotemporal Boosted Discriminatory Classifiers, 6111/2010 pp. 405-414
Okwechime D, Ong E-J, Gilbert A, Bowden R (2011) Visualisation and prediction of conversation interest through mined social signals, 2011 IEEE International Conference on Automatic Face and Gesture Recognition and Workshops pp. 951-956 IEEE
This paper introduces a novel approach to social behaviour recognition governed by the exchange of non-verbal cues between people. We conduct experiments to try and deduce distinct rules that dictate the social dynamics of people in a conversation, and utilise semi-supervised computer vision techniques to extract their social signals such as laughing and nodding. Data mining is used to deduce frequently occurring patterns of social trends between a speaker and listener in both interested and not interested social scenarios. The confidence values from rules are utilised to build a Social Dynamic Model (SDM), that can then be used for classification and visualisation. By visualising the rules generated in the SDM, we can analyse distinct social trends between an interested and not interested listener in a conversation. Results show that these distinctions can be applied generally and used to accurately predict conversational interest.
Gilbert A, Bowden R (2011) iGroup: Weakly supervised image and video grouping, 2011 IEEE International Conference on Computer Vision pp. 2166-2173 IEEE
We present a generic, efficient and iterative algorithm for interactively clustering classes of images and videos. The approach moves away from the use of large hand labelled training datasets, instead allowing the user to find natural groups of similar content based upon a handful of ?seed? examples. Two efficient data mining tools originally developed for text analysis; min-Hash and APriori are used and extended to achieve both speed and scalability on large image and video datasets. Inspired by the Bag-of-Words (BoW) architecture, the idea of an image signature is introduced as a simple descriptor on which nearest neighbour classification can be performed. The image signature is then dynamically expanded to identify common features amongst samples of the same class. The iterative approach uses APriori to identify common and distinctive elements of a small set of labelled true and false positive signatures. These elements are then accentuated in the signature to increase similarity between examples and ?pull? positive classes together. By repeating this process, the accuracy of similarity increases dramatically despite only a few training examples, only 10% of the labelled groundtruth is needed, compared to other approaches. It is tested on two image datasets including the caltech101 [9] dataset and on three state-of-the-art action recognition datasets. On the YouTube [18] video dataset the accuracy increases from 72% to 97% using only 44 labelled examples from a dataset of over 1200 videos. The approach is both scalable and efficient, with an iteration on the full YouTube dataset taking around 1 minute on a standard desktop machine.
Ong E, Pugeault N, Gilbert A, Bowden R (2016) Learning multi-class discriminative patterns using episode-trees, 7th International Conference on Cloud Computing, GRIDs, and Virtualization (CLOUD COMPUTING 2016) International Academy, Research, and Industry Association (IARIA)
In this paper, we aim to tackle the problem of recognising temporal sequences in the context of a multi-class problem. In the past, the representation of sequential patterns was used for modelling discriminative temporal patterns for different classes. Here, we have improved on this by using the more general representation of episodes, of which sequential patterns are a special case. We then propose a novel tree structure called a MultI-Class Episode Tree (MICE-Tree) that allows one to simultaneously model a set of different episodes in an efficient manner whilst providing labels for them. A set of MICE-Trees are then combined together into a MICE-Forest that is learnt in a Boosting framework. The result is a strong classifier that utilises episodes for performing classification of temporal sequences. We also provide experimental evidence showing that the MICE-Trees allow for a more compact and efficient model compared to sequential patterns. Additionally, we demonstrate the accuracy and robustness of the proposed method in the presence of different levels of noise and class labels.
Hadfield SJ, Lebeda K, Bowden R (2014) The Visual Object Tracking VOT2014 challenge results,
The Visual Object Tracking challenge 2014, VOT2014, aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. Results of 38 trackers are presented. The number of tested trackers makes VOT 2014 the largest benchmark on short-term tracking to date. For each participating tracker, a short description is provided in the appendix. Features of the VOT2014 challenge that go beyond its VOT2013 predecessor are introduced: (i) a new VOT2014 dataset with full annotation of targets by rotated bounding boxes and per-frame attribute, (ii) extensions of the VOT2013 evaluation methodology, (iii) a new unit for tracking speed assessment less dependent on the hardware and (iv) the VOT2014 evaluation toolkit that significantly speeds up execution of experiments. The dataset, the evaluation kit as well as the results are publicly available at the challenge website.
Lebeda K, Hadfield S, Matas J, Bowden R (2015) Texture-Independent Long-Term Tracking Using Virtual Corners, IEEE TRANSACTIONS ON IMAGE PROCESSING 25 (1) pp. 359-371 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Cooper H, Bowden R (2007) Large lexicon detection of sign language, HUMAN-COMPUTER INTERACTION, PROCEEDINGS 4796 pp. 88-97 SPRINGER-VERLAG BERLIN
Moore S, Bowden R (2011) Local binary patterns for multi-view facial expression recognition, Computer Vision and Image Understanding 115 (4) pp. 541-558 Elsevier
Research into facial expression recognition has predominantly been applied to face images at frontal view only. Some attempts have been made to produce pose invariant facial expression classifiers. However, most of these attempts have only considered yaw variations of up to 45°, where all of the face is visible. Little work has been carried out to investigate the intrinsic potential of different poses for facial expression recognition. This is largely due to the databases available, which typically capture frontal view face images only. Recent databases, BU3DFE and multi-pie, allows empirical investigation of facialexpressionrecognition for different viewing angles. A sequential 2 stage approach is taken for pose classification and view dependent facialexpression classification to investigate the effects of yaw variations from frontal to profile views. Local binary patterns (LBPs) and variations of LBPs as texture descriptors are investigated. Such features allow investigation of the influence of orientation and multi-resolution analysis for multi-view facial expression recognition. The influence of pose on different facial expressions is investigated. Others factors are investigated including resolution and construction of global and local feature vectors. An appearance based approach is adopted by dividing images into sub-blocks coarsely aligned over the face. Feature vectors contain concatenated feature histograms built from each sub-block. Multi-class support vector machines are adopted to learn pose and pose dependent facial expression classifiers.
Bowden R, Gilbert A, KaewTraKulPong P (2006) Tracking Objects Across Uncalibrated Arbitary Topology Camera Networks, In: Velastin S, Remagnino P (eds.), Intelligent Distributed Video Surveillance Systems 6 pp. 157-182 Institution of Engineering and Technology
Intelligent visual surveillance is an important application area for computer vision.
In situations where networks of hundreds of cameras are used to cover a wide area,
the obvious limitation becomes the users? ability to manage such vast amounts of
information. For this reason, automated tools that can generalise about activities
or track objects are important to the operator. Key to the users? requirements is the
ability to track objects across (spatially separated) camera scenes. However, extensive
geometric knowledge about the site and camera position is typically required. Such
an explicit mapping from camera to world is infeasible for large installations as it
requires that the operator know which camera to switch to when an object disappears.
To further compound the problem the installation costs of CCTV systems outweigh
those of the hardware. This means that geometric constraints or any form of calibration
(such as that which might be used with epipolar constraints) is simply not realistic for
a real world installation. The algorithms cannot afford to dictate to the installer. This
work attempts to address this problem and outlines a method to allow objects to be
related and tracked across cameras without any explicit calibration, be it geometric
or colour.
Ong E-J, Bowden R (2006) Learning Distance for Arbitrary Visual Features, Proceedings of the British Machine Vision Conference 2 pp. 749-758 BMVA
This paper presents a method for learning distance functions of arbitrary feature
representations that is based on the concept of wormholes. We introduce
wormholes and describe how it provides a method for warping the topology
of visual representation spaces such that a meaningful distance between
examples is available. Additionally, we show how a more general distance
function can be learnt through the combination of many wormholes via an
inter-wormhole network. We then demonstrate the application of the distance
learning method on a variety of problems including nonlinear synthetic
data, face illumination detection and the retrieval of images containing natural
landscapes and man-made objects (e.g. cities).
Gilbert A, Bowden R (2008) Incremental, scalable tracking of objects inter camera, COMPUTER VISION AND IMAGE UNDERSTANDING 111 (1) pp. 43-58 ACADEMIC PRESS INC ELSEVIER SCIENCE
Cooper HM, Holt B, Bowden R (2011) Sign Language Recognition, In: Moeslund TB, Hilton A, Krüger V, Sigal L (eds.), Visual Analysis of Humans: Looking at People pp. 539-562 Springer Verlag
This chapter covers the key aspects of sign-language recognition (SLR), starting with a brief introduction to the motivations and requirements, followed by a précis of sign linguistics and their impact on the field. The types of data available and the relative merits are explored allowing examination of the features which can be extracted. Classifying the manual aspects of sign (similar to gestures) is then discussed from a tracking and non-tracking viewpoint before summarising some of the approaches to the non-manual aspects of sign languages. Methods for combining the sign classification results into full SLR are given showing the progression towards speech recognition techniques and the further adaptations required for the sign specific case. Finally the current frontiers are discussed and the recent research presented. This covers the task of continuous sign recognition, the work towards true signer independence, how to effectively combine the different modalities of sign, making use of the current linguistic research and adapting to larger more noisy data sets
Dowson N, Bowden R (2008) Mutual information for Lucas-Kanade tracking (MILK): An inverse compositional formulation, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 30 (1) pp. 180-185 IEEE COMPUTER SOC
Lebeda K, Hadfield S, Bowden R (2015) Exploring Causal Relationships in Visual Object Tracking, 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV) pp. 3065-3073 IEEE
Oshin O, Gilbert A, Bowden R (2011) There Is More Than One Way to Get Out of a Car: Automatic Mode Finding for Action Recognition in the Wild, Lecture Notes in Computer Science: Pattern Recognition and Image Analysis 6669 pp. 41-48 Springer Berlin / Heidelberg
Actions in the wild? is the term given to examples of human motion that are performed in natural settings, such as those harvested from movies [10] or the Internet [9]. State-of-the-art approaches in this domain are orders of magnitude lower than in more contrived settings. One of the primary reasons being the huge variability within each action class. We propose to tackle recognition in the wild by automatically breaking complex action categories into multiple modes/group, and training a separate classifier for each mode. This is achieved using RANSAC which identifies and separates the modes while rejecting outliers. We employ a novel reweighting scheme within the RANSAC procedure to iteratively reweight training examples, ensuring their inclusion in the final classification model. Our results demonstrate the validity of the approach, and for classes which exhibit multi-modality, we achieve in excess of double the performance over approaches that assume single modality.
KaewTraKulPong P, Bowden R (2004) Probabilistic Learning of Salient Patterns across Spatially Separated Uncalibrated Views, Proceedings of IDSS04 - Intelligent Distributed Surveillance Systems, Feb 2004 pp. 36-40 Institution of Electrical Engineers
We present a solution to the problem of tracking intermittent targets that can overcome long-term occlusions as well as movement between camera views. Unlike other approaches, our system does not require topological knowledge of the site or labelled training patterns during the learning period. The approach uses the statistical consistency of data obtained automatically over an extended period of time rather than explicit geometric calibration to automatically learn the salient reappearance periods for objects. This allows us to predict where objects may reappear and within how long. We demonstrate how these salient reappearance periods can be used with a model of physical appearance to track objects between spatially separate regions in single and separated views.
Okwechime D, Ong E, Gilbert A, Bowden R (2011) Social interactive human video synthesis, Lecture Notes in Computer Science: Computer Vision ? ACCV 2010 6492 (PART 1) pp. 256-270 Springer
In this paper, we propose a computational model for social interaction between three people in a conversation, and demonstrate results using human video motion synthesis. We utilised semi-supervised computer vision techniques to label social signals between the people, like laughing, head nod and gaze direction. Data mining is used to deduce frequently occurring patterns of social signals between a speaker and a listener in both interested and not interested social scenarios, and the mined confidence values are used as conditional probabilities to animate social responses. The human video motion synthesis is done using an appearance model to learn a multivariate probability distribution, combined with a transition matrix to derive the likelihood of motion given a pose configuration. Our system uses social labels to more accurately define motion transitions and build a texture motion graph. Traditional motion synthesis algorithms are best suited to large human movements like walking and running, where motion variations are large and prominent. Our method focuses on generating more subtle human movement like head nods. The user can then control who speaks and the interest level of the individual listeners resulting in social interactive conversational agents.
Lebeda K, Hadfield S, Matas J, Bowden R (2013) Long-Term Tracking Through Failure Cases, Proceeedings, IEEE workshop on visual object tracking challenge at ICCV pp. 153-160 IEEE
Long term tracking of an object, given only a single instance in an initial frame, remains an open problem. We propose a visual tracking algorithm, robust to many of the difficulties which often occur in real-world scenes. Correspondences of edge-based features are used, to overcome the reliance on the texture of the tracked object and improve invariance to lighting. Furthermore we address long-term stability, enabling the tracker to recover from drift and to provide redetection following object disappearance or occlusion. The two-module principle is similar to the successful state-of-the-art long-term TLD tracker, however our approach extends to cases of low-textured objects. Besides reporting our results on the VOT Challenge dataset, we perform two additional experiments. Firstly, results on short-term sequences show the performance of tracking challenging objects which represent failure cases for competing state-of-the-art approaches. Secondly, long sequences are tracked, including one of almost 30000 frames which to our knowledge is the longest tracking sequence reported to date. This tests the re-detection and drift resistance properties of the tracker. All the results are comparable to the state-of-the-art on sequences with textured objects and superior on non-textured objects. The new annotated sequences are made publicly available
Escalera S, Gonzàlez J, Baró X, Reyes M, Guyon I, Athitsos V, Escalante H, Sigal L, Argyros A, Sminchisescu C, Bowden R, Sclaroff S (2013) ChaLearn multi-modal gesture recognition 2013: Grand challenge and workshop summary, ICMI 2013 - Proceedings of the 2013 ACM International Conference on Multimodal Interaction pp. 365-370
We organized a Grand Challenge and Workshop on Multi-Modal Gesture Recognition. The MMGR Grand Challenge focused on the recognition of continuous natural gestures from multi-modal data (including RGB, Depth, user mask, Skeletal model, and audio). We made available a large labeled video database of 13,858 gestures from a lexicon of 20 Italian gesture categories recorded with a Kinect" camera. More than 54 teams participated in the challenge and a final error rate of 12% was achieved by the winner of the competition. Winners of the competition published their work in the workshop of the Challenge. The MMGR Workshop was held at ICMI conference 2013, Sidney. A total of 9 relevant papers with basis on multi-modal gesture recognition were accepted for presentation. This includes multi-modal descriptors, multi-class learning strategies for segmentation and classification in temporal data, as well as relevant applications in the field, including multi-modal Social Signal Processing and multi-modal Human Computer Interfaces. Five relevant invited speakers participated in the workshop: Profs. Leonid Signal from Disney Research, Antonis Argyros from FORTH, Institute of Computer Science, Cristian Sminchisescu from Lund University, Richard Bowden from University of Surrey, and Stan Sclaroff from Boston University. They summarized their research in the field and discussed past, current, and future challenges in Multi-Modal Gesture Recognition. © 2013 ACM.
Koller O, Ney H, Bowden R (2015) Deep Learning of Mouth Shapes for Sign Language, 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOP (ICCVW) pp. 477-483 IEEE
Windridge D, Bowden R (2005) Hidden Markov chain estimation and parameterisation via ICA-based feature-selection, Pattern Analysis Applications 8 1-2 pp. 115-124-115-124
Ong EJ, Bowden R (2011) Learning sequential patterns for lipreading, BMVC 2011 - Proceedings of the British Machine Vision Conference 2011
This paper proposes a novel machine learning algorithm (SP-Boosting) to tackle the problem of lipreading by building visual sequence classifiers based on sequential patterns. We show that an exhaustive search of optimal sequential patterns is not possible due to the immense search space, and tackle this with a novel, efficient tree-search method with a set of pruning criteria. Crucially, the pruning strategies preserve our ability to locate the optimal sequential pattern. Additionally, the tree-based search method accounts for the training set's boosting weight distribution. This temporal search method is then integrated into the boosting framework resulting in the SP-Boosting algorithm. We also propose a novel constrained set of strong classifiers that further improves recognition accuracy. The resulting learnt classifiers are applied to lipreading by performing multi-class recognition on the OuluVS database. Experimental results show that our method achieves state of the art recognition performane, using only a small set of sequential patterns. © 2011. The copyright of this document resides with its authors.
Gilbert A, Bowden R (2011) Push and pull: Iterative grouping of media, BMVC 2011 - Proceedings of the British Machine Vision Conference 2011
We present an approach to iteratively cluster images and video in an efficient and intuitive manor. While many techniques use the traditional approach of time consuming groundtruthing large amounts of data [10, 16, 20, 23], this is increasingly infeasible as dataset size and complexity increase. Furthermore it is not applicable to the home user, who wants to intuitively group his/her own media without labelling the content. Instead we propose a solution that allows the user to select media that semantically belongs to the same class and use machine learning to "pull" this and other related content together. We introduce an "image signature" descriptor and use min-Hash and greedy clustering to efficiently present the user with clusters of the dataset using multi-dimensional scaling. The image signatures of the dataset are then adjusted by APriori data mining identifying the common elements between a small subset of image signatures. This is able to both pull together true positive clusters and push apart false positive examples. The approach is tested on real videos harvested from the web using the state of the art YouTube dataset [18]. The accuracy of correct group label increases from 60.4% to 81.7% using 15 iterations of pulling and pushing the media around. While the process takes only 1 minute to compute the pair wise similarities of the image signatures and visualise the youtube whole dataset. © 2011. The copyright of this document resides with its authors.
Bowden R, Heap T, Hart C (1996) Virtual Datagloves: Interacting with Virtual Environments Through Computer Vision, Proceedings of the Third UK Virtual Reality Special Interest Group Conference; Leicester, 3rd July 1996
This paper outlines a system design and implementation of a 3D input device
for graphical applications. It is shown how computer vision can be used to
track a users movements within the image frame allowing interaction with 3D
worlds and objects. Point Distribution Models (PDMs) have been shown to be
successful at tracking deformable objects. This system demonstrates how
these ?smart snakes? can be used in real time with real world applications,
demonstrating how computer vision can provide a low cost, intuitive interface
that has few hardware constraints. The compact mathematical model behind
the PDM allows simple static gesture recognition to be performed providing
the means to communicate with an application. It is shown how movement of
both the hand and face can be used to drive 3D engines. The system is based
upon Open Inventor and designed for use with Silicon Graphics Indy
Workstations but allowances have been made to facilitate the inclusion of the
tracker within third party applications. The reader is also provided with an
insight into the next generation of HCI and Multimedia. Access to this work
can be gained through the above web address.
Bowden R, Collomosse J, Mikolajczyk K (2014) Guest Editorial: Tracking, Detection and Segmentation, International Journal of Computer Vision
Sheerman-Chase T, Ong E-J, Bowden R (2013) Non-linear predictors for facial feature tracking across pose and expression, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, FG 2013
This paper proposes a non-linear predictor for estimating the displacement of tracked feature points on faces that exhibit significant variations across pose and expression. Existing methods such as linear predictors, ASMs or AAMs are limited to a narrow range in pose. In order to track across a large pose range, separate pose-specific models are required that are then coupled via a pose-estimator. In our approach, we neither require a set of pose-specific models nor a pose-estimator. Using just a single tracking model, we are able to robustly and accurately track across a wide range of expression on poses. This is achieved by gradient boosting of regression trees for predicting the displacement vectors of tracked points. Additionally, we propose a novel algorithm for simultaneously configuring this hierarchical set of trackers for optimal tracking results. Experiments were carried out on sequences of naturalistic conversation and sequences with large pose and expression changes. The results show that the proposed method is superior to state of the art methods, in being able to robustly track a set of facial points whilst gracefully recovering from tracking failures. © 2013 IEEE.
Ong E, Bowden R (2011) Learning Temporal Signatures for Lip Reading,
Bowden R (2014) Seeing and understanding people, COMPUTATIONAL VISION AND MEDICAL IMAGE PROCESSING IV pp. 9-15 CRC PRESS-TAYLOR & FRANCIS GROUP
Bowden R (2015) The evolution of Computer Vision, PERCEPTION 44 pp. 360-361 SAGE PUBLICATIONS LTD
Koller O, Bowden R, Ney H (2016) Automatic Alignment of HamNoSys Subunits for Continuous Sign Language Recognition, LREC 2016 Proceedings pp. 121-128
This work presents our recent advances in the field of automatic processing of sign language corpora targeting continuous sign language
recognition. We demonstrate how generic annotations at the articulator level, such as HamNoSys, can be exploited to learn subunit
classifiers. Specifically, we explore cross-language-subunits of the hand orientation modality, which are trained on isolated signs
of publicly available lexicon data sets for Swiss German and Danish Sign Language and are applied to continuous sign language
recognition of the challenging RWTH-PHOENIX-Weather corpus featuring German Sign Language. We observe a significant reduction
in word error rate using this method.
Bowden R (2003) Probabilistic models in computer vision, IMAGE AND VISION COMPUTING 21 (10) pp. 841-841 ELSEVIER SCIENCE BV
Micilotta AS, Ong EJ, Bowden R (2005) Detection and tracking of humans by probabilistic body part assembly, BMVC 2005 - Proceedings of the British Machine Vision Conference 2005
This paper presents a probabilistic framework of assembling detected human body parts into a full 2D human configuration. The face, torso, legs and hands are detected in cluttered scenes using boosted body part detectors trained by AdaBoost. Body configurations are assembled from the detected parts using RANSAC, and a coarse heuristic is applied to eliminate obvious outliers. An a priori mixture model of upper-body configurations is used to provide a pose likelihood for each configuration. A joint-likelihood model is then determined by combining the pose, part detector and corresponding skin model likelihoods. The assembly with the highest likelihood is selected by RANSAC, and the elbow positions are inferred. This paper also illustrates the combination of skin colour likelihood and detection likelihood to further reduce false hand and face detections.
Gilbert A, Bowden R (2015) Data mining for action recognition, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9007 pp. 290-303
© Springer International Publishing Switzerland 2015.In recent years, dense trajectories have shown to be an efficient representation for action recognition and have achieved state-of-theart results on a variety of increasingly difficult datasets. However, while the features have greatly improved the recognition scores, the training process and machine learning used hasn?t in general deviated from the object recognition based SVM approach. This is despite the increase in quantity and complexity of the features used. This paper improves the performance of action recognition through two data mining techniques, APriori association rule mining and Contrast Set Mining. These techniques are ideally suited to action recognition and in particular, dense trajectory features as they can utilise the large amounts of data, to identify far shorter discriminative subsets of features called rules. Experimental results on one of the most challenging datasets, Hollywood2 outperforms the current state-of-the-art.
Bowden R, Ellis L, Kittler J, Shevchenko M, Windridge D (2005) Unsupervised symbol grounding and cognitive bootstrapping in cognitive vision, Proc. 13th Int. Conference on Image Analysis and Processing pp. 27-36-27-36
Holt B, Bowden R (2013) Efficient Estimation of Human Upper Body Pose in Static Depth Images, Communications in Computer and Information Science 359 CCIS pp. 399-410
Automatic estimation of human pose has long been a goal of computer vision, to which a solution would have a wide range of applications. In this paper we formulate the pose estimation task within a regression and Hough voting framework to predict 2D joint locations from depth data captured by a consumer depth camera. In our approach the offset from each pixel to the location of each joint is predicted directly using random regression forests. The predictions are accumulated in Hough images which are treated as likelihood distributions where maxima correspond to joint location hypotheses. Our approach is evaluated on a publicly available dataset with good results. © Springer-Verlag Berlin Heidelberg 2013.
Ellis L, Felsberg M, Bowden R (2011) Affordance mining: Forming perception through action, Lecture Notes in Computer Science: 10th Asian Conference on Computer Vision, Revised Selected Papers Part IV 6495 pp. 525-538 Springer
This work employs data mining algorithms to discover visual entities that are strongly associated to autonomously discovered modes of action, in an embodied agent. Mappings are learnt from these perceptual entities, onto the agents action space. In general, low dimensional action spaces are better suited to unsupervised learning than high dimensional percept spaces, allowing for structure to be discovered in the action space, and used to organise the perceptual space. Local feature configurations that are strongly associated to a particular ?type? of action (and not all other action types) are considered likely to be relevant in eliciting that action type. By learning mappings from these relevant features onto the action space, the system is able to respond in real time to novel visual stimuli. The proposed approach is demonstrated on an autonomous navigation task, and the system is shown to identify the relevant visual entities to the task and to generate appropriate responses.
Ong E-J, Bowden R, IEEE (2008) Robust Lip-Tracking using Rigid Flocks of Selected Linear Predictors, 2008 8TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE & GESTURE RECOGNITION (FG 2008), VOLS 1 AND 2 pp. 247-254
Efthimiou E, Fotinea S-E, Hanke T, Glauert J, Bowden R, Braffort A, Collet C, Maragos P, Lefebvre-Albaret F (2012) Sign Language technologies and resources of the Dicta-Sign project, Proceedings of the 5th Workshop on the Representation and Processing of Sign Languages: Interactions between Corpus and Lexicon. Satellite Workshop to the eighth International Conference on Language Resources and Evaluation (LREC-2012) pp. 37-44 Institute for German Sign Language and Communication of the Deaf
Here we present the outcomes of Dicta-Sign FP7-ICT project. Dicta-Sign researched ways to enable communication between Deaf individuals through the development of human-computer interfaces (HCI) for Deaf users, by means of Sign Language. It has researched and developed recognition and synthesis engines for sign languages (SLs) that have brought sign recognition and generation technologies significantly closer to authentic signing. In this context, Dicta-Sign has developed several technologies demonstrated via a sign language aware Web 2.0, combining work from the fields of sign language recognition, sign language animation via avatars and sign language resources and language models development, with the goal of allowing Deaf users to make, edit, and review avatar-based sign language contributions online, similar to the way people nowadays make text-based contributions on the Web.
Hadfield S, Bowden R (2011) Kinecting the dots: Particle Based Scene Flow From Depth Sensors, Proceedings, IEEE International Conference on Computer Vision (ICCV) pp. 2290-2295 IEEE
The motion field of a scene can be used for object segmentation and to provide features for classification tasks like action recognition. Scene flow is the full 3D motion field of the scene, and is more difficult to estimate than it?s 2D counterpart, optical flow. Current approaches use a smoothness cost for regularisation, which tends to over-smooth at object boundaries. This paper presents a novel formulation for scene flow estimation, a collection of moving points in 3D space, modelled using a particle filter that supports multiple hypotheses and does not oversmooth the motion field. In addition, this paper is the first to address scene flow estimation, while making use of modern depth sensors and monocular appearance images, rather than traditional multi-viewpoint rigs. The algorithm is applied to an existing scene flow dataset, where it achieves comparable results to approaches utilising multiple views, while taking a fraction of the time.
Oshin O, Gilbert A, Illingworth J, Bowden R (2009) Learning to Recognise Spatio-Temporal Interest Points, In: Wang L, Cheng L, Zhao G (eds.), Machine Learning for Human Motion Analysis (2) 2 pp. 14-30 Igi Publishing
Machine Learning for Human Motion Analysis: Theory and Practice highlights thedevelopment of robust and effective vision-based motion understanding systems.
Gilbert A, Bowden R (2007) Multi person tracking within crowded scenes, Human Motion - Understanding, Modeling, Capture and Animation, Proceedings 4814 pp. 166-179 Springer
This paper presents a solution to the problem of tracking
people within crowded scenes. The aim is to maintain individual object
identity through a crowded scene which contains complex interactions
and heavy occlusions of people. Our approach uses the strengths of two
separate methods; a global object detector and a localised frame by frame
tracker. A temporal relationship model of torso detections built during
low activity period, is used to further disambiguate during periods of
high activity. A single camera with no calibration and no environmental
information is used. Results are compared to a standard tracking method
and groundtruth. Two video sequences containing interactions, overlaps
and occlusions between people are used to demonstrate our approach.
The results show that our technique performs better that a standard
tracking method and can cope with challenging occlusions and crowd
interactions.
Lewin M, Bowden R, Sarhadi M (2000) Automotive Prototyping using Augmented Reality, Proceedings of the 7th VRSIG Conference, Strathclyde University, Sept 2000
Ong E-J, Micilotta AS, Bowden R, Hilton A (2005) Viewpoint invariant exemplar-based 3D human tracking, COMPUTER VISION AND IMAGE UNDERSTANDING 104 (2-3) pp. 178-189 ACADEMIC PRESS INC ELSEVIER SCIENCE
Pugeault N, Bowden R (2015) How Much of Driving Is Preattentive?, IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY 64 (12) pp. 5424-5438 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Lebeda K, Matas J, Bowden R (2013) Tracking the untrackable: How to track when your object is featureless, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 7729 LNCS (PART 2) pp. 347-359
We propose a novel approach to tracking objects by low-level line correspondences. In our implementation we show that this approach is usable even when tracking objects with lack of texture, exploiting situations, when feature-based trackers fails due to the aperture problem. Furthermore, we suggest an approach to failure detection and recovery to maintain long-term stability. This is achieved by remembering configurations which lead to good pose estimations and using them later for tracking corrections. We carried out experiments on several sequences of different types. The proposed tracker proves itself as competitive or superior to state-of-the-art trackers in both standard and low-textured scenes. © 2013 Springer-Verlag.
Bowden R, Sarhadi M (2002) A non-linear model of shape and motion for tracking finger spelt American sign language, IMAGE AND VISION COMPUTING 20 (9-10) pp. 597-607 ELSEVIER SCIENCE BV
Gilbert A, Bowden R (2013) A picture is worth a thousand tags: Automatic web based image tag expansion, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 7725 L (PART 2) pp. 447-460 Springer
We present an approach to automatically expand the annotation of images using the internet as an additional information source. The novelty of the work is in the expansion of image tags by automatically introducing new unseen complex linguistic labels which are collected unsupervised from associated webpages. Taking a small subset of existing image tags, a web based search retrieves additional textual information. Both a textual bag of words model and a visual bag of words model are combined and symbolised for data mining. Association rule mining is then used to identify rules which relate words to visual contents. Unseen images that fit these rules are re-tagged. This approach allows a large number of additional annotations to be added to unseen images, on average 12.8 new tags per image, with an 87.2% true positive rate. Results are shown on two datasets including a new 2800 image annotation dataset of landmarks, the results include pictures of buildings being tagged with the architect, the year of construction and even events that have taken place there. This widens the tag annotation impact and their use in retrieval. This dataset is made available along with tags and the 1970 webpages and additional images which form the information corpus. In addition, results for a common state-of-the-art dataset MIRFlickr25000 are presented for comparison of the learning framework against previous works. © 2013 Springer-Verlag.
Lewin M, Bowden R, Sarhadi M (2000) Applying Augmented Reality to Virtual Product Prototyping, pp. 59-68
Dowson N, Bowden R (2004) Metric mixtures for mutual information ((MI)-I-3) tracking, PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 2 pp. 752-756 IEEE COMPUTER SOC
Ong EJ, Bowden R (2005) Learning multi-kernel distance functions using relative comparisons, PATTERN RECOGNITION 38 (12) pp. 2653-2657 ELSEVIER SCI LTD
Gupta A, Bowden R (2011) Evaluating dimensionality reduction techniques for visual category recognition using rényi entropy, European Signal Processing Conference pp. 913-917 IEEE
Visual category recognition is a difficult task of significant interest to the machine learning and vision community. One of the principal hurdles is the high dimensional feature space. This paper evaluates several linear and non-linear dimensionality reduction techniques. A novel evaluation metric, the rényi entropy of the inter-vector euclidean distance distribution, is introduced. This information theoretic measure judges the techniques on their preservation of structure in lower-dimensional sub-space. The popular dataset, Caltech-101 is utilized in the experiments. The results indicate that the techniques which preserve local neighborhood structure performed best amongst the techniques evaluated in this paper. © 2011 EURASIP.
Cooper H, Bowden R (2007) Sign Language Recognition Using Boosted Volumetric Features, Proceedings of the IAPR Conference on Machine Vision Applications pp. 359-362 MVA Organisation
This paper proposes a method for sign language recognition that bypasses the need for tracking by classifying the motion directly. The method uses the natural extension of haar like features into the temporal domain, computed efficiently using an integral volume. These volumetric features are assembled into spatio-temporal classifiers using boosting. Results are presented for a fast feature extraction method and 2 different types of boosting. These configurations have been tested on a data set consisting of both seen and unseen signers performing 5 signs producing competitive results.
Ellis L, Matas J, Bowden R (2008) Online Learning and Partitioning of Linear Displacement Predictors for Tracking, Proceedings of the British Machine Vision Conference pp. 33-42 The British Machine Vision Association (BMVA)
A novel approach to learning and tracking arbitrary image features is
presented. Tracking is tackled by learning the mapping from image intensity
differences to displacements. Linear regression is used, resulting in low
computational cost. An appearance model of the target is built on-the-fly by
clustering sub-sampled image templates. The medoidshift algorithm is used
to cluster the templates thus identifying various modes or aspects of the target
appearance, each mode is associated to the most suitable set of linear predictors
allowing piecewise linear regression from image intensity differences to
warp updates. Despite no hard-coding or offline learning, excellent results
are shown on three publicly available video sequences and comparisons with
related approaches made.
Cooper HM, Pugeault N, Bowden R (2011) Reading the Signs: A Video Based Sign Dictionary, 2011 International Conference on Computer Vision: 2nd IEEE Workshop on Analysis and Retrieval of Tracked Events and Motion in Imagery Streams (ARTEMIS 2011) pp. 914-919 IEEE
This article presents a dictionary for Sign Language using visual sign recognition based on linguistic subcomponents. We demonstrate a system where the user makes a query, receiving in response a ranked selection of similar results. The approach uses concepts from linguistics to provide sign sub-unit features and classifiers based on motion, sign-location and handshape. These sub-units are combined using Markov Models for sign level recognition. Results are shown for a video dataset of 984 isolated signs performed by a native signer. Recognition rates reach 71.4% for the first candidate and 85.9% for retrieval within the top 10 ranked signs.
Ellis L, Pugeault N, Ofjall K, Hedborg J, Bowden R, Felsberg M (2013) Autonomous navigation and sign detector learning, 2013 IEEE Workshop on Robot Vision, WORV 2013 pp. 144-151
This paper presents an autonomous robotic system that incorporates novel Computer Vision, Machine Learning and Data Mining algorithms in order to learn to navigate and discover important visual entities. This is achieved within a Learning from Demonstration (LfD) framework, where policies are derived from example state-to-action mappings. For autonomous navigation, a mapping is learnt from holistic image features (GIST) onto control parameters using Random Forest regression. Additionally, visual entities (road signs e.g. STOP sign) that are strongly associated to autonomously discovered modes of action (e.g. stopping behaviour) are discovered through a novel Percept-Action Mining methodology. The resulting sign detector is learnt without any supervision (no image labeling or bounding box annotations are used). The complete system is demonstrated on a fully autonomous robotic platform, featuring a single camera mounted on a standard remote control car. The robot carries a PC laptop, that performs all the processing on board and in real-time. © 2013 IEEE.
Bowden R, Mitchell TA, Sarhadi M (1998) Reconstructing 3D Pose and Motion from a Single Camera View, Proceedings of BMVC 1998 2 BMVA (British Machine Vision Association)
This paper presents a model based approach to human body tracking in which
the 2D silhouette of a moving human and the corresponding 3D skeletal structure
are encapsulated within a non-linear Point Distribution Model. This statistical
model allows a direct mapping to be achieved between the external boundary of
a human and the anatomical position. It is shown how this information, along
with the position of landmark features such as the hands and head can be used
to reconstruct information about the pose and structure of the human body from
a monoscopic view of a scene.
Koller O, Ney H, Bowden R (2013) May the force be with you: Force-aligned signwriting for automatic subunit annotation of corpora, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, FG 2013
We propose a method to generate linguistically meaningful subunits in a fully automated fashion for sign language corpora. The ability to automate the process of subunit annotation has profound effects on the data available for training sign language recognition systems. The approach is based on the idea that subunits are shared among different signs. With sufficient data and knowledge of possible signing variants, accurate automatic subunit sequences are produced, matching the specific characteristics of given sign language data. Specifically we demonstrate how an iterative forced alignment algorithm can be used to transfer the knowledge of a user-edited open sign language dictionary to the task of annotating a challenging, large vocabulary, multi-signer corpus recorded from public TV. Existing approaches focus on labour intensive manual subunit annotations or on data-driven approaches. Our method yields an average precision and recall of 15% under the maximum achievable accuracy with little user intervention beyond providing a simple word gloss. © 2013 IEEE.
Hadfield S, Bowden R (2013) Hollywood 3D: Recognizing Actions in 3D Natural Scenes, Proceeedings, IEEE conference on Computer Vision and Pattern Recognition (CVPR) pp. 3398-3405 IEEE
Action recognition in unconstrained situations is a difficult task, suffering from massive intra-class variations. It is made even more challenging when complex 3D actions are projected down to the image plane, losing a great deal of information. The recent emergence of 3D data, both in broadcast content, and commercial depth sensors, provides the possibility to overcome this issue. This paper presents a new dataset, for benchmarking action recognition algorithms in natural environments, while making use of 3D information. The dataset contains around 650 video clips, across 14 classes. In addition, two state of the art action recognition algorithms are extended to make use of the 3D data, and five new interest point detection strategies are also proposed, that extend to the 3D data. Our evaluation compares all 4 feature descriptors, using 7 different types of interest point, over a variety of threshold levels, for the Hollywood 3D dataset. We make the dataset including stereo video, estimated depth maps and all code required to reproduce the benchmark results, available to the wider community.
Dowson NDH, Bowden R, Kadir T (2006) Image template matching using mutual information and NP-Windows, 18TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 2, PROCEEDINGS pp. 1186-1191 IEEE COMPUTER SOC
A non-parametric (NP) sampling method is introduced for obtaining the joint distribution of a pair of images. This method based on NP windowing and is equivalent to sampling the images at infinite resolution. Unlike existing methods, arbitrary selection of kernels is not required and the spatial structure of images is used. NP windowing is applied to a registration application where the mutual information (MI) between a reference image and a warped template is maximised with respect to the warp parameters. In comparisons against the current state of the art MI registration methods NP windowing yielded excellent results with lower bias and improved convergence rates
Pugeault N, Bowden R (2010) Learning pre-attentive driving behaviour from holistic visual features, ECCV 2010, Part VI, LNCS 6316 pp. 154-167
Bowden R, Kaewtrakulpong P, Lewin M (2002) Jeremiah: The face of computer vision, ACM International Conference Proceeding Series 22 pp. 124-128
This paper presents a humanoid computer interface (Jeremiah) that is capable of extracting moving objects from a video stream and responding by directing the gaze of an animated head toward it. It further responds through change of expression reflecting the emotional state of the system as a response to stimuli. As such, the system exhibits similar behavior to a child. The system was originally designed as a robust visual tracking system capable of performing accurately and consistently within a real world visual surveillance arena. As such, it provides a system capable of operating reliably in any environment both indoor and outdoor. Originally designed as a public interface to promote computer vision and the public understanding of science (exhibited in British Science Museum), Jeremiah provides the first step to a new form of intuitive computer interface. Copyright © ACM 2002.
Ellis L, Dowson N, Matas J, Bowden R (2011) Linear Regression and Adaptive Appearance Models for Fast Simultaneous Modelling and Tracking, INTERNATIONAL JOURNAL OF COMPUTER VISION 95 (2) pp. 154-179 SPRINGER
Sheerman-Chase T, Ong E-J, Bowden R (2009) Feature selection of facial displays for detection of non verbal communication in natural conversation, 2009 IEEE 12th International Conference on Computer Vision Workshops pp. 1985-1992 IEEE
Recognition of human communication has previously focused on deliberately acted emotions or in structured or artificial social contexts. This makes the result hard to apply to realistic social situations. This paper describes the recording of spontaneous human communication in a specific and common social situation: conversation between two people. The clips are then annotated by multiple observers to reduce individual variations in interpretation of social signals. Temporal and static features are generated from tracking using heuristic and algorithmic methods. Optimal features for classifying examples of spontaneous communication signals are then extracted by AdaBoost. The performance of the boosted classifier is comparable to human performance for some communication signals, even on this challenging and realistic data set.
Okwechime D, Bowden R (2008) A generative model for motion synthesis and blending using probability density estimation, ARTICULATED MOTION AND DEFORMABLE OBJECTS, PROCEEDINGS 5098 pp. 218-227 SPRINGER-VERLAG BERLIN
Lan Y, Harvey R, Theobald B, Ong EJ, Bowden R (2009) Comparing Visual Features for Lipreading, International Conference on Auditory-Visual Speech Processing 2009 pp. 102-106 ICSA
For automatic lipreading, there are many competing methods for feature extraction. Often, because of the complexity of the task these methods are tested on only quite restricted datasets, such as the letters of the alphabet or digits, and from only a few speakers. In this paper we compare some of the leading methods for lip feature extraction and compare them on the GRID dataset which uses a constrained vocabulary over, in this case, 15 speakers. Previously the GRID data has had restricted attention because of the requirements to track the face and lips accurately. We overcome this via the use of a novel linear predictor (LP) tracker which we use to control an Active Appearance Model (AAM).
By ignoring shape and/or appearance parameters from the AAM we can quantify the effect of appearance and/or shape when lip-reading. We find that shape alone is a useful cue for lipreading (which is consistent with human experiments). However, the incremental effect of shape on appearance appears to be not significant which implies that the inner appearance of the mouth contains more information than the shape.
Bowden R, Zisserman A, Kadir T, Brady M (2003) Vision based Interpretation of Natural Sign Languages, Proceedings of the 3rd international conference on Computer vision systems Springer-Verlag
This manuscript outlines our current demonstration system for translating
visual Sign to written text. The system is based around a broad description
of scene activity that naturally generalizes, reducing training requirements
and allowing the knowledge base to be explicitly stated. This allows the same
system to be used for different sign languages requiring only a change of the
knowledge base.
Pugeault N, Bowden R (2010) Learning driving behaviour using holistic image descriptors, 4th International Conference on Cognitive Systems, CogSys 2010
Dowson N, Bowden R (2006) A unifying framework for mutual information methods for use in non-linear optimisation, Lecture Notes in Computer Science: 9th European Conference on Computer Vision, Proceedings Part 1 3951 pp. 365-378 Springer
Many variants of MI exist in the literature. These vary primarily in how the joint histogram is populated. This paper places the four main variants of MI: Standard sampling, Partial Volume Estimation (PVE), In-Parzen Windowing and Post-Parzen Windowing into a single mathematical framework. Jacobians and Hessians are derived in each case. A particular contribution is that the non-linearities implicit to standard sampling and post-Parzen windowing are explicitly dealt with. These non-linearities are a barrier to their use in optimisation. Side-by-side comparison of the MI variants is made using eight diverse data-sets, considering computational expense and convergence. In the experiments, PVE was generally the best performer, although standard sampling often performed nearly as well (if a higher sample rate was used). The widely used sum of squared differences metric performed as well as MI unless large occlusions and non-linear intensity relationships occurred. The binaries and scripts used for testing are available online.
Koller O, Ney H, Bowden R (2014) Read my lips: Continuous signer independent weakly supervised viseme recognition, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 8689 LNCS (PART 1) pp. 281-296
This work presents a framework to recognise signer independent mouthings in continuous sign language, with no manual annotations needed. Mouthings represent lip-movements that correspond to pronunciations of words or parts of them during signing. Research on sign language recognition has focused extensively on the hands as features. But sign language is multi-modal and a full understanding particularly with respect to its lexical variety, language idioms and grammatical structures is not possible without further exploring the remaining information channels. To our knowledge no previous work has explored dedicated viseme recognition in the context of sign language recognition. The approach is trained on over 180.000 unlabelled frames and reaches 47.1% precision on the frame level. Generalisation across individuals and the influence of context-dependent visemes are analysed. © 2014 Springer International Publishing.
Ong EJ, Bowden R (2006) Learning wormholes for sparsely labelled clustering, 18TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 1, PROCEEDINGS pp. 916-919 IEEE COMPUTER SOC
Distance functions are an important component in many learning applications. However, the correct function is context dependent, therefore it is advantageous to learn a distance function using available training data. Many existing distance functions is the requirement for data to exist in a space of constant dimensionality and not possible to be directly used on symbolic data. To address these problems, this paper introduces an alternative learnable distance function, based on multi-kernel distance bases or "wormholes that connects spaces belonging to similar examples that were originally far away close together. This work only assumes the availability of a set data in the form of relative comparisons, avoiding the need for having labelled or quantitative information. To learn the distance function, two algorithms were proposed: 1) Building a set of basic wormhole bases using a Boosting-inspired algorithm. 2) Merging different distance bases together for better generalisation. The learning algorithms were then shown to successfully extract suitable distance functions in various clustering problems, ranging from synthetic 2D data to symbolic representations of unlabelled images
Dowson NDH, Bowden R (2006) N-tier Simultaneous Modelling and Tracking for Arbitrary Warps, Proceedings of the British Machine Vision Conference 2 pp. 569-578 BMVA
This paper presents an approach to object tracking which, given a single example
of a target, learns a hierarchical constellation model of appearance
and structure on the fly. The model becomes more robust over time as evidence
of the variability of the object is acquired and added to the model.
Tracking is performed in an optimised Lucas-Kanade type framework, using
Mutual Information as a similarity metric. Several novelties are presented: an
improved template update strategy using Bayes theorem, a multi-tier model
topology, and a semi-automatic testing method. A critical comparison with
other methods is made using exhaustive testing. In all 11 challenging test
sequences were used with a mean length of 568 frames.
Gilbert A, Illingworth J, Bowden R (2010) Action Recognition Using Mined Hierarchical Compound Features, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 33 (5) pp. 883-897 IEEE COMPUTER SOC
Camgoz NC, Hadfield SJ, Koller O, Bowden R (2016) Using Convolutional 3D Neural Networks for User-Independent Continuous Gesture Recognition, Proceedings IEEE International Conference of Pattern Recognition (ICPR), ChaLearn Workshop IEEE
In this paper, we propose using 3D Convolutional
Neural Networks for large scale user-independent continuous
gesture recognition. We have trained an end-to-end deep network
for continuous gesture recognition (jointly learning both the
feature representation and the classifier). The network performs
three-dimensional (i.e. space-time) convolutions to extract features
related to both the appearance and motion from volumes of
color frames. Space-time invariance of the extracted features is
encoded via pooling layers. The earlier stages of the network are
partially initialized using the work of Tran et al. before being
adapted to the task of gesture recognition. An earlier version
of the proposed method, which was trained for 11,250 iterations,
was submitted to ChaLearn 2016 Continuous Gesture Recognition
Challenge and ranked 2nd with the Mean Jaccard Index Score
of 0.269235. When the proposed method was further trained for
28,750 iterations, it achieved state-of-the-art performance on the
same dataset, yielding a 0.314779 Mean Jaccard Index Score.
Oshin O, Gilbert A, Illingworth J, Bowden R (2009) Action recognition using Randomised Ferns, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops 2009 pp. 530-537 IEEE
This paper presents a generic method for recognising and localising human actions in video based solely on the distribution of interest points. The use of local interest points has shown promising results in both object and action recognition. While previous methods classify actions based on the appearance and/or motion of these points, we hypothesise that the distribution of interest points alone contains the majority of the discriminatory information. Motivated by its recent success in rapidly detecting 2D interest points, the semi-naive Bayesian classification method of Randomised Ferns is employed. Given a set of interest points within the boundaries of an action, the generic classifier learns the spatial and temporal distributions of those interest points. This is done efficiently by comparing sums of responses of interest points detected within randomly positioned spatio-temporal blocks within the action boundaries. We present results on the largest and most popular human action dataset using a number of interest point detectors, and demostrate that the distribution of interest points alone can perform as well as approaches that rely upon the appearance of the interest points.
Bowden R, KaewTraKulPong P (2001) Adaptive Visual System for Tracking Low Resolution Colour Targets, Proceedings of the 12th British Machine Vision Conference (BMVC2001) pp. 243-252
This paper addresses the problem of using appearance and motion models in classifying
and tracking objects when detailed information of the object?s appearance is not
available. The approach relies upon motion, shape cues and colour information to help
in associating objects temporally within a video stream. Unlike previous applications of
colour in object tracking, where relatively large-size targets are tracked, our method is
designed to track small colour targets. Our approach uses a robust background model
based around Expectation Maximisation to segment moving objects with very low false
detection rates. The system also incorporates a shadow detection algorithm which helps
alleviate standard environmental problems associated with such approaches. A colour
transformation derived from anthropological studies to model colour distributions of
low-resolution targets is used along with a probabilistic method of combining colour and
motion information. This provides a robust visual tracking system which is capable of
performing accurately and consistently within a real world visual surveillance arena.
This paper shows the system successfully tracking multiple people moving
independently and the ability of the approach to recover lost tracks due to occlusions
and background clutter.
Mitchell TA, Bowden R (1999) Automated visual inspection of dry carbon-fibre reinforced composite preforms, Proceedings of the Institution of Mechanical Engineers, Part G: Journal of Aerospace Engineering 213 (6) pp. 377-386
A vision system is described which performs real-time inspection of dry carbon-fibre preforms during lay-up, the first stage in resin transfer moulding (RTM). The position of ply edges on the preform is determined in a two-stage process. Firstly, an optimized texture analysis method is used to estimate the approximate ply edge position. Secondly, boundary refinement is carried out using the texture estimate as a guiding template. Each potential edge point is evaluated using a merit function of edge magnitude, orientation and distance from the texture boundary estimate. The parameters of the merit function must be obtained by training on sample images. Once trained, the system has been shown to be accurate to better than ±1 pixel when used in conjunction with boundary models. Processing time is less than 1 s per image using commercially available convolution hardware. The system has been demonstrated in a prototype automated lay-up cell and used in a large number of manufacturing trials.
Ong EJ, Cooper H, Pugeault N, Bowden R (2012) Sign Language Recognition using Sequential Pattern Trees, Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on pp. 2200-2207
Moore S, Bowden R (2007) Automatic facial expression recognition using boosted discriminatory classifiers, Lecture Notes in Computer Science: Analysis and Modelling of Faces and Gestures 4778 pp. 71-83 Springer
Over the last two decades automatic facial expression recognition has become an active research area. Facial expressions are an important channel of non-verbal communication, and can provide cues to emotions and intentions. This paper introduces a novel method for facial expression recognition, by assembling contour fragments as discriminatory classifiers and boosting them to form a strong accurate classifier. Detection is fast as features are evaluated using an efficient lookup to a chamfer image, which weights the response of the feature. An Ensemble classification technique is presented using a voting scheme based on classifiers responses. The results of this research are a 6-class classifier (6 basic expressions of anger, joy, sadness, surprise, disgust and fear) which demonstrate competitive results achieving rates as high as 96% for some expressions. As classifiers are extremely fast to compute the approach operates at well above frame rate. We also demonstrate how a dedicated classifier can be consrtucted to give optimal automatic parameter selection of the detector, allowing real time operation on unconstrained video.
Micilotta A, Ong E, Bowden R (2005) Real-time Upper Body 3D Reconstruction from a Single Uncalibrated Camera, The European Association for Computer Graphics 26th Annual Conference, EUROGRAPHICS 2005 pp. 41-44
This paper outlines a method of estimating the 3D pose of the upper human body from a single uncalibrated
camera. The objective application lies in 3D Human Computer Interaction where hand depth information offers
extended functionality when interacting with a 3D virtual environment, but it is equally suitable to animation and
motion capture. A database of 3D body configurations is built from a variety of human movements using motion
capture data. A hierarchical structure consisting of three subsidiary databases, namely the frontal-view Hand
Position (top-level), Silhouette and Edge Map Databases, are pre-extracted from the 3D body configuration database.
Using this hierarchy, subsets of the subsidiary databases are then matched to the subject in real-time. The
examples of the subsidiary databases that yield the highest matching score are used to extract the corresponding
3D configuration from the motion capture data, thereby estimating the upper body 3D pose.
Koller O, Zargaran O, Ney H, Bowden R (2016) Deep Sign: Hybrid CNN-HMM for Continuous Sign Language Recognition, Proceedings of the British Machine Vision Conference 2016 BMVA Press
This paper introduces the end-to-end embedding of a CNN into a HMM, while interpreting
the outputs of the CNN in a Bayesian fashion. The hybrid CNN-HMM combines
the strong discriminative abilities of CNNs with the sequence modelling capabilities of
HMMs. Most current approaches in the field of gesture and sign language recognition
disregard the necessity of dealing with sequence data both for training and evaluation.
With our presented end-to-end embedding we are able to improve over the state-of-the-art
on three challenging benchmark continuous sign language recognition tasks by between
15% and 38% relative and up to 13.3% absolute.
Kristan M, Matas J, Leonardis A, Felsberg M, Cehovin L, Fernandez G, Voj1r T, Hager G, Nebehay G, Pflugfelder R, Gupta A, Bibi A, Lukezic A, Garcia-Martin A, Petrosino A, Saffari A, Montero A, Varfolomieiev A, Baskurt A, Zhao B, Ghanem B, Martinez B, Lee B, Han B, Wang C, Garcia C, Zhang C, Schmid C, Tao D, Kim D, Huang D, Prokhorov D, Du D, Yeung D, Ribeiro E, Khan F, Porikli F, Bunyak F, Zhu G, Seetharaman G, Kieritz H, Yau H, Li H, Qi H, Bischof H, Possegger H, Lee H, Nam H, Bogun I, Jeong J, Cho J, Lee J, Zhu J, Shi J, Li J, Jia J, Feng J, Gao J, Choi J, Kim J, Lang J, Martinez J, Choi J, Xing J, Xue K, Palaniappan K, Lebeda K, Alahari K, Gao K, Yun K, Wong K, Luo L, Ma L, Ke L, Wen L, Bertinetto L, Pootschi M, Maresca M, Danelljan M, Wen M, Zhang M, Arens M, Valstar M, Tang M, Chang M, Khan M, Fan N, Wang N, Miksik O, Torr P, Wang Q, Martin-Nieto R, Pelapur R, Bowden R, Laganière R, Moujtahid S, Hare S, Hadfield SJ, Lyu S, Li S, Zhu S, Becker S, Duffner S, Hicks S, Golodetz S, Choi S, Wu T, Mauthner T, Pridmore T, Hu W, Hübner W, Wang X, Li X, Shi X, Zhao X, Mei X, Shizeng Y, Hua Y, Li Y, Lu Y, Li Y, Chen Z, Huang Z, Chen Z, Zhang Z, He Z, Hong Z (2015) The Visual Object Tracking VOT2015 challenge results, ICCV workshop on Visual Object Tracking Challenge pp. 564-586
The Visual Object Tracking challenge 2015, VOT2015, aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. Results of 62 trackers are presented. The number of tested trackers makes VOT 2015 the largest benchmark on short-term tracking to date. For each participating tracker, a short description is provided in the appendix. Features of the VOT2015 challenge that go beyond its VOT2014 predecessor are: (i) a new VOT2015 dataset twice as large as in VOT2014 with full annotation of targets by rotated bounding boxes and per-frame attribute, (ii) extensions of the VOT2014 evaluation methodology by introduction of a new performance measure. The dataset, the evaluation kit as well as the results are publicly available at the challenge website.
Gilbert A, bowden R (2015) Geometric Mining: Scaling Geometric Hashing to Large Datasets, 3rd Workshop on Web-scale Vision and Social Media (VSM), at ICCV 2015
Bowden R, Mitchell TA, Sarhadi M (2000) Non-linear Statistical Models for the 3D Reconstruction of Human Pose and Motion from Monocular Image Sequences, Image and Vision Computing 18 (9) pp. 729-737 Elsevier
This paper presents a model based approach to human body tracking in which the 2D silhouette of a moving human and the corresponding 3D skeletal structure are encapsulated within a non-linear point distribution model. This statistical model allows a direct mapping to be achieved between the external boundary of a human and the anatomical position. It is shown how this information, along with the position of landmark features such as the hands and head can be used to reconstruct information about the pose and structure of the human body from a monocular view of a scene.
Ellis L, Bowden R (2005) A generalised exemplar approach to modelling perception action couplings, Proceedings of the Tenth IEEE International Conference on Computer Vision Workshops pp. 1874-1874 IEEE
We present a framework for autonomous behaviour in vision based artificial cognitive systems by imitation through coupled percept-action (stimulus and response) exemplars. Attributed Relational Graphs (ARGs) are used as a symbolic representation of scene information (percepts). A measure of similarity between ARGs is implemented with the use of a graph isomorphism algorithm and is used to hierarchically group the percepts. By hierarchically grouping percept exemplars into progressively more general models coupled to progressively more general Gaussian action models, we attempt to model the percept space and create a direct mapping to associated actions. The system is built on a simulated shape sorter puzzle that represents a robust vision system. Spatio temporal hypothesis exploration is performed ef- ficiently in a Bayesian framework using a particle filter to propagate game play over time.
Bowden R (1998) Non-linear Point Distribution Models, The University of Edinburgh
Ellis L, Dowson N, Matas J, Bowden R (2007) Linear predictors for fast simultaneous modeling and tracking, 2007 IEEE 11TH INTERNATIONAL CONFERENCE ON COMPUTER VISION, VOLS 1-6 pp. 2792-2799 IEEE
Sheerman-Chase T, Ong E-J, Bowden R (2011) Cultural factors in the regression of non-verbal communication perception, 2011 IEEE International Conference on Computer Vision pp. 1242-1249 IEEE
Recognition of non-verbal communication (NVC) is important for understanding human communication and designing user centric user interfaces. Cultural differences affect the expression and perception of NVC but no previous automatic system considers these cultural differences. Annotation data for the LILiR TwoTalk corpus, containing dyadic (two person) conversations, was gathered using Internet crowdsourcing, with a significant quantity collected from India, Kenya and the United Kingdom (UK). Many studies have investigated cultural differences based on human observations but this has not been addressed in the context of automatic emotion or NVC recognition. Perhaps not surprisingly, testing an automatic system on data that is not culturally representative of the training data is seen to result in low performance. We address this problem by training and testing our system on a specific culture to enable better modeling of the cultural differences in NVC perception. The system uses linear predictor tracking, with features generated based on distances between pairs of trackers. The annotations indicated the strength of the NVC which enables the use of v-SVR to perform the regression.
Holt B, Ong E-J, Cooper H, Bowden R (2011) Putting the pieces together: Connected Poselets for human pose estimation, 2011 IEEE International Conference on Computer Vision pp. 1196-1201 IEEE
We propose a novel hybrid approach to static pose estimation called Connected Poselets. This representation combines the best aspects of part-based and example-based estimation. First detecting poselets extracted from the training data; our method then applies a modified Random Decision Forest to identify Poselet activations. By combining keypoint predictions from poselet activitions within a graphical model, we can infer the marginal distribution over each keypoint without any kinematic constraints. Our approach is demonstrated on a new publicly available dataset with promising results.
Lebeda K, Hadfield S, Bowden R (2015) Dense Rigid Reconstruction from Unstructured Discontinuous Video, 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOP (ICCVW) pp. 814-822 IEEE
Koller O, Ney H, bowden R (2014) Weakly Supervised Automatic Transcription of Mouthings for Gloss-Based Sign Language Corpora, LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION EUROPEAN LANGUAGE RESOURCES ASSOC-ELRA
Micilotta AS, Ong EJ, Bowden R (2006) Real-time upper body detection and 3D pose estimation in monoscopic images, Lecture Notes in Computer Science: Proceedings of 9th European Conference on Computer Vision, Part III 3953 pp. 139-150 Springer
This paper presents a novel solution to the difficult task of both detecting and estimating the 3D pose of humans in monoscopic images. The approach consists of two parts. Firstly the location of a human is identified by a probabalistic assembly of detected body parts. Detectors for the face, torso and hands are learnt using adaBoost. A pose likliehood is then obtained using an a priori mixture model on body configuration and possible configurations assembled from available evidence using RANSAC. Once a human has been detected, the location is used to initialise a matching algorithm which matches the silhouette and edge map of a subject with a 3D model. This is done efficiently using chamfer matching, integral images and pose estimation from the initial detection stage. We demonstrate the application of the approach to large, cluttered natural images and at near framerate operation (16fps) on lower resolution video streams.
Mitchell TA, Bowden R, Sarhadi M (2010) Efficient Texture Analysis for Industrial Inspection, International Journal of Production Research 38 (4) pp. 967-984 Taylor & Francis
This paper describes a convolution-based approach to the analysis of images containing few texture classes. Segmentation of foreground and background textures, or detection of boundaries between similarly textured objects, is demonstrated. The application to industrial inspection applications is demonstrated. Near frame-rate performance on low-cost hardware is possible, since only convolution with small kernels is used. A new algorithm to optimize convolution kernels for the required texture analysis task is presented. A key feature of the paper is the industrial readiness of the techniques described.
KaewTraKulPong P, Bowden R (2001) An Improved Adaptive Background Mixture Model for Real-time Tracking with Shadow Detection, Proceedings of 2nd European Workshop on Advanced Video Based Surveillance Systems, AVBS01. Sept 2001. Kluwer Academic Publishers
Real-time segmentation of moving regions in image sequences is a fundamental step in many vision systems including automated visual surveillance, human-machine interface, and very low-bandwidth telecommunications. A typical method is background subtraction. Many background models have been introduced to deal with different problems. One of the successful solutions to these problems is to use a multi-colour background model per pixel proposed by Grimson et al [1, 2,3]. However, the method suffers from slow learning at the beginning, especially in busy environments. In addition, it can not distinguish between moving shadows and moving objects. This paper presents a method which improves this adaptive background mixture model. By reinvestigating the update equations, we utilise different equations at different phases. This allows our system learn faster and more accurately as well as adapts effectively to changing environment. A shadow detection scheme is also introduced in this paper. It is based on a computational colour space that makes use of our background model. A comparison has been made between the two algorithms. The results show the speed of learning and the accuracy of the model using our update algorithm over the Grimson et al?s tracker. When incorporate with the shadow detection, our method results in far better segmentation than The Thirteenth Conference on Uncertainty in Artificial Intelligence that of Grimson et al.
Ong EJ, Koller O, Pugeault N, Bowden R (2014) Sign spotting using hierarchical sequential patterns with temporal intervals, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition pp. 1931-1938
© 2014 IEEE.This paper tackles the problem of spotting a set of signs occuring in videos with sequences of signs. To achieve this, we propose to model the spatio-temporal signatures of a sign using an extension of sequential patterns that contain temporal intervals called Sequential Interval Patterns (SIP). We then propose a novel multi-class classifier that organises different sequential interval patterns in a hierarchical tree structure called a Hierarchical SIP Tree (HSP-Tree). This allows one to exploit any subsequence sharing that exists between different SIPs of different classes. Multiple trees are then combined together into a forest of HSP-Trees resulting in a strong classifier that can be used to spot signs. We then show how the HSP-Forest can be used to spot sequences of signs that occur in an input video. We have evaluated the method on both concatenated sequences of isolated signs and continuous sign sequences. We also show that the proposed method is superior in robustness and accuracy to a state of the art sign recogniser when applied to spotting a sequence of signs.
Gilbert A, Bowden R (2005) Incremental modelling of the posterior distribution of objects for inter and intra camera tracking, BMVC 2005 - Proceedings of the British Machine Vision Conference 2005 BMVA
This paper presents a scalable solution to the problem of tracking objects across spatially separated, uncalibrated, non-overlapping cameras. Unlike other approaches this technique uses an incremental learning method to create the spatio-temporal links between cameras, and thus model the posterior probability distribution of these links. This can then be used with an appearance model of the object to track across cameras. It requires no calibration or batch preprocessing and becomes more accurate over time as evidence is accumulated.
Oshin O, Gilbert A, Illingworth J, Bowden R (2009) Learning to recognise spatio-temporal interest points, pp. 14-30
In this chapter, we present a generic classifier for detecting spatio-temporal interest points within video, the premise being that, given an interest point detector, we can learn a classifier that duplicates its functionality and which is both accurate and computationally efficient. This means that interest point detection can be achieved independent of the complexity of the original interest point formulation. We extend the naive Bayesian classifier of Randomised Ferns to the spatio-temporal domain and learn classifiers that duplicate the functionality of common spatio-temporal interest point detectors. Results demonstrate accurate reproduction of results with a classifier that can be applied exhaustively to video at frame-rate, without optimisation, in a scanning window approach. © 2010, IGI Global.
Bowden R, Mitchell TA, Sarhadi M (1997) Cluster Based non-linear Principle Component Analysis, Electronics Letters 33 (22) pp. 1858-1859 The Institution of Engineering and Technology
In the field of computer vision, principle component analysis (PCA) is often used to provide statistical models of shape, deformation or appearance. This simple statistical model provides a constrained, compact approach to model based vision. However. As larger problems are considered, high dimensionality and nonlinearity make linear PCA an unsuitable and unreliable approach. A nonlinear PCA (NLPCA) technique is proposed which uses cluster analysis and dimensional reduction to provide a fast, robust solution. Simulation results on both 2D contour models and greyscale images are presented.
Krejov P, Gilbert A, Bowden R (2015) Combining Discriminative and Model Based Approaches for Hand Pose Estimation, 2015 11TH IEEE INTERNATIONAL CONFERENCE AND WORKSHOPS ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG), VOL. 5 IEEE
Pugeault N, Bowden R (2011) Spelling It Out: Real?Time ASL Fingerspelling Recognition, 2011 IEEE International Conference on Computer Vision Workshops pp. 1114-1119 IEEE
This article presents an interactive hand shape recognition
user interface for American Sign Language (ASL)
finger-spelling. The system makes use of a Microsoft Kinect
device to collect appearance and depth images, and of the
OpenNI+NITE framework for hand detection and tracking.
Hand-shapes corresponding to letters of the alphabet are
characterized using appearance and depth images and classified
using random forests. We compare classification using
appearance and depth images, and show a combination
of both lead to best results, and validate on a dataset of four
different users.
This hand shape detection works in real-time and is integrated
in an interactive user interface allowing the signer
to select between ambiguous detections and integrated with
an English dictionary for efficient writing.
Kristan M, Pflugfelder R, Leonardis A, Matas J, Porikli F, ?ehovin L, Nebehay G, Fernandez G, VojíY T, Gatt A, Khajenezhad A, Salahledin A, Soltani-Farani A, Zarezade A, Petrosino A, Milton A, Bozorgtabar B, Li B, Chan CS, Heng C, Ward D, Kearney D, Monekosso D, Karaimer HC, Rabiee HR, Zhu J, Gao J, Xiao J, Zhang J, Xing J, Huang K, Lebeda K, Cao L, Maresca ME, Lim MK, ELHelw M, Felsberg M, Remagnino P, Bowden R, Goecke R, Stolkin R, Lim SYY, Maher S, Poullot S, Wong S, Satoh S, Chen W, Hu W, Zhang X, Li Y, Niu Z (2013) The visual object tracking VOT2013 challenge results, Proceedings of the IEEE International Conference on Computer Vision pp. 98-111
Visual tracking has attracted a significant attention in the last few decades. The recent surge in the number of publications on tracking-related problems have made it almost impossible to follow the developments in the field. One of the reasons is that there is a lack of commonly accepted annotated data-sets and standardized evaluation protocols that would allow objective comparison of different tracking methods. To address this issue, the Visual Object Tracking (VOT) workshop was organized in conjunction with ICCV2013. Researchers from academia as well as industry were invited to participate in the first VOT2013 challenge which aimed at single-object visual trackers that do not apply pre-learned models of object appearance (model-free). Presented here is the VOT2013 benchmark dataset for evaluation of single-object visual trackers as well as the results obtained by the trackers competing in the challenge. In contrast to related attempts in tracker benchmarking, the dataset is labeled per-frame by visual attributes that indicate occlusion, illumination change, motion change, size change and camera motion, offering a more systematic comparison of the trackers. Furthermore, we have designed an automated system for performing and evaluating the experiments. We present the evaluation protocol of the VOT2013 challenge and the results of a comparison of 27 trackers on the benchmark dataset. The dataset, the evaluation tools and the tracker rankings are publicly available from the challenge website (http://votchallenge. net). © 2013 IEEE.
Bowden R, Windridge D, Kadir T, Zisserman A, Brady M (2004) A Linguistic Feature Vector for the Visual Interpretation of Sign Language, European Conference on Computer Vision pp. 390-401-390-401
Hadfield SJ, Lebeda K, Bowden R (2016) Stereo reconstruction using top-down cues, Computer Vision and Image Understanding 157 pp. 206-222 Elsevier
We present a framework which allows standard stereo reconstruction to be unified with a wide range of classic top-down cues from urban scene understanding. The resulting algorithm is analogous to the human visual system where conflicting interpretations of the scene due to ambiguous data can be resolved based on a higher level understanding of urban environments. The cues which are reformulated within the framework include: recognising common arrangements of surface normals and semantic edges (e.g. concave, convex and occlusion boundaries), recognising connected or coplanar structures such as walls, and recognising collinear edges (which are common on repetitive structures such as windows). Recognition of these common configurations has only recently become feasible, thanks to the emergence of large-scale reconstruction datasets. To demonstrate the importance and generality of scene understanding during stereo-reconstruction, the proposed approach is integrated with 3 different state-of-the-art techniques for bottom-up stereo reconstruction. The use of high-level cues is shown to improve performance by up to 15 % on the Middlebury 2014 and KITTI datasets. We further evaluate the technique using the recently proposed HCI stereo metrics, finding significant improvements in the quality of depth discontinuities, planar surfaces and thin structures.
Dowson NDH, Bowden R (2005) Simultaneous modeling and tracking (SMAT) of feature sets, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol 2, Proceedings pp. 99-105 IEEE COMPUTER SOC
KaewTraKulPong P, Bowden R (2002) An Improved Adaptive Background Mixture Model for Real-time Tracking with Shadow Detection, In: Remagnino P, Jones GA, Paragios N, Regazzoni CS (eds.), Video-Based Surveillance Systems 11 Springer US
Real-time segmentation of moving regions in image sequences is a fundamental step in many vision systems including automated visual surveillance, human-machine interface, and very low-bandwidth telecommunications. A typical method is background subtraction. Many background models have been introduced to deal with different problems. One of the successful solutions to these problems is to use a multi-colour background model per pixel proposed by Grimson et al [1, 2,3]. However, the method suffers from slow learning at the beginning, especially in busy environments. In addition, it can not distinguish between moving shadows and moving objects. This paper presents a method which improves this adaptive background mixture model. By reinvestigating the update equations, we utilise different equations at different phases. This allows our system learn faster and more accurately as well as adapts effectively to changing environment. A shadow detection scheme is also introduced in this paper. It is based on a computational colour space that makes use of our background model. A comparison has been made between the two algorithms. The results show the speed of learning and the accuracy of the model using our update algorithm over the Grimson et al?s tracker. When incorporate with the shadow detection, our method results in far better segmentation than The Thirteenth Conference on Uncertainty in Artificial Intelligence that of Grimson et al.
Krejov P, Gilbert A, Bowden R (2014) A Multitouchless Interface Expanding User Interaction, IEEE COMPUTER GRAPHICS AND APPLICATIONS 34 (3) pp. 40-48 IEEE COMPUTER SOC
Oshin O, Gilbert A, Bowden R (2014) Capturing relative motion and finding modes for action recognition in the wild, Computer Vision and Image Understanding
"Actions in the wild" is the term given to examples of human motion that are performed in natural settings, such as those harvested from movies [1] or Internet databases [2]. This paper presents an approach to the categorisation of such activity in video, which is based solely on the relative distribution of spatio-temporal interest points. Presenting the Relative Motion Descriptor, we show that the distribution of interest points alone (without explicitly encoding their neighbourhoods) effectively describes actions. Furthermore, given the huge variability of examples within action classes in natural settings, we propose to further improve recognition by automatically detecting outliers, and breaking complex action categories into multiple modes. This is achieved using a variant of Random Sampling Consensus (RANSAC), which identifies and separates the modes. We employ a novel reweighting scheme within the RANSAC procedure to iteratively reweight training examples, ensuring their inclusion in the final classification model. We demonstrate state-of-the-art performance on five human action datasets. © 2014 Elsevier Inc. All rights reserved.
Hadfield S, Bowden R (2015) Exploiting high level scene cues in stereo reconstruction, 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV) pp. 783-791 IEEE
Ellis L, Bowden R (2007) Learning Responses to Visual Stimuli: A Generic Approach, Proceedings of the 5th International Conference on Computer Vision Systems Applied Computer Science Group, Bielefeld University, Germany
A general framework for learning to respond appropriately to
visual stimulus is presented. By hierarchically clustering percept-action
exemplars in the action space, contextually important features and relationships
in the perceptual input space are identified and associated with
response models of varying generality. Searching the hierarchy for a set
of best matching percept models yields a set of action models with likelihoods.
By posing the problem as one of cost surface optimisation in a
probabilistic framework, a particle filter inspired forward exploration algorithm
is employed to select actions from multiple hypotheses that move
the system toward a goal state and to escape from local minima. The
system is quantitatively and qualitatively evaluated in both a simulated
shape sorter puzzle and a real-world autonomous navigation domain.
Windridge D, Bowden R (2004) Induced Decision Fusion in Automated Sign Language Interpretation: Using ICA to Isolate the Underlying Components of Sign, Multiple Classifier Systems pp. 303-313-303-313
Cooper HM (2010) Sign Language Recognitions: Generalising to More Complex Corpora.,
The aim of this thesis is to find new approaches to Sign Language Recognition (SLR) which are suited to working with the limited corpora currently available. Data available for SLR is of limited quality; low resolution and frame rates make the task of recognition even more complex. The content is rarely natural, concentrating on isolated signs and filmed under laboratory conditions. In addition, the amount of accurately labelled data is minimal. To this end, several contributions are made: Tracking the hands is eschewed in favour of detection based techniques more robust to noise; for both signs and for linguistically-motivated sign sub-units are investigated, to make best use of limited data sets. Finally, an algorithm is proposed to learn signs from the inset signers on TV, with the aid of the accompanying subtitles, thus increasing the corpus of data available.
Tracking fast moving hands under laboratory conditions is a complex task, move this to real world data and the challenge is even greater. When using tracked data as a base for SLR, the errors in the tracking are compounded at the classification stage. Proposed instead, is a novel sign detection method, which views space-time as a 3D volume and the sign within it as an object to be located. Features are combined into strong classifiers using a novel boosting implementation designed to create optimal classifiers over sparse datasets. Using boosted volumetric features, on a robust frame differenced input, average classification rates reach 71% on seen signers and
66% on a mixture of seen and unseen signers, with individual sign classification rates gaining 95%.
Using a classifier per sign approach to SLR, means that data sets need to contain numerous examples of the signs to be learnt. Instead, this thesis proposes learnt classifiers to detect the common sub-units of sign. The responses of these classifiers can then be combined for recognition at the sign level. This approach requires fewer examples per sign to be learnt, since the sub-unit detectors are trained on data from multiple signs. It is also faster at detection time since there are fewer classifiers to consult, the number of these being limited by the linguistics of sign and not the number of signs being detected. For this method, appearance based boosted classifiers are introduced to distinguish the sub-units of sign. Results show that when combined with temporal models, these novel sub-unit classifiers, can outperform similar classifiers
Sanfeliu A, Andrade-Cetto J, Barbosa M, Bowden R, Capitan J, Corominas A, Gilbert A, Illingworth J, Merino L, Mirats JM, Moreno P, Ollero A, Sequeira J, Spaan MTJ (2010) Decentralized Sensor Fusion for Ubiquitous Networking Robotics in Urban Areas, Sensors 10 (3) pp. 2274-2314 MOLECULAR DIVERSITY PRESERVATION INTERNATIONAL-MDPI
In this article we explain the architecture for the environment and sensors that has been built for the European project URUS (Ubiquitous Networking Robotics in Urban Sites), a project whose objective is to develop an adaptable network robot architecture for cooperation between network robots and human beings and/or the environment in urban areas. The project goal is to deploy a team of robots in an urban area to give a set of services to a user community. This paper addresses the sensor architecture devised for URUS and the type of robots and sensors used, including environment sensors and sensors onboard the robots. Furthermore, we also explain how sensor fusion takes place to achieve urban outdoor execution of robotic services. Finally some results of the project related to the sensor network are highlighted.
Merino L, Gilbert A, Bowden R, Illingworth J, Capitán J, Ollero A, Ollero A (2012) Data fusion in ubiquitous networked robot systems for urban services, Annales des Telecommunications/Annals of Telecommunications pp. 1-21
There is a clear trend in the use of robots to accomplish services that can help humans. In this paper, robots acting in urban environments are considered for the task of person guiding. Nowadays, it is common to have ubiquitous sensors integrated within the buildings, such as camera networks, and wireless communications like 3G or WiFi. Such infrastructure can be directly used by robotic platforms. The paper shows how combining the information from the robots and the sensors allows tracking failures to be overcome, by being more robust under occlusion, clutter, and lighting changes. The paper describes the algorithms for tracking with a set of fixed surveillance cameras and the algorithms for position tracking using the signal strength received by a wireless sensor network (WSN). Moreover, an algorithm to obtain estimations on the positions of people from cameras on board robots is described. The estimate from all these sources are then combined using a decentralized data fusion algorithm to provide an increase in performance. This scheme is scalable and can handle communication latencies and failures. We present results of the system operating in real time on a large outdoor environment, including 22 nonoverlapping cameras, WSN, and several robots. © 2012 Institut Mines-Télécom and Springer-Verlag.
Bowden R, Heap AJ, Hogg DC (1997) Real Time Hand Tracking and Gesture Recognition as a 3D Input Device for Graphical Applications, Proceedings of Gesture Workshop ?96 pp. 117-129 Springer-Verlag
This paper outlines a system design and implementation of a 3D input device for graphical
applications which uses real time hand tracking and gesture recognition to provide the user with an
intuitive interface for tomorrow?s applications. Point Distribution Models (PDMs) have been
shown to be successful at tracking deformable objects . This system demonstrates how these ?smart
snakes? can be used in real time with a real world problem. The system is based upon Open
Inventor1 and designed for use with Silicon Graphics Indy Workstations, but provisions have been
make for the move to other platforms and applications. We demonstrate how PDMs provide the
ideal feature vector for model classification. It is shown how computer vision can provide a low
cost, intuitive interface that has few hardware constraints. We also give the reader an insight into
the next generation of HCI and Multimedia, providing a 3D scene viewer and VRML browser
based upon the handtracker. Further allowances have been made to facilitate the inclusion of the
handtracker within third party Inventor applications. All source code, libraries and applications can
be downloaded for free from the above web addresses. This paper demonstrates how computer
vision and computer graphics can work together providing an interdisciplinary approach to problem
solving.
Sheerman-Chase T, Ong E-J, Bowden R (2009) Online learning of robust facial feature trackers, 2009 IEEE 12th International Conference on Computer Vision Workshops pp. 1386-1392 IEEE
This paper presents a head pose and facial feature estimation technique that works over a wide range of pose variations without a priori knowledge of the appearance of the face. Using simple LK trackers, head pose is estimated by Levenberg-Marquardt (LM) pose estimation using the feature tracking as constraints. Factored sampling and RANSAC are employed to both provide a robust pose estimate and identify tracker drift by constraining outliers in the estimation process. The system provides both a head pose estimate and the position of facial features and is capable of tracking over a wide range of head poses.
Okwechime D, Ong E-J, Bowden R (2009) Real-time motion control using pose space probability density estimation, 2009 IEEE 12th International Conference on Computer Vision Workshops pp. 2056-2063 IEEE
We introduce a new algorithm for real-time interactive motion control and demonstrate its application to motion captured data, pre-recorded videos and HCI. Firstly, a data set of frames are projected into a lower dimensional space. An appearance model is learnt using a multivariate probability distribution. A novel approach to determining transition points is presented based on k-medoids, whereby appropriate points of intersection in the motion trajectory are derived as cluster centres. These points are used to segment the data into smaller subsequences. A transition matrix combined with a kernel density estimation is used to determine suitable transitions between the subsequences to develop novel motion. To facilitate real-time interactive control, conditional probabilities are used to derive motion given user commands. The user commands can come from any modality including auditory, touch and gesture. The system is also extended to HCI using audio signals of speech in a conversation to trigger non-verbal responses from a synthetic listener in real-time. We demonstrate the ?exibility of the model by presenting results ranging from data sets composed of vectorised images, 2D and 3D point representations. Results show real-time interaction and plausible motion generation between different types of movement.
Krejov P, Bowden R (2013) Multi-touchless: Real-time fingertip detection and tracking using geodesic maxima, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, FG 2013 pp. 1-7 IEEE
Since the advent of multitouch screens users have been able to interact using fingertip gestures in a two dimensional plane. With the development of depth cameras, such as the Kinect, attempts have been made to reproduce the detection of gestures for three dimensional interaction. Many of these use contour analysis to find the fingertips, however the success of such approaches is limited due to sensor noise and rapid movements. This paper discusses an approach to identify fingertips during rapid movement at varying depths allowing multitouch without contact with a screen. To achieve this, we use a weighted graph that is built using the depth information of the hand to determine the geodesic maxima of the surface. Fingertips are then selected from these maxima using a simplified model of the hand and correspondence found over successive frames. Our experiments show real-time performance for multiple users providing tracking at 30fps for up to 4 hands and we compare our results with stateof- the-art methods, providing accuracy an order of magnitude better than existing approaches. © 2013 IEEE.
Hadfield SJ, Bowden R (2013) Scene Particles: Unregularized Particle Based Scene Flow Estimation, IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (3) 3 pp. 564-576 IEEE Computer Society
In this paper, an algorithm is presented for estimating scene flow, which is a richer, 3D analogue of Optical Flow. The approach operates orders of magnitude faster than alternative techniques, and is well suited to further performance gains through parallelized implementation. The algorithm employs multiple hypothesis to deal with motion ambiguities, rather than the traditional smoothness constraints, removing oversmoothing errors and providing signi?cant performance improvements on benchmark data, over the previous state of the art. The approach is ?exible, and capable of operating with any combination of appearance and/or depth sensors, in any setup, simultaneously estimating the structure and motion if necessary. Additionally, the algorithm propagates information over time to resolve ambiguities, rather than performing an isolated estimation at each frame, as in contemporary approaches. Approaches to smoothing the motion ?eld without sacri?cing the bene?ts of multiple hypotheses are explored, and a probabilistic approach to Occlusion estimation is demonstrated, leading to 10% and 15% improved performance respectively. Finally, a data driven tracking approach is described, and used to estimate the 3D trajectories of hands during sign language, without the need to model complex appearance variations at each viewpoint.
Hadfield S, Bowden R (2012) Go with the flow: Hand trajectories in 3D via clustered scene flow, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 7324 LNCS (PART 1) pp. 285-295
Tracking hands and estimating their trajectories is useful in a number of tasks, including sign language recognition and human computer interaction. Hands are extremely difficult objects to track, their deformability, frequent self occlusions and motion blur cause appearance variations too great for most standard object trackers to deal with robustly. In this paper, the 3D motion field of a scene (known as the Scene Flow, in contrast to Optical Flow, which is it's projection onto the image plane) is estimated using a recently proposed algorithm, inspired by particle filtering. Unlike previous techniques, this scene flow algorithm does not introduce blurring across discontinuities, making it far more suitable for object segmentation and tracking. Additionally the algorithm operates several orders of magnitude faster than previous scene flow estimation systems, enabling the use of Scene Flow in real-time, and near real-time applications. A novel approach to trajectory estimation is then introduced, based on clustering the estimated scene flow field in both space and velocity dimensions. This allows estimation of object motions in the true 3D scene, rather than the traditional approach of estimating 2D image plane motions. By working in the scene space rather than the image plane, the constant velocity assumption, commonly used in the prediction stage of trackers, is far more valid, and the resulting motion estimate is richer, providing information on out of plane motions. To evaluate the performance of the system, 3D trajectories are estimated on a multi-view sign-language dataset, and compared to a traditional high accuracy 2D system, with excellent results. © 2012 Springer-Verlag.
Moore S, Bowden R (2009) The Effects of Pose On Facial Expression Recognition, Proceedings of the British Machine Vision Conference pp. 1-11 BMVA Press
Research into facial expression recognition has predominantly been based upon near
frontal view data. However, a recent 3D facial expression database (BU-3DFE database)
has allowed empirical investigation of facial expression recognition across pose. In this
paper, we investigate the effects of pose from frontal to profile view on facial expression
recognition. Experiments are carried out on 100 subjects with 5 yaw angles over 6 prototypical
expressions. Expressions have 4 levels of intensity from subtle to exaggerated.
We evaluate features such as local binary patterns (LBPs) as well as various extensions
of LBPs. In addition, a novel approach to facial expression recognition is proposed using
local gabor binary patterns (LGBPs). Multi class support vector machines (SVMs) are
used for classification. We investigate the effects of image resolution and pose on facial
expression classification using a variety of different features.
Ong EJ, Bowden R (2004) A boosted classifier tree for hand shape detection, SIXTH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION, PROCEEDINGS pp. 889-894 IEEE COMPUTER SOC
Holt B, Bowden R (2012) Static pose estimation from depth images using random regression forests and Hough voting, VISAPP 2012 - Proceedings of the International Conference on Computer Vision Theory and Applications 1 pp. 557-564
Robust and fast algorithms for estimating the pose of a human given an image would have a far reaching impact on many fields in and outside of computer vision. We address the problem using depth data that can be captured inexpensively using consumer depth cameras such as the Kinect sensor. To achieve robustness and speed on a small training dataset, we formulate the pose estimation task within a regression and Hough voting framework. Our approach uses random regression forests to predict joint locations from each pixel and accumulate these predictions with Hough voting. The Hough accumulator images are treated as likelihood distributions where maxima correspond to joint location hypotheses. We demonstrate our approach and compare to the state-ofthe-art on a publicly available dataset.
Kadir T, Bowden R, Ong EJ, Zisserman A (2004) Minimal Training, Large Lexicon, Unconstrained Sign Language Recognition, BMVC 2004 Electronic Proceedings pp. 939-948 The British Machine Vision Association and Society for Pattern Recognition
This paper presents a flexible monocular system capable of recognising sign
lexicons far greater in number than previous approaches. The power of the
system is due to four key elements: (i) Head and hand detection based upon
boosting which removes the need for temperamental colour segmentation;
(ii) A body centred description of activity which overcomes issues with camera
placement, calibration and user; (iii) A two stage classification in which
stage I generates a high level linguistic description of activity which naturally
generalises and hence reduces training; (iv) A stage II classifier bank which
does not require HMMs, further reducing training requirements.
The outcome of which is a system capable of running in real-time, and
generating extremely high recognition rates for large lexicons with as little as
a single training instance per sign. We demonstrate classification rates as high
as 92% for a lexicon of 164 words with extremely low training requirements
outperforming previous approaches where thousands of training examples
are required.
Bowden R, Cox S, Harvey R, Lan Y, Ong E-J, Owen G, Theobald B-J (2013) Recent developments in automated lip-reading, OPTICS AND PHOTONICS FOR COUNTERTERRORISM, CRIME FIGHTING AND DEFENCE IX; AND OPTICAL MATERIALS AND BIOMATERIALS IN SECURITY AND DEFENCE SYSTEMS TECHNOLOGY X 8901 SPIE-INT SOC OPTICAL ENGINEERING
Micilotta A, Bowden R (2004) View-based Location and Tracking of Body Parts for Visual Interaction, BMVC 2004 Electronic Proceedings pp. 849-858 The British Machine Vision Association and Society for Pattern Recognition
This paper presents a real time approach to locate and track the upper
torso of the human body. Our main interest is not in 3D biometric accuracy,
but rather a sufficient discriminatory representation for visual interaction.
The algorithm employs background suppression and a general approximation
to body shape, applied within a particle filter framework, making use of
integral images to maintain real-time performance. Furthermore, we present
a novel method to disambiguate the hands of the subject and to predict the
likely position of elbows. The final system is demonstrated segmenting multiple
subjects from a cluttered scene at above real time operation.
Efthimiou E, Fotinea SE, Hanke T, Glauert J, Bowden R, Braffort A, Collet C, Maragos P, Goudenove F (2010) DICTA-SIGN: Sign Language Recognition, Generation and Modelling with application in Deaf Communication, pp. 80-84
Gilbert A, Bowden R (2017) Image and Video Mining through Online Learning, Computer Vision and Image Understanding 158 pp. 72-84 Elsevier
Within the eld of image and video recognition, the traditional approach is a dataset split into xed training and test partitions. However, the labelling of the training set is time-consuming, especially as datasets grow in size and complexity. Furthermore, this approach is not applicable to the home user, who wants to intuitively group their media without tirelessly labelling the content. Consequently, we propose a solution similar in nature to an active learning paradigm, where a small subset of media is labelled as semantically belonging to the same class, and machine learning is then used to pull this and other related content together in the feature space. Our interactive approach is able to iteratively cluster classes of images and video. We reformulate it in an online learning framework and demonstrate competitive performance to batch learning approaches using only a fraction of the labelled data. Our approach is based around the concept of an image signature which, unlike a standard bag of words model, can express co-occurrence statistics as well as symbol frequency. We e ciently compute metric distances between signatures despite their inherent high dimensionality and provide discriminative feature selection, to allow common and distinctive elements to be identi ed from a small set of user labelled examples. These elements are then accentuated in the image signature to increase similarity between examples and pull correct classes together. By repeating this process in an online learning framework, the accuracy of similarity increases dramatically despite labelling only a few training examples. To demonstrate that the approach is agnostic to media type and features used, we evaluate on three image datasets (15 scene, Caltech101 and FG-NET), a mixed text and image dataset (ImageTag), a dataset used in active learning (Iris) and on three action recognition datasets (UCF11, KTH and Hollywood2). On the UCF11 video dataset, the accuracy is 86.7% despite using only 90 labelled examples from a dataset of over 1200 videos, instead of the standard 1122 training videos. The approach is both scalable and e cient, with a single iteration over the full UCF11 dataset of around 1200 videos taking approximately 1 minute on a standard desktop machine.
Hadfield S, Bowden R (2012) Generalised Pose Estimation Using Depth, In: Kutulakos K (eds.), Trends and Topics in Computer Vision ECCV 2010 Trends and Topics in Computer Vision. ECCV 2010. Lecture Notes in Computer Science 6553 (6553) pp. 312-325 Springer
Estimating the pose of an object, be it articulated, deformable or rigid, is an important task, with applications ranging from Human-Computer Interaction to environmental understanding. The idea of a general pose estimation framework, capable of being rapidly retrained to suit a variety of tasks, is appealing. In this paper a solution isproposed requiring only a set of labelled training images in order to be applied to many pose estimation tasks. This is achieved bytreating pose estimation as a classification problem, with particle filtering used to provide non-discretised estimates. Depth information extracted from a calibrated stereo sequence, is used for background suppression and object scale estimation. The appearance and shape channels are then transformed to Local Binary Pattern histograms, and pose classification is performed via a randomised decision forest. To demonstrate flexibility, the approach is applied to two different situations, articulated hand pose and rigid head orientation, achieving 97% and 84% accurate estimation rates, respectively.
Hadfield S, Bowden R (2010) Generalised Pose Estimation Using Depth, In proceedings, European Conference on Computer Vision (Workshops)
Estimating the pose of an object, be it articulated, deformable or rigid, is an important task, with applications ranging from Human-Computer Interaction to environmental understanding. The idea of a general pose estimation framework, capable of being rapidly retrained to suit a variety of tasks, is appealing. In this paper a solution is proposed requiring only a set of labelled training images in order to be applied to many pose estimation tasks. This is achieved by treating pose estimation as a classification problem, with particle filtering used to provide non-discretised estimates. Depth information extracted from a calibrated stereo sequence, is used for background suppression and object scale estimation. The appearance and shape channels are then transformed to Local Binary Pattern histograms, and pose classification is performed via a randomised decision forest. To demonstrate flexibility, the approach is applied to two different situations, articulated hand pose and rigid head orientation, achieving 97% and 84% accurate estimation rates, respectively.
Krejov P, Gilbert A, Bowden R (2016) Guided Optimisation through Classification and Regression for Hand Pose Estimation, Computer Vision and Image Understanding 155 pp. 124-138 Elsevier
This paper presents an approach to hand pose estimation that combines discriminative and model-based methods to leverage the advantages of both. Randomised Decision Forests are trained using real data to provide fast coarse segmentation of the hand. The segmentation then forms the basis of constraints applied in model fitting, using an efficient projected Gauss-Seidel solver, which enforces temporal continuity and kinematic limitations. However, when fitting a generic model to multiple users with varying hand shape, there is likely to be residual errors between the model and their hand. Also, local minima can lead to failures in tracking that are difficult to recover from. Therefore, we introduce an error regression stage that learns to correct these instances of optimisation failure. The approach provides improved accuracy over the current state of the art methods, through the inclusion of temporal cohesion and by learning to correct from failure cases. Using discriminative learning, our approach performs guided optimisation, greatly reducing model fitting complexity and radically improves efficiency. This allows tracking to be performed at over 40 frames per second using a single CPU thread.
Hadfield Simon, Lebeda K, Bowden Richard (2016) Hollywood 3D: What are the best 3D features for Action Recognition?, International Journal of Computer Vision 121 (1) pp. 95-110 Springer Verlag
Action recognition ?in the wild? is extremely challenging, particularly when complex 3D actions are projected down to the image plane, losing a great deal of information. The recent growth of 3D data in broadcast content and commercial depth sensors, makes it possible to overcome this. However, there is little work examining the best way to exploit this new modality. In this paper we introduce the Hollywood 3D benchmark, which is the first dataset containing ?in the wild? action footage including 3D data. This dataset consists of 650 stereo video clips across 14 action classes, taken from Hollywood movies. We provide stereo calibrations and depth reconstructions for each clip. We also provide an action recognition pipeline, and propose a number of specialised depth-aware techniques including five interest point detectors and three feature descriptors. Extensive tests allow evaluation of different appearance and depth encoding schemes. Our novel techniques exploiting this depth allow us to reach performance levels more than triple those of the best baseline algorithm using only appearance information. The benchmark data, code and calibrations are all made available to the community.
Lebeda K, Hadfield SJ, Bowden R (2017) TMAGIC: A Model-free 3D Tracker, IEEE Transactions on Image Processing 26 (9) pp. 4378-4388 IEEE
Significant effort has been devoted within the visual tracking community to rapid learning of object properties on the fly. However, state-of-the-art approaches still often fail in cases such as rapid out-of-plane rotation, when the appearance changes suddenly. One of the major contributions of this work is a radical rethinking of the traditional wisdom of modelling 3D motion as appearance change during tracking. Instead, 3D motion is modelled as 3D motion. This intuitive but previously unexplored approach provides new possibilities in visual tracking research. Firstly, 3D tracking is more general, as large out-of-plane motion is often fatal for 2D trackers, but helps 3D trackers to build better models. Secondly, the tracker?s internal model of the object can be used in many different applications and it could even become the main motivation, with tracking supporting reconstruction rather than vice versa. This effectively bridges the gap between visual tracking and Structure from Motion. A new benchmark dataset of sequences with extreme out-ofplane rotation is presented and an online leader-board offered to stimulate new research in the relatively underdeveloped area of 3D tracking. The proposed method, provided as a baseline, is capable of successfully tracking these sequences, all of which pose a considerable challenge to 2D trackers (error reduced by 46 %).
Allday R, Hadfield S, Bowden R (2017) From Vision to Grasping: Adapting Visual Networks, TAROS-2017 Conference Proceedings. Lecture Notes in Computer Science 10454 pp. 484-494 Springer
Grasping is one of the oldest problems in robotics and is still considered challenging, especially when grasping unknown objects with unknown 3D shape. We focus on exploiting recent advances in computer vision recognition systems. Object classification problems tend to have much larger datasets to train from and have far fewer practical constraints around the size of the model and speed to train. In this paper we will investigate how to adapt Convolutional Neural Networks (CNNs), traditionally used for image classification, for planar robotic grasping. We consider the differences in the problems and how a network can be adjusted to account for this. Positional information is far more important to robotics than generic image classification tasks, where max pooling layers are used to improve translation invariance. By using a more appropriate network structure we are able to obtain improved accuracy while simultaneously improving run times and reducing memory consumption by reducing model size by up to 69%.
Mendez Maldonado O, Hadfield S, Pugeault N, Bowden R (2017) Taking the Scenic Route to 3D: Optimising Reconstruction from Moving Cameras, ICCV 2017 Proceedings IEEE
Reconstruction of 3D environments is a problem that
has been widely addressed in the literature. While many
approaches exist to perform reconstruction, few of them
take an active role in deciding where the next observations
should come from. Furthermore, the problem of travelling
from the camera?s current position to the next, known as
pathplanning, usually focuses on minimising path length.
This approach is ill-suited for reconstruction applications,
where learning about the environment is more valuable than
speed of traversal.
We present a novel Scenic Route Planner that selects
paths which maximise information gain, both in terms of
total map coverage and reconstruction accuracy. We also
introduce a new type of collaborative behaviour into the
planning stage called opportunistic collaboration, which
allows sensors to switch between acting as independent
Structure from Motion (SfM) agents or as a variable baseline
stereo pair.
We show that Scenic Planning enables similar performance
to state-of-the-art batch approaches using less than
0.00027% of the possible stereo pairs (3% of the views).
Comparison against length-based pathplanning approaches
show that our approach produces more complete and more
accurate maps with fewer frames. Finally, we demonstrate
the Scenic Pathplanner?s ability to generalise to live scenarios
by mounting cameras on autonomous ground-based
sensor platforms and exploring an environment.
Camgöz N, Hadfield SJ, Koller O, Bowden R (2017) SubUNets: End-to-end Hand Shape and Continuous Sign Language Recognition, ICCV 2017 Proceedings IEEE
We propose a novel deep learning approach to solve
simultaneous alignment and recognition problems (referred
to as ?Sequence-to-sequence? learning). We decompose the
problem into a series of specialised expert systems referred
to as SubUNets. The spatio-temporal relationships between
these SubUNets are then modelled to solve the task, while
remaining trainable end-to-end.
The approach mimics human learning and educational
techniques, and has a number of significant advantages. SubUNets
allow us to inject domain-specific expert knowledge
into the system regarding suitable intermediate representations.
They also allow us to implicitly perform transfer
learning between different interrelated tasks, which also allows
us to exploit a wider range of more varied data sources.
In our experiments we demonstrate that each of these properties
serves to significantly improve the performance of the
overarching recognition system, by better constraining the
learning problem.
The proposed techniques are demonstrated in the challenging
domain of sign language recognition. We demonstrate
state-of-the-art performance on hand-shape recognition (outperforming
previous techniques by more than 30%). Furthermore,
we are able to obtain comparable sign recognition
rates to previous research, without the need for an alignment
step to segment out the signs for recognition.
Cooper H, Ong E, Pugeault N, Bowden R (2017) Sign Language Recognition Using Sub-units, In: Escalera S, Guyon I, Athitsos V (eds.), Gesture Recognition pp. 89-118 Springer International Publishing
This chapter discusses sign language recognition using linguistic sub-units. It presents three types of sub-units for consideration; those learnt from appearance data as well as those inferred from both 2D or 3D tracking data. These sub-units are then combined using a sign level classifier; here, two options are presented. The first uses Markov Models to encode the temporal changes between sub-units. The second makes use of Sequential Pattern Boosting to apply discriminative feature selection at the same time as encoding temporal information. This approach is more robust to noise and performs well in signer independent tests, improving results from the 54% achieved by the Markov Chains to 76%.
Camgöz N, Hadfield S, Bowden R (2017) Particle Filter based Probabilistic Forced Alignment for Continuous Gesture Recognition, IEEE International Conference on Computer Vision Workshops (ICCVW) 2017 pp. 3079-3085 IEEE
In this paper, we propose a novel particle filter based
probabilistic forced alignment approach for training spatiotemporal
deep neural networks using weak border level
annotations.
The proposed method jointly learns to localize and recognize
isolated instances in continuous streams. This is done
by drawing training volumes from a prior distribution of
likely regions and training a discriminative 3D-CNN from
this data. The classifier is then used to calculate the posterior
distribution by scoring the training examples and using this
as the prior for the next sampling stage.
We apply the proposed approach to the challenging task
of large-scale user-independent continuous gesture recognition.
We evaluate the performance on the popular ChaLearn
2016 Continuous Gesture Recognition (ConGD) dataset. Our
method surpasses state-of-the-art results by obtaining 0:3646
and 0:3744 Mean Jaccard Index Score on the validation and
test sets of ConGD, respectively. Furthermore, we participated
in the ChaLearn 2017 Continuous Gesture Recognition
Challenge and was ranked 3rd. It should be noted that our
method is learner independent, it can be easily combined
with other approaches.
Autonomous 3D reconstruction, the process whereby an agent can produce its own representation of the world, is an extremely challenging area in both vision and robotics. However, 3D reconstructions have the ability to grant robots the understanding of the world necessary for collaboration and high-level goal execution. Therefore, this thesis aims to explore methods that will enable modern robotic systems to autonomously and collaboratively achieve an understanding of the world.

In the real world, reconstructing a 3D scene requires nuanced understanding of the environment. Additionally, it is not enough to simply ?understand? the world, autonomous agents must be capable of actively acquiring this understanding. Achieving all of this using simple monocular sensors is extremely challenging. Agents must be able to understand what areas of the world are navigable, how egomotion affects reconstruction and how other agents may be leveraged to provide an advantage. All of this must be considered in addition to the traditional 3D reconstruction issues of correspondence estimation, triangulation and data association.

Simultaneous Localisation and Mapping (SLAM) solutions are not particularly well suited to autonomous multi-agent reconstruction. They typically require the sensors to be in constant communication, do not scale well with the number of agents (or map size) and require expensive optimisations. Instead, this thesis attempts to develop more pro-active techniques from the ground up.

First, an autonomous agent must have the ability to actively select what it is going to reconstruct. Known as view-selection, or Next-Best View (NBV), this has recently become an active topic in autonomous robotics and will form the first contribution of this thesis. Second, once a view is selected, an autonomous agent must be able to plan a trajectory to arrive at that view. This problem, known as path-planning, can be considered a core topic in the robotics field and will form the second contribution of this thesis. Finally, the 3D reconstruction must be anchored to a globally consistent map that co-relates to the real world. This will be addressed as a floorplan localisation problem, an emerging field for the vision community, and will be the third contribution of this thesis.

To give autonomous agents the ability to actively select what data to process, this thesis discusses the NBV problem in the context of Multi-View Stereo (MVS). The proposed approach has the ability to massively reduce the amount of computing resources required for any given 3D reconstruction. More importantly, it autonomously selects the views that improve the reconstruction the most. All of this is done exclusively on the sensor pose; the images are not used for view-selection and only loaded into memory once they have been selected for reconstruction. Experimental evaluation shows that NBV applied to this problem can achieve results comparable to state-of-the-art using as little as 3.8% of the views.

To provide the ability to execute an autonomous 3D reconstruction, this thesis proposes a novel computer-vision based goal-estimation and path-planning approach. The method proposed in the previous chapter is extended into a continuous pose-space. The resulting view then becomes the goal of a Scenic Pathplanner that plans a trajectory between the current robot pose and the NBV. This is done using an NBV-based pose-space that biases the paths towards areas of high information gain. Experimental evaluation shows that the Scenic Planning enables similar performance to state-of-the-art batch approaches using less than 3% of the views, whichcorresponds to 2.7 × 10 ?4 % of the possible stereo pairs (using a naive interpretation of plausible stereo pairs). Comparison against length-based path-planning approaches show that the Scenic Pathplanner produces more complete and more accurate maps with fewer frames. Finally, the ability of the Scenic Pathplanner to generalise to live sc

Hadfield Simon, Lebeda Karel, Bowden Richard (2018) HARD-PnP: PnP Optimization Using a Hybrid Approximate Representation, IEEE transactions on Pattern Analysis and Machine Intelligence Institute of Electrical and Electronics Engineers (IEEE)
This paper proposes a Hybrid Approximate Representation (HAR) based on unifying several efficient approximations of the
generalized reprojection error (which is known as the gold standard for multiview geometry). The HAR is an over-parameterization
scheme where the approximation is applied simultaneously in multiple parameter spaces. A joint minimization scheme ?HAR-Descent?
can then solve the PnP problem efficiently, while remaining robust to approximation errors and local minima.
The technique is evaluated extensively, including numerous synthetic benchmark protocols and the real-world data evaluations used in
previous works. The proposed technique was found to have runtime complexity comparable to the fastest O(n) techniques, and up to
10 times faster than current state of the art minimization approaches. In addition, the accuracy exceeds that of all 9 previous
techniques tested, providing definitive state of the art performance on the benchmarks, across all 90 of the experiments in the paper
and supplementary material.
Ebling S, Camgöz N, Boyes Braem P, Tissi K, Sidler-Miserez S, Stoll S, Hadfield S, Haug T, Bowden R, Tornay S, Razaviz M, Magimai-Doss M (2018) SMILE Swiss German Sign Language Dataset, Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC) 2018 The European Language Resources Association (ELRA)
Sign language recognition (SLR) involves identifying the form and meaning of isolated signs or sequences of signs. To our knowledge,
the combination of SLR and sign language assessment is novel. The goal of an ongoing three-year project in Switzerland is to pioneer
an assessment system for lexical signs of Swiss German Sign Language (Deutschschweizerische Geb¨ardensprache, DSGS) that relies on
SLR. The assessment system aims to give adult L2 learners of DSGS feedback on the correctness of the manual parameters (handshape,
hand position, location, and movement) of isolated signs they produce. In its initial version, the system will include automatic feedback
for a subset of a DSGS vocabulary production test consisting of 100 lexical items. To provide the SLR component of the assessment
system with sufficient training samples, a large-scale dataset containing videotaped repeated productions of the 100 items of the
vocabulary test with associated transcriptions and annotations was created, consisting of data from 11 adult L1 signers and 19 adult L2
learners of DSGS. This paper introduces the dataset, which will be made available to the research community.
Mendez Maldonado Oscar, Hadfield Simon, Pugeault Nicolas, Bowden Richard (2018) SeDAR ? Semantic Detection and Ranging: Humans can localise without LiDAR, can robots?, Proceedings of the 2018 IEEE International Conference on Robotics and Automation, May 21-25, 2018, Brisbane, Australia IEEE
How does a person work out their location using
a floorplan? It is probably safe to say that we do not explicitly
measure depths to every visible surface and try to match them
against different pose estimates in the floorplan. And yet, this
is exactly how most robotic scan-matching algorithms operate.
Similarly, we do not extrude the 2D geometry present in the
floorplan into 3D and try to align it to the real-world. And yet,
this is how most vision-based approaches localise.
Humans do the exact opposite. Instead of depth, we use
high level semantic cues. Instead of extruding the floorplan up
into the third dimension, we collapse the 3D world into a 2D
representation. Evidence of this is that many of the floorplans
we use in everyday life are not accurate, opting instead for high
levels of discriminative landmarks.
In this work, we use this insight to present a global
localisation approach that relies solely on the semantic labels
present in the floorplan and extracted from RGB images. While
our approach is able to use range measurements if available,
we demonstrate that they are unnecessary as we can achieve
results comparable to state-of-the-art without them.
Camgöz Necati Cihan, Hadfield Simon, Koller O, Ney H, Bowden Richard (2018) Neural Sign Language Translation, CVPR 2018 Proceedings pp. 7784-7793 IEEE
Sign Language Recognition (SLR) has been an active
research field for the last two decades. However, most
research to date has considered SLR as a naive gesture
recognition problem. SLR seeks to recognize a sequence of
continuous signs but neglects the underlying rich grammatical
and linguistic structures of sign language that differ
from spoken language. In contrast, we introduce the Sign
Language Translation (SLT) problem. Here, the objective
is to generate spoken language translations from sign
language videos, taking into account the different word
orders and grammar.
We formalize SLT in the framework of Neural Machine
Translation (NMT) for both end-to-end and pretrained
settings (using expert knowledge). This allows us to jointly
learn the spatial representations, the underlying language
model, and the mapping between sign and spoken language.
To evaluate the performance of Neural SLT, we collected
the first publicly available Continuous SLT dataset, RWTHPHOENIX-
Weather 2014T1. It provides spoken language
translations and gloss level annotations for German Sign
Language videos of weather broadcasts. Our dataset contains
over .95M frames with >67K signs from a sign vocabulary
of >1K and >99K words from a German vocabulary
of >2.8K. We report quantitative and qualitative results for
various SLT setups to underpin future research in this newly
established field. The upper bound for translation performance
is calculated at 19.26 BLEU-4, while our end-to-end
frame-level and gloss-level tokenization networks were able
to achieve 9.58 and 18.13 respectively.
Hadfield S, Bowden R (2012) Go With The Flow: Hand Trajectories in 3D via Clustered Scene Flow, In Proceedings, International Conference on Image Analysis and Recognition LNCS 7 pp. 285-295
Tracking hands and estimating their trajectories is useful in a number of tasks, including sign language recognition and human computer interaction. Hands are extremely difficult objects to track, their deformability, frequent self occlusions and motion blur cause appearance variations too great for most standard object trackers to deal with robustly.

In this paper, the 3D motion field of a scene (known as the Scene Flow, in contrast to Optical Flow, which is it's projection onto the image plane) is estimated using a recently proposed algorithm, inspired by particle filtering. Unlike previous techniques, this scene flow algorithm does not introduce blurring across discontinuities, making it far more suitable for object segmentation and tracking. Additionally the algorithm operates several orders of magnitude faster than previous scene flow estimation systems, enabling the use of Scene Flow in real-time, and near real-time applications.

A novel approach to trajectory estimation is then introduced, based on clustering the estimated scene flow field in both space and velocity dimensions. This allows estimation of object motions in the true 3D scene, rather than the traditional approach of estimating 2D image plane motions. By working in the scene space rather than the image plane, the constant velocity assumption, commonly used in the prediction stage of trackers, is far more valid, and the resulting motion estimate is richer, providing information on out of plane motions. To evaluate the performance of the system, 3D trajectories are estimated on a multi-view sign-language dataset, and compared to a traditional high accuracy 2D system, with excellent results.

Hadfield S, Bowden R (2011) Kinecting the dots: Particle Based Scene Flow From Depth Sensors, In Proceedings, International Conference on Computer Vision (ICCV) pp. 2290-2295
The motion field of a scene can be used for object segmentation and to provide features for classification tasks like action recognition. Scene flow is the full 3D motion field of the scene, and is more difficult to estimate than it's 2D counterpart, optical flow. Current approaches use a smoothness cost for regularisation, which tends to over-smooth at object boundaries.

This paper presents a novel formulation for scene flow estimation, a collection of moving points in 3D space, modelled using a particle filter that supports multiple hypotheses and does not oversmooth the motion field.

In addition, this paper is the first to address scene flow estimation, while making use of modern depth sensors and monocular appearance images, rather than traditional multi-viewpoint rigs.

The algorithm is applied to an existing scene flow dataset, where it achieves comparable results to approaches utilising multiple views, while taking a fraction of the time.

Lebeda K, Matas J, Bowden R (2013) Tracking the Untrackable: How to Track When Your Object Is Featureless, Lecture Notes in Computer Science 7729 pp. 343-355 Springer
We propose a novel approach to tracking objects by low-level line correspondences. In our implementation we show that this approach is usable even when tracking objects with lack of texture, exploiting situations, when feature-based trackers fails due to the aperture problem. Furthermore, we suggest an approach to failure detection and recovery to maintain long-term stability. This is achieved by remembering configurations which lead to good pose estimations and using them later for tracking corrections.

We carried out experiments on several sequences of different types. The proposed tracker proves itself as competitive or superior to state-of-the-art trackers in both standard and low-textured scenes.

Merino L, Gilbert A, Capitán J, Bowden R, Illingworth J, Ollero A (2012) Data fusion in ubiquitous networked robot systems for urban services, Annales des Telecommunications/Annals of Telecommunications 67 (7-8) pp. 355-375 Springer
There is a clear trend in the use of robots to accomplish services that can help humans. In this paper, robots acting in urban environments are considered for the task of person guiding. Nowadays, it is common to have ubiquitous sensors integrated within the buildings, such as camera networks, and wireless communications like 3G or WiFi. Such infrastructure can be directly used by robotic platforms. The paper shows how combining the information from the robots and the sensors allows tracking failures to be overcome, by being more robust under occlusion, clutter, and lighting changes. The paper describes the algorithms for tracking with a set of fixed surveillance cameras and the algorithms for position tracking using the signal strength received by a wireless sensor network (WSN). Moreover, an algorithm to obtain estimations on the positions of people from cameras on board robots is described. The estimate from all these sources are then combined using a decentralized data fusion algorithm to provide an increase in performance. This scheme is scalable and can handle communication latencies and failures. We present results of the system operating in real time on a large outdoor environment, including 22 nonoverlapping cameras,WSN, and several robots. © Institut Mines-Télécom and Springer-Verlag 2012.
Gilbert A, Illingworth J, Bowden R (2009) Fast realistic multi-action recognition using mined dense spatio-temporal features, Proceedings of the 12th IEEE International Conference on Computer Vision pp. 925-931 IEEE
Within the field of action recognition, features and descriptors are often engineered to be sparse and invariant to transformation. While sparsity makes the problem tractable, it is not necessarily optimal in terms of class separability and classification. This paper proposes a novel approach that uses very dense corner features that are spatially and temporally grouped in a hierarchical process to produce an overcomplete compound feature set. Frequently reoccurring patterns of features are then found through data mining, designed for use with large data sets. The novel use of the hierarchical classifier allows real time operation while the approach is demonstrated to handle camera motion, scale, human appearance variations, occlusions and background clutter. The performance of classification, outperforms other state-of-the-art action recognition algorithms on the three datasets; KTH, multi-KTH, and Hollywood. Multiple action localisation is performed, though no groundtruth localisation data is required, using only weak supervision of class labels for each training sequence. The Hollywood dataset contain complex realistic actions from movies, the approach outperforms the published accuracy on this dataset and also achieves real time performance. ©2009 IEEE.
Lebeda K, Hadfield S, Matas J, Bowden R (2013) Long-Term Tracking Through Failure Cases, ICCV workshop on Visual Object Tracking Challenge pp. 153-160
Long term tracking of an object, given only a single instance in an initial frame, remains an open problem. We propose a visual tracking algorithm, robust to many of the difficulties which often occur in real-world scenes. Correspondences of edge-based features are used, to overcome the reliance on the texture of the tracked object and improve invariance to lighting. Furthermore we address long-term stability, enabling the tracker to recover from drift and to provide redetection following object disappearance or occlusion. The two-module principle is similar to the successful state-of-the-art long-term TLD tracker, however our approach extends to cases of low-textured objects. Besides reporting our results on the VOT Challenge dataset, we perform two additional experiments. Firstly, results on short-term sequences show the performance of tracking challenging objects which represent failure cases for competing state-of-the-art approaches. Secondly, long sequences are tracked, including one of almost 30 000 frames which to our knowledge is the longest tracking sequence reported to date. This tests the re-detection and drift resistance properties of the tracker. All the results are comparable to the state-of-the-art on sequences with textured objects and superior on non-textured objects. The new annotated sequences are made publicly available.
Stoll Stephanie, Camgöz Necati Cihan, Hadfield Simon, Bowden Richard (2018) Sign Language Production using Neural Machine Translation and Generative Adversarial Networks, Proceedings of the 29th British Machine Vision Conference (BMVC 2018) British Machine Vision Association
We present a novel approach to automatic Sign Language Production using stateof-
the-art Neural Machine Translation (NMT) and Image Generation techniques. Our
system is capable of producing sign videos from spoken language sentences. Contrary to
current approaches that are dependent on heavily annotated data, our approach requires
minimal gloss and skeletal level annotations for training. We achieve this by breaking
down the task into dedicated sub-processes. We first translate spoken language sentences
into sign gloss sequences using an encoder-decoder network. We then find a data driven
mapping between glosses and skeletal sequences. We use the resulting pose information
to condition a generative model that produces sign language video sequences. We
evaluate our approach on the recently released PHOENIX14T Sign Language Translation
dataset. We set a baseline for text-to-gloss translation, reporting a BLEU-4 score of
16.34/15.26 on dev/test sets. We further demonstrate the video generation capabilities
of our approach by sharing qualitative results of generated sign sequences given their
skeletal correspondence.
Koller Oscar, Zargaran Sepehr, Ney Hermann, Bowden Richard (2018) Deep Sign: Enabling Robust Statistical Continuous Sign Language
Recognition via Hybrid CNN-HMMs,
International Journal of Computer Vision 126 (12) pp. 1311-1325 Springer
This manuscript introduces the end-to-end embedding
of a CNN into a HMM, while interpreting the outputs
of the CNN in a Bayesian framework. The hybrid CNNHMM
combines the strong discriminative abilities of CNNs
with the sequence modelling capabilities of HMMs. Most
current approaches in the field of gesture and sign language
recognition disregard the necessity of dealing with sequence
data both for training and evaluation. With our presented
end-to-end embedding we are able to improve over the state-of-the-art on three challenging benchmark continuous sign
language recognition tasks by between 15% and 38% relative
reduction in word error rate and up to 20% absolute. We
analyse the effect of the CNN structure, network pretraining
and number of hidden states. We compare the hybrid modelling
to a tandem approach and evaluate the gain of model
combination.
Spencer Jaime, Mendez Maldonado Oscar, Bowden Richard, Hadfield Simon (2018) Localisation via Deep Imagination: learn the features not the map, Proceedings of ECCV 2018 - European Conference on Computer Vision Springer Nature
How many times does a human have to drive through the
same area to become familiar with it? To begin with, we might first
build a mental model of our surroundings. Upon revisiting this area, we
can use this model to extrapolate to new unseen locations and imagine
their appearance.
Based on this, we propose an approach where an agent is capable of
modelling new environments after a single visitation. To this end, we
introduce ?Deep Imagination?, a combination of classical Visual-based
Monte Carlo Localisation and deep learning. By making use of a feature
embedded 3D map, the system can ?imagine? the view from any
novel location. These ?imagined? views are contrasted with the current
observation in order to estimate the agent?s current location. In order to
build the embedded map, we train a deep Siamese Fully Convolutional
U-Net to perform dense feature extraction. By training these features to
be generic, no additional training or fine tuning is required to adapt to
new environments.
Our results demonstrate the generality and transfer capability of our
learnt dense features by training and evaluating on multiple datasets.
Additionally, we include several visualizations of the feature representations
and resulting 3D maps, as well as their application to localisation.
Lebeda Karel, Hadfield Simon J., Bowden Richard 3DCars, IEEE