Professor Richard Bowden


Professor of Computer Vision and Machine Learning
BSc, MSc, PhD, SMIEEE, FHEA, FIAPR
+44 (0)1483 689838
22 BA 00

Biography

Areas of specialism

Sign and gesture recognition; Deep learning; Cognitive Robotics; Activity and action recognition; Lip-reading; Machine Perception; Facial feature tracking; Autonomous Vehicles; Computer Vision; AI

University roles and responsibilities

  • Professor of Computer Vision and Machine Learning

    My qualifications

    1993
    BSc degree in computer science
    University of London
    1995
    MSc degree with distinction
    University of Leeds
    1999
    PhD degree in computer vision
    Brunel University

    Previous roles

    2015 - 2016
    Postgraduate Research Director for Faculty
    University of Surrey
    2016 - 2018
    Associate Dean, Doctoral College
    University of Surrey
    2013 - 2014
    Royal Society Leverhulme Trust Senior Research Fellowship
    Royal Society
    2012
    General chair
    BMVC2012
    2012
    Track chair
    ICPR2012
    2008 - 2011
    Reader
    University of Surrey
    2010
    General Chair
    Sign, Gesture, Activity 2010
    2003 - 2009
    Senior Tutor for Professional Training
    University of Surrey
    2006 - 2008
    Senior Lecturer
    University of Surrey
    2001 - 2006
    Lecturer in Multimedia Signal Processing
    University of Surrey
    2001 - 2004
    Visiting Research Fellow working with Profs Zisserman and Brady
    University of Oxford
    1998 - 2001
    Lecturer in Image Processing
    Brunel University
    1997
    General Chair
    VRSig97

    Affiliations and memberships

    IEEE Pattern Analysis and Machine Learning
    Associate Editor
    Image and Vision Computing journal
    Associate Editor
    British Machine Vision Association (BMVA) Executive Committee
    Previous member
    British Machine Vision Association (BMVA) Executive Committee
    Previous Company Director
    Higher Education Academy
    Fellow
    IEEE
    Senior Member
    IAPR
    Fellow

    Research

    Research interests

    Research projects

    Indicators of esteem

    • Awarded Fellow of the International Association of Pattern Recognition in 2016

    • Member of Royal Society’s International Exchanges Committee 2016

    • Royal Society Leverhulme Trust Senior Research Fellowship

    • Sullivan thesis prize in 2000

    • Executive Committee member and theme leader for EPSRC ViiHM Network 2015

    • TIGA Games award for Makaton Learning Environment with Gamelab UK 2013

    • Appointed Associate Editor for IEEE Pattern Analysis and Machine Intelligence 2013

    • Best Paper Award at VISAPP2012

    • Advisory Board for Springer Advances in Computer Vision and Pattern Recognition

    • General Chair BMVC2012

    • Outstanding Reviewer Award ICCV 2011

    • Best Paper Award at IbPRIA2011

    • Main Track Chair (Computer & Robot Vision) ICPR2012, Japan

    • Appointed Associate Editor International Journal of Image & Vision Comp

    Supervision

    Postgraduate research supervision

    Postgraduate research supervision

    My teaching

    Courses I teach on

    Undergraduate

    My publications

    Publications

    In this paper, we address the problem of tracking an unknown object in 3D space. Online 2D tracking often fails for strong outof-plane rotation which results in considerable changes in appearance beyond those that can be represented by online update strategies. However, by modelling and learning the 3D structure of the object explicitly, such effects are mitigated. To address this, a novel approach is presented, combining techniques from the fields of visual tracking, structure from motion (SfM) and simultaneous localisation and mapping (SLAM). This algorithm is referred to as TMAGIC (Tracking, Modelling And Gaussianprocess Inference Combined). At every frame, point and line features are tracked in the image plane and are used, together with their 3D correspondences, to estimate the camera pose. These features are also used to model the 3D shape of the object as a Gaussian process. Tracking determines the trajectories of the object in both the image plane and 3D space, but the approach also provides the 3D object shape. The approach is validated on several video-sequences used in the tracking literature, comparing favourably to state-of-the-art trackers for simple scenes (error reduced by 22 %) with clear advantages in the case of strong out-of-plane rotation, where 2D approaches fail (error reduction of 58 %).
    Cooper H, Bowden R (2009) Sign Language Recognition: Working with Limited Corpora, UNIVERSAL ACCESS IN HUMAN-COMPUTER INTERACTION: APPLICATIONS AND SERVICES, PT III5616pp. 472-481 SPRINGER-VERLAG BERLIN
    Mendez Maldonado O, Hadfield S, Pugeault N, Bowden R (2016) Next-best stereo: extending next best view optimisation for collaborative sensors,Proceedings of BMVC 2016
    Most 3D reconstruction approaches passively optimise over all data, exhaustively matching pairs, rather than actively selecting data to process. This is costly both in terms of time and computer resources, and quickly becomes intractable for large datasets. This work proposes an approach to intelligently filter large amounts of data for 3D reconstructions of unknown scenes using monocular cameras. Our contributions are twofold: First, we present a novel approach to efficiently optimise the Next-Best View ( NBV ) in terms of accuracy and coverage using partial scene geometry. Second, we extend this to intelligently selecting stereo pairs by jointly optimising the baseline and vergence to find the NBV ?s best stereo pair to perform reconstruction. Both contributions are extremely efficient, taking 0.8ms and 0.3ms per pose, respectively. Experimental evaluation shows that the proposed method allows efficient selection of stereo pairs for reconstruction, such that a dense model can be obtained with only a small number of images. Once a complete model has been obtained, the remaining computational budget is used to intelligently refine areas of uncertainty, achieving results comparable to state-of-the-art batch approaches on the Middlebury dataset, using as little as 3.8% of the views.
    Efthimiou E, Fotinea S-E, Vogler C, Hanke T, Glauert J, Bowden R, Braffort A, Collet C, Maragos P, Segouat J (2009) Sign language recognition, generation, and modelling: A research effort with applications in deaf communication, Lecture Notes in Computer Science: Proceedings of 5th International Conference of Universal Access in Human-Computer Interaction. Addressing Diversity, Part 15614pp. 21-30 Springer
    Sign language and Web 2.0 applications are currently incompatible, because of the lack of anonymisation and easy editing of online sign language contributions. This paper describes Dicta-Sign, a project aimed at developing the technologies required for making sign language-based Web contributions possible, by providing an integrated framework for sign language recognition, animation, and language modelling. It targets four different European sign languages: Greek, British, German, and French. Expected outcomes are three showcase applications for a search-by-example sign language dictionary, a sign language-to-sign language translator, and a sign language-based Wiki.
    Koller O, Ney H, Bowden R (2016) Deep Hand: How to Train a CNN on 1 Million Hand Images When Your Data Is Continuous and Weakly Labelled, Proceddings of 2016 IEEE Conference on Computer Vision and Pattern Recognition
    This work presents a new approach to learning a framebased classifier on weakly labelled sequence data by embedding a CNN within an iterative EM algorithm. This allows the CNN to be trained on a vast number of example images when only loose sequence level information is available for the source videos. Although we demonstrate this in the context of hand shape recognition, the approach has wider application to any video recognition task where frame level labelling is not available. The iterative EM algorithm leverages the discriminative ability of the CNN to iteratively refine the frame level annotation and subsequent training of the CNN. By embedding the classifier within an EM framework the CNN can easily be trained on 1 million hand images. We demonstrate that the final classifier generalises over both individuals and data sets. The algorithm is evaluated on over 3000 manually labelled hand shape images of 60 different classes which will be released to the community. Furthermore, we demonstrate its use in continuous sign language recognition on two publicly available large sign language data sets, where it outperforms the current state-of-the-art by a large margin. To our knowledge no previous work has explored expectation maximization without Gaussian mixture models to exploit weak sequence labels for sign language recognition.
    KaewTrakulPong P, Bowden R (2003) A real time adaptive visual surveillance system for tracking low-resolution colour targets in dynamically changing scenes, IMAGE AND VISION COMPUTING21(10)pp. 913-929 ELSEVIER SCIENCE BV
    Oshin O, Gilbert A, Illingworth J, Bowden R (2008) Spatio-Temporal Feature Recognition using Randomised Ferns,The 1st International Workshop on Machine Learning for Vision-based Motion Analysis (MVLMA'08)
    Efthimiou E, Fotinea SE, Hanke T, Glauert J, Bowden R, Braffort A, Collet C, Maragos P, Lefebvre-Albaret F (2012) The dicta-sign Wiki: Enabling web communication for the deaf, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)7383 LNCS(PART 2)pp. 205-212
    The paper provides a report on the user-centred showcase prototypes of the DICTA-SIGN project (http://www.dictasign.eu/), an FP7-ICT project which ended in January 2012. DICTA-SIGN researched ways to enable communication between Deaf individuals through the development of human-computer interfaces (HCI) for Deaf users, by means of Sign Language. Emphasis is placed on the Sign-Wiki prototype that demonstrates the potential of sign languages to participate in contemporary Web 2.0 applications where user contributions are editable by an entire community and sign language users can benefit from collaborative editing facilities. © 2012 Springer-Verlag.
    Pugeault N, Bowden R (2011) Driving me Around the Bend: Learning to Drive from Visual Gist,2011 IEEE International Conference on Computer Visionpp. 1022-1029 IEEE
    This article proposes an approach to learning steering and road following behaviour from a human driver using holistic visual features. We use a random forest (RF) to regress a mapping between these features and the driver's actions, and propose an alternative to classical random forest regression based on the Medoid (RF-Medoid), that reduces the underestimation of extreme control values. We compare prediction performance using different holistic visual descriptors: GIST, Channel-GIST (C-GIST) and Pyramidal-HOG (P-HOG). The proposed methods are evaluated on two different datasets: predicting human behaviour on countryside roads and also for autonomous control of a robot on an indoor track. We show that 1) C-GIST leads to the best predictions on both sequences, and 2) RF-Medoid leads to a better estimation of extreme values, where a classical RF tends to under-steer. We use around 10% of the data for training and show excellent generalization over a dataset of thousands of images. Importantly, we do not engineer the solution but instead use machine learning to automatically identify the relationship between visual features and behaviour, providing an efficient, generic solution to autonomous control.
    Krejov P, Gilbert A, Bowden R (2015) Combining Discriminative and Model Based Approaches for Hand Pose Estimation, 2015 11TH IEEE INTERNATIONAL CONFERENCE AND WORKSHOPS ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG), VOL. 2 IEEE
    Gilbert A, Illingworth J, Bowden R (2008) Scale Invariant Action Recognition Using Compound Features Mined from Dense Spatio-temporal Corners,Lecture Notes in Computer Science: Proceedings of 10th European Conference on Computer Vision (Part 1)5302pp. 222-233 Springer
    The use of sparse invariant features to recognise classes of actions or objects has become common in the literature. However, features are often ?engineered? to be both sparse and invariant to transformation and it is assumed that they provide the greatest discriminative information. To tackle activity recognition, we propose learning compound features that are assembled from simple 2D corners in both space and time. Each corner is encoded in relation to its neighbours and from an over complete set (in excess of 1 million possible features), compound features are extracted using data mining. The final classifier, consisting of sets of compound features, can then be applied to recognise and localise an activity in real-time while providing superior performance to other state-of-the-art approaches (including those based upon sparse feature detectors). Furthermore, the approach requires only weak supervision in the form of class labels for each training sequence. No ground truth position or temporal alignment is required during training.
    Bowden R (2004) Progress in sign and gesture recognition, ARTICULATED MOTION AN DEFORMABLE OBJECTS, PROCEEDINGS3179pp. 13-13 SPRINGER-VERLAG BERLIN
    Hadfield SJ, Bowden R, Lebeda K (2016) The Visual Object Tracking VOT2016 Challenge Results, Lecture Notes in Computer Science9914pp. 777-823 Springer
    The Visual Object Tracking challenge VOT2016 aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. Results of 70 trackers are presented, with a large number of trackers being published at major computer vision conferences and journals in the recent years. The number of tested state-of-the-art trackers makes the VOT 2016 the largest and most challenging benchmark on short-term tracking to date. For each participating tracker, a short description is provided in the Appendix. The VOT2016 goes beyond its predecessors by (i) introducing a new semi-automatic ground truth bounding box annotation methodology and (ii) extending the evaluation system with the no-reset experiment. The dataset, the evaluation kit as well as the results are publicly available at the challenge website (http:// votchallenge. net).
    Bowden R, Sarhadi M (2000) Building Temporal Models for Gesture Recognition, Proceedings of BMVC 2000 - The Eleventh British Machine Vision Conference BMVA (British Machine Vision Association)
    This work presents a piecewise linear approximation to non-linear Point Distribution Models for modelling the human hand. The work utilises the natural segmentation of shape space, inherent to the technique, to apply temporal constraints which can be used with CONDENSATION to support multiple hypotheses and quantum leaps through shape space. This paper presents a novel method by which the one-state transitions of the English Language are projected into shape space for tracking and model prediction using a HMM like approach.
    Okwechime D, Ong E-J, Bowden R, Member S (2011) MIMiC: Multimodal Interactive Motion Controller,IEEE Transactions on Multimedia13(2)pp. 255-265 IEEE
    We introduce a new algorithm for real-time interactive motion control and demonstrate its application to motion captured data, prerecorded videos, and HCI. Firstly, a data set of frames are projected into a lower dimensional space. An appearance model is learnt using a multivariate probability distribution. A novel approach to determining transition points is presented based on k-medoids, whereby appropriate points of intersection in the motion trajectory are derived as cluster centers. These points are used to segment the data into smaller subsequences. A transition matrix combined with a kernel density estimation is used to determine suitable transitions between the subsequences to develop novel motion. To facilitate real-time interactive control, conditional probabilities are used to derive motion given user commands. The user commands can come from any modality including auditory, touch, and gesture. The system is also extended to HCI using audio signals of speech in a conversation to trigger nonverbal responses from a synthetic listener in real-time. We demonstrate the flexibility of the model by presenting results ranging from data sets composed of vectorized images, 2-D, and 3-D point representations. Results show real-time interaction and plausible motion generation between different types of movement.
    Shaukat A, Gilbert A, Windridge D, Bowden R (2012) Meeting in the Middle: A top-down and bottom-up approach to detect pedestrians,Proceedings - International Conference on Pattern Recognitionpp. 874-877
    This paper proposes a generic approach combining a bottom-up (low-level) visual detector with a top-down (high-level) fuzzy first-order logic (FOL) reasoning framework in order to detect pedestrians from a moving vehicle. Detections from the low-level visual corner based detector are fed into the logical reasoning framework as logical facts. A set of FOL clauses utilising fuzzy predicates with piecewise linear continuous membership functions associates a fuzzy confidence (a degree-of-truth) to each detector input. Detections associated with lower confidence functions are deemed as false positives and blanked out, thus adding top-down constraints based on global logical consistency of detections. We employ a state of the art visual detector on a challenging pedestrian detection dataset, and demonstrate an increase in detection performance when used in a framework that combines bottom-up detections with (fuzzy FOL-based) top-down constraints. © 2012 ICPR Org Committee.
    Oshin O, Gilbert A, Bowden R (2011) Capturing the relative distribution of features for action recognition,2011 IEEE International Conference on Automatic Face and Gesture Recognition and Workshopspp. 111-116 IEEE
    This paper presents an approach to the categorisation of spatio-temporal activity in video, which is based solely on the relative distribution of feature points. Introducing a Relative Motion Descriptor for actions in video, we show that the spatio-temporal distribution of features alone (without explicit appearance information) effectively describes actions, and demonstrate performance consistent with state-of-the-art. Furthermore, we propose that for actions where noisy examples exist, it is not optimal to group all action examples as a single class. Therefore, rather than engineering features that attempt to generalise over noisy examples, our method follows a different approach: We make use of Random Sampling Consensus (RANSAC) to automatically discover and reject outlier examples within classes. We evaluate the Relative Motion Descriptor and outlier rejection approaches on four action datasets, and show that outlier rejection using RANSAC provides a consistent and notable increase in performance, and demonstrate superior performance to more complex multiple-feature based approaches.
    Cooper H, Bowden R (2009) Learning Signs from Subtitles: A Weakly Supervised Approach to Sign Language Recognition, CVPR: 2009 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOLS 1-4pp. 2560-2566 IEEE
    Ong E-J, Ellis L, Bowden R (2009) Problem solving through imitation,IMAGE AND VISION COMPUTING27(11)pp. 1715-1728 ELSEVIER SCIENCE BV
    Bowden R (2000) Learning Statistical Models of Human Motion,Proceedings of CVPR 2000 - IEEE Workshop on Human Modeling, Analysis and Synthesis IEEE
    Non-linear statistical models of deformation provide methods to learn a priori shape and deformation for an object or class of objects by example. This paper extends these models of deformation to that of motion by augmenting the discrete representation of piecewise nonlinear principle component analysis of shape with a markov chain which represents the temporal dynamics of the model. In this manner, mean trajectories can be learnt and reproduced for either the simulation of movement or for object tracking. This paper demonstrates the use of these techniques in learning human motion from capture data.
    Ong EJ, Lan Y, Theobald BJ, Harvey R, Bowden R (2009) Robust Facial Feature Tracking using Selected Multi-Resolution Linear Predictors,, pp. 1483-1490
    Sheerman-Chase T, Ong E-J, Pugeault N, Bowden R (2013) Improving Recognition and Identification of Facial Areas Involved in Non-verbal Communication by Feature Selection,Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on
    Meaningful Non-Verbal Communication (NVC) signals can be recognised by facial deformations based on video tracking. However, the geometric features previously used contain a signi?cant amount of redundant or irrelevant information. A feature selection method is described for selecting a subset of features that improves performance and allows for the identi?cation and visualisation of facial areas involved in NVC. The feature selection is based on a sequential backward elimination of features to ?nd a effective subset of components. This results in a signi?cant improvement in recognition performance, as well as providing evidence that brow lowering is involved in questioning sentences. The improvement in performance is a step towards a more practical automatic system and the facial areas identi?ed provide some insight into human behaviour.
    Ong E, Bowden R (2011) Learning Sequential Patterns for Lipreading,Proceedings of the 22nd British Machine Vision Conferencepp. 55.1-55.10 BMVA Press
    This paper proposes a novel machine learning algorithm (SP-Boosting) to tackle the problem of lipreading by building visual sequence classifiers based on sequential patterns. We show that an exhaustive search of optimal sequential patterns is not possible due to the immense search space, and tackle this with a novel, efficient tree-search method with a set of pruning criteria. Crucially, the pruning strategies preserve our ability to locate the optimal sequential pattern. Additionally, the tree-based search method accounts for the training set?s boosting weight distribution. This temporal search method is then integrated into the boosting framework resulting in the SP-Boosting algorithm. We also propose a novel constrained set of strong classifiers that further improves recognition accuracy. The resulting learnt classifiers are applied to lipreading by performing multi-class recognition on the OuluVS database. Experimental results show that our method achieves state of the art recognition performane, using only a small set of sequential patterns.
    Gupta A, Bowden R (2012) Fuzzy encoding for image classification using Gustafson-Kessel algorithm, IEEE PES Innovative Smart Grid Technologies Conference Europepp. 3137-3140
    This paper presents a novel adaptation of fuzzy clustering and feature encoding for image classification. Visual word ambiguity has recently been successfully modeled by kernel codebooks to provide improvement in classification performance over the standard 'Bag-of-Features' (BoF) approach, which uses hard partitioning and crisp logic for assignment of features to visual words. Motivated by this progress we utilize fuzzy logic to model the ambiguity and combine it with clustering to discover fuzzy visual words. The feature descriptors of an image are encoded using the learned fuzzy membership function associated with each word. The codebook built using this fuzzy encoding technique is demonstrated to provide superior performance over BoF. We use the Gustafson-Kessel algorithm which is an improvement over Fuzzy C-Means clustering and can adapt to local distributions. We evaluate our approach on several popular datasets and demonstrate that it consistently provides superior performance to the BoF approach. © 2012 IEEE.
    Bowden R, Cox SJ, Harvey RW, Lan Y, Ong EJ, Owen G, Theobald BJ (2012) Is automated conversion of video to text a reality?, Proceedings of SPIE - The International Society for Optical Engineering8546
    A recent trend in law enforcement has been the use of Forensic lip-readers. Criminal activities are often recorded on CCTV or other video gathering systems. Knowledge of what suspects are saying enriches the evidence gathered but lip-readers, by their own admission, are fallible so, based on long term studies of automated lip-reading, we are investigating the possibilities and limitations of applying this technique under realistic conditions. We have adopted a step-by-step approach and are developing a capability when prior video information is available for the suspect of interest. We use the terminology video-to-text (V2T) for this technique by analogy with speech-to-text (S2T) which also has applications in security and law-enforcement. © 2012 SPIE.
    Windridge D, Bowden R, Kittler J (2004) A General Strategy for Hidden Markov Chain Parameterisation in Composite Feature-Spaces,SSPR/SPRpp. 1069-1077 - 1069-1077
    Hadfield S, Lebeda K, Bowden R (2014) Natural action recognition using invariant 3D motion encoding, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)8690 LNCS(PART 2)pp. 758-771
    We investigate the recognition of actions "in the wild" using 3D motion information. The lack of control over (and knowledge of) the camera configuration, exacerbates this already challenging task, by introducing systematic projective inconsistencies between 3D motion fields, hugely increasing intra-class variance. By introducing a robust, sequence based, stereo calibration technique, we reduce these inconsistencies from fully projective to a simple similarity transform. We then introduce motion encoding techniques which provide the necessary scale invariance, along with additional invariances to changes in camera viewpoint. On the recent Hollywood 3D natural action recognition dataset, we show improvements of 40% over previous state-of-the-art techniques based on implicit motion encoding. We also demonstrate that our robust sequence calibration simplifies the task of recognising actions, leading to recognition rates 2.5 times those for the same technique without calibration. In addition, the sequence calibrations are made available. © 2014 Springer International Publishing.
    Gupta A, Bowden R (2012) Unity in diversity: Discovering topics from words: Information theoretic co-clustering for visual categorization, VISAPP 2012 - Proceedings of the International Conference on Computer Vision Theory and Applications1pp. 628-633
    This paper presents a novel approach to learning a codebook for visual categorization, that resolves the key issue of intra-category appearance variation found in complex real world datasets. The codebook of visual-topics (semantically equivalent descriptors) is made by grouping visual-words (syntactically equivalent descriptors) that are scattered in feature space. We analyze the joint distribution of images and visual-words using information theoretic co-clustering to discover visual-topics. Our approach is compared with the standard 'Bagof- Words' approach. The statistically significant performance improvement in all the datasets utilized (Pascal VOC 2006; VOC 2007; VOC 2010; Scene-15) establishes the efficacy of our approach.
    Gilbert A, Bowden R (2006) Tracking objects across cameras by incrementally learning inter-camera colour calibration and patterns of activity,Lecture Notes in Computer Science: 9th European Conference on Computer Vision, Proceedings Part 23952pp. 125-136 Springer
    This paper presents a scalable solution to the problem of tracking objects across spatially separated, uncalibrated, non-overlapping cameras. Unlike other approaches this technique uses an incremental learning method, to model both the colour variations and posterior probability distributions of spatio-temporal links between cameras. These operate in parallel and are then used with an appearance model of the object to track across spatially separated cameras. The approach requires no pre-calibration or batch preprocessing, is completely unsupervised, and becomes more accurate over time as evidence is accumulated.
    Gilbert A, Illingworth J, Bowden R, Capitan J, Merino L (2009) Accurate fusion of robot, camera and wireless sensors for surveillance applications,IEEE 12th International Conference on Computer Vision Workshopspp. 1290-1297 IEEE
    Often within the field of tracking people within only fixed cameras are used. This can mean that when the the illumination of the image changes or object occlusion occurs, the tracking can fail. We propose an approach that uses three simultaneous separate sensors. The fixed surveillance cameras track objects of interest cross camera through incrementally learning relationships between regions on the image. Cameras and laser rangefinder sensors onboard robots also provide an estimate of the person. Moreover, the signal strength of mobile devices carried by the person can be used to estimate his position. The estimate from all these sources are then combined using data fusion to provide an increase in performance. We present results of the fixed camera based tracking operating in real time on a large outdoor environment of over 20 non-overlapping cameras. Moreover, the tracking algorithms for robots and wireless nodes are described. A decentralized data fusion algorithm for combining all these information is presented.
    Bowden R, Mitchel TA, Sarhadi M (1997) Real-time Dynamic Deformable Meshes for Volumetric Segmentation and Visualisation,BMVC97 Electronic Proceedings of the Eighth British Machine Vision Conference1pp. 310-319
    This paper presents a surface segmentation method which uses a simulated inflating balloon model to segment surface structure from volumetric data using a triangular mesh. The model employs simulated surface tension and an inflationary force to grow from within an object and find its boundary. Mechanisms are described which allow both evenly spaced and minimal polygonal count surfaces to be generated. The work is based on inflating balloon models by Terzopolous[8]. Simplifications are made to the model, and an approach proposed which provides a technique robust to noise regardless of the feature detection scheme used. The proposed technique uses no explicit attraction to data features, and as such is less dependent on the initialisation of the model and parameters. The model grows under its own forces, and is never anchored to boundaries, but instead constrained to remain inside the desired object. Results are presented which demonstrate the technique?s ability and speed at the segmentation of a complex, concave object with narrow features.
    Marter M, Hadfield S, Bowden R (2014) Friendly faces: Weakly supervised character identification,Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)8912pp. 121-132
    © Springer International Publishing Switzerland 2015.This paper demonstrates a novel method for automatically discovering and recognising characters in video without any labelled examples or user intervention. Instead weak supervision is obtained via a rough script-to-subtitle alignment. The technique uses pose invariant features, extracted from detected faces and clustered to form groups of co-occurring characters. Results show that with 9 characters, 29% of the closest exemplars are correctly identified, increasing to 50% as additional exemplars are considered.
    Elliott R, Cooper HM, Ong EJ, Glauert J, Bowden R, Lefebvre-Albaret F (2011) Search-By-Example in Multilingual Sign Language Databases,
    We describe a prototype Search-by-Example or look-up tool for signs, based on a newly developed 1000-concept sign lexicon for four national sign languages (GSL, DGS, LSF,BSL), which includes a spoken language gloss, a HamNoSys description, and a video for each sign. The look-up tool combines an interactive sign recognition system, supported by Kinect technology, with a real-time sign synthesis system,using a virtual human signer, to present results to the user. The user performs a sign to the system and is presented with animations of signs recognised as similar. The user also has the option to view any of these signs performed in the other three sign languages. We describe the supporting technology and architecture for this system, and present some preliminary evaluation results.
    Cooper H, Bowden R (2010) Sign Language Recognition using Linguistically Derived Sub-Units,Proceedings of 4th Workshop on the Representation and Processing of Sign Languages: Corpora and Sign Language Technologiespp. 57-61 European Language Resources Association (ELRA)
    This work proposes to learn linguistically-derived sub-unit classifiers for sign language. The responses of these classifiers can be combined by Markov models, producing efficient sign-level recognition. Tracking is used to create vectors of hand positions per frame as inputs for sub-unit classifiers learnt using AdaBoost. Grid-like classifiers are built around specific elements of the tracking vector to model the placement of the hands. Comparative classifiers encode the positional relationship between the hands. Finally, binary-pattern classifiers are applied over the tracking vectors of multiple frames to describe the motion of the hands. Results for the sub-unit classifiers in isolation are presented, reaching averages over 90%. Using a simple Markov model to combine the sub-unit classifiers allows sign level classification giving an average of 63%, over a 164 sign lexicon, with no grammatical constraints.
    Hadfield S, Bowden R (2014) Scene Flow Estimation using Intelligent Cost Functions,Proceedings of the British Conference on Machine Vision (BMVC) BMVA
    Motion estimation algorithms are typically based upon the assumption of brightness constancy or related assumptions such as gradient constancy. This manuscript evaluates several common cost functions from the motion estimation literature, which embody these assumptions. We demonstrate that such assumptions break for real world data, and the functions are therefore unsuitable. We propose a simple solution, which significantly increases the discriminatory ability of the metric, by learning a nonlinear relationship using techniques from machine learning. Furthermore, we demonstrate how context and a nonlinear combination of metrics, can provide additional gains, and demonstrating a 44% improvement in the performance of a state of the art scene flow estimation technique. In addition, smaller gains of 20% are demonstrated in optical flow estimation tasks.
    Bowden R, KaewTraKulPong P (2005) Towards automated wide area visual surveillance: tracking objects between spatially-separated, uncalibrated views, IEE PROCEEDINGS-VISION IMAGE AND SIGNAL PROCESSING152(2)pp. 213-223 IEE-INST ELEC ENG
    Ong EJ, Bowden R (2011) Robust Facial Feature Tracking Using Shape-Constrained Multi-Resolution Selected Linear Predictors.,IEEE Transactions on Pattern Analysis and Machine Intelligence33(9)pp. 1844-1859 IEEE Computer Society
    This paper proposes a learnt {\em data-driven} approach for accurate, real-time tracking of facial features using only intensity information, a non-trivial task since the face is a highly deformable object with large textural variations and motion in certain regions. The framework proposed here largely avoids the need for apriori design of feature trackers by automatically identifying the optimal visual support required for tracking a single facial feature point. This is essentially equivalent to automatically determining the visual context required for tracking. Tracking is achieved via biased linear predictors which provide a fast and effective method for mapping pixel-intensities into tracked feature position displacements. Multiple linear predictors are grouped into a rigid flock to increase robustness. To further improve tracking accuracy, a novel probabilistic selection method is used to identify relevant visual areas for tracking a feature point. These selected flocks are then combined into a hierarchical multi-resolution LP model. Finally, we also exploit a simple shape constraint for correcting the occasional tracking failure of a minority of feature points. Experimental results also show that this method performs more robustly and accurately than AAMs, on example sequences that range from SD quality to Youtube quality.
    Holt B, Ong EJ, Bowden R (2013) Accurate static pose estimation combining direct regression and geodesic extrema, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, FG 2013
    Human pose estimation in static images has received significant attention recently but the problem remains challenging. Using data acquired from a consumer depth sensor, our method combines a direct regression approach for the estimation of rigid body parts with the extraction of geodesic extrema to find extremities. We show how these approaches are complementary and present a novel approach to combine the results resulting in an improvement over the state-of-the-art. We report and compare our results a new dataset of aligned RGB-D pose sequences which we release as a benchmark for further evaluation. © 2013 IEEE.
    Dowson N, Kadir T, Bowden R (2008) Estimating the joint statistics of images using Nonparametric Windows with application to registration using Mutual Information,IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE30(10)pp. 1841-1857 IEEE COMPUTER SOC
    Recently, Nonparametric (NP) Windows has been proposed to estimate the statistics of real 1D and 2D signals. NP Windows is accurate because it is equivalent to sampling images at a high (infinite) resolution for an assumed interpolation model. This paper extends the proposed approach to consider joint distributions of image pairs. Second, Green's Theorem is used to simplify the previous NP Windows algorithm. Finally, a resolution-aware NP Windows algorithm is proposed to improve robustness to relative scaling between an image pair. Comparative testing of 2D image registration was performed using translation only and affine transformations. Although it is more expensive than other methods, NP Windows frequently demonstrated superior performance for bias (distance between ground truth and global maximum) and frequency of convergence. Unlike other methods, the number of samples and the number of bins have little effect on NP Windows and the prior selection of a kernel is not required.
    Moore S, Ong EJ, Bowden R (2010) Facial Expression Recognition using Spatiotemporal Boosted Discriminatory Classifiers, 6111/2010pp. 405-414
    Okwechime D, Ong E-J, Gilbert A, Bowden R (2011) Visualisation and prediction of conversation interest through mined social signals,2011 IEEE International Conference on Automatic Face and Gesture Recognition and Workshopspp. 951-956 IEEE
    This paper introduces a novel approach to social behaviour recognition governed by the exchange of non-verbal cues between people. We conduct experiments to try and deduce distinct rules that dictate the social dynamics of people in a conversation, and utilise semi-supervised computer vision techniques to extract their social signals such as laughing and nodding. Data mining is used to deduce frequently occurring patterns of social trends between a speaker and listener in both interested and not interested social scenarios. The confidence values from rules are utilised to build a Social Dynamic Model (SDM), that can then be used for classification and visualisation. By visualising the rules generated in the SDM, we can analyse distinct social trends between an interested and not interested listener in a conversation. Results show that these distinctions can be applied generally and used to accurately predict conversational interest.
    Gilbert A, Bowden R (2011) iGroup: Weakly supervised image and video grouping,2011 IEEE International Conference on Computer Visionpp. 2166-2173 IEEE
    We present a generic, efficient and iterative algorithm for interactively clustering classes of images and videos. The approach moves away from the use of large hand labelled training datasets, instead allowing the user to find natural groups of similar content based upon a handful of ?seed? examples. Two efficient data mining tools originally developed for text analysis; min-Hash and APriori are used and extended to achieve both speed and scalability on large image and video datasets. Inspired by the Bag-of-Words (BoW) architecture, the idea of an image signature is introduced as a simple descriptor on which nearest neighbour classification can be performed. The image signature is then dynamically expanded to identify common features amongst samples of the same class. The iterative approach uses APriori to identify common and distinctive elements of a small set of labelled true and false positive signatures. These elements are then accentuated in the signature to increase similarity between examples and ?pull? positive classes together. By repeating this process, the accuracy of similarity increases dramatically despite only a few training examples, only 10% of the labelled groundtruth is needed, compared to other approaches. It is tested on two image datasets including the caltech101 [9] dataset and on three state-of-the-art action recognition datasets. On the YouTube [18] video dataset the accuracy increases from 72% to 97% using only 44 labelled examples from a dataset of over 1200 videos. The approach is both scalable and efficient, with an iteration on the full YouTube dataset taking around 1 minute on a standard desktop machine.
    Ong E, Pugeault N, Gilbert A, Bowden R (2016) Learning multi-class discriminative patterns using episode-trees,7th International Conference on Cloud Computing, GRIDs, and Virtualization (CLOUD COMPUTING 2016) International Academy, Research, and Industry Association (IARIA)
    In this paper, we aim to tackle the problem of recognising temporal sequences in the context of a multi-class problem. In the past, the representation of sequential patterns was used for modelling discriminative temporal patterns for different classes. Here, we have improved on this by using the more general representation of episodes, of which sequential patterns are a special case. We then propose a novel tree structure called a MultI-Class Episode Tree (MICE-Tree) that allows one to simultaneously model a set of different episodes in an efficient manner whilst providing labels for them. A set of MICE-Trees are then combined together into a MICE-Forest that is learnt in a Boosting framework. The result is a strong classifier that utilises episodes for performing classification of temporal sequences. We also provide experimental evidence showing that the MICE-Trees allow for a more compact and efficient model compared to sequential patterns. Additionally, we demonstrate the accuracy and robustness of the proposed method in the presence of different levels of noise and class labels.
    Hadfield SJ, Lebeda K, Bowden R (2014) The Visual Object Tracking VOT2014 challenge results,
    The Visual Object Tracking challenge 2014, VOT2014, aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. Results of 38 trackers are presented. The number of tested trackers makes VOT 2014 the largest benchmark on short-term tracking to date. For each participating tracker, a short description is provided in the appendix. Features of the VOT2014 challenge that go beyond its VOT2013 predecessor are introduced: (i) a new VOT2014 dataset with full annotation of targets by rotated bounding boxes and per-frame attribute, (ii) extensions of the VOT2013 evaluation methodology, (iii) a new unit for tracking speed assessment less dependent on the hardware and (iv) the VOT2014 evaluation toolkit that significantly speeds up execution of experiments. The dataset, the evaluation kit as well as the results are publicly available at the challenge website.
    Lebeda K, Hadfield S, Matas J, Bowden R (2015) Texture-Independent Long-Term Tracking Using Virtual Corners, IEEE TRANSACTIONS ON IMAGE PROCESSING25(1)pp. 359-371 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
    Cooper H, Bowden R (2007) Large lexicon detection of sign language, HUMAN-COMPUTER INTERACTION, PROCEEDINGS4796pp. 88-97 SPRINGER-VERLAG BERLIN
    Moore S, Bowden R (2011) Local binary patterns for multi-view facial expression recognition, Computer Vision and Image Understanding115(4)pp. 541-558 Elsevier
    Research into facial expression recognition has predominantly been applied to face images at frontal view only. Some attempts have been made to produce pose invariant facial expression classifiers. However, most of these attempts have only considered yaw variations of up to 45°, where all of the face is visible. Little work has been carried out to investigate the intrinsic potential of different poses for facial expression recognition. This is largely due to the databases available, which typically capture frontal view face images only. Recent databases, BU3DFE and multi-pie, allows empirical investigation of facialexpressionrecognition for different viewing angles. A sequential 2 stage approach is taken for pose classification and view dependent facialexpression classification to investigate the effects of yaw variations from frontal to profile views. Local binary patterns (LBPs) and variations of LBPs as texture descriptors are investigated. Such features allow investigation of the influence of orientation and multi-resolution analysis for multi-view facial expression recognition. The influence of pose on different facial expressions is investigated. Others factors are investigated including resolution and construction of global and local feature vectors. An appearance based approach is adopted by dividing images into sub-blocks coarsely aligned over the face. Feature vectors contain concatenated feature histograms built from each sub-block. Multi-class support vector machines are adopted to learn pose and pose dependent facial expression classifiers.
    Bowden R, Gilbert A, KaewTraKulPong P (2006) Tracking Objects Across Uncalibrated Arbitary Topology Camera Networks,In: Velastin S, Remagnino P (eds.), Intelligent Distributed Video Surveillance Systems6pp. 157-182 Institution of Engineering and Technology
    Intelligent visual surveillance is an important application area for computer vision. In situations where networks of hundreds of cameras are used to cover a wide area, the obvious limitation becomes the users? ability to manage such vast amounts of information. For this reason, automated tools that can generalise about activities or track objects are important to the operator. Key to the users? requirements is the ability to track objects across (spatially separated) camera scenes. However, extensive geometric knowledge about the site and camera position is typically required. Such an explicit mapping from camera to world is infeasible for large installations as it requires that the operator know which camera to switch to when an object disappears. To further compound the problem the installation costs of CCTV systems outweigh those of the hardware. This means that geometric constraints or any form of calibration (such as that which might be used with epipolar constraints) is simply not realistic for a real world installation. The algorithms cannot afford to dictate to the installer. This work attempts to address this problem and outlines a method to allow objects to be related and tracked across cameras without any explicit calibration, be it geometric or colour.
    Ong E-J, Bowden R (2006) Learning Distance for Arbitrary Visual Features,Proceedings of the British Machine Vision Conference2pp. 749-758 BMVA
    This paper presents a method for learning distance functions of arbitrary feature representations that is based on the concept of wormholes. We introduce wormholes and describe how it provides a method for warping the topology of visual representation spaces such that a meaningful distance between examples is available. Additionally, we show how a more general distance function can be learnt through the combination of many wormholes via an inter-wormhole network. We then demonstrate the application of the distance learning method on a variety of problems including nonlinear synthetic data, face illumination detection and the retrieval of images containing natural landscapes and man-made objects (e.g. cities).
    Gilbert A, Bowden R (2008) Incremental, scalable tracking of objects inter camera,COMPUTER VISION AND IMAGE UNDERSTANDING111(1)pp. 43-58 ACADEMIC PRESS INC ELSEVIER SCIENCE
    Cooper HM, Holt B, Bowden R (2011) Sign Language Recognition,In: Moeslund TB, Hilton A, Krüger V, Sigal L (eds.), Visual Analysis of Humans: Looking at Peoplepp. 539-562 Springer Verlag
    This chapter covers the key aspects of sign-language recognition (SLR), starting with a brief introduction to the motivations and requirements, followed by a précis of sign linguistics and their impact on the field. The types of data available and the relative merits are explored allowing examination of the features which can be extracted. Classifying the manual aspects of sign (similar to gestures) is then discussed from a tracking and non-tracking viewpoint before summarising some of the approaches to the non-manual aspects of sign languages. Methods for combining the sign classification results into full SLR are given showing the progression towards speech recognition techniques and the further adaptations required for the sign specific case. Finally the current frontiers are discussed and the recent research presented. This covers the task of continuous sign recognition, the work towards true signer independence, how to effectively combine the different modalities of sign, making use of the current linguistic research and adapting to larger more noisy data sets
    Dowson N, Bowden R (2008) Mutual information for Lucas-Kanade tracking (MILK): An inverse compositional formulation,IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE30(1)pp. 180-185 IEEE COMPUTER SOC
    Lebeda K, Hadfield S, Bowden R (2015) Exploring Causal Relationships in Visual Object Tracking,2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV)pp. 3065-3073 IEEE
    Oshin O, Gilbert A, Bowden R (2011) There Is More Than One Way to Get Out of a Car: Automatic Mode Finding for Action Recognition in the Wild,Lecture Notes in Computer Science: Pattern Recognition and Image Analysis6669pp. 41-48 Springer Berlin / Heidelberg
    Actions in the wild? is the term given to examples of human motion that are performed in natural settings, such as those harvested from movies [10] or the Internet [9]. State-of-the-art approaches in this domain are orders of magnitude lower than in more contrived settings. One of the primary reasons being the huge variability within each action class. We propose to tackle recognition in the wild by automatically breaking complex action categories into multiple modes/group, and training a separate classifier for each mode. This is achieved using RANSAC which identifies and separates the modes while rejecting outliers. We employ a novel reweighting scheme within the RANSAC procedure to iteratively reweight training examples, ensuring their inclusion in the final classification model. Our results demonstrate the validity of the approach, and for classes which exhibit multi-modality, we achieve in excess of double the performance over approaches that assume single modality.
    KaewTraKulPong P, Bowden R (2004) Probabilistic Learning of Salient Patterns across Spatially Separated Uncalibrated Views, Proceedings of IDSS04 - Intelligent Distributed Surveillance Systems, Feb 2004pp. 36-40 Institution of Electrical Engineers
    We present a solution to the problem of tracking intermittent targets that can overcome long-term occlusions as well as movement between camera views. Unlike other approaches, our system does not require topological knowledge of the site or labelled training patterns during the learning period. The approach uses the statistical consistency of data obtained automatically over an extended period of time rather than explicit geometric calibration to automatically learn the salient reappearance periods for objects. This allows us to predict where objects may reappear and within how long. We demonstrate how these salient reappearance periods can be used with a model of physical appearance to track objects between spatially separate regions in single and separated views.
    Okwechime D, Ong E, Gilbert A, Bowden R (2011) Social interactive human video synthesis,Lecture Notes in Computer Science: Computer Vision ? ACCV 20106492(PART 1)pp. 256-270 Springer
    In this paper, we propose a computational model for social interaction between three people in a conversation, and demonstrate results using human video motion synthesis. We utilised semi-supervised computer vision techniques to label social signals between the people, like laughing, head nod and gaze direction. Data mining is used to deduce frequently occurring patterns of social signals between a speaker and a listener in both interested and not interested social scenarios, and the mined confidence values are used as conditional probabilities to animate social responses. The human video motion synthesis is done using an appearance model to learn a multivariate probability distribution, combined with a transition matrix to derive the likelihood of motion given a pose configuration. Our system uses social labels to more accurately define motion transitions and build a texture motion graph. Traditional motion synthesis algorithms are best suited to large human movements like walking and running, where motion variations are large and prominent. Our method focuses on generating more subtle human movement like head nods. The user can then control who speaks and the interest level of the individual listeners resulting in social interactive conversational agents.
    Lebeda K, Hadfield S, Matas J, Bowden R (2013) Long-Term Tracking Through Failure Cases, Proceeedings, IEEE workshop on visual object tracking challenge at ICCVpp. 153-160 IEEE
    Long term tracking of an object, given only a single instance in an initial frame, remains an open problem. We propose a visual tracking algorithm, robust to many of the difficulties which often occur in real-world scenes. Correspondences of edge-based features are used, to overcome the reliance on the texture of the tracked object and improve invariance to lighting. Furthermore we address long-term stability, enabling the tracker to recover from drift and to provide redetection following object disappearance or occlusion. The two-module principle is similar to the successful state-of-the-art long-term TLD tracker, however our approach extends to cases of low-textured objects. Besides reporting our results on the VOT Challenge dataset, we perform two additional experiments. Firstly, results on short-term sequences show the performance of tracking challenging objects which represent failure cases for competing state-of-the-art approaches. Secondly, long sequences are tracked, including one of almost 30000 frames which to our knowledge is the longest tracking sequence reported to date. This tests the re-detection and drift resistance properties of the tracker. All the results are comparable to the state-of-the-art on sequences with textured objects and superior on non-textured objects. The new annotated sequences are made publicly available
    Escalera S, Gonzàlez J, Baró X, Reyes M, Guyon I, Athitsos V, Escalante H, Sigal L, Argyros A, Sminchisescu C, Bowden R, Sclaroff S (2013) ChaLearn multi-modal gesture recognition 2013: Grand challenge and workshop summary, ICMI 2013 - Proceedings of the 2013 ACM International Conference on Multimodal Interactionpp. 365-370
    We organized a Grand Challenge and Workshop on Multi-Modal Gesture Recognition. The MMGR Grand Challenge focused on the recognition of continuous natural gestures from multi-modal data (including RGB, Depth, user mask, Skeletal model, and audio). We made available a large labeled video database of 13,858 gestures from a lexicon of 20 Italian gesture categories recorded with a Kinect" camera. More than 54 teams participated in the challenge and a final error rate of 12% was achieved by the winner of the competition. Winners of the competition published their work in the workshop of the Challenge. The MMGR Workshop was held at ICMI conference 2013, Sidney. A total of 9 relevant papers with basis on multi-modal gesture recognition were accepted for presentation. This includes multi-modal descriptors, multi-class learning strategies for segmentation and classification in temporal data, as well as relevant applications in the field, including multi-modal Social Signal Processing and multi-modal Human Computer Interfaces. Five relevant invited speakers participated in the workshop: Profs. Leonid Signal from Disney Research, Antonis Argyros from FORTH, Institute of Computer Science, Cristian Sminchisescu from Lund University, Richard Bowden from University of Surrey, and Stan Sclaroff from Boston University. They summarized their research in the field and discussed past, current, and future challenges in Multi-Modal Gesture Recognition. © 2013 ACM.
    Koller O, Ney H, Bowden R (2015) Deep Learning of Mouth Shapes for Sign Language, 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOP (ICCVW)pp. 477-483 IEEE
    Windridge D, Bowden R (2005) Hidden Markov chain estimation and parameterisation via ICA-based feature-selection, Pattern Analysis Applications81-2pp. 115-124-115-124
    Ong EJ, Bowden R (2011) Learning sequential patterns for lipreading,BMVC 2011 - Proceedings of the British Machine Vision Conference 2011
    This paper proposes a novel machine learning algorithm (SP-Boosting) to tackle the problem of lipreading by building visual sequence classifiers based on sequential patterns. We show that an exhaustive search of optimal sequential patterns is not possible due to the immense search space, and tackle this with a novel, efficient tree-search method with a set of pruning criteria. Crucially, the pruning strategies preserve our ability to locate the optimal sequential pattern. Additionally, the tree-based search method accounts for the training set's boosting weight distribution. This temporal search method is then integrated into the boosting framework resulting in the SP-Boosting algorithm. We also propose a novel constrained set of strong classifiers that further improves recognition accuracy. The resulting learnt classifiers are applied to lipreading by performing multi-class recognition on the OuluVS database. Experimental results show that our method achieves state of the art recognition performane, using only a small set of sequential patterns. © 2011. The copyright of this document resides with its authors.
    Gilbert A, Bowden R (2011) Push and pull: Iterative grouping of media, BMVC 2011 - Proceedings of the British Machine Vision Conference 2011
    We present an approach to iteratively cluster images and video in an efficient and intuitive manor. While many techniques use the traditional approach of time consuming groundtruthing large amounts of data [10, 16, 20, 23], this is increasingly infeasible as dataset size and complexity increase. Furthermore it is not applicable to the home user, who wants to intuitively group his/her own media without labelling the content. Instead we propose a solution that allows the user to select media that semantically belongs to the same class and use machine learning to "pull" this and other related content together. We introduce an "image signature" descriptor and use min-Hash and greedy clustering to efficiently present the user with clusters of the dataset using multi-dimensional scaling. The image signatures of the dataset are then adjusted by APriori data mining identifying the common elements between a small subset of image signatures. This is able to both pull together true positive clusters and push apart false positive examples. The approach is tested on real videos harvested from the web using the state of the art YouTube dataset [18]. The accuracy of correct group label increases from 60.4% to 81.7% using 15 iterations of pulling and pushing the media around. While the process takes only 1 minute to compute the pair wise similarities of the image signatures and visualise the youtube whole dataset. © 2011. The copyright of this document resides with its authors.
    Bowden R, Heap T, Hart C (1996) Virtual Datagloves: Interacting with Virtual Environments Through Computer Vision,Proceedings of the Third UK Virtual Reality Special Interest Group Conference; Leicester, 3rd July 1996
    This paper outlines a system design and implementation of a 3D input device for graphical applications. It is shown how computer vision can be used to track a users movements within the image frame allowing interaction with 3D worlds and objects. Point Distribution Models (PDMs) have been shown to be successful at tracking deformable objects. This system demonstrates how these ?smart snakes? can be used in real time with real world applications, demonstrating how computer vision can provide a low cost, intuitive interface that has few hardware constraints. The compact mathematical model behind the PDM allows simple static gesture recognition to be performed providing the means to communicate with an application. It is shown how movement of both the hand and face can be used to drive 3D engines. The system is based upon Open Inventor and designed for use with Silicon Graphics Indy Workstations but allowances have been made to facilitate the inclusion of the tracker within third party applications. The reader is also provided with an insight into the next generation of HCI and Multimedia. Access to this work can be gained through the above web address.
    Bowden R, Collomosse J, Mikolajczyk K (2014) Guest Editorial: Tracking, Detection and Segmentation, International Journal of Computer Vision
    Sheerman-Chase T, Ong E-J, Bowden R (2013) Non-linear predictors for facial feature tracking across pose and expression, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, FG 2013
    This paper proposes a non-linear predictor for estimating the displacement of tracked feature points on faces that exhibit significant variations across pose and expression. Existing methods such as linear predictors, ASMs or AAMs are limited to a narrow range in pose. In order to track across a large pose range, separate pose-specific models are required that are then coupled via a pose-estimator. In our approach, we neither require a set of pose-specific models nor a pose-estimator. Using just a single tracking model, we are able to robustly and accurately track across a wide range of expression on poses. This is achieved by gradient boosting of regression trees for predicting the displacement vectors of tracked points. Additionally, we propose a novel algorithm for simultaneously configuring this hierarchical set of trackers for optimal tracking results. Experiments were carried out on sequences of naturalistic conversation and sequences with large pose and expression changes. The results show that the proposed method is superior to state of the art methods, in being able to robustly track a set of facial points whilst gracefully recovering from tracking failures. © 2013 IEEE.
    Ong E, Bowden R (2011) Learning Temporal Signatures for Lip Reading,
    Bowden R (2014) Seeing and understanding people, COMPUTATIONAL VISION AND MEDICAL IMAGE PROCESSING IVpp. 9-15 CRC PRESS-TAYLOR & FRANCIS GROUP
    Bowden R (2015) The evolution of Computer Vision,PERCEPTION44pp. 360-361 SAGE PUBLICATIONS LTD
    Koller O, Bowden R, Ney H (2016) Automatic Alignment of HamNoSys Subunits for Continuous Sign Language Recognition, LREC 2016 Proceedingspp. 121-128
    This work presents our recent advances in the field of automatic processing of sign language corpora targeting continuous sign language recognition. We demonstrate how generic annotations at the articulator level, such as HamNoSys, can be exploited to learn subunit classifiers. Specifically, we explore cross-language-subunits of the hand orientation modality, which are trained on isolated signs of publicly available lexicon data sets for Swiss German and Danish Sign Language and are applied to continuous sign language recognition of the challenging RWTH-PHOENIX-Weather corpus featuring German Sign Language. We observe a significant reduction in word error rate using this method.
    Bowden R (2003) Probabilistic models in computer vision, IMAGE AND VISION COMPUTING21(10)pp. 841-841 ELSEVIER SCIENCE BV
    Micilotta AS, Ong EJ, Bowden R (2005) Detection and tracking of humans by probabilistic body part assembly, BMVC 2005 - Proceedings of the British Machine Vision Conference 2005
    This paper presents a probabilistic framework of assembling detected human body parts into a full 2D human configuration. The face, torso, legs and hands are detected in cluttered scenes using boosted body part detectors trained by AdaBoost. Body configurations are assembled from the detected parts using RANSAC, and a coarse heuristic is applied to eliminate obvious outliers. An a priori mixture model of upper-body configurations is used to provide a pose likelihood for each configuration. A joint-likelihood model is then determined by combining the pose, part detector and corresponding skin model likelihoods. The assembly with the highest likelihood is selected by RANSAC, and the elbow positions are inferred. This paper also illustrates the combination of skin colour likelihood and detection likelihood to further reduce false hand and face detections.
    Gilbert A, Bowden R (2015) Data mining for action recognition, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)9007pp. 290-303
    © Springer International Publishing Switzerland 2015.In recent years, dense trajectories have shown to be an efficient representation for action recognition and have achieved state-of-theart results on a variety of increasingly difficult datasets. However, while the features have greatly improved the recognition scores, the training process and machine learning used hasn?t in general deviated from the object recognition based SVM approach. This is despite the increase in quantity and complexity of the features used. This paper improves the performance of action recognition through two data mining techniques, APriori association rule mining and Contrast Set Mining. These techniques are ideally suited to action recognition and in particular, dense trajectory features as they can utilise the large amounts of data, to identify far shorter discriminative subsets of features called rules. Experimental results on one of the most challenging datasets, Hollywood2 outperforms the current state-of-the-art.
    Bowden R, Ellis L, Kittler J, Shevchenko M, Windridge D (2005) Unsupervised symbol grounding and cognitive bootstrapping in cognitive vision,Proc. 13th Int. Conference on Image Analysis and Processingpp. 27-36 - 27-36
    Holt B, Bowden R (2013) Efficient Estimation of Human Upper Body Pose in Static Depth Images, Communications in Computer and Information Science359 CCISpp. 399-410
    Automatic estimation of human pose has long been a goal of computer vision, to which a solution would have a wide range of applications. In this paper we formulate the pose estimation task within a regression and Hough voting framework to predict 2D joint locations from depth data captured by a consumer depth camera. In our approach the offset from each pixel to the location of each joint is predicted directly using random regression forests. The predictions are accumulated in Hough images which are treated as likelihood distributions where maxima correspond to joint location hypotheses. Our approach is evaluated on a publicly available dataset with good results. © Springer-Verlag Berlin Heidelberg 2013.
    Ellis L, Felsberg M, Bowden R (2011) Affordance mining: Forming perception through action,Lecture Notes in Computer Science: 10th Asian Conference on Computer Vision, Revised Selected Papers Part IV6495pp. 525-538 Springer
    This work employs data mining algorithms to discover visual entities that are strongly associated to autonomously discovered modes of action, in an embodied agent. Mappings are learnt from these perceptual entities, onto the agents action space. In general, low dimensional action spaces are better suited to unsupervised learning than high dimensional percept spaces, allowing for structure to be discovered in the action space, and used to organise the perceptual space. Local feature configurations that are strongly associated to a particular ?type? of action (and not all other action types) are considered likely to be relevant in eliciting that action type. By learning mappings from these relevant features onto the action space, the system is able to respond in real time to novel visual stimuli. The proposed approach is demonstrated on an autonomous navigation task, and the system is shown to identify the relevant visual entities to the task and to generate appropriate responses.
    Ong E-J, Bowden R, IEEE (2008) Robust Lip-Tracking using Rigid Flocks of Selected Linear Predictors, 2008 8TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE & GESTURE RECOGNITION (FG 2008), VOLS 1 AND 2pp. 247-254
    Efthimiou E, Fotinea S-E, Hanke T, Glauert J, Bowden R, Braffort A, Collet C, Maragos P, Lefebvre-Albaret F (2012) Sign Language technologies and resources of the Dicta-Sign project,Proceedings of the 5th Workshop on the Representation and Processing of Sign Languages: Interactions between Corpus and Lexicon. Satellite Workshop to the eighth International Conference on Language Resources and Evaluation (LREC-2012)pp. 37-44 Institute for German Sign Language and Communication of the Deaf
    Here we present the outcomes of Dicta-Sign FP7-ICT project. Dicta-Sign researched ways to enable communication between Deaf individuals through the development of human-computer interfaces (HCI) for Deaf users, by means of Sign Language. It has researched and developed recognition and synthesis engines for sign languages (SLs) that have brought sign recognition and generation technologies significantly closer to authentic signing. In this context, Dicta-Sign has developed several technologies demonstrated via a sign language aware Web 2.0, combining work from the fields of sign language recognition, sign language animation via avatars and sign language resources and language models development, with the goal of allowing Deaf users to make, edit, and review avatar-based sign language contributions online, similar to the way people nowadays make text-based contributions on the Web.
    Oshin O, Gilbert A, Illingworth J, Bowden R (2009) Learning to Recognise Spatio-Temporal Interest Points,In: Wang L, Cheng L, Zhao G (eds.), Machine Learning for Human Motion Analysis(2)2pp. 14-30 Igi Publishing
    Machine Learning for Human Motion Analysis: Theory and Practice highlights thedevelopment of robust and effective vision-based motion understanding systems.
    Gilbert A, Bowden R (2007) Multi person tracking within crowded scenes,Human Motion - Understanding, Modeling, Capture and Animation, Proceedings4814pp. 166-179 Springer
    This paper presents a solution to the problem of tracking people within crowded scenes. The aim is to maintain individual object identity through a crowded scene which contains complex interactions and heavy occlusions of people. Our approach uses the strengths of two separate methods; a global object detector and a localised frame by frame tracker. A temporal relationship model of torso detections built during low activity period, is used to further disambiguate during periods of high activity. A single camera with no calibration and no environmental information is used. Results are compared to a standard tracking method and groundtruth. Two video sequences containing interactions, overlaps and occlusions between people are used to demonstrate our approach. The results show that our technique performs better that a standard tracking method and can cope with challenging occlusions and crowd interactions.
    Lewin M, Bowden R, Sarhadi M (2000) Automotive Prototyping using Augmented Reality,Proceedings of the 7th VRSIG Conference, Strathclyde University, Sept 2000
    Ong E-J, Micilotta AS, Bowden R, Hilton A (2005) Viewpoint invariant exemplar-based 3D human tracking, COMPUTER VISION AND IMAGE UNDERSTANDING104(2-3)pp. 178-189 ACADEMIC PRESS INC ELSEVIER SCIENCE
    Pugeault N, Bowden R (2015) How Much of Driving Is Preattentive?,IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY64(12)pp. 5424-5438 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
    Lebeda K, Matas J, Bowden R (2013) Tracking the untrackable: How to track when your object is featureless, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)7729 LNCS(PART 2)pp. 347-359
    We propose a novel approach to tracking objects by low-level line correspondences. In our implementation we show that this approach is usable even when tracking objects with lack of texture, exploiting situations, when feature-based trackers fails due to the aperture problem. Furthermore, we suggest an approach to failure detection and recovery to maintain long-term stability. This is achieved by remembering configurations which lead to good pose estimations and using them later for tracking corrections. We carried out experiments on several sequences of different types. The proposed tracker proves itself as competitive or superior to state-of-the-art trackers in both standard and low-textured scenes. © 2013 Springer-Verlag.
    Bowden R, Sarhadi M (2002) A non-linear model of shape and motion for tracking finger spelt American sign language, IMAGE AND VISION COMPUTING20(9-10)pp. 597-607 ELSEVIER SCIENCE BV
    Gilbert A, Bowden R (2013) A picture is worth a thousand tags: Automatic web based image tag expansion,Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)7725 L(PART 2)pp. 447-460 Springer
    We present an approach to automatically expand the annotation of images using the internet as an additional information source. The novelty of the work is in the expansion of image tags by automatically introducing new unseen complex linguistic labels which are collected unsupervised from associated webpages. Taking a small subset of existing image tags, a web based search retrieves additional textual information. Both a textual bag of words model and a visual bag of words model are combined and symbolised for data mining. Association rule mining is then used to identify rules which relate words to visual contents. Unseen images that fit these rules are re-tagged. This approach allows a large number of additional annotations to be added to unseen images, on average 12.8 new tags per image, with an 87.2% true positive rate. Results are shown on two datasets including a new 2800 image annotation dataset of landmarks, the results include pictures of buildings being tagged with the architect, the year of construction and even events that have taken place there. This widens the tag annotation impact and their use in retrieval. This dataset is made available along with tags and the 1970 webpages and additional images which form the information corpus. In addition, results for a common state-of-the-art dataset MIRFlickr25000 are presented for comparison of the learning framework against previous works. © 2013 Springer-Verlag.
    Lewin M, Bowden R, Sarhadi M (2000) Applying Augmented Reality to Virtual Product Prototyping,pp. 59-68
    Dowson N, Bowden R (2004) Metric mixtures for mutual information ((MI)-I-3) tracking, PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 2pp. 752-756 IEEE COMPUTER SOC
    Ong EJ, Bowden R (2005) Learning multi-kernel distance functions using relative comparisons, PATTERN RECOGNITION38(12)pp. 2653-2657 ELSEVIER SCI LTD
    Gupta A, Bowden R (2011) Evaluating dimensionality reduction techniques for visual category recognition using rényi entropy, European Signal Processing Conferencepp. 913-917 IEEE
    Visual category recognition is a difficult task of significant interest to the machine learning and vision community. One of the principal hurdles is the high dimensional feature space. This paper evaluates several linear and non-linear dimensionality reduction techniques. A novel evaluation metric, the rényi entropy of the inter-vector euclidean distance distribution, is introduced. This information theoretic measure judges the techniques on their preservation of structure in lower-dimensional sub-space. The popular dataset, Caltech-101 is utilized in the experiments. The results indicate that the techniques which preserve local neighborhood structure performed best amongst the techniques evaluated in this paper. © 2011 EURASIP.
    Cooper H, Bowden R (2007) Sign Language Recognition Using Boosted Volumetric Features,Proceedings of the IAPR Conference on Machine Vision Applicationspp. 359-362 MVA Organisation
    This paper proposes a method for sign language recognition that bypasses the need for tracking by classifying the motion directly. The method uses the natural extension of haar like features into the temporal domain, computed efficiently using an integral volume. These volumetric features are assembled into spatio-temporal classifiers using boosting. Results are presented for a fast feature extraction method and 2 different types of boosting. These configurations have been tested on a data set consisting of both seen and unseen signers performing 5 signs producing competitive results.
    Ellis L, Matas J, Bowden R (2008) Online Learning and Partitioning of Linear Displacement Predictors for Tracking,Proceedings of the British Machine Vision Conferencepp. 33-42 The British Machine Vision Association (BMVA)
    A novel approach to learning and tracking arbitrary image features is presented. Tracking is tackled by learning the mapping from image intensity differences to displacements. Linear regression is used, resulting in low computational cost. An appearance model of the target is built on-the-fly by clustering sub-sampled image templates. The medoidshift algorithm is used to cluster the templates thus identifying various modes or aspects of the target appearance, each mode is associated to the most suitable set of linear predictors allowing piecewise linear regression from image intensity differences to warp updates. Despite no hard-coding or offline learning, excellent results are shown on three publicly available video sequences and comparisons with related approaches made.
    Cooper HM, Pugeault N, Bowden R (2011) Reading the Signs: A Video Based Sign Dictionary,2011 International Conference on Computer Vision: 2nd IEEE Workshop on Analysis and Retrieval of Tracked Events and Motion in Imagery Streams (ARTEMIS 2011)pp. 914-919 IEEE
    This article presents a dictionary for Sign Language using visual sign recognition based on linguistic subcomponents. We demonstrate a system where the user makes a query, receiving in response a ranked selection of similar results. The approach uses concepts from linguistics to provide sign sub-unit features and classifiers based on motion, sign-location and handshape. These sub-units are combined using Markov Models for sign level recognition. Results are shown for a video dataset of 984 isolated signs performed by a native signer. Recognition rates reach 71.4% for the first candidate and 85.9% for retrieval within the top 10 ranked signs.
    Ellis L, Pugeault N, Ofjall K, Hedborg J, Bowden R, Felsberg M (2013) Autonomous navigation and sign detector learning,2013 IEEE Workshop on Robot Vision, WORV 2013pp. 144-151
    This paper presents an autonomous robotic system that incorporates novel Computer Vision, Machine Learning and Data Mining algorithms in order to learn to navigate and discover important visual entities. This is achieved within a Learning from Demonstration (LfD) framework, where policies are derived from example state-to-action mappings. For autonomous navigation, a mapping is learnt from holistic image features (GIST) onto control parameters using Random Forest regression. Additionally, visual entities (road signs e.g. STOP sign) that are strongly associated to autonomously discovered modes of action (e.g. stopping behaviour) are discovered through a novel Percept-Action Mining methodology. The resulting sign detector is learnt without any supervision (no image labeling or bounding box annotations are used). The complete system is demonstrated on a fully autonomous robotic platform, featuring a single camera mounted on a standard remote control car. The robot carries a PC laptop, that performs all the processing on board and in real-time. © 2013 IEEE.
    Bowden R, Mitchell TA, Sarhadi M (1998) Reconstructing 3D Pose and Motion from a Single Camera View,Proceedings of BMVC 19982 BMVA (British Machine Vision Association)
    This paper presents a model based approach to human body tracking in which the 2D silhouette of a moving human and the corresponding 3D skeletal structure are encapsulated within a non-linear Point Distribution Model. This statistical model allows a direct mapping to be achieved between the external boundary of a human and the anatomical position. It is shown how this information, along with the position of landmark features such as the hands and head can be used to reconstruct information about the pose and structure of the human body from a monoscopic view of a scene.
    Koller O, Ney H, Bowden R (2013) May the force be with you: Force-aligned signwriting for automatic subunit annotation of corpora, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, FG 2013
    We propose a method to generate linguistically meaningful subunits in a fully automated fashion for sign language corpora. The ability to automate the process of subunit annotation has profound effects on the data available for training sign language recognition systems. The approach is based on the idea that subunits are shared among different signs. With sufficient data and knowledge of possible signing variants, accurate automatic subunit sequences are produced, matching the specific characteristics of given sign language data. Specifically we demonstrate how an iterative forced alignment algorithm can be used to transfer the knowledge of a user-edited open sign language dictionary to the task of annotating a challenging, large vocabulary, multi-signer corpus recorded from public TV. Existing approaches focus on labour intensive manual subunit annotations or on data-driven approaches. Our method yields an average precision and recall of 15% under the maximum achievable accuracy with little user intervention beyond providing a simple word gloss. © 2013 IEEE.
    Hadfield S, Bowden R (2013) Hollywood 3D: Recognizing Actions in 3D Natural Scenes,Proceeedings, IEEE conference on Computer Vision and Pattern Recognition (CVPR)pp. 3398-3405 IEEE
    Action recognition in unconstrained situations is a difficult task, suffering from massive intra-class variations. It is made even more challenging when complex 3D actions are projected down to the image plane, losing a great deal of information. The recent emergence of 3D data, both in broadcast content, and commercial depth sensors, provides the possibility to overcome this issue. This paper presents a new dataset, for benchmarking action recognition algorithms in natural environments, while making use of 3D information. The dataset contains around 650 video clips, across 14 classes. In addition, two state of the art action recognition algorithms are extended to make use of the 3D data, and five new interest point detection strategies are also proposed, that extend to the 3D data. Our evaluation compares all 4 feature descriptors, using 7 different types of interest point, over a variety of threshold levels, for the Hollywood 3D dataset. We make the dataset including stereo video, estimated depth maps and all code required to reproduce the benchmark results, available to the wider community.
    Dowson NDH, Bowden R, Kadir T (2006) Image template matching using mutual information and NP-Windows,18TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 2, PROCEEDINGSpp. 1186-1191 IEEE COMPUTER SOC
    A non-parametric (NP) sampling method is introduced for obtaining the joint distribution of a pair of images. This method based on NP windowing and is equivalent to sampling the images at infinite resolution. Unlike existing methods, arbitrary selection of kernels is not required and the spatial structure of images is used. NP windowing is applied to a registration application where the mutual information (MI) between a reference image and a warped template is maximised with respect to the warp parameters. In comparisons against the current state of the art MI registration methods NP windowing yielded excellent results with lower bias and improved convergence rates
    Pugeault N, Bowden R (2010) Learning pre-attentive driving behaviour from holistic visual features, ECCV 2010, Part VI, LNCS 6316pp. 154-167
    Bowden R, Kaewtrakulpong P, Lewin M (2002) Jeremiah: The face of computer vision, ACM International Conference Proceeding Series22pp. 124-128
    This paper presents a humanoid computer interface (Jeremiah) that is capable of extracting moving objects from a video stream and responding by directing the gaze of an animated head toward it. It further responds through change of expression reflecting the emotional state of the system as a response to stimuli. As such, the system exhibits similar behavior to a child. The system was originally designed as a robust visual tracking system capable of performing accurately and consistently within a real world visual surveillance arena. As such, it provides a system capable of operating reliably in any environment both indoor and outdoor. Originally designed as a public interface to promote computer vision and the public understanding of science (exhibited in British Science Museum), Jeremiah provides the first step to a new form of intuitive computer interface. Copyright © ACM 2002.
    Ellis L, Dowson N, Matas J, Bowden R (2011) Linear Regression and Adaptive Appearance Models for Fast Simultaneous Modelling and Tracking,INTERNATIONAL JOURNAL OF COMPUTER VISION95(2)pp. 154-179 SPRINGER
    Sheerman-Chase T, Ong E-J, Bowden R (2009) Feature selection of facial displays for detection of non verbal communication in natural conversation,2009 IEEE 12th International Conference on Computer Vision Workshopspp. 1985-1992 IEEE
    Recognition of human communication has previously focused on deliberately acted emotions or in structured or artificial social contexts. This makes the result hard to apply to realistic social situations. This paper describes the recording of spontaneous human communication in a specific and common social situation: conversation between two people. The clips are then annotated by multiple observers to reduce individual variations in interpretation of social signals. Temporal and static features are generated from tracking using heuristic and algorithmic methods. Optimal features for classifying examples of spontaneous communication signals are then extracted by AdaBoost. The performance of the boosted classifier is comparable to human performance for some communication signals, even on this challenging and realistic data set.
    Okwechime D, Bowden R (2008) A generative model for motion synthesis and blending using probability density estimation, ARTICULATED MOTION AND DEFORMABLE OBJECTS, PROCEEDINGS5098pp. 218-227 SPRINGER-VERLAG BERLIN
    Lan Y, Harvey R, Theobald B, Ong EJ, Bowden R (2009) Comparing Visual Features for Lipreading,International Conference on Auditory-Visual Speech Processing 2009pp. 102-106 ICSA
    For automatic lipreading, there are many competing methods for feature extraction. Often, because of the complexity of the task these methods are tested on only quite restricted datasets, such as the letters of the alphabet or digits, and from only a few speakers. In this paper we compare some of the leading methods for lip feature extraction and compare them on the GRID dataset which uses a constrained vocabulary over, in this case, 15 speakers. Previously the GRID data has had restricted attention because of the requirements to track the face and lips accurately. We overcome this via the use of a novel linear predictor (LP) tracker which we use to control an Active Appearance Model (AAM). By ignoring shape and/or appearance parameters from the AAM we can quantify the effect of appearance and/or shape when lip-reading. We find that shape alone is a useful cue for lipreading (which is consistent with human experiments). However, the incremental effect of shape on appearance appears to be not significant which implies that the inner appearance of the mouth contains more information than the shape.
    Bowden R, Zisserman A, Kadir T, Brady M (2003) Vision based Interpretation of Natural Sign Languages,Proceedings of the 3rd international conference on Computer vision systems Springer-Verlag
    This manuscript outlines our current demonstration system for translating visual Sign to written text. The system is based around a broad description of scene activity that naturally generalizes, reducing training requirements and allowing the knowledge base to be explicitly stated. This allows the same system to be used for different sign languages requiring only a change of the knowledge base.
    Pugeault N, Bowden R (2010) Learning driving behaviour using holistic image descriptors, 4th International Conference on Cognitive Systems, CogSys 2010
    Dowson N, Bowden R (2006) A unifying framework for mutual information methods for use in non-linear optimisation,Lecture Notes in Computer Science: 9th European Conference on Computer Vision, Proceedings Part 13951pp. 365-378 Springer
    Many variants of MI exist in the literature. These vary primarily in how the joint histogram is populated. This paper places the four main variants of MI: Standard sampling, Partial Volume Estimation (PVE), In-Parzen Windowing and Post-Parzen Windowing into a single mathematical framework. Jacobians and Hessians are derived in each case. A particular contribution is that the non-linearities implicit to standard sampling and post-Parzen windowing are explicitly dealt with. These non-linearities are a barrier to their use in optimisation. Side-by-side comparison of the MI variants is made using eight diverse data-sets, considering computational expense and convergence. In the experiments, PVE was generally the best performer, although standard sampling often performed nearly as well (if a higher sample rate was used). The widely used sum of squared differences metric performed as well as MI unless large occlusions and non-linear intensity relationships occurred. The binaries and scripts used for testing are available online.
    Koller O, Ney H, Bowden R (2014) Read my lips: Continuous signer independent weakly supervised viseme recognition, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)8689 LNCS(PART 1)pp. 281-296
    This work presents a framework to recognise signer independent mouthings in continuous sign language, with no manual annotations needed. Mouthings represent lip-movements that correspond to pronunciations of words or parts of them during signing. Research on sign language recognition has focused extensively on the hands as features. But sign language is multi-modal and a full understanding particularly with respect to its lexical variety, language idioms and grammatical structures is not possible without further exploring the remaining information channels. To our knowledge no previous work has explored dedicated viseme recognition in the context of sign language recognition. The approach is trained on over 180.000 unlabelled frames and reaches 47.1% precision on the frame level. Generalisation across individuals and the influence of context-dependent visemes are analysed. © 2014 Springer International Publishing.
    Ong EJ, Bowden R (2006) Learning wormholes for sparsely labelled clustering,18TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 1, PROCEEDINGSpp. 916-919 IEEE COMPUTER SOC
    Distance functions are an important component in many learning applications. However, the correct function is context dependent, therefore it is advantageous to learn a distance function using available training data. Many existing distance functions is the requirement for data to exist in a space of constant dimensionality and not possible to be directly used on symbolic data. To address these problems, this paper introduces an alternative learnable distance function, based on multi-kernel distance bases or "wormholes that connects spaces belonging to similar examples that were originally far away close together. This work only assumes the availability of a set data in the form of relative comparisons, avoiding the need for having labelled or quantitative information. To learn the distance function, two algorithms were proposed: 1) Building a set of basic wormhole bases using a Boosting-inspired algorithm. 2) Merging different distance bases together for better generalisation. The learning algorithms were then shown to successfully extract suitable distance functions in various clustering problems, ranging from synthetic 2D data to symbolic representations of unlabelled images
    Dowson NDH, Bowden R (2006) N-tier Simultaneous Modelling and Tracking for Arbitrary Warps,Proceedings of the British Machine Vision Conference2pp. 569-578 BMVA
    This paper presents an approach to object tracking which, given a single example of a target, learns a hierarchical constellation model of appearance and structure on the fly. The model becomes more robust over time as evidence of the variability of the object is acquired and added to the model. Tracking is performed in an optimised Lucas-Kanade type framework, using Mutual Information as a similarity metric. Several novelties are presented: an improved template update strategy using Bayes theorem, a multi-tier model topology, and a semi-automatic testing method. A critical comparison with other methods is made using exhaustive testing. In all 11 challenging test sequences were used with a mean length of 568 frames.
    Gilbert A, Illingworth J, Bowden R (2010) Action Recognition Using Mined Hierarchical Compound Features,IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE33(5)pp. 883-897 IEEE COMPUTER SOC
    Camgoz NC, Hadfield SJ, Koller O, Bowden R (2016) Using Convolutional 3D Neural Networks for User-Independent Continuous Gesture Recognition, Proceedings IEEE International Conference of Pattern Recognition (ICPR), ChaLearn Workshop IEEE
    In this paper, we propose using 3D Convolutional Neural Networks for large scale user-independent continuous gesture recognition. We have trained an end-to-end deep network for continuous gesture recognition (jointly learning both the feature representation and the classifier). The network performs three-dimensional (i.e. space-time) convolutions to extract features related to both the appearance and motion from volumes of color frames. Space-time invariance of the extracted features is encoded via pooling layers. The earlier stages of the network are partially initialized using the work of Tran et al. before being adapted to the task of gesture recognition. An earlier version of the proposed method, which was trained for 11,250 iterations, was submitted to ChaLearn 2016 Continuous Gesture Recognition Challenge and ranked 2nd with the Mean Jaccard Index Score of 0.269235. When the proposed method was further trained for 28,750 iterations, it achieved state-of-the-art performance on the same dataset, yielding a 0.314779 Mean Jaccard Index Score.
    Oshin O, Gilbert A, Illingworth J, Bowden R (2009) Action recognition using Randomised Ferns,2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops 2009pp. 530-537 IEEE
    This paper presents a generic method for recognising and localising human actions in video based solely on the distribution of interest points. The use of local interest points has shown promising results in both object and action recognition. While previous methods classify actions based on the appearance and/or motion of these points, we hypothesise that the distribution of interest points alone contains the majority of the discriminatory information. Motivated by its recent success in rapidly detecting 2D interest points, the semi-naive Bayesian classification method of Randomised Ferns is employed. Given a set of interest points within the boundaries of an action, the generic classifier learns the spatial and temporal distributions of those interest points. This is done efficiently by comparing sums of responses of interest points detected within randomly positioned spatio-temporal blocks within the action boundaries. We present results on the largest and most popular human action dataset using a number of interest point detectors, and demostrate that the distribution of interest points alone can perform as well as approaches that rely upon the appearance of the interest points.
    Bowden R, KaewTraKulPong P (2001) Adaptive Visual System for Tracking Low Resolution Colour Targets,Proceedings of the 12th British Machine Vision Conference (BMVC2001)pp. 243-252
    This paper addresses the problem of using appearance and motion models in classifying and tracking objects when detailed information of the object?s appearance is not available. The approach relies upon motion, shape cues and colour information to help in associating objects temporally within a video stream. Unlike previous applications of colour in object tracking, where relatively large-size targets are tracked, our method is designed to track small colour targets. Our approach uses a robust background model based around Expectation Maximisation to segment moving objects with very low false detection rates. The system also incorporates a shadow detection algorithm which helps alleviate standard environmental problems associated with such approaches. A colour transformation derived from anthropological studies to model colour distributions of low-resolution targets is used along with a probabilistic method of combining colour and motion information. This provides a robust visual tracking system which is capable of performing accurately and consistently within a real world visual surveillance arena. This paper shows the system successfully tracking multiple people moving independently and the ability of the approach to recover lost tracks due to occlusions and background clutter.
    Mitchell TA, Bowden R (1999) Automated visual inspection of dry carbon-fibre reinforced composite preforms, Proceedings of the Institution of Mechanical Engineers, Part G: Journal of Aerospace Engineering213(6)pp. 377-386
    A vision system is described which performs real-time inspection of dry carbon-fibre preforms during lay-up, the first stage in resin transfer moulding (RTM). The position of ply edges on the preform is determined in a two-stage process. Firstly, an optimized texture analysis method is used to estimate the approximate ply edge position. Secondly, boundary refinement is carried out using the texture estimate as a guiding template. Each potential edge point is evaluated using a merit function of edge magnitude, orientation and distance from the texture boundary estimate. The parameters of the merit function must be obtained by training on sample images. Once trained, the system has been shown to be accurate to better than ±1 pixel when used in conjunction with boundary models. Processing time is less than 1 s per image using commercially available convolution hardware. The system has been demonstrated in a prototype automated lay-up cell and used in a large number of manufacturing trials.
    Ong EJ, Cooper H, Pugeault N, Bowden R (2012) Sign Language Recognition using Sequential Pattern Trees, Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference onpp. 2200-2207
    Moore S, Bowden R (2007) Automatic facial expression recognition using boosted discriminatory classifiers, Lecture Notes in Computer Science: Analysis and Modelling of Faces and Gestures4778pp. 71-83 Springer
    Over the last two decades automatic facial expression recognition has become an active research area. Facial expressions are an important channel of non-verbal communication, and can provide cues to emotions and intentions. This paper introduces a novel method for facial expression recognition, by assembling contour fragments as discriminatory classifiers and boosting them to form a strong accurate classifier. Detection is fast as features are evaluated using an efficient lookup to a chamfer image, which weights the response of the feature. An Ensemble classification technique is presented using a voting scheme based on classifiers responses. The results of this research are a 6-class classifier (6 basic expressions of anger, joy, sadness, surprise, disgust and fear) which demonstrate competitive results achieving rates as high as 96% for some expressions. As classifiers are extremely fast to compute the approach operates at well above frame rate. We also demonstrate how a dedicated classifier can be consrtucted to give optimal automatic parameter selection of the detector, allowing real time operation on unconstrained video.
    Micilotta A, Ong E, Bowden R (2005) Real-time Upper Body 3D Reconstruction from a Single Uncalibrated Camera,The European Association for Computer Graphics 26th Annual Conference, EUROGRAPHICS 2005pp. 41-44
    This paper outlines a method of estimating the 3D pose of the upper human body from a single uncalibrated camera. The objective application lies in 3D Human Computer Interaction where hand depth information offers extended functionality when interacting with a 3D virtual environment, but it is equally suitable to animation and motion capture. A database of 3D body configurations is built from a variety of human movements using motion capture data. A hierarchical structure consisting of three subsidiary databases, namely the frontal-view Hand Position (top-level), Silhouette and Edge Map Databases, are pre-extracted from the 3D body configuration database. Using this hierarchy, subsets of the subsidiary databases are then matched to the subject in real-time. The examples of the subsidiary databases that yield the highest matching score are used to extract the corresponding 3D configuration from the motion capture data, thereby estimating the upper body 3D pose.
    Koller O, Zargaran O, Ney H, Bowden R (2016) Deep Sign: Hybrid CNN-HMM for Continuous Sign Language Recognition, Proceedings of the British Machine Vision Conference 2016 BMVA Press
    This paper introduces the end-to-end embedding of a CNN into a HMM, while interpreting the outputs of the CNN in a Bayesian fashion. The hybrid CNN-HMM combines the strong discriminative abilities of CNNs with the sequence modelling capabilities of HMMs. Most current approaches in the field of gesture and sign language recognition disregard the necessity of dealing with sequence data both for training and evaluation. With our presented end-to-end embedding we are able to improve over the state-of-the-art on three challenging benchmark continuous sign language recognition tasks by between 15% and 38% relative and up to 13.3% absolute.
    Kristan M, Matas J, Leonardis A, Felsberg M, Cehovin L, Fernandez G, Voj1r T, Hager G, Nebehay G, Pflugfelder R, Gupta A, Bibi A, Lukezic A, Garcia-Martin A, Petrosino A, Saffari A, Montero A, Varfolomieiev A, Baskurt A, Zhao B, Ghanem B, Martinez B, Lee B, Han B, Wang C, Garcia C, Zhang C, Schmid C, Tao D, Kim D, Huang D, Prokhorov D, Du D, Yeung D, Ribeiro E, Khan F, Porikli F, Bunyak F, Zhu G, Seetharaman G, Kieritz H, Yau H, Li H, Qi H, Bischof H, Possegger H, Lee H, Nam H, Bogun I, Jeong J, Cho J, Lee J, Zhu J, Shi J, Li J, Jia J, Feng J, Gao J, Choi J, Kim J, Lang J, Martinez J, Choi J, Xing J, Xue K, Palaniappan K, Lebeda K, Alahari K, Gao K, Yun K, Wong K, Luo L, Ma L, Ke L, Wen L, Bertinetto L, Pootschi M, Maresca M, Danelljan M, Wen M, Zhang M, Arens M, Valstar M, Tang M, Chang M, Khan M, Fan N, Wang N, Miksik O, Torr P, Wang Q, Martin-Nieto R, Pelapur R, Bowden R, Laganière R, Moujtahid S, Hare S, Hadfield SJ, Lyu S, Li S, Zhu S, Becker S, Duffner S, Hicks S, Golodetz S, Choi S, Wu T, Mauthner T, Pridmore T, Hu W, Hübner W, Wang X, Li X, Shi X, Zhao X, Mei X, Shizeng Y, Hua Y, Li Y, Lu Y, Li Y, Chen Z, Huang Z, Chen Z, Zhang Z, He Z, Hong Z (2015) The Visual Object Tracking VOT2015 challenge results,ICCV workshop on Visual Object Tracking Challengepp. 564-586
    The Visual Object Tracking challenge 2015, VOT2015, aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. Results of 62 trackers are presented. The number of tested trackers makes VOT 2015 the largest benchmark on short-term tracking to date. For each participating tracker, a short description is provided in the appendix. Features of the VOT2015 challenge that go beyond its VOT2014 predecessor are: (i) a new VOT2015 dataset twice as large as in VOT2014 with full annotation of targets by rotated bounding boxes and per-frame attribute, (ii) extensions of the VOT2014 evaluation methodology by introduction of a new performance measure. The dataset, the evaluation kit as well as the results are publicly available at the challenge website.
    Gilbert A, bowden R (2015) Geometric Mining: Scaling Geometric Hashing to Large Datasets, 3rd Workshop on Web-scale Vision and Social Media (VSM), at ICCV 2015
    Bowden R, Mitchell TA, Sarhadi M (2000) Non-linear Statistical Models for the 3D Reconstruction of Human Pose and Motion from Monocular Image Sequences, Image and Vision Computing18(9)pp. 729-737 Elsevier
    This paper presents a model based approach to human body tracking in which the 2D silhouette of a moving human and the corresponding 3D skeletal structure are encapsulated within a non-linear point distribution model. This statistical model allows a direct mapping to be achieved between the external boundary of a human and the anatomical position. It is shown how this information, along with the position of landmark features such as the hands and head can be used to reconstruct information about the pose and structure of the human body from a monocular view of a scene.
    Ellis L, Bowden R (2005) A generalised exemplar approach to modelling perception action couplings,Proceedings of the Tenth IEEE International Conference on Computer Vision Workshopspp. 1874-1874 IEEE
    We present a framework for autonomous behaviour in vision based artificial cognitive systems by imitation through coupled percept-action (stimulus and response) exemplars. Attributed Relational Graphs (ARGs) are used as a symbolic representation of scene information (percepts). A measure of similarity between ARGs is implemented with the use of a graph isomorphism algorithm and is used to hierarchically group the percepts. By hierarchically grouping percept exemplars into progressively more general models coupled to progressively more general Gaussian action models, we attempt to model the percept space and create a direct mapping to associated actions. The system is built on a simulated shape sorter puzzle that represents a robust vision system. Spatio temporal hypothesis exploration is performed ef- ficiently in a Bayesian framework using a particle filter to propagate game play over time.
    Bowden R (1998) Non-linear Point Distribution Models, The University of Edinburgh
    Ellis L, Dowson N, Matas J, Bowden R (2007) Linear predictors for fast simultaneous modeling and tracking, 2007 IEEE 11TH INTERNATIONAL CONFERENCE ON COMPUTER VISION, VOLS 1-6pp. 2792-2799 IEEE
    Sheerman-Chase T, Ong E-J, Bowden R (2011) Cultural factors in the regression of non-verbal communication perception,2011 IEEE International Conference on Computer Visionpp. 1242-1249 IEEE
    Recognition of non-verbal communication (NVC) is important for understanding human communication and designing user centric user interfaces. Cultural differences affect the expression and perception of NVC but no previous automatic system considers these cultural differences. Annotation data for the LILiR TwoTalk corpus, containing dyadic (two person) conversations, was gathered using Internet crowdsourcing, with a significant quantity collected from India, Kenya and the United Kingdom (UK). Many studies have investigated cultural differences based on human observations but this has not been addressed in the context of automatic emotion or NVC recognition. Perhaps not surprisingly, testing an automatic system on data that is not culturally representative of the training data is seen to result in low performance. We address this problem by training and testing our system on a specific culture to enable better modeling of the cultural differences in NVC perception. The system uses linear predictor tracking, with features generated based on distances between pairs of trackers. The annotations indicated the strength of the NVC which enables the use of v-SVR to perform the regression.
    Holt B, Ong E-J, Cooper H, Bowden R (2011) Putting the pieces together: Connected Poselets for human pose estimation,2011 IEEE International Conference on Computer Visionpp. 1196-1201 IEEE
    We propose a novel hybrid approach to static pose estimation called Connected Poselets. This representation combines the best aspects of part-based and example-based estimation. First detecting poselets extracted from the training data; our method then applies a modified Random Decision Forest to identify Poselet activations. By combining keypoint predictions from poselet activitions within a graphical model, we can infer the marginal distribution over each keypoint without any kinematic constraints. Our approach is demonstrated on a new publicly available dataset with promising results.
    Lebeda K, Hadfield S, Bowden R (2015) Dense Rigid Reconstruction from Unstructured Discontinuous Video, 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOP (ICCVW)pp. 814-822 IEEE
    Koller O, Ney H, bowden R (2014) Weakly Supervised Automatic Transcription of Mouthings for Gloss-Based Sign Language Corpora, LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION EUROPEAN LANGUAGE RESOURCES ASSOC-ELRA
    Micilotta AS, Ong EJ, Bowden R (2006) Real-time upper body detection and 3D pose estimation in monoscopic images,Lecture Notes in Computer Science: Proceedings of 9th European Conference on Computer Vision, Part III3953pp. 139-150 Springer
    This paper presents a novel solution to the difficult task of both detecting and estimating the 3D pose of humans in monoscopic images. The approach consists of two parts. Firstly the location of a human is identified by a probabalistic assembly of detected body parts. Detectors for the face, torso and hands are learnt using adaBoost. A pose likliehood is then obtained using an a priori mixture model on body configuration and possible configurations assembled from available evidence using RANSAC. Once a human has been detected, the location is used to initialise a matching algorithm which matches the silhouette and edge map of a subject with a 3D model. This is done efficiently using chamfer matching, integral images and pose estimation from the initial detection stage. We demonstrate the application of the approach to large, cluttered natural images and at near framerate operation (16fps) on lower resolution video streams.
    Mitchell TA, Bowden R, Sarhadi M (2010) Efficient Texture Analysis for Industrial Inspection, International Journal of Production Research38(4)pp. 967-984 Taylor & Francis
    This paper describes a convolution-based approach to the analysis of images containing few texture classes. Segmentation of foreground and background textures, or detection of boundaries between similarly textured objects, is demonstrated. The application to industrial inspection applications is demonstrated. Near frame-rate performance on low-cost hardware is possible, since only convolution with small kernels is used. A new algorithm to optimize convolution kernels for the required texture analysis task is presented. A key feature of the paper is the industrial readiness of the techniques described.
    KaewTraKulPong P, Bowden R (2001) An Improved Adaptive Background Mixture Model for Real-time Tracking with Shadow Detection, Proceedings of 2nd European Workshop on Advanced Video Based Surveillance Systems, AVBS01. Sept 2001. Kluwer Academic Publishers
    Real-time segmentation of moving regions in image sequences is a fundamental step in many vision systems including automated visual surveillance, human-machine interface, and very low-bandwidth telecommunications. A typical method is background subtraction. Many background models have been introduced to deal with different problems. One of the successful solutions to these problems is to use a multi-colour background model per pixel proposed by Grimson et al [1, 2,3]. However, the method suffers from slow learning at the beginning, especially in busy environments. In addition, it can not distinguish between moving shadows and moving objects. This paper presents a method which improves this adaptive background mixture model. By reinvestigating the update equations, we utilise different equations at different phases. This allows our system learn faster and more accurately as well as adapts effectively to changing environment. A shadow detection scheme is also introduced in this paper. It is based on a computational colour space that makes use of our background model. A comparison has been made between the two algorithms. The results show the speed of learning and the accuracy of the model using our update algorithm over the Grimson et al?s tracker. When incorporate with the shadow detection, our method results in far better segmentation than The Thirteenth Conference on Uncertainty in Artificial Intelligence that of Grimson et al.
    Ong EJ, Koller O, Pugeault N, Bowden R (2014) Sign spotting using hierarchical sequential patterns with temporal intervals, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognitionpp. 1931-1938
    © 2014 IEEE.This paper tackles the problem of spotting a set of signs occuring in videos with sequences of signs. To achieve this, we propose to model the spatio-temporal signatures of a sign using an extension of sequential patterns that contain temporal intervals called Sequential Interval Patterns (SIP). We then propose a novel multi-class classifier that organises different sequential interval patterns in a hierarchical tree structure called a Hierarchical SIP Tree (HSP-Tree). This allows one to exploit any subsequence sharing that exists between different SIPs of different classes. Multiple trees are then combined together into a forest of HSP-Trees resulting in a strong classifier that can be used to spot signs. We then show how the HSP-Forest can be used to spot sequences of signs that occur in an input video. We have evaluated the method on both concatenated sequences of isolated signs and continuous sign sequences. We also show that the proposed method is superior in robustness and accuracy to a state of the art sign recogniser when applied to spotting a sequence of signs.
    Gilbert A, Bowden R (2005) Incremental modelling of the posterior distribution of objects for inter and intra camera tracking,BMVC 2005 - Proceedings of the British Machine Vision Conference 2005 BMVA
    This paper presents a scalable solution to the problem of tracking objects across spatially separated, uncalibrated, non-overlapping cameras. Unlike other approaches this technique uses an incremental learning method to create the spatio-temporal links between cameras, and thus model the posterior probability distribution of these links. This can then be used with an appearance model of the object to track across cameras. It requires no calibration or batch preprocessing and becomes more accurate over time as evidence is accumulated.
    Oshin OT, Gilbert A, Illingworth J, Bowden R (2009) Learning to recognise spatio-temporal interest points,pp. 14-30
    In this chapter, we present a generic classifier for detecting spatio-temporal interest points within video, the premise being that, given an interest point detector, we can learn a classifier that duplicates its functionality and which is both accurate and computationally efficient. This means that interest point detection can be achieved independent of the complexity of the original interest point formulation. We extend the naive Bayesian classifier of Randomised Ferns to the spatio-temporal domain and learn classifiers that duplicate the functionality of common spatio-temporal interest point detectors. Results demonstrate accurate reproduction of results with a classifier that can be applied exhaustively to video at frame-rate, without optimisation, in a scanning window approach. © 2010, IGI Global.
    Bowden R, Mitchell TA, Sarhadi M (1997) Cluster Based non-linear Principle Component Analysis, Electronics Letters33(22)pp. 1858-1859 The Institution of Engineering and Technology
    In the field of computer vision, principle component analysis (PCA) is often used to provide statistical models of shape, deformation or appearance. This simple statistical model provides a constrained, compact approach to model based vision. However. As larger problems are considered, high dimensionality and nonlinearity make linear PCA an unsuitable and unreliable approach. A nonlinear PCA (NLPCA) technique is proposed which uses cluster analysis and dimensional reduction to provide a fast, robust solution. Simulation results on both 2D contour models and greyscale images are presented.
    Krejov P, Gilbert A, Bowden R (2015) Combining Discriminative and Model Based Approaches for Hand Pose Estimation, 2015 11TH IEEE INTERNATIONAL CONFERENCE AND WORKSHOPS ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG), VOL. 5 IEEE
    Pugeault N, Bowden R (2011) Spelling It Out: Real?Time ASL Fingerspelling Recognition,2011 IEEE International Conference on Computer Vision Workshopspp. 1114-1119 IEEE
    This article presents an interactive hand shape recognition user interface for American Sign Language (ASL) finger-spelling. The system makes use of a Microsoft Kinect device to collect appearance and depth images, and of the OpenNI+NITE framework for hand detection and tracking. Hand-shapes corresponding to letters of the alphabet are characterized using appearance and depth images and classified using random forests. We compare classification using appearance and depth images, and show a combination of both lead to best results, and validate on a dataset of four different users. This hand shape detection works in real-time and is integrated in an interactive user interface allowing the signer to select between ambiguous detections and integrated with an English dictionary for efficient writing.
    Kristan M, Pflugfelder R, Leonardis A, Matas J, Porikli F, ?ehovin L, Nebehay G, Fernandez G, VojíY T, Gatt A, Khajenezhad A, Salahledin A, Soltani-Farani A, Zarezade A, Petrosino A, Milton A, Bozorgtabar B, Li B, Chan CS, Heng C, Ward D, Kearney D, Monekosso D, Karaimer HC, Rabiee HR, Zhu J, Gao J, Xiao J, Zhang J, Xing J, Huang K, Lebeda K, Cao L, Maresca ME, Lim MK, ELHelw M, Felsberg M, Remagnino P, Bowden R, Goecke R, Stolkin R, Lim SYY, Maher S, Poullot S, Wong S, Satoh S, Chen W, Hu W, Zhang X, Li Y, Niu Z (2013) The visual object tracking VOT2013 challenge results, Proceedings of the IEEE International Conference on Computer Visionpp. 98-111
    Visual tracking has attracted a significant attention in the last few decades. The recent surge in the number of publications on tracking-related problems have made it almost impossible to follow the developments in the field. One of the reasons is that there is a lack of commonly accepted annotated data-sets and standardized evaluation protocols that would allow objective comparison of different tracking methods. To address this issue, the Visual Object Tracking (VOT) workshop was organized in conjunction with ICCV2013. Researchers from academia as well as industry were invited to participate in the first VOT2013 challenge which aimed at single-object visual trackers that do not apply pre-learned models of object appearance (model-free). Presented here is the VOT2013 benchmark dataset for evaluation of single-object visual trackers as well as the results obtained by the trackers competing in the challenge. In contrast to related attempts in tracker benchmarking, the dataset is labeled per-frame by visual attributes that indicate occlusion, illumination change, motion change, size change and camera motion, offering a more systematic comparison of the trackers. Furthermore, we have designed an automated system for performing and evaluating the experiments. We present the evaluation protocol of the VOT2013 challenge and the results of a comparison of 27 trackers on the benchmark dataset. The dataset, the evaluation tools and the tracker rankings are publicly available from the challenge website (http://votchallenge. net). © 2013 IEEE.
    Bowden R, Windridge D, Kadir T, Zisserman A, Brady M (2004) A Linguistic Feature Vector for the Visual Interpretation of Sign Language,European Conference on Computer Visionpp. 390-401 - 390-401
    Hadfield SJ, Lebeda K, Bowden R (2016) Stereo reconstruction using top-down cues,Computer Vision and Image Understanding157pp. 206-222 Elsevier
    We present a framework which allows standard stereo reconstruction to be unified with a wide range of classic top-down cues from urban scene understanding. The resulting algorithm is analogous to the human visual system where conflicting interpretations of the scene due to ambiguous data can be resolved based on a higher level understanding of urban environments. The cues which are reformulated within the framework include: recognising common arrangements of surface normals and semantic edges (e.g. concave, convex and occlusion boundaries), recognising connected or coplanar structures such as walls, and recognising collinear edges (which are common on repetitive structures such as windows). Recognition of these common configurations has only recently become feasible, thanks to the emergence of large-scale reconstruction datasets. To demonstrate the importance and generality of scene understanding during stereo-reconstruction, the proposed approach is integrated with 3 different state-of-the-art techniques for bottom-up stereo reconstruction. The use of high-level cues is shown to improve performance by up to 15 % on the Middlebury 2014 and KITTI datasets. We further evaluate the technique using the recently proposed HCI stereo metrics, finding significant improvements in the quality of depth discontinuities, planar surfaces and thin structures.
    Dowson NDH, Bowden R (2005) Simultaneous modeling and tracking (SMAT) of feature sets, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol 2, Proceedingspp. 99-105 IEEE COMPUTER SOC
    KaewTraKulPong P, Bowden R (2002) An Improved Adaptive Background Mixture Model for Real-time Tracking with Shadow Detection,In: Remagnino P, Jones GA, Paragios N, Regazzoni CS (eds.), Video-Based Surveillance Systems11 Springer US
    Real-time segmentation of moving regions in image sequences is a fundamental step in many vision systems including automated visual surveillance, human-machine interface, and very low-bandwidth telecommunications. A typical method is background subtraction. Many background models have been introduced to deal with different problems. One of the successful solutions to these problems is to use a multi-colour background model per pixel proposed by Grimson et al [1, 2,3]. However, the method suffers from slow learning at the beginning, especially in busy environments. In addition, it can not distinguish between moving shadows and moving objects. This paper presents a method which improves this adaptive background mixture model. By reinvestigating the update equations, we utilise different equations at different phases. This allows our system learn faster and more accurately as well as adapts effectively to changing environment. A shadow detection scheme is also introduced in this paper. It is based on a computational colour space that makes use of our background model. A comparison has been made between the two algorithms. The results show the speed of learning and the accuracy of the model using our update algorithm over the Grimson et al?s tracker. When incorporate with the shadow detection, our method results in far better segmentation than The Thirteenth Conference on Uncertainty in Artificial Intelligence that of Grimson et al.
    Krejov P, Gilbert A, Bowden R (2014) A Multitouchless Interface Expanding User Interaction,IEEE COMPUTER GRAPHICS AND APPLICATIONS34(3)pp. 40-48 IEEE COMPUTER SOC
    Oshin O, Gilbert A, Bowden R (2014) Capturing relative motion and finding modes for action recognition in the wild, Computer Vision and Image Understanding
    "Actions in the wild" is the term given to examples of human motion that are performed in natural settings, such as those harvested from movies [1] or Internet databases [2]. This paper presents an approach to the categorisation of such activity in video, which is based solely on the relative distribution of spatio-temporal interest points. Presenting the Relative Motion Descriptor, we show that the distribution of interest points alone (without explicitly encoding their neighbourhoods) effectively describes actions. Furthermore, given the huge variability of examples within action classes in natural settings, we propose to further improve recognition by automatically detecting outliers, and breaking complex action categories into multiple modes. This is achieved using a variant of Random Sampling Consensus (RANSAC), which identifies and separates the modes. We employ a novel reweighting scheme within the RANSAC procedure to iteratively reweight training examples, ensuring their inclusion in the final classification model. We demonstrate state-of-the-art performance on five human action datasets. © 2014 Elsevier Inc. All rights reserved.
    Hadfield S, Bowden R (2015) Exploiting high level scene cues in stereo reconstruction,2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV)pp. 783-791 IEEE
    Ellis L, Bowden R (2007) Learning Responses to Visual Stimuli: A Generic Approach,Proceedings of the 5th International Conference on Computer Vision Systems Applied Computer Science Group, Bielefeld University, Germany
    A general framework for learning to respond appropriately to visual stimulus is presented. By hierarchically clustering percept-action exemplars in the action space, contextually important features and relationships in the perceptual input space are identified and associated with response models of varying generality. Searching the hierarchy for a set of best matching percept models yields a set of action models with likelihoods. By posing the problem as one of cost surface optimisation in a probabilistic framework, a particle filter inspired forward exploration algorithm is employed to select actions from multiple hypotheses that move the system toward a goal state and to escape from local minima. The system is quantitatively and qualitatively evaluated in both a simulated shape sorter puzzle and a real-world autonomous navigation domain.
    Cooper HM (2010) Sign Language Recognitions: Generalising to More Complex Corpora.,
    The aim of this thesis is to find new approaches to Sign Language Recognition (SLR) which are suited to working with the limited corpora currently available. Data available for SLR is of limited quality; low resolution and frame rates make the task of recognition even more complex. The content is rarely natural, concentrating on isolated signs and filmed under laboratory conditions. In addition, the amount of accurately labelled data is minimal. To this end, several contributions are made: Tracking the hands is eschewed in favour of detection based techniques more robust to noise; for both signs and for linguistically-motivated sign sub-units are investigated, to make best use of limited data sets. Finally, an algorithm is proposed to learn signs from the inset signers on TV, with the aid of the accompanying subtitles, thus increasing the corpus of data available. Tracking fast moving hands under laboratory conditions is a complex task, move this to real world data and the challenge is even greater. When using tracked data as a base for SLR, the errors in the tracking are compounded at the classification stage. Proposed instead, is a novel sign detection method, which views space-time as a 3D volume and the sign within it as an object to be located. Features are combined into strong classifiers using a novel boosting implementation designed to create optimal classifiers over sparse datasets. Using boosted volumetric features, on a robust frame differenced input, average classification rates reach 71% on seen signers and 66% on a mixture of seen and unseen signers, with individual sign classification rates gaining 95%. Using a classifier per sign approach to SLR, means that data sets need to contain numerous examples of the signs to be learnt. Instead, this thesis proposes learnt classifiers to detect the common sub-units of sign. The responses of these classifiers can then be combined for recognition at the sign level. This approach requires fewer examples per sign to be learnt, since the sub-unit detectors are trained on data from multiple signs. It is also faster at detection time since there are fewer classifiers to consult, the number of these being limited by the linguistics of sign and not the number of signs being detected. For this method, appearance based boosted classifiers are introduced to distinguish the sub-units of sign. Results show that when combined with temporal models, these novel sub-unit classifiers, can outperform similar classifiers
    Sanfeliu A, Andrade-Cetto J, Barbosa M, Bowden R, Capitan J, Corominas A, Gilbert A, Illingworth J, Merino L, Mirats JM, Moreno P, Ollero A, Sequeira J, Spaan MTJ (2010) Decentralized Sensor Fusion for Ubiquitous Networking Robotics in Urban Areas, Sensors10(3)pp. 2274-2314 MOLECULAR DIVERSITY PRESERVATION INTERNATIONAL-MDPI
    In this article we explain the architecture for the environment and sensors that has been built for the European project URUS (Ubiquitous Networking Robotics in Urban Sites), a project whose objective is to develop an adaptable network robot architecture for cooperation between network robots and human beings and/or the environment in urban areas. The project goal is to deploy a team of robots in an urban area to give a set of services to a user community. This paper addresses the sensor architecture devised for URUS and the type of robots and sensors used, including environment sensors and sensors onboard the robots. Furthermore, we also explain how sensor fusion takes place to achieve urban outdoor execution of robotic services. Finally some results of the project related to the sensor network are highlighted.
    Merino L, Gilbert A, Bowden R, Illingworth J, Capitán J, Ollero A, Ollero A (2012) Data fusion in ubiquitous networked robot systems for urban services, Annales des Telecommunications/Annals of Telecommunicationspp. 1-21
    There is a clear trend in the use of robots to accomplish services that can help humans. In this paper, robots acting in urban environments are considered for the task of person guiding. Nowadays, it is common to have ubiquitous sensors integrated within the buildings, such as camera networks, and wireless communications like 3G or WiFi. Such infrastructure can be directly used by robotic platforms. The paper shows how combining the information from the robots and the sensors allows tracking failures to be overcome, by being more robust under occlusion, clutter, and lighting changes. The paper describes the algorithms for tracking with a set of fixed surveillance cameras and the algorithms for position tracking using the signal strength received by a wireless sensor network (WSN). Moreover, an algorithm to obtain estimations on the positions of people from cameras on board robots is described. The estimate from all these sources are then combined using a decentralized data fusion algorithm to provide an increase in performance. This scheme is scalable and can handle communication latencies and failures. We present results of the system operating in real time on a large outdoor environment, including 22 nonoverlapping cameras, WSN, and several robots. © 2012 Institut Mines-Télécom and Springer-Verlag.
    Bowden R, Heap AJ, Hogg DC (1997) Real Time Hand Tracking and Gesture Recognition as a 3D Input Device for Graphical Applications,Proceedings of Gesture Workshop ?96pp. 117-129 Springer-Verlag
    This paper outlines a system design and implementation of a 3D input device for graphical applications which uses real time hand tracking and gesture recognition to provide the user with an intuitive interface for tomorrow?s applications. Point Distribution Models (PDMs) have been shown to be successful at tracking deformable objects . This system demonstrates how these ?smart snakes? can be used in real time with a real world problem. The system is based upon Open Inventor1 and designed for use with Silicon Graphics Indy Workstations, but provisions have been make for the move to other platforms and applications. We demonstrate how PDMs provide the ideal feature vector for model classification. It is shown how computer vision can provide a low cost, intuitive interface that has few hardware constraints. We also give the reader an insight into the next generation of HCI and Multimedia, providing a 3D scene viewer and VRML browser based upon the handtracker. Further allowances have been made to facilitate the inclusion of the handtracker within third party Inventor applications. All source code, libraries and applications can be downloaded for free from the above web addresses. This paper demonstrates how computer vision and computer graphics can work together providing an interdisciplinary approach to problem solving.
    Sheerman-Chase T, Ong E-J, Bowden R (2009) Online learning of robust facial feature trackers,2009 IEEE 12th International Conference on Computer Vision Workshopspp. 1386-1392 IEEE
    This paper presents a head pose and facial feature estimation technique that works over a wide range of pose variations without a priori knowledge of the appearance of the face. Using simple LK trackers, head pose is estimated by Levenberg-Marquardt (LM) pose estimation using the feature tracking as constraints. Factored sampling and RANSAC are employed to both provide a robust pose estimate and identify tracker drift by constraining outliers in the estimation process. The system provides both a head pose estimate and the position of facial features and is capable of tracking over a wide range of head poses.
    Okwechime D, Ong E-J, Bowden R (2009) Real-time motion control using pose space probability density estimation, 2009 IEEE 12th International Conference on Computer Vision Workshopspp. 2056-2063 IEEE
    We introduce a new algorithm for real-time interactive motion control and demonstrate its application to motion captured data, pre-recorded videos and HCI. Firstly, a data set of frames are projected into a lower dimensional space. An appearance model is learnt using a multivariate probability distribution. A novel approach to determining transition points is presented based on k-medoids, whereby appropriate points of intersection in the motion trajectory are derived as cluster centres. These points are used to segment the data into smaller subsequences. A transition matrix combined with a kernel density estimation is used to determine suitable transitions between the subsequences to develop novel motion. To facilitate real-time interactive control, conditional probabilities are used to derive motion given user commands. The user commands can come from any modality including auditory, touch and gesture. The system is also extended to HCI using audio signals of speech in a conversation to trigger non-verbal responses from a synthetic listener in real-time. We demonstrate the ?exibility of the model by presenting results ranging from data sets composed of vectorised images, 2D and 3D point representations. Results show real-time interaction and plausible motion generation between different types of movement.
    Krejov P, Bowden R (2013) Multi-touchless: Real-time fingertip detection and tracking using geodesic maxima,2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, FG 2013pp. 1-7 IEEE
    Since the advent of multitouch screens users have been able to interact using fingertip gestures in a two dimensional plane. With the development of depth cameras, such as the Kinect, attempts have been made to reproduce the detection of gestures for three dimensional interaction. Many of these use contour analysis to find the fingertips, however the success of such approaches is limited due to sensor noise and rapid movements. This paper discusses an approach to identify fingertips during rapid movement at varying depths allowing multitouch without contact with a screen. To achieve this, we use a weighted graph that is built using the depth information of the hand to determine the geodesic maxima of the surface. Fingertips are then selected from these maxima using a simplified model of the hand and correspondence found over successive frames. Our experiments show real-time performance for multiple users providing tracking at 30fps for up to 4 hands and we compare our results with stateof- the-art methods, providing accuracy an order of magnitude better than existing approaches. © 2013 IEEE.
    Hadfield SJ, Bowden R (2013) Scene Particles: Unregularized Particle Based Scene Flow Estimation,IEEE Transactions on Pattern Analysis and Machine Intelligence36(3)3pp. 564-576 IEEE Computer Society
    In this paper, an algorithm is presented for estimating scene flow, which is a richer, 3D analogue of Optical Flow. The approach operates orders of magnitude faster than alternative techniques, and is well suited to further performance gains through parallelized implementation. The algorithm employs multiple hypothesis to deal with motion ambiguities, rather than the traditional smoothness constraints, removing oversmoothing errors and providing signi?cant performance improvements on benchmark data, over the previous state of the art. The approach is ?exible, and capable of operating with any combination of appearance and/or depth sensors, in any setup, simultaneously estimating the structure and motion if necessary. Additionally, the algorithm propagates information over time to resolve ambiguities, rather than performing an isolated estimation at each frame, as in contemporary approaches. Approaches to smoothing the motion ?eld without sacri?cing the bene?ts of multiple hypotheses are explored, and a probabilistic approach to Occlusion estimation is demonstrated, leading to 10% and 15% improved performance respectively. Finally, a data driven tracking approach is described, and used to estimate the 3D trajectories of hands during sign language, without the need to model complex appearance variations at each viewpoint.
    Moore S, Bowden R (2009) The Effects of Pose On Facial Expression Recognition,Proceedings of the British Machine Vision Conferencepp. 1-11 BMVA Press
    Research into facial expression recognition has predominantly been based upon near frontal view data. However, a recent 3D facial expression database (BU-3DFE database) has allowed empirical investigation of facial expression recognition across pose. In this paper, we investigate the effects of pose from frontal to profile view on facial expression recognition. Experiments are carried out on 100 subjects with 5 yaw angles over 6 prototypical expressions. Expressions have 4 levels of intensity from subtle to exaggerated. We evaluate features such as local binary patterns (LBPs) as well as various extensions of LBPs. In addition, a novel approach to facial expression recognition is proposed using local gabor binary patterns (LGBPs). Multi class support vector machines (SVMs) are used for classification. We investigate the effects of image resolution and pose on facial expression classification using a variety of different features.
    Ong EJ, Bowden R (2004) A boosted classifier tree for hand shape detection, SIXTH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION, PROCEEDINGSpp. 889-894 IEEE COMPUTER SOC
    Holt B, Bowden R (2012) Static pose estimation from depth images using random regression forests and Hough voting, VISAPP 2012 - Proceedings of the International Conference on Computer Vision Theory and Applications1pp. 557-564
    Robust and fast algorithms for estimating the pose of a human given an image would have a far reaching impact on many fields in and outside of computer vision. We address the problem using depth data that can be captured inexpensively using consumer depth cameras such as the Kinect sensor. To achieve robustness and speed on a small training dataset, we formulate the pose estimation task within a regression and Hough voting framework. Our approach uses random regression forests to predict joint locations from each pixel and accumulate these predictions with Hough voting. The Hough accumulator images are treated as likelihood distributions where maxima correspond to joint location hypotheses. We demonstrate our approach and compare to the state-ofthe-art on a publicly available dataset.
    Kadir T, Bowden R, Ong EJ, Zisserman A (2004) Minimal Training, Large Lexicon, Unconstrained Sign Language Recognition,BMVC 2004 Electronic Proceedingspp. 939-948 The British Machine Vision Association and Society for Pattern Recognition
    This paper presents a flexible monocular system capable of recognising sign lexicons far greater in number than previous approaches. The power of the system is due to four key elements: (i) Head and hand detection based upon boosting which removes the need for temperamental colour segmentation; (ii) A body centred description of activity which overcomes issues with camera placement, calibration and user; (iii) A two stage classification in which stage I generates a high level linguistic description of activity which naturally generalises and hence reduces training; (iv) A stage II classifier bank which does not require HMMs, further reducing training requirements. The outcome of which is a system capable of running in real-time, and generating extremely high recognition rates for large lexicons with as little as a single training instance per sign. We demonstrate classification rates as high as 92% for a lexicon of 164 words with extremely low training requirements outperforming previous approaches where thousands of training examples are required.
    Bowden R, Cox S, Harvey R, Lan Y, Ong E-J, Owen G, Theobald B-J (2013) Recent developments in automated lip-reading, OPTICS AND PHOTONICS FOR COUNTERTERRORISM, CRIME FIGHTING AND DEFENCE IX; AND OPTICAL MATERIALS AND BIOMATERIALS IN SECURITY AND DEFENCE SYSTEMS TECHNOLOGY X8901 SPIE-INT SOC OPTICAL ENGINEERING
    Micilotta A, Bowden R (2004) View-based Location and Tracking of Body Parts for Visual Interaction,BMVC 2004 Electronic Proceedingspp. 849-858 The British Machine Vision Association and Society for Pattern Recognition
    This paper presents a real time approach to locate and track the upper torso of the human body. Our main interest is not in 3D biometric accuracy, but rather a sufficient discriminatory representation for visual interaction. The algorithm employs background suppression and a general approximation to body shape, applied within a particle filter framework, making use of integral images to maintain real-time performance. Furthermore, we present a novel method to disambiguate the hands of the subject and to predict the likely position of elbows. The final system is demonstrated segmenting multiple subjects from a cluttered scene at above real time operation.
    Efthimiou E, Fotinea SE, Hanke T, Glauert J, Bowden R, Braffort A, Collet C, Maragos P, Goudenove F (2010) DICTA-SIGN: Sign Language Recognition, Generation and Modelling with application in Deaf Communication, pp. 80-84
    Gilbert A, Bowden R (2017) Image and Video Mining through Online Learning,Computer Vision and Image Understanding158pp. 72-84 Elsevier
    Within the eld of image and video recognition, the traditional approach is a dataset split into xed training and test partitions. However, the labelling of the training set is time-consuming, especially as datasets grow in size and complexity. Furthermore, this approach is not applicable to the home user, who wants to intuitively group their media without tirelessly labelling the content. Consequently, we propose a solution similar in nature to an active learning paradigm, where a small subset of media is labelled as semantically belonging to the same class, and machine learning is then used to pull this and other related content together in the feature space. Our interactive approach is able to iteratively cluster classes of images and video. We reformulate it in an online learning framework and demonstrate competitive performance to batch learning approaches using only a fraction of the labelled data. Our approach is based around the concept of an image signature which, unlike a standard bag of words model, can express co-occurrence statistics as well as symbol frequency. We e ciently compute metric distances between signatures despite their inherent high dimensionality and provide discriminative feature selection, to allow common and distinctive elements to be identi ed from a small set of user labelled examples. These elements are then accentuated in the image signature to increase similarity between examples and pull correct classes together. By repeating this process in an online learning framework, the accuracy of similarity increases dramatically despite labelling only a few training examples. To demonstrate that the approach is agnostic to media type and features used, we evaluate on three image datasets (15 scene, Caltech101 and FG-NET), a mixed text and image dataset (ImageTag), a dataset used in active learning (Iris) and on three action recognition datasets (UCF11, KTH and Hollywood2). On the UCF11 video dataset, the accuracy is 86.7% despite using only 90 labelled examples from a dataset of over 1200 videos, instead of the standard 1122 training videos. The approach is both scalable and e cient, with a single iteration over the full UCF11 dataset of around 1200 videos taking approximately 1 minute on a standard desktop machine.
    Hadfield Simon, Bowden Richard (2012) Generalised Pose Estimation Using Depth,In: Kutulakos KN (eds.), Trends and Topics in Computer Vision ECCV 2010Trends and Topics in Computer Vision. ECCV 2010. Lecture Notes in Computer Science6553(6553)pp. 312-325 Springer
    Estimating the pose of an object, be it articulated, deformable or rigid, is an important task, with applications ranging from Human-Computer Interaction to environmental understanding. The idea of a general pose estimation framework, capable of being rapidly retrained to suit a variety of tasks, is appealing. In this paper a solution isproposed requiring only a set of labelled training images in order to be applied to many pose estimation tasks. This is achieved bytreating pose estimation as a classification problem, with particle filtering used to provide non-discretised estimates. Depth information extracted from a calibrated stereo sequence, is used for background suppression and object scale estimation. The appearance and shape channels are then transformed to Local Binary Pattern histograms, and pose classification is performed via a randomised decision forest. To demonstrate flexibility, the approach is applied to two different situations, articulated hand pose and rigid head orientation, achieving 97% and 84% accurate estimation rates, respectively.
    Hadfield Simon J., Bowden Richard (2010) Generalised Pose Estimation Using Depth,In proceedings, European Conference on Computer Vision (Workshops)
    Estimating the pose of an object, be it articulated, deformable or rigid, is an important task, with applications ranging from Human-Computer Interaction to environmental understanding. The idea of a general pose estimation framework, capable of being rapidly retrained to suit a variety of tasks, is appealing. In this paper a solution is proposed requiring only a set of labelled training images in order to be applied to many pose estimation tasks. This is achieved by treating pose estimation as a classification problem, with particle filtering used to provide non-discretised estimates. Depth information extracted from a calibrated stereo sequence, is used for background suppression and object scale estimation. The appearance and shape channels are then transformed to Local Binary Pattern histograms, and pose classification is performed via a randomised decision forest. To demonstrate flexibility, the approach is applied to two different situations, articulated hand pose and rigid head orientation, achieving 97% and 84% accurate estimation rates, respectively.
    Krejov P, Gilbert A, Bowden R (2016) Guided Optimisation through Classification and Regression for Hand Pose Estimation,Computer Vision and Image Understanding155pp. 124-138 Elsevier
    This paper presents an approach to hand pose estimation that combines discriminative and model-based methods to leverage the advantages of both. Randomised Decision Forests are trained using real data to provide fast coarse segmentation of the hand. The segmentation then forms the basis of constraints applied in model fitting, using an efficient projected Gauss-Seidel solver, which enforces temporal continuity and kinematic limitations. However, when fitting a generic model to multiple users with varying hand shape, there is likely to be residual errors between the model and their hand. Also, local minima can lead to failures in tracking that are difficult to recover from. Therefore, we introduce an error regression stage that learns to correct these instances of optimisation failure. The approach provides improved accuracy over the current state of the art methods, through the inclusion of temporal cohesion and by learning to correct from failure cases. Using discriminative learning, our approach performs guided optimisation, greatly reducing model fitting complexity and radically improves efficiency. This allows tracking to be performed at over 40 frames per second using a single CPU thread.
    Hadfield Simon, Lebeda K, Bowden Richard (2016) Hollywood 3D: What are the best 3D features for Action Recognition?,International Journal of Computer Vision121(1)pp. 95-110 Springer Verlag
    Action recognition ?in the wild? is extremely challenging, particularly when complex 3D actions are projected down to the image plane, losing a great deal of information. The recent growth of 3D data in broadcast content and commercial depth sensors, makes it possible to overcome this. However, there is little work examining the best way to exploit this new modality. In this paper we introduce the Hollywood 3D benchmark, which is the first dataset containing ?in the wild? action footage including 3D data. This dataset consists of 650 stereo video clips across 14 action classes, taken from Hollywood movies. We provide stereo calibrations and depth reconstructions for each clip. We also provide an action recognition pipeline, and propose a number of specialised depth-aware techniques including five interest point detectors and three feature descriptors. Extensive tests allow evaluation of different appearance and depth encoding schemes. Our novel techniques exploiting this depth allow us to reach performance levels more than triple those of the best baseline algorithm using only appearance information. The benchmark data, code and calibrations are all made available to the community.
    Lebeda K, Hadfield SJ, Bowden R (2017) TMAGIC: A Model-free 3D Tracker,IEEE Transactions on Image Processing26(9)pp. 4378-4388 IEEE
    Significant effort has been devoted within the visual tracking community to rapid learning of object properties on the fly. However, state-of-the-art approaches still often fail in cases such as rapid out-of-plane rotation, when the appearance changes suddenly. One of the major contributions of this work is a radical rethinking of the traditional wisdom of modelling 3D motion as appearance change during tracking. Instead, 3D motion is modelled as 3D motion. This intuitive but previously unexplored approach provides new possibilities in visual tracking research. Firstly, 3D tracking is more general, as large out-of-plane motion is often fatal for 2D trackers, but helps 3D trackers to build better models. Secondly, the tracker?s internal model of the object can be used in many different applications and it could even become the main motivation, with tracking supporting reconstruction rather than vice versa. This effectively bridges the gap between visual tracking and Structure from Motion. A new benchmark dataset of sequences with extreme out-ofplane rotation is presented and an online leader-board offered to stimulate new research in the relatively underdeveloped area of 3D tracking. The proposed method, provided as a baseline, is capable of successfully tracking these sequences, all of which pose a considerable challenge to 2D trackers (error reduced by 46 %).
    Allday R, Hadfield S, Bowden R (2017) From Vision to Grasping: Adapting Visual Networks,TAROS-2017 Conference Proceedings. Lecture Notes in Computer Science10454pp. 484-494 Springer
    Grasping is one of the oldest problems in robotics and is still considered challenging, especially when grasping unknown objects with unknown 3D shape. We focus on exploiting recent advances in computer vision recognition systems. Object classification problems tend to have much larger datasets to train from and have far fewer practical constraints around the size of the model and speed to train. In this paper we will investigate how to adapt Convolutional Neural Networks (CNNs), traditionally used for image classification, for planar robotic grasping. We consider the differences in the problems and how a network can be adjusted to account for this. Positional information is far more important to robotics than generic image classification tasks, where max pooling layers are used to improve translation invariance. By using a more appropriate network structure we are able to obtain improved accuracy while simultaneously improving run times and reducing memory consumption by reducing model size by up to 69%.
    Mendez Maldonado O, Hadfield S, Pugeault N, Bowden R (2017) Taking the Scenic Route to 3D: Optimising Reconstruction from Moving Cameras,ICCV 2017 Proceedings IEEE
    Reconstruction of 3D environments is a problem that has been widely addressed in the literature. While many approaches exist to perform reconstruction, few of them take an active role in deciding where the next observations should come from. Furthermore, the problem of travelling from the camera?s current position to the next, known as pathplanning, usually focuses on minimising path length. This approach is ill-suited for reconstruction applications, where learning about the environment is more valuable than speed of traversal. We present a novel Scenic Route Planner that selects paths which maximise information gain, both in terms of total map coverage and reconstruction accuracy. We also introduce a new type of collaborative behaviour into the planning stage called opportunistic collaboration, which allows sensors to switch between acting as independent Structure from Motion (SfM) agents or as a variable baseline stereo pair. We show that Scenic Planning enables similar performance to state-of-the-art batch approaches using less than 0.00027% of the possible stereo pairs (3% of the views). Comparison against length-based pathplanning approaches show that our approach produces more complete and more accurate maps with fewer frames. Finally, we demonstrate the Scenic Pathplanner?s ability to generalise to live scenarios by mounting cameras on autonomous ground-based sensor platforms and exploring an environment.
    Camgöz N, Hadfield SJ, Koller O, Bowden R (2017) SubUNets: End-to-end Hand Shape and Continuous Sign Language Recognition,ICCV 2017 Proceedings IEEE
    We propose a novel deep learning approach to solve simultaneous alignment and recognition problems (referred to as ?Sequence-to-sequence? learning). We decompose the problem into a series of specialised expert systems referred to as SubUNets. The spatio-temporal relationships between these SubUNets are then modelled to solve the task, while remaining trainable end-to-end. The approach mimics human learning and educational techniques, and has a number of significant advantages. SubUNets allow us to inject domain-specific expert knowledge into the system regarding suitable intermediate representations. They also allow us to implicitly perform transfer learning between different interrelated tasks, which also allows us to exploit a wider range of more varied data sources. In our experiments we demonstrate that each of these properties serves to significantly improve the performance of the overarching recognition system, by better constraining the learning problem. The proposed techniques are demonstrated in the challenging domain of sign language recognition. We demonstrate state-of-the-art performance on hand-shape recognition (outperforming previous techniques by more than 30%). Furthermore, we are able to obtain comparable sign recognition rates to previous research, without the need for an alignment step to segment out the signs for recognition.
    Cooper Helen, Ong Eng-Jon, Pugeault Nicolas, Bowden Richard (2017) Sign Language Recognition Using Sub-units,In: Escalera Sergio, Guyon Isabelle, Athitsos Vassilis (eds.), Gesture Recognitionpp. 89-118 Springer International Publishing
    This chapter discusses sign language recognition using linguistic sub-units. It presents three types of sub-units for consideration; those learnt from appearance data as well as those inferred from both 2D or 3D tracking data. These sub-units are then combined using a sign level classifier; here, two options are presented. The first uses Markov Models to encode the temporal changes between sub-units. The second makes use of Sequential Pattern Boosting to apply discriminative feature selection at the same time as encoding temporal information. This approach is more robust to noise and performs well in signer independent tests, improving results from the 54% achieved by the Markov Chains to 76%.
    Camgöz N, Hadfield S, Bowden R (2017) Particle Filter based Probabilistic Forced Alignment for Continuous Gesture Recognition,IEEE International Conference on Computer Vision Workshops (ICCVW) 2017pp. 3079-3085 IEEE
    In this paper, we propose a novel particle filter based probabilistic forced alignment approach for training spatiotemporal deep neural networks using weak border level annotations. The proposed method jointly learns to localize and recognize isolated instances in continuous streams. This is done by drawing training volumes from a prior distribution of likely regions and training a discriminative 3D-CNN from this data. The classifier is then used to calculate the posterior distribution by scoring the training examples and using this as the prior for the next sampling stage. We apply the proposed approach to the challenging task of large-scale user-independent continuous gesture recognition. We evaluate the performance on the popular ChaLearn 2016 Continuous Gesture Recognition (ConGD) dataset. Our method surpasses state-of-the-art results by obtaining 0:3646 and 0:3744 Mean Jaccard Index Score on the validation and test sets of ConGD, respectively. Furthermore, we participated in the ChaLearn 2017 Continuous Gesture Recognition Challenge and was ranked 3rd. It should be noted that our method is learner independent, it can be easily combined with other approaches.
    Autonomous 3D reconstruction, the process whereby an agent can produce its own representation of the world, is an extremely challenging area in both vision and robotics. However, 3D reconstructions have the ability to grant robots the understanding of the world necessary for collaboration and high-level goal execution. Therefore, this thesis aims to explore methods that will enable modern robotic systems to autonomously and collaboratively achieve an understanding of the world. In the real world, reconstructing a 3D scene requires nuanced understanding of the environment. Additionally, it is not enough to simply ?understand? the world, autonomous agents must be capable of actively acquiring this understanding. Achieving all of this using simple monocular sensors is extremely challenging. Agents must be able to understand what areas of the world are navigable, how egomotion affects reconstruction and how other agents may be leveraged to provide an advantage. All of this must be considered in addition to the traditional 3D reconstruction issues of correspondence estimation, triangulation and data association. Simultaneous Localisation and Mapping (SLAM) solutions are not particularly well suited to autonomous multi-agent reconstruction. They typically require the sensors to be in constant communication, do not scale well with the number of agents (or map size) and require expensive optimisations. Instead, this thesis attempts to develop more pro-active techniques from the ground up. First, an autonomous agent must have the ability to actively select what it is going to reconstruct. Known as view-selection, or Next-Best View (NBV), this has recently become an active topic in autonomous robotics and will form the first contribution of this thesis. Second, once a view is selected, an autonomous agent must be able to plan a trajectory to arrive at that view. This problem, known as path-planning, can be considered a core topic in the robotics field and will form the second contribution of this thesis. Finally, the 3D reconstruction must be anchored to a globally consistent map that co-relates to the real world. This will be addressed as a floorplan localisation problem, an emerging field for the vision community, and will be the third contribution of this thesis. To give autonomous agents the ability to actively select what data to process, this thesis discusses the NBV problem in the context of Multi-View Stereo (MVS). The proposed approach has the ability to massively reduce the amount of computing resources required for any given 3D reconstruction. More importantly, it autonomously selects the views that improve the reconstruction the most. All of this is done exclusively on the sensor pose; the images are not used for view-selection and only loaded into memory once they have been selected for reconstruction. Experimental evaluation shows that NBV applied to this problem can achieve results comparable to state-of-the-art using as little as 3.8% of the views. To provide the ability to execute an autonomous 3D reconstruction, this thesis proposes a novel computer-vision based goal-estimation and path-planning approach. The method proposed in the previous chapter is extended into a continuous pose-space. The resulting view then becomes the goal of a Scenic Pathplanner that plans a trajectory between the current robot pose and the NBV. This is done using an NBV-based pose-space that biases the paths towards areas of high information gain. Experimental evaluation shows that the Scenic Planning enables similar performance to state-of-the-art batch approaches using less than 3% of the views, whichcorresponds to 2.7 × 10 ?4 % of the possible stereo pairs (using a naive interpretation of plausible stereo pairs). Comparison against length-based path-planning approaches show that the Scenic Pathplanner produces more complete and more accurate maps with fewer frames. Finally, the ability of the Scenic Pathplanner to generalise to live sc
    Hadfield Simon, Lebeda Karel, Bowden Richard (2018) HARD-PnP: PnP Optimization Using a Hybrid Approximate Representation,IEEE transactions on Pattern Analysis and Machine Intelligence Institute of Electrical and Electronics Engineers (IEEE)
    This paper proposes a Hybrid Approximate Representation (HAR) based on unifying several efficient approximations of the generalized reprojection error (which is known as the gold standard for multiview geometry). The HAR is an over-parameterization scheme where the approximation is applied simultaneously in multiple parameter spaces. A joint minimization scheme ?HAR-Descent? can then solve the PnP problem efficiently, while remaining robust to approximation errors and local minima. The technique is evaluated extensively, including numerous synthetic benchmark protocols and the real-world data evaluations used in previous works. The proposed technique was found to have runtime complexity comparable to the fastest O(n) techniques, and up to 10 times faster than current state of the art minimization approaches. In addition, the accuracy exceeds that of all 9 previous techniques tested, providing definitive state of the art performance on the benchmarks, across all 90 of the experiments in the paper and supplementary material.
    Ebling S, Camgöz N, Boyes Braem P, Tissi K, Sidler-Miserez S, Stoll S, Hadfield S, Haug T, Bowden R, Tornay S, Razaviz M, Magimai-Doss M (2018) SMILE Swiss German Sign Language Dataset,Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC) 2018 The European Language Resources Association (ELRA)
    Sign language recognition (SLR) involves identifying the form and meaning of isolated signs or sequences of signs. To our knowledge, the combination of SLR and sign language assessment is novel. The goal of an ongoing three-year project in Switzerland is to pioneer an assessment system for lexical signs of Swiss German Sign Language (Deutschschweizerische Geb¨ardensprache, DSGS) that relies on SLR. The assessment system aims to give adult L2 learners of DSGS feedback on the correctness of the manual parameters (handshape, hand position, location, and movement) of isolated signs they produce. In its initial version, the system will include automatic feedback for a subset of a DSGS vocabulary production test consisting of 100 lexical items. To provide the SLR component of the assessment system with sufficient training samples, a large-scale dataset containing videotaped repeated productions of the 100 items of the vocabulary test with associated transcriptions and annotations was created, consisting of data from 11 adult L1 signers and 19 adult L2 learners of DSGS. This paper introduces the dataset, which will be made available to the research community.
    Mendez Maldonado Oscar, Hadfield Simon, Pugeault Nicolas, Bowden Richard (2018) SeDAR ? Semantic Detection and Ranging: Humans can localise without LiDAR, can robots?,Proceedings of the 2018 IEEE International Conference on Robotics and Automation, May 21-25, 2018, Brisbane, Australia IEEE
    How does a person work out their location using a floorplan? It is probably safe to say that we do not explicitly measure depths to every visible surface and try to match them against different pose estimates in the floorplan. And yet, this is exactly how most robotic scan-matching algorithms operate. Similarly, we do not extrude the 2D geometry present in the floorplan into 3D and try to align it to the real-world. And yet, this is how most vision-based approaches localise. Humans do the exact opposite. Instead of depth, we use high level semantic cues. Instead of extruding the floorplan up into the third dimension, we collapse the 3D world into a 2D representation. Evidence of this is that many of the floorplans we use in everyday life are not accurate, opting instead for high levels of discriminative landmarks. In this work, we use this insight to present a global localisation approach that relies solely on the semantic labels present in the floorplan and extracted from RGB images. While our approach is able to use range measurements if available, we demonstrate that they are unnecessary as we can achieve results comparable to state-of-the-art without them.
    Camgöz Necati Cihan, Hadfield Simon, Koller O, Ney H, Bowden Richard (2018) Neural Sign Language Translation,Proceedings CVPR 2018pp. 7784-7793 IEEE
    Sign Language Recognition (SLR) has been an active research field for the last two decades. However, most research to date has considered SLR as a naive gesture recognition problem. SLR seeks to recognize a sequence of continuous signs but neglects the underlying rich grammatical and linguistic structures of sign language that differ from spoken language. In contrast, we introduce the Sign Language Translation (SLT) problem. Here, the objective is to generate spoken language translations from sign language videos, taking into account the different word orders and grammar. We formalize SLT in the framework of Neural Machine Translation (NMT) for both end-to-end and pretrained settings (using expert knowledge). This allows us to jointly learn the spatial representations, the underlying language model, and the mapping between sign and spoken language. To evaluate the performance of Neural SLT, we collected the first publicly available Continuous SLT dataset, RWTHPHOENIX-Weather 2014T1. It provides spoken language translations and gloss level annotations for German Sign Language videos of weather broadcasts. Our dataset contains over .95M frames with >67K signs from a sign vocabulary of >1K and >99K words from a German vocabulary of >2.8K. We report quantitative and qualitative results for various SLT setups to underpin future research in this newly established field. The upper bound for translation performance is calculated at 19.26 BLEU-4, while our end-to-end frame-level and gloss-level tokenization networks were able to achieve 9.58 and 18.13 respectively.
    Hadfield Simon J., Bowden Richard (2012) Go With The Flow: Hand Trajectories in 3D via Clustered Scene Flow,In Proceedings, International Conference on Image Analysis and RecognitionLNCS 7pp. 285-295
    Tracking hands and estimating their trajectories is useful in a number of tasks, including sign language recognition and human computer interaction. Hands are extremely difficult objects to track, their deformability, frequent self occlusions and motion blur cause appearance variations too great for most standard object trackers to deal with robustly. In this paper, the 3D motion field of a scene (known as the Scene Flow, in contrast to Optical Flow, which is it's projection onto the image plane) is estimated using a recently proposed algorithm, inspired by particle filtering. Unlike previous techniques, this scene flow algorithm does not introduce blurring across discontinuities, making it far more suitable for object segmentation and tracking. Additionally the algorithm operates several orders of magnitude faster than previous scene flow estimation systems, enabling the use of Scene Flow in real-time, and near real-time applications. A novel approach to trajectory estimation is then introduced, based on clustering the estimated scene flow field in both space and velocity dimensions. This allows estimation of object motions in the true 3D scene, rather than the traditional approach of estimating 2D image plane motions. By working in the scene space rather than the image plane, the constant velocity assumption, commonly used in the prediction stage of trackers, is far more valid, and the resulting motion estimate is richer, providing information on out of plane motions. To evaluate the performance of the system, 3D trajectories are estimated on a multi-view sign-language dataset, and compared to a traditional high accuracy 2D system, with excellent results.
    Hadfield Simon J., Bowden Richard (2011) Kinecting the dots: Particle Based Scene Flow From Depth Sensors,In Proceedings, International Conference on Computer Vision (ICCV)pp. 2290-2295
    The motion field of a scene can be used for object segmentation and to provide features for classification tasks like action recognition. Scene flow is the full 3D motion field of the scene, and is more difficult to estimate than it's 2D counterpart, optical flow. Current approaches use a smoothness cost for regularisation, which tends to over-smooth at object boundaries. This paper presents a novel formulation for scene flow estimation, a collection of moving points in 3D space, modelled using a particle filter that supports multiple hypotheses and does not oversmooth the motion field. In addition, this paper is the first to address scene flow estimation, while making use of modern depth sensors and monocular appearance images, rather than traditional multi-viewpoint rigs. The algorithm is applied to an existing scene flow dataset, where it achieves comparable results to approaches utilising multiple views, while taking a fraction of the time.
    Lebeda Karel, Matas Jiri, Bowden Richard (2013) Tracking the Untrackable: How to Track When Your Object Is Featureless,Lecture Notes in Computer Science7729pp. 343-355 Springer
    We propose a novel approach to tracking objects by low-level line correspondences. In our implementation we show that this approach is usable even when tracking objects with lack of texture, exploiting situations, when feature-based trackers fails due to the aperture problem. Furthermore, we suggest an approach to failure detection and recovery to maintain long-term stability. This is achieved by remembering configurations which lead to good pose estimations and using them later for tracking corrections. We carried out experiments on several sequences of different types. The proposed tracker proves itself as competitive or superior to state-of-the-art trackers in both standard and low-textured scenes.
    Merino L, Gilbert A, Capitán J, Bowden R, Illingworth J, Ollero A (2012) Data fusion in ubiquitous networked robot systems for urban services,Annales des Telecommunications/Annals of Telecommunications67(7-8)pp. 355-375 Springer
    There is a clear trend in the use of robots to accomplish services that can help humans. In this paper, robots acting in urban environments are considered for the task of person guiding. Nowadays, it is common to have ubiquitous sensors integrated within the buildings, such as camera networks, and wireless communications like 3G or WiFi. Such infrastructure can be directly used by robotic platforms. The paper shows how combining the information from the robots and the sensors allows tracking failures to be overcome, by being more robust under occlusion, clutter, and lighting changes. The paper describes the algorithms for tracking with a set of fixed surveillance cameras and the algorithms for position tracking using the signal strength received by a wireless sensor network (WSN). Moreover, an algorithm to obtain estimations on the positions of people from cameras on board robots is described. The estimate from all these sources are then combined using a decentralized data fusion algorithm to provide an increase in performance. This scheme is scalable and can handle communication latencies and failures. We present results of the system operating in real time on a large outdoor environment, including 22 nonoverlapping cameras,WSN, and several robots. © Institut Mines-Télécom and Springer-Verlag 2012.
    Gilbert Andrew, Illingworth John, Bowden Richard (2009) Fast realistic multi-action recognition using mined dense spatio-temporal features,Proceedings of the 12th IEEE International Conference on Computer Visionpp. 925-931 IEEE
    Within the field of action recognition, features and descriptors are often engineered to be sparse and invariant to transformation. While sparsity makes the problem tractable, it is not necessarily optimal in terms of class separability and classification. This paper proposes a novel approach that uses very dense corner features that are spatially and temporally grouped in a hierarchical process to produce an overcomplete compound feature set. Frequently reoccurring patterns of features are then found through data mining, designed for use with large data sets. The novel use of the hierarchical classifier allows real time operation while the approach is demonstrated to handle camera motion, scale, human appearance variations, occlusions and background clutter. The performance of classification, outperforms other state-of-the-art action recognition algorithms on the three datasets; KTH, multi-KTH, and Hollywood. Multiple action localisation is performed, though no groundtruth localisation data is required, using only weak supervision of class labels for each training sequence. The Hollywood dataset contain complex realistic actions from movies, the approach outperforms the published accuracy on this dataset and also achieves real time performance. ©2009 IEEE.
    Stoll Stephanie, Camgöz Necati Cihan, Hadfield Simon, Bowden Richard (2018) Sign Language Production using Neural Machine Translation and Generative Adversarial Networks,Proceedings of the 29th British Machine Vision Conference (BMVC 2018) British Machine Vision Association
    We present a novel approach to automatic Sign Language Production using stateof- the-art Neural Machine Translation (NMT) and Image Generation techniques. Our system is capable of producing sign videos from spoken language sentences. Contrary to current approaches that are dependent on heavily annotated data, our approach requires minimal gloss and skeletal level annotations for training. We achieve this by breaking down the task into dedicated sub-processes. We first translate spoken language sentences into sign gloss sequences using an encoder-decoder network. We then find a data driven mapping between glosses and skeletal sequences. We use the resulting pose information to condition a generative model that produces sign language video sequences. We evaluate our approach on the recently released PHOENIX14T Sign Language Translation dataset. We set a baseline for text-to-gloss translation, reporting a BLEU-4 score of 16.34/15.26 on dev/test sets. We further demonstrate the video generation capabilities of our approach by sharing qualitative results of generated sign sequences given their skeletal correspondence.
    Koller Oscar, Zargaran Sepehr, Ney Hermann, Bowden Richard (2018) Deep Sign: Enabling Robust Statistical Continuous Sign Language Recognition via Hybrid CNN-HMMs,International Journal of Computer Vision126(12)pp. 1311-1325 Springer
    This manuscript introduces the end-to-end embedding of a CNN into a HMM, while interpreting the outputs of the CNN in a Bayesian framework. The hybrid CNNHMM combines the strong discriminative abilities of CNNs with the sequence modelling capabilities of HMMs. Most current approaches in the field of gesture and sign language recognition disregard the necessity of dealing with sequence data both for training and evaluation. With our presented end-to-end embedding we are able to improve over the state-of-the-art on three challenging benchmark continuous sign language recognition tasks by between 15% and 38% relative reduction in word error rate and up to 20% absolute. We analyse the effect of the CNN structure, network pretraining and number of hidden states. We compare the hybrid modelling to a tandem approach and evaluate the gain of model combination.
    Spencer Jaime, Mendez Maldonado Oscar, Bowden Richard, Hadfield Simon (2018) Localisation via Deep Imagination: learn the features not the map,Proceedings of ECCV 2018 - European Conference on Computer Vision Springer Nature
    How many times does a human have to drive through the same area to become familiar with it? To begin with, we might first build a mental model of our surroundings. Upon revisiting this area, we can use this model to extrapolate to new unseen locations and imagine their appearance. Based on this, we propose an approach where an agent is capable of modelling new environments after a single visitation. To this end, we introduce ?Deep Imagination?, a combination of classical Visual-based Monte Carlo Localisation and deep learning. By making use of a feature embedded 3D map, the system can ?imagine? the view from any novel location. These ?imagined? views are contrasted with the current observation in order to estimate the agent?s current location. In order to build the embedded map, we train a deep Siamese Fully Convolutional U-Net to perform dense feature extraction. By training these features to be generic, no additional training or fine tuning is required to adapt to new environments. Our results demonstrate the generality and transfer capability of our learnt dense features by training and evaluating on multiple datasets. Additionally, we include several visualizations of the feature representations and resulting 3D maps, as well as their application to localisation.
    Lebeda Karel, Hadfield Simon J., Bowden Richard 3DCars, IEEE
    Spencer Jaime, Bowden Richard, Hadfield Simon (2019) Scale-Adaptive Neural Dense Features: Learning via Hierarchical Context Aggregation,Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019) Institute of Electrical and Electronics Engineers (IEEE)

    How do computers and intelligent agents view the world around them? Feature extraction and representation constitutes one the basic building blocks towards answering this question. Traditionally, this has been done with carefully engineered hand-crafted techniques such as HOG, SIFT or ORB. However, there is no ?one size fits all? approach that satisfies all requirements.

    In recent years, the rising popularity of deep learning has resulted in a myriad of end-to-end solutions to many computer vision problems. These approaches, while successful, tend to lack scalability and can?t easily exploit information learned by other systems.

    Instead, we propose SAND features, a dedicated deep learning solution to feature extraction capable of providing hierarchical context information. This is achieved by employing sparse relative labels indicating relationships of similarity/dissimilarity between image locations. The nature of these labels results in an almost infinite set of dissimilar examples to choose from. We demonstrate how the selection of negative examples during training can be used to modify the feature space and vary it?s properties.

    To demonstrate the generality of this approach, we apply the proposed features to a multitude of tasks, each requiring different properties. This includes disparity estimation, semantic segmentation, self-localisation and SLAM. In all cases, we show how incorporating SAND features results in better or comparable results to the baseline, whilst requiring little to no additional training. Code can be found at: https://github.com/jspenmar/SAND_features

    Tornay Sandrine, Razavi Marzieh, Camgöz Necati Cihan, Bowden Richard, Magimai.-Doss Mathew (2019) HMM-based Approaches to Model Multichannel Information in Sign Language Inspired from Articulatory Features-based Speech Processing,Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019)pp. 2817-2821 Institute of Electrical and Electronics Engineers (IEEE)
    Sign language conveys information through multiple channels, such as hand shape, hand movement, and mouthing. Modeling this multichannel information is a highly challenging problem. In this paper, we elucidate the link between spoken language and sign language in terms of production phenomenon and perception phenomenon. Through this link we show that hidden Markov model-based approaches developed to model "articulatory" features for spoken language processing can be exploited to model the multichannel information inherent in sign language for sign language processing.
    Koller Oscar, Camgöz Necati Cihan, Ney Hermann, Bowden Richard (2019) Weakly Supervised Learning with Multi-Stream CNN-LSTM-HMMs to Discover Sequential Parallelism in Sign Language Videos,IEEE Transactions on Pattern Analysis and Machine Intelligencepp. 1-1 Institute of Electrical and Electronics Engineers (IEEE)
    In this work we present a new approach to the field of weakly supervised learning in the video domain. Our method is relevant to sequence learning problems which can be split up into sub-problems that occur in parallel. Here, we experiment with sign language data. The approach exploits sequence constraints within each independent stream and combines them by explicitly imposing synchronisation points to make use of parallelism that all sub-problems share. We do this with multi-stream HMMs while adding intermediate synchronisation constraints among the streams. We embed powerful CNN-LSTM models in each HMM stream following the hybrid approach. This allows the discovery of attributes which on their own lack sufficient discriminative power to be identified. We apply the approach to the domain of sign language recognition exploiting the sequential parallelism to learn sign language, mouth shape and hand shape classifiers. We evaluate the classifiers on three publicly available benchmark data sets featuring challenging real-life sign language with over 1000 classes, full sentence based lip-reading and articulated hand shape recognition on a fine-grained hand shape taxonomy featuring over 60 different hand shapes. We clearly outperform the state-of-the-art on all data sets and observe significantly faster convergence using the parallel alignment approach.
    Kuutti Sampo, Bowden Richard, Joshi Harita, de Temple Robert, Fallah Saber (2019) End-to-end Reinforcement Learning for Autonomous Longitudinal Control Using Advantage Actor Critic with Temporal Context,2019 IEEE Intelligent Transportation Systems Conference IEEE
    Reinforcement learning has been used widely for autonomous longitudinal control algorithms. However, many existing algorithms suffer from sample inefficiency in reinforcement learning as well as the jerky driving behaviour of the learned systems. In this paper, we propose a reinforcement learning algorithm and a training framework to address these two disadvantages of previous algorithms proposed in this field. The proposed system uses an Advantage Actor Critic (A2C) learning system with recurrent layers to introduce temporal context within the network. This allows the learned system to evaluate continuous control actions based on previous states and actions in addition to current states. Moreover, slow training of the algorithm caused by its sample inefficiency is addressed by utilising another neural network to approximate the vehicle dynamics. Using a neural network as a proxy for the simulator has significant benefit to training as it reduces the requirement for reinforcement learning to query the simulation (which is a major bottleneck) in learning and as both reinforcement learning network and proxy network can be deployed on the same GPU, learning speed is considerably improved. Simulation results from testing in IPG CarMaker show the effectiveness of our recurrent A2C algorithm, compared to an A2C without recurrent layers.
    Walters Celyn, Mendez Oscar, Hadfield Simon, Bowden Richard (2019) A Robust Extrinsic Calibration Framework for Vehicles with Unscaled Sensors,Towards a Robotic Society IEEE

    Accurate extrinsic sensor calibration is essential for both autonomous vehicles and robots. Traditionally this is an involved process requiring calibration targets, known fiducial markers and is generally performed in a lab. Moreover, even a small change in the sensor layout requires recalibration. With the anticipated arrival of consumer autonomous vehicles, there is demand for a system which can do this automatically, after deployment and without specialist human expertise.

    To solve these limitations, we propose a flexible framework which can estimate extrinsic parameters without an explicit calibration stage, even for sensors with unknown scale. Our first contribution builds upon standard hand-eye calibration by jointly recovering scale. Our second contribution is that our system is made robust to imperfect and degenerate sensor data, by collecting independent sets of poses and automatically selecting those which are most ideal.

    We show that our approach?s robustness is essential for the target scenario. Unlike previous approaches, ours runs in real time and constantly estimates the extrinsic transform. For both an ideal experimental setup and a real use case, comparison against these approaches shows that we outperform the state-of-the-art. Furthermore, we demonstrate that the recovered scale may be applied to the full trajectory, circumventing the need for scale estimation via sensor fusion.

    Rochette Guillaume, Russell Chris, Bowden Richard (2019) Weakly-Supervised 3D Pose Estimation from a Single Image using Multi-View Consistency,Proceedings of the 30th British Machine Vision Conference (BMVC 2019) BMVC
    We present a novel data-driven regularizer for weakly-supervised learning of 3D human pose estimation that eliminates the drift problem that affects existing approaches. We do this by moving the stereo reconstruction problem into the loss of the network itself. This avoids the need to reconstruct 3D data prior to training and unlike previous semi-supervised approaches, avoids the need for a warm-up period of supervised training. The conceptual and implementational simplicity of our approach is fundamental to its appeal. Not only is it straightforward to augment many weakly-supervised approaches with our additional re-projection based loss, but it is obvious how it shapes reconstructions and prevents drift. As such we believe it will be a valuable tool for any researcher working in weakly-supervised 3D reconstruction. Evaluating on Panoptic, the largest multi-camera and markerless dataset available, we obtain an accuracy that is essentially indistinguishable from a strongly-supervised approach making full use of 3D groundtruth in training.
    Allday Rebecca, Hadfield Simon, Bowden Richard (2019) Auto-Perceptive Reinforcement Learning (APRiL),Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2019) Institute of Electrical and Electronics Engineers (IEEE)
    The relationship between the feedback given in Reinforcement Learning (RL) and visual data input is often extremely complex. Given this, expecting a single system trained end-to-end to learn both how to perceive and interact with its environment is unrealistic for complex domains. In this paper we propose Auto-Perceptive Reinforcement Learning (APRiL), separating the perception and the control elements of the task. This method uses an auto-perceptive network to encode a feature space. The feature space may explicitly encode available knowledge from the semantically understood state space but the network is also free to encode unanticipated auxiliary data. By decoupling visual perception from the RL process, APRiL can make use of techniques shown to improve performance and efficiency of RL training, which are often difficult to apply directly with a visual input. We present results showing that APRiL is effective in tasks where the semantically understood state space is known. We also demonstrate that allowing the feature space to learn auxiliary information, allows it to use the visual perception system to improve performance by approximately 30%. We also show that maintaining some level of semantics in the encoded state, which can then make use of state-of-the art RL techniques, saves around 75% of the time that would be used to collect simulation examples.
    Kuutti Sampo, Bowden Richard, Joshi Harita, de Temple Robert, Fallah Saber (2019) Safe Deep Neural Network-driven Autonomous Vehicles Using Software Safety Cages,Proceedings of the 20th International Conference on Intelligent Data Engineering and Automated Learning (IDEAL 2019) Springer International Publishing
    Deep learning is a promising class of techniques for controlling an autonomous vehicle. However, functional safety validation is seen as a critical issue for these systems due to the lack of transparency in deep neural networks and the safety-critical nature of autonomous vehicles. The black box nature of deep neural networks limits the effectiveness of traditional verification and validation methods. In this paper, we propose two software safety cages, which aim to limit the control action of the neural network to a safe operational envelope. The safety cages impose limits on the control action during critical scenarios, which if breached, change the control action to a more conservative value. This has the benefit that the behaviour of the safety cages is interpretable, and therefore traditional functional safety validation techniques can be applied. The work here presents a deep neural network trained for longitudinal vehicle control, with safety cages designed to prevent forward collisions. Simulated testing in critical scenarios shows the effectiveness of the safety cages in preventing forward collisions whilst under normal highway driving unnecessary interruptions are eliminated, and the deep learning control policy is able to perform unhindered. Interventions by the safety cages are also used to re-train the network, resulting in a more robust control policy.
    Mendez Oscar, Hadfield Simon, Pugeault Nicolas, Bowden Richard (2019) SeDAR: Reading floorplans like a human,International Journal of Computer Vision Springer Verlag

    The use of human-level semantic information to aid robotic tasks has recently become an important area for both Computer Vision and Robotics. This has been enabled by advances in Deep Learning that allow consistent and robust semantic understanding. Leveraging this semantic vision of the world has allowed human-level understanding to naturally emerge from many different approaches. Particularly, the use of semantic information to aid in localisation and reconstruction has been at the forefront of both fields.

    Like robots, humans also require the ability to localise within a structure. To aid this, humans have designed highlevel semantic maps of our structures called floorplans. We are extremely good at localising in them, even with limited access to the depth information used by robots. This is because we focus on the distribution of semantic elements, rather than geometric ones. Evidence of this is that humans are normally able to localise in a floorplan that has not been scaled properly. In order to grant this ability to robots, it is necessary to use localisation approaches that leverage the same semantic information humans use.

    In this paper, we present a novel method for semantically enabled global localisation. Our approach relies on the semantic labels present in the floorplan. Deep Learning is leveraged to extract semantic labels from RGB images, which are compared to the floorplan for localisation. While our approach is able to use range measurements if available, we demonstrate that they are unnecessary as we can achieve results comparable to state-of-the-art without them.

    Cormier Kearsy, Fox Neil, Woll Bencie, Zisserman Andrew, Camgöz Necati Cihan, Bowden Richard (2019) ExTOL: Automatic recognition of British Sign Language using the BSL Corpus,Proceedings of 6th Workshop on Sign Language Translation and Avatar Technology (SLTAT) 2019 Universitat Hamburg

    Although there have been some recent advances in sign language recognition, part of the problem is that most computer scientists in this research area do not have the required in-depth knowledge of sign language, and often have no connection with the Deaf community or sign linguists. For example, one project described as translation into sign language aimed to take subtitles and turn them into fingerspelling. This is one of many reasons why much of this technology, including sign-language gloves, simply doesn?t address the challenges.

    However there are benefits to achieving automatic sign language recognition. The process of annotating and analysing sign language data on video is extremely labour-intensive. Sign language recognition technology could help speed this up. Until recently we have lacked large signed video datasets that have been precisely and consistently transcribed and translated ? these are needed to train computers for automation. But sign language corpora ? large datasets like the British Sign Language Corpus (Schembri et al., 2014) - bring new possibilities for this technology.

    Here we describe the project ?ExTOL: End to End Translation of British Sign Language? ? which has one aim of building the world's first British Sign Language to English translation system and the first practically functional machine translation system for any sign language. Annotation work previously done on the BSL Corpus is providing essential data to be used by computer vision tools to assist with automatic recognition. To achieve this the computer must be able to recognise not only the shape, motion and location of the hands but also nonmanual features ? including facial expression, mouth movements, and body posture of the signer. It must also understand how all of this activity in connected signing can be translated into written/spoken language. The technology for recognising hand, face and body positions and movements is improving all the time, which will allow significant progress in speeding up automatic recognition and identification of these elements (e.g. recognising specific facial expressions or mouth movements or head movements). Full translation from BSL to English is of course more complex but the automatic recognition of basic position and movements will help pave the way towards automatic translation. In addition, a secondary aim of the project is to create automatic annotation tools to be integrated into the software annotation package ELAN. We will additionally make the software tools available as independent packages, thus potentially allowing their inclusion into other annotation software such as iLex.

    In this poster we report on some initial progress on the ExTOL project. This includes (1) automatic recognition of English mouthings, which is being trained using 600+ hours of audiovisual spoken English from British television and TED videos, and developed by testing on English mouthing annotations from the BSL Corpus. It also includes (2) translation of IDGloss to Free Translation, where the aim is to produce English-like sentences given sign glosses in BSL word order. We report baseline results on a subset of the BSL Corpus, which contains 10,000+ sequences and over 5,000 unique tokens, using the state-of-the-art attention based Neural Machine Translation approaches (Camgoz et al., 2018; Vaswani et al., 2017). Although it is clear that free translation (i.e. full English translation) cannot be achieved via ID glosses alone, this baseline translation task will help contribute to the overall BSL to English translation process - at least at the level of manual signs.

    Kuutti Sampo, Bowden Richard, Jin Yaochu, Barber Phil, Fallah Saber (2020) A Survey of Deep Learning Applications to Autonomous Vehicle Control,IEEE Transactions on Intelligent Transportation Systems IEEE
    Designing a controller for autonomous vehicles capable of providing adequate performance in all driving scenarios is challenging due to the highly complex environment and inability to test the system in the wide variety of scenarios which it may encounter after deployment. However, deep learning methods have shown great promise in not only providing excellent performance for complex and non-linear control problems, but also in generalising previously learned rules to new scenarios. For these reasons, the use of deep learning for vehicle control is becoming increasingly popular. Although important advancements have been achieved in this field, these works have not been fully summarised. This paper surveys a wide range of research works reported in the literature which aim to control a vehicle through deep learning methods. Although there exists overlap between control and perception, the focus of this paper is on vehicle control, rather than the wider perception problem which includes tasks such as semantic segmentation and object detection. The paper identifies the strengths and limitations of available deep learning methods through comparative analysis and discusses the research challenges in terms of computation, architecture selection, goal specification, generalisation, verification and validation, as well as safety. Overall, this survey brings timely and topical information to a rapidly evolving field relevant to intelligent transportation systems.
    Stoll Stephanie, Camgöz Necati Cihan, Hadfield Simon, Bowden Richard (2020) Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks.,International Journal of Computer Vision Springer
    We present a novel approach to automatic Sign Language Production using recent developments in Neural Machine Translation (NMT), Generative Adversarial Networks, and motion generation. Our system is capable of producing sign videos from spoken language sentences. Contrary to current approaches that are dependent on heavily annotated data, our approach requires minimal gloss and skeletal level annotations for training. We achieve this by breaking down the task into dedicated sub-processes. We first translate spoken language sentences into sign pose sequences by combining an NMT network with a Motion Graph. The resulting pose information is then used to condition a generative model that produces photo realistic sign language video sequences. This is the first approach to continuous sign video generation that does not use a classical graphical avatar. We evaluate the translation abilities of our approach on the PHOENIX14T Sign Language Translation dataset. We set a baseline for text-to-gloss translation, reporting a BLEU-4 score of 16.34/15.26 on dev/test sets. We further demonstrate the video generation capabilities of our approach for both multi-signer and high-definition settings qualitatively and quantitatively using broadcast quality assessment metrics.
    Vowels Matthew, Camgöz Necati Cihan, Bowden Richard (2020) Gated Variational AutoEncoders: Incorporating Weak Supervision to Encourage Disentanglement,15th IEEE International Conference on Automatic Face and Gesture Recognition
    Variational AutoEncoders (VAEs) provide a means to generate representational latent embeddings. Previous research has highlighted the benefits of achieving representations that are disentangled, particularly for downstream tasks. However, there is some debate about how to encourage disentanglement with VAEs, and evidence indicates that existing implementations do not achieve disentanglement consistently. The evaluation of how well a VAE?s latent space has been disentangled is often evaluated against our subjective expectations of which attributes should be disentangled for a given problem. Therefore, by definition, we already have domain knowledge of what should be achieved and yet we use unsupervised approaches to achieve it. We propose a weakly supervised approach that incorporates any available domain knowledge into the training process to form a Gated-VAE. The process involves partitioning the representational embedding and gating backpropagation. All partitions are utilised on the forward pass but gradients are backpropagated through different partitions according to selected image/target pairings. The approach can be used to modify existing VAE models such as beta-VAE, InfoVAE and DIP-VAE-II. Experiments demonstrate that using gated backpropagation, latent factors are represented in their intended partition. The approach is applied to images of faces for the purpose of disentangling head-pose from facial expression. Quantitative metrics show that using Gated-VAE improves average disentanglement, completeness and informativeness, as compared with un-gated implementations. Qualitative assessment of latent traversals demonstrate its disentanglement of head-pose from expression, even when only weak/noisy supervision is available.
    Vowels Matthew, Camgöz Necati Cihan, Bowden Richard (2020) Nested VAE:Isolating Common Factors via Weak Supervision,15th IEEE International Conference on Automatic Face and Gesture Recognition
    Fair and unbiased machine learning is an important and active ?eld of research, as decision processes are increasingly driven by models that learn from data. Unfortunately, any biases present in the data may be learned by the model, thereby inappropriately transferring that bias into the decision making process. We identify the connection between the task of bias reduction and that of isolating factors common between domains whilst encouraging domain speci?c invariance. To isolate the common factors we combine the theory of deep latent variable models with information bottleneck theory for scenarios whereby data may be naturally paired across domains and no additional supervision is required. The result is the Nested Variational AutoEncoder (NestedVAE). Two outer VAEs with shared weights attempt to reconstruct the input and infer a latent space, whilst a nested VAE attempt store construct the latent representation of one image,from the latent representation of its paired image. In so doing,the nested VAE isolates the common latent factors/causes and becomes invariant to unwanted factors that are not shared between paired images. We also propose a new metric to provide a balanced method of evaluating consistency and classi?er performance across domains which we refer to as the Adjusted Parity metric. An evaluation of Nested VAE on both domain and attribute invariance, change detection,and learning common factors for the prediction of biological sex demonstrates that NestedVAE signi?cantly outperforms alternative methods.
    Camgöz Necati Cihan, Koller Oscar, Hadfield Simon, Bowden Richard (2020) Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation,IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020
    Prior work on Sign Language Translation has shown that having a mid-level sign gloss representation(effectively recognizing the individual signs) improves the translation performance drastically. In fact, the current state-of-theart in translation requires gloss level tokenization in order to work. We introduce a novel transformer based architecture that jointly learns Continuous Sign Language Recognition and Translation whilebeing trainable in an end-to-end manner. This is achieved by using a Connectionist Temporal Classi?cation(CTC)loss to bind the recognition and translation problems into a single uni?ed architecture. This joint approach does not require any ground-truth timing information,simultaneously solving two co-dependant sequence to sequence learning problems and leads to signi?cant performance gains. We evaluate the recognition and translation performances of our approaches on the challenging RWTHPHOENIX-Weather-2014T(PHOENIX14T)dataset. Wereport state-of-the-art sign language recognition and translation results achieved by our Sign Language Transformers. Our translation net works out perform both sign video to spoken language and gloss to spoken language translation models, in some cases more than doubling the performance (9.58vs. 21.80BLEU-4Score). We also share new baseline translation results using transformer networks for several other text-to-text sign language translation tasks.
    Spencer Jaime, Bowden Richard, Had?eld Simon (2020) Same Features, Different Day: Weakly Supervised Feature Learning for Seasonal Invariance, IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
    ?Like night and day? is a commonly used expression to imply that two things are completely different. Unfortunately, this tends to be the case for current visual feature representations of the same scene across varying seasons or times of day. The aim of this paper is to provide a dense feature representation that can be used to perform localization, sparse matching or image retrieval,regardless of the current seasonal or temporal appearance. Recently, there have been several proposed methodologies for deep learning dense feature representations. These methods make use of ground truth pixel-wise correspondences between pairs of images and focus on the spatial properties of the features. As such, they don?t address temporal or seasonal variation. Furthermore, obtaining the required pixel-wise correspondence data to train in cross-seasonal environments is highly complex in most scenarios. We propose Deja-Vu, a weakly supervised approach to learning season invariant features that does not require pixel-wise ground truth data. The proposed system only requires coarse labels indicating if two images correspond to the same location or not. From these labels, the network is trained to produce ?similar? dense feature maps for corresponding locations despite environmental changes. Code will be made available at: https://github.com/ jspenmar/DejaVu_Features
    Spencer Jaime, Bowden Richard, Had?eld Simon (2020) DeFeat-Net: General Monocular Depth via Simultaneous Unsupervised Representation Learning, IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
    In the current monocular depth research, the dominant approach is to employ unsupervised training on large datasets, driven by warped photometric consistency. Such approaches lack robustness and are unable to generalize to challenging domains such as nighttime scenes or adverse weather conditions where assumptions about photometric consistency break down. We propose DeFeat-Net (Depth & Feature network), an approach to simultaneously learn a cross-domain dense feature representation, alongside a robust depth-estimation framework based on warped feature consistency. The resulting feature representation is learned in an unsupervised manner with no explicit ground-truth correspondences required. We show that within a single domain, our technique is comparable to both the current state of the art in monocular depth estimation and supervised feature representation learning. However, by simultaneously learning features, depth and motion, our technique is able to generalize to challenging domains, allowing DeFeat-Net to outperform the current state-of-the-art with around 10% reduction in all error measures on more challenging sequences such as nighttime driving.
    Saunders Ben, Camgöz Necati Cihan, Bowden Richard (2020) Progressive Transformers for End-to-End Sign Language Production,2020 European Conference on Computer Vision (ECCV)
    The goal of automatic Sign Language Production (SLP) is to translate spoken language to a continuous stream of sign language video at a level comparable to a human translator. If this was achievable, then it would revolutionise Deaf hearing communications. Previous work on predominantly isolated SLP has shown the need for architectures that are better suited to the continuous domain of full sign sequences. In this paper, we propose Progressive Transformers, the first SLP model to translate from discrete spoken language sentences to continuous 3D sign pose sequences in an end-to-end manner. A novel counter decoding technique is introduced, that enables continuous sequence generation at training and inference. We present two model configurations, an end-to end network that produces sign direct from text and a stacked network that utilises a gloss intermediary. We also provide several data augmentation processes to overcome the problem of drift and drastically improve the performance of SLP models. We propose a back translation evaluation mechanism for SLP, presenting benchmark quantitative results on the challenging RWTH-PHOENIXWeather- 2014T (PHOENIX14T) dataset and setting baselines for future research. Code available at https://github.com/BenSaunders27/ ProgressiveTransformersSLP.
    Saunders Ben, Camgöz Necati Cihan, Bowden Richard (2020) Adversarial Training for Multi-Channel Sign Language Production,The 31st British Machine Vision Virtual Conference British Machine Vision Association
    Sign Languages are rich multi-channel languages, requiring articulation of both manual (hands) and non-manual (face and body) features in a precise, intricate manner. Sign Language Production (SLP), the automatic translation from spoken to sign languages, must embody this full sign morphology to be truly understandable by the Deaf community. Previous work has mainly focused on manual feature production, with an under-articulated output caused by regression to the mean. In this paper, we propose an Adversarial Multi-Channel approach to SLP. We frame sign production as a minimax game between a transformer-based Generator and a conditional Discriminator. Our adversarial discriminator evaluates the realism of sign production conditioned on the source text, pushing the generator towards a realistic and articulate output. Additionally, we fully encapsulate sign articulators with the inclusion of non-manual features, producing facial features and mouthing patterns. We evaluate on the challenging RWTH-PHOENIX-Weather-2014T (PHOENIX14T) dataset, and report state-of-the art SLP back-translation performance for manual production. We set new benchmarks for the production of multi-channel sign to underpin future research into realistic SLP.
    Stoll Stephanie, Hadfield Simon, Bowden Richard (2020) SignSynth: Data-Driven Sign Language Video Generation,Eighth International Workshop on Assistive Computer Vision and Robotics
    We present SignSynth, a fully automatic and holistic approach to generating sign language video. Traditionally, Sign Language Production (SLP) relies on animating 3D avatars using expensively annotated data, but so far this approach has not been able to simultaneously provide a realistic, and scalable solution. We introduce a gloss2pose network architecture that is capable of generating human pose sequences conditioned on glosses.1 Combined with a generative adversarial pose2video network, we are able to produce natural-looking, high definition sign language video. For sign pose sequence generation, we outperform the SotA by a factor of 18, with a Mean Square Error of 1.0673 in pixels. For video generation we report superior results on three broadcast quality assessment metrics. To evaluate our full gloss-to-video pipeline we introduce two novel error metrics, to assess the perceptual quality and sign representativeness of generated videos. We present promising results, significantly outperforming the SotA in both metrics. Finally we evaluate our approach qualitatively by analysing example sequences.
    Camgoz Necati Cihan, Koller Oscar, Hadfield Simon, Bowden Richard (2020) Multi-channel Transformers for Multi-articulatory Sign Language Translation,Proceedings of the 16th European Conference on Computer Vision (ECCV 2020) Part XI Springer International Publishing
    Sign languages use multiple asynchronous information channels (articulators), not just the hands but also the face and body, which computational approaches often ignore. In this paper we tackle the multiarticulatory sign language translation task and propose a novel multichannel transformer architecture. The proposed architecture allows both the inter and intra contextual relationships between different sign articulators to be modelled within the transformer network itself, while also maintaining channel specific information. We evaluate our approach on the RWTH-PHOENIX-Weather-2014T dataset and report competitive translation performance. Importantly, we overcome the reliance on gloss annotations which underpin other state-of-the-art approaches, thereby removing the need for expensive curated datasets.