Professor Richard Bowden

Departmental Duties

Senior Tutor for Professional Training

Contact Me

E-mail:
Phone: 01483 68 9838

Find me on campus
Room: 22 BA 00

Publications

Highlights

  • Pugeault N, Bowden R. (2015) 'How Much of Driving Is Preattentive?'. IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, 64 (12), pp. 5424-5438.
  • Lebeda K, Hadfield S, Matas J, Bowden R. (2015) 'Texture-Independent Long-Term Tracking Using Virtual Corners'. IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC IEEE TRANSACTIONS ON IMAGE PROCESSING, 25 (1), pp. 359-371.
  • Krejov P, Gilbert A, Bowden R. (2014) 'A Multitouchless Interface Expanding User Interaction'. IEEE COMPUTER SOC IEEE COMPUTER GRAPHICS AND APPLICATIONS, 34 (3), pp. 40-48.
  • Oshin O, Gilbert A, Bowden R. (2014) 'Capturing relative motion and finding modes for action recognition in the wild'. Computer Vision and Image Understanding,
  • Gilbert A, Illingworth J, Bowden R. (2010) 'Action Recognition Using Mined Hierarchical Compound Features'. IEEE COMPUTER SOC IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 33 (5), pp. 883-897.
  • Dowson N, Kadir T, Bowden R. (2008) 'Estimating the joint statistics of images using Nonparametric Windows with application to registration using Mutual Information'. IEEE COMPUTER SOC IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, Minneapolis, MN: 30 (10), pp. 1841-1857.

    Abstract

    Recently, Nonparametric (NP) Windows has been proposed to estimate the statistics of real 1D and 2D signals. NP Windows is accurate because it is equivalent to sampling images at a high (infinite) resolution for an assumed interpolation model. This paper extends the proposed approach to consider joint distributions of image pairs. Second, Green's Theorem is used to simplify the previous NP Windows algorithm. Finally, a resolution-aware NP Windows algorithm is proposed to improve robustness to relative scaling between an image pair. Comparative testing of 2D image registration was performed using translation only and affine transformations. Although it is more expensive than other methods, NP Windows frequently demonstrated superior performance for bias (distance between ground truth and global maximum) and frequency of convergence. Unlike other methods, the number of samples and the number of bins have little effect on NP Windows and the prior selection of a kernel is not required.

  • Dowson N, Bowden R. (2008) 'Mutual information for Lucas-Kanade tracking (MILK): An inverse compositional formulation'. IEEE COMPUTER SOC IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 30 (1), pp. 180-185.

Journal articles

  • Oshin OT, Gilbert A, Illingworth J, Bowden R. 'Learning to recognise spatio-temporal interest points'. , pp. 14-30.
  • Hadfield S, Lebeda K, Bowden R. (2018) 'HARD-PnP: PnP Optimization Using a Hybrid Approximate Representation'. Institute of Electrical and Electronics Engineers (IEEE) IEEE transactions on Pattern Analysis and Machine Intelligence,
    [ Status: Accepted ]

    Abstract

    This paper proposes a Hybrid Approximate Representation (HAR) based on unifying several efficient approximations of the generalized reprojection error (which is known as the gold standard for multiview geometry). The HAR is an over-parameterization scheme where the approximation is applied simultaneously in multiple parameter spaces. A joint minimization scheme “HAR-Descent” can then solve the PnP problem efficiently, while remaining robust to approximation errors and local minima. The technique is evaluated extensively, including numerous synthetic benchmark protocols and the real-world data evaluations used in previous works. The proposed technique was found to have runtime complexity comparable to the fastest O(n) techniques, and up to 10 times faster than current state of the art minimization approaches. In addition, the accuracy exceeds that of all 9 previous techniques tested, providing definitive state of the art performance on the benchmarks, across all 90 of the experiments in the paper and supplementary material.

  • Lebeda K, Hadfield SJ, Bowden R . (2017) 'TMAGIC: A Model-free 3D Tracker'. IEEE IEEE Transactions on Image Processing, 26 (9), pp. 4378-4388.

    Abstract

    Significant effort has been devoted within the visual tracking community to rapid learning of object properties on the fly. However, state-of-the-art approaches still often fail in cases such as rapid out-of-plane rotation, when the appearance changes suddenly. One of the major contributions of this work is a radical rethinking of the traditional wisdom of modelling 3D motion as appearance change during tracking. Instead, 3D motion is modelled as 3D motion. This intuitive but previously unexplored approach provides new possibilities in visual tracking research. Firstly, 3D tracking is more general, as large out-of-plane motion is often fatal for 2D trackers, but helps 3D trackers to build better models. Secondly, the tracker’s internal model of the object can be used in many different applications and it could even become the main motivation, with tracking supporting reconstruction rather than vice versa. This effectively bridges the gap between visual tracking and Structure from Motion. A new benchmark dataset of sequences with extreme out-ofplane rotation is presented and an online leader-board offered to stimulate new research in the relatively underdeveloped area of 3D tracking. The proposed method, provided as a baseline, is capable of successfully tracking these sequences, all of which pose a considerable challenge to 2D trackers (error reduced by 46 %).

  • Gilbert A, Bowden R. (2017) 'Image and Video Mining through Online Learning'. Elsevier Computer Vision and Image Understanding, 158, pp. 72-84.

    Abstract

    Within the eld of image and video recognition, the traditional approach is a dataset split into xed training and test partitions. However, the labelling of the training set is time-consuming, especially as datasets grow in size and complexity. Furthermore, this approach is not applicable to the home user, who wants to intuitively group their media without tirelessly labelling the content. Consequently, we propose a solution similar in nature to an active learning paradigm, where a small subset of media is labelled as semantically belonging to the same class, and machine learning is then used to pull this and other related content together in the feature space. Our interactive approach is able to iteratively cluster classes of images and video. We reformulate it in an online learning framework and demonstrate competitive performance to batch learning approaches using only a fraction of the labelled data. Our approach is based around the concept of an image signature which, unlike a standard bag of words model, can express co-occurrence statistics as well as symbol frequency. We e ciently compute metric distances between signatures despite their inherent high dimensionality and provide discriminative feature selection, to allow common and distinctive elements to be identi ed from a small set of user labelled examples. These elements are then accentuated in the image signature to increase similarity between examples and pull correct classes together. By repeating this process in an online learning framework, the accuracy of similarity increases dramatically despite labelling only a few training examples. To demonstrate that the approach is agnostic to media type and features used, we evaluate on three image datasets (15 scene, Caltech101 and FG-NET), a mixed text and image dataset (ImageTag), a dataset used in active learning (Iris) and on three action recognition datasets (UCF11, KTH and Hollywood2). On the UCF11 video dataset, the accuracy is 86.7% despite using only 90 labelled examples from a dataset of over 1200 videos, instead of the standard 1122 training videos. The approach is both scalable and e cient, with a single iteration over the full UCF11 dataset of around 1200 videos taking approximately 1 minute on a standard desktop machine.

  • Krejov P, Gilbert A, Bowden R. (2016) 'Guided Optimisation through Classification and Regression for Hand Pose Estimation'. Elsevier Computer Vision and Image Understanding, 155, pp. 124-138.

    Abstract

    This paper presents an approach to hand pose estimation that combines discriminative and model-based methods to leverage the advantages of both. Randomised Decision Forests are trained using real data to provide fast coarse segmentation of the hand. The segmentation then forms the basis of constraints applied in model fitting, using an efficient projected Gauss-Seidel solver, which enforces temporal continuity and kinematic limitations. However, when fitting a generic model to multiple users with varying hand shape, there is likely to be residual errors between the model and their hand. Also, local minima can lead to failures in tracking that are difficult to recover from. Therefore, we introduce an error regression stage that learns to correct these instances of optimisation failure. The approach provides improved accuracy over the current state of the art methods, through the inclusion of temporal cohesion and by learning to correct from failure cases. Using discriminative learning, our approach performs guided optimisation, greatly reducing model fitting complexity and radically improves efficiency. This allows tracking to be performed at over 40 frames per second using a single CPU thread.

  • Hadfield SJ, Lebeda K, Bowden R . (2016) 'Stereo reconstruction using top-down cues'. Elsevier Computer Vision and Image Understanding, 157, pp. 206-222.

    Abstract

    We present a framework which allows standard stereo reconstruction to be unified with a wide range of classic top-down cues from urban scene understanding. The resulting algorithm is analogous to the human visual system where conflicting interpretations of the scene due to ambiguous data can be resolved based on a higher level understanding of urban environments. The cues which are reformulated within the framework include: recognising common arrangements of surface normals and semantic edges (e.g. concave, convex and occlusion boundaries), recognising connected or coplanar structures such as walls, and recognising collinear edges (which are common on repetitive structures such as windows). Recognition of these common configurations has only recently become feasible, thanks to the emergence of large-scale reconstruction datasets. To demonstrate the importance and generality of scene understanding during stereo-reconstruction, the proposed approach is integrated with 3 different state-of-the-art techniques for bottom-up stereo reconstruction. The use of high-level cues is shown to improve performance by up to 15 % on the Middlebury 2014 and KITTI datasets. We further evaluate the technique using the recently proposed HCI stereo metrics, finding significant improvements in the quality of depth discontinuities, planar surfaces and thin structures.

  • Hadfield SJ, Lebeda K, Bowden R . (2016) 'Hollywood 3D: What are the best 3D features for Action Recognition?'. Springer Verlag International Journal of Computer Vision, 121 (1), pp. 95-110.

    Abstract

    Action recognition “in the wild” is extremely challenging, particularly when complex 3D actions are projected down to the image plane, losing a great deal of information. The recent growth of 3D data in broadcast content and commercial depth sensors, makes it possible to overcome this. However, there is little work examining the best way to exploit this new modality. In this paper we introduce the Hollywood 3D benchmark, which is the first dataset containing “in the wild” action footage including 3D data. This dataset consists of 650 stereo video clips across 14 action classes, taken from Hollywood movies. We provide stereo calibrations and depth reconstructions for each clip. We also provide an action recognition pipeline, and propose a number of specialised depth-aware techniques including five interest point detectors and three feature descriptors. Extensive tests allow evaluation of different appearance and depth encoding schemes. Our novel techniques exploiting this depth allow us to reach performance levels more than triple those of the best baseline algorithm using only appearance information. The benchmark data, code and calibrations are all made available to the community.

  • Gilbert A, bowden R. (2015) 'Geometric Mining: Scaling Geometric Hashing to Large Datasets'. 3rd Workshop on Web-scale Vision and Social Media (VSM), at ICCV 2015,
  • Pugeault N, Bowden R. (2015) 'How Much of Driving Is Preattentive?'. IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, 64 (12), pp. 5424-5438.
  • Lebeda K, Hadfield S, Matas J, Bowden R. (2015) 'Texture-Independent Long-Term Tracking Using Virtual Corners'. IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC IEEE TRANSACTIONS ON IMAGE PROCESSING, 25 (1), pp. 359-371.
  • Bowden R. (2015) 'The evolution of Computer Vision'. SAGE PUBLICATIONS LTD PERCEPTION, 44, pp. 360-361.
  • Krejov P, Gilbert A, Bowden R. (2015) 'Combining Discriminative and Model Based Approaches for Hand Pose Estimation'. IEEE 2015 11TH IEEE INTERNATIONAL CONFERENCE AND WORKSHOPS ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG), VOL. 2, Ljubljana, SLOVENIA:
  • Gilbert A, Bowden R. (2015) 'Data mining for action recognition'. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9007, pp. 290-303.
  • Bowden R, Collomosse J, Mikolajczyk K. (2014) 'Guest Editorial: Tracking, Detection and Segmentation'. International Journal of Computer Vision,
    [ Status: Accepted ]
  • Krejov P, Gilbert A, Bowden R. (2014) 'A Multitouchless Interface Expanding User Interaction'. IEEE COMPUTER SOC IEEE COMPUTER GRAPHICS AND APPLICATIONS, 34 (3), pp. 40-48.
  • Koller O, Ney H, Bowden R. (2014) 'Read my lips: Continuous signer independent weakly supervised viseme recognition'. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 8689 LNCS (PART 1), pp. 281-296.
  • Oshin O, Gilbert A, Bowden R. (2014) 'Capturing relative motion and finding modes for action recognition in the wild'. Computer Vision and Image Understanding,
  • Hadfield SJ, Bowden R . (2013) 'Scene Particles: Unregularized Particle Based Scene Flow Estimation'. IEEE Computer Society IEEE Transactions on Pattern Analysis and Machine Intelligence, 36 (3) Article number 3 , pp. 564-576.

    Abstract

    In this paper, an algorithm is presented for estimating scene flow, which is a richer, 3D analogue of Optical Flow. The approach operates orders of magnitude faster than alternative techniques, and is well suited to further performance gains through parallelized implementation. The algorithm employs multiple hypothesis to deal with motion ambiguities, rather than the traditional smoothness constraints, removing oversmoothing errors and providing signi?cant performance improvements on benchmark data, over the previous state of the art. The approach is ?exible, and capable of operating with any combination of appearance and/or depth sensors, in any setup, simultaneously estimating the structure and motion if necessary. Additionally, the algorithm propagates information over time to resolve ambiguities, rather than performing an isolated estimation at each frame, as in contemporary approaches. Approaches to smoothing the motion ?eld without sacri?cing the bene?ts of multiple hypotheses are explored, and a probabilistic approach to Occlusion estimation is demonstrated, leading to 10% and 15% improved performance respectively. Finally, a data driven tracking approach is described, and used to estimate the 3D trajectories of hands during sign language, without the need to model complex appearance variations at each viewpoint.

  • Krejov P, Bowden R . (2013) 'Multi-touchless: Real-time fingertip detection and tracking using geodesic maxima'. IEEE 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, FG 2013, , pp. 1-7.

    Abstract

    Since the advent of multitouch screens users have been able to interact using fingertip gestures in a two dimensional plane. With the development of depth cameras, such as the Kinect, attempts have been made to reproduce the detection of gestures for three dimensional interaction. Many of these use contour analysis to find the fingertips, however the success of such approaches is limited due to sensor noise and rapid movements. This paper discusses an approach to identify fingertips during rapid movement at varying depths allowing multitouch without contact with a screen. To achieve this, we use a weighted graph that is built using the depth information of the hand to determine the geodesic maxima of the surface. Fingertips are then selected from these maxima using a simplified model of the hand and correspondence found over successive frames. Our experiments show real-time performance for multiple users providing tracking at 30fps for up to 4 hands and we compare our results with stateof- the-art methods, providing accuracy an order of magnitude better than existing approaches. © 2013 IEEE.

  • Kristan M, Pflugfelder R, Leonardis A, Matas J, Porikli F, Čehovin L, Nebehay G, Fernandez G, Vojíř T, Gatt A, Khajenezhad A, Salahledin A, Soltani-Farani A, Zarezade A, Petrosino A, Milton A, Bozorgtabar B, Li B, Chan CS, Heng C, Ward D, Kearney D, Monekosso D, Karaimer HC, Rabiee HR, Zhu J, Gao J, Xiao J, Zhang J, Xing J, Huang K, Lebeda K, Cao L, Maresca ME, Lim MK, ELHelw M, Felsberg M, Remagnino P, Bowden R, Goecke R, Stolkin R, Lim SYY, Maher S, Poullot S, Wong S, Satoh S, Chen W, Hu W, Zhang X, Li Y, Niu Z. (2013) 'The visual object tracking VOT2013 challenge results'. Proceedings of the IEEE International Conference on Computer Vision, , pp. 98-111.
  • Escalera S, Gonzàlez J, Baró X, Reyes M, Guyon I, Athitsos V, Escalante H, Sigal L, Argyros A, Sminchisescu C, Bowden R, Sclaroff S. (2013) 'ChaLearn multi-modal gesture recognition 2013: Grand challenge and workshop summary'. ICMI 2013 - Proceedings of the 2013 ACM International Conference on Multimodal Interaction, , pp. 365-370.
  • Sheerman-Chase T, Ong E-J, Bowden R. (2013) 'Non-linear predictors for facial feature tracking across pose and expression'. 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, FG 2013,
  • Koller O, Ney H, Bowden R. (2013) 'May the force be with you: Force-aligned signwriting for automatic subunit annotation of corpora'. 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, FG 2013,
  • Gilbert A, Bowden R. (2013) 'A picture is worth a thousand tags: Automatic web based image tag expansion'. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 7725 LNCS (PART 2), pp. 447-460.

    Abstract

    We present an approach to automatically expand the annotation of images using the internet as an additional information source. The novelty of the work is in the expansion of image tags by automatically introducing new unseen complex linguistic labels which are collected unsupervised from associated webpages. Taking a small subset of existing image tags, a web based search retrieves additional textual information. Both a textual bag of words model and a visual bag of words model are combined and symbolised for data mining. Association rule mining is then used to identify rules which relate words to visual contents. Unseen images that fit these rules are re-tagged. This approach allows a large number of additional annotations to be added to unseen images, on average 12.8 new tags per image, with an 87.2% true positive rate. Results are shown on two datasets including a new 2800 image annotation dataset of landmarks, the results include pictures of buildings being tagged with the architect, the year of construction and even events that have taken place there. This widens the tag annotation impact and their use in retrieval. This dataset is made available along with tags and the 1970 webpages and additional images which form the information corpus. In addition, results for a common state-of-the-art dataset MIRFlickr25000 are presented for comparison of the learning framework against previous works. © 2013 Springer-Verlag.

  • Holt B, Bowden R. (2013) 'Efficient Estimation of Human Upper Body Pose in Static Depth Images'. Communications in Computer and Information Science, 359 CCIS, pp. 399-410.
  • Lebeda K, Matas J, Bowden R. (2013) 'Tracking the untrackable: How to track when your object is featureless'. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 7729 LNCS (PART 2), pp. 347-359.
  • Holt B, Ong EJ, Bowden R. (2013) 'Accurate static pose estimation combining direct regression and geodesic extrema'. 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, FG 2013,
  • Efthimiou E, Fotinea SE, Hanke T, Glauert J, Bowden R, Braffort A, Collet C, Maragos P, Lefebvre-Albaret F. (2012) 'The dicta-sign Wiki: Enabling web communication for the deaf'. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 7383 LNCS (PART 2), pp. 205-212.
  • Bowden R, Cox SJ, Harvey RW, Lan Y, Ong EJ, Owen G, Theobald BJ. (2012) 'Is automated conversion of video to text a reality?'. Proceedings of SPIE - The International Society for Optical Engineering, 8546
  • Gupta A, Bowden R. (2012) 'Fuzzy encoding for image classification using Gustafson-Kessel algorithm'. IEEE PES Innovative Smart Grid Technologies Conference Europe, , pp. 3137-3140.
  • Merino L, Gilbert A, Bowden R, Illingworth J, Capitán J, Ollero A, Ollero A. (2012) 'Data fusion in ubiquitous networked robot systems for urban services'. Annales des Telecommunications/Annals of Telecommunications, , pp. 1-21.
  • Ong EJ, Bowden R. (2011) 'Learning sequential patterns for lipreading'. BMVC 2011 - Proceedings of the British Machine Vision Conference 2011,

    Abstract

    This paper proposes a novel machine learning algorithm (SP-Boosting) to tackle the problem of lipreading by building visual sequence classifiers based on sequential patterns. We show that an exhaustive search of optimal sequential patterns is not possible due to the immense search space, and tackle this with a novel, efficient tree-search method with a set of pruning criteria. Crucially, the pruning strategies preserve our ability to locate the optimal sequential pattern. Additionally, the tree-based search method accounts for the training set's boosting weight distribution. This temporal search method is then integrated into the boosting framework resulting in the SP-Boosting algorithm. We also propose a novel constrained set of strong classifiers that further improves recognition accuracy. The resulting learnt classifiers are applied to lipreading by performing multi-class recognition on the OuluVS database. Experimental results show that our method achieves state of the art recognition performane, using only a small set of sequential patterns. © 2011. The copyright of this document resides with its authors.

  • Ong EJ, Bowden R. (2011) 'Robust Facial Feature Tracking Using Shape-Constrained Multi-Resolution Selected Linear Predictors.'. IEEE Computer Society IEEE Transactions on Pattern Analysis and Machine Intelligence, 33 (9), pp. 1844-1859.

    Abstract

    This paper proposes a learnt {\em data-driven} approach for accurate, real-time tracking of facial features using only intensity information, a non-trivial task since the face is a highly deformable object with large textural variations and motion in certain regions. The framework proposed here largely avoids the need for apriori design of feature trackers by automatically identifying the optimal visual support required for tracking a single facial feature point. This is essentially equivalent to automatically determining the visual context required for tracking. Tracking is achieved via biased linear predictors which provide a fast and effective method for mapping pixel-intensities into tracked feature position displacements. Multiple linear predictors are grouped into a rigid flock to increase robustness. To further improve tracking accuracy, a novel probabilistic selection method is used to identify relevant visual areas for tracking a feature point. These selected flocks are then combined into a hierarchical multi-resolution LP model. Finally, we also exploit a simple shape constraint for correcting the occasional tracking failure of a minority of feature points. Experimental results also show that this method performs more robustly and accurately than AAMs, on example sequences that range from SD quality to Youtube quality.

  • Okwechime D, Ong E-J, Bowden R, Member S. (2011) 'MIMiC: Multimodal Interactive Motion Controller'. IEEE IEEE Transactions on Multimedia, 13 (2), pp. 255-265.

    Abstract

    We introduce a new algorithm for real-time interactive motion control and demonstrate its application to motion captured data, prerecorded videos, and HCI. Firstly, a data set of frames are projected into a lower dimensional space. An appearance model is learnt using a multivariate probability distribution. A novel approach to determining transition points is presented based on k-medoids, whereby appropriate points of intersection in the motion trajectory are derived as cluster centers. These points are used to segment the data into smaller subsequences. A transition matrix combined with a kernel density estimation is used to determine suitable transitions between the subsequences to develop novel motion. To facilitate real-time interactive control, conditional probabilities are used to derive motion given user commands. The user commands can come from any modality including auditory, touch, and gesture. The system is also extended to HCI using audio signals of speech in a conversation to trigger nonverbal responses from a synthetic listener in real-time. We demonstrate the flexibility of the model by presenting results ranging from data sets composed of vectorized images, 2-D, and 3-D point representations. Results show real-time interaction and plausible motion generation between different types of movement.

  • Gilbert A, Bowden R. (2011) 'Push and pull: Iterative grouping of media'. BMVC 2011 - Proceedings of the British Machine Vision Conference 2011,
  • Gupta A, Bowden R. (2011) 'Evaluating dimensionality reduction techniques for visual category recognition using rényi entropy'. IEEE European Signal Processing Conference, , pp. 913-917.
  • Ellis L, Dowson N, Matas J, Bowden R. (2011) 'Linear Regression and Adaptive Appearance Models for Fast Simultaneous Modelling and Tracking'. SPRINGER INTERNATIONAL JOURNAL OF COMPUTER VISION, 95 (2), pp. 154-179.
  • Moore S, Bowden R. (2011) 'Local binary patterns for multi-view facial expression recognition'. Elsevier Computer Vision and Image Understanding, 115 (4), pp. 541-558.
  • Mitchell TA, Bowden R, Sarhadi M. (2010) 'Efficient Texture Analysis for Industrial Inspection'. Taylor & Francis International Journal of Production Research, 38 (4), pp. 967-984.
  • Sanfeliu A, Andrade-Cetto J, Barbosa M, Bowden R, Capitan J, Corominas A, Gilbert A, Illingworth J, Merino L, Mirats JM, Moreno P, Ollero A, Sequeira J, Spaan MTJ. (2010) 'Decentralized Sensor Fusion for Ubiquitous Networking Robotics in Urban Areas'. MOLECULAR DIVERSITY PRESERVATION INTERNATIONAL-MDPI Sensors, 10 (3), pp. 2274-2314.
  • Pugeault N, Bowden R. (2010) 'Learning driving behaviour using holistic image descriptors'. 4th International Conference on Cognitive Systems, CogSys 2010,
  • Gilbert A, Illingworth J, Bowden R. (2010) 'Action Recognition Using Mined Hierarchical Compound Features'. IEEE COMPUTER SOC IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 33 (5), pp. 883-897.
  • Ong E-J, Ellis L, Bowden R. (2009) 'Problem solving through imitation'. ELSEVIER SCIENCE BV IMAGE AND VISION COMPUTING, 27 (11), pp. 1715-1728.
  • Dowson N, Kadir T, Bowden R. (2008) 'Estimating the joint statistics of images using Nonparametric Windows with application to registration using Mutual Information'. IEEE COMPUTER SOC IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, Minneapolis, MN: 30 (10), pp. 1841-1857.

    Abstract

    Recently, Nonparametric (NP) Windows has been proposed to estimate the statistics of real 1D and 2D signals. NP Windows is accurate because it is equivalent to sampling images at a high (infinite) resolution for an assumed interpolation model. This paper extends the proposed approach to consider joint distributions of image pairs. Second, Green's Theorem is used to simplify the previous NP Windows algorithm. Finally, a resolution-aware NP Windows algorithm is proposed to improve robustness to relative scaling between an image pair. Comparative testing of 2D image registration was performed using translation only and affine transformations. Although it is more expensive than other methods, NP Windows frequently demonstrated superior performance for bias (distance between ground truth and global maximum) and frequency of convergence. Unlike other methods, the number of samples and the number of bins have little effect on NP Windows and the prior selection of a kernel is not required.

  • Gilbert A, Bowden R. (2008) 'Incremental, scalable tracking of objects inter camera'. ACADEMIC PRESS INC ELSEVIER SCIENCE COMPUTER VISION AND IMAGE UNDERSTANDING, 111 (1), pp. 43-58.
  • Dowson N, Bowden R. (2008) 'Mutual information for Lucas-Kanade tracking (MILK): An inverse compositional formulation'. IEEE COMPUTER SOC IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 30 (1), pp. 180-185.
  • Ong EJ, Bowden R. (2005) 'Learning multi-kernel distance functions using relative comparisons'. ELSEVIER SCI LTD PATTERN RECOGNITION, 38 (12), pp. 2653-2657.
  • Micilotta AS, Ong EJ, Bowden R. (2005) 'Detection and tracking of humans by probabilistic body part assembly'. BMVC 2005 - Proceedings of the British Machine Vision Conference 2005,
  • Windridge D, Bowden R. (2005) 'Hidden Markov chain estimation and parameterisation via ICA-based feature-selection'. Pattern Analysis Applications, 8 Article number 1-2 , pp. 115-124-115-124.
  • Gilbert A, Bowden R. (2005) 'Incremental modelling of the posterior distribution of objects for inter and intra camera tracking'. BMVC 2005 - Proceedings of the British Machine Vision Conference 2005,
  • Ong E-J, Micilotta AS, Bowden R, Hilton A. (2005) 'Viewpoint invariant exemplar-based 3D human tracking'. ACADEMIC PRESS INC ELSEVIER SCIENCE COMPUTER VISION AND IMAGE UNDERSTANDING, 104 (2-3), pp. 178-189.
  • Bowden R. (2003) 'Probabilistic models in computer vision'. ELSEVIER SCIENCE BV IMAGE AND VISION COMPUTING, 21 (10), pp. 841-841.
  • Bowden R, Kaewtrakulpong P, Lewin M. (2002) 'Jeremiah: The face of computer vision'. ACM International Conference Proceeding Series, 22, pp. 124-128.
  • Bowden R, Mitchell TA, Sarhadi M. (2000) 'Non-linear Statistical Models for the 3D Reconstruction of Human Pose and Motion from Monocular Image Sequences'. Elsevier Image and Vision Computing, 18 (9), pp. 729-737.
  • Mitchell TA, Bowden R. (1999) 'Automated visual inspection of dry carbon-fibre reinforced composite preforms'. Proceedings of the Institution of Mechanical Engineers, Part G: Journal of Aerospace Engineering, 213 (6), pp. 377-386.
  • Bowden R, Mitchell TA, Sarhadi M. (1997) 'Cluster Based non-linear Principle Component Analysis'. The Institution of Engineering and Technology Electronics Letters, 33 (22), pp. 1858-1859.

Conference papers

  • Ebling S, Camgöz N, Boyes Braem P, Tissi K, Sidler-Miserez S, Stoll S, Hadfield S, Haug T, Bowden R, Tornay S, Razaviz M, Magimai-Doss M. (2018) 'SMILE Swiss German Sign Language Dataset'. The European Language Resources Association (ELRA) Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC) 2018, Miyazaki, Japan: 11th edition of the Language Resources and Evaluation Conference
    [ Status: Accepted ]

    Abstract

    Sign language recognition (SLR) involves identifying the form and meaning of isolated signs or sequences of signs. To our knowledge, the combination of SLR and sign language assessment is novel. The goal of an ongoing three-year project in Switzerland is to pioneer an assessment system for lexical signs of Swiss German Sign Language (Deutschschweizerische Geb¨ardensprache, DSGS) that relies on SLR. The assessment system aims to give adult L2 learners of DSGS feedback on the correctness of the manual parameters (handshape, hand position, location, and movement) of isolated signs they produce. In its initial version, the system will include automatic feedback for a subset of a DSGS vocabulary production test consisting of 100 lexical items. To provide the SLR component of the assessment system with sufficient training samples, a large-scale dataset containing videotaped repeated productions of the 100 items of the vocabulary test with associated transcriptions and annotations was created, consisting of data from 11 adult L1 signers and 19 adult L2 learners of DSGS. This paper introduces the dataset, which will be made available to the research community.

  • Mendez Maldonado O, Hadfield S, Pugeault N, Bowden R. (2017) 'Taking the Scenic Route to 3D: Optimising Reconstruction from Moving Cameras'. IEEE ICCV 2017 Proceedings, Venice, Italy: IEEE International Conference on Computer Vision 2017

    Abstract

    Reconstruction of 3D environments is a problem that has been widely addressed in the literature. While many approaches exist to perform reconstruction, few of them take an active role in deciding where the next observations should come from. Furthermore, the problem of travelling from the camera’s current position to the next, known as pathplanning, usually focuses on minimising path length. This approach is ill-suited for reconstruction applications, where learning about the environment is more valuable than speed of traversal. We present a novel Scenic Route Planner that selects paths which maximise information gain, both in terms of total map coverage and reconstruction accuracy. We also introduce a new type of collaborative behaviour into the planning stage called opportunistic collaboration, which allows sensors to switch between acting as independent Structure from Motion (SfM) agents or as a variable baseline stereo pair. We show that Scenic Planning enables similar performance to state-of-the-art batch approaches using less than 0.00027% of the possible stereo pairs (3% of the views). Comparison against length-based pathplanning approaches show that our approach produces more complete and more accurate maps with fewer frames. Finally, we demonstrate the Scenic Pathplanner’s ability to generalise to live scenarios by mounting cameras on autonomous ground-based sensor platforms and exploring an environment.

  • Camgöz N, Hadfield S, Bowden R. (2017) 'Particle Filter based Probabilistic Forced Alignment for Continuous Gesture Recognition'. IEEE IEEE International Conference on Computer Vision Workshops (ICCVW) 2017, Venice, Italy: IEEE International Conference on Computer Vision Workshops (ICCVW) 2017, pp. 3079-3085.

    Abstract

    In this paper, we propose a novel particle filter based probabilistic forced alignment approach for training spatiotemporal deep neural networks using weak border level annotations. The proposed method jointly learns to localize and recognize isolated instances in continuous streams. This is done by drawing training volumes from a prior distribution of likely regions and training a discriminative 3D-CNN from this data. The classifier is then used to calculate the posterior distribution by scoring the training examples and using this as the prior for the next sampling stage. We apply the proposed approach to the challenging task of large-scale user-independent continuous gesture recognition. We evaluate the performance on the popular ChaLearn 2016 Continuous Gesture Recognition (ConGD) dataset. Our method surpasses state-of-the-art results by obtaining 0:3646 and 0:3744 Mean Jaccard Index Score on the validation and test sets of ConGD, respectively. Furthermore, we participated in the ChaLearn 2017 Continuous Gesture Recognition Challenge and was ranked 3rd. It should be noted that our method is learner independent, it can be easily combined with other approaches.

  • Camgöz N, Hadfield SJ, Koller O, Bowden R . (2017) 'SubUNets: End-to-end Hand Shape and Continuous Sign Language Recognition'. IEEE ICCV 2017 Proceedings, Venice, Italy: International Conference on Computer Vision
    [ Status: Accepted ]

    Abstract

    We propose a novel deep learning approach to solve simultaneous alignment and recognition problems (referred to as “Sequence-to-sequence” learning). We decompose the problem into a series of specialised expert systems referred to as SubUNets. The spatio-temporal relationships between these SubUNets are then modelled to solve the task, while remaining trainable end-to-end. The approach mimics human learning and educational techniques, and has a number of significant advantages. SubUNets allow us to inject domain-specific expert knowledge into the system regarding suitable intermediate representations. They also allow us to implicitly perform transfer learning between different interrelated tasks, which also allows us to exploit a wider range of more varied data sources. In our experiments we demonstrate that each of these properties serves to significantly improve the performance of the overarching recognition system, by better constraining the learning problem. The proposed techniques are demonstrated in the challenging domain of sign language recognition. We demonstrate state-of-the-art performance on hand-shape recognition (outperforming previous techniques by more than 30%). Furthermore, we are able to obtain comparable sign recognition rates to previous research, without the need for an alignment step to segment out the signs for recognition.

  • Allday R, Hadfield S, Bowden R. (2017) 'From Vision to Grasping: Adapting Visual Networks'. Springer TAROS-2017 Conference Proceedings. Lecture Notes in Computer Science, Guildford, UK: TAROS-2017 10454, pp. 484-494.

    Abstract

    Grasping is one of the oldest problems in robotics and is still considered challenging, especially when grasping unknown objects with unknown 3D shape. We focus on exploiting recent advances in computer vision recognition systems. Object classification problems tend to have much larger datasets to train from and have far fewer practical constraints around the size of the model and speed to train. In this paper we will investigate how to adapt Convolutional Neural Networks (CNNs), traditionally used for image classification, for planar robotic grasping. We consider the differences in the problems and how a network can be adjusted to account for this. Positional information is far more important to robotics than generic image classification tasks, where max pooling layers are used to improve translation invariance. By using a more appropriate network structure we are able to obtain improved accuracy while simultaneously improving run times and reducing memory consumption by reducing model size by up to 69%.

  • Camgoz NC, Hadfield SJ, Koller O, Bowden R. (2016) 'Using Convolutional 3D Neural Networks for User-Independent Continuous Gesture Recognition'. IEEE Proceedings IEEE International Conference of Pattern Recognition (ICPR), ChaLearn Workshop, Cancun, Mexico: 23rd International Conference on Pattern Recognition (ICPR), ChaLearn Workshop
    [ Status: Accepted ]
  • Hadfield SJ, Bowden R, Lebeda K. (2016) 'The Visual Object Tracking VOT2016 Challenge Results'. Springer Lecture Notes in Computer Science, Amsterdam, Netherlands: European Conference on Computer Vision (ECCV) 2016 workshops 9914, pp. 777-823.
  • Koller O, Zargaran O, Ney H, Bowden R. (2016) 'Deep Sign: Hybrid CNN-HMM for Continuous Sign Language Recognition'. BMVA Press Proceedings of the British Machine Vision Conference 2016, York: The British Machine Vision Conference (BMVC) 2016
  • Mendez Maldonado OA, Hadfield SJ, Pugeault N, Bowden R . (2016) 'Next-best stereo: extending next best view optimisation for collaborative sensors'. Proceedings of BMVC 2016, York, UK: British Machine Vision Conference
    [ Status: Accepted ]

    Abstract

    Most 3D reconstruction approaches passively optimise over all data, exhaustively matching pairs, rather than actively selecting data to process. This is costly both in terms of time and computer resources, and quickly becomes intractable for large datasets. This work proposes an approach to intelligently filter large amounts of data for 3D reconstructions of unknown scenes using monocular cameras. Our contributions are twofold: First, we present a novel approach to efficiently optimise the Next-Best View ( NBV ) in terms of accuracy and coverage using partial scene geometry. Second, we extend this to intelligently selecting stereo pairs by jointly optimising the baseline and vergence to find the NBV ’s best stereo pair to perform reconstruction. Both contributions are extremely efficient, taking 0.8ms and 0.3ms per pose, respectively. Experimental evaluation shows that the proposed method allows efficient selection of stereo pairs for reconstruction, such that a dense model can be obtained with only a small number of images. Once a complete model has been obtained, the remaining computational budget is used to intelligently refine areas of uncertainty, achieving results comparable to state-of-the-art batch approaches on the Middlebury dataset, using as little as 3.8% of the views.

  • Koller O, Bowden R, Ney H. (2016) 'Automatic Alignment of HamNoSys Subunits for Continuous Sign Language Recognition'. LREC 2016 Proceedings, Portorož (Slovenia): LREC 2016: 10th edition of the Language Resources and Evaluation Conference, pp. 121-128.
  • Bowden R. (2016) 'Learning multi-class discriminative patterns using episode-trees'. International Academy, Research, and Industry Association (IARIA) 7th International Conference on Cloud Computing, GRIDs, and Virtualization (CLOUD COMPUTING 2016), Rome, Italy: Cloud Computing 2016 - The Seventh International Conference on Cloud Computing, GRIDs, and Virtualization

    Abstract

    In this paper, we aim to tackle the problem of recognising temporal sequences in the context of a multi-class problem. In the past, the representation of sequential patterns was used for modelling discriminative temporal patterns for different classes. Here, we have improved on this by using the more general representation of episodes, of which sequential patterns are a special case. We then propose a novel tree structure called a MultI-Class Episode Tree (MICE-Tree) that allows one to simultaneously model a set of different episodes in an efficient manner whilst providing labels for them. A set of MICE-Trees are then combined together into a MICE-Forest that is learnt in a Boosting framework. The result is a strong classifier that utilises episodes for performing classification of temporal sequences. We also provide experimental evidence showing that the MICE-Trees allow for a more compact and efficient model compared to sequential patterns. Additionally, we demonstrate the accuracy and robustness of the proposed method in the presence of different levels of noise and class labels.

  • Koller O, Ney H, Bowden R. (2016) 'Deep Hand: How to Train a CNN on 1 Million Hand Images When Your Data Is Continuous and Weakly Labelled'. Proceddings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA: CVPR 2016: IEEE Conference on Computer Vision and Pattern Recognition
  • Koller O, Ney H, Bowden R. (2015) 'Deep Learning of Mouth Shapes for Sign Language'. IEEE 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOP (ICCVW), santigo, CHILE: IEEE International Conference on Computer Vision Workshops, pp. 477-483.
  • Hadfield S, Bowden R. (2015) 'Exploiting high level scene cues in stereo reconstruction'. IEEE 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), Santiago, CHILE: IEEE International Conference on Computer Vision, pp. 783-791.
  • Lebeda K, Hadfield S, Bowden R. (2015) 'Dense Rigid Reconstruction from Unstructured Discontinuous Video'. IEEE 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOP (ICCVW), santigo, CHILE: IEEE International Conference on Computer Vision Workshops, pp. 814-822.
  • Lebeda K, Hadfield S, Bowden R. (2015) 'Exploring Causal Relationships in Visual Object Tracking'. IEEE 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), Santiago, CHILE: IEEE International Conference on Computer Vision, pp. 3065-3073.
  • Kristan M, Matas J, Leonardis A, Felsberg M, Cehovin L, Fernandez G, Vojır T, Hager G, Nebehay G, Pflugfelder R, Gupta A, Bibi A, Lukezic A, Garcia-Martin A, Petrosino A, Saffari A, Montero A, Varfolomieiev A, Baskurt A, Zhao B, Ghanem B, Martinez B, Lee B, Han B, Wang C, Garcia C, Zhang C, Schmid C, Tao D, Kim D, Huang D, Prokhorov D, Du D, Yeung D, Ribeiro E, Khan F, Porikli F, Bunyak F, Zhu G, Seetharaman G, Kieritz H, Yau H, Li H, Qi H, Bischof H, Possegger H, Lee H, Nam H, Bogun I, Jeong J, Cho J, Lee J, Zhu J, Shi J, Li J, Jia J, Feng J, Gao J, Choi J, Kim J, Lang J, Martinez J, Choi J, Xing J, Xue K, Palaniappan K, Lebeda K, Alahari K, Gao K, Yun K, Wong K, Luo L, Ma L, Ke L, Wen L, Bertinetto L, Pootschi M, Maresca M, Danelljan M, Wen M, Zhang M, Arens M, Valstar M, Tang M, Chang M, Khan M, Fan N, Wang N, Miksik O, Torr P, Wang Q, Martin-Nieto R, Pelapur R, Bowden R, Laganière R, Moujtahid S, Hare S, Hadfield SJ, Lyu S, Li S, Zhu S, Becker S, Duffner S, Hicks S, Golodetz S, Choi S, Wu T, Mauthner T, Pridmore T, Hu W, Hübner W, Wang X, Li X, Shi X, Zhao X, Mei X, Shizeng Y, Hua Y, Li Y, Lu Y, Li Y, Chen Z, Huang Z, Chen Z, Zhang Z, He Z, Hong Z. (2015) 'The Visual Object Tracking VOT2015 challenge results'. ICCV workshop on Visual Object Tracking Challenge, Santiago, Chile: 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), pp. 564-586.

    Abstract

    The Visual Object Tracking challenge 2015, VOT2015, aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. Results of 62 trackers are presented. The number of tested trackers makes VOT 2015 the largest benchmark on short-term tracking to date. For each participating tracker, a short description is provided in the appendix. Features of the VOT2015 challenge that go beyond its VOT2014 predecessor are: (i) a new VOT2015 dataset twice as large as in VOT2014 with full annotation of targets by rotated bounding boxes and per-frame attribute, (ii) extensions of the VOT2014 evaluation methodology by introduction of a new performance measure. The dataset, the evaluation kit as well as the results are publicly available at the challenge website.

  • Krejov P, Gilbert A, Bowden R. (2015) 'Combining Discriminative and Model Based Approaches for Hand Pose Estimation'. IEEE 2015 11TH IEEE INTERNATIONAL CONFERENCE AND WORKSHOPS ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG), VOL. 5, Ljubljana, SLOVENIA: IEEE 11th International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Vol 6
  • Lebeda K, Hadfield SJ, Bowden R. (2014) '2D Or Not 2D: Bridging the Gap Between Tracking and Structure from Motion'. NUS, Singapore: Asian Conference on Computer Vision (ACCV)

    Abstract

    In this paper, we address the problem of tracking an unknown object in 3D space. Online 2D tracking often fails for strong outof-plane rotation which results in considerable changes in appearance beyond those that can be represented by online update strategies. However, by modelling and learning the 3D structure of the object explicitly, such effects are mitigated. To address this, a novel approach is presented, combining techniques from the fields of visual tracking, structure from motion (SfM) and simultaneous localisation and mapping (SLAM). This algorithm is referred to as TMAGIC (Tracking, Modelling And Gaussianprocess Inference Combined). At every frame, point and line features are tracked in the image plane and are used, together with their 3D correspondences, to estimate the camera pose. These features are also used to model the 3D shape of the object as a Gaussian process. Tracking determines the trajectories of the object in both the image plane and 3D space, but the approach also provides the 3D object shape. The approach is validated on several video-sequences used in the tracking literature, comparing favourably to state-of-the-art trackers for simple scenes (error reduced by 22 %) with clear advantages in the case of strong out-of-plane rotation, where 2D approaches fail (error reduction of 58 %).

  • Hadfield S, Lebeda K, Bowden R. (2014) 'Natural action recognition using invariant 3D motion encoding'. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 8690 LNCS (PART 2), pp. 758-771.
  • Hadfield SJ, Lebeda K, Bowden R. (2014) 'The Visual Object Tracking VOT2014 challenge results'. Zurich, Switzerland: European Conference on Computer Vision (ECCV) Visual Object Tracking Challenge Workshop
  • Marter M, Hadfield S, Bowden R. (2014) 'Friendly faces: Weakly supervised character identification'. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 8912, pp. 121-132.

    Abstract

    © Springer International Publishing Switzerland 2015.This paper demonstrates a novel method for automatically discovering and recognising characters in video without any labelled examples or user intervention. Instead weak supervision is obtained via a rough script-to-subtitle alignment. The technique uses pose invariant features, extracted from detected faces and clustered to form groups of co-occurring characters. Results show that with 9 characters, 29% of the closest exemplars are correctly identified, increasing to 50% as additional exemplars are considered.

  • Ong EJ, Koller O, Pugeault N, Bowden R. (2014) 'Sign spotting using hierarchical sequential patterns with temporal intervals'. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, , pp. 1931-1938.
  • Hadfield S, Bowden R. (2014) 'Scene Flow Estimation using Intelligent Cost Functions'. Nottingham, UK : BMVA Proceedings of the British Conference on Machine Vision (BMVC),

    Abstract

    Motion estimation algorithms are typically based upon the assumption of brightness constancy or related assumptions such as gradient constancy. This manuscript evaluates several common cost functions from the motion estimation literature, which embody these assumptions. We demonstrate that such assumptions break for real world data, and the functions are therefore unsuitable. We propose a simple solution, which significantly increases the discriminatory ability of the metric, by learning a nonlinear relationship using techniques from machine learning. Furthermore, we demonstrate how context and a nonlinear combination of metrics, can provide additional gains, and demonstrating a 44% improvement in the performance of a state of the art scene flow estimation technique. In addition, smaller gains of 20% are demonstrated in optical flow estimation tasks.

  • Koller O, Ney H, bowden R. (2014) 'Weakly Supervised Automatic Transcription of Mouthings for Gloss-Based Sign Language Corpora'. EUROPEAN LANGUAGE RESOURCES ASSOC-ELRA LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, Reykjavik, ICELAND: 9th International Conference on Language Resources and Evaluation (LREC)
  • Bowden R. (2014) 'Seeing and understanding people'. CRC PRESS-TAYLOR & FRANCIS GROUP COMPUTATIONAL VISION AND MEDICAL IMAGE PROCESSING IV, Funchal, PORTUGAL: 4th Eccomas Thematic Conference on Computational Vision and Medical Image Processing (VipIMAGE), pp. 9-15.
  • Lebeda K, Hadfield S, Matas J, Bowden R. (2013) 'Long-Term Tracking Through Failure Cases'. Sydney, Australia : IEEE Proceeedings, IEEE workshop on visual object tracking challenge at ICCV, , pp. 153-160.
  • Hadfield S, Bowden R. (2013) 'Hollywood 3D: Recognizing Actions in 3D Natural Scenes'. Portland, Oregon : IEEE Proceeedings, IEEE conference on Computer Vision and Pattern Recognition (CVPR), , pp. 3398-3405.

    Abstract

    Action recognition in unconstrained situations is a difficult task, suffering from massive intra-class variations. It is made even more challenging when complex 3D actions are projected down to the image plane, losing a great deal of information. The recent emergence of 3D data, both in broadcast content, and commercial depth sensors, provides the possibility to overcome this issue. This paper presents a new dataset, for benchmarking action recognition algorithms in natural environments, while making use of 3D information. The dataset contains around 650 video clips, across 14 classes. In addition, two state of the art action recognition algorithms are extended to make use of the 3D data, and five new interest point detection strategies are also proposed, that extend to the 3D data. Our evaluation compares all 4 feature descriptors, using 7 different types of interest point, over a variety of threshold levels, for the Hollywood 3D dataset. We make the dataset including stereo video, estimated depth maps and all code required to reproduce the benchmark results, available to the wider community.

  • Sheerman-Chase T, Ong E-J, Pugeault N, Bowden R. (2013) 'Improving Recognition and Identification of Facial Areas Involved in Non-verbal Communication by Feature Selection'. Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on, Shanghai, China: 10th IEEE Conference on Automatic Face and Gesture Recognition (FG)

    Abstract

    Meaningful Non-Verbal Communication (NVC) signals can be recognised by facial deformations based on video tracking. However, the geometric features previously used contain a significant amount of redundant or irrelevant information. A feature selection method is described for selecting a subset of features that improves performance and allows for the identification and visualisation of facial areas involved in NVC. The feature selection is based on a sequential backward elimination of features to find a effective subset of components. This results in a significant improvement in recognition performance, as well as providing evidence that brow lowering is involved in questioning sentences. The improvement in performance is a step towards a more practical automatic system and the facial areas identified provide some insight into human behaviour.

  • Ellis L, Pugeault N, Ofjall K, Hedborg J, Bowden R, Felsberg M. (2013) 'Autonomous navigation and sign detector learning'. 2013 IEEE Workshop on Robot Vision, WORV 2013, , pp. 144-151.

    Abstract

    This paper presents an autonomous robotic system that incorporates novel Computer Vision, Machine Learning and Data Mining algorithms in order to learn to navigate and discover important visual entities. This is achieved within a Learning from Demonstration (LfD) framework, where policies are derived from example state-to-action mappings. For autonomous navigation, a mapping is learnt from holistic image features (GIST) onto control parameters using Random Forest regression. Additionally, visual entities (road signs e.g. STOP sign) that are strongly associated to autonomously discovered modes of action (e.g. stopping behaviour) are discovered through a novel Percept-Action Mining methodology. The resulting sign detector is learnt without any supervision (no image labeling or bounding box annotations are used). The complete system is demonstrated on a fully autonomous robotic platform, featuring a single camera mounted on a standard remote control car. The robot carries a PC laptop, that performs all the processing on board and in real-time. © 2013 IEEE.

  • Bowden R, Cox S, Harvey R, Lan Y, Ong E-J, Owen G, Theobald B-J. (2013) 'Recent developments in automated lip-reading'. SPIE-INT SOC OPTICAL ENGINEERING OPTICS AND PHOTONICS FOR COUNTERTERRORISM, CRIME FIGHTING AND DEFENCE IX; AND OPTICAL MATERIALS AND BIOMATERIALS IN SECURITY AND DEFENCE SYSTEMS TECHNOLOGY X, Dresden, GERMANY: Conference on Optics and Photonics for Counterterrorism, Crime Fighting and Defence IX; and Optical Materials and Biomaterials in Security and Defence Systems Technology X 8901
  • Efthimiou E, Fotinea S-E, Hanke T, Glauert J, Bowden R, Braffort A, Collet C, Maragos P, Lefebvre-Albaret F. (2012) 'Sign Language technologies and resources of the Dicta-Sign project'. Institute for German Sign Language and Communication of the Deaf Proceedings of the 5th Workshop on the Representation and Processing of Sign Languages: Interactions between Corpus and Lexicon. Satellite Workshop to the eighth International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey: 5th Workshop on the Representation and Processing of Sign Languages: Interactions between Corpus and Lexicon. Language Resources and Evaluation Conference (LREC), pp. 37-44.

    Abstract

    Here we present the outcomes of Dicta-Sign FP7-ICT project. Dicta-Sign researched ways to enable communication between Deaf individuals through the development of human-computer interfaces (HCI) for Deaf users, by means of Sign Language. It has researched and developed recognition and synthesis engines for sign languages (SLs) that have brought sign recognition and generation technologies significantly closer to authentic signing. In this context, Dicta-Sign has developed several technologies demonstrated via a sign language aware Web 2.0, combining work from the fields of sign language recognition, sign language animation via avatars and sign language resources and language models development, with the goal of allowing Deaf users to make, edit, and review avatar-based sign language contributions online, similar to the way people nowadays make text-based contributions on the Web.

  • Gupta A, Bowden R. (2012) 'Unity in diversity: Discovering topics from words: Information theoretic co-clustering for visual categorization'. VISAPP 2012 - Proceedings of the International Conference on Computer Vision Theory and Applications, Rome: International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications 1, pp. 628-633.
  • Ong EJ, Cooper H, Pugeault N, Bowden R. (2012) 'Sign Language Recognition using Sequential Pattern Trees'. Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, , pp. 2200-2207.
  • Holt B, Bowden R. (2012) 'Static pose estimation from depth images using random regression forests and Hough voting'. VISAPP 2012 - Proceedings of the International Conference on Computer Vision Theory and Applications, Rome: International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications 1, pp. 557-564.
  • Hadfield S, Bowden R. (2012) 'Go with the flow: Hand trajectories in 3D via clustered scene flow'. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 7324 LNCS (PART 1), pp. 285-295.
  • Shaukat A, Gilbert A, Windridge D, Bowden R. (2012) 'Meeting in the Middle: A top-down and bottom-up approach to detect pedestrians'. Proceedings - International Conference on Pattern Recognition, , pp. 874-877.

    Abstract

    This paper proposes a generic approach combining a bottom-up (low-level) visual detector with a top-down (high-level) fuzzy first-order logic (FOL) reasoning framework in order to detect pedestrians from a moving vehicle. Detections from the low-level visual corner based detector are fed into the logical reasoning framework as logical facts. A set of FOL clauses utilising fuzzy predicates with piecewise linear continuous membership functions associates a fuzzy confidence (a degree-of-truth) to each detector input. Detections associated with lower confidence functions are deemed as false positives and blanked out, thus adding top-down constraints based on global logical consistency of detections. We employ a state of the art visual detector on a challenging pedestrian detection dataset, and demonstrate an increase in detection performance when used in a framework that combines bottom-up detections with (fuzzy FOL-based) top-down constraints. © 2012 ICPR Org Committee.

  • Pugeault N, Bowden R. (2011) 'Driving me Around the Bend: Learning to Drive from Visual Gist'. IEEE 2011 IEEE International Conference on Computer Vision, Barcelona, Spain: ICCV 2011: 1st IEEE Workshop on Challenges and Opportunities in Robotic Perception, pp. 1022-1029.

    Abstract

    This article proposes an approach to learning steering and road following behaviour from a human driver using holistic visual features. We use a random forest (RF) to regress a mapping between these features and the driver's actions, and propose an alternative to classical random forest regression based on the Medoid (RF-Medoid), that reduces the underestimation of extreme control values. We compare prediction performance using different holistic visual descriptors: GIST, Channel-GIST (C-GIST) and Pyramidal-HOG (P-HOG). The proposed methods are evaluated on two different datasets: predicting human behaviour on countryside roads and also for autonomous control of a robot on an indoor track. We show that 1) C-GIST leads to the best predictions on both sequences, and 2) RF-Medoid leads to a better estimation of extreme values, where a classical RF tends to under-steer. We use around 10% of the data for training and show excellent generalization over a dataset of thousands of images. Importantly, we do not engineer the solution but instead use machine learning to automatically identify the relationship between visual features and behaviour, providing an efficient, generic solution to autonomous control.

  • Pugeault N, Bowden R. (2011) 'Spelling It Out: Real–Time ASL Fingerspelling Recognition'. IEEE 2011 IEEE International Conference on Computer Vision Workshops, Barcelona, Spain: ICCV 2011: 1st IEEE Workshop on Consumer Depth Cameras for Computer Vision, pp. 1114-1119.

    Abstract

    This article presents an interactive hand shape recognition user interface for American Sign Language (ASL) finger-spelling. The system makes use of a Microsoft Kinect device to collect appearance and depth images, and of the OpenNI+NITE framework for hand detection and tracking. Hand-shapes corresponding to letters of the alphabet are characterized using appearance and depth images and classified using random forests. We compare classification using appearance and depth images, and show a combination of both lead to best results, and validate on a dataset of four different users. This hand shape detection works in real-time and is integrated in an interactive user interface allowing the signer to select between ambiguous detections and integrated with an English dictionary for efficient writing.

  • Sheerman-Chase T, Ong E-J, Bowden R. (2011) 'Cultural factors in the regression of non-verbal communication perception'. IEEE 2011 IEEE International Conference on Computer Vision, Barcelona, Spain: ICCV 2011, pp. 1242-1249.

    Abstract

    Recognition of non-verbal communication (NVC) is important for understanding human communication and designing user centric user interfaces. Cultural differences affect the expression and perception of NVC but no previous automatic system considers these cultural differences. Annotation data for the LILiR TwoTalk corpus, containing dyadic (two person) conversations, was gathered using Internet crowdsourcing, with a significant quantity collected from India, Kenya and the United Kingdom (UK). Many studies have investigated cultural differences based on human observations but this has not been addressed in the context of automatic emotion or NVC recognition. Perhaps not surprisingly, testing an automatic system on data that is not culturally representative of the training data is seen to result in low performance. We address this problem by training and testing our system on a specific culture to enable better modeling of the cultural differences in NVC perception. The system uses linear predictor tracking, with features generated based on distances between pairs of trackers. The annotations indicated the strength of the NVC which enables the use of v-SVR to perform the regression.

  • Gilbert A, Bowden R. (2011) 'iGroup: Weakly supervised image and video grouping'. IEEE 2011 IEEE International Conference on Computer Vision, Barcelona, Spain: ICCV 2011, pp. 2166-2173.

    Abstract

    We present a generic, efficient and iterative algorithm for interactively clustering classes of images and videos. The approach moves away from the use of large hand labelled training datasets, instead allowing the user to find natural groups of similar content based upon a handful of “seed” examples. Two efficient data mining tools originally developed for text analysis; min-Hash and APriori are used and extended to achieve both speed and scalability on large image and video datasets. Inspired by the Bag-of-Words (BoW) architecture, the idea of an image signature is introduced as a simple descriptor on which nearest neighbour classification can be performed. The image signature is then dynamically expanded to identify common features amongst samples of the same class. The iterative approach uses APriori to identify common and distinctive elements of a small set of labelled true and false positive signatures. These elements are then accentuated in the signature to increase similarity between examples and “pull” positive classes together. By repeating this process, the accuracy of similarity increases dramatically despite only a few training examples, only 10% of the labelled groundtruth is needed, compared to other approaches. It is tested on two image datasets including the caltech101 [9] dataset and on three state-of-the-art action recognition datasets. On the YouTube [18] video dataset the accuracy increases from 72% to 97% using only 44 labelled examples from a dataset of over 1200 videos. The approach is both scalable and efficient, with an iteration on the full YouTube dataset taking around 1 minute on a standard desktop machine.

  • Hadfield S, Bowden R. (2011) 'Kinecting the dots: Particle Based Scene Flow From Depth Sensors'. Barcelona, Spain : IEEE Proceedings, IEEE International Conference on Computer Vision (ICCV), , pp. 2290-2295.

    Abstract

    The motion field of a scene can be used for object segmentation and to provide features for classification tasks like action recognition. Scene flow is the full 3D motion field of the scene, and is more difficult to estimate than it’s 2D counterpart, optical flow. Current approaches use a smoothness cost for regularisation, which tends to over-smooth at object boundaries. This paper presents a novel formulation for scene flow estimation, a collection of moving points in 3D space, modelled using a particle filter that supports multiple hypotheses and does not oversmooth the motion field. In addition, this paper is the first to address scene flow estimation, while making use of modern depth sensors and monocular appearance images, rather than traditional multi-viewpoint rigs. The algorithm is applied to an existing scene flow dataset, where it achieves comparable results to approaches utilising multiple views, while taking a fraction of the time.

  • Holt B, Ong E-J, Cooper H, Bowden R. (2011) 'Putting the pieces together: Connected Poselets for human pose estimation'. IEEE 2011 IEEE International Conference on Computer Vision, Barcelona, Spain: ICCV Workshops 2011, pp. 1196-1201.

    Abstract

    We propose a novel hybrid approach to static pose estimation called Connected Poselets. This representation combines the best aspects of part-based and example-based estimation. First detecting poselets extracted from the training data; our method then applies a modified Random Decision Forest to identify Poselet activations. By combining keypoint predictions from poselet activitions within a graphical model, we can infer the marginal distribution over each keypoint without any kinematic constraints. Our approach is demonstrated on a new publicly available dataset with promising results.

  • Cooper HM, Pugeault N, Bowden R. (2011) 'Reading the Signs: A Video Based Sign Dictionary'. IEEE 2011 International Conference on Computer Vision: 2nd IEEE Workshop on Analysis and Retrieval of Tracked Events and Motion in Imagery Streams (ARTEMIS 2011), Barcelona, Spain: ICCV 2011, pp. 914-919.

    Abstract

    This article presents a dictionary for Sign Language using visual sign recognition based on linguistic subcomponents. We demonstrate a system where the user makes a query, receiving in response a ranked selection of similar results. The approach uses concepts from linguistics to provide sign sub-unit features and classifiers based on motion, sign-location and handshape. These sub-units are combined using Markov Models for sign level recognition. Results are shown for a video dataset of 984 isolated signs performed by a native signer. Recognition rates reach 71.4% for the first candidate and 85.9% for retrieval within the top 10 ranked signs.

  • Elliott R, Cooper HM, Ong EJ, Glauert J, Bowden R, Lefebvre-Albaret F. (2011) 'Search-By-Example in Multilingual Sign Language Databases'. Dundee, UK: 2nd International Workshop on Sign Language Translation and Avatar Technology (SLTAT)

    Abstract

    We describe a prototype Search-by-Example or look-up tool for signs, based on a newly developed 1000-concept sign lexicon for four national sign languages (GSL, DGS, LSF,BSL), which includes a spoken language gloss, a HamNoSys description, and a video for each sign. The look-up tool combines an interactive sign recognition system, supported by Kinect technology, with a real-time sign synthesis system,using a virtual human signer, to present results to the user. The user performs a sign to the system and is presented with animations of signs recognised as similar. The user also has the option to view any of these signs performed in the other three sign languages. We describe the supporting technology and architecture for this system, and present some preliminary evaluation results.

  • Ong E, Bowden R. (2011) 'Learning Sequential Patterns for Lipreading'. BMVA Press Proceedings of the 22nd British Machine Vision Conference, Dundee, UK: BMVC 2011, pp. 55.1-55.10.

    Abstract

    This paper proposes a novel machine learning algorithm (SP-Boosting) to tackle the problem of lipreading by building visual sequence classifiers based on sequential patterns. We show that an exhaustive search of optimal sequential patterns is not possible due to the immense search space, and tackle this with a novel, efficient tree-search method with a set of pruning criteria. Crucially, the pruning strategies preserve our ability to locate the optimal sequential pattern. Additionally, the tree-based search method accounts for the training set’s boosting weight distribution. This temporal search method is then integrated into the boosting framework resulting in the SP-Boosting algorithm. We also propose a novel constrained set of strong classifiers that further improves recognition accuracy. The resulting learnt classifiers are applied to lipreading by performing multi-class recognition on the OuluVS database. Experimental results show that our method achieves state of the art recognition performane, using only a small set of sequential patterns.

  • Oshin O, Gilbert A, Bowden R. (2011) 'Capturing the relative distribution of features for action recognition'. IEEE 2011 IEEE International Conference on Automatic Face and Gesture Recognition and Workshops, Santa Barbara, USA: 2011 IEEE FG, pp. 111-116.

    Abstract

    This paper presents an approach to the categorisation of spatio-temporal activity in video, which is based solely on the relative distribution of feature points. Introducing a Relative Motion Descriptor for actions in video, we show that the spatio-temporal distribution of features alone (without explicit appearance information) effectively describes actions, and demonstrate performance consistent with state-of-the-art. Furthermore, we propose that for actions where noisy examples exist, it is not optimal to group all action examples as a single class. Therefore, rather than engineering features that attempt to generalise over noisy examples, our method follows a different approach: We make use of Random Sampling Consensus (RANSAC) to automatically discover and reject outlier examples within classes. We evaluate the Relative Motion Descriptor and outlier rejection approaches on four action datasets, and show that outlier rejection using RANSAC provides a consistent and notable increase in performance, and demonstrate superior performance to more complex multiple-feature based approaches.

  • Okwechime D, Ong E-J, Gilbert A, Bowden R. (2011) 'Visualisation and prediction of conversation interest through mined social signals'. IEEE 2011 IEEE International Conference on Automatic Face and Gesture Recognition and Workshops, Santa Barbara, USA: FG 2011: IEEE International Conference on Automatic Face & Gesture Recognition and Workshops, pp. 951-956.

    Abstract

    This paper introduces a novel approach to social behaviour recognition governed by the exchange of non-verbal cues between people. We conduct experiments to try and deduce distinct rules that dictate the social dynamics of people in a conversation, and utilise semi-supervised computer vision techniques to extract their social signals such as laughing and nodding. Data mining is used to deduce frequently occurring patterns of social trends between a speaker and listener in both interested and not interested social scenarios. The confidence values from rules are utilised to build a Social Dynamic Model (SDM), that can then be used for classification and visualisation. By visualising the rules generated in the SDM, we can analyse distinct social trends between an interested and not interested listener in a conversation. Results show that these distinctions can be applied generally and used to accurately predict conversational interest.

  • Ellis L, Felsberg M, Bowden R. (2011) 'Affordance mining: Forming perception through action'. Springer Lecture Notes in Computer Science: 10th Asian Conference on Computer Vision, Revised Selected Papers Part IV, Queenstown, New Zealand: ACCV 2010 6495, pp. 525-538.

    Abstract

    This work employs data mining algorithms to discover visual entities that are strongly associated to autonomously discovered modes of action, in an embodied agent. Mappings are learnt from these perceptual entities, onto the agents action space. In general, low dimensional action spaces are better suited to unsupervised learning than high dimensional percept spaces, allowing for structure to be discovered in the action space, and used to organise the perceptual space. Local feature configurations that are strongly associated to a particular ‘type’ of action (and not all other action types) are considered likely to be relevant in eliciting that action type. By learning mappings from these relevant features onto the action space, the system is able to respond in real time to novel visual stimuli. The proposed approach is demonstrated on an autonomous navigation task, and the system is shown to identify the relevant visual entities to the task and to generate appropriate responses.

  • Oshin O, Gilbert A, Bowden R. (2011) 'There Is More Than One Way to Get Out of a Car: Automatic Mode Finding for Action Recognition in the Wild'. Springer Berlin / Heidelberg Lecture Notes in Computer Science: Pattern Recognition and Image Analysis, Las Palmas de Gran Canaria, Spain: 5th IbPRIA 2011 6669, pp. 41-48.

    Abstract

    Actions in the wild” is the term given to examples of human motion that are performed in natural settings, such as those harvested from movies [10] or the Internet [9]. State-of-the-art approaches in this domain are orders of magnitude lower than in more contrived settings. One of the primary reasons being the huge variability within each action class. We propose to tackle recognition in the wild by automatically breaking complex action categories into multiple modes/group, and training a separate classifier for each mode. This is achieved using RANSAC which identifies and separates the modes while rejecting outliers. We employ a novel reweighting scheme within the RANSAC procedure to iteratively reweight training examples, ensuring their inclusion in the final classification model. Our results demonstrate the validity of the approach, and for classes which exhibit multi-modality, we achieve in excess of double the performance over approaches that assume single modality.

  • Okwechime D, Ong E-J, Gilbert A, Bowden R. (2011) 'Social interactive human video synthesis'. Springer Lecture Notes in Computer Science: Computer Vision – ACCV 2010, Queenstown, New Zealand: 10th Asian Conference on Computer Vision 6492 (PART 1), pp. 256-270.
  • Ong E, Bowden R. (2011) 'Learning Temporal Signatures for Lip Reading'. ARTEMIS, ICCV 2011
  • Hadfield SJ, Bowden R . (2010) 'Generalised Pose Estimation Using Depth'. In proceedings, European Conference on Computer Vision (Workshops), Heraklion, Crete: Workshop on Sign Gesture and Activity

    Abstract

    Estimating the pose of an object, be it articulated, deformable or rigid, is an important task, with applications ranging from Human-Computer Interaction to environmental understanding. The idea of a general pose estimation framework, capable of being rapidly retrained to suit a variety of tasks, is appealing. In this paper a solution is proposed requiring only a set of labelled training images in order to be applied to many pose estimation tasks. This is achieved by treating pose estimation as a classification problem, with particle filtering used to provide non-discretised estimates. Depth information extracted from a calibrated stereo sequence, is used for background suppression and object scale estimation. The appearance and shape channels are then transformed to Local Binary Pattern histograms, and pose classification is performed via a randomised decision forest. To demonstrate flexibility, the approach is applied to two different situations, articulated hand pose and rigid head orientation, achieving 97% and 84% accurate estimation rates, respectively.

  • Moore S, Ong EJ, Bowden R. (2010) 'Facial Expression Recognition using Spatiotemporal Boosted Discriminatory Classifiers'. Portugal: International Conference on Image Analysis and Recognition 6111/2010, pp. 405-414.
  • Cooper H, Bowden R. (2010) 'Sign Language Recognition using Linguistically Derived Sub-Units'. Valetta, Malta : European Language Resources Association (ELRA) Proceedings of 4th Workshop on the Representation and Processing of Sign Languages: Corpora and Sign Language Technologies, Valetta, Malta: IREC 2010, pp. 57-61.

    Abstract

    This work proposes to learn linguistically-derived sub-unit classifiers for sign language. The responses of these classifiers can be combined by Markov models, producing efficient sign-level recognition. Tracking is used to create vectors of hand positions per frame as inputs for sub-unit classifiers learnt using AdaBoost. Grid-like classifiers are built around specific elements of the tracking vector to model the placement of the hands. Comparative classifiers encode the positional relationship between the hands. Finally, binary-pattern classifiers are applied over the tracking vectors of multiple frames to describe the motion of the hands. Results for the sub-unit classifiers in isolation are presented, reaching averages over 90%. Using a simple Markov model to combine the sub-unit classifiers allows sign level classification giving an average of 63%, over a 164 sign lexicon, with no grammatical constraints.

  • Efthimiou E, Fotinea SE, Hanke T, Glauert J, Bowden R, Braffort A, Collet C, Maragos P, Goudenove F. (2010) 'DICTA-SIGN: Sign Language Recognition, Generation and Modelling with application in Deaf Communication'. Malta: 4th Workshop on the Representation and Processing of Sign Languages: Corpora and Sign Language Technologies, LREC2010, pp. 80-84.
  • Pugeault N, Bowden R. (2010) 'Learning pre-attentive driving behaviour from holistic visual features'. ECCV 2010, Part VI, LNCS 6316, , pp. 154-167.
  • Ong EJ, Lan Y, Theobald BJ, Harvey R, Bowden R. (2009) 'Robust Facial Feature Tracking using Selected Multi-Resolution Linear Predictors,'. Kyoto, Japan: Int. Conference Computer Vision ICCV09, pp. 1483-1490.
  • Okwechime D, Ong E-J, Bowden R. (2009) 'Real-time motion control using pose space probability density estimation'. IEEE 2009 IEEE 12th International Conference on Computer Vision Workshops, Kyoto, Japan: ICCV 2009, pp. 2056-2063.
  • Sheerman-Chase T, Ong E-J, Bowden R. (2009) 'Feature selection of facial displays for detection of non verbal communication in natural conversation'. IEEE 2009 IEEE 12th International Conference on Computer Vision Workshops, Kyoto, Japan: ICCV 2009, pp. 1985-1992.

    Abstract

    Recognition of human communication has previously focused on deliberately acted emotions or in structured or artificial social contexts. This makes the result hard to apply to realistic social situations. This paper describes the recording of spontaneous human communication in a specific and common social situation: conversation between two people. The clips are then annotated by multiple observers to reduce individual variations in interpretation of social signals. Temporal and static features are generated from tracking using heuristic and algorithmic methods. Optimal features for classifying examples of spontaneous communication signals are then extracted by AdaBoost. The performance of the boosted classifier is comparable to human performance for some communication signals, even on this challenging and realistic data set.

  • Gilbert A, Illingworth J, Bowden R, Capitan J, Merino L. (2009) 'Accurate fusion of robot, camera and wireless sensors for surveillance applications'. IEEE IEEE 12th International Conference on Computer Vision Workshops, Kyoto, Japan: ICCV 2009, pp. 1290-1297.

    Abstract

    Often within the field of tracking people within only fixed cameras are used. This can mean that when the the illumination of the image changes or object occlusion occurs, the tracking can fail. We propose an approach that uses three simultaneous separate sensors. The fixed surveillance cameras track objects of interest cross camera through incrementally learning relationships between regions on the image. Cameras and laser rangefinder sensors onboard robots also provide an estimate of the person. Moreover, the signal strength of mobile devices carried by the person can be used to estimate his position. The estimate from all these sources are then combined using data fusion to provide an increase in performance. We present results of the fixed camera based tracking operating in real time on a large outdoor environment of over 20 non-overlapping cameras. Moreover, the tracking algorithms for robots and wireless nodes are described. A decentralized data fusion algorithm for combining all these information is presented.

  • Oshin O, Gilbert A, Illingworth J, Bowden R. (2009) 'Action recognition using Randomised Ferns'. IEEE 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops 2009, Kyoto, Japan: ICCV 2009, pp. 530-537.

    Abstract

    This paper presents a generic method for recognising and localising human actions in video based solely on the distribution of interest points. The use of local interest points has shown promising results in both object and action recognition. While previous methods classify actions based on the appearance and/or motion of these points, we hypothesise that the distribution of interest points alone contains the majority of the discriminatory information. Motivated by its recent success in rapidly detecting 2D interest points, the semi-naive Bayesian classification method of Randomised Ferns is employed. Given a set of interest points within the boundaries of an action, the generic classifier learns the spatial and temporal distributions of those interest points. This is done efficiently by comparing sums of responses of interest points detected within randomly positioned spatio-temporal blocks within the action boundaries. We present results on the largest and most popular human action dataset using a number of interest point detectors, and demostrate that the distribution of interest points alone can perform as well as approaches that rely upon the appearance of the interest points.

  • Sheerman-Chase T, Ong E-J, Bowden R. (2009) 'Online learning of robust facial feature trackers'. IEEE 2009 IEEE 12th International Conference on Computer Vision Workshops, Kyoto, Japan: ICCV 2009, pp. 1386-1392.

    Abstract

    This paper presents a head pose and facial feature estimation technique that works over a wide range of pose variations without a priori knowledge of the appearance of the face. Using simple LK trackers, head pose is estimated by Levenberg-Marquardt (LM) pose estimation using the feature tracking as constraints. Factored sampling and RANSAC are employed to both provide a robust pose estimate and identify tracker drift by constraining outliers in the estimation process. The system provides both a head pose estimate and the position of facial features and is capable of tracking over a wide range of head poses.

  • Lan Y, Harvey R, Theobald B, Ong EJ, Bowden R. (2009) 'Comparing Visual Features for Lipreading'. ICSA International Conference on Auditory-Visual Speech Processing 2009, Norwich, UK: AVSP 2009, pp. 102-106.

    Abstract

    For automatic lipreading, there are many competing methods for feature extraction. Often, because of the complexity of the task these methods are tested on only quite restricted datasets, such as the letters of the alphabet or digits, and from only a few speakers. In this paper we compare some of the leading methods for lip feature extraction and compare them on the GRID dataset which uses a constrained vocabulary over, in this case, 15 speakers. Previously the GRID data has had restricted attention because of the requirements to track the face and lips accurately. We overcome this via the use of a novel linear predictor (LP) tracker which we use to control an Active Appearance Model (AAM). By ignoring shape and/or appearance parameters from the AAM we can quantify the effect of appearance and/or shape when lip-reading. We find that shape alone is a useful cue for lipreading (which is consistent with human experiments). However, the incremental effect of shape on appearance appears to be not significant which implies that the inner appearance of the mouth contains more information than the shape.

  • Moore S, Bowden R. (2009) 'The Effects of Pose On Facial Expression Recognition'. BMVA Press Proceedings of the British Machine Vision Conference, London, UK: BMVC 2009, pp. 1-11.

    Abstract

    Research into facial expression recognition has predominantly been based upon near frontal view data. However, a recent 3D facial expression database (BU-3DFE database) has allowed empirical investigation of facial expression recognition across pose. In this paper, we investigate the effects of pose from frontal to profile view on facial expression recognition. Experiments are carried out on 100 subjects with 5 yaw angles over 6 prototypical expressions. Expressions have 4 levels of intensity from subtle to exaggerated. We evaluate features such as local binary patterns (LBPs) as well as various extensions of LBPs. In addition, a novel approach to facial expression recognition is proposed using local gabor binary patterns (LGBPs). Multi class support vector machines (SVMs) are used for classification. We investigate the effects of image resolution and pose on facial expression classification using a variety of different features.

  • Cooper H, Bowden R. (2009) 'Sign Language Recognition: Working with Limited Corpora'. SPRINGER-VERLAG BERLIN UNIVERSAL ACCESS IN HUMAN-COMPUTER INTERACTION: APPLICATIONS AND SERVICES, PT III, San Diego, CA: 5th International Conference on Universal Access in Human-Computer Interaction held at the HCI International 2009 5616, pp. 472-481.
  • Cooper H, Bowden R. (2009) 'Learning Signs from Subtitles: A Weakly Supervised Approach to Sign Language Recognition'. IEEE CVPR: 2009 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOLS 1-4, Miami Beach, FL: IEEE-Computer-Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 2560-2566.
  • Efthimiou E, Fotinea S-E, Vogler C, Hanke T, Glauert J, Bowden R, Braffort A, Collet C, Maragos P, Segouat J. (2009) 'Sign language recognition, generation, and modelling: A research effort with applications in deaf communication'. Springer Lecture Notes in Computer Science: Proceedings of 5th International Conference of Universal Access in Human-Computer Interaction. Addressing Diversity, Part 1, San Diego, USA: UAHCI 2009, Held as part of HCI International 2009 5614, pp. 21-30.
  • Oshin O, Gilbert A, Illingworth J, Bowden R. (2008) 'Spatio-Temporal Feature Recognition using Randomised Ferns'. The 1st International Workshop on Machine Learning for Vision-based Motion Analysis (MVLMA'08), Marseille, France: International Workshop on Machine Learning for Vision Based Motion Analysis, ECCV08
  • Ellis L, Matas J, Bowden R. (2008) 'Online Learning and Partitioning of Linear Displacement Predictors for Tracking'. The British Machine Vision Association (BMVA) Proceedings of the British Machine Vision Conference, Leeds, UK: BMVC 2008, pp. 33-42.

    Abstract

    A novel approach to learning and tracking arbitrary image features is presented. Tracking is tackled by learning the mapping from image intensity differences to displacements. Linear regression is used, resulting in low computational cost. An appearance model of the target is built on-the-fly by clustering sub-sampled image templates. The medoidshift algorithm is used to cluster the templates thus identifying various modes or aspects of the target appearance, each mode is associated to the most suitable set of linear predictors allowing piecewise linear regression from image intensity differences to warp updates. Despite no hard-coding or offline learning, excellent results are shown on three publicly available video sequences and comparisons with related approaches made.

  • Ong E-J, Bowden R, IEEE . (2008) 'Robust Lip-Tracking using Rigid Flocks of Selected Linear Predictors'. 2008 8TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE & GESTURE RECOGNITION (FG 2008), VOLS 1 AND 2, , pp. 247-254.
  • Gilbert A, Illingworth J, Bowden R. (2008) 'Scale Invariant Action Recognition Using Compound Features Mined from Dense Spatio-temporal Corners'. Springer Lecture Notes in Computer Science: Proceedings of 10th European Conference on Computer Vision (Part 1), Marseille, France: ECCV 2008 5302, pp. 222-233.
  • Okwechime D, Bowden R. (2008) 'A generative model for motion synthesis and blending using probability density estimation'. SPRINGER-VERLAG BERLIN ARTICULATED MOTION AND DEFORMABLE OBJECTS, PROCEEDINGS, Port d Andratx, SPAIN: 5th International Conference on Articulated Motion and Deformable Objects 5098, pp. 218-227.
  • Cooper H, Bowden R. (2007) 'Sign Language Recognition Using Boosted Volumetric Features'. MVA Organisation Proceedings of the IAPR Conference on Machine Vision Applications, Tokyo, Japan: IAPR MVA 2007, pp. 359-362.

    Abstract

    This paper proposes a method for sign language recognition that bypasses the need for tracking by classifying the motion directly. The method uses the natural extension of haar like features into the temporal domain, computed efficiently using an integral volume. These volumetric features are assembled into spatio-temporal classifiers using boosting. Results are presented for a fast feature extraction method and 2 different types of boosting. These configurations have been tested on a data set consisting of both seen and unseen signers performing 5 signs producing competitive results.

  • Ellis L, Bowden R. (2007) 'Learning Responses to Visual Stimuli: A Generic Approach'. Applied Computer Science Group, Bielefeld University, Germany Proceedings of the 5th International Conference on Computer Vision Systems, Bielefeld, Germany: ICVS 2007

    Abstract

    A general framework for learning to respond appropriately to visual stimulus is presented. By hierarchically clustering percept-action exemplars in the action space, contextually important features and relationships in the perceptual input space are identified and associated with response models of varying generality. Searching the hierarchy for a set of best matching percept models yields a set of action models with likelihoods. By posing the problem as one of cost surface optimisation in a probabilistic framework, a particle filter inspired forward exploration algorithm is employed to select actions from multiple hypotheses that move the system toward a goal state and to escape from local minima. The system is quantitatively and qualitatively evaluated in both a simulated shape sorter puzzle and a real-world autonomous navigation domain.

  • Cooper H, Bowden R. (2007) 'Large lexicon detection of sign language'. SPRINGER-VERLAG BERLIN HUMAN-COMPUTER INTERACTION, PROCEEDINGS, Rio de Janeiro, BRAZIL: IEEE International Workshop on Human - Computer Interaction 4796, pp. 88-97.
  • Moore S, Bowden R. (2007) 'Automatic facial expression recognition using boosted discriminatory classifiers'. Springer Lecture Notes in Computer Science: Analysis and Modelling of Faces and Gestures, Rio de Janeiro, Brazil: Third International Workshop on AMFG'07 4778, pp. 71-83.
  • Gilbert A, Bowden R. (2007) 'Multi person tracking within crowded scenes'. SPRINGER-VERLAG BERLIN Human Motion - Understanding, Modeling, Capture and Animation, Proceedings, Rio de Janeiro, BRAZIL: 2nd Workshop on Human Motion Understanding, Modeling, Capture and Animation 4814, pp. 166-179.
  • Ellis L, Dowson N, Matas J, Bowden R. (2007) 'Linear predictors for fast simultaneous modeling and tracking'. IEEE 2007 IEEE 11TH INTERNATIONAL CONFERENCE ON COMPUTER VISION, VOLS 1-6, Rio de Janeiro, BRAZIL: 11th IEEE International Conference on Computer Vision, pp. 2792-2799.
  • Dowson NDH, Bowden R. (2006) 'N-tier Simultaneous Modelling and Tracking for Arbitrary Warps'. BMVA Proceedings of the British Machine Vision Conference, Edinburgh, UK: BMVC 2006 2, pp. 569-578.

    Abstract

    This paper presents an approach to object tracking which, given a single example of a target, learns a hierarchical constellation model of appearance and structure on the fly. The model becomes more robust over time as evidence of the variability of the object is acquired and added to the model. Tracking is performed in an optimised Lucas-Kanade type framework, using Mutual Information as a similarity metric. Several novelties are presented: an improved template update strategy using Bayes theorem, a multi-tier model topology, and a semi-automatic testing method. A critical comparison with other methods is made using exhaustive testing. In all 11 challenging test sequences were used with a mean length of 568 frames.

  • Ong E-J, Bowden R. (2006) 'Learning Distance for Arbitrary Visual Features'. BMVA Proceedings of the British Machine Vision Conference, Edinburgh, UK: BMVC 2006 2, pp. 749-758.

    Abstract

    This paper presents a method for learning distance functions of arbitrary feature representations that is based on the concept of wormholes. We introduce wormholes and describe how it provides a method for warping the topology of visual representation spaces such that a meaningful distance between examples is available. Additionally, we show how a more general distance function can be learnt through the combination of many wormholes via an inter-wormhole network. We then demonstrate the application of the distance learning method on a variety of problems including nonlinear synthetic data, face illumination detection and the retrieval of images containing natural landscapes and man-made objects (e.g. cities).

  • Micilotta AS, Ong EJ, Bowden R. (2006) 'Real-time upper body detection and 3D pose estimation in monoscopic images'. Springer Lecture Notes in Computer Science: Proceedings of 9th European Conference on Computer Vision, Part III, Graz, Austria: ECCV 2006 3953, pp. 139-150.

    Abstract

    This paper presents a novel solution to the difficult task of both detecting and estimating the 3D pose of humans in monoscopic images. The approach consists of two parts. Firstly the location of a human is identified by a probabalistic assembly of detected body parts. Detectors for the face, torso and hands are learnt using adaBoost. A pose likliehood is then obtained using an a priori mixture model on body configuration and possible configurations assembled from available evidence using RANSAC. Once a human has been detected, the location is used to initialise a matching algorithm which matches the silhouette and edge map of a subject with a 3D model. This is done efficiently using chamfer matching, integral images and pose estimation from the initial detection stage. We demonstrate the application of the approach to large, cluttered natural images and at near framerate operation (16fps) on lower resolution video streams.

  • Gilbert A, Bowden R. (2006) 'Tracking objects across cameras by incrementally learning inter-camera colour calibration and patterns of activity'. Springer Lecture Notes in Computer Science: 9th European Conference on Computer Vision, Proceedings Part 2, Graz, Austria: ECCV 2006 3952, pp. 125-136.

    Abstract

    This paper presents a scalable solution to the problem of tracking objects across spatially separated, uncalibrated, non-overlapping cameras. Unlike other approaches this technique uses an incremental learning method, to model both the colour variations and posterior probability distributions of spatio-temporal links between cameras. These operate in parallel and are then used with an appearance model of the object to track across spatially separated cameras. The approach requires no pre-calibration or batch preprocessing, is completely unsupervised, and becomes more accurate over time as evidence is accumulated.

  • Dowson N, Bowden R. (2006) 'A unifying framework for mutual information methods for use in non-linear optimisation'. Springer Lecture Notes in Computer Science: 9th European Conference on Computer Vision, Proceedings Part 1, Graz, Austria: ECCV 2006 3951, pp. 365-378.

    Abstract

    Many variants of MI exist in the literature. These vary primarily in how the joint histogram is populated. This paper places the four main variants of MI: Standard sampling, Partial Volume Estimation (PVE), In-Parzen Windowing and Post-Parzen Windowing into a single mathematical framework. Jacobians and Hessians are derived in each case. A particular contribution is that the non-linearities implicit to standard sampling and post-Parzen windowing are explicitly dealt with. These non-linearities are a barrier to their use in optimisation. Side-by-side comparison of the MI variants is made using eight diverse data-sets, considering computational expense and convergence. In the experiments, PVE was generally the best performer, although standard sampling often performed nearly as well (if a higher sample rate was used). The widely used sum of squared differences metric performed as well as MI unless large occlusions and non-linear intensity relationships occurred. The binaries and scripts used for testing are available online.

  • Ong EJ, Bowden R. (2006) 'Learning wormholes for sparsely labelled clustering'. IEEE COMPUTER SOC 18TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 1, PROCEEDINGS, Hong Kong, PEOPLES R CHINA: 18th International Conference on Pattern Recognition (ICPR 2006), pp. 916-919.

    Abstract

    Distance functions are an important component in many learning applications. However, the correct function is context dependent, therefore it is advantageous to learn a distance function using available training data. Many existing distance functions is the requirement for data to exist in a space of constant dimensionality and not possible to be directly used on symbolic data. To address these problems, this paper introduces an alternative learnable distance function, based on multi-kernel distance bases or "wormholes that connects spaces belonging to similar examples that were originally far away close together. This work only assumes the availability of a set data in the form of relative comparisons, avoiding the need for having labelled or quantitative information. To learn the distance function, two algorithms were proposed: 1) Building a set of basic wormhole bases using a Boosting-inspired algorithm. 2) Merging different distance bases together for better generalisation. The learning algorithms were then shown to successfully extract suitable distance functions in various clustering problems, ranging from synthetic 2D data to symbolic representations of unlabelled images

  • Dowson NDH, Bowden R, Kadir T. (2006) 'Image template matching using mutual information and NP-Windows'. IEEE COMPUTER SOC 18TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 2, PROCEEDINGS, Hong Kong, PEOPLES R CHINA: 18th International Conference on Pattern Recognition (ICPR 2006), pp. 1186-1191.

    Abstract

    A non-parametric (NP) sampling method is introduced for obtaining the joint distribution of a pair of images. This method based on NP windowing and is equivalent to sampling the images at infinite resolution. Unlike existing methods, arbitrary selection of kernels is not required and the spatial structure of images is used. NP windowing is applied to a registration application where the mutual information (MI) between a reference image and a warped template is maximised with respect to the warp parameters. In comparisons against the current state of the art MI registration methods NP windowing yielded excellent results with lower bias and improved convergence rates

  • Ellis L, Bowden R. (2005) 'A generalised exemplar approach to modelling perception action couplings'. IEEE Proceedings of the Tenth IEEE International Conference on Computer Vision Workshops, Beijing, China: Tenth IEEE International Conference on Computer Vision Workshops (ICCVW'05), pp. 1874-1874.

    Abstract

    We present a framework for autonomous behaviour in vision based artificial cognitive systems by imitation through coupled percept-action (stimulus and response) exemplars. Attributed Relational Graphs (ARGs) are used as a symbolic representation of scene information (percepts). A measure of similarity between ARGs is implemented with the use of a graph isomorphism algorithm and is used to hierarchically group the percepts. By hierarchically grouping percept exemplars into progressively more general models coupled to progressively more general Gaussian action models, we attempt to model the percept space and create a direct mapping to associated actions. The system is built on a simulated shape sorter puzzle that represents a robust vision system. Spatio temporal hypothesis exploration is performed ef- ficiently in a Bayesian framework using a particle filter to propagate game play over time.

  • Micilotta A, Ong E, Bowden R. (2005) 'Real-time Upper Body 3D Reconstruction from a Single Uncalibrated Camera'. The European Association for Computer Graphics 26th Annual Conference, EUROGRAPHICS 2005, Trinity College Dublin, Ireland: EUROGRAPHICS 2005: The European Association for Computer Graphics 26th Annual Conference, pp. 41-44.

    Abstract

    This paper outlines a method of estimating the 3D pose of the upper human body from a single uncalibrated camera. The objective application lies in 3D Human Computer Interaction where hand depth information offers extended functionality when interacting with a 3D virtual environment, but it is equally suitable to animation and motion capture. A database of 3D body configurations is built from a variety of human movements using motion capture data. A hierarchical structure consisting of three subsidiary databases, namely the frontal-view Hand Position (top-level), Silhouette and Edge Map Databases, are pre-extracted from the 3D body configuration database. Using this hierarchy, subsets of the subsidiary databases are then matched to the subject in real-time. The examples of the subsidiary databases that yield the highest matching score are used to extract the corresponding 3D configuration from the motion capture data, thereby estimating the upper body 3D pose.

  • Bowden R, Ellis L, Kittler J, Shevchenko M, Windridge D. (2005) 'Unsupervised symbol grounding and cognitive bootstrapping in cognitive vision'. Proc. 13th Int. Conference on Image Analysis and Processing, , pp. 27-36-27-36.
  • Bowden R, KaewTraKulPong P. (2005) 'Towards automated wide area visual surveillance: tracking objects between spatially-separated, uncalibrated views'. IEE-INST ELEC ENG IEE PROCEEDINGS-VISION IMAGE AND SIGNAL PROCESSING, London, ENGLAND: IEEE International Symposium on Imaging for Crime Detection and Prevention 152 (2), pp. 213-223.
  • Dowson NDH, Bowden R. (2005) 'Simultaneous modeling and tracking (SMAT) of feature sets'. IEEE COMPUTER SOC 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol 2, Proceedings, San Diego, CA: Conference on Computer Vision and Pattern Recognition, pp. 99-105.
  • Kadir T, Bowden R, Ong EJ, Zisserman A. (2004) 'Minimal Training, Large Lexicon, Unconstrained Sign Language Recognition'. The British Machine Vision Association and Society for Pattern Recognition BMVC 2004 Electronic Proceedings, Kingston University, London: BMVC 2004 - British Machine Vision Conference 7-9th Sept 2004, pp. 939-948.

    Abstract

    This paper presents a flexible monocular system capable of recognising sign lexicons far greater in number than previous approaches. The power of the system is due to four key elements: (i) Head and hand detection based upon boosting which removes the need for temperamental colour segmentation; (ii) A body centred description of activity which overcomes issues with camera placement, calibration and user; (iii) A two stage classification in which stage I generates a high level linguistic description of activity which naturally generalises and hence reduces training; (iv) A stage II classifier bank which does not require HMMs, further reducing training requirements. The outcome of which is a system capable of running in real-time, and generating extremely high recognition rates for large lexicons with as little as a single training instance per sign. We demonstrate classification rates as high as 92% for a lexicon of 164 words with extremely low training requirements outperforming previous approaches where thousands of training examples are required.

  • Micilotta A, Bowden R. (2004) 'View-based Location and Tracking of Body Parts for Visual Interaction'. The British Machine Vision Association and Society for Pattern Recognition BMVC 2004 Electronic Proceedings, Kingston University, London: BMVC 2004 - British Machine Vision Conference 7-9th Sept 2004, pp. 849-858.

    Abstract

    This paper presents a real time approach to locate and track the upper torso of the human body. Our main interest is not in 3D biometric accuracy, but rather a sufficient discriminatory representation for visual interaction. The algorithm employs background suppression and a general approximation to body shape, applied within a particle filter framework, making use of integral images to maintain real-time performance. Furthermore, we present a novel method to disambiguate the hands of the subject and to predict the likely position of elbows. The final system is demonstrated segmenting multiple subjects from a cluttered scene at above real time operation.

  • KaewTraKulPong P, Bowden R. (2004) 'Probabilistic Learning of Salient Patterns across Spatially Separated Uncalibrated Views'. Institution of Electrical Engineers Proceedings of IDSS04 - Intelligent Distributed Surveillance Systems, Feb 2004, London, UK: IDSS04 - Intelligent Distributed Surveillance Systems, Feb 2004, pp. 36-40.
  • Bowden R. (2004) 'Progress in sign and gesture recognition'. SPRINGER-VERLAG BERLIN ARTICULATED MOTION AN DEFORMABLE OBJECTS, PROCEEDINGS, Palma de Mallorca, SPAIN: 3rd International Workshop on Articulated Motion and Deformable Objects 3179, pp. 13-13.
  • Windridge D, Bowden R, Kittler J. (2004) 'A General Strategy for Hidden Markov Chain Parameterisation in Composite Feature-Spaces'. SSPR/SPR, , pp. 1069-1077-1069-1077.
  • Bowden R, Windridge D, Kadir T, Zisserman A, Brady M. (2004) 'A Linguistic Feature Vector for the Visual Interpretation of Sign Language'. European Conference on Computer Vision, , pp. 390-401-390-401.
  • Ong EJ, Bowden R. (2004) 'A boosted classifier tree for hand shape detection'. IEEE COMPUTER SOC SIXTH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION, PROCEEDINGS, Seoul, SOUTH KOREA: 6th IEEE International Conference on Automatic Face and Gesture Recognition, pp. 889-894.
  • Dowson N, Bowden R. (2004) 'Metric mixtures for mutual information ((MI)-I-3) tracking'. IEEE COMPUTER SOC PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 2, British Machine Vis Assoc, Cambridge, ENGLAND: 17th International Conference on Pattern Recognition (ICPR), pp. 752-756.
  • Windridge D, Bowden R. (2004) 'Induced Decision Fusion in Automated Sign Language Interpretation: Using ICA to Isolate the Underlying Components of Sign'. Multiple Classifier Systems, , pp. 303-313-303-313.
  • KaewTrakulPong P, Bowden R. (2003) 'A real time adaptive visual surveillance system for tracking low-resolution colour targets in dynamically changing scenes'. ELSEVIER SCIENCE BV IMAGE AND VISION COMPUTING, ENGLAND: Symposium on Probabilistic Models in Computer Vision 21 (10), pp. 913-929.
  • Bowden R, Zisserman A, Kadir T, Brady M. (2003) 'Vision based Interpretation of Natural Sign Languages'. Springer-Verlag Proceedings of the 3rd international conference on Computer vision systems, Graz, Austria: 3rd International Conference on Computer Vision Systems - ICVS 2003

    Abstract

    This manuscript outlines our current demonstration system for translating visual Sign to written text. The system is based around a broad description of scene activity that naturally generalizes, reducing training requirements and allowing the knowledge base to be explicitly stated. This allows the same system to be used for different sign languages requiring only a change of the knowledge base.

  • Bowden R, Sarhadi M. (2002) 'A non-linear model of shape and motion for tracking finger spelt American sign language'. ELSEVIER SCIENCE BV IMAGE AND VISION COMPUTING, UNIV BRISTOL, BRISTOL, ENGLAND: 11th British Machine Vision Conference (BMVC2000) 20 (9-10), pp. 597-607.
  • Bowden R, KaewTraKulPong P. (2001) 'Adaptive Visual System for Tracking Low Resolution Colour Targets'. Proceedings of the 12th British Machine Vision Conference (BMVC2001), Manchester, UK: 12th British Machine Vision Conference (BMVC2001), pp. 243-252.

    Abstract

    This paper addresses the problem of using appearance and motion models in classifying and tracking objects when detailed information of the object’s appearance is not available. The approach relies upon motion, shape cues and colour information to help in associating objects temporally within a video stream. Unlike previous applications of colour in object tracking, where relatively large-size targets are tracked, our method is designed to track small colour targets. Our approach uses a robust background model based around Expectation Maximisation to segment moving objects with very low false detection rates. The system also incorporates a shadow detection algorithm which helps alleviate standard environmental problems associated with such approaches. A colour transformation derived from anthropological studies to model colour distributions of low-resolution targets is used along with a probabilistic method of combining colour and motion information. This provides a robust visual tracking system which is capable of performing accurately and consistently within a real world visual surveillance arena. This paper shows the system successfully tracking multiple people moving independently and the ability of the approach to recover lost tracks due to occlusions and background clutter.

  • KaewTraKulPong P, Bowden R. (2001) 'An Improved Adaptive Background Mixture Model for Real-time Tracking with Shadow Detection'. Kluwer Academic Publishers Proceedings of 2nd European Workshop on Advanced Video Based Surveillance Systems, AVBS01. Sept 2001., Kingston, UK: AVBS01: 2nd European Workshop on Advanced Video Based Surveillance Systems
  • Bowden R, Sarhadi M. (2000) 'Building Temporal Models for Gesture Recognition'. BMVA (British Machine Vision Association) Proceedings of BMVC 2000 - The Eleventh British Machine Vision Conference, Bristol, UK: BMVC 2000 - The Eleventh British Machine Vision Conference
  • Lewin M, Bowden R, Sarhadi M. (2000) 'Automotive Prototyping using Augmented Reality'. Proceedings of the 7th VRSIG Conference, Strathclyde University, Sept 2000, Strathclyde University, Scotland: 7th VRSIG Conference
  • Bowden R. (2000) 'Learning Statistical Models of Human Motion'. IEEE Proceedings of CVPR 2000 - IEEE Workshop on Human Modeling, Analysis and Synthesis, Hilton Head, South Carolina, U.S.A.: CVPR 2000 - IEEE Workshop on Human Modeling, Analysis and Synthesis

    Abstract

    Non-linear statistical models of deformation provide methods to learn a priori shape and deformation for an object or class of objects by example. This paper extends these models of deformation to that of motion by augmenting the discrete representation of piecewise nonlinear principle component analysis of shape with a markov chain which represents the temporal dynamics of the model. In this manner, mean trajectories can be learnt and reproduced for either the simulation of movement or for object tracking. This paper demonstrates the use of these techniques in learning human motion from capture data.

  • Lewin M, Bowden R, Sarhadi M. (2000) 'Applying Augmented Reality to Virtual Product Prototyping'. Brest, France: CfP: FIRST FRENCH-BRITISH INTERNATIONAL WORKSHOP ON VIRTUAL REALITY, pp. 59-68.
  • Bowden R, Mitchell TA, Sarhadi M. (1998) 'Reconstructing 3D Pose and Motion from a Single Camera View'. BMVA (British Machine Vision Association) Proceedings of BMVC 1998, University of Southampton, UK: BMVC 1998 2

    Abstract

    This paper presents a model based approach to human body tracking in which the 2D silhouette of a moving human and the corresponding 3D skeletal structure are encapsulated within a non-linear Point Distribution Model. This statistical model allows a direct mapping to be achieved between the external boundary of a human and the anatomical position. It is shown how this information, along with the position of landmark features such as the hands and head can be used to reconstruct information about the pose and structure of the human body from a monoscopic view of a scene.

  • Bowden R, Mitchel TA, Sarhadi M. (1997) 'Real-time Dynamic Deformable Meshes for Volumetric Segmentation and Visualisation'. BMVC97 Electronic Proceedings of the Eighth British Machine Vision Conference, University of Essex, Colchester, UK: BMVC97 - the Eighth British Machine Vision Conference 1, pp. 310-319.

    Abstract

    This paper presents a surface segmentation method which uses a simulated inflating balloon model to segment surface structure from volumetric data using a triangular mesh. The model employs simulated surface tension and an inflationary force to grow from within an object and find its boundary. Mechanisms are described which allow both evenly spaced and minimal polygonal count surfaces to be generated. The work is based on inflating balloon models by Terzopolous[8]. Simplifications are made to the model, and an approach proposed which provides a technique robust to noise regardless of the feature detection scheme used. The proposed technique uses no explicit attraction to data features, and as such is less dependent on the initialisation of the model and parameters. The model grows under its own forces, and is never anchored to boundaries, but instead constrained to remain inside the desired object. Results are presented which demonstrate the technique’s ability and speed at the segmentation of a complex, concave object with narrow features.

  • Bowden R, Heap AJ, Hogg DC. (1997) 'Real Time Hand Tracking and Gesture Recognition as a 3D Input Device for Graphical Applications'. Springer-Verlag Proceedings of Gesture Workshop ’96, University of York, UK: Gesture Workshop ’96, pp. 117-129.

    Abstract

    This paper outlines a system design and implementation of a 3D input device for graphical applications which uses real time hand tracking and gesture recognition to provide the user with an intuitive interface for tomorrow’s applications. Point Distribution Models (PDMs) have been shown to be successful at tracking deformable objects . This system demonstrates how these ‘smart snakes’ can be used in real time with a real world problem. The system is based upon Open Inventor1 and designed for use with Silicon Graphics Indy Workstations, but provisions have been make for the move to other platforms and applications. We demonstrate how PDMs provide the ideal feature vector for model classification. It is shown how computer vision can provide a low cost, intuitive interface that has few hardware constraints. We also give the reader an insight into the next generation of HCI and Multimedia, providing a 3D scene viewer and VRML browser based upon the handtracker. Further allowances have been made to facilitate the inclusion of the handtracker within third party Inventor applications. All source code, libraries and applications can be downloaded for free from the above web addresses. This paper demonstrates how computer vision and computer graphics can work together providing an interdisciplinary approach to problem solving.

  • Bowden R, Heap T, Hart C. (1996) 'Virtual Datagloves: Interacting with Virtual Environments Through Computer Vision'. Proceedings of the Third UK Virtual Reality Special Interest Group Conference; Leicester, 3rd July 1996, DeMontfort University, Leicester, UK: 3rd UK VR-Sig Conference

    Abstract

    This paper outlines a system design and implementation of a 3D input device for graphical applications. It is shown how computer vision can be used to track a users movements within the image frame allowing interaction with 3D worlds and objects. Point Distribution Models (PDMs) have been shown to be successful at tracking deformable objects. This system demonstrates how these ‘smart snakes’ can be used in real time with real world applications, demonstrating how computer vision can provide a low cost, intuitive interface that has few hardware constraints. The compact mathematical model behind the PDM allows simple static gesture recognition to be performed providing the means to communicate with an application. It is shown how movement of both the hand and face can be used to drive 3D engines. The system is based upon Open Inventor and designed for use with Silicon Graphics Indy Workstations but allowances have been made to facilitate the inclusion of the tracker within third party applications. The reader is also provided with an insight into the next generation of HCI and Multimedia. Access to this work can be gained through the above web address.

Books

  • Bowden R. (1997) VRSIG'97, Proceedings of the 4th UK Virtual Reality Special Interest Group Conference. Bristol, UK : UK-VRSIG

Book chapters

  • Cooper HM, Ong E, Pugeault N, Bowden R . (2017) 'Sign Language Recognition Using Sub-units'. in Escalera S, Guyon I, Athitsos V (eds.) Gesture Recognition Springer International Publishing , pp. 89-118.

    Abstract

    This chapter discusses sign language recognition using linguistic sub-units. It presents three types of sub-units for consideration; those learnt from appearance data as well as those inferred from both 2D or 3D tracking data. These sub-units are then combined using a sign level classifier; here, two options are presented. The first uses Markov Models to encode the temporal changes between sub-units. The second makes use of Sequential Pattern Boosting to apply discriminative feature selection at the same time as encoding temporal information. This approach is more robust to noise and performs well in signer independent tests, improving results from the 54% achieved by the Markov Chains to 76%.

  • Hadfield SJ, Bowden R . (2012) 'Generalised Pose Estimation Using Depth'. in Kutulakos K (ed.) Trends and Topics in Computer Vision ECCV 2010 Springer Trends and Topics in Computer Vision. ECCV 2010. Lecture Notes in Computer Science, 6553 (6553), pp. 312-325.

    Abstract

    Estimating the pose of an object, be it articulated, deformable or rigid, is an important task, with applications ranging from Human-Computer Interaction to environmental understanding. The idea of a general pose estimation framework, capable of being rapidly retrained to suit a variety of tasks, is appealing. In this paper a solution isproposed requiring only a set of labelled training images in order to be applied to many pose estimation tasks. This is achieved bytreating pose estimation as a classification problem, with particle filtering used to provide non-discretised estimates. Depth information extracted from a calibrated stereo sequence, is used for background suppression and object scale estimation. The appearance and shape channels are then transformed to Local Binary Pattern histograms, and pose classification is performed via a randomised decision forest. To demonstrate flexibility, the approach is applied to two different situations, articulated hand pose and rigid head orientation, achieving 97% and 84% accurate estimation rates, respectively.

  • Cooper HM, Holt B, Bowden R. (2011) 'Sign Language Recognition'. in Moeslund TB, Hilton A, Krüger V, Sigal L (eds.) Visual Analysis of Humans: Looking at People Springer Verlag , pp. 539-562.

    Abstract

    This chapter covers the key aspects of sign-language recognition (SLR), starting with a brief introduction to the motivations and requirements, followed by a précis of sign linguistics and their impact on the field. The types of data available and the relative merits are explored allowing examination of the features which can be extracted. Classifying the manual aspects of sign (similar to gestures) is then discussed from a tracking and non-tracking viewpoint before summarising some of the approaches to the non-manual aspects of sign languages. Methods for combining the sign classification results into full SLR are given showing the progression towards speech recognition techniques and the further adaptations required for the sign specific case. Finally the current frontiers are discussed and the recent research presented. This covers the task of continuous sign recognition, the work towards true signer independence, how to effectively combine the different modalities of sign, making use of the current linguistic research and adapting to larger more noisy data sets

  • Oshin O, Gilbert A, Illingworth J, Bowden R. (2009) 'Learning to Recognise Spatio-Temporal Interest Points'. in Wang L, Cheng L, Zhao G (eds.) Machine Learning for Human Motion Analysis Igi Publishing Article number 2 , pp. 14-30.
  • Bowden R, Gilbert A, KaewTraKulPong P. (2006) 'Tracking Objects Across Uncalibrated Arbitary Topology Camera Networks'. in Velastin S, Remagnino P (eds.) Intelligent Distributed Video Surveillance Systems Stevenage, United Kingdom : Institution of Engineering and Technology Article number 6 , pp. 157-182.

    Abstract

    Intelligent visual surveillance is an important application area for computer vision. In situations where networks of hundreds of cameras are used to cover a wide area, the obvious limitation becomes the users’ ability to manage such vast amounts of information. For this reason, automated tools that can generalise about activities or track objects are important to the operator. Key to the users’ requirements is the ability to track objects across (spatially separated) camera scenes. However, extensive geometric knowledge about the site and camera position is typically required. Such an explicit mapping from camera to world is infeasible for large installations as it requires that the operator know which camera to switch to when an object disappears. To further compound the problem the installation costs of CCTV systems outweigh those of the hardware. This means that geometric constraints or any form of calibration (such as that which might be used with epipolar constraints) is simply not realistic for a real world installation. The algorithms cannot afford to dictate to the installer. This work attempts to address this problem and outlines a method to allow objects to be related and tracked across cameras without any explicit calibration, be it geometric or colour.

  • KaewTraKulPong P, Bowden R. (2002) 'An Improved Adaptive Background Mixture Model for Real-time Tracking with Shadow Detection'. in Remagnino P, Jones GA, Paragios N, Regazzoni CS (eds.) Video-Based Surveillance Systems New York : Springer US Article number 11

    Abstract

    Real-time segmentation of moving regions in image sequences is a fundamental step in many vision systems including automated visual surveillance, human-machine interface, and very low-bandwidth telecommunications. A typical method is background subtraction. Many background models have been introduced to deal with different problems. One of the successful solutions to these problems is to use a multi-colour background model per pixel proposed by Grimson et al [1, 2,3]. However, the method suffers from slow learning at the beginning, especially in busy environments. In addition, it can not distinguish between moving shadows and moving objects. This paper presents a method which improves this adaptive background mixture model. By reinvestigating the update equations, we utilise different equations at different phases. This allows our system learn faster and more accurately as well as adapts effectively to changing environment. A shadow detection scheme is also introduced in this paper. It is based on a computational colour space that makes use of our background model. A comparison has been made between the two algorithms. The results show the speed of learning and the accuracy of the model using our update algorithm over the Grimson et al’s tracker. When incorporate with the shadow detection, our method results in far better segmentation than The Thirteenth Conference on Uncertainty in Artificial Intelligence that of Grimson et al.

Internet publications

Theses and dissertations

  • Cooper HM. (2010) Sign Language Recognitions: Generalising to More Complex Corpora.. University Of Surrey

Page Owner: ees4rb
Page Created: Thursday 16 September 2010 11:49:42 by lb0014
Last Modified: Friday 10 February 2017 15:27:23 by cd0027
Expiry Date: Saturday 18 June 2011 15:06:04
Assembly date: Tue Feb 20 00:14:20 GMT 2018
Content ID: 37145
Revision: 12
Community: 1379

Rhythmyx folder: //Sites/surrey.ac.uk/CVSSP/people
Content type: rx:StaffProfile