Professor Richard Bowden

Professor

Qualifications: BSc, MSc, PhD, SMIEEE, FHEA

Email:
Phone: Work: 01483 68 9838
Room no: 37 AB 05

Further information

Biography

Richard Bowden received a BSc degree in Computer Science from the University of London in 1993, a MSc in 1995 from the University of Leeds and a PhD in Computer Vision from Brunel University in 1999. He is currently a Reader at the University of Surrey, where he leads the Cognitive Vision Group within CVSSP. His research centers on the use of computer vision to locate, track and understand humans. His research into tracking and artificial life received worldwide media coverage, appeared at the British Science Museum and the Minnesota Science Museum. He has won a number of awards including paper prizes for his work on sign language recognition (undertaken as a visiting Research Fellow at the University of Oxford), as well as the Sullivan Doctoral Thesis Prize in 2000 for the best UK PhD thesis in vision. He was a member of the British Machine Vision Association BMVA executive committee and company director for 7 years. He is a London Technology Network Business Fellow, a member of the British Machine Vision Association, a Fellow of the Higher Education Academy and a Senior Member of the Institute of Electrical and Electronic Engineers.

Further details can be found on my personal web page.

A full list of publications, conference presentations, and patents can be found here.

Publications

Highlights

  • Gilbert A, Illingworth J, Bowden R. (2010) 'Action Recognition Using Mined Hierarchical Compound Features'. IEEE COMPUTER SOC IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 33 (5), pp. 883-897.
  • Dowson N, Kadir T, Bowden R. (2008) 'Estimating the joint statistics of images using Nonparametric Windows with application to registration using Mutual Information'. IEEE COMPUTER SOC IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, Minneapolis, MN: 30 (10), pp. 1841-1857.

    Abstract

    Recently, Nonparametric (NP) Windows has been proposed to estimate the statistics of real 1D and 2D signals. NP Windows is accurate because it is equivalent to sampling images at a high (infinite) resolution for an assumed interpolation model. This paper extends the proposed approach to consider joint distributions of image pairs. Second, Green's Theorem is used to simplify the previous NP Windows algorithm. Finally, a resolution-aware NP Windows algorithm is proposed to improve robustness to relative scaling between an image pair. Comparative testing of 2D image registration was performed using translation only and affine transformations. Although it is more expensive than other methods, NP Windows frequently demonstrated superior performance for bias (distance between ground truth and global maximum) and frequency of convergence. Unlike other methods, the number of samples and the number of bins have little effect on NP Windows and the prior selection of a kernel is not required.

  • Dowson N, Bowden R. (2008) 'Mutual information for Lucas-Kanade tracking (MILK): An inverse compositional formulation'. IEEE COMPUTER SOC IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 30 (1), pp. 180-185.

Journal articles

  • Merino L, Gilbert A, Bowden R, Illingworth J, Capitán J, Ollero A. (2012) 'Data fusion in ubiquitous networked robot systems for urban services'. Annales des Telecommunications/Annals of Telecommunications, 67 (7-8), pp. 355-375.
  • Cooper H, Ong E-J, Pugeault N, Bowden R. (2012) 'Sign language recognition using sub-units'. Journal of Machine Learning Research, 13, pp. 2205-2231.
  • Efthimiou E, Fotinea S-E, Hanke T, Glauert J, Bowden R, Braffort A, Collet C, Maragos P, Lefebvre-Albaret F. (2012) 'The dicta-sign Wiki: Enabling web communication for the deaf'. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 7383 LNCS (PART 2), pp. 205-212.
  • Hadfield S, Bowden R. (2012) 'Go with the flow: Hand trajectories in 3D via clustered scene flow'. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 7324 LNCS (PART 1), pp. 285-295.
  • Merino L, Gilbert A, Bowden R, Illingworth J, Capitán J, Ollero A, Ollero A. (2012) 'Data fusion in ubiquitous networked robot systems for urban services'. Annales des Telecommunications/Annals of Telecommunications, , pp. 1-21.
  • Ong EJ, Bowden R. (2011) 'Robust Facial Feature Tracking Using Shape-Constrained Multi-Resolution Selected Linear Predictors.'. IEEE Computer Society IEEE Transactions on Pattern Analysis and Machine Intelligence, 33 (9), pp. 1844-1859.

    Abstract

    This paper proposes a learnt {\em data-driven} approach for accurate, real-time tracking of facial features using only intensity information, a non-trivial task since the face is a highly deformable object with large textural variations and motion in certain regions. The framework proposed here largely avoids the need for apriori design of feature trackers by automatically identifying the optimal visual support required for tracking a single facial feature point. This is essentially equivalent to automatically determining the visual context required for tracking. Tracking is achieved via biased linear predictors which provide a fast and effective method for mapping pixel-intensities into tracked feature position displacements. Multiple linear predictors are grouped into a rigid flock to increase robustness. To further improve tracking accuracy, a novel probabilistic selection method is used to identify relevant visual areas for tracking a feature point. These selected flocks are then combined into a hierarchical multi-resolution LP model. Finally, we also exploit a simple shape constraint for correcting the occasional tracking failure of a minority of feature points. Experimental results also show that this method performs more robustly and accurately than AAMs, on example sequences that range from SD quality to Youtube quality.

  • Ellis L, Dowson N, Matas J, Bowden R. (2011) 'Linear Regression and Adaptive Appearance Models for Fast Simultaneous Modelling and Tracking'. SPRINGER INTERNATIONAL JOURNAL OF COMPUTER VISION, 95 (2), pp. 154-179.
  • Gupta A, Bowden R. (2011) 'Evaluating dimensionality reduction techniques for visual category recognition using rényi entropy'. European Signal Processing Conference, , pp. 913-917.
  • Moore S, Bowden R. (2011) 'Local binary patterns for multi-view facial expression recognition'. Elsevier Computer Vision and Image Understanding, 115 (4), pp. 541-558.
  • Okwechime D, Ong E-J, Bowden R, Member S. (2011) 'MIMiC: Multimodal Interactive Motion Controller'. IEEE IEEE Transactions on Multimedia, 13 (2), pp. 255-265.

    Abstract

    We introduce a new algorithm for real-time interactive motion control and demonstrate its application to motion captured data, prerecorded videos, and HCI. Firstly, a data set of frames are projected into a lower dimensional space. An appearance model is learnt using a multivariate probability distribution. A novel approach to determining transition points is presented based on k-medoids, whereby appropriate points of intersection in the motion trajectory are derived as cluster centers. These points are used to segment the data into smaller subsequences. A transition matrix combined with a kernel density estimation is used to determine suitable transitions between the subsequences to develop novel motion. To facilitate real-time interactive control, conditional probabilities are used to derive motion given user commands. The user commands can come from any modality including auditory, touch, and gesture. The system is also extended to HCI using audio signals of speech in a conversation to trigger nonverbal responses from a synthetic listener in real-time. We demonstrate the flexibility of the model by presenting results ranging from data sets composed of vectorized images, 2-D, and 3-D point representations. Results show real-time interaction and plausible motion generation between different types of movement.

  • Sanfeliu A, Andrade-Cetto J, Barbosa M, Bowden R, Capitan J, Corominas A, Gilbert A, Illingworth J, Merino L, Mirats JM, Moreno P, Ollero A, Sequeira J, Spaan MTJ. (2010) 'Decentralized Sensor Fusion for Ubiquitous Networking Robotics in Urban Areas'. MOLECULAR DIVERSITY PRESERVATION INTERNATIONAL-MDPI Sensors, 10 (3), pp. 2274-2314.
  • Gilbert A, Illingworth J, Bowden R. (2010) 'Action Recognition Using Mined Hierarchical Compound Features'. IEEE COMPUTER SOC IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 33 (5), pp. 883-897.
  • Ong E-J, Ellis L, Bowden R. (2009) 'Problem solving through imitation'. ELSEVIER SCIENCE BV IMAGE AND VISION COMPUTING, 27 (11), pp. 1715-1728.
  • Dowson N, Kadir T, Bowden R. (2008) 'Estimating the joint statistics of images using Nonparametric Windows with application to registration using Mutual Information'. IEEE COMPUTER SOC IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, Minneapolis, MN: 30 (10), pp. 1841-1857.

    Abstract

    Recently, Nonparametric (NP) Windows has been proposed to estimate the statistics of real 1D and 2D signals. NP Windows is accurate because it is equivalent to sampling images at a high (infinite) resolution for an assumed interpolation model. This paper extends the proposed approach to consider joint distributions of image pairs. Second, Green's Theorem is used to simplify the previous NP Windows algorithm. Finally, a resolution-aware NP Windows algorithm is proposed to improve robustness to relative scaling between an image pair. Comparative testing of 2D image registration was performed using translation only and affine transformations. Although it is more expensive than other methods, NP Windows frequently demonstrated superior performance for bias (distance between ground truth and global maximum) and frequency of convergence. Unlike other methods, the number of samples and the number of bins have little effect on NP Windows and the prior selection of a kernel is not required.

  • Gilbert A, Bowden R. (2008) 'Incremental, scalable tracking of objects inter camera'. ACADEMIC PRESS INC ELSEVIER SCIENCE COMPUTER VISION AND IMAGE UNDERSTANDING, 111 (1), pp. 43-58.
  • Dowson N, Bowden R. (2008) 'Mutual information for Lucas-Kanade tracking (MILK): An inverse compositional formulation'. IEEE COMPUTER SOC IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 30 (1), pp. 180-185.
  • Ong E-J, Micilotta AS, Bowden R, Hilton A. (2005) 'Viewpoint invariant exemplar-based 3D human tracking'. ACADEMIC PRESS INC ELSEVIER SCIENCE COMPUTER VISION AND IMAGE UNDERSTANDING, 104 (2-3), pp. 178-189.

Conference papers

  • Gupta A, Bowden R. (2012) 'Unity in diversity: Discovering topics from words: Information theoretic co-clustering for visual categorization'. VISAPP 2012 - Proceedings of the International Conference on Computer Vision Theory and Applications, Rome: International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications 1, pp. 628-633.
  • Holt B, Bowden R. (2012) 'Static pose estimation from depth images using random regression forests and Hough voting'. VISAPP 2012 - Proceedings of the International Conference on Computer Vision Theory and Applications, Rome: International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications 1, pp. 557-564.
  • Sheerman-Chase T, Ong E-J, Bowden R. (2011) 'Cultural factors in the regression of non-verbal communication perception'. IEEE 2011 IEEE International Conference on Computer Vision, Barcelona, Spain: ICCV 2011, pp. 1242-1249.

    Abstract

    Recognition of non-verbal communication (NVC) is important for understanding human communication and designing user centric user interfaces. Cultural differences affect the expression and perception of NVC but no previous automatic system considers these cultural differences. Annotation data for the LILiR TwoTalk corpus, containing dyadic (two person) conversations, was gathered using Internet crowdsourcing, with a significant quantity collected from India, Kenya and the United Kingdom (UK). Many studies have investigated cultural differences based on human observations but this has not been addressed in the context of automatic emotion or NVC recognition. Perhaps not surprisingly, testing an automatic system on data that is not culturally representative of the training data is seen to result in low performance. We address this problem by training and testing our system on a specific culture to enable better modeling of the cultural differences in NVC perception. The system uses linear predictor tracking, with features generated based on distances between pairs of trackers. The annotations indicated the strength of the NVC which enables the use of v-SVR to perform the regression.

  • Gilbert A, Bowden R. (2011) 'iGroup: Weakly supervised image and video grouping'. IEEE 2011 IEEE International Conference on Computer Vision, Barcelona, Spain: ICCV 2011, pp. 2166-2173.

    Abstract

    We present a generic, efficient and iterative algorithm for interactively clustering classes of images and videos. The approach moves away from the use of large hand labelled training datasets, instead allowing the user to find natural groups of similar content based upon a handful of “seed” examples. Two efficient data mining tools originally developed for text analysis; min-Hash and APriori are used and extended to achieve both speed and scalability on large image and video datasets. Inspired by the Bag-of-Words (BoW) architecture, the idea of an image signature is introduced as a simple descriptor on which nearest neighbour classification can be performed. The image signature is then dynamically expanded to identify common features amongst samples of the same class. The iterative approach uses APriori to identify common and distinctive elements of a small set of labelled true and false positive signatures. These elements are then accentuated in the signature to increase similarity between examples and “pull” positive classes together. By repeating this process, the accuracy of similarity increases dramatically despite only a few training examples, only 10% of the labelled groundtruth is needed, compared to other approaches. It is tested on two image datasets including the caltech101 [9] dataset and on three state-of-the-art action recognition datasets. On the YouTube [18] video dataset the accuracy increases from 72% to 97% using only 44 labelled examples from a dataset of over 1200 videos. The approach is both scalable and efficient, with an iteration on the full YouTube dataset taking around 1 minute on a standard desktop machine.

  • Ong E, Bowden R. (2011) 'Learning Temporal Signatures for Lip Reading'. ARTEMIS, ICCV 2011
  • Hadfield S, Bowden R. (2011) 'Kinecting the dots: Particle based scene flow from depth sensors'. IEEE 2011 IEEE International Conference on Computer Vision, Barcelona, Spain: ICCV 2011, pp. 2290-2295.

    Abstract

    The motion field of a scene can be used for object segmentation and to provide features for classification tasks like action recognition. Scene flow is the full 3D motion field of the scene, and is more difficult to estimate than it’s 2D counterpart, optical flow. Current approaches use a smoothness cost for regularisation, which tends to oversmooth at object boundaries. This paper presents a novel formulation for scene flow estimation, a collection of moving points in 3D space, modelled using a particle filter that supports multiple hypotheses and does not oversmooth the motion field. In addition, this paper is the first to address scene flow estimation, while making use of modern depth sensors and monocular appearance images, rather than traditional multi-viewpoint rigs. The algorithm is applied to an existing scene flow dataset, where it achieves comparable results to approaches utilising multiple views, while taking a fraction of the time.

  • Pugeault N, Bowden R. (2011) 'Driving me Around the Bend: Learning to Drive from Visual Gist'. IEEE 2011 IEEE International Conference on Computer Vision, Barcelona, Spain: ICCV 2011: 1st IEEE Workshop on Challenges and Opportunities in Robotic Perception, pp. 1022-1029.

    Abstract

    This article proposes an approach to learning steering and road following behaviour from a human driver using holistic visual features. We use a random forest (RF) to regress a mapping between these features and the driver's actions, and propose an alternative to classical random forest regression based on the Medoid (RF-Medoid), that reduces the underestimation of extreme control values. We compare prediction performance using different holistic visual descriptors: GIST, Channel-GIST (C-GIST) and Pyramidal-HOG (P-HOG). The proposed methods are evaluated on two different datasets: predicting human behaviour on countryside roads and also for autonomous control of a robot on an indoor track. We show that 1) C-GIST leads to the best predictions on both sequences, and 2) RF-Medoid leads to a better estimation of extreme values, where a classical RF tends to under-steer. We use around 10% of the data for training and show excellent generalization over a dataset of thousands of images. Importantly, we do not engineer the solution but instead use machine learning to automatically identify the relationship between visual features and behaviour, providing an efficient, generic solution to autonomous control.

  • Holt B, Ong E-J, Cooper H, Bowden R. (2011) 'Putting the pieces together: Connected Poselets for human pose estimation'. IEEE 2011 IEEE International Conference on Computer Vision, Barcelona, Spain: ICCV Workshops 2011, pp. 1196-1201.

    Abstract

    We propose a novel hybrid approach to static pose estimation called Connected Poselets. This representation combines the best aspects of part-based and example-based estimation. First detecting poselets extracted from the training data; our method then applies a modified Random Decision Forest to identify Poselet activations. By combining keypoint predictions from poselet activitions within a graphical model, we can infer the marginal distribution over each keypoint without any kinematic constraints. Our approach is demonstrated on a new publicly available dataset with promising results.

  • Pugeault N, Bowden R. (2011) 'Spelling It Out: Real–Time ASL Fingerspelling Recognition'. IEEE 2011 IEEE International Conference on Computer Vision Workshops, Barcelona, Spain: ICCV 2011: 1st IEEE Workshop on Consumer Depth Cameras for Computer Vision, pp. 1114-1119.

    Abstract

    This article presents an interactive hand shape recognition user interface for American Sign Language (ASL) finger-spelling. The system makes use of a Microsoft Kinect device to collect appearance and depth images, and of the OpenNI+NITE framework for hand detection and tracking. Hand-shapes corresponding to letters of the alphabet are characterized using appearance and depth images and classified using random forests. We compare classification using appearance and depth images, and show a combination of both lead to best results, and validate on a dataset of four different users. This hand shape detection works in real-time and is integrated in an interactive user interface allowing the signer to select between ambiguous detections and integrated with an English dictionary for efficient writing.

  • Cooper HM, Pugeault N, Bowden R. (2011) 'Reading the Signs: A Video Based Sign Dictionary'. IEEE 2011 International Conference on Computer Vision: 2nd IEEE Workshop on Analysis and Retrieval of Tracked Events and Motion in Imagery Streams (ARTEMIS 2011), Barcelona, Spain: ICCV 2011, pp. 914-919.

    Abstract

    This article presents a dictionary for Sign Language using visual sign recognition based on linguistic subcomponents. We demonstrate a system where the user makes a query, receiving in response a ranked selection of similar results. The approach uses concepts from linguistics to provide sign sub-unit features and classifiers based on motion, sign-location and handshape. These sub-units are combined using Markov Models for sign level recognition. Results are shown for a video dataset of 984 isolated signs performed by a native signer. Recognition rates reach 71.4% for the first candidate and 85.9% for retrieval within the top 10 ranked signs.

  • Elliott R, Cooper HM, Ong EJ, Glauert J, Bowden R, Lefebvre-Albaret F. (2011) 'Search-By-Example in Multilingual Sign Language Databases'. Dundee, UK: 2nd International Workshop on Sign Language Translation and Avatar Technology (SLTAT)

    Abstract

    We describe a prototype Search-by-Example or look-up tool for signs, based on a newly developed 1000-concept sign lexicon for four national sign languages (GSL, DGS, LSF,BSL), which includes a spoken language gloss, a HamNoSys description, and a video for each sign. The look-up tool combines an interactive sign recognition system, supported by Kinect technology, with a real-time sign synthesis system,using a virtual human signer, to present results to the user. The user performs a sign to the system and is presented with animations of signs recognised as similar. The user also has the option to view any of these signs performed in the other three sign languages. We describe the supporting technology and architecture for this system, and present some preliminary evaluation results.

  • Ong E, Bowden R. (2011) 'Learning Sequential Patterns for Lipreading'. BMVA Press Proceedings of the 22nd British Machine Vision Conference, Dundee, UK: BMVC 2011, pp. 55.1-55.10.

    Abstract

    This paper proposes a novel machine learning algorithm (SP-Boosting) to tackle the problem of lipreading by building visual sequence classifiers based on sequential patterns. We show that an exhaustive search of optimal sequential patterns is not possible due to the immense search space, and tackle this with a novel, efficient tree-search method with a set of pruning criteria. Crucially, the pruning strategies preserve our ability to locate the optimal sequential pattern. Additionally, the tree-based search method accounts for the training set’s boosting weight distribution. This temporal search method is then integrated into the boosting framework resulting in the SP-Boosting algorithm. We also propose a novel constrained set of strong classifiers that further improves recognition accuracy. The resulting learnt classifiers are applied to lipreading by performing multi-class recognition on the OuluVS database. Experimental results show that our method achieves state of the art recognition performane, using only a small set of sequential patterns.

  • Okwechime D, Ong E-J, Gilbert A, Bowden R. (2011) 'Visualisation and prediction of conversation interest through mined social signals'. IEEE 2011 IEEE International Conference on Automatic Face and Gesture Recognition and Workshops, Santa Barbara, USA: FG 2011: IEEE International Conference on Automatic Face & Gesture Recognition and Workshops, pp. 951-956.

    Abstract

    This paper introduces a novel approach to social behaviour recognition governed by the exchange of non-verbal cues between people. We conduct experiments to try and deduce distinct rules that dictate the social dynamics of people in a conversation, and utilise semi-supervised computer vision techniques to extract their social signals such as laughing and nodding. Data mining is used to deduce frequently occurring patterns of social trends between a speaker and listener in both interested and not interested social scenarios. The confidence values from rules are utilised to build a Social Dynamic Model (SDM), that can then be used for classification and visualisation. By visualising the rules generated in the SDM, we can analyse distinct social trends between an interested and not interested listener in a conversation. Results show that these distinctions can be applied generally and used to accurately predict conversational interest.

  • Oshin O, Gilbert A, Bowden R. (2011) 'Capturing the relative distribution of features for action recognition'. IEEE 2011 IEEE International Conference on Automatic Face and Gesture Recognition and Workshops, Santa Barbara, USA: 2011 IEEE FG, pp. 111-116.

    Abstract

    This paper presents an approach to the categorisation of spatio-temporal activity in video, which is based solely on the relative distribution of feature points. Introducing a Relative Motion Descriptor for actions in video, we show that the spatio-temporal distribution of features alone (without explicit appearance information) effectively describes actions, and demonstrate performance consistent with state-of-the-art. Furthermore, we propose that for actions where noisy examples exist, it is not optimal to group all action examples as a single class. Therefore, rather than engineering features that attempt to generalise over noisy examples, our method follows a different approach: We make use of Random Sampling Consensus (RANSAC) to automatically discover and reject outlier examples within classes. We evaluate the Relative Motion Descriptor and outlier rejection approaches on four action datasets, and show that outlier rejection using RANSAC provides a consistent and notable increase in performance, and demonstrate superior performance to more complex multiple-feature based approaches.

  • Ellis L, Felsberg M, Bowden R. (2011) 'Affordance mining: Forming perception through action'. Springer Lecture Notes in Computer Science: 10th Asian Conference on Computer Vision, Revised Selected Papers Part IV, Queenstown, New Zealand: ACCV 2010 6495, pp. 525-538.

    Abstract

    This work employs data mining algorithms to discover visual entities that are strongly associated to autonomously discovered modes of action, in an embodied agent. Mappings are learnt from these perceptual entities, onto the agents action space. In general, low dimensional action spaces are better suited to unsupervised learning than high dimensional percept spaces, allowing for structure to be discovered in the action space, and used to organise the perceptual space. Local feature configurations that are strongly associated to a particular ‘type’ of action (and not all other action types) are considered likely to be relevant in eliciting that action type. By learning mappings from these relevant features onto the action space, the system is able to respond in real time to novel visual stimuli. The proposed approach is demonstrated on an autonomous navigation task, and the system is shown to identify the relevant visual entities to the task and to generate appropriate responses.

  • Okwechime D, Ong E-J, Gilbert A, Bowden R. (2011) 'Social interactive human video synthesis'. Springer Lecture Notes in Computer Science: Computer Vision – ACCV 2010, Queenstown, New Zealand: 10th Asian Conference on Computer Vision 6492 (PART 1), pp. 256-270.
  • Oshin O, Gilbert A, Bowden R. (2011) 'There Is More Than One Way to Get Out of a Car: Automatic Mode Finding for Action Recognition in the Wild'. Springer Berlin / Heidelberg Lecture Notes in Computer Science: Pattern Recognition and Image Analysis, Las Palmas de Gran Canaria, Spain: 5th IbPRIA 2011 6669, pp. 41-48.

    Abstract

    Actions in the wild” is the term given to examples of human motion that are performed in natural settings, such as those harvested from movies [10] or the Internet [9]. State-of-the-art approaches in this domain are orders of magnitude lower than in more contrived settings. One of the primary reasons being the huge variability within each action class. We propose to tackle recognition in the wild by automatically breaking complex action categories into multiple modes/group, and training a separate classifier for each mode. This is achieved using RANSAC which identifies and separates the modes while rejecting outliers. We employ a novel reweighting scheme within the RANSAC procedure to iteratively reweight training examples, ensuring their inclusion in the final classification model. Our results demonstrate the validity of the approach, and for classes which exhibit multi-modality, we achieve in excess of double the performance over approaches that assume single modality.

  • Moore S, Ong EJ, Bowden R. (2010) 'Facial Expression Recognition using Spatiotemporal Boosted Discriminatory Classifiers'. Portugal: International Conference on Image Analysis and Recognition 6111/2010, pp. 405-414.
  • Cooper H, Bowden R. (2010) 'Sign Language Recognition using Linguistically Derived Sub-Units'. Valetta, Malta : European Language Resources Association (ELRA) Proceedings of 4th Workshop on the Representation and Processing of Sign Languages: Corpora and Sign Language Technologies, Valetta, Malta: IREC 2010, pp. 57-61.

    Abstract

    This work proposes to learn linguistically-derived sub-unit classifiers for sign language. The responses of these classifiers can be combined by Markov models, producing efficient sign-level recognition. Tracking is used to create vectors of hand positions per frame as inputs for sub-unit classifiers learnt using AdaBoost. Grid-like classifiers are built around specific elements of the tracking vector to model the placement of the hands. Comparative classifiers encode the positional relationship between the hands. Finally, binary-pattern classifiers are applied over the tracking vectors of multiple frames to describe the motion of the hands. Results for the sub-unit classifiers in isolation are presented, reaching averages over 90%. Using a simple Markov model to combine the sub-unit classifiers allows sign level classification giving an average of 63%, over a 164 sign lexicon, with no grammatical constraints.

  • Efthimiou E, Fotinea SE, Hanke T, Glauert J, Bowden R, Braffort A, Collet C, Maragos P, Goudenove F. (2010) 'DICTA-SIGN: Sign Language Recognition, Generation and Modelling with application in Deaf Communication'. Malta: 4th Workshop on the Representation and Processing of Sign Languages: Corpora and Sign Language Technologies, LREC2010, pp. 80-84.
  • Pugeault N, Bowden R. (2010) 'Learning pre-attentive driving behaviour from holistic visual features'. ECCV 2010, Part VI, LNCS 6316, , pp. 154-167.
  • Ong EJ, Lan Y, Theobald BJ, Harvey R, Bowden R. (2009) 'Robust Facial Feature Tracking using Selected Multi-Resolution Linear Predictors,'. Kyoto, Japan: Int. Conference Computer Vision ICCV09, pp. 1483-1490.
  • Okwechime D, Ong E-J, Bowden R. (2009) 'Real-time motion control using pose space probability density estimation'. IEEE 2009 IEEE 12th International Conference on Computer Vision Workshops, Kyoto, Japan: ICCV 2009, pp. 2056-2063.
  • Sheerman-Chase T, Ong E-J, Bowden R. (2009) 'Online learning of robust facial feature trackers'. IEEE 2009 IEEE 12th International Conference on Computer Vision Workshops, Kyoto, Japan: ICCV 2009, pp. 1386-1392.

    Abstract

    This paper presents a head pose and facial feature estimation technique that works over a wide range of pose variations without a priori knowledge of the appearance of the face. Using simple LK trackers, head pose is estimated by Levenberg-Marquardt (LM) pose estimation using the feature tracking as constraints. Factored sampling and RANSAC are employed to both provide a robust pose estimate and identify tracker drift by constraining outliers in the estimation process. The system provides both a head pose estimate and the position of facial features and is capable of tracking over a wide range of head poses.

  • Gilbert A, Illingworth J, Bowden R, Capitan J, Merino L. (2009) 'Accurate fusion of robot, camera and wireless sensors for surveillance applications'. IEEE IEEE 12th International Conference on Computer Vision Workshops, Kyoto, Japan: ICCV 2009, pp. 1290-1297.

    Abstract

    Often within the field of tracking people within only fixed cameras are used. This can mean that when the the illumination of the image changes or object occlusion occurs, the tracking can fail. We propose an approach that uses three simultaneous separate sensors. The fixed surveillance cameras track objects of interest cross camera through incrementally learning relationships between regions on the image. Cameras and laser rangefinder sensors onboard robots also provide an estimate of the person. Moreover, the signal strength of mobile devices carried by the person can be used to estimate his position. The estimate from all these sources are then combined using data fusion to provide an increase in performance. We present results of the fixed camera based tracking operating in real time on a large outdoor environment of over 20 non-overlapping cameras. Moreover, the tracking algorithms for robots and wireless nodes are described. A decentralized data fusion algorithm for combining all these information is presented.

  • Sheerman-Chase T, Ong E-J, Bowden R. (2009) 'Feature selection of facial displays for detection of non verbal communication in natural conversation'. IEEE 2009 IEEE 12th International Conference on Computer Vision Workshops, Kyoto, Japan: ICCV 2009, pp. 1985-1992.

    Abstract

    Recognition of human communication has previously focused on deliberately acted emotions or in structured or artificial social contexts. This makes the result hard to apply to realistic social situations. This paper describes the recording of spontaneous human communication in a specific and common social situation: conversation between two people. The clips are then annotated by multiple observers to reduce individual variations in interpretation of social signals. Temporal and static features are generated from tracking using heuristic and algorithmic methods. Optimal features for classifying examples of spontaneous communication signals are then extracted by AdaBoost. The performance of the boosted classifier is comparable to human performance for some communication signals, even on this challenging and realistic data set.

  • Oshin O, Gilbert A, Illingworth J, Bowden R. (2009) 'Action recognition using Randomised Ferns'. IEEE 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops 2009, Kyoto, Japan: ICCV 2009, pp. 530-537.

    Abstract

    This paper presents a generic method for recognising and localising human actions in video based solely on the distribution of interest points. The use of local interest points has shown promising results in both object and action recognition. While previous methods classify actions based on the appearance and/or motion of these points, we hypothesise that the distribution of interest points alone contains the majority of the discriminatory information. Motivated by its recent success in rapidly detecting 2D interest points, the semi-naive Bayesian classification method of Randomised Ferns is employed. Given a set of interest points within the boundaries of an action, the generic classifier learns the spatial and temporal distributions of those interest points. This is done efficiently by comparing sums of responses of interest points detected within randomly positioned spatio-temporal blocks within the action boundaries. We present results on the largest and most popular human action dataset using a number of interest point detectors, and demostrate that the distribution of interest points alone can perform as well as approaches that rely upon the appearance of the interest points.

  • Lan Y, Harvey R, Theobald B, Ong EJ, Bowden R. (2009) 'Comparing Visual Features for Lipreading'. ICSA International Conference on Auditory-Visual Speech Processing 2009, Norwich, UK: AVSP 2009, pp. 102-106.

    Abstract

    For automatic lipreading, there are many competing methods for feature extraction. Often, because of the complexity of the task these methods are tested on only quite restricted datasets, such as the letters of the alphabet or digits, and from only a few speakers. In this paper we compare some of the leading methods for lip feature extraction and compare them on the GRID dataset which uses a constrained vocabulary over, in this case, 15 speakers. Previously the GRID data has had restricted attention because of the requirements to track the face and lips accurately. We overcome this via the use of a novel linear predictor (LP) tracker which we use to control an Active Appearance Model (AAM). By ignoring shape and/or appearance parameters from the AAM we can quantify the effect of appearance and/or shape when lip-reading. We find that shape alone is a useful cue for lipreading (which is consistent with human experiments). However, the incremental effect of shape on appearance appears to be not significant which implies that the inner appearance of the mouth contains more information than the shape.

  • Moore S, Bowden R. (2009) 'The Effects of Pose On Facial Expression Recognition'. BMVA Press Proceedings of the British Machine Vision Conference, London, UK: BMVC 2009, pp. 1-11.

    Abstract

    Research into facial expression recognition has predominantly been based upon near frontal view data. However, a recent 3D facial expression database (BU-3DFE database) has allowed empirical investigation of facial expression recognition across pose. In this paper, we investigate the effects of pose from frontal to profile view on facial expression recognition. Experiments are carried out on 100 subjects with 5 yaw angles over 6 prototypical expressions. Expressions have 4 levels of intensity from subtle to exaggerated. We evaluate features such as local binary patterns (LBPs) as well as various extensions of LBPs. In addition, a novel approach to facial expression recognition is proposed using local gabor binary patterns (LGBPs). Multi class support vector machines (SVMs) are used for classification. We investigate the effects of image resolution and pose on facial expression classification using a variety of different features.

  • Cooper H, Bowden R, Staphanidis C. (2009) 'Sign Language Recognition: Working with Limited Corpora'. SPRINGER-VERLAG BERLIN UNIVERSAL ACCESS IN HUMAN-COMPUTER INTERACTION: APPLICATIONS AND SERVICES, PT III, San Diego, CA: 5th International Conference on Universal Access in Human-Computer Interaction held at the HCI International 2009 5616, pp. 472-481.
  • Cooper H, Bowden R. (2009) 'Learning Signs from Subtitles: A Weakly Supervised Approach to Sign Language Recognition'. IEEE CVPR: 2009 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOLS 1-4, Miami Beach, FL: IEEE-Computer-Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 2560-2566.
  • Efthimiou E, Fotinea S-E, Vogler C, Hanke T, Glauert J, Bowden R, Braffort A, Collet C, Maragos P, Segouat J. (2009) 'Sign language recognition, generation, and modelling: A research effort with applications in deaf communication'. Springer Lecture Notes in Computer Science: Proceedings of 5th International Conference of Universal Access in Human-Computer Interaction. Addressing Diversity, Part 1, San Diego, USA: UAHCI 2009, Held as part of HCI International 2009 5614, pp. 21-30.
  • Gilbert A, Illingworth J, Bowden R. (2009) 'Fast realistic multi-action recognition using mined dense spatio-temporal features'. Proceedings of the IEEE International Conference on Computer Vision, , pp. 925-931.
  • Oshin O, Gilbert A, Illingworth J, Bowden R. (2008) 'Spatio-Temporal Feature Recognition using Randomised Ferns'. The 1st International Workshop on Machine Learning for Vision-based Motion Analysis (MVLMA'08), Marseille, France: International Workshop on Machine Learning for Vision Based Motion Analysis, ECCV08
  • Ellis L, Matas J, Bowden R. (2008) 'Online Learning and Partitioning of Linear Displacement Predictors for Tracking'. The British Machine Vision Association (BMVA) Proceedings of the British Machine Vision Conference, Leeds, UK: BMVC 2008, pp. 33-42.

    Abstract

    A novel approach to learning and tracking arbitrary image features is presented. Tracking is tackled by learning the mapping from image intensity differences to displacements. Linear regression is used, resulting in low computational cost. An appearance model of the target is built on-the-fly by clustering sub-sampled image templates. The medoidshift algorithm is used to cluster the templates thus identifying various modes or aspects of the target appearance, each mode is associated to the most suitable set of linear predictors allowing piecewise linear regression from image intensity differences to warp updates. Despite no hard-coding or offline learning, excellent results are shown on three publicly available video sequences and comparisons with related approaches made.

  • Okwechime D, Bowden R. (2008) 'A generative model for motion synthesis and blending using probability density estimation'. SPRINGER-VERLAG BERLIN ARTICULATED MOTION AND DEFORMABLE OBJECTS, PROCEEDINGS, Port d Andratx, SPAIN: 5th International Conference on Articulated Motion and Deformable Objects 5098, pp. 218-227.
  • Gilbert A, Illingworth J, Bowden R. (2008) 'Scale Invariant Action Recognition Using Compound Features Mined from Dense Spatio-temporal Corners'. Springer Lecture Notes in Computer Science: Proceedings of 10th European Conference on Computer Vision (Part 1), Marseille, France: ECCV 2008 5302, pp. 222-233.
  • Ong E-J, Bowden R, IEEE . (2008) 'Robust Lip-Tracking using Rigid Flocks of Selected Linear Predictors'. 2008 8TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE & GESTURE RECOGNITION (FG 2008), VOLS 1 AND 2, , pp. 247-254.
  • Cooper H, Bowden R. (2007) 'Sign Language Recognition Using Boosted Volumetric Features'. MVA Organisation Proceedings of the IAPR Conference on Machine Vision Applications, Tokyo, Japan: IAPR MVA 2007, pp. 359-362.

    Abstract

    This paper proposes a method for sign language recognition that bypasses the need for tracking by classifying the motion directly. The method uses the natural extension of haar like features into the temporal domain, computed efficiently using an integral volume. These volumetric features are assembled into spatio-temporal classifiers using boosting. Results are presented for a fast feature extraction method and 2 different types of boosting. These configurations have been tested on a data set consisting of both seen and unseen signers performing 5 signs producing competitive results.

  • Ellis L, Bowden R. (2007) 'Learning Responses to Visual Stimuli: A Generic Approach'. Applied Computer Science Group, Bielefeld University, Germany Proceedings of the 5th International Conference on Computer Vision Systems, Bielefeld, Germany: ICVS 2007

    Abstract

    A general framework for learning to respond appropriately to visual stimulus is presented. By hierarchically clustering percept-action exemplars in the action space, contextually important features and relationships in the perceptual input space are identified and associated with response models of varying generality. Searching the hierarchy for a set of best matching percept models yields a set of action models with likelihoods. By posing the problem as one of cost surface optimisation in a probabilistic framework, a particle filter inspired forward exploration algorithm is employed to select actions from multiple hypotheses that move the system toward a goal state and to escape from local minima. The system is quantitatively and qualitatively evaluated in both a simulated shape sorter puzzle and a real-world autonomous navigation domain.

  • Ellis L, Dowson N, Matas J, Bowden R. (2007) 'Linear predictors for fast simultaneous modeling and tracking'. IEEE 2007 IEEE 11TH INTERNATIONAL CONFERENCE ON COMPUTER VISION, VOLS 1-6, Rio de Janeiro, BRAZIL: 11th IEEE International Conference on Computer Vision, pp. 2792-2799.
  • Gilbert A, Bowden R. (2007) 'Multi person tracking within crowded scenes'. SPRINGER-VERLAG BERLIN Human Motion - Understanding, Modeling, Capture and Animation, Proceedings, Rio de Janeiro, BRAZIL: 2nd Workshop on Human Motion Understanding, Modeling, Capture and Animation 4814, pp. 166-179.
  • Moore S, Bowden R. (2007) 'Automatic facial expression recognition using boosted discriminatory classifiers'. Springer Lecture Notes in Computer Science: Analysis and Modelling of Faces and Gestures, Rio de Janeiro, Brazil: Third International Workshop on AMFG'07 4778, pp. 71-83.
  • Cooper H, Bowden R, Lew M, Sebe N, Huang TS, Bakker EM. (2007) 'Large lexicon detection of sign language'. SPRINGER-VERLAG BERLIN HUMAN-COMPUTER INTERACTION, PROCEEDINGS, Rio de Janeiro, BRAZIL: IEEE International Workshop on Human - Computer Interaction 4796, pp. 88-97.
  • Ong E-J, Bowden R. (2006) 'Learning Distance for Arbitrary Visual Features'. BMVA Proceedings of the British Machine Vision Conference, Edinburgh, UK: BMVC 2006 2, pp. 749-758.

    Abstract

    This paper presents a method for learning distance functions of arbitrary feature representations that is based on the concept of wormholes. We introduce wormholes and describe how it provides a method for warping the topology of visual representation spaces such that a meaningful distance between examples is available. Additionally, we show how a more general distance function can be learnt through the combination of many wormholes via an inter-wormhole network. We then demonstrate the application of the distance learning method on a variety of problems including nonlinear synthetic data, face illumination detection and the retrieval of images containing natural landscapes and man-made objects (e.g. cities).

  • Dowson NDH, Bowden R. (2006) 'N-tier Simultaneous Modelling and Tracking for Arbitrary Warps'. BMVA Proceedings of the British Machine Vision Conference, Edinburgh, UK: BMVC 2006 2, pp. 569-578.

    Abstract

    This paper presents an approach to object tracking which, given a single example of a target, learns a hierarchical constellation model of appearance and structure on the fly. The model becomes more robust over time as evidence of the variability of the object is acquired and added to the model. Tracking is performed in an optimised Lucas-Kanade type framework, using Mutual Information as a similarity metric. Several novelties are presented: an improved template update strategy using Bayes theorem, a multi-tier model topology, and a semi-automatic testing method. A critical comparison with other methods is made using exhaustive testing. In all 11 challenging test sequences were used with a mean length of 568 frames.

  • Gilbert A, Bowden R. (2006) 'Tracking objects across cameras by incrementally learning inter-camera colour calibration and patterns of activity'. Springer Lecture Notes in Computer Science: 9th European Conference on Computer Vision, Proceedings Part 2, Graz, Austria: ECCV 2006 3952, pp. 125-136.

    Abstract

    This paper presents a scalable solution to the problem of tracking objects across spatially separated, uncalibrated, non-overlapping cameras. Unlike other approaches this technique uses an incremental learning method, to model both the colour variations and posterior probability distributions of spatio-temporal links between cameras. These operate in parallel and are then used with an appearance model of the object to track across spatially separated cameras. The approach requires no pre-calibration or batch preprocessing, is completely unsupervised, and becomes more accurate over time as evidence is accumulated.

  • Micilotta AS, Ong EJ, Bowden R. (2006) 'Real-time upper body detection and 3D pose estimation in monoscopic images'. Springer Lecture Notes in Computer Science: Proceedings of 9th European Conference on Computer Vision, Part III, Graz, Austria: ECCV 2006 3953, pp. 139-150.

    Abstract

    This paper presents a novel solution to the difficult task of both detecting and estimating the 3D pose of humans in monoscopic images. The approach consists of two parts. Firstly the location of a human is identified by a probabalistic assembly of detected body parts. Detectors for the face, torso and hands are learnt using adaBoost. A pose likliehood is then obtained using an a priori mixture model on body configuration and possible configurations assembled from available evidence using RANSAC. Once a human has been detected, the location is used to initialise a matching algorithm which matches the silhouette and edge map of a subject with a 3D model. This is done efficiently using chamfer matching, integral images and pose estimation from the initial detection stage. We demonstrate the application of the approach to large, cluttered natural images and at near framerate operation (16fps) on lower resolution video streams.

  • Dowson N, Bowden R. (2006) 'A unifying framework for mutual information methods for use in non-linear optimisation'. Springer Lecture Notes in Computer Science: 9th European Conference on Computer Vision, Proceedings Part 1, Graz, Austria: ECCV 2006 3951, pp. 365-378.

    Abstract

    Many variants of MI exist in the literature. These vary primarily in how the joint histogram is populated. This paper places the four main variants of MI: Standard sampling, Partial Volume Estimation (PVE), In-Parzen Windowing and Post-Parzen Windowing into a single mathematical framework. Jacobians and Hessians are derived in each case. A particular contribution is that the non-linearities implicit to standard sampling and post-Parzen windowing are explicitly dealt with. These non-linearities are a barrier to their use in optimisation. Side-by-side comparison of the MI variants is made using eight diverse data-sets, considering computational expense and convergence. In the experiments, PVE was generally the best performer, although standard sampling often performed nearly as well (if a higher sample rate was used). The widely used sum of squared differences metric performed as well as MI unless large occlusions and non-linear intensity relationships occurred. The binaries and scripts used for testing are available online.

  • Dowson NDH, Bowden R, Kadir T. (2006) 'Image template matching using mutual information and NP-Windows'. IEEE COMPUTER SOC 18TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 2, PROCEEDINGS, Hong Kong, PEOPLES R CHINA: 18th International Conference on Pattern Recognition (ICPR 2006), pp. 1186-1191.

    Abstract

    A non-parametric (NP) sampling method is introduced for obtaining the joint distribution of a pair of images. This method based on NP windowing and is equivalent to sampling the images at infinite resolution. Unlike existing methods, arbitrary selection of kernels is not required and the spatial structure of images is used. NP windowing is applied to a registration application where the mutual information (MI) between a reference image and a warped template is maximised with respect to the warp parameters. In comparisons against the current state of the art MI registration methods NP windowing yielded excellent results with lower bias and improved convergence rates

  • Ong EJ, Bowden R. (2006) 'Learning wormholes for sparsely labelled clustering'. IEEE COMPUTER SOC 18TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 1, PROCEEDINGS, Hong Kong, PEOPLES R CHINA: 18th International Conference on Pattern Recognition (ICPR 2006), pp. 916-919.

    Abstract

    Distance functions are an important component in many learning applications. However, the correct function is context dependent, therefore it is advantageous to learn a distance function using available training data. Many existing distance functions is the requirement for data to exist in a space of constant dimensionality and not possible to be directly used on symbolic data. To address these problems, this paper introduces an alternative learnable distance function, based on multi-kernel distance bases or "wormholes that connects spaces belonging to similar examples that were originally far away close together. This work only assumes the availability of a set data in the form of relative comparisons, avoiding the need for having labelled or quantitative information. To learn the distance function, two algorithms were proposed: 1) Building a set of basic wormhole bases using a Boosting-inspired algorithm. 2) Merging different distance bases together for better generalisation. The learning algorithms were then shown to successfully extract suitable distance functions in various clustering problems, ranging from synthetic 2D data to symbolic representations of unlabelled images

Book chapters

  • Cooper HM, Holt B, Bowden R. (2011) 'Sign Language Recognition'. in Moeslund TB, Hilton A, Krüger V, Sigal L (eds.) Visual Analysis of Humans: Looking at People Springer Verlag , pp. 539-562.

    Abstract

    This chapter covers the key aspects of sign-language recognition (SLR), starting with a brief introduction to the motivations and requirements, followed by a précis of sign linguistics and their impact on the field. The types of data available and the relative merits are explored allowing examination of the features which can be extracted. Classifying the manual aspects of sign (similar to gestures) is then discussed from a tracking and non-tracking viewpoint before summarising some of the approaches to the non-manual aspects of sign languages. Methods for combining the sign classification results into full SLR are given showing the progression towards speech recognition techniques and the further adaptations required for the sign specific case. Finally the current frontiers are discussed and the recent research presented. This covers the task of continuous sign recognition, the work towards true signer independence, how to effectively combine the different modalities of sign, making use of the current linguistic research and adapting to larger more noisy data sets

  • Oshin O, Gilbert A, Illingworth J, Bowden R. (2009) 'Machine Learning for Human Motion Analysis'. in Wang L, Cheng L, Zhao G (eds.) Machine Learning for Human Motion Analysis Igi Publishing Article number 2 , pp. 14-30.

Theses and dissertations

  • Cooper HM. (2010) Sign Language Recognitions: Generalising to More Complex Corpora.. University Of Surrey

Departmental Duties

Senior Tutor for Professional Training

Page Owner: ees4rb
Page Created: Thursday 16 September 2010 11:49:42 by lb0014
Last Modified: Tuesday 4 October 2011 16:28:16 by lb0014
Expiry Date: Saturday 18 June 2011 15:06:04
Assembly date: Tue Mar 26 22:38:34 GMT 2013
Content ID: 37145
Revision: 5
Community: 1379