I have worked on various theoretical aspects of Pattern Recognition, Image Analysis and Computer Vision, and on many applications including System Identification, Automatic Inspection, ECG diagnosis, Mammographic Image Interpretation, Remote Sensing, Robotics, Speech Recognition, Character Recognition and Document Processing, Image Coding, Biometrics, Image and Video Database Retrieval, Surveillance. Contributions to statistical pattern recognition include k-nearest neighbour methods of pattern classification, feature selection, contextual classification, probabilistic relaxation and most recently to multiple expert fusion. In computer vision my major contributions include robust statistical methods for shape analysis and detection, motion estimation and segmentation, and image segmentation by thresholding and edge detection.
I have co-authored a book with the title `Pattern Recognition: a statistical approach' published by Prentice-Hall and published more than 500 papers.
Served as a member of the Editorial Board of IEEE Transactions on Pattern Analysis and Machine Intelligence during 1982-85. Currently serves on the Editorial Boards of Pattern Recognition Letters, Pattern Recognition and Artificial Intelligence, Pattern Analysis and Applications.
Served on the Governing Board of the International Association for Pattern Recognition (IAPR) as one of the two British representatives during the period 1982-2005. President of the IAPR during 1994-1996. Currently a member of the KS Fu Prize Committee of IAPR.
Received Best Paper awards from the Pattern Recognition Society, the British Machine Vision Association and IEE.
Received ``Honorary Medal'' from the Electrotechnical Faculty of the Czech Technical University in Prague in September 1995 for contributions to the field of pattern recognition and computer vision.
Elected Fellow of the International Association for Pattern Recognition in 1998
Elected Fellow of Institution of Electrical Engineers in 1999
Received Honorary Doctorate from the Lappeenranta University of Technology, Finland, for contributions to Pattern Recognition and Computer Vision in 1999
Elected Fellow of the Royal Academy of Engineering, 2000
Received Institution of Electrical Engineers Achievemments Medal 2002 for outstanding contributions to Visual Information Engineering
Elected BMVA Distinguished Fellow 2002
Received, from the Czech Academy of Sciences, the 2003 Bernard Bolzano Honorary Medal for Merit in the Mathematical Sciences
Awarded the title Distinguished Professor of the University of Surrey in 2004
Appointed as Series Editor for Springer Lecture Notes in Computer Science 2004
Awarded the KS Fu Prize 2006, by the International Association for Pattern Recognition, for outstanding contributions to Pattern Recognition (the prize awarded biennially)
Received Honorary Doctorate from the Czech Technical University in Prague in 2007, on the occasion of the 300th anniversary of its foundation.
Awarded the IET Faraday Medal 2008.
Find me on campus Room: 25 BA 00
3D face reconstruction of shape and skin texture from a single 2D image can be performed using a 3D Morphable Model (3DMM) in an analysis-by-synthesis approach. However, performing this reconstruction (fitting) efficiently and accurately in a general imaging scenario is a challenge. Such a scenario would involve a perspective camera to describe the geometric projection from 3D to 2D, and the Phong model to characterise illumination. Under these imaging assumptions the reconstruction problem is nonlinear and, consequently, computationally very demanding. In this work, we present an efficient stepwise 3DMM-to-2D image-fitting procedure, which sequentially optimises the pose, shape, light direction, light strength and skin texture parameters in separate steps. By linearising each step of the fitting process we derive closed-form solutions for the recovery of the respective parameters, leading to efficient fitting. The proposed optimisation process involves all the pixels of the input image, rather than randomly selected subsets, which enhances the accuracy of the fitting. It is referred to as Efficient Stepwise Optimisation (ESO). The proposed fitting strategy is evaluated using reconstruction error as a performance measure. In addition, we demonstrate its merits in the context of a 3D-assisted 2D face recognition system which detects landmarks automatically and extracts both holistic and local features using a 3DMM. This contrasts with most other methods which only report results that use manual face landmarking to initialise the fitting. Our method is tested on the public CMU-PIE and Multi-PIE face databases, as well as one internal database. The experimental results show that the face reconstruction using ESO is significantly faster, and its accuracy is at least as good as that achieved by the existing 3DMM fitting algorithms. A face recognition system integrating ESO to provide a pose and illumination invariant solution compares favourably with other state-of-the-art methods. In particular, it outperforms deep learning methods when tested on the Multi-PIE database.
We present a fully automatic approach to real-time 3D face reconstruction from monocular in-the-wild videos. With the use of a cascaded-regressor based face tracking and a 3D Morphable Face Model shape fitting, we obtain a semi-dense 3D face shape. We further use the texture information from multiple frames to build a holistic 3D face representation from the video footage. Our system is able to capture facial expressions and does not require any person-specific training. We demonstrate the robustness of our approach on the challenging 300 Videos in the Wild (300-VW) dataset. Our real-time fitting framework is available as an open source library at http://4dface.org.
The probability hypothesis density (PHD) filter based on sequential Monte Carlo (SMC) approximation (also known as SMC-PHD filter) has proven to be a promising algorithm for multi-speaker tracking. However, it has a heavy computational cost as surviving, spawned and born particles need to be distributed in each frame to model the state of the speakers and to estimate jointly the variable number of speakers with their states. In particular, the computational cost is mostly caused by the born particles as they need to be propagated over the entire image in every frame to detect the new speaker presence in the view of the visual tracker. In this paper, we propose to use audio data to improve the visual SMC-PHD (VSMC- PHD) filter by using the direction of arrival (DOA) angles of the audio sources to determine when to propagate the born particles and re-allocate the surviving and spawned particles. The tracking accuracy of the AV-SMC-PHD algorithm is further improved by using a modified mean-shift algorithm to search and climb density gradients iteratively to find the peak of the probability distribution, and the extra computational complexity introduced by mean-shift is controlled with a sparse sampling technique. These improved algorithms, named as AVMS-SMCPHD and sparse-AVMS-SMC-PHD respectively, are compared systematically with AV-SMC-PHD and V-SMC-PHD based on the AV16.3, AMI and CLEAR datasets.
© 1999-2012 IEEE.The problem of tracking multiple moving speakers in indoor environments has received much attention. Earlier techniques were based purely on a single modality, e.g., vision. Recently, the fusion of multi-modal information has been shown to be instrumental in improving tracking performance, as well as robustness in the case of challenging situations like occlusions (by the limited field of view of cameras or by other speakers). However, data fusion algorithms often suffer from noise corrupting the sensor measurements which cause non-negligible detection errors. Here, a novel approach to combining audio and visual data is proposed. We employ the direction of arrival angles of the audio sources to reshape the typical Gaussian noise distribution of particles in the propagation step and to weight the observation model in the measurement step. This approach is further improved by solving a typical problem associated with the PF, whose efficiency and accuracy usually depend on the number of particles and noise variance used in state estimation and particle propagation. Both parameters are specified beforehand and kept fixed in the regular PF implementation which makes the tracker unstable in practice. To address these problems, we design an algorithm which adapts both the number of particles and noise variance based on tracking error and the area occupied by the particles in the image. Experiments on the AV16.3 dataset show the advantage of our proposed methods over the baseline PF method and an existing adaptive PF algorithm for tracking occluded speakers with a significantly reduced number of particles.
We propose four variants of a novel hierarchical hidden Markov models strategy for rule induction in the context of automated sports video annotation including a multilevel Chinese takeaway process (MLCTP) based on the Chinese restaurant process and a novel Cartesian product label-based hierarchical bottom-up clustering (CLHBC) method that employs prior information contained within label structures. Our results show significant improvement by comparison against the flat Markov model: optimal performance is obtained using a hybrid method, which combines the MLCTP generated hierarchical topological structures with CLHBC generated event labels. We also show that the methods proposed are generalizable to other rule-based environments including human driving behavior and human actions.
We address the problem of anomaly detection in machine perception. The concept of domain anomaly is introduced as distinct from the conventional notion of anomaly used in the literature. We propose a unified framework for anomaly detection which exposes the multifaceted nature of anomalies and suggest effective mechanisms for identifying and distinguishing each facet as instruments for domain anomaly detection. The framework draws on the Bayesian probabilistic reasoning apparatus which clearly defines concepts such as outlier, noise, distribution drift, novelty detection (object, object primitive), rare events, and unexpected events. Based on these concepts we provide a taxonomy of domain anomaly events. One of the mechanisms helping to pinpoint the nature of anomaly is based on detecting incongruence between contextual and noncontextual sensor(y) data interpretation. The proposed methodology has wide applicability. It underpins in a unified way the anomaly detection applications found in the literature. To illustrate some of its distinguishing features, in here the domain anomaly detection methodology is applied to the problem of anomaly detection for a video annotation system.
Lip region deformation during speech contains biometric information and is termed visual speech. This biometric information can be interpreted as being genetic or behavioral depending on whether static or dynamic features are extracted. In this paper, we use a texture descriptor called local ordinal contrast pattern (LOCP) with a dynamic texture representation called three orthogonal planes to represent both the appearance and dynamics features observed in visual speech. This feature representation, when used in standard speaker verification engines, is shown to improve the performance of the lip-biometric trait compared to the state-of-the-art. The best baseline state-of-the-art performance was a half total error rate (HTER) of 13.35% for the XM2VTS database. We obtained HTER of less than 1%. The resilience of the LOCP texture descriptor to random image noise is also investigated. Finally, the effect of the amount of video information on speaker verification performance suggests that with the proposed approach, speaker identity can be verified with a much shorter biometric trait record than the length normally required for voice-based biometrics. In summary, the performance obtained is remarkable and suggests that there is enough discriminative information in the mouth-region to enable its use as a primary biometric trait. © 2006 IEEE.
Sparsity-inducing multiple kernel Fisher discriminant analysis (MK-FDA) has been studied in the literature. Building on recent advances in non-sparse multiple kernel learning (MKL), we propose a non-sparse version of MK-FDA, which imposes a general `p norm regularisation on the kernel weights. We formulate the associated optimisation problem as a semi-infinite program (SIP), and adapt an iterative wrapper algorithm to solve it. We then discuss, in light of latest advances inMKL optimisation techniques, several reformulations and optimisation strategies that can potentially lead to significant improvements in the efficiency and scalability of MK-FDA. We carry out extensive experiments on six datasets from various application areas, and compare closely the performance of `p MK-FDA, fixed norm MK-FDA, and several variants of SVM-based MKL (MK-SVM). Our results demonstrate that `p MK-FDA improves upon sparse MK-FDA in many practical situations. The results also show that on image categorisation problems, `p MK-FDA tends to outperform its SVM counterpart. Finally, we also discuss the connection between (MK-)FDA and (MK-)SVM, under the unified framework of regularised kernel machines.
The use of non-negative matrix factorisation (NMF) on 2D face images has been shown to result in sparse feature vectors that encode for local patches on the face, and thus provides a statistically justified approach to learning parts from wholes. However successful on 2D images, the method has so far not been extended to 3D images. The main reason for this is that 3D space is a continuum and so it is not apparent how to represent 3D coordinates in a non-negative fashion. This work compares different non-negative representations for spatial coordinates, and demonstrates that not all non-negative representations are suitable. We analyse the representational properties that make NMF a successful method to learn sparse 3D facial features. Using our proposed representation, the factorisation results in sparse and interpretable facial features.
3D Morphable Face Models are a powerful tool in computer vision. They consist of a PCA model of face shape and colour information and allow to reconstruct a 3D face from a single 2D image. 3D Morphable Face Models are used for 3D head pose estimation, face analysis, face recognition, and, more recently, facial landmark detection and tracking. However, they are not as widely used as 2D methods - the process of building and using a 3D model is much more involved. In this paper, we present the Surrey Face Model, a multi-resolution 3D Morphable Model that we make available to the public for non-commercial purposes. The model contains different mesh resolution levels and landmark point annotations as well as metadata for texture remapping. Accompanying the model is a lightweight open-source C++ library designed with simplicity and ease of integration as its foremost goals. In addition to basic functionality, it contains pose estimation and face frontalisation algorithms. With the tools presented in this paper, we aim to close two gaps. First, by offering different model resolution levels and fast fitting functionality, we enable the use of a 3D Morphable Model in time-critical applications like tracking. Second, the software library makes it easy for the community to adopt the 3D Morphable Face Model in their research, and it offers a public place for collaboration.
In this paper, we propose a novel fitting method that uses local image features to fit a 3D Morphable Face Model to 2D images. To overcome the obstacle of optimising a cost function that contains a non-differentiable feature extraction operator, we use a learning-based cascaded regression method that learns the gradient direction from data. The method allows to simultaneously solve for shape and pose parameters. Our method is thoroughly evaluated on Morphable Model generated data and first results on real data are presented. Compared to traditional fitting methods, which use simple raw features like pixel colour or edge maps, local features have been shown to be much more robust against variations in imaging conditions. Our approach is unique in that we are the first to use local features to fit a 3D Morphable Model. Because of the speed of our method, it is applicable for real-time applications. Our cascaded regression framework is available as an open source library at github.com/patrikhuber/superviseddescent.
© 2014 IEEE.Large pose and illumination variations are very challenging for face recognition. The 3D Morphable Model (3DMM) approach is one of the effective methods for pose and illumination invariant face recognition. However, it is very difficult for the 3DMM to recover the illumination of the 2D input image because the ratio of the albedo and illumination contributions in a pixel intensity is ambiguous. Unlike the traditional idea of separating the albedo and illumination contributions using a 3DMM, we propose a novel Albedo Based 3D Morphable Model (AB3DMM), which removes the illumination component from the images using illumination normalisation in a preprocessing step. A comparative study of different illumination normalisation methods for this step is conducted on PIE and Multi-PIE databases. The results show that overall performance of our method outperforms state-of-the-art methods.
In practical applications of pattern recognition and computer vision, the performance of many approaches can be improved by using multiple models. In this paper, we develop a common theoretical framework for multiple model fusion at the feature level using multilinear subspace analysis (also known as tensor algebra). One disadvantage of the multilinear approach is that it is hard to obtain enough training observations for tensor decomposition algorithms. To overcome this difficulty, we adopted the M2SA algorithm to reconstruct the missing entries of the incomplete training tensor. Furthermore, we apply the proposed framework to the problem of face image analysis using Active Appearance Model (AAM) to validate its performance. Evaluations of AAM using the proposed framework are conducted on Multi-PIE face database with promising results. © Springer-Verlag 2013.
3D face reconstruction from a single 2D image can be performed using a 3D Morphable Model (3DMM) in an analysis-by-synthesis approach. However, the reconstruction is an ill-posed problem. The recovery of the illumination characteristics of the 2D input image is particularly difficult because the proportion of the albedo and shading contributions in a pixel intensity is ambiguous. In this paper we propose the use of a facial symmetry constraint, which helps to identify the relative contributions of albedo and shading. The facial symmetry constraint is incorporated in a multi-feature optimisation framework, which realises the fitting process. By virtue of this constraint better illumination parameters can be recovered, and as a result the estimated 3D face shape and surface texture are more accurate. The proposed method is validated on the PIE face database. The experimental results show that the introduction of facial symmetry constraint improves the performance of both, face reconstruction and face recognition.
Particle filtering has emerged as a useful tool for tracking problems. However, the efficiency and accuracy of the filter usually depend on the number of particles and noise variance used in the estimation and propagation functions for re-allocating these particles at each iteration. Both of these parameters are specified beforehand and are kept fixed in the regular implementation of the filter which makes the tracker unstable in practice. In this paper we are interested in the design of a particle filtering algorithm which is able to adapt the number of particles and noise variance. The new filter, which is based on audio-visual (AV) tracking, uses information from the tracking errors to modify the number of particles and noise variance used. Its performance is compared with a previously proposed audio-visual particle filtering algorithm with a fixed number of particles and an existing adaptive particle filtering algorithm, using the AV 16.3 dataset with single and multi-speaker sequences. Our proposed approach demonstrates good tracking performance with a significantly reduced number of particles. © 2013 EURASIP.
Performing facial recognition between Near Infrared (NIR) and visible-light (VIS) images has been established as a common method of countering illumination variation problems in face recognition. In this paper we present a new database to enable the evaluation of cross-spectral face recognition. A series of preprocessing algorithms, followed by Local Binary Pattern Histogram (LBPH) representation and combinations with Linear Discriminant Analysis (LDA) are used for recognition. These experiments are conducted on both NIR→VIS and the less common VIS→NIR protocols, with permutations of uni-modal training sets. 12 individual baseline algorithms are presented. In addition, the best performing fusion approaches involving a subset of 12 algorithms are also described. © 2011 IEEE.
Over the last few years, several approaches have been proposed for information fusion including different variants of classifier level fusion (ensemble methods), stacking and multiple kernel learning (MKL). MKL has become a preferred choice for information fusion in object recognition. However, in the case of highly discriminative and complementary feature channels, it does not significantly improve upon its trivial baseline which averages the kernels. Alternative ways are stacking and classifier level fusion (CLF) which rely on a two phase approach. There is a significant amount of work on linear programming formulations of ensemble methods particularly in the case of binary classification. In this paper we propose a multiclass extension of binary ν-LPBoost, which learns the contribution of each class in each feature channel. The existing approaches of classifier fusion promote sparse features combinations, due to regularization based on ℓ1-norm, and lead to a selection of a subset of feature channels, which is not good in the case of informative channels. Therefore, we generalize existing classifier fusion formulations to arbitrary ℓp-norm for binary and multiclass problems which results in more effective use of complementary information. We also extended stacking for both binary and multiclass datasets. We present an extensive evaluation of the fusion methods on four datasets involving kernels that are all informative and achieve state-of-the-art results on all of them. © 2011 Springer-Verlag.
We describe a novel framework to detect ball hits in a tennis game by combining audio and visual information. Ball hit detection is a key step in understanding a game such as tennis, but single-mode approaches are not very successful: audio detection suffers from interfering noise and acoustic mismatch, video detection is made difficult by the small size of the ball and the complex background of the surrounding environment. Our goal in this paper is to improve detection performance by focusing on high-level information (rather than low-level features), including the detected audio events, the ball’s trajectory, and inter-event timing information. Visual information supplies coarse detection of the ball-hits events. This information is used as a constraint for audio detection. In addition, useful gains in detection performance can be obtained by using and inter-ballhit timing information, which aids prediction of the next ball hit. This method seems to be very effective in reducing the interference present in low-level features. After applying this method to a women’s doubles tennis game, we obtained improvements in the F-score of about 30% (absolute) for audio detection and about 10% for video detection.
Multiple Kernel Learning (MKL) has become a preferred choice for information fusion in image recognition problem. Aim of MKL is to learn optimal combination of kernels formed from different features, thus, to learn importance of different feature spaces for classification. Augmented Kernel Matrix (AKM) has recently been proposed to accommodate for the fact that a single training example may have different importance in different feature spaces, in contrast to MKL that assigns same weight to all examples in one feature space. However, AKM approach is limited to small datasets due to its memory requirements. We propose a novel two stage technique to make AKM applicable to large data problems. In first stage various kernels are combined into different groups automatically using kernel alignment. Next, most influential training examples are identified within each group and used to construct an AKM of significantly reduced size. This reduced size AKM leads to same results as the original AKM. We demonstrate that proposed two stage approach is memory efficient and leads to better performance than original AKM and is robust to noise. Results are compared with other state-of-the art MKL techniques, and show improvement on challenging object recognition benchmarks. © 2011 Springer-Verlag.
Bags-of-visual-Words (BoW) and Spatio-Temporal Shapes (STS) are two very popular approaches for action recognition from video. The former (BoW) is an un-structured global representation of videos which is built using a large set of local features. The latter (STS) uses a single feature located on a region of interest (where the actor is) in the video. Despite the popularity of these methods, no comparison between them has been done. Also, given that BoW and STS differ intrinsically in terms of context inclusion and globality/locality of operation, an appropriate evaluation framework has to be designed carefully. This paper compares these two approaches using four different datasets with varied degree of space-time specificity of the actions and varied relevance of the contextual background. We use the same local feature extraction method and the same classifier for both approaches. Further to BoW and STS, we also evaluated novel variations of BoW constrained in time or space. We observe that the STS approach leads to better results in all datasets whose background is of little relevance to action classification. © 2010 IEEE.
This paper addresses issues of analysis of DAPI-stained microscopy images of cell samples, particularly classification of objects as single nuclei, nuclei clusters or nonnuclear material. First, segmentation is significantly improved compared to Otsu’s method by choosing a more appropriate threshold, using a cost-function that explicitly relates to the quality of resulting boundary, rather than image histogram. This method applies ideas from active contour models to threshold-based segmentation, combining the local image sensitivity of the former with the simplicity and lower computational complexity of the latter. Secondly, we evaluate some novel measurements that are useful in classification of resulting shapes. Particularly, analysis of central distance profiles provides a method for improved detection of notches in nuclei clusters. Error rates are reduced to less than half compared to those of the base system, which used Fourier shape descriptors alone.
We apply domain adaptation to the problem of recognizing common actions between differing court-game sport videos (in particular tennis and badminton games). Actions are characterized in terms of HOG3D features extracted at the bounding box of each detected player, and thus have large intrinsic dimensionality. The techniques evaluated here for domain adaptation are based on estimating linear transformations to adapt the source domain features in order to maximize the similarity between posterior PDFs for each class in the source domain and the expected posterior PDF for each class in the target domain. As such, the problem scales linearly with feature dimensionality, making the video-environment domain adaptation problem tractable on reasonable time scales and resilient to over-fitting. We thus demonstrate that significant performance improvement can be achieved by applying domain adaptation in this context.
We consider the problem of learning a linear combination of pre-specified kernel matrices in the Fisher discriminant analysis setting. Existing methods for such a task impose an l1 norm regularisation on the kernel weights, which produces sparse solution but may lead to loss of information. In this paper, we propose to use l2 norm regularisation instead. The resulting learning problem is formulated as a semi-infinite program and can be solved efficiently. Through experiments on both synthetic data and a very challenging object recognition benchmark, the relative advantages of the proposed method and its l1 counterpart are demonstrated, and insights are gained as to how the choice of regularisation norm should be made. © 2009 IEEE.
It is well recognised that data association is critically important for object tracking. However, in the presence of successive misdetections, a large number of false candidates and an unknown number of abrupt model switchings that happen unpredictably, the data association problem can be very difficult. We tackle these difficulties by using a layered data association scheme. At the object level, trajectories are "grown" from sets of object candidates that have high probabilities of containing only true positives; by this means the otherwise combinatorial complexity is significantly reduced. Dijkstra 's shortest path algorithm is then used to perform data association at the trajectory level. The algorithm is applied to low-quality tennis video sequences to track a tennis ball. Experiments show that the algorithm is robust to abrupt model switchings, and performs well in heavily cluttered environments. © 2006 IEEE.
Page Owner: ees1jk
Page Created: Thursday 5 August 2010 15:44:10 by lb0014
Last Modified: Monday 21 March 2016 14:00:35 by jg0036
Expiry Date: Thursday 18 August 2011 10:40:18
Assembly date: Thu Apr 27 11:00:09 BST 2017
Content ID: 32831