We apply domain adaptation to the problem of recognizing common actions between differing court-game sport videos (in particular tennis and badminton games). Actions are characterized in terms of HOG3D features extracted at the bounding box of each detected player, and thus have large intrinsic dimensionality. The techniques evaluated here for domain adaptation are based on estimating linear transformations to adapt the source domain features in order to maximize the similarity between posterior PDFs for each class in the source domain and the expected posterior PDF for each class in the target domain. As such, the problem scales linearly with feature dimensionality, making the video-environment domain adaptation problem tractable on reasonable time scales and resilient to over-fitting. We thus demonstrate that significant performance improvement can be achieved by applying domain adaptation in this context.
Zou X, Kittler J, Messer K (2006) Ambient illumination variation removal by active near-IR imaging, ADVANCES IN BIOMETRICS, PROCEEDINGS 3832 pp. 19-25 SPRINGER-VERLAG BERLIN
Messer K, Kittler J, Short J, Heusch G, Cardinaux F, Marcel S, Rodriguez Y, Shan S, Su Y, Gao W, Chen X (2006) Performance characterisation of face recognition algorithms and their sensitivity to severe illumination changes, ADVANCES IN BIOMETRICS, PROCEEDINGS 3832 pp. 1-11 SPRINGER-VERLAG BERLIN
Roh MC, Christmas B, Kittler J, Lee SW (2006) Robust player gesture spotting and recognition in low-resolution sports video, COMPUTER VISION - ECCV 2006, PT 4, PROCEEDINGS 3954 pp. 347-358 SPRINGER-VERLAG BERLIN
FarajiDavar N, deCampos TE, Kittler J (2014) Transductive Transfer Machine, Preceedings of the Asian Conference on Computer Vision (ACCV)
Poh N, Merati A, Kittler J (2009) Adaptive Client-Impostor Centric Score Normalization: A Case Study in Fingerprint Verification, 2009 IEEE 3RD INTERNATIONAL CONFERENCE ON BIOMETRICS: THEORY, APPLICATIONS AND SYSTEMS pp. 245-251 IEEE
Ortega-Garcia J, Fierrez J, Alonso-Fernandez F, Galbally J, Freire MR, Gonzalez-Rodriguez J, Garcia-Mateo C, Alba-Castro J-L, Gonzalez-Agulla E, Otero-Muras E, Garcia-Salicetti S, Allano L, Ly-Van B, Dorizzi B, Kittler J, Bourlai T, Poh N, Deravi F, Ng MWR, Fairhurst M, Hennebert J, Humm A, Tistarelli M, Brodo L, Richiardi J, Drygajlo A, Ganster H, Sukno FM, Pavani S-K, Frangi A, Akarun L, Savran A (2010) The Multiscenario Multienvironment BioSecure Multimodal Database (BMDB), IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 32 (6) pp. 1097-1111 IEEE COMPUTER SOC
Feng ZH, Huber P, Kittler J, Christmas W, Wu XJ (2015) Random cascaded-regression copse for robust facial landmark detection, IEEE Signal Processing Letters 22 (1) pp. 76-80
In this letter, we present a random cascaded-regression copse (R-CR-C) for robust facial landmark detection. Its key innovations include a new parallel cascade structure design, and an adaptive scheme for scale-invariant shape update and local feature extraction. Evaluation on two challenging benchmarks shows the superiority of the proposed algorithm to state-of-the-art methods. © 1994-2012 IEEE.
In practical applications of pattern recognition and computer vision, the performance of many approaches can be improved by using multiple models. In this paper, we develop a common theoretical framework for multiple model fusion at the feature level using multilinear subspace analysis (also known as tensor algebra). One disadvantage of the multilinear approach is that it is hard to obtain enough training observations for tensor decomposition algorithms. To overcome this difficulty, we adopted the M2SA algorithm to reconstruct the missing entries of the incomplete training tensor. Furthermore, we apply the proposed framework to the problem of face image analysis using Active Appearance Model (AAM) to validate its performance. Evaluations of AAM using the proposed framework are conducted on Multi-PIE face database with promising results. © Springer-Verlag 2013.
Ogata T, Christmas W, Kittler J, Ishikawa S (2006) Improving human activity detection by combining multi-dimensional motion descriptors with boosting, Proceedings - International Conference on Pattern Recognition 1 pp. 295-298
A new, combined human activity detection method is proposed. Our method is based on Efros et al.'s motion descriptors and Ke et al.'s event detectors. Since both methods use optical flow, it is easy to combine them. However, the computational cost of the training increases considerably because of the increased number of weak classifiers. We reduce this computational cost by extend Ke et al.'s weak classifiers to incorporate multi-dimensional features. The proposed method is applied to off-air tennis video data, and its performance is evaluated by comparison with the original two methods. Experimental results show that the performance of the proposed method is a good compromise in terms of detection rate and of computation time of testing and training. © 2006 IEEE.
FarajiDavar N, deCampos TE, Kittler J (2014) Adaptive Transductive Transfer Machine, Preceedings of the British Machine Vision Conference (BMVC)
Snell V, Christmas W, Kittler J (2012) Texture and shape in fluorescence pattern identification for auto-immune disease diagnosis, Proceedings - International Conference on Pattern Recognition pp. 3750-3753 IEEE
Automation of HEp-2 cell pattern classification would drastically improve the accuracy and throughput of diagnostic services for many auto-immune diseases, but it has proven difficult to reach a sufficient level of precision. Correct diagnosis relies on a subtle assessment of texture type in microscopic images of indirect immunofluorescence (IIF), which so far has eluded reliable replication through automated measurements. We introduce a combination of spectral analysis and multi-scale digital filtering to extract the most discriminative variables from the cell images. We also apply multistage classification techniques to make optimal use of the limited labelled data set. Overall error rate of 1.6% is achieved in recognition of 6 different cell patterns, which drops to 0.5% if only positive samples are considered. © 2012 ICPR Org Committee.
Zor C, Windeatt T, Kittler J (2013) ECOC pruning using accuracy, diversity and hamming distance information, 2013 21st Signal Processing and Communications Applications Conference, SIU 2013
Existing ensemble pruning algorithms in the literature have mainly been defined for unweighted or weighted voting ensembles, whose extensions to the Error Correcting Output Coding (ECOC) framework is not successful. This paper presents a novel pruning algorithm to be used in the pruning of ECOC, via using a new accuracy measure together with diversity and Hamming distance information. The results show that the novel method outperforms those existing in the state-of-the-Art. © 2013 IEEE.
We consider the problem of learning a linear combination of pre-specified kernel matrices in the Fisher discriminant analysis setting. Existing methods for such a task impose an l1 norm regularisation on the kernel weights, which produces sparse solution but may lead to loss of information. In this paper, we propose to use l2 norm regularisation instead. The resulting learning problem is formulated as a semi-infinite program and can be solved efficiently. Through experiments on both synthetic data and a very challenging object recognition benchmark, the relative advantages of the proposed method and its l1 counterpart are demonstrated, and insights are gained as to how the choice of regularisation norm should be made. © 2009 IEEE.
Chan CH, Goswami B, Kittler J, Christmas W (2010) Local Ordinal Contrast Pattern Histograms for Spatiotemporal, Lip-Based Speaker Authentication, IEEE Transactions on Information, Forensics and Security 7 pp. 602-612
Lip region deformation during speech contains biometric information and is termed visual speech. This biometric information can be interpreted as being genetic or behavioral depending on whether static or dynamic features are extracted. In this paper, we use a texture descriptor called local ordinal contrast pattern (LOCP) with a dynamic texture representation called three orthogonal planes to represent both the appearance and dynamics features observed in visual speech. This feature representation, when used in standard speaker verification engines, is shown to improve the performance of the lip-biometric trait compared to the state-of-the-art. The best baseline state-of-the-art performance was a half total error rate (HTER) of 13.35% for the XM2VTS database. We obtained HTER of less than 1%. The resilience of the LOCP texture descriptor to random image noise is also investigated. Finally, the effect of the amount of video information on speaker verification performance suggests that with the proposed approach, speaker identity can be verified with a much shorter biometric trait record than the length normally required for voicebased biometrics. In summary, the performance obtained is remarkable and suggests that there is enough discriminative information in the mouth-region to enable its use as a primary biometric trait.
Kim T-K, Kittler J, Cipolla R (2007) Discriminative learning and recognition of image set classes using canonical correlations, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 29 (6) pp. 1005-1018 IEEE COMPUTER SOC
It is well recognised that data association is critically important for object tracking. However, in the presence of successive misdetections, a large number of false candidates and an unknown number of abrupt model switchings that happen unpredictably, the data association problem can be very difficult. We tackle these difficulties by using a layered data association scheme. At the object level, trajectories are "grown" from sets of object candidates that have high probabilities of containing only true positives; by this means the otherwise combinatorial complexity is significantly reduced. Dijkstra 's shortest path algorithm is then used to perform data association at the trajectory level. The algorithm is applied to low-quality tennis video sequences to track a tennis ball. Experiments show that the algorithm is robust to abrupt model switchings, and performs well in heavily cluttered environments. © 2006 IEEE.
Faraji Davar N, deCampos TE, Kittler J, Yan F (2011) Transductive Transfer Learning for Action Recognition in Tennis Games, 3rd International Workshop on Video Event Categorization, Tagging and Retrieval for Real-World Applications (VECTaR), in conjunction with ICCV
We present a fully automatic approach to real-time 3D face reconstruction from monocular in-the-wild videos. With the use of a cascaded-regressor based face tracking and a 3D Morphable Face Model shape fitting, we obtain a semi-dense 3D face shape. We further use the texture information from multiple frames to build a holistic 3D face representation from the video footage. Our system is able to capture facial expressions and does not require any person-specific training. We demonstrate the robustness of our approach on the challenging 300 Videos in the Wild (300-VW) dataset. Our real-time fitting framework is available as an open source library at http://4dface.org
Poh N, Kittler J, Bourlai T (2007) Improving biometric device interoperability by likelihood ratio-based quality dependent score normalization, 2007 FIRST IEEE INTERNATIONAL CONFERENCE ON BIOMETRICS: THEORY, APPLICATIONS AND SYSTEMS pp. 325-329 IEEE
Goswami B, Christmas W, Kittler J (2009) Robust Statistical Estimation Applied to Automatic Lip Segmentation, PATTERN RECOGNITION AND IMAGE ANALYSIS, PROCEEDINGS 5524 pp. 200-207 SPRINGER-VERLAG BERLIN
We propose four variants of a novel hierarchical hidden Markov models strategy for rule induction in the context of automated sports video annotation including a multilevel Chinese takeaway process (MLCTP) based on the Chinese restaurant process and a novel Cartesian product label-based hierarchical bottom-up clustering (CLHBC) method that employs prior information contained within label structures. Our results show significant improvement by comparison against the flat Markov model: optimal performance is obtained using a hybrid method, which combines the MLCTP generated hierarchical topological structures with CLHBC generated event labels. We also show that the methods proposed are generalizable to other rule-based environments including human driving behavior and human actions.
Poh N, Wong R, Kittler J, Roli F (2009) Challenges and Research Directions for Adaptive Biometric Recognition Systems, ADVANCES IN BIOMETRICS 5558 pp. 753-764 SPRINGER-VERLAG BERLIN
Hamouz M, Tena JR, Kittler J, Hilton A, Illingworth J (2006) Algorithms for 3D-assisted face recognition, 2006 IEEE 14TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS, VOLS 1 AND 2 pp. 826-829 IEEE
Liu Q, Wang W, Jackson PJB, Barnard M, Kittler J, Chambers J (2013) Source separation of convolutive and noisy mixtures using audio-visual dictionary learning and probabilistic time-frequency masking, IEEE Transactions on Signal Processing 61 (22) 99 pp. 5520-5535
In existing audio-visual blind source separation (AV-BSS) algorithms, the AV coherence is usually established through statistical modelling, using e.g. Gaussian mixture models (GMMs). These methods often operate in a lowdimensional feature space, rendering an effective global representation of the data. The local information, which is important in capturing the temporal structure of the data, however, has not been explicitly exploited. In this paper, we propose a new method for capturing such local information, based on audio-visual dictionary learning (AVDL). We address several challenges associated with AVDL, including cross-modality differences in size, dimension and sampling rate, as well as the issues of scalability and computational complexity. Following a commonly employed bootstrap coding-learning process, we have developed a new AVDL algorithm which features, a bimodality balanced and scalable matching criterion, a size and dimension adaptive dictionary, a fast search index for efficient coding, and cross-modality diverse sparsity. We also show how the proposed AVDL can be incorporated into a BSS algorithm. As an example, we consider binaural mixtures, mimicking aspects of human binaural hearing, and derive a new noise-robust AV-BSS algorithm by combining the proposed AVDL algorithm with Mandel?s BSS method, which is a state-of-the-art audio-domain method using time-frequency masking. We have systematically evaluated the proposed AVDL and AV-BSS algorithms, and show their advantages over the corresponding baseline methods, using both synthetic data and visual speech data from the multimodal LILiR Twotalk corpus.
Poh N, Chan CH, Kittler J, Marcel S, Mc Cool C, Argones Rua E, Alba Castro JL, Villegas M, Paredes R, Struc V, Pavesic N, Salah AA, Fang H, Costen N (2010) An Evaluation of Video-to-Video Face Verification, IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 5 (4) pp. 781-801 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Roh M-C, Christmas B, Kittler J, Lee S-W (2008) Gesture spotting for low-resolution sports video annotation, PATTERN RECOGNITION 41 (3) pp. 1124-1137
Kilic V, Barnard M, Wang W, Hilton A, Kittler J (2015) AUDIO INFORMED VISUAL SPEAKER TRACKING WITH SMC-PHD FILTER, 2015 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO (ICME) IEEE
Arturo Olvera-Lopez J, Ariel Carrasco-Ochoa J, Francisco Martinez-Trinidad J, Kittler J (2010) A review of instance selection methods, ARTIFICIAL INTELLIGENCE REVIEW 34 (2) pp. 133-143 SPRINGER
Barnard M, Wang W, Kittler J, Naqvi SM, Chambers J (2013) Audio-visual face detection for tracking in a meeting room environment, Proceedings of the 16th International Conference on Information Fusion, FUSION 2013 pp. 1222-1227
A key task in many applications such as tracking or face recognition is the detection and localisation of a subject's face in an image. This can still prove to be a challenging task particularly in low resolution or noisy images. Here we propose a robust method for face detection using both audio and visual information. We construct a dictionary learning based face detector using a set of distinctive and robust image features. We then train a support vector machine classifier using sparse image representations produced by this dictionary to classify face versus background. This is combined with the azimuth angle of the speaker produced by an audio localisation system to constrain the search space for the subject's face. This increases the efficiency of the detection and localisation process by limiting the search area. However, more importantly, the audio information allows us to know a priori the number of subjects in the image. This greatly reduces the possibility of false positive face detections. We demonstrate the advantage of this proposed approach over traditional face detection methods on the challenging AV16.3 dataset. © 2013 ISIF ( Intl Society of Information Fusi.
Yan F, Christmas W, Kittler J (2008) Layered data association using graph-theoretic formulation with applications to tennis ball tracking in monocular sequences, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 30 (10) pp. 1814-1830 IEEE COMPUTER SOC
© 2014 IEEE.Large pose and illumination variations are very challenging for face recognition. The 3D Morphable Model (3DMM) approach is one of the effective methods for pose and illumination invariant face recognition. However, it is very difficult for the 3DMM to recover the illumination of the 2D input image because the ratio of the albedo and illumination contributions in a pixel intensity is ambiguous. Unlike the traditional idea of separating the albedo and illumination contributions using a 3DMM, we propose a novel Albedo Based 3D Morphable Model (AB3DMM), which removes the illumination component from the images using illumination normalisation in a preprocessing step. A comparative study of different illumination normalisation methods for this step is conducted on PIE and Multi-PIE databases. The results show that overall performance of our method outperforms state-of-the-art methods.
Poh N, Bourlai T, Kittler J, Allano L, Alonso-Fernandez F, Ambekar O, Baker J, Dorizzi B, Fatukasi O, Fierrez J, Ganster H, Ortega-Garcia J, Maurer D, Salah AA, Scheidat T, Vielhauer C (2009) Benchmarking Quality-Dependent and Cost-Sensitive Score-Level Multimodal Biometric Fusion Algorithms, IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 4 (4) pp. 849-866
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Feng ZH, Kittler J, Christmas W, Wu XJ, Pfeiffer S (2012) Automatic face annotation by multilinear AAM with Missing Values, Proceedings - International Conference on Pattern Recognition pp. 2586-2589
It has been shown that multilinear subspace analysis is a powerful tool to overcome difficulties posed by viewpoint, illumination and expression variations in Active Appearance Model(AAM). However, the Higher Order Singular Value Decomposition (HOSVD) in multilinear analysis requires training samples to build the training tensor, which include face images under all different variations. It is hard to obtain such a complete training tensor in practical applications. In this paper, we propose a multilinear AAM which can be generated from an incomplete training tensor using Multilinear Subspace Analysis with Missing Values (M2SA). Also, the 2D appearance is used for training appearance tensor directly to reduce the memory requirements. Experimental results on the Multi-PIE face database show the efficiency of the proposed method. © 2012 ICPR Org Committee.
Messer K, Christmas WJ, Jaser E, Kittler J, Levienaise-Obadia B, Koubaroulis D (2005) A unified approach to the generation of semantic cues for sports video annotation, SIGNAL PROCESSING 85 (2) pp. 357-383 ELSEVIER SCIENCE BV
3D face reconstruction of shape and skin texture from a single 2D image can be performed using a 3D Morphable Model (3DMM) in an analysis-by-synthesis approach. However, performing this reconstruction (fitting) efficiently and accurately in a general imaging scenario is a challenge. Such a scenario would involve a perspective camera to describe the geometric projection from 3D to 2D, and the Phong model to characterise illumination. Under these imaging assumptions the reconstruction problem is nonlinear and, consequently, computationally very demanding. In this work, we present an efficient stepwise 3DMM-to-2D image-fitting procedure, which sequentially optimises the pose, shape, light direction, light strength and skin texture parameters in separate steps. By linearising each step of the fitting process we derive closed-form solutions for the recovery of the respective parameters, leading to efficient fitting. The proposed optimisation process involves all the pixels of the input image, rather than randomly selected subsets, which enhances the accuracy of the fitting. It is referred to as Efficient Stepwise Optimisation (ESO). The proposed fitting strategy is evaluated using reconstruction error as a performance measure. In addition, we demonstrate its merits in the context of a 3D-assisted 2D face recognition system which detects landmarks automatically and extracts both holistic and local features using a 3DMM. This contrasts with most other methods which only report results that use manual face landmarking to initialise the fitting. Our method is tested on the public CMU-PIE and Multi-PIE face databases, as well as one internal database. The experimental results show that the face reconstruction using ESO is significantly faster, and its accuracy is at least as good as that achieved by the existing 3DMM fitting algorithms. A face recognition system integrating ESO to provide a pose and illumination invariant solution compares favourably with other state-of-the-art methods. In particular, it outperforms deep learning methods when tested on the Multi-PIE database.
Veta M, Diest PJV, Willems SM, Wang H, Madabhushi A, Cruz-Roa A, Gonzalez F, Larsen ABL, Vestergaard JS, Dahl AB, Cire_an DC, Schmidhuber J, Giusti A, Gambardella LM, Tek FB, Walter T, Wang C-W, Kondo S, Matuszewski BJ, Precioso F, Snell V, Kittler J, Campos TED, Khan AM, Rajpoot NM, Arkoumani E, Lacle MM, Viergever MA, Pluim JPW (2014) Assessment of algorithms for mitosis detection in breast cancer
The proliferative activity of breast tumors, which is routinely estimated by
counting of mitotic figures in hematoxylin and eosin stained histology
sections, is considered to be one of the most important prognostic markers.
However, mitosis counting is laborious, subjective and may suffer from low
inter-observer agreement. With the wider acceptance of whole slide images in
pathology labs, automatic image analysis has been proposed as a potential
solution for these issues. In this paper, the results from the Assessment of
Mitosis Detection Algorithms 2013 (AMIDA13) challenge are described. The
challenge was based on a data set consisting of 12 training and 11 testing
subjects, with more than one thousand annotated mitotic figures by multiple
observers. Short descriptions and results from the evaluation of eleven methods
are presented. The top performing method has an error rate that is comparable
to the inter-observer agreement among pathologists.
In this paper, we propose a novel fitting method that uses local image features to fit a 3D Morphable Face Model to 2D images. To overcome the obstacle of optimising a cost function that contains a non-differentiable feature extraction operator, we use a learning-based cascaded regression method that learns the gradient direction from data. The method allows to simultaneously solve for shape and pose parameters. Our method is thoroughly evaluated on Morphable Model generated data and first results on real data are presented. Compared to traditional fitting methods, which use simple raw features like pixel colour or edge maps,
local features have been shown to be much more robust against variations in imaging conditions. Our approach is unique in that we are the first to use local features to fit a 3D Morphable Model. Because of the speed of our method, it is applicable for real-time applications. Our cascaded regression framework is available as an open source library at github.com/patrikhuber/superviseddescent.
Yan F, Kittler J, Windridge D, Christmas W, Mikolajczyk K, Cox S, Huang Q (2014) Automatic annotation of tennis games: An integration of audio, vision, and learning, Image and Vision Computing 32 (11) pp. 896-903
Fully automatic annotation of tennis game using broadcast video is a task with a great potential but with enormous challenges. In this paper we describe our approach to this task, which integrates computer vision, machine listening, and machine learning. At the low level processing, we improve upon our previously proposed state-of-the-art tennis ball tracking algorithm and employ audio signal processing techniques to detect key events and construct features for classifying the events. At high level analysis, we model event classification as a sequence labelling problem, and investigate four machine learning techniques using simulated event sequences. Finally, we evaluate our proposed approach on three real world tennis games, and discuss the interplay between audio, vision and learning. To the best of our knowledge, our system is the only one that can annotate tennis game at such a detailed level. © 2014 Elsevier B.V.
Campos TED, Khan A, Yan F, Faraji Davar N, Windridge D, Kittler J, Christmas W (2013) A framework for automatic sports video annotation with anomaly detection and transfer learning, Proceedings of Machine Learning and Cognitive Science
Tena JR, Smith RS, Hamouz A, Kittler J, Hilton A, Illingworth J (2007) 2D face pose normalisation using a 3D morphable model, 2007 IEEE CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE pp. 51-56 IEEE
Koppen WP, Chan CH, Christmas WJ, Kittler J (2012) An intrinsic coordinate system for 3D face registration, Proceedings - International Conference on Pattern Recognition pp. 2740-2743
We present a method to estimate, based on the horizontal symmetry, an intrinsic coordinate system of faces scanned in 3D. We show that this coordinate system provides an excellent basis for subsequent landmark positioning and model-based refinement such as Active Shape Models, outperforming other -explicit- landmark localisation methods including the commonly-used ICP+ASM approach. © 2012 ICPR Org Committee.
Zou X, Kittler J, Hamouz M, Tena JR (2008) Robust albedo estimation from face image under unknown illumination, BIOMETRIC TECHNOLOGY FOR HUMAN IDENTIFICATION V 6944 SPIE-INT SOC OPTICAL ENGINEERING
Poh N, Heusch G, Kittler J (2007) On combination of face authentication experts by a mixture of quality dependent fusion classifiers, Multiple Classifier Systems, Proceedings 4472 pp. 344-356 SPRINGER-VERLAG BERLIN
Kostin A, Kittler J, Christmas WJ (2005) Object recognition by symmetrised graph matching using relaxation labelling with an inhibitory mechanism, Pattern Recognition Letters 26 3 pp. 381-393
Object recognition using graph-matching techniques can be viewed as a two-stage process: extracting suitable object primitives from an image and corresponding models, and matching graphs constructed from these two sets of object primitives. In this paper we concentrate mainly on the latter issue of graph matching, for which we derive a technique based on probabilistic relaxation graph labelling. The new method was evaluated on two standard data sets, SOIL47 and COIL100, in both of which objects must be recognised from a variety of different views. The results indicated that our method is comparable with the best of other current object recognition techniques. The potential of the method was also demonstrated on challenging examples of object recognition in cluttered scenes.
Kilic V, Barnard M, Wang W, Kittler J (2013) Audio constrained particle filter based visual tracking, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings pp. 3627-3631
We present a robust and efficient audio-visual (AV) approach to speaker tracking in a room environment. A challenging problem with visual tracking is to deal with occlusions (caused by the limited field of view of cameras or by other speakers). Another challenge is associated with the particle filtering (PF) algorithm, commonly used for visual tracking, which requires a large number of particles to ensure the distribution is well modelled. In this paper, we propose a new method of fusing audio into the PF based visual tracking. We use the direction of arrival angles (DOAs) of the audio sources to reshape the typical Gaussian noise distribution of particles in the propagation step and to weight the observation model in the measurement step. Experiments on AV16.3 datasets show the advantage of our proposed method over the baseline PF method for tracking occluded speakers with a significantly reduced number of particles. © 2013 IEEE.
Chan C-H, Kittler J, Tahir MA (2010) Kernel Fusion of Multiple Histogram Descriptors for Robust Face Recognition, STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION 6218 pp. 718-727 SPRINGER-VERLAG BERLIN
Poh N, Kittler J (2007) Predicting biometric authentication system performance across different application conditions: A bootstrap enhanced parametric approach, Advances in Biometrics, Proceedings 4642 pp. 625-635 SPRINGER-VERLAG BERLIN
Ilonen J, Kamarainen J-K, Paalanen P, Hamouz M, Kittler J, Kalviainen H (2008) Image feature localization by multiple hypothesis testing of Gabor features, IEEE TRANSACTIONS ON IMAGE PROCESSING 17 (3) pp. 311-325 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Coppi D, De Campos T, Yan F, Kittler J, Cucchiara R (2014) On detection of novel categories and subcategories of images using incongruence, ICMR 2014 - Proceedings of the ACM International Conference on Multimedia Retrieval 2014 pp. 337-344
Novelty detection is a crucial task in the development of autonomous vision systems. It aims at detecting if samples do not conform with the learnt models. In this paper, we consider the problem of detecting novelty in object recognition problems in which the set of object classes are grouped to form a semantic hierarchy. We follow the idea that, within a semantic hierarchy, novel samples can be defined as samples whose categorization at a specific level contrasts with the categorization at a more general level. This measure indicates if a sample is novel and, in that case, if it is likely to belong to a novel broad category or to a novel sub-category. We present an evaluation of this approach on two hierarchical subsets of the Caltech256 objects dataset and on the SUN scenes dataset, with different classification schemes. We obtain an improvement over Weinshall et al. and show that it is possible to bypass their normalisation heuristic. We demonstrate that this approach achieves good novelty detection rates as far as the conceptual taxonomy is congruent with the visual hierarchy, but tends to fail if this assumption is not satisfied. Copyright 2014 ACM.
Barnard M, Koniusz P, Wang W, Kittler J, Naqvi SM, Chambers J (2014) Robust multi-speaker tracking via dictionary learning and identity modeling, IEEE Transactions on Multimedia 16 (3) pp. 864-880
We investigate the problem of visual tracking of multiple human speakers in an office environment. In particular, we propose novel solutions to the following challenges: (1) robust and computationally efficient modeling and classification of the changing appearance of the speakers in a variety of different lighting conditions and camera resolutions; (2) dealing with full or partial occlusions when multiple speakers cross or come into very close proximity; (3) automatic initialization of the trackers, or re-initialization when the trackers have lost lock caused by e.g. the limited camera views. First, we develop new algorithms for appearance modeling of the moving speakers based on dictionary learning (DL), using an off-line training process. In the tracking phase, the histograms (coding coefficients) of the image patches derived from the learned dictionaries are used to generate the likelihood functions based on Support Vector Machine (SVM) classification. This likelihood function is then used in the measurement step of the classical particle filtering (PF) algorithm. To improve the computational efficiency of generating the histograms, a soft voting technique based on approximate Locality-constrained Soft Assignment (LcSA) is proposed to reduce the number of dictionary atoms (codewords) used for histogram encoding. Second, an adaptive identity model is proposed to track multiple speakers whilst dealing with occlusions. This model is updated online using Maximum a Posteriori (MAP) adaptation, where we control the adaptation rate using the spatial relationship between the subjects. Third, to enable automatic initialization of the visual trackers, we exploit audio information, the Direction of Arrival (DOA) angle, derived from microphone array recordings. Such information provides, a priori, the number of speakers and constrains the search space for the speaker's faces. The proposed system is tested on a number of sequences from three publicly available and challenging data corpora (AV16.3, EPFL pedestrian data set and CLEAR) with up to five moving subjects. © 2014 IEEE.
Particle filtering has emerged as a useful tool for tracking problems. However, the efficiency and accuracy of the filter usually depend on the number of particles and noise variance used in the estimation and propagation functions for re-allocating these particles at each iteration. Both of these parameters are specified beforehand and are kept fixed in the regular implementation of the filter which makes the tracker unstable in practice. In this paper we are interested in the design of a particle filtering algorithm which is able to adapt the number of particles and noise variance. The new filter, which is based on audio-visual (AV) tracking, uses information from the tracking errors to modify the number of particles and noise variance used. Its performance is compared with a previously proposed audio-visual particle filtering algorithm with a fixed number of particles and an existing adaptive particle filtering algorithm, using the AV 16.3 dataset with single and multi-speaker sequences. Our proposed approach demonstrates good tracking performance with a significantly reduced number of particles. © 2013 EURASIP.
Almajai I, Yan F, de Campos T, Khan A, Christmas W, Windridge D, Kittler J (2012) Anomaly Detection and Knowledge Transfer in Automatic Sports Video Annotation, Proceedings of DIRAC Workshop on Detection and Identification of Rare Audivisual Cues 384 pp. 109-117 Springer
A key question in machine perception is how to adaptively build upon existing capabilities so as to permit novel functionalities. Implicit in this are the notions of anomaly detection and learning transfer. A perceptual system must
firstly determine at what point the existing learned model ceases to apply, and secondly, what aspects of the existing model can be brought to bear on the newlydefined
learning domain. Anomalies must thus be distinguished from mere outliers,
i.e. cases in which the learned model has failed to produce a clear response; it is also necessary to distinguish novel (but meaningful) input from misclassification error within the existing models.We thus apply a methodology of anomaly detection based on comparing the outputs of strong and weak classifiers  to the
problem of detecting the rule-incongruence involved in the transition from singles
to doubles tennis videos. We then demonstrate how the detected anomalies can be used to transfer learning from one (initially known) rule-governed structure to another. Our ultimate aim, building on existing annotation technology, is to construct an adaptive system for court-based sport video annotation.
Goswami B, Chan CH, Kittler J, Christmas B (2011) Speaker Authentication using Video-based Lip information, ICASSP
Poh N, Kittler J (2008) On Using Error Bounds to Optimize Cost-Sensitive Multimodal Biometric uthentication, IEEE Proceedings of 19th International Conference on Pattern Recognition pp. 684-687 IEEE
While using more biometric traits in multimodal biometric fusion can effectively increase the system robustness, often, the cost associated to adding additional systems is not considered. In this paper, we propose an algorithm that can efficiently bound the biometric system error. This helps not only to speed up the search for the optimal system configuration by an order of magnitude but also unexpectedly to enhance the robustness to population mismatch. This suggests that bounding the error of biometric system from above can possibly be better than directly estimating it from the data. The latter strategy can be susceptible to spurious biometric samples and the particular choice of users. The efficiency of the proposal is achieved thanks to the use of Chernoff bound in estimating the authentication error. Unfortunately, such a bound assumes that the match scores are normally distributed, which is not necessarily the correct distribution model. We propose to transform simultaneously the class conditional match scores (genuine user or impostor scores) into ones that are more conforming to normal distributions using a modified criterion of the Box-Cox transform.
Poh N, Kittler J (2008) Incorporating model-specific score distribution in speaker verification systems, IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING 16 (3) pp. 594-606 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
The use of non-negative matrix factorisation (NMF) on 2D face images has been shown to result in sparse feature vectors that encode for local patches on the face, and thus provides a statistically justified approach to learning parts from wholes. However successful on 2D images, the method has so far not been extended to 3D images. The main reason
for this is that 3D space is a continuum and so it is not apparent how to represent 3D coordinates in a non-negative fashion. This work compares different non-negative representations for spatial coordinates, and demonstrates that not all non-negative representations are suitable. We analyse the representational properties that make NMF a successful
method to learn sparse 3D facial features. Using our proposed representation, the factorisation results in sparse and interpretable facial features.
Shekar BH, Bharathi RK, Kittler J, Vizilter YV, Mestestskiy L (2015) Grid Structured Morphological Pattern Spectrum for Off-line Signature Verification, 2015 INTERNATIONAL CONFERENCE ON BIOMETRICS (ICB) pp. 430-435 IEEE
Zou X, Kittler J, Messer K (2007) Motion compensation for face recognition based on active differential imaging, Advances in Biometrics, Proceedings 4642 pp. 39-48 SPRINGER-VERLAG BERLIN
Poh N, Kittler J (2008) A METHODOLOGY FOR SEPARATING SHEEP FROM GOATS FOR CONTROLLED ENROLLMENT AND MULTIMODAL FUSION, 2008 BIOMETRICS SYMPOSIUM (BSYM) pp. 17-22 IEEE
Multiple Kernel Learning (MKL) has become a preferred choice for information fusion in image recognition problem. Aim of MKL is to learn optimal combination of kernels formed from different features, thus, to learn importance of different feature spaces for classification. Augmented Kernel Matrix (AKM) has recently been proposed to accommodate for the fact that a single training example may have different importance in different feature spaces, in contrast to MKL that assigns same weight to all examples in one feature space. However, AKM approach is limited to small datasets due to its memory requirements. We propose a novel two stage technique to make AKM applicable to large data problems. In first stage various kernels are combined into different groups automatically using kernel alignment. Next, most influential training examples are identified within each group and used to construct an AKM of significantly reduced size. This reduced size AKM leads to same results as the original AKM. We demonstrate that proposed two stage approach is memory efficient and leads to better performance than original AKM and is robust to noise. Results are compared with other state-of-the art MKL techniques, and show improvement on challenging object recognition benchmarks. © 2011 Springer-Verlag.
Poh N, Kittler J (2011) A Unified Framework for Multimodal Biometric Fusion Incorporating Quality Measures., IEEE Trans Pattern Anal Mach Intell
This paper proposes a unified framework for quality-based fusion of multimodal biometrics. Quality- dependent fusion algorithms aim to dynamically combine several classifier (biometric expert) outputs as a function of automatically derived (biometric) sample quality. Quality measures used for this purpose quantify the degree of conformance of biometric samples to some predefined criteria known to influence the system performance. Designing a fusion classifier to take quality into consideration is difficult because quality measures cannot be used to distinguish genuine users from impostors, i.e., they are non- discriminative; yet, still useful for classification. We propose a general Bayesian framework that can utilize the quality infor- mation effectively. We show that this framework encompasses several recently proposed quality-based fusion algorithms in the literature -- Nandakumar et al., 2006; Poh et al., 2007; Kryszczuk and Drygajo, 2007; Kittler et al., 2007; Alonso- Fernandez, 2008; Maurer and Baker, 2007; Poh et al., 2010. Furthermore, thanks to the systematic study concluded herein, we also develop two alternative formulations of the problem, leading to more efficient implementation (with fewer parameters) and achieving performance comparable to, or better than the state of the art. Last but not least, the framework also improves the understanding of the role of quality in multiple classifier combination.
Prabhakar S, Kittler J, Maltoni D, O'Gorman L, Tan T (2007) Introduction to the special issue on biometrics: Progress and directions, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 29 (4) pp. 513-516 IEEE COMPUTER SOC
Windridge D, Kittler J (2010) Perception-Action Learning as an Epistemologically-Consistent Model for Self-Updating Cognitive Representation, BRAIN INSPIRED COGNITIVE SYSTEMS 2008 657 pp. 95-134 SPRINGER-VERLAG BERLIN
As well as having the ability to formulate models of the world capable of experimental falsification, it is evident that human cognitive capability embraces some degree of representational plasticity, having the scope (at least in infancy) to modify the primitives in terms of which the world is delineated. We hence employ the term 'cognitive bootstrapping' to refer to the autonomous updating of an embodied agent's perceptual framework in response to the perceived requirements of the environment in such a way as to retain the ability to refine the environment model in a consistent fashion across perceptual changes.We will thus argue that the concept of cognitive bootstrapping is epistemically ill-founded unless there exists an a priori percept/motor interrelation capable of maintaining an empirical distinction between the various possibilities of perceptual categorization and the inherent uncertainties of environment modeling.As an instantiation of this idea, we shall specify a very general, logically-inductive model of perception-action learning capable of compact re-parameterization of the percept space. In consequence of the a priori percept/action coupling, the novel perceptual state transitions so generated always exist in bijective correlation with a set of novel action states, giving rise to the required empirical validation criterion for perceptual inferences. Environmental description is correspondingly accomplished in terms of progressively higher-level affordance conjectures which are likewise validated by exploratory action.Application of this mechanism within simulated perception-action environments indicates that, as well as significantly reducing the size and specificity of the a priori perceptual parameter-space, the method can significantly reduce the number of iterations required for accurate convergence of the world-model. It does so by virtue of the active learning characteristics implicit in the notion of cognitive bootstrapping.
Zou X, Kittler J, Messer K (2007) Illumination invariant face recognition: A survey, pp. 113-120 IEEE
The probability hypothesis density (PHD) filter
based on sequential Monte Carlo (SMC) approximation (also
known as SMC-PHD filter) has proven to be a promising
algorithm for multi-speaker tracking. However, it has a heavy
computational cost as surviving, spawned and born particles
need to be distributed in each frame to model the state of the
speakers and to estimate jointly the variable number of speakers
with their states. In particular, the computational cost is mostly
caused by the born particles as they need to be propagated
over the entire image in every frame to detect the new speaker
presence in the view of the visual tracker. In this paper, we
propose to use audio data to improve the visual SMC-PHD (VSMC-
PHD) filter by using the direction of arrival (DOA) angles
of the audio sources to determine when to propagate the born
particles and re-allocate the surviving and spawned particles.
The tracking accuracy of the AV-SMC-PHD algorithm is further
improved by using a modified mean-shift algorithm to search
and climb density gradients iteratively to find the peak of the
probability distribution, and the extra computational complexity
introduced by mean-shift is controlled with a sparse sampling
technique. These improved algorithms, named as AVMS-SMCPHD
and sparse-AVMS-SMC-PHD respectively, are compared
systematically with AV-SMC-PHD and V-SMC-PHD based on
the AV16.3, AMI and CLEAR datasets.
Chan CH, Kittler J (2012) BLUR KERNEL ESTIMATION TO IMPROVE RECOGNITION OF BLURRED FACES, 2012 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP 2012) pp. 1989-1992 IEEE
Kim TK, Kittler J, Cipolla R (2006) Learning discriminative canonical correlations for object recognition with image sets, COMPUTER VISION - ECCV 2006, PT 3, PROCEEDINGS 3953 pp. 251-262 SPRINGER-VERLAG BERLIN
Shevchenko M, Windridge D, Kittler J (2009) A linear-complexity reparameterisation strategy for the hierarchical bootstrapping of capabilities within perception-action architectures, IMAGE AND VISION COMPUTING 27 (11) pp. 1702-1714 ELSEVIER SCIENCE BV
Shaukat A, Windridge D, Hollnagel E, Macchi L, Kittler J (2010) Induction of the Human Perception-Action Hierarchy Employed in Junction-Navigation Scenarios,
We address the problem of anomaly detection in machine perception. The concept of domain anomaly is introduced as distinct from the conventional notion of anomaly used in the literature. We propose a unified framework for anomaly detection which exposes the multifaceted nature of anomalies and suggest effective mechanisms for identifying and distinguishing each facet as instruments for domain anomaly detection. The framework draws on the Bayesian probabilistic reasoning apparatus which clearly defines concepts such as outlier, noise, distribution drift, novelty detection (object, object primitive), rare events, and unexpected events. Based on these concepts we provide a taxonomy of domain anomaly events. One of the mechanisms helping to pinpoint the nature of anomaly is based on detecting incongruence between contextual and noncontextual sensor(y) data interpretation. The proposed methodology has wide applicability. It underpins in a unified way the anomaly detection applications found in the literature. To illustrate some of its distinguishing features, in here the domain anomaly detection methodology is applied to the problem of anomaly detection for a video annotation system.
Almajai I, Yan F, de Campos TE, Khan A, Christmas W, Windridge D, Kittler J (2010) Anomaly Detection and Knowledge Transfer in Automatic Sports Video Annotation, Proceedings of DIRAC Workshop, European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2010)
McIntyre AH, Hancock PJB, Kittler J, Langton SRH (2013) Improving Discrimination and Face Matching with Caricature, APPLIED COGNITIVE PSYCHOLOGY 27 (6) pp. 725-734 WILEY-BLACKWELL
Yan F, Christmas W, Kittler J (2014) Ball tracking for tennis video annotation, 71 pp. 25-45
© Springer International Publishing Switzerland 2014.Tennis game annotation using broadcast video is a task with a wide range of applications. In particular, ball trajectories carry rich semantic information for the annotation. However, tracking a ball in broadcast tennis video is extremely challenging. In this chapter, we explicitly address the challenges, and propose a layered data association algorithm for tracking multiple tennis balls fully automatically. The effectiveness of the proposed algorithm is demonstrated on two data sets with more than 100 sequences from real-world tennis videos, where other data association methods perform poorly or fail completely.
Performing facial recognition between Near Infrared (NIR) and visible-light (VIS) images has been established as a common method of countering illumination variation problems in face recognition. In this paper we present a new database to enable the evaluation of cross-spectral face recognition. A series of preprocessing algorithms, followed by Local Binary Pattern Histogram (LBPH) representation and combinations with Linear Discriminant Analysis (LDA) are used for recognition. These experiments are conducted on both NIRVIS and the less common VISNIR protocols, with permutations of uni-modal training sets. 12 individual baseline algorithms are presented. In addition, the best performing fusion approaches involving a subset of 12 algorithms are also described. © 2011 IEEE.
Christmas WJ, Kostin A, Yan F, Kolonias I, Kittler J (2005) A system for the automatic annotation of tennis matches,
In this paper we describe a system for the automatic annotation of tennis matches. The goal is to provide annotation at all levels, from shot detection to a complete breakdown of the scoring within the match. At present the system will automatically analyse a tennis video to the extent that it can identify the outcome of individual video shots, with reasonable accuracy. We briefly describe the overall system architecture, and describe in more detail the key components: the ball tracking and the high-level reasoning.
Barnard M, Wang W, Kittler J, Naqvi SM, Chambers JA A Dictionary Learning Approach to Tracking, International Conference on Acoustics, Speech and Signal Processing
Tahir M, Yan F, Koniusz P, Awais M, Barnard M, Mikolajczyk K, Kittler J (2012) A Robust and Scalable Visual Category and Action Recognition System using Kernel Discriminant Analysis with Spectral Regression, IEEE Transactions on Multimedia
Poh N, Kittler J, Alkoot F (2012) A discriminative parametric approach to video-based score-level fusion for biometric authentication, Proceedings - International Conference on Pattern Recognition pp. 2335-2338
Video-based biometric systems are becoming feasible thanks to advancement in both algorithms and computation platforms. Such systems have many advantages: improved robustness to spoof attack, performance gain thanks to variance reduction, and increased data quality/resolution, among others. We investigate a discriminative video-based score-level fusion mechanism, which enables an existing biometric system to further harness the riches of temporarily sampled biometric data using a set of distribution descriptors. Our approach shows that higher order moments of the video scores contain discriminative information. To our best knowledge, this is the first time this higher order moment is reported to be effective in the score-level fusion literature. Experimental results based on face and speech unimodal systems, as well as multimodal fusion, show that our proposal can improve the performance over that of the standard fixed rule fusion strategies by as much as 50%. © 2012 ICPR Org Committee.
Sanchez UR, Kittler J (2006) Fusion of talking face biometric modalities for personal identity verification, 2006 IEEE International Conference on Acoustics, Speech and Signal Processing, Vols 1-13 pp. 5931-5934 IEEE
Kittler J, Windridge D, Goswami D (2008) Subsurface Scattering Deconvolution for Improved NIR-Visible Facial Image Correlation, 2008 8TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE & GESTURE RECOGNITION (FG 2008), VOLS 1 AND 2 pp. 889-894 IEEE
Poh N, Kittler J (2008) A Family of Methods for Quality-based Multimodal Biometric Fusion using Generative Classifiers, 2008 10TH INTERNATIONAL CONFERENCE ON CONTROL AUTOMATION ROBOTICS & VISION: ICARV 2008, VOLS 1-4 pp. 1162-1167 IEEE
Chan C-H, Kittler J, Messer K (2007) Multi-scale local binary pattern histograms for face recognition, Advances in Biometrics, Proceedings 4642 pp. 809-818 SPRINGER-VERLAG BERLIN
Bourlai T, Kittler J, Messer K (2006) Database size effects on performance on a smart card face verification system, Proceedings of the Seventh International Conference on Automatic Face and Gesture Recognition - Proceedings of the Seventh International Conference pp. 61-66 IEEE COMPUTER SOC
Sadeghi MT, Kittler J (2006) Confidence based gating of multiple face authentication experts, STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, PROCEEDINGS 4109 pp. 667-676 SPRINGER-VERLAG BERLIN
Zou X, Kittler J, Messer K (2006) Accurate face localisation for faces under active near-IR illumination, Proceedings of the Seventh International Conference on Automatic Face and Gesture Recognition - Proceedings of the Seventh International Conference pp. 369-374 IEEE COMPUTER SOC
Rua EA, Kittler J, Castro JLA, Jimenez DG (2006) Information fusion for local Gabor features based frontal face verification, ADVANCES IN BIOMETRICS, PROCEEDINGS 3832 pp. 173-181 SPRINGER-VERLAG BERLIN
Sparsity-inducing multiple kernel Fisher discriminant analysis (MK-FDA) has been studied in the
literature. Building on recent advances in non-sparse multiple kernel learning (MKL), we propose
a non-sparse version of MK-FDA, which imposes a general `p norm regularisation on the kernel
weights. We formulate the associated optimisation problem as a semi-infinite program (SIP), and
adapt an iterative wrapper algorithm to solve it. We then discuss, in light of latest advances inMKL
optimisation techniques, several reformulations and optimisation strategies that can potentially lead
to significant improvements in the efficiency and scalability of MK-FDA. We carry out extensive
experiments on six datasets from various application areas, and compare closely the performance
of `p MK-FDA, fixed norm MK-FDA, and several variants of SVM-based MKL (MK-SVM). Our
results demonstrate that `p MK-FDA improves upon sparse MK-FDA in many practical situations.
The results also show that on image categorisation problems, `p MK-FDA tends to outperform its
SVM counterpart. Finally, we also discuss the connection between (MK-)FDA and (MK-)SVM,
under the unified framework of regularised kernel machines.
This paper addresses issues of analysis of DAPI-stained microscopy images of cell
samples, particularly classification of objects as single nuclei, nuclei clusters or nonnuclear
material. First, segmentation is significantly improved compared to Otsu?s method by choosing a more appropriate threshold, using a cost-function that explicitly relates to the quality of resulting boundary, rather than image histogram. This method applies ideas from active contour models to threshold-based segmentation, combining the local image sensitivity of the former with the simplicity and lower computational complexity of the latter.
Secondly, we evaluate some novel measurements that are useful in classification of resulting shapes. Particularly, analysis of central distance profiles provides a method for
improved detection of notches in nuclei clusters. Error rates are reduced to less than half compared to those of the base system, which used Fourier shape descriptors alone.
Barnard M, Wang W, Kittler J (2013) Audio head pose estimation using the direct to reverberant speech ratio, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings pp. 8056-8060
Head pose is an important cue in many applications such as, speech recognition and face recognition. Most approaches to head pose estimation to date have used visual information to model and recognise a subject's head in different configurations. These approaches have a number of limitations such as, inability to cope with occlusions, changes in the appearance of the head, and low resolution images. We present here a novel method for determining coarse head pose orientation purely from audio information, exploiting the direct to reverberant speech energy ratio (DRR) within a highly reverberant meeting room environment. Our hypothesis is that a speaker facing towards a microphone will have a higher DRR and a speaker facing away from the microphone will have a lower DRR. This hypothesis is confirmed by experiments conducted on the publicly available AV16.3 database. © 2013 IEEE.
Poschmann P, Huber P, Raetsch M, Kittler J, Boehme H-J (2014) Fusion of tracking techniques to enhance adaptive real-time tracking of arbitrary objects, 6TH INTERNATIONAL CONFERENCE ON INTELLIGENT HUMAN COMPUTER INTERACTION, IHCI 2014 39 pp. 162-165 ELSEVIER SCIENCE BV
Kittler J, Hamouz M, Tena JR, Hilton A, Illingworth J, Ruiz M (2005) 3D assisted 2D face recognition: Methodology, PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS AND APPLICATIONS, PROCEEDINGS 3773 pp. 1055-1065 SPRINGER-VERLAG BERLIN
Olvera-Lopez JA, Martinez-Trinidad JF, Carrasco-Ochoa JA, Kittler J (2009) Prototype selection based on sequential search, INTELLIGENT DATA ANALYSIS 13 (4) pp. 599-631 IOS PRESS
Fatukasi O, Kittler J, Poh N (2007) Quality controlled multimodal fusion of biometric experts, PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS AND APPLICATIONS, PROCEEDINGS 4756 pp. 881-890 SPRINGER-VERLAG BERLIN
Granai L, Tena JR, Hamouz M, Kittler J (2009) Influence of compression on 3D face recognition, PATTERN RECOGNITION LETTERS 30 (8) pp. 745-750 ELSEVIER SCIENCE BV
Chatzilari E, Nikolopoulos S, Kompatsiaris Y, Kittler J (2016) SALIC: Social Active Learning for Image Classification, IEEE TRANSACTIONS ON MULTIMEDIA 18 (8) pp. 1488-1503 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Kittler J, Poh N, Merati A (2011) Cohort based approach to multiexpert class verification, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 6713 LNCS pp. 319-329
We address the problem of cohort based normalisation in multiexpert class verification. We show that there is a relationship between decision templates and cohort based normalisation methods. Thanks to this relationship, some of the recent features of cohort score normalisation techniques can be adopted by decision templates, with the benefit of noise reduction and the ability to compensate for any distribution drift. © 2011 Springer-Verlag.
Most existing cognitive architectures integrate computer vision and symbolic reasoning. However, there is still a gap between low-level scene representations (signals) and abstract symbols. Manually attaching, i.e. grounding, the symbols on the physical context makes it impossible to expand system capabilities by learning new concepts. This paper presents a visual bootstrapping approach for the unsupervised symbol grounding. The method is based on a recursive clustering of a perceptual category domain controlled by goal acquisition from the visual environment. The novelty of the method consists in division of goals into the classes of parameter goal, invariant goal and context goal. The proposed system exhibits incremental learning in such a manner as to allow effective transferable representation of high-level concepts.
Snell V, Christmas W, Kittler J (2012) Texture and shape in fluorescence pattern identification for auto-immune disease diagnosis, Proceedings - International Conference on Pattern Recognition pp. 3750-3753
Automation of HEp-2 cell pattern classification would drastically improve the accuracy and throughput of diagnostic services for many auto-immune diseases, but it has proven difficult to reach a sufficient level of precision. Correct diagnosis relies on a subtle assessment of texture type in microscopic images of indirect immunofluorescence (IIF), which so far has eluded reliable replication through automated measurements. We introduce a combination of spectral analysis and multi-scale digital filtering to extract the most discriminative variables from the cell images. We also apply multistage classification techniques to make optimal use of the limited labelled data set. Overall error rate of 1.6% is achieved in recognition of 6 different cell patterns, which drops to 0.5% if only positive samples are considered. © 2012 ICPR Org Committee.
Tahir MA, Kittler J, Yan F (2012) Inverse random under sampling for class imbalance problem and its application to multi-label classification, Pattern Recognition 45 (10) pp. 3738-3750 Elsevier
In this paper, a novel inverse random undersampling (IRUS) method is proposed for the class imbalance problem. The main idea is to severely under sample the majority class thus creating a large number of distinct training sets. For each training set we then find a decision boundary which separates the minority class from the majority class. By combining the multiple designs through fusion, we construct a composite boundary between the majority class and the minority class. The proposed methodology is applied on 22 UCI data sets and experimental results indicate a significant increase in performance when compared with many existing class-imbalance learning methods. We also present promising results for multi-label classification, a challenging research problem in many modern applications such as music, text and image categorization.
We describe a novel framework to detect ball hits in a tennis game by combining audio and visual information. Ball hit detection
is a key step in understanding a game such as tennis, but single-mode approaches are not very successful: audio detection suffers from interfering noise and acoustic mismatch,
video detection is made difficult by the small size of the ball and the complex background of the surrounding environment. Our goal in this paper is to improve detection performance by
focusing on high-level information (rather than low-level features), including the detected audio events, the ball?s trajectory,
and inter-event timing information. Visual information supplies coarse detection of the ball-hits events. This information is used
as a constraint for audio detection. In addition, useful gains in detection performance can be obtained by using and inter-ballhit
timing information, which aids prediction of the next ball hit. This method seems to be very effective in reducing the interference
present in low-level features. After applying this method to a women?s doubles tennis game, we obtained improvements in the F-score of about 30% (absolute) for audio detection and about 10% for video detection.
© 1999-2012 IEEE.The problem of tracking multiple moving speakers in indoor environments has received much attention. Earlier techniques were based purely on a single modality, e.g., vision. Recently, the fusion of multi-modal information has been shown to be instrumental in improving tracking performance, as well as robustness in the case of challenging situations like occlusions (by the limited field of view of cameras or by other speakers). However, data fusion algorithms often suffer from noise corrupting the sensor measurements which cause non-negligible detection errors. Here, a novel approach to combining audio and visual data is proposed. We employ the direction of arrival angles of the audio sources to reshape the typical Gaussian noise distribution of particles in the propagation step and to weight the observation model in the measurement step. This approach is further improved by solving a typical problem associated with the PF, whose efficiency and accuracy usually depend on the number of particles and noise variance used in state estimation and particle propagation. Both parameters are specified beforehand and kept fixed in the regular PF implementation which makes the tracker unstable in practice. To address these problems, we design an algorithm which adapts both the number of particles and noise variance based on tracking error and the area occupied by the particles in the image. Experiments on the AV16.3 dataset show the advantage of our proposed methods over the baseline PF method and an existing adaptive PF algorithm for tracking occluded speakers with a significantly reduced number of particles.
Poh N, Ross A, Li W, Kittler J (2013) Corrigendum to "A user-specific and selective multimodal biometric fusion strategy by ranking subjects" [Pattern Recognition 46 (2013) 3341-3357] (DOI:10.1016/j.patcog.2013.03.018), Pattern Recognition
Poh N, Kittler J, Smith R, Tena JR (2007) A method for estimating authentication performance over time, with applications to face biometrics, PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS AND APPLICATIONS, PROCEEDINGS 4756 pp. 360-369 SPRINGER-VERLAG BERLIN
Kim T-K, Stenger B, Kittler J, Cipolla R (2011) Incremental Linear Discriminant Analysis Using Sufficient Spanning Sets and Its Applications, INTERNATIONAL JOURNAL OF COMPUTER VISION 91 (2) pp. 216-232 SPRINGER
Gonzalez-Jimenez D, Argones-Rua E, Alba-Castro JL, Kittler J (2007) Evaluation of point localisation and similarity fusion methods for Gabor jet-based face verification, IET COMPUTER VISION 1 (3-4) pp. 101-112 INST ENGINEERING TECHNOLOGY-IET
Sidiropoulos P, Mezaris V, Kompatsiaris IY, Kittler J (2012) Differential Edit Distance: A Metric for Scene Segmentation Evaluation, IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 22 (6) pp. 904-914 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Zou X, Wang W, Kittler J (2008) Non-negative Matrix Factorization for Face Illumination Analysis, Proc. ICA Research Network International Workshop pp. 52-55
Kittler J, Poh N, Fatukasi O, Messer K, Kryszczuk K, Richiardi J, Drygajlo A (2007) Quality dependent fusion of intramodal and multimodal biometric experts, BIOMETRIC TECHNOLOGY FOR HUMAN IDENTIFICATION IV 6539 ARTN 653903 SPIE-INT SOC OPTICAL ENGINEERING
Fitch AJ, Kadyrov A, Christmas WJ, Kittler J (2005) Fast robust correlation, IEEE TRANSACTIONS ON IMAGE PROCESSING 14 (8) pp. 1063-1073 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Mortazavian P, Kittler J, Christmas W (2009) 3D-assisted Facial Texture Super-Resolution, Proceedings of BMVC 2009
Tahir MA, Kittler J, Bouridane A (2012) Multilabel classification using heterogeneous ensemble of multi-label classifiers, PATTERN RECOGNITION LETTERS 33 (5) pp. 513-523
Arashloo SR, Kittler J, Christmas WJ (2010) Facial feature localization using graph matching with higher order statistical shape priors and global optimization, IEEE 4th International Conference on Biometrics: Theory, Applications and Systems, BTAS 2010
This paper presents a graphical model for deformable face matching and landmark localization under an unknown non-rigid warp. The proposed model learns and combines statistics of both appearance and shape variations of facial images (learnt purely from a set of frontal training images) in a complex objective function in an unsupervised manner. Local and global shape variations are included in the objective function as binary and higher order clique potentials. The proposed approach exploits the sparseness of facial features to reduce the complexity of inference over the probabilistic model. Besides presenting a method for face feature localization, the paper proposes a framework for incorporation of statistical shape priors as higher order cliques into MRFs. The problem of optimizing the objective function is performed using the dual decomposition approach in which the higher order subproblems based on point distribution models are formulated as instances of convex quadratic programs. The evaluation of the approach for feature localization is performed both on the frontal and rotated images of the XM2VTS dataset images as well as images collected from Google. The method shows high robustness to partial occlusion, pose changes etc. The method is then applied as an initialization step for a more costly matching method and is shown to be instrumental in improving performance and reducing runtime. ©2010 Crown.
Rambaruth R, Christmas W, Kittler J (2000) Representation of regions for accurate motion compensation, European Signal Processing Conference 2015-March (March)
© 2000 EUSIPCO.We propose a novel motion compensation technique for the precise reconstruction of regions over several frames within a region-based coding scheme. This is achieved by using a more accurate internal representation of arbitrarily shaped regions than the standard grid structure, thus avoiding repeated approximations for a region at each frame.
Bourlai T, Kittler J, Messer K (2007) Smart-card-based face verification system: Empirical optimization of system meta-parameters, 41ST ANNUAL IEEE INTERNATIONAL CARNAHAN CONFERENCE ON SECURITY TECHNOLOGY, PROCEEDINGS pp. 85-92 IEEE
Chan CH, Goswami B, Kittler J, Christmas W (2011) Kernel-based Speaker Verification Using Spatiotemporal Lip Information, Proceedings of MVA 2011 - IAPR Conference on Machine Vision Applications pp. 422-425 MVA Organization
Roh MC, Christmas B, Kittler J, Lee SW (2006) Gesture spotting in low-quality video with features based on curvature scale space, FGR 2006: Proceedings of the 7th International Conference on Automatic Face and Gesture Recognition 2006 pp. 375-380
Player's gesture and action spotting in sports video is a key task in automatic analysis of the video material at a high level. In many sports views, the camera covers a large part of the sports arena, so that the area of player's region is small, and has large motion. These make the determination of the player's gestures and actions a challenging task. To overcome these problems, we propose a method based on curvature scale space templates of the player's silhouette. The use of curvature scale space makes the method robust to noise and our method is robust to significant shape corruption of a part of player's silhouette. We also propose a new recognition method which is robust to noisy sequence of posture and needs only a small amount of training data, which is essential characteristic for many practical applications. © 2006 IEEE.
Chatzilari E, Nikolopoulos S, Kompatsiaris Y, Kittler J (2014) HOW MANY MORE IMAGES DO WE NEED? PERFORMANCE PREDICTION OF BOOTSTRAPPING FOR IMAGE CLASSIFICATION, 2014 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP) pp. 4256-4260 IEEE
Merati A, Poh N, Kittler J (2012) User-specific cohort selection and score normalization for biometric systems, IEEE Transactions on Information Forensics and Security 7 (4) pp. 1270-1277
An increasing body of evidence suggests that cohort-based score normalization can improve the performance of biometric authentication. This approach relies on the use of N cohort biometric templates, which can be computationally expensive. We contribute to the advancement of cohort score normalization in two ways. First, we show both theoretically and empirically that the most similar and the most dissimilar cohort templates to a target user contain discriminative information. We then investigate the extraction of this information using polynomial regression. Extensive evaluation on the face and fingerprint modalities in the Biosecure DS2 dataset indicates that the proposed method outperforms the state-of-the-art cohort score normalization methods, while reducing the computation cost by as much as half. © 2012 IEEE.
Shekar BH, Rajesh DS, Kittler J (2015) Affine Normalized Stockwell Transform based Face Recognition, 2015 IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, INFORMATICS, COMMUNICATION AND ENERGY SYSTEMS (SPICES) IEEE
Arashloo SR, Kittler J, Christmas WJ (2011) Pose-invariant face recognition by matching on multi-resolution MRFs linked by supercoupling transform, COMPUTER VISION AND IMAGE UNDERSTANDING 115 (7) pp. 1073-1083 ACADEMIC PRESS INC ELSEVIER SCIENCE
Yan F, kittler J, mikolajczyk K, windridge D (2012) Automatic Annotation of Court Games with Structured Output Learning,
Sanchez UR, Kittler J (2006) Fusion of talking face biometric modalities for personal identity verification, 2006 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol V, Proceedings pp. 1073-1076 IEEE
3D Morphable Face Models are a powerful tool in computer vision. They consist of a PCA model of face shape and colour information and allow to reconstruct a 3D face from a single 2D image. 3D Morphable Face Models are used for 3D head pose estimation, face analysis, face recognition, and, more recently, facial landmark detection and tracking. However, they are not as widely used as 2D methods - the process of building and using a 3D model is much more involved. In this paper, we present the Surrey Face Model, a multi-resolution 3D Morphable Model that we make available to the public for non-commercial purposes. The model contains different mesh resolution levels and landmark point annotations as well as metadata for texture remapping. Accompanying the model is a lightweight open-source C++ library designed with simplicity and ease of integration as its foremost goals. In addition to basic functionality, it contains pose estimation and face frontalisation algorithms. With the tools presented in this paper, we aim to close two gaps. First, by offering different model resolution levels and fast fitting functionality, we enable the use of a 3D Morphable Model in time-critical applications like tracking. Second, the software library makes it easy for the community to adopt the 3D Morphable Face Model in their research, and it offers a public place for collaboration.
FarajiDavar N, de Campos TE, Windridge D, Kittler J, Christmas W (2011) Domain Adaptation in the Context of Sport Video Action Recognition,
We apply domain adaptation to the problem of recognizing common actions
between differing court-game sport videos (in particular tennis and badminton
games). Actions are characterized in terms of HOG3D features extracted at the
bounding box of each detected player, and thus have large intrinsic dimensionality. The techniques evaluated here for domain adaptation are based on estimating
linear transformations to adapt the source domain features in order to maximize
the similarity between posterior PDFs for each class in the source domain and the
expected posterior PDF for each class in the target domain. As such, the problem
scales linearly with feature dimensionality, making the video-environment domain
adaptation problem tractable on reasonable time scales and resilient to over-?tting.
We thus demonstrate that signi?cant performance improvement can be achieved
by applying domain adaptation in this context.
Loog M, Wu X-J, Lu J-P, Yang J-Y, Wang S-T, Kittler J (2008) A note on an extreme case of the generalized optimal discriminant transformation, NEUROCOMPUTING 72 (1-3) pp. 664-665 ELSEVIER SCIENCE BV
Tahir MA, Yan F, Barnard M, Awais M, Mikolajczyk K, Kittler J (2010) The University of Surrey visual concept detection system at ImageCLEF@ICPR: Working notes, Lecture Notes in Computer Science: Recognising Patterns in Signals, Speech, Images and Videos 6388 pp. 162-170 Springer
Visual concept detection is one of the most important tasks in image and video indexing. This paper describes our system in the ImageCLEF@ICPR Visual Concept Detection Task which ranked first for large-scale visual concept detection tasks in terms of Equal Error Rate (EER) and Area under Curve (AUC) and ranked third in terms of hierarchical measure. The presented approach involves state-of-the-art local descriptor computation, vector quantisation via clustering, structured scene or object representation via localised histograms of vector codes, similarity measure for kernel construction and classifier learning. The main novelty is the classifier-level and kernel-level fusion using Kernel Discriminant Analysis with RBF/Power Chi-Squared kernels obtained from various image descriptors. For 32 out of 53 individual concepts, we obtain the best performance of all 12 submissions to this task.
Mendez-Vázquez H, Kittler J, Chan CH, García-Reyes E (2013) Photometric normalization for face recognition using local discrete cosine transform, International Journal of Pattern Recognition and Artificial Intelligence 27 (3)
Variations in illumination is one of major limiting factors of face recognition system performance. The effect of changes in the incident light on face images is analyzed, as well as its influence on the low frequency components of the image. Starting from this analysis, a new photometric normalization method for illumination invariant face recognition is presented. Low-frequency Discrete Cosine Transform coefficients in the logarithmic domain are used in a local way to reconstruct a slowly varying component of the face image which is caused by illumination. After smoothing, this component is subtracted from the original logarithmic image to compensate for illumination variations. Compared to other preprocessing algorithms, our method achieved a very good performance with a total error rate very similar to that produced by the best performing state-of-the-art algorithm. An in-depth analysis of the two preprocessing methods revealed notable differences in their behavior, which is exploited in a multiple classifier fusion framework to achieve further performance improvement. The superiority of the proposal is demonstrated in both face verification and identification experiments. © 2013 World Scientific Publishing Company.
Almajai I, Yan F, de Campos TE, Khan A, Christmas W, Windridge D, Kittler J (2012) Anomaly Detection and Knowledge Transfer in Automatic Sports Video Annotation, In: Weinshall D, Anemüller J, van Gool L (eds.), Detection and Identification of Rare Audiovisual Cues 384 pp. 109-117 Springer
A key question in machine perception is how to adaptively build upon existing capabilities so as to permit novel functionalities. Implicit in this are the notions of anomaly detection and learning transfer. A perceptual system must firstly determine at what point the existing learned model ceases to apply, and secondly, what aspects of the existing model can be brought to bear on the newly-defined learning domain. Anomalies must thus be distinguished from mere outliers, i.e. cases in which the learned model has failed to produce a clear response; it is also necessary to distinguish novel (but meaningful) input from misclassification error within the existing models. We thus apply a methodology of anomaly detection based on comparing the outputs of strong and weak classifiers  to the problem of detecting the rule-incongruence involved in the transition from singles to doubles tennis videos. We then demonstrate how the detected anomalies can be used to transfer learning from one (initially known) rule-governed structure to another. Our ultimate aim, building on existing annotation technology, is to construct an adaptive system for court-based sport video annotation.
Goswami B, Chan C, Kittler J, Christmas W (2011) SPEAKER AUTHENTICATION USING VIDEO-BASED LIP INFORMATION, 2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING pp. 1908-1911 IEEE
Guillemaut J-Y, Kittler J, Sadeghi MT, Christmas WJ (2006) General pose face recognition using frontal face model, PROGRESS IN PATTERN RECOGNITON, IMAGE ANALYSIS AND APPLICATIONS, PROCEEDINGS 4225 pp. 79-88 SPRINGER-VERLAG BERLIN
Kim T-K, Kittler J (2006) Design and fusion of pose-invariant face-identification experts, IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 16 (9) pp. 1096-1106 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Tahir MA, Kittler J, Mikolajczyk K, Yan F (2010) Improving Multilabel Classification Performance by Using Ensemble of Multi-label Classifiers, MULTIPLE CLASSIFIER SYSTEMS, PROCEEDINGS 5997 pp. 11-21 SPRINGER-VERLAG BERLIN
Mortazavian P, Kittler J, Christmas WJ (2012) 3D Morphable Model Fitting For Low-Resolution Facial Images,
This paper proposes a new algorithm for fitting a 3D morphable face model on low-resolution (LR) facial images. We analyse the criterion commonly used by the main fitting algorithms and by comparing with an image formation model, show that this criterion is only valid if the resolution of the input image is high. We then derive an imaging model to describe the process of LR image formation given the 3D model. Finally, we use this imaging model to improve the fitting criterion. Experimental results show that our algorithm significantly improves fitting results on LR images and yields similar parameters to those that would have been obtained if the input image had a higher resolution. We also show that our algorithm can be used for face recognition in low-resolutions where the conventional fitting algorithms fail.
Poh N, Chan CH, Kittler J, Marcel S, Mc Cool C, Argones Rua E, Alba Castro JL, Villegas M, Paredes R, Struc V, Pavesic N, Salah AA, Fang H, Costen N (2009) Face Video Competition, ADVANCES IN BIOMETRICS 5558 pp. 715-724 SPRINGER-VERLAG BERLIN
Hamouz M, Tena JR, Kittler J, Hilton A, Illingworth J (2007) 3D assisted face recognition: A survey, 3D IMAGING FOR SAFETY AND SECURITY 35 pp. 3-23 SPRINGER
Goswami B, Christmas WJ, Kittler J (2006) Statistical estimators for use in automatic lip segmentation, IET Conference Publications (516 CP) pp. 79-86
The past decade has seen a considerable increase in interest in the field of facial feature extraction. The primary reason for this is the variety of uses, in particular of the mouth region, in communicating important information about an individual which can in turn be used in a wide array of applications. The shape and dynamics of the mouth region convey the content of a communicated message, useful in applications involving speech processing as well as man-machine user interfaces. The mouth region can also be used as a parameter in a biometric verification system. Extraction of the mouth region from a face often uses lip contour processing to achieve these objectives. Thus, solving the problem of reliably segmenting the lip region given a talking face image is critical. This paper compares the use of statistical estimators, both robust and non-robust, when applied to the problem of automatic lip region segmentation. It then compares the results of the two systems with a state-of-the art method for lip segmentation.
Kim T-K, Kittler J, Cipolla R (2010) On-line Learning of Mutually Orthogonal Subspaces for Face Recognition by Image Sets, IEEE TRANSACTIONS ON IMAGE PROCESSING 19 (4) pp. 1067-1074 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Short J, Kittler J, Messer K (2006) Photometric normalisation for component-based face verification, Proceedings of the Seventh International Conference on Automatic Face and Gesture Recognition - Proceedings of the Seventh International Conference pp. 114-119 IEEE COMPUTER SOC
Smith RS, Kittler J, Hamouz M, Illingworth J (2006) Face recognition using angular LDA and SVM ensembles, 18th International Conference on Pattern Recognition, Vol 3, Proceedings pp. 1008-1012 IEEE COMPUTER SOC
Poh N, Kittler J (2007) On the use of log-likelihood ratio based model-specific score normalisation in biometric authentication, Advances in Biometrics, Proceedings 4642 pp. 614-624 SPRINGER-VERLAG BERLIN
Llano EG, Kittler J, Messer K, Vazquez HM (2006) A comparative study of face representations in the frequency domain, PROGRESS IN PATTERN RECOGNITON, IMAGE ANALYSIS AND APPLICATIONS, PROCEEDINGS 4225 pp. 99-108 SPRINGER-VERLAG BERLIN
Beveridge JR, Zhang H, Draper BA, Flynn PJ, Feng Z, Huber P, Kittler J, Huang Z, Li S, Li Y, Kan M, Wang R, Shan S, Chen X, Li H, Hua G, Struc V, Krizaj J, Ding C, Tao D, Phillips PJ (2015) Report on the FG 2015 Video Person Recognition Evaluation, 2015 11TH IEEE INTERNATIONAL CONFERENCE AND WORKSHOPS ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG), VOL. 2 IEEE
Wu X-J, Lu J-P, Yang J-Y, Wang S-T, Kittler J (2007) An extreme case of the generalized optimal discriminant transformation and its application to face recognition, NEUROCOMPUTING 70 (4-6) pp. 828-834 ELSEVIER SCIENCE BV
Poh N, Kittler J, Bourlai T (2010) Quality-Based Score Normalization With Device Qualitative Information for Multimodal Biometric Fusion, IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART A-SYSTEMS AND HUMANS 40 (3) pp. 539-554 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Arashloo SR, Kittler J (2009) Pose-Invariant Face Matching Using MRF Energy Minimization Framework, ENERGY MINIMIZATION METHODS IN COMPUTER VISION AND PATTERN RECOGNITION, PROCEEDINGS 5681 pp. 56-69 SPRINGER-VERLAG BERLIN
Shaukat A, Windridge D, Hollnagel E, Macchi L, Kittler J (2010) Adaptive, Perception-Action-based Cognitive Modelling of Human Driving Behaviour using Control, Gaze and Signal inputs,
A perception-action framework for cognition represents the world in terms
of an embodied agent?s ability to bring about changes within that environment. This
amounts to an affordance-based modelling of the environment. Recent psychological
research suggests that a hierarchical perception-action model, known as the
Extended Control Model (ECOM), is employed by humans within a vehicle driving
context. We thus seek to use machine learning techniques to identify ECOM states
(i.e. hierarchical driver intentions) using the modalities of eye-gaze, signalling and
driver control input with respect to external visual features. Our approach consists
in building a deductive logical model based on a priori highway-code and ECOM
rules, which is then to be applied to non-contextual stochastic classifications of feature
inputs from a test-car?s camera and detectors so as to determine the currently
active ECOM state. Since feature inputs are both noisy and sparse, the goal of the
logic system is to adaptively impose top-down consistency and completeness on the
input. The cognitively-motivated combination of stochastic bottom-up and logical
top-down representational induction means that machine learning problem is one of
symbol tethering in Sloman?s sense.
Kittler J, Huber P, Feng Z, Hu G, Christmas W (2016) 3D Morphable Face Models and Their Applications, Lecture Notes in Computer Science (LNCS) vol.9756: 9th International Conference, AMDO 2016, Palma de Mallorca, Spain, July 13-15, 2016, Proceedings 9756 pp. 185-206 Springer
3D Morphable Face Models (3DMM) have been used in face recognition for some time now. They can be applied in their own right as a basis for 3D face recognition and analysis involving 3D face data. However their prevalent use over the last decade has been as a versatile tool in 2D face recognition to normalise pose, illumination and expression of 2D face images. A 3DMM has the generative capacity to augment the training and test databases for various 2D face processing related tasks. It can be used to expand the gallery set for pose-invariant face matching. For any 2D face image it can furnish complementary information, in terms of its 3D face shape and texture. It can also aid multiple frame fusion by providing the means of registering a set of 2D images. A key enabling technology for this versatility is 3D face model to 2D face image fitting. In this paper recent developments in 3D face modelling and model fitting will be overviewed, and their merits in the context of diverse applications illustrated on several examples, including pose and illumination invariant face recognition, and 3D face reconstruction from video.
Chan CH, Kittler J (2010) SPARSE REPRESENTATION OF (MULTISCALE) HISTOGRAMS FOR FACE RECOGNITION ROBUST TO REGISTRATION AND ILLUMINATION PROBLEMS, 2010 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING pp. 2441-2444 IEEE
Khan A, Christmas W, Kittler J (2007) Lip contour segmentation using kernel methods and level sets, ADVANCES IN VISUAL COMPUTING, PROCEEDINGS, PT 2 4842 pp. 86-95 SPRINGER-VERLAG BERLIN
Kittler J, Ghaderi R, Windeatt T, Matas J (2001) Face identification and verification via ECOC, AUDIO- AND VIDEO-BASED BIOMETRIC PERSON AUTHENTICATION, PROCEEDINGS 2091 pp. 1-13 SPRINGER-VERLAG BERLIN
Poh N, Kittler J (2009) A Biometric Menagerie Index for Characterising Template/Model-Specific Variation, ADVANCES IN BIOMETRICS 5558 pp. 816-827 SPRINGER-VERLAG BERLIN
Bags-of-visual-Words (BoW) and Spatio-Temporal Shapes (STS) are two very popular approaches for action recognition from video. The former (BoW) is an un-structured global representation of videos which is built using a large set of local features. The latter (STS) uses a single feature located on a region of interest (where the actor is) in the video. Despite the popularity of these methods, no comparison between them has been done. Also, given that BoW and STS differ intrinsically in terms of context inclusion and globality/locality of operation, an appropriate evaluation framework has to be designed carefully. This paper compares these two approaches using four different datasets with varied degree of space-time specificity of the actions and varied relevance of the contextual background. We use the same local feature extraction method and the same classifier for both approaches. Further to BoW and STS, we also evaluated novel variations of BoW constrained in time or space. We observe that the STS approach leads to better results in all datasets whose background is of little relevance to action classification. © 2010 IEEE.
Ogata T, Christmas W, Kittler J, Tan JK, Ishikawa S (2007) Human activity detection by combining motion descriptors with boosting, Journal of Information Processing Society of Japan 48 3 pp. 1166-1175
A new, combined human activity detection method is proposed. Our method is based on Efros et al.?s motion descriptors and Ke et al.?s event detectors. Since both methods use optical flow, it is easy to combine them. However, the computational cost of the training increases considerably because of the increased number of weak classifiers. We reduce this computational cost by extending Ke et al.?s weak classifiers to incorporate multi-dimensional features. We also introduce a Look Up Table for further high-speed computation. The proposed method is applied to off-air tennis video data, and its performance is evaluated by comparison with the original two methods. Experimental results show that the performance of the proposed method is a good compromise in terms of detection rate and computation time of testing and training.
Hu G, Chan CH, Kittler J, Christmas W (2012) Resolution-Aware 3D Morphable Model, Proceedings of the British Machine Vision Conference pp. 109.1-109.10 BMVA Press
Rahimzadeh Arashloo S, Kittler J (2010) Energy Normalization for Pose-Invariant Face Recognition Based on MRF Model Image Matching., IEEE Trans Pattern Anal Mach Intell
A pose-invariant face recognition system based on an image matching method formulated on MRFs s presented. The method uses the energy of the established match between a pair of images as a measure of goodness-of-match. The method can tolerate moderate global spatial transformations between the gallery and the test images and alleviates the need for geometric preprocessing of facial images by encapsulating a registration step as part of the system. It requires no training on non-frontal face images. A number of innovations, such as a dynamic block size and block shape adaptation, as well as label pruning and error prewhitening measures have been introduced to increase the effectiveness of the approach. The experimental evaluation of the method is performed on two publicly available databases. First, the method is tested on the rotation shots of the XM2VTS data set in a verification scenario. Next, the evaluation is conducted in an identification scenario on the CMU-PIE database. The method compares favorably with the existing 2D or 3D generative model based methods on both databases in both identification and verification scenarios.
Snell V, Christmas W, Kittler J (2013) HEp-2 fluorescence pattern classification, Pattern Recognition
Automation of HEp-2 cell pattern classification would drastically improve the accuracy and throughput of diagnostic services for many auto-immune diseases, but it has proven difficult to reach a sufficient level of precision. Correct diagnosis relies on a subtle assessment of texture type in microscopic images of indirect immunofluorescence (IIF), which has, so far, eluded reliable replication through automated measurements. Following the recent HEp-2 Cells Classification contest held at ICPR 2012, we extend the scope of research in this field to develop a method of feature comparison that goes beyond the analysis of individual cells and majority-vote decisions to consider the full distribution of cell parameters within a patient sample. We demonstrate that this richer analysis is better able to predict the results of majority vote decisions than the cell-level performance analysed in all previous works. © 2013.
Mortazavian P, Kittler J, Christmas W (2009) A 3-D assisted generative model for facial texture super-resolution, 2009 IEEE 3RD INTERNATIONAL CONFERENCE ON BIOMETRICS: THEORY, APPLICATIONS AND SYSTEMS pp. 452-458 IEEE
Kolonias I, Kittler J, Christmas WJ, Yan F (2007) Improving the accuracy of automatic tennis video annotation by high level grammar, 14TH INTERNATIONAL CONFERENCE ON IMAGE ANALYSIS AND PROCESSING WORKSHOPS, PROCEEDINGS pp. 154-159 IEEE COMPUTER SOC
Bourlai T, Kittler J, Messer K (2010) On design and optimization of face verification systems that are smart-card based, MACHINE VISION AND APPLICATIONS 21 (5) pp. 695-711 SPRINGER
Shekar BH, Pilar B, Kittler J (2015) An unification of Inner Distance Shape Context and Local Binary Pattern for Shape Representation and Classification, PERCEPTION AND MACHINE INTELLIGENCE, 2015 pp. 46-55 ASSOC COMPUTING MACHINERY
Kamarainen J-K, Hamouz M, Kittler J, Paalanen P, Ilonen J, Drobchenko A (2007) Object localisation using generative probability model for spatial constellation and local image features, 2007 IEEE 11TH INTERNATIONAL CONFERENCE ON COMPUTER VISION, VOLS 1-6 pp. 2784-2791 IEEE
Bourlai T, Kittler J, Messer K (2009) Designing a smart-card-based face verification system: empirical investigation, MACHINE VISION AND APPLICATIONS 20 (4) pp. 225-242 SPRINGER
Fatukasi O, Kittler J, Poh N (2008) Estimation of Missing Values in Multimodal Biometric Fusion, 2008 IEEE SECOND INTERNATIONAL CONFERENCE ON BIOMETRICS: THEORY, APPLICATIONS AND SYSTEMS (BTAS) pp. 117-122 IEEE
3D face reconstruction from a single 2D image can be performed using a 3D Morphable Model (3DMM) in an analysis-by-synthesis approach. However, the reconstruction is an ill-posed problem. The recovery of the illumination characteristics of the 2D input image is particularly difficult because the proportion of the albedo and shading contributions in a pixel intensity is ambiguous. In this paper we propose the use of a facial symmetry constraint, which helps to identify the relative contributions of albedo and shading. The facial symmetry constraint is incorporated in a multi-feature optimisation framework, which realises the fitting process. By virtue of this constraint better illumination parameters can be recovered, and as a result the estimated 3D face shape and surface texture are more accurate. The proposed method is validated on the PIE face database. The experimental results show that the introduction of facial symmetry constraint improves the performance of both, face reconstruction and face recognition.
Chan CH, Tahir MA, Kittler J, Pietikäinen M (2013) Multiscale local phase quantization for robust component-based face recognition using kernel fusion of multiple descriptors., IEEE Trans Pattern Anal Mach Intell 35 (5) pp. 1164-1177
Face recognition subject to uncontrolled illumination and blur is challenging. Interestingly, image degradation caused by blurring, often present in real-world imagery, has mostly been overlooked by the face recognition community. Such degradation corrupts face information and affects image alignment, which together negatively impact recognition accuracy. We propose a number of countermeasures designed to achieve system robustness to blurring. First, we propose a novel blur-robust face image descriptor based on Local Phase Quantization (LPQ) and extend it to a multiscale framework (MLPQ) to increase its effectiveness. To maximize the insensitivity to misalignment, the MLPQ descriptor is computed regionally by adopting a component-based framework. Second, the regional features are combined using kernel fusion. Third, the proposed MLPQ representation is combined with the Multiscale Local Binary Pattern (MLBP) descriptor using kernel fusion to increase insensitivity to illumination. Kernel Discriminant Analysis (KDA) of the combined features extracts discriminative information for face recognition. Last, two geometric normalizations are used to generate and combine multiple scores from different face image scales to further enhance the accuracy. The proposed approach has been comprehensively evaluated using the combined Yale and Extended Yale database B (degraded by artificially induced linear motion blur) as well as the FERET, FRGC 2.0, and LFW databases. The combined system is comparable to state-of-the-art approaches using similar system configurations. The reported work provides a new insight into the merits of various face representation and fusion methods, as well as their role in dealing with variable lighting and blur degradation.
We present a new Cascaded Shape Regression (CSR) architecture, namely Dynamic Attention-Controlled CSR (DAC-CSR), for robust facial landmark detection on unconstrained faces. Our DAC-CSR divides facial landmark detection into three cascaded sub-tasks: face bounding box refinement, general CSR and attention-controlled CSR. The first two stages refine initial face bounding boxes and output intermediate facial landmarks. Then, an online dynamic model selection method is used to choose appropriate domain-specific CSRs for further landmark refinement. The key innovation of our DAC-CSR is the fault-tolerant mechanism, using fuzzy set sample weighting, for attentioncontrolled domain-specific model training. Moreover, we advocate data augmentation with a simple but effective 2D profile face generator, and context-aware feature extraction for better facial feature representation. Experimental results obtained on challenging datasets demonstrate the merits of our DAC-CSR over the state-of-the-art methods.
This paper proposes a methodology for the automatic detec-
tion of anomalous shipping tracks traced by ferries. The ap-
proach comprises a set of models as a basis for outlier detec-
tion: A Gaussian process (GP) model regresses displacement
information collected over time, and a Markov chain based
detector makes use of the direction (heading) information. GP
regression is performed together with Median Absolute Devi-
ation to account for contaminated training data. The method-
ology utilizes the coordinates of a given ferry recorded on a
second by second basis via Automatic Identification System.
Its effectiveness is demonstrated on a dataset collected in the
We present a framework for robust face detection and
landmark localisation of faces in the wild, which has been
evaluated as part of ?the 2nd Facial Landmark Localisation
Competition?. The framework has four stages: face
detection, bounding box aggregation, pose estimation and
landmark localisation. To achieve a high detection rate,
we use two publicly available CNN-based face detectors
and two proprietary detectors. We aggregate the detected
face bounding boxes of each input image to reduce false
positives and improve face detection accuracy. A cascaded
shape regressor, trained using faces with a variety of pose
variations, is then employed for pose estimation and image
pre-processing. Last, we train the final cascaded shape
regressor for fine-grained landmark localisation, using a
large number of training samples with limited pose variations.
The experimental results obtained on the 300W
and Menpo benchmarks demonstrate the superiority of our
framework over state-of-the-art methods.
Existing ensemble pruning algorithms in the literature have mainly been defined for unweighted or weighted voting ensembles, whose extensions to the Error Correcting Output Coding (ECOC) framework is not successful. This paper presents a novel pruning algorithm to be used in the pruning of ECOC, via using a new accuracy measure together with diversity and Hamming distance information. The results show that the novel method outperforms those existing in the state-of-the-art.
In this paper we formulate multiple kernel learning (MKL) as a distance metric learning (DML) problem. More specifically, we learn a linear combination of a set of base kernels by optimising two objective functions that are commonly used in distance metric learning. We first propose a global version of such an MKL via DML scheme, then a localised version. We argue that the localised version not only yields better performance than the global version, but also fits naturally into the framework of example based retrieval and relevance feedback. Finally the usefulness of the proposed schemes are verified through experiments on two image retrieval datasets.
Awais M, Yan F, Mikolajczyk K, Kittler JV (2011) Novel fusion methods for pattern recognition, Lecture Notes in Computer Science: Proceedings of Machine Learning and Knowledge Discovery in Databases (Part 1) 6911 (PART 1) pp. 140-155
Over the last few years, several approaches have been proposed for information fusion including different variants of classifier level fusion (ensemble methods), stacking and multiple kernel learning (MKL). MKL has become a preferred choice for information fusion in object recognition. However, in the case of highly discriminative and complementary feature channels, it does not significantly improve upon its trivial baseline which averages the kernels. Alternative ways are stacking and classifier level fusion (CLF) which rely on a two phase approach. There is a significant amount of work on linear programming formulations of ensemble methods particularly in the case of binary classification. In this paper we propose a multiclass extension of binary ½-LPBoost, which learns the contribution of each class in each feature channel. The existing approaches of classifier fusion promote sparse features combinations, due to regularization based on ?1-norm, and lead to a selection of a subset of feature channels, which is not good in the case of informative channels. Therefore, we generalize existing classifier fusion formulations to arbitrary ? p -norm for binary and multiclass problems which results in more effective use of complementary information. We also extended stacking for both binary and multiclass datasets. We present an extensive evaluation of the fusion methods on four datasets involving kernels that are all informative and achieve state-of-the-art results on all of them.
The 3D Morphable Model (3DMM) is currently receiving considerable attention for
human face analysis. Most existing work focuses on fitting a 3DMM to high resolution
images. However, in many applications, fitting a 3DMM to low-resolution images
is also important. In this paper, we propose a Resolution-Aware 3DMM (RA-
3DMM), which consists of 3 different resolution 3DMMs: High-Resolution 3DMM
(HR- 3DMM), Medium-Resolution 3DMM (MR-3DMM) and Low-Resolution 3DMM
(LR-3DMM). RA-3DMM can automatically select the best model to fit the input images
of different resolutions. The multi-resolution model was evaluated in experiments
conducted on PIE and XM2VTS databases. The experimental results verified that HR-
3DMM achieves the best performance for input image of high resolution, and MR-
3DMM and LR-3DMM worked best for medium and low resolution input images, respectively.
A model selection strategy incorporated in the RA-3DMM is proposed based
on these results. The RA-3DMM model has been applied to pose correction of face images
ranging from high to low resolution. The face verification results obtained with
the pose-corrected images show considerable performance improvement over the result
without pose correction in all resolutions
Automation of HEp-2 cell pattern classification would drastically improve the accuracy and throughput of diagnostic services for many auto-immune diseases, but it has proven difficult to reach a sufficient level of precision. Correct diagnosis relies on a subtle assessment of texture type in microscopic images of indirect immunofluorescence (IIF), which so far has eluded reliable replication through automated measurements. We introduce a combination of spectral analysis and multiscale digital filtering to extract the most discriminative variables from the cell images. We also apply multistage classification techniques to make optimal use of the limited labelled data set. Overall error rate of 1.6% is achieved in recognition of 6 different cell patterns, which drops to 0.5% if only positive samples are considered.
Reconstructing 3D face shape from a single 2D photograph as well as from video is an inherently ill-posed problem with many ambiguities. One way to solve some of the ambiguities is using a 3D face model to aid the task. 3D Morphable Face Models (3DMMs) are amongst the state of the art methods for 3D face reconstruction, or so called 3D model fitting. However, current existing methods have severe limitations, and most of them have not been trialled on in-the-wild data. Current analysis-by-synthesis methods form complex non-linear optimisation processes, and optimisers often get stuck in local optima. Further, most existing methods are slow, requiring in the order of minutes to process one photograph.
This thesis presents an algorithm to reconstruct 3D face shape from a single image as well as from sets of images or video frames in real-time. We introduce a solution for linear fitting of a PCA shape identity model and expression blendshapes to 2D facial landmarks. To improve the accuracy of the shape, a fast face contour fitting algorithm is introduced. These different components of the algorithm are run in iteration, resulting in a fast, linear shape-to-landmarks fitting algorithm. The algorithm, specifically designed to fit to landmarks obtained from in-the-wild images, by tackling imaging conditions that occur in in-the-wild images like facial expressions and the mismatch of 2D?3D contour correspondences, achieves the shape reconstruction accuracy of much more complex, nonlinear state of the art methods, while being multiple orders of magnitudes faster.
Second, we address the problem of fitting to sets of multiple images of the same person, as well as monocular video sequences. We extend the proposed shape-tolandmarks fitting to multiple frames by using the knowledge that all images are from the same identity. To recover facial texture, the approach uses texture from the original images, instead of employing the often-used PCA albedo model of a 3DMM. We employ an algorithm that merges texture from multiple frames in real-time based on a weighting of each triangle of the reconstructed shape mesh.
Last, we make the proposed real-time 3D morphable face model fitting algorithm available as open-source software. In contrast to ubiquitous available 2D-based face models and code, there is a general lack of software for 3D morphable face model fitting, hindering a widespread adoption. The library thus constitutes a significant contribution to the community.
The state of classfier incongruence in decision making systems incorporating
multiple classifiers is often an indicator of anomaly caused by an unexpected
observation or an unusual situation. Its assessment is important as one of the
key mechanisms for domain anomaly detection. In this paper, we investigate
the sensitivity of Delta divergence, a novel measure of classifier incongruence, to
estimation errors. Statistical properties of Delta divergence are analysed both
theoretically and experimentally. The results of the analysis provide guidelines
on the selection of threshold for classifier incongruence detection based on this
Crouch D, Winney B, Koppen Willem, Christmas William, Hutnik K, Day T, Meena D, Boumertit A, Hysi P, Nessa A, Spector T, Kittler Josef, Bodmer W (2018) The genetics of the human face: Identification of
large-effect single gene variants, Proceedings of the National Academy of Sciences 115 (4) pp. E676-E685
National Academy of Sciences
To discover specific variants with relatively large effects on the
human face, we have devised an approach to identifying facial
features with high heritability. This is based on using twin data to
estimate the additive genetic value of each point on a face, as
provided by a 3D camera system. In addition, we have used the
ethnic difference between East Asian and European faces as a
further source of face genetic variation. We use principal components
(PCs) analysis to provide a fine definition of the surface
features of human faces around the eyes and of the profile, and
chose upper and lower 10% extremes of the most heritable PCs for
looking for genetic associations. Using this strategy for the
analysis of 3D images of 1,832 unique volunteers from the wellcharacterized
People of the British Isles study and 1,567 unique
twin images from the TwinsUK cohort, together with genetic data
for 500,000 SNPs, we have identified three specific genetic variants
with notable effects on facial profiles and eyes.
This paper investigates the evaluation of dense
3D face reconstruction from a single 2D image in the wild.
To this end, we organise a competition that provides a new
benchmark dataset that contains 2000 2D facial images of
135 subjects as well as their 3D ground truth face scans. In
contrast to previous competitions or challenges, the aim of this
new benchmark dataset is to evaluate the accuracy of a 3D
dense face reconstruction algorithm using real, accurate and
high-resolution 3D ground truth face scans. In addition to the
dataset, we provide a standard protocol as well as a Python
script for the evaluation. Last, we report the results obtained
by three state-of-the-art 3D face reconstruction systems on the
new benchmark dataset. The competition is organised along
with the 2018 13th IEEE Conference on Automatic Face &
In recent years, facial landmark detection ? also known as face alignment or facial landmark localisation ? has become a very active area, due to its importance to a variety of image and video-based face analysis systems, such as face recognition, emotion analysis, human-computer interaction and 3D face reconstruction. This article looks at the challenges and latest technology advances in facial landmarks.
We present a new loss function, namely Wing loss, for robust
facial landmark localisation with Convolutional Neural
Networks (CNNs). We first compare and analyse different
loss functions including L2, L1 and smooth L1. The
analysis of these loss functions suggests that, for the training
of a CNN-based localisation model, more attention
should be paid to small and medium range errors. To this
end, we design a piece-wise loss function. The new loss
amplifies the impact of errors from the interval (-w, w) by
switching from L1 loss to a modified logarithm function.
To address the problem of under-representation of samples
with large out-of-plane head rotations in the training
set, we propose a simple but effective boosting strategy, referred
to as pose-based data balancing. In particular, we
deal with the data imbalance problem by duplicating the
minority training samples and perturbing them by injecting
random image rotation, bounding box translation and
other data augmentation approaches. Last, the proposed
approach is extended to create a two-stage framework for
robust facial landmark localisation. The experimental results
obtained on AFLW and 300W demonstrate the merits
of the Wing loss function, and prove the superiority of the
proposed method over the state-of-the-art approaches.
In pattern recognition, disagreement between two
classifiers regarding the predicted class membership of an observation
can be indicative of an anomaly and its nuance. As
in general classifiers base their decisions on class aposteriori
probabilities, the most natural approach to detecting classifier
incongruence is to use divergence. However, existing divergences
are not particularly suitable to gauge classifier incongruence. In
this paper, we postulate the properties that a divergence measure
should satisfy and propose a novel divergence measure, referred
to as Delta divergence. In contrast to existing measures, it focuses
on the dominant (most probable) hypotheses and thus reduces the
effect of the probability mass distributed over the non dominant
hypotheses (clutter). The proposed measure satisfies other important
properties such as symmetry, and independence of classifier
confidence. The relationship of the proposed divergence to some
baseline measures, and its superiority, is shown experimentally.
This guest editorial introduces the twenty two papers accepted for this Special Issue on Articulated Motion and Deformable Objects (AMDO). They are grouped into four main categories within the field of AMDO: human motion analysis (action/gesture), human pose estimation, deformable shape segmentation, and face analysis. For each of the four topics, a survey of the recent developments in the field is presented. The accepted papers are briefly introduced in the context of this survey. They contribute novel methods, algorithms with improved performance as measured on benchmarking datasets, as well as two new datasets for hand action detection and human posture analysis. The special issue should be of high relevance to the reader interested in AMDO recognition and promote future research directions in the field.
The paper presents a dictionary integration algorithm
using 3D morphable face models (3DMM) for poseinvariant
collaborative-representation-based face classification.
To this end, we first fit a 3DMM to the 2D face images of
a dictionary to reconstruct the 3D shape and texture of each
image. The 3D faces are used to render a number of virtual
2D face images with arbitrary pose variations to augment the
training data, by merging the original and rendered virtual
samples to create an extended dictionary. Second, to reduce
the information redundancy of the extended dictionary and
improve the sparsity of reconstruction coefficient vectors using
collaborative-representation-based classification (CRC), we
exploit an on-line class elimination scheme to optimise the
extended dictionary by identifying the training samples of the
most representative classes for a given query. The final goal is
to perform pose-invariant face classification using the proposed
dictionary integration method and the on-line pruning strategy
under the CRC framework. Experimental results obtained for
a set of well-known face datasets demonstrate the merits of the
proposed method, especially its robustness to pose variations.
The problem of re-identification of people in a crowd com-
monly arises in real application scenarios, yet it has received less atten-
tion than it deserves. To facilitate research focusing on this problem, we
have embarked on constructing a new person re-identification dataset
with many instances of crowded indoor and outdoor scenes. This paper proposes a two-stage robust method for pedestrian detection in a
complex crowded background to provide bounding box annotations. The
first stage is to generate pedestrian proposals using Faster R-CNN and
locate each pedestrian using Non-maximum Suppression (NMS). Candidates in dense proposal regions are merged to identify crowd patches.
We then apply a bottom-up human pose estimation method to detect
individual pedestrians in the crowd patches. The locations of all subjects are achieved based on the bounding boxes from the two stages. The
identity of the detected subjects throughout each video is then automatically annotated using multiple features and spatial-temporal clues. The
experimental results on a crowded pedestrians dataset demonstrate the
effectiveness and efficiency of the proposed method.
Error Correcting Output Coding (ECOC) is a multi-
class classification technique in which multiple binary classifiers
are trained according to a preset code matrix such that each one
learns a separate dichotomy of the classes. While ECOC is one of
the best solutions for multi-class problems, one issue which makes
it suboptimal is that the training of the base classifiers is done
independently of the generation of the code matrix.
In this paper, we propose to modify a given ECOC matrix
to improve its performance by reducing this decoupling. The
proposed algorithm uses beam search to iteratively modify the
original matrix, using validation accuracy as a guide. It does not
involve further training of the classifiers and can be applied to
any ECOC matrix.
We evaluate the accuracy of the proposed algorithm (BeamE-
COC) using 10-fold cross-validation experiments on 6 UCI
datasets, using random code matrices of different sizes, and base
classifiers of different strengths. Compared to the random ECOC
approach, BeamECOC increases the average cross-validation
of the experimental settings involving all
datasets, and gives better results than the state-of-the-art in
of the scenarios. By employing BeamECOC, it is also possible to
reduce the number of columns of a random matrix down to
and still obtain comparable or even better results at times.
In recent years, discriminative correlation filter
(DCF) based algorithms have significantly advanced the state of the art in visual object tracking. The key to the success of DCF is an efficient discriminative regression model trained
with powerful multi-cue features, including both hand-crafted and deep neural network features. However, the tracking performance is hindered by their inability to respond adequately to abrupt target appearance variations. This issue is posed by the limited representation capability of fixed image features. In this work, we set out to rectify this shortcoming by proposing a complementary representation of a visual content. Specifically, we propose the use of a collaborative representation between
successive frames to extract the dynamic appearance information from a target with rapid appearance changes, which results in suppressing the undesirable impact of the background. The resulting collaborative representation coefficients are combined
with the original feature maps using a spatially regularised DCF framework for performance boosting. The experimental results on several benchmarking datasets demonstrate the effectiveness and robustness of the proposed method, as compared with a
number of state-of-the-art tracking algorithms.
Fitting 3D Morphable Face Models (3DMM) to a 2D face image allows the separation of face shape from skin texture, as well as correction for face expression. However, the recovered 3D face representation is not readily amenable to processing by convolutional neural networks (CNN). We propose a conformal mapping from a 3D mesh to a 2D image, which makes these machine learning tools accessible by 3D face data. Experiments with a CNN based face recognition system designed using the proposed representation have been carried out to validate the advocated approach. The results obtained on standard benchmarking data sets show its promise.
The abrupt expansion of the Internet use over the last decade led to an uncontrollable amount of media stored in the Web. Image, video and news information has
ooded the pool of data that is at our disposal and advanced data mining techniques need to be developed in order to take full advantage of them. The focus of this thesis is mainly on developing robust video analysis technologies concerned with detecting and recognizing activities in video. The work aims at developing a compact activity descriptor with low computational cost, which will be robust enough to discriminate easily among diverse activity classes.
Additionally, we introduce a motion compensation algorithm which alleviates any issues introduced by moving camera and is used to create motion binary masks, referred to as compensated Activity Areas (cAA), where dense interest points are sampled. Motion and appearance descriptors invariant to scale and illumination changes are then computed around them and a thorough evaluation of their merit is carried out. The notion of Motion Boundaries Activity Areas (MBAA) is then introduced. The concept differs from cAA in terms of the area they focus on (ie human boundaries), reducing even more the computational cost of the activity descriptor. A novel algorithm that computes human trajectories, referred to as 'optimal trajectories', with variable temporal scale is introduced. It is based on the Statistical Sequential Change Detection (SSCD) algorithm, which allows dynamic segmentation of trajectories based on their motion pattern and facilitates their classification with better accuracy. Finally, we introduce an activity detection algorithm, which segments long duration
videos in an accurate but computationally efficient manner. We advocate Statistical Sequential Boundary Detection (SSBD) method as a means of analysing motion patterns and report improvement over the State-of-the-Art.
Huang Zengxi, Feng Zhenhua, Kittler Josef, Liu Yiguang (2018) Improve the Spoofing Resistance of Multimodal Verification with Representation-Based Measures, In: Lai Jian-Huang, Liu Cheng-Lin, Chen Xilin, Zhou Jie, Tan Tieniu, Zheng Nanning, Zha Hongbin (eds.), Pattern Recognition and Computer Vision. PRCV 2018. 11258 pp. 388-399
Recently, the security of multimodal verification has become a grow-ing concern since many fusion systems have been known to be easily deceived by partial spoof attacks, i.e. only a subset of modalities is spoofed. In this paper, we verify such a vulnerability and propose to use two representation-based met-rics to close this gap. Firstly, we use the collaborative representation fidelity with non-target subjects to measure the affinity of a query sample to the claimed client. We further consider sparse coding as a competing comparison among the client and the non-target subjects, and hence explore two sparsity-based measures for recognition. Last, we select the representation-based measure, and assemble its score and the affinity score of each modality to train a support vector machine classifier. Our experimental results on a chimeric multimodal database with face and ear traits demonstrate that in both regular verification and partial spoof at-tacks, the proposed method significant
Effective data augmentation is crucial for facial landmark localisation with Convolutional Neural Networks (CNNs). In this letter, we investigate different data augmentation techniques that can be used to generate sufficient data for training CNN-based facial landmark localisation systems. To the best of our knowledge, this is the first study that provides a systematic analysis of different data augmentation techniques in the area. In addition, an online Hard Augmented Example Mining (HAEM) strategy is advocated for further performance boosting. We examine the effectiveness of those techniques using a regression-based CNN architecture. The experimental results obtained on the AFLW and COFW datasets demonstrate the importance of data augmentation and the effectiveness of HAEM. The performance achieved using these techniques is superior to the state-of-the-art algorithms.
In this letter, we formulate sparse subspace clustering as a smoothed ?p (0/Â/p/Â/1) minimization problem (SSC-SLp) and present a unified formulation for different practical clustering problems by introducing a new pseudo norm. Generally, the use of ?p (0/Â/p/Â/1) norm approximating the ?0 one can lead to a more effective approximation than the ?p norm, while the ?p-regularization also causes the objective function to be non-convex and non-smooth. Besides, better adapting to the property of data representing real problems, the objective function is usually constrained by multiple factors (such as spatial distribution of data and errors). In view of this, we propose a computationally efficient method for solving the multi-constrained non-smooth ?p minimization problem, which smooths the ?p norm and minimizes the objective function by alternately updating a block (or a variable) and its weight. In addition, the convergence of the proposed algorithm is theoretically proven. Extensive experimental results on real datasets demonstrate the effectiveness of the proposed method.
Bober Miroslaw, Kittler Josef (1994) Robust motion analysis, 1994 IEEE computer society conference on computer vision and pattern recognition, proceedings pp. 947-952
IEEE, Computer Soc Press
With efficient appearance learning models, Discriminative Correlation Filter (DCF) has been proven to be very successful in recent video object tracking benchmarks and competitions. However, the existing DCF paradigm suffers from two major issues, i.e., spatial boundary effect and temporal filter degradation. To mitigate these challenges, we propose a new DCF-based tracking method. The key innovations of the proposed method include adaptive spatial feature selection and temporal consistent constraints, with which the new tracker enables joint spatial-temporal filter learning in a lower dimensional discriminative manifold. More specifically, we apply structured spatial sparsity constraints to multi-channel filers. Consequently, the process of learning spatial filters can be approximated by the lasso regularisation. To encourage temporal consistency, the filter model is restricted to lie around its historical value and updated locally to preserve the global structure in the manifold. Last, a unified optimisation framework is proposed to jointly select temporal consistency preserving spatial features and learn discriminative filters with the augmented Lagrangian method. Qualitative and quantitative evaluations have been conducted on a number of well-known benchmarking datasets such as OTB2013, OTB50, OTB100, Temple-Colour, UAV123 and VOT2018. The experimental results demonstrate the superiority of the proposed method over the state-of-the-art approaches.
Appearance variations result in many difficulties in face image analysis. To deal with this challenge, we present
a Unified Tensor-based Active Appearance Model (UT-AAM) for jointly modelling the geometry and texture
information of 2D faces. For each type of face information, namely shape and texture, we construct a unified
tensor model capturing all relevant appearance variations. This contrasts with the variation-specific models
of the classical tensor AAM. To achieve the unification across pose variations, a strategy for dealing with
self-occluded faces is proposed to obtain consistent shape and texture representations of pose-varied faces.
In addition, our UT-AAM is capable of constructing the model from an incomplete training dataset, using
tensor completion methods. Last, we use an effective cascaded-regression-based method for UT-AAM fitting.
With these advancements, the utility of UT-AAM in practice is considerably enhanced. As an example, we
demonstrate the improvements in training facial landmark detectors through the use of UT-AAM to synthesise
a large number of virtual samples. Experimental results obtained on a number of well-known face datasets
demonstrate the merits of the proposed approach.
Fatemifar Soroush, Arashloo Shervin Rahimzadeh, Awais Muhammad, Kittler Josef (2019) Spoofing Attack Detection by Anomaly Detection, Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019) pp. 8464-8468
Institute of Electrical and Electronics Engineers (IEEE)
Spoofing attacks on biometric systems can seriously compromise their practical utility. In this paper we focus on face spoofing detection. The majority of papers on spoofing attack detection formulate the problem as a two or multiclass learning task, attempting to separate normal accesses from samples of different types of spoofing attacks. In this paper we adopt the anomaly detection approach proposed in , where the detector is trained on genuine accesses only using one-class classifiers and investigate the merit of subject specific solutions. We show experimentally that subject specific models are superior to the commonly used client independent method. We also demonstrate that the proposed approach is more robust than multiclass formulations to unseen attacks.
We propose a new Group Feature Selection method for Discriminative Correlation Filters (GFS-DCF) based visual object tracking. The key innovation of the proposed method is to perform group feature selection across both channel and spatial dimensions, thus to pinpoint the structural relevance of multi-channel features to the filtering system. In contrast to the widely used spatial regularisation or feature selection methods, to the best of our knowledge, this is the first time that channel selection has been advocated for DCF-based tracking. We demonstrate that our GFS-DCF method is able to significantly improve the performance of a DCF tracker equipped with deep neural network features. In addition, our GFS-DCF enables joint feature selection and filter learning, achieving enhanced discrimination and interpretability of the learned filters.
To further improve the performance, we adaptively integrate historical information by constraining filters to be smooth across temporal frames, using an efficient low-rank approximation. By design, specific temporal-spatial-channel configurations are dynamically learned in the tracking process, highlighting the relevant features, and alleviating the performance degrading impact of less discriminative representations and reducing information redundancy. The experimental results obtained on OTB2013, OTB2015, VOT2017, VOT2018 and TrackingNet demonstrate the merits of our GFS-DCF and its superiority over the state-of-the-art trackers. The code is publicly available at https://github.com/XU-TIANYANG/GFS-DCF.
Discriminative correlation filter (DCF) has achieved advanced performance in visual object tracking with remarkable efficiency guaranteed by its implementation in the frequency domain. However, the effect of the structural relationship of DCF and object features has not been adequately explored in the context of the filter design. To remedy this deficiency, this paper proposes a Low-rank and Sparse DCF (LSDCF) that improves the relevance of features used by discriminative filters. To be more specific, we extend the classical DCF paradigm from ridge regression to lasso regression, and constrain the estimate to be of low-rank across frames, thus identifying and retaining the informative filters distributed on a low-dimensional manifold. To this end, specific temporal-spatial-channel configurations are adaptively learned to achieve enhanced discrimination and interpretability. In addition, we analyse the complementary characteristics between hand-crafted features and deep features, and propose a coarse-to-fine heuristic tracking strategy to further improve the performance of our LSDCF. Last, the augmented Lagrange multiplier optimisation method is used to achieve efficient optimisation. The experimental results obtained on a number of well-known benchmarking datasets, including OTB2013, OTB50, OTB100, TC128, UAV123, VOT2016 and VOT2018, demonstrate the effectiveness and robustness of the proposed method, delivering outstanding performance compared to the state-of-the-art trackers.
Visual semantic information comprises two important parts: the meaning of each visual semantic unit and the coherent visual semantic relation conveyed by these visual semantic units. Essentially, the former one is a visual perception task while the latter one corresponds to visual context reasoning. Remarkable advances in visual perception have been achieved due to the success of deep learning. In contrast, visual semantic information pursuit, a visual scene semantic interpretation task combining visual perception and visual context reasoning, is still in its early stage. It is the core task of many different computer vision applications, such as object detection, visual semantic segmentation, visual relationship detection or scene graph generation. Since it helps to enhance the accuracy and the consistency of the resulting interpretation, visual context reasoning is often incorporated with visual perception in current deep end-to-end visual semantic information pursuit methods. Surprisingly, a comprehensive review for this exciting area is still lacking. In this survey, we present a unified theoretical paradigm for all these methods, followed by an overview of the major developments and the future trends in each potential direction. The common benchmark datasets, the evaluation metrics and the comparisons of the corresponding methods are also introduced.
In what way is information processing influenced by the rules underlying a dynamic scene? In two studies we consider this question by examining the relationship between attention allocation in a dynamic visual scene (ie a singles tennis match) and the absence/presence of rule application (ie point allocation task). During training participants observed short clips of a tennis match, and for each they indicated the order of the items (eg players, ball, court lines, umpire, and crowd) from most to least attended. Participants performed a similar task in the test phase, but were also presented with a specific goal which was to indicate which of the two players won the point. In the second experiment, the effects of goal-directed vs non-goal directed observation were compared based on behavioural measures (self-reported ranks and point allocation) and eye-tracking data. Critical differences were revealed between observers regarding their attention allocation for items related to the specific goal (eg court lines). Overall, by varying the levels of goal specificity, observers showed different sensitivity to rule-based items in a dynamic visual scene according to the allocation of attention.
3D assisted 2D face recognition involves the process of reconstructing 3D faces from 2D images and solving
the problem of face recognition in 3D. To facilitate the use of deep neural networks, a 3D face, normally
represented as a 3D mesh of vertices and its corresponding surface texture, is remapped to image-like square
isomaps by a conformal mapping. Based on previous work, we assume that face recognition benefits more
from texture. In this work, we focus on the surface texture and its discriminatory information content for recognition
purposes. Our approach is to prepare a 3D mesh, the corresponding surface texture and the original 2D
image as triple input for the recognition network, to show that 3D data is useful for face recognition. Texture
enhancement methods to control the texture fusion process are introduced and we adapt data augmentation
methods. Our results show that texture-map-based face recognition can not only compete with state-of-the-art
systems under the same preconditions but also outperforms standard 2D methods from recent years.
Sparse-representation-based classification (SRC) has been widely studied and developed for various practical signal classification applications. However, the performance of a SRC-based method is degraded when both the training and test data are corrupted. To counteract this problem, we propose an approach that learns representation with block-diagonal structure (RBDS) for robust image recognition. To be more specific, we first introduce a regularization term that captures the block-diagonal structure of the target representation matrix of the training data. The resulting problem is then solved by an optimizer. Last, based on the learned representation, a simple yet effective linear classifier is used for the classification task. The experimental results obtained on several benchmarking datasets demonstrate the efficacy of the proposed RBDS method. The source code of our proposed RBDS is accessible at https://github.com/yinhefeng/RBDS
Efficient and robust facial landmark localisation is crucial for the deployment of real-time face analysis systems. This paper presents a new loss function, namely Rectified Wing (RWing) loss, for regression-based facial landmark localisation with Convolutional Neural Networks (CNNs). We first systemically analyse different loss functions, including L2, L1 and smooth L1. The analysis suggests that the training of a network should pay more attention to small-medium errors. Motivated by this finding, we design a piece-wise loss that amplifies the impact of the samples with small-medium errors. Besides, we rectify the loss function for very small errors to mitigate the impact of inaccuracy of manual annotation. The use of our RWing loss boosts the performance significantly for regression-based CNNs in facial landmarking, especially for lightweight network architectures. To address the problem of under-representation of samples with large pose variations, we propose a simple but effective boosting strategy, referred to as pose-based data balancing. In particular, we deal with the data imbalance problem by duplicating the minority training samples and perturbing them by injecting random image rotation, bounding box translation and other data augmentation strategies. Last, the proposed approach is extended to create a coarse-to-fine framework for robust and efficient landmark localisation. Moreover, the proposed coarse-to-fine framework is able to deal with the small sample size problem effectively. The experimental results obtained on several well-known benchmarking datasets demonstrate the merits of our RWing loss and prove the superiority of the proposed method over the state-of-the-art approaches.
One-class spoofing detection approaches have been an
effective alternative to the two-class learners in the face presentation attack detection particularly in unseen attack scenarios. We propose an ensemble based anomaly detection approach applicable to one-class classifiers. A new score normalisation method is proposed to normalise the output of individual outlier detectors before fusion. To comply with the accuracy and diversity objectives for the component classifiers, three different strategies are utilised to build a pool of anomaly experts. To boost the performance, we also make use of the client-specific information both in the design of individual experts as well as in setting a distinct threshold for each client. We carry out extensive experiments
on three face anti-spoofing datasets and show that
the proposed ensemble approaches are comparable superior
to the techniques based on the two-class formulation
or class-independent settings.
A cell-free massive multiple-input multiple-output
(MIMO) uplink is considered, where quantize-and-forward (QF)
refers to the case where both the channel estimates and the
received signals are quantized at the access points (APs) and forwarded to a central processing unit (CPU) whereas in combinequantize-
and-forward (CQF), the APs send the quantized version
of the combined signal to the CPU. To solve the non-convex sum rate maximization problem, a heuristic sub-optimal scheme is exploited to convert the power allocation problem into a standard geometric programme (GP). We exploit the knowledge of the channel statistics to design the power elements. Employing largescale-fading (LSF) with a deep convolutional neural network (DCNN) enables us to determine a mapping from the LSF coefficients and the optimal power through solving the sum rate maximization problem using the quantized channel. Four possible power control schemes are studied, which we refer to as i) small-scale fading (SSF)-based QF; ii) LSF-based CQF; iii) LSF use-and-then-forget (UatF)-based QF; and iv) LSF deep
learning (DL)-based QF, according to where channel estimation is performed and exploited and how the optimization problem
is solved. Numerical results show that for the same fronthaul rate, the throughput significantly increases thanks to the mapping obtained using DCNN.
Modern face recognition systems extract face representations
using deep neural networks (DNNs) and give excellent
identification and verification results, when tested on
high resolution (HR) images. However, the performance of such an algorithm degrades significantly for low resolution (LR) images. A straight forward solution could be to train a DNN, using simultaneously, high and low resolution face images. This
approach yields a definite improvement at lower resolutions but suffers a performance degradation for high resolution images. To overcome this shortcoming, we propose to train a network using both HR and LR images under the guidance of a fixed network, pretrained on HR face images. The guidance is provided by minimising the KL-divergence between the output Softmax probabilities of the pretrained (i.e., Teacher) and trainable (i.e.,
Student) network as well as by sharing the Softmax weights
between the two networks. The resulting solution is tested on down-sampled images from FaceScrub and MegaFace datasets
and shows a consistent performance improvement across various resolutions. We also tested our proposed solution on standard LR benchmarks such as TinyFace and SCFace. Our algorithm consistently outperforms the state-of-the-art methods on these
datasets, confirming the effectiveness and merits of the proposed method.
The kernel null-space technique is known to be an
effective one-class classification (OCC) technique. Nevertheless, the applicability of this method is limited due to its susceptibility to possible training data corruption and the inability to rank training observations according to their conformity with the model. This article addresses these shortcomings by regularizing the solution of the null-space kernel Fisher methodology in the context of its regression-based formulation. In this respect, first,
the effect of the Tikhonov regularization in the Hilbert space is analyzed, where the one-class learning problem in the presence of contamination in the training set is posed as a sensitivity analysis problem. Next, the effect of the sparsity of the solution is studied. For both alternative regularization schemes, iterative algorithms are proposed which recursively update label confidences. Through extensive experiments, the proposed methodology is found to enhance robustness against contamination in the training set compared with the baseline kernel null-space method,
as well as other existing approaches in the OCC paradigm, while providing the functionality to rank training samples effectively
Strict ?0-1? block-diagonal structure has been widely used for learning structured representation in face recognition problems. However, it is questionable and unreasonable to assume the within-class representations are the same. To circumvent this problem, in this paper, we propose a slack block-diagonal (SBD) structure for representation where the target structure matrix is dynamically updated, yet its blockdiagonal nature is preserved. Furthermore, in order to depict the noise in face images more precisely, we propose a robust dictionary learning algorithm based on mixed-noise model by utilizing the above SBD structure (SBD2L). SBD2L considers that there exists two forms of noise in data which are drawn from Laplacian and Gaussion distribution, respectively. Moreover, SBD2L introduces a low-rank constraint on the representation matrix to enhance the dictionary?s robustness to noise. Extensive experiments on four benchmark databases show that the proposed SBD2L can achieve better classification results than several state-of-the-art dictionary learning methods.
Recently, the correlation filters have been successfully applied to visual tracking, but the boundary effect severely restrains their tracking performance. In this paper, to overcome this problem, we propose a correlation tracking framework with implicitly extending search region (TESR) without introducing background noise. The proposed tracking method is a two- stage detection framework. To implicitly extend the search region of the correlation tracking, firstly we add other four search centers except for the original search center in an elegant manner, which is given by the target location in previous frame, so
our TESR will totally generate five potential object locations based on these five search centers. Then, an SVM classifier is used to determine the correct target position. We also apply the salient object detection score to regularize the output of the
SVM classifier to improve its performance. The experimental results demonstrate that TESR exhibits superior performance
in comparison with the state-of-the-art trackers.
The importance of wild video based image set recognition is becoming monotonically increasing. However, the contents of these collected videos are often complicated, and how to efficiently perform set modeling and feature extraction is a big challenge in CV community. Recently, some proposed image set classification methods have made a considerable advance by modeling the original image set with covariance matrix, linear subspace, or Gaussian distribution. Moreover, the distinctive geometry spanned by them are three types of Riemannian manifolds. As a matter of fact, most of them just adopt a single geometric model to describe each set data, which may lose some information for classification. To tackle this, we propose a novel algorithm to model each image set from a multi-geometric perspective. Specifically, the covariance matrix, linear subspace, and Gaussian distribution are applied for set representation simultaneously. In order to fuse these multiple heterogeneous features, the well-equipped Riemannian kernel functions are first utilized to map them into high dimensional Hilbert spaces. Then, a multi-kernel metric learning framework is devised to embed the learned hybrid kernels into a lower dimensional common subspace for classification. We conduct experiments on four widely used datasets. Extensive experimental results justify its superiority over the state-of-the-art.
In the domain of video-based image set classification, a considerable advance has been made by modeling each video sequence as a linear subspace, which typically resides on a Grassmann manifold. Due to the large intra-class variations, how to establish appropriate set models to encode these variations of set data and how to effectively measure the dissimilarity between any two image sets are two open challenges. To seek a possible way to tackle these issues, this paper presents a graph embedding multi-kernel metric learning (GEMKML) algorithm for image set classification. The proposed GEMKML implements set modeling, feature extraction, and classification in two steps. Firstly, the proposed framework constructs a novel cascaded feature learning architecture on Grassmann manifold for the sake of producing more effective Grassmann manifold-valued feature representations. To make a better use of these learned features, a graph embedding multi-kernel metric learning scheme is then devised to map them into a lower-dimensional Euclidean space, where the inter-class distances are maximized and the intra-class distances are minimized. We evaluate the proposed GEMKML on four different video-based image set classification tasks using widely adopted datasets. The extensive classification results confirm its superiority over the state-of-the-art methods.