Dr Chi Ho Chan
Academic and research departmentsDepartment of Electrical and Electronic Engineering, Centre for Vision, Speech and Signal Processing (CVSSP).
Face detection and recognition benchmarks have shifted toward more difficult environments. The challenge presented in this paper addresses the next step in the direction of automatic detection and identification of people from outdoor surveillance cameras. While face detection has shown remarkable success in images collected from the web, surveillance cameras include more diverse occlusions, poses, weather conditions and image blur. Although face verification or closed-set face identification have surpassed human capabilities on some datasets, open-set identification is much more complex as it needs to reject both unknown identities and false accepts from the face detector. We show that unconstrained face detection can approach high detection rates albeit with moderate false accept rates. By contrast, open-set face recognition is currently weak and requires much more attention.
Modern face recognition systems extract face representations using deep neural networks (DNNs) and give excellent identification and verification results, when tested on high resolution (HR) images. However, the performance of such an algorithm degrades significantly for low resolution (LR) images. A straight forward solution could be to train a DNN, using simultaneously, high and low resolution face images. This approach yields a definite improvement at lower resolutions but suffers a performance degradation for high resolution images. To overcome this shortcoming, we propose to train a network using both HR and LR images under the guidance of a fixed network, pretrained on HR face images. The guidance is provided by minimising the KL-divergence between the output Softmax probabilities of the pretrained (i.e., Teacher) and trainable (i.e., Student) network as well as by sharing the Softmax weights between the two networks. The resulting solution is tested on down-sampled images from FaceScrub and MegaFace datasets and shows a consistent performance improvement across various resolutions. We also tested our proposed solution on standard LR benchmarks such as TinyFace and SCFace. Our algorithm consistently outperforms the state-of-the-art methods on these datasets, confirming the effectiveness and merits of the proposed method.
Face recognition (FR) using deep convolutional neural networks (DCNNs) has seen remarkable success in recent years. One key ingredient of DCNN-based FR is the design of a loss function that ensures discrimination between various identities. The state-of-the-art (SOTA) solutions utilise normalised Softmax loss with additive and/or multiplicative margins. Despite being popular and effective, these losses are justified only intuitively with little theoretical explanations. In this work, we show that under the LogSumExp (LSE) approximation, the SOTA Softmax losses become equivalent to a proxy-triplet loss that focuses on nearest-neighbour negative proxies only. This motivates us to propose a variant of the proxy-triplet loss, entitled Nearest Proxies Triplet (NPT) loss, which unlike SOTA solutions, converges for a wider range of hyper-parameters and offers flexibility in proxy selection and thus outperforms SOTA techniques. We generalise many SOTA losses into a single framework and give theoretical justifications for the assertion that minimising the proposed loss ensures a minimum separability between all identities. We also show that the proposed loss has an implicit mechanism of hard-sample mining. We conduct extensive experiments using various DCNN architectures on a number of FR benchmarks to demonstrate the efficacy of the proposed scheme over SOTA methods.
Performing facial recognition between Near Infrared (NIR) and visible-light (VIS) images has been established as a common method of countering illumination variation problems in face recognition. In this paper we present a new database to enable the evaluation of cross-spectral face recognition. A series of preprocessing algorithms, followed by Local Binary Pattern Histogram (LBPH) representation and combinations with Linear Discriminant Analysis (LDA) are used for recognition. These experiments are conducted on both NIR→VIS and the less common VIS→NIR protocols, with permutations of uni-modal training sets. 12 individual baseline algorithms are presented. In addition, the best performing fusion approaches involving a subset of 12 algorithms are also described. © 2011 IEEE.
Large pose and illumination variations are very challenging for face recognition. The 3D Morphable Model (3DMM) approach is one of the effective methods for pose and illumination invariant face recognition. However, it is very difficult for the 3DMM to recover the illumination of the 2D input image because the ratio of the albedo and illumination contributions in a pixel intensity is ambiguous. Unlike the traditional idea of separating the albedo and illumination contributions using a 3DMM, we propose a novel Albedo Based 3D Morphable Model (AB3DMM), which removes the illumination component from the images using illumination normalisation in a preprocessing step. A comparative study of different illumination normalisation methods for this step is conducted on PIE and Multi-PIE databases. The results show that overall performance of our method outperforms state-of-the-art methods.
We present a method to estimate, based on the horizontal symmetry, an intrinsic coordinate system of faces scanned in 3D. We show that this coordinate system provides an excellent basis for subsequent landmark positioning and model-based refinement such as Active Shape Models, outperforming other -explicit- landmark localisation methods including the commonly-used ICP+ASM approach. © 2012 ICPR Org Committee.
The 3D Morphable Model (3DMM) is currently receiving considerable attention for human face analysis. Most existing work focuses on fitting a 3DMM to high resolution images. However, in many applications, fitting a 3DMM to low-resolution images is also important. In this paper, we propose a Resolution-Aware 3DMM (RA- 3DMM), which consists of 3 different resolution 3DMMs: High-Resolution 3DMM (HR- 3DMM), Medium-Resolution 3DMM (MR-3DMM) and Low-Resolution 3DMM (LR-3DMM). RA-3DMM can automatically select the best model to fit the input images of different resolutions. The multi-resolution model was evaluated in experiments conducted on PIE and XM2VTS databases. The experimental results verified that HR- 3DMM achieves the best performance for input image of high resolution, and MR- 3DMM and LR-3DMM worked best for medium and low resolution input images, respectively. A model selection strategy incorporated in the RA-3DMM is proposed based on these results. The RA-3DMM model has been applied to pose correction of face images ranging from high to low resolution. The face verification results obtained with the pose-corrected images show considerable performance improvement over the result without pose correction in all resolutions
We present an algorithm that models the rate of change of biometric performance over time on a subject-dependent basis. It is called “homomorphic users grouping algorithm” or HUGA. Although the model is based on very simplistic assumptions that are inherent in linear regression, it has been applied successfully to estimate the performance of talking face and speech identity verification modalities, as well as their fusion, over a period of more than 600 days. Our experiments carried out on the MOBIO database show that subjects exhibit very different performance trends. While the performance of some users degrades over time, which is consistent with the literature, we also found that for a similar proportion of users, their performance actually improves with use. The latter finding has never been reported in the literature. Hence, our findings suggest that the problem of biometric performance degradation may be not as serious as previously thought, and so far, the community has ignored the possibility of improved biometric performance over time. The findings also suggest that adaptive biometric systems, that is, systems that attempt to update biometric templates, should be subject-dependent.
Face recognition subject to uncontrolled illumination and blur is challenging. Interestingly, image degradation caused by blurring, often present in real-world imagery, has mostly been overlooked by the face recognition community. Such degradation corrupts face information and affects image alignment, which together negatively impact recognition accuracy. We propose a number of countermeasures designed to achieve system robustness to blurring. First, we propose a novel blur-robust face image descriptor based on Local Phase Quantization (LPQ) and extend it to a multiscale framework (MLPQ) to increase its effectiveness. To maximize the insensitivity to misalignment, the MLPQ descriptor is computed regionally by adopting a component-based framework. Second, the regional features are combined using kernel fusion. Third, the proposed MLPQ representation is combined with the Multiscale Local Binary Pattern (MLBP) descriptor using kernel fusion to increase insensitivity to illumination. Kernel Discriminant Analysis (KDA) of the combined features extracts discriminative information for face recognition. Last, two geometric normalizations are used to generate and combine multiple scores from different face image scales to further enhance the accuracy. The proposed approach has been comprehensively evaluated using the combined Yale and Extended Yale database B (degraded by artificially induced linear motion blur) as well as the FERET, FRGC 2.0, and LFW databases. The combined system is comparable to state-of-the-art approaches using similar system configurations. The reported work provides a new insight into the merits of various face representation and fusion methods, as well as their role in dealing with variable lighting and blur degradation.
Lip region deformation during speech contains biometric information and is termed visual speech. This biometric information can be interpreted as being genetic or behavioral depending on whether static or dynamic features are extracted. In this paper, we use a texture descriptor called local ordinal contrast pattern (LOCP) with a dynamic texture representation called three orthogonal planes to represent both the appearance and dynamics features observed in visual speech. This feature representation, when used in standard speaker verification engines, is shown to improve the performance of the lip-biometric trait compared to the state-of-the-art. The best baseline state-of-the-art performance was a half total error rate (HTER) of 13.35% for the XM2VTS database. We obtained HTER of less than 1%. The resilience of the LOCP texture descriptor to random image noise is also investigated. Finally, the effect of the amount of video information on speaker verification performance suggests that with the proposed approach, speaker identity can be verified with a much shorter biometric trait record than the length normally required for voice-based biometrics. In summary, the performance obtained is remarkable and suggests that there is enough discriminative information in the mouth-region to enable its use as a primary biometric trait.
The lip-region can be interpreted as either a genetic or behavioural biometric trait depending on whether static or dynamic information is used. In this paper, we use a texture descriptor called Local Ordinal Contrast Pattern (LOCP) in conjunction with a novel spatiotemporal sampling method called Windowed Three Orthogonal Planes (WTOP) to represent both appearance and dynamics features observed in visual speech. This representation, with standard speaker verification engines, is shown to improve the performance of the lipbiometric trait compared to the state-of-the-art. The improvement obtained suggests that there is enough discriminative information in the mouth-region to enable its use as a primary biometric as opposed to a "soft" biometric trait.
3D face reconstruction of shape and skin texture from a single 2D image can be performed using a 3D Morphable Model (3DMM) in an analysis-by-synthesis approach. However, performing this reconstruction (fitting) efficiently and accurately in a general imaging scenario is a challenge. Such a scenario would involve a perspective camera to describe the geometric projection from 3D to 2D, and the Phong model to characterise illumination. Under these imaging assumptions the reconstruction problem is nonlinear and, consequently, computationally very demanding. In this work, we present an efficient stepwise 3DMM-to-2D image-fitting procedure, which sequentially optimises the pose, shape, light direction, light strength and skin texture parameters in separate steps. By linearising each step of the fitting process we derive closed-form solutions for the recovery of the respective parameters, leading to efficient fitting. The proposed optimisation process involves all the pixels of the input image, rather than randomly selected subsets, which enhances the accuracy of the fitting. It is referred to as Efficient Stepwise Optimisation (ESO). The proposed fitting strategy is evaluated using reconstruction error as a performance measure. In addition, we demonstrate its merits in the context of a 3D-assisted 2D face recognition system which detects landmarks automatically and extracts both holistic and local features using a 3DMM. This contrasts with most other methods which only report results that use manual face landmarking to initialise the fitting. Our method is tested on the public CMU-PIE and Multi-PIE face databases, as well as one internal database. The experimental results show that the face reconstruction using ESO is significantly faster, and its accuracy is at least as good as that achieved by the existing 3DMM fitting algorithms. A face recognition system integrating ESO to provide a pose and illumination invariant solution compares favourably with other state-of-the-art methods. In particular, it outperforms deep learning methods when tested on the Multi-PIE database.
Assessing biometric performance is challenging because an experimental outcome depends on the choice of demographics, and the chosen application scenario of an experiment. If one can quantify biometric samples into good, bad, and ugly categories for one application, the proportion of these categories is likely to be different for another application. As a result, a typical performance curve of a biometric experiment cannot generalise to another different application setting, even though the same system is used. We propose an algorithm that is capable of generalising a biometric performance curve in terms of Detection Error Trade-off (DET) or equivalently Receiver’s Operating Characteristics (ROC), by allowing the user (system operator, policy-maker, biometric researcher) to explicitly set the proportion of data differently. This offers the possibility for the user to simulate different operating conditions that can better match the setting of a target application. We demonstrated the utility of the algorithm in three scenarios, namely, estimating the system performance under varying quality; spoof and zero-effort attacks; and cross-device matching. Based on the results of 1300 use-case experiments, we found that the quality of prediction on unseen (test) data, measured in terms of coverage, is typically between 60% and 80%, which is significantly better than random, that is, 50%.