Placeholder image for staff profiles

Dr Mark Barnard


Biography

My publications

Publications

de Campos TE, Liu Q, Barnard M S3A speaker tracking with Kinect2, University of Surrey
De Campos T, Barnard M, Mikolajczyk K, Kittler J, Yan F, Christmas W, Windridge D (2011) An evaluation of bags-of-words and spatio-temporal shapes for action recognition,2011 IEEE Workshop on Applications of Computer Vision, WACV 2011 pp. 344-351
Bags-of-visual-Words (BoW) and Spatio-Temporal Shapes (STS) are two very popular approaches for action recognition from video. The former (BoW) is an un-structured global representation of videos which is built using a large set of local features. The latter (STS) uses a single feature located on a region of interest (where the actor is) in the video. Despite the popularity of these methods, no comparison between them has been done. Also, given that BoW and STS differ intrinsically in terms of context inclusion and globality/locality of operation, an appropriate evaluation framework has to be designed carefully. This paper compares these two approaches using four different datasets with varied degree of space-time specificity of the actions and varied relevance of the contextual background. We use the same local feature extraction method and the same classifier for both approaches. Further to BoW and STS, we also evaluated novel variations of BoW constrained in time or space. We observe that the STS approach leads to better results in all datasets whose background is of little relevance to action classification. © 2010 IEEE.
Barnard M, Wang W, Kittler J (2013) Audio head pose estimation using the direct to reverberant speech ratio, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings pp. 8056-8060
Head pose is an important cue in many applications such as, speech recognition and face recognition. Most approaches to head pose estimation to date have used visual information to model and recognise a subject's head in different configurations. These approaches have a number of limitations such as, inability to cope with occlusions, changes in the appearance of the head, and low resolution images. We present here a novel method for determining coarse head pose orientation purely from audio information, exploiting the direct to reverberant speech energy ratio (DRR) within a highly reverberant meeting room environment. Our hypothesis is that a speaker facing towards a microphone will have a higher DRR and a speaker facing away from the microphone will have a lower DRR. This hypothesis is confirmed by experiments conducted on the publicly available AV16.3 database. © 2013 IEEE.
Barnard M, Matilainen M, Heikkila J (2008) BODY PART SEGMENTATION OF NOISY HUMAN SILHOUETTE IMAGES, 2008 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-4 pp. 1189-1192 IEEE
Chen M, Wang W, Barnard M, Chambers J (2017) Wideband DoA Estimation Based on Joint Optimisation of Array and Spatial Sparsity,Proceedings of the 2017 25th European Signal Processing Conference (EUSIPCO) pp. 2106-2110 Institute of Electrical and Electronics Engineers (IEEE)
We study the problem of wideband direction of
arrival (DoA) estimation by joint optimisation of array and
spatial sparsity. Two-step iterative process is proposed. In the
first step, the wideband signal is reshaped and used as the input
to derive the weight coefficients using a sparse array optimisation
method. The weights are then used to scale the observed signal
model for which a compressive sensing based spatial sparsity
optimisation method is used for DoA estimation. Simulations are
provided to demonstrate the performance of the proposed method
for both stationary and moving sources.
Barnard M, Heikkila J (2008) On Bin Configuration of Shape Context Descriptors in Human Silhouette Classification, ADVANCED CONCEPTS FOR INTELLIGENT VISION SYSTEMS, PROCEEDINGS 5259 pp. 850-859 SPRINGER-VERLAG BERLIN
Grima S, Barnard M, Wang W (2011) Robust Multi-Camera Audio-Visual Tracking,
Barnard M, Odobez J-M (2005) Sports event recognition using layered HMMs,IEEE International Conference on Multimedia and Expo, ICME 2005 2005 pp. 1150-1153
The recognition of events in video data is a subject of much current interest. In this paper, we address several issues related to this topic. The first one is overfilling when very large feature spaces are used and relatively small amounts of training data are available. The second is the use of a frame-work that can recognise events at different time scales, as standard Hidden Markov Model (HMM) do not model well long-term term temporal dependencies in the data. In this paper we propose a method combining Layered HMMs and an unsupervised low level clustering of the features to address these issues. Experiments conducted on the recognition task of different events in 7 rugby games demonstrates the potential of our approach with respect to standard HMM techniques coupled with a feature size reduction technique. While the current focus of this work is on events in sports videos, we believe the techniques shown here are general enough to be applied to other sources of data. © 2005 IEEE.
Barnard M, Odobez JM (2004) Robust playfield segmentation using MAP adaptation,PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 3 3 pp. 610-613
A vital task in sports video annotation is to detect and segment areas of the playfield. This is an important first step in player or ball tracking and detecting the location of the play on the playfield. In this paper we present a technique using statistical models, Gaussian mixture models (GMMs) and Maximum a Posteriori (MAP) adaptation. This involves first creating a generic model of the playfield colour and then using unsupervised MAP adaptation to adapt this model to the colour of the playfield in each game. This technique provides a robust and accurate segmentation of the playfield. To demonstrate the robustness of the method we tested it on a number of different sports that have grass playfields, rugby, soccer and field hockey.
Barnard M, Wang W, Kittler J, Naqvi SM, Chambers J (2013) Audio-visual face detection for tracking in a meeting room environment, Proceedings of the 16th International Conference on Information Fusion, FUSION 2013 pp. 1222-1227
A key task in many applications such as tracking or face recognition is the detection and localisation of a subject's face in an image. This can still prove to be a challenging task particularly in low resolution or noisy images. Here we propose a robust method for face detection using both audio and visual information. We construct a dictionary learning based face detector using a set of distinctive and robust image features. We then train a support vector machine classifier using sparse image representations produced by this dictionary to classify face versus background. This is combined with the azimuth angle of the speaker produced by an audio localisation system to constrain the search space for the subject's face. This increases the efficiency of the detection and localisation process by limiting the search area. However, more importantly, the audio information allows us to know a priori the number of subjects in the image. This greatly reduces the possibility of false positive face detections. We demonstrate the advantage of this proposed approach over traditional face detection methods on the challenging AV16.3 dataset. © 2013 ISIF ( Intl Society of Information Fusi.
Kilic V, Barnard M, Wang W, Hilton A, Kittler J (2015) AUDIO INFORMED VISUAL SPEAKER TRACKING WITH SMC-PHD FILTER, 2015 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO (ICME) IEEE
Barnard M, Odobez JM, Bengio S, ieee (2003) Multi-modal audio-visual event recognition for football analysis, 2003 IEEE XIII WORKSHOP ON NEURAL NETWORKS FOR SIGNAL PROCESSING - NNSP'03 pp. 469-478
Hannuksela J, Barnard M, Sangi P, Heikkila J (2011) Camera based motion recognition for mobile interaction, ISRN Signal Processing 425621
Multiple built-in cameras and the small size of mobile phones are underexploited assets for creating novel applications that are
ideal for pocket size devices, butmay notmakemuch sense with laptops. In this paper we present two vision-basedmethods for the
control of mobile user interfaces based on motion tracking and recognition. In the first case the motion is extracted by estimating
the movement of the device held in the user?s hand. In the second it is produced from tracking the motion of the user?s finger
in front of the device. In both alternatives sequences of motion are classified using Hidden Markov Models. The results of the
classification are filtered using a likelihood ratio and the velocity entropy to reject possibly incorrect sequences. Our hypothesis
here is that incorrect measurements are characterised by a higher entropy value for their velocity histogram denotingmore random
movements by the user. We also show that using the same filtering criteria we can control unsupervised Maximum A Posteriori
adaptation. Experiments conducted on a recognition task involving simple control gestures formobile phones clearly demonstrate
the potential usage of our approaches and may provide for ingredients for new user interface designs.
Barnard M, Wang W, Kittler J, Naqvi SM, Chambers JA A Dictionary Learning Approach to Tracking, International Conference on Acoustics, Speech and Signal Processing
Tahir A, Yan F, Barnard M, Awais M, Mikolajczyk K, Kittler J (2010) The University of Surrey Visual Concept Detection System at ImageCLEF 2010: Working Notes,Lecture Notes in Computer Science: Recognizing Patterns in Signals, Speech, Images and Videos: Contest Reports Springer
Visual concept detection is one of the most important tasks in image and video indexing. This paper describes our system in the ImageCLEF@ICPR Visual Concept Detection Task which ranked first for large-scale visual concept detection tasks in terms of Equal Error Rate (EER) and Area under Curve (AUC) and ranked third in terms of hierarchical measure. The presented approach involves state-of-the-art local descriptor computation, vector quantisation via clustering, structured scene or object representation via localised histograms of vector codes, similarity measure for kernel construction and classifier learning. The main novelty is the classifier-level and kernel-level fusion using Kernel Discriminant Analysis with RBF/Power Chi-Squared kernels obtained from various image descriptors. For 32 out of 53 individual concepts, we obtain the best performance of all 12 submissions to this task.
Zhao G, Barnard M, Pietikainen M (2009) Lipreading With Local Spatiotemporal Descriptors, IEEE TRANSACTIONS ON MULTIMEDIA 11 (7) pp. 1254-1265 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Tahir MA, Yan F, Barnard M, Awais M, Mikolajczyk K, Kittler J (2010) The University of Surrey visual concept detection system at ImageCLEF@ICPR: Working notes, Lecture Notes in Computer Science: Recognising Patterns in Signals, Speech, Images and Videos 6388 pp. 162-170 Springer
Visual concept detection is one of the most important tasks in image and video indexing. This paper describes our system in the ImageCLEF@ICPR Visual Concept Detection Task which ranked first for large-scale visual concept detection tasks in terms of Equal Error Rate (EER) and Area under Curve (AUC) and ranked third in terms of hierarchical measure. The presented approach involves state-of-the-art local descriptor computation, vector quantisation via clustering, structured scene or object representation via localised histograms of vector codes, similarity measure for kernel construction and classifier learning. The main novelty is the classifier-level and kernel-level fusion using Kernel Discriminant Analysis with RBF/Power Chi-Squared kernels obtained from various image descriptors. For 32 out of 53 individual concepts, we obtain the best performance of all 12 submissions to this task.
Matilainen M, Barnard M, Silven O (2009) Unusual Activity Recognition in Noisy Environments, ADVANCED CONCEPTS FOR INTELLIGENT VISION SYSTEMS, PROCEEDINGS 5807 pp. 389-399 SPRINGER-VERLAG BERLIN
Gatica-Perez D, McCowan I, Barnard M, Bengio S, Bourlard H (2003) On automatic annotation of meeting databases, IEEE International Conference on Image Processing 3 pp. 629-632
In this paper, we discuss meetings as an application domain for multimedia content analysis. Meeting databases are a rich data source suitable for a variety of audio, visual and multi-modal tasks, including speech recognition, people and action recognition, and information retrieval. We specifically focus on the task of semantic annotation of audio-visual (AV) events, where annotation consists of assigning labels (event names) to the data. In order to develop an automatic annotation system in a principled manner, it is essential to have a well-defined task, a standard corpus and an objective performance measure. In this work we address each of these issues to automatically annotate events based on participant interactions.
Kilic V, Barnard M, Wang W, Hilton A, Kittler J (2016) Mean-Shift and Sparse Sampling Based SMC-PHD Filtering for Audio Informed Visual Speaker Tracking,IEEE Transactions on Multimedia
The probability hypothesis density (PHD) filter based on sequential Monte Carlo (SMC) approximation (also known as SMC-PHD filter) has proven to be a promising algorithm for multi-speaker tracking. However, it has a heavy computational cost as surviving, spawned and born particles need to be distributed in each frame to model the state of the speakers and to estimate jointly the variable number of speakers with their states. In particular, the computational cost is mostly caused by the born particles as they need to be propagated over the entire image in every frame to detect the new speaker presence in the view of the visual tracker. In this paper, we propose to use audio data to improve the visual SMC-PHD (VSMC- PHD) filter by using the direction of arrival (DOA) angles of the audio sources to determine when to propagate the born particles and re-allocate the surviving and spawned particles. The tracking accuracy of the AV-SMC-PHD algorithm is further improved by using a modified mean-shift algorithm to search and climb density gradients iteratively to find the peak of the probability distribution, and the extra computational complexity introduced by mean-shift is controlled with a sparse sampling technique. These improved algorithms, named as AVMS-SMCPHD and sparse-AVMS-SMC-PHD respectively, are compared systematically with AV-SMC-PHD and V-SMC-PHD based on the AV16.3, AMI and CLEAR datasets.
Mohsen Naqvi S, Wang W, Khan MS, Barnard M, Chambers JA (2012) Multimodal (audio-visual) source separation exploiting multi-speaker tracking, robust beamforming and time-frequency masking, IET Signal Processing 6 (5) pp. 466-477
A novel multimodal source separation approach is proposed for physically moving and stationary sources which exploits a circular microphone array, multiple video cameras, robust spatial beamforming and time-frequency masking. The challenge of separating moving sources, including higher reverberation time (RT) even for physically stationary sources, is that the mixing filters are time varying; as such the unmixing filters should also be time varying but these are difficult to determine from only audio measurements. Therefore in the proposed approach, visual modality is used to facilitate the separation for both stationary and moving sources. The movement of the sources is detected by a three-dimensional tracker based on a Markov Chain Monte Carlo particle filter. The audio separation is performed by a robust least squares frequency invariant data-independent beamformer. The uncertainties in source localisation and direction of arrival information obtained from the 3D video-based tracker are controlled by using a convex optimisation approach in the beamformer design. In the final stage, the separated audio sources are further enhanced by applying a binary time-frequency masking technique in the cepstral domain. Experimental results show that using the visual modality, the proposed algorithm cannot only achieve performance better than conventional frequency-domain source separations algorithms, but also provide acceptable separation performance for moving sources. © 2012 The Institution of Engineering and Technology.
Barnard M, Odobez JM (2005) Sports event recognition using layered HMMs,2005 IEEE International Conference on Multimedia and Expo (ICME), Vols 1 and 2 pp. 1151-1154 IEEE
McCowan I, Gatica-Perez D, Bengio S, Lathoud G, Barnard M, Zhang D (2005) Automatic analysis of multimodal group actions in meetings., IEEE Trans Pattern Anal Mach Intell 27 (3) pp. 305-317
This paper investigates the recognition of group actions in meetings. A framework is employed in which group actions result from the interactions of the individual participants. The group actions are modeled using different HMM-based approaches, where the observations are provided by a set of audiovisual features monitoring the actions of individuals. Experiments demonstrate the importance of taking interactions into account in modeling the group actions. It is also shown that the visual modality contains useful information, even for predominantly audio-based events, motivating a multimodal approach to meeting analysis.
Liu Q, Wang W, Jackson PJB, Barnard M (2012) Reverberant Speech Separation Based on Audio-visual Dictionary Learning and Binaural Cues,Proc. of IEEE Statistical Signal Processing Workshop (SSP) pp. 664-667 IEEE
Probabilistic models of binaural cues, such as the interaural phase difference (IPD) and the interaural level difference (ILD), can be used to obtain the audio mask in the time-frequency (TF) domain, for source separation of binaural mixtures. Those models are, however, often degraded by acoustic noise. In contrast, the video stream contains relevant information about the synchronous audio stream that is not affected by acoustic noise. In this paper, we present a novel method for modeling the audio-visual (AV) coherence based on dictionary learning. A visual mask is constructed from the video signal based on the learnt AV dictionary, and incorporated with the audio mask to obtain a noise-robust audio-visual mask, which is then applied to the binaural signal for source separation. We tested our algorithm on the XM2VTS database, and observed considerable performance improvement for noise corrupted signals.
Kilic V, Zhong X, Barnard M, Wang W, Kittler J (2014) Audio-Visual Tracking of a Variable Number of Speakers with a Random Finite Set Approach,2014 17TH INTERNATIONAL CONFERENCE ON INFORMATION FUSION (FUSION) IEEE
© 2014 International Society of Information Fusion.Speaker tracking in smart environments has attracted an increasing amount of attention in the past few years. Our recent studies show that fusing audio and visual modalities can provide improved robustness and accuracy in some challenging tracking scenarios such as occlusions (by the limited field of view of cameras or by other speakers), as compared with the tracking system based on individual modalities. In these previous works, however, the number of speakers is assumed to be known and remains fixed over the tracking process. In this paper, we focus on a more realistic and complex scenario where the number of speakers is unknown and variable with time. We extend the random finite set (RFS) theory for multi-modal data and devise a particle filter algorithm under the RFS framework for audiovisual (AV) tracking. The experiments on the AV16.3 dataset show the capability of our proposed algorithm for tracking both the number of speakers and the positions of the speakers in challenging scenarios such as occlusions.
Liu Q, Wang W, Jackson PJB, Barnard M, Kittler J, Chambers J (2013) Source separation of convolutive and noisy mixtures using audio-visual dictionary learning and probabilistic time-frequency masking, IEEE Transactions on Signal Processing 61 (22) 99 pp. 5520-5535
In existing audio-visual blind source separation (AV-BSS) algorithms, the AV coherence is usually established through statistical modelling, using e.g. Gaussian mixture models (GMMs). These methods often operate in a lowdimensional feature space, rendering an effective global representation of the data. The local information, which is important in capturing the temporal structure of the data, however, has not been explicitly exploited. In this paper, we propose a new method for capturing such local information, based on audio-visual dictionary learning (AVDL). We address several challenges associated with AVDL, including cross-modality differences in size, dimension and sampling rate, as well as the issues of scalability and computational complexity. Following a commonly employed bootstrap coding-learning process, we have developed a new AVDL algorithm which features, a bimodality balanced and scalable matching criterion, a size and dimension adaptive dictionary, a fast search index for efficient coding, and cross-modality diverse sparsity. We also show how the proposed AVDL can be incorporated into a BSS algorithm. As an example, we consider binaural mixtures, mimicking aspects of human binaural hearing, and derive a new noise-robust AV-BSS algorithm by combining the proposed AVDL algorithm with Mandel?s BSS method, which is a state-of-the-art audio-domain method using time-frequency masking. We have systematically evaluated the proposed AVDL and AV-BSS algorithms, and show their advantages over the corresponding baseline methods, using both synthetic data and visual speech data from the multimodal LILiR Twotalk corpus.
Hannuksela J, Barnard M, Sangi P, Heikkila J (2008) Adaptive motion-based gesture recognition interface for mobile phones, COMPUTER VISION SYSTEMS, PROCEEDINGS 5008 pp. 271-280 SPRINGER-VERLAG BERLIN
Kilic V, Barnard M, Wang W, Kittler J (2013) Adaptive particle filtering approach to audio-visual tracking,European Signal Processing Conference
Particle filtering has emerged as a useful tool for tracking problems. However, the efficiency and accuracy of the filter usually depend on the number of particles and noise variance used in the estimation and propagation functions for re-allocating these particles at each iteration. Both of these parameters are specified beforehand and are kept fixed in the regular implementation of the filter which makes the tracker unstable in practice. In this paper we are interested in the design of a particle filtering algorithm which is able to adapt the number of particles and noise variance. The new filter, which is based on audio-visual (AV) tracking, uses information from the tracking errors to modify the number of particles and noise variance used. Its performance is compared with a previously proposed audio-visual particle filtering algorithm with a fixed number of particles and an existing adaptive particle filtering algorithm, using the AV 16.3 dataset with single and multi-speaker sequences. Our proposed approach demonstrates good tracking performance with a significantly reduced number of particles. © 2013 EURASIP.
Head pose is an important cue in many applications such as, speech recognition and face recognition. Most approaches to head pose estimation to date have focussed on the use of visual information of a subject?s head. These visual approaches have a number of limitations such as, an inability to cope with occlusions, changes in the appearance of the head, and low resolution images. We present here a novel method for determining coarse head pose orientation purely from audio information, exploiting the direct to reverberant speech energy ratio (DRR) within a reverberant room environment. Our hypothesis is that a speaker facing towards a microphone will have a higher DRR and a speaker facing away from the microphone will have a lower DRR. This method has the advantage of actually exploiting the reverberations within a room rather than trying to suppress them. This also has the practical advantage that most enclosed living spaces, such as meeting rooms or offices are highly reverberant environments. In order to test this hypothesis we also present a new data set featuring 56 subjects recorded in three different rooms, with different acoustic properties, adopting 8 different head poses in 4 different room positions captured with a 16 element microphone array. As far as the authors are aware this data set is unique and will make a significant contribution to further work in the area of audio head pose estimation. Using this data set we demonstrate that our proposed method of using the DRR for audio head pose estimation provides a significant improvement over previous methods.
Kiliç V, Barnard M, Wang W, Kittler J (2015) Audio assisted robust visual tracking with adaptive particle filtering,IEEE Transactions on Multimedia 17 (2) pp. 186-200
© 1999-2012 IEEE.The problem of tracking multiple moving speakers in indoor environments has received much attention. Earlier techniques were based purely on a single modality, e.g., vision. Recently, the fusion of multi-modal information has been shown to be instrumental in improving tracking performance, as well as robustness in the case of challenging situations like occlusions (by the limited field of view of cameras or by other speakers). However, data fusion algorithms often suffer from noise corrupting the sensor measurements which cause non-negligible detection errors. Here, a novel approach to combining audio and visual data is proposed. We employ the direction of arrival angles of the audio sources to reshape the typical Gaussian noise distribution of particles in the propagation step and to weight the observation model in the measurement step. This approach is further improved by solving a typical problem associated with the PF, whose efficiency and accuracy usually depend on the number of particles and noise variance used in state estimation and particle propagation. Both parameters are specified beforehand and kept fixed in the regular PF implementation which makes the tracker unstable in practice. To address these problems, we design an algorithm which adapts both the number of particles and noise variance based on tracking error and the area occupied by the particles in the image. Experiments on the AV16.3 dataset show the advantage of our proposed methods over the baseline PF method and an existing adaptive PF algorithm for tracking occluded speakers with a significantly reduced number of particles.
Tahir M, Yan F, Koniusz P, Awais M, Barnard M, Mikolajczyk K, Kittler J (2012) A Robust and Scalable Visual Category and Action Recognition System using Kernel Discriminant Analysis with Spectral Regression, IEEE Transactions on Multimedia
Hu R, Barnard M, Collomosse J (2010) Gradient field descriptor for sketch based retrieval and localization, Proceedings - International Conference on Image Processing, ICIP pp. 1025-1028
Hu R, Barnard M, Collomosse J (2010) Gradient Field Descriptor for Sketch based Retrieval and Localization,Proceedings of Intl. Conf. on Image Proc. (ICIP) pp. 1025-1028 IEEE
We present an image retrieval system driven by free-hand sketched queries depicting shape. We introduce Gradient Field HoG (GF-HOG) as a depiction invariant image descriptor, encapsulating local spatial structure in the sketch and facilitating efficient codebook based retrieval. We show improved retrieval accuracy over 3 leading descriptors (Self Similarity, SIFT, HoG) across two datasets (Flickr160, ETHZ extended objects), and explain how GF-HOG can be combined with RANSAC to localize sketched objects within relevant images. We also demonstrate a prototype sketch driven photo montage application based on our system.
Kilic V, Barnard M, Wang W, Kittler J (2013) Audio constrained particle filter based visual tracking, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings pp. 3627-3631
We present a robust and efficient audio-visual (AV) approach to speaker tracking in a room environment. A challenging problem with visual tracking is to deal with occlusions (caused by the limited field of view of cameras or by other speakers). Another challenge is associated with the particle filtering (PF) algorithm, commonly used for visual tracking, which requires a large number of particles to ensure the distribution is well modelled. In this paper, we propose a new method of fusing audio into the PF based visual tracking. We use the direction of arrival angles (DOAs) of the audio sources to reshape the typical Gaussian noise distribution of particles in the propagation step and to weight the observation model in the measurement step. Experiments on AV16.3 datasets show the advantage of our proposed method over the baseline PF method for tracking occluded speakers with a significantly reduced number of particles. © 2013 IEEE.
Barnard M, Koniusz P, Wang W, Kittler J, Naqvi SM, Chambers J (2014) Robust multi-speaker tracking via dictionary learning and identity modeling, IEEE Transactions on Multimedia 16 (3) pp. 864-880
We investigate the problem of visual tracking of multiple human speakers in an office environment. In particular, we propose novel solutions to the following challenges: (1) robust and computationally efficient modeling and classification of the changing appearance of the speakers in a variety of different lighting conditions and camera resolutions; (2) dealing with full or partial occlusions when multiple speakers cross or come into very close proximity; (3) automatic initialization of the trackers, or re-initialization when the trackers have lost lock caused by e.g. the limited camera views. First, we develop new algorithms for appearance modeling of the moving speakers based on dictionary learning (DL), using an off-line training process. In the tracking phase, the histograms (coding coefficients) of the image patches derived from the learned dictionaries are used to generate the likelihood functions based on Support Vector Machine (SVM) classification. This likelihood function is then used in the measurement step of the classical particle filtering (PF) algorithm. To improve the computational efficiency of generating the histograms, a soft voting technique based on approximate Locality-constrained Soft Assignment (LcSA) is proposed to reduce the number of dictionary atoms (codewords) used for histogram encoding. Second, an adaptive identity model is proposed to track multiple speakers whilst dealing with occlusions. This model is updated online using Maximum a Posteriori (MAP) adaptation, where we control the adaptation rate using the spatial relationship between the subjects. Third, to enable automatic initialization of the visual trackers, we exploit audio information, the Direction of Arrival (DOA) angle, derived from microphone array recordings. Such information provides, a priori, the number of speakers and constrains the search space for the speaker's faces. The proposed system is tested on a number of sequences from three publicly available and challenging data corpora (AV16.3, EPFL pedestrian data set and CLEAR) with up to five moving subjects. © 2014 IEEE.
Barnard M, Odobez J-M (2004) Robust Playfield Segmentation using MAP Adaptation.,ICPR (3) pp. 610-613 IEEE Computer Society