Eng-Jon Ong

Dr Eng-Jon Ong


Research Fellow
+44 (0)1483 689842
02 BB 00

Academic and research departments

Centre for Vision, Speech and Signal Processing (CVSSP).

My publications

Publications

Ong E-J, Ellis L, Bowden R (2009) Problem solving through imitation,IMAGE AND VISION COMPUTING 27 (11) pp. 1715-1728 ELSEVIER SCIENCE BV
Okwechime D, Ong E-J, Bowden R (2011) MIMiC: Multimodal Interactive Motion Controller,IEEE TRANSACTIONS ON MULTIMEDIA 13 (2) pp. 255-265 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Sheerman-Chase T, Ong EJ, Bowden R (2011) Cultural factors in the regression of non-verbal communication perception,Proceedings of the IEEE International Conference on Computer Vision pp. 1242-1249
Recognition of non-verbal communication (NVC) is important for understanding human communication and designing user centric user interfaces. Cultural differences affect the expression and perception of NVC but no previous automatic system considers these cultural differences. Annotation data for the LILiR TwoTalk corpus, containing dyadic (two person) conversations, was gathered using Internet crowdsourcing, with a significant quantity collected from India, Kenya and the United Kingdom (UK). Many studies have investigated cultural differences based on human observations but this has not been addressed in the context of automatic emotion or NVC recognition. Perhaps not surprisingly, testing an automatic system on data that is not culturally representative of the training data is seen to result in low performance. We address this problem by training and testing our system on a specific culture to enable better modeling of the cultural differences in NVC perception. The system uses linear predictor tracking, with features generated based on distances between pairs of trackers. The annotations indicated the strength of the NVC which enables the use of v-SVR to perform the regression. © 2011 IEEE.
Okwechime D, Ong E, Gilbert A, Bowden R (2011) Social interactive human video synthesis,Lecture Notes in Computer Science: Computer Vision ? ACCV 2010 6492 (PART 1) pp. 256-270 Springer
In this paper, we propose a computational model for social interaction between three people in a conversation, and demonstrate results using human video motion synthesis. We utilised semi-supervised computer vision techniques to label social signals between the people, like laughing, head nod and gaze direction. Data mining is used to deduce frequently occurring patterns of social signals between a speaker and a listener in both interested and not interested social scenarios, and the mined confidence values are used as conditional probabilities to animate social responses. The human video motion synthesis is done using an appearance model to learn a multivariate probability distribution, combined with a transition matrix to derive the likelihood of motion given a pose configuration. Our system uses social labels to more accurately define motion transitions and build a texture motion graph. Traditional motion synthesis algorithms are best suited to large human movements like walking and running, where motion variations are large and prominent. Our method focuses on generating more subtle human movement like head nods. The user can then control who speaks and the interest level of the individual listeners resulting in social interactive conversational agents.
Ong EJ, Cooper H, Pugeault N, Bowden R (2012) Sign Language Recognition using Sequential Pattern Trees, Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on pp. 2200-2207
Ong E, Oliver G, Cosker D, Hancock P, Eisert P, McKinnel J (2012) Applications of Face Recognition and Modeling in Media Production, IEEE Transactions on Multimedia
Bowden R, Cox SJ, Harvey RW, Lan Y, Ong EJ, Owen G, Theobald BJ (2012) Is automated conversion of video to text a reality?, Proceedings of SPIE - The International Society for Optical Engineering 8546
A recent trend in law enforcement has been the use of Forensic lip-readers. Criminal activities are often recorded on CCTV or other video gathering systems. Knowledge of what suspects are saying enriches the evidence gathered but lip-readers, by their own admission, are fallible so, based on long term studies of automated lip-reading, we are investigating the possibilities and limitations of applying this technique under realistic conditions. We have adopted a step-by-step approach and are developing a capability when prior video information is available for the suspect of interest. We use the terminology video-to-text (V2T) for this technique by analogy with speech-to-text (S2T) which also has applications in security and law-enforcement. © 2012 SPIE.
Okwechime D, Ong EJ, Bowden R (2009) Real-time motion control using pose space probability density estimation, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops 2009 pp. 2056-2063
The ability to control the movements of an object or person in a video sequence has applications in the movie and animation industries, and in HCI. In this paper, we introduce a new algorithm for real-time motion control and demonstrate its application to pre-recorded video clips and HCI. Firstly, a dataset of video frames are projected into a lower dimension space. A k-mediod clustering algorithm with a distance metric is used to determine groups of similar frames which operate as cut points, segmenting the data into smaller subsequences. A multivariate probability distribution is learnt and probability density estimation is used to determine transitions between the subsequences to develop novel motion. To facilitate real-time control, conditional probabilities are used to derive motion given user commands. The motion controller is extended to HCI using speech Mel-Frequency Ceptral Coefficients (MFCCs) to trigger movement from an input speech signal. We demonstrate the flexibility of the model by presenting results ranging from datasets composed of both vectorised images and 2D point representation. Results show plausible motion generation and lifelike blends between different types of movement. ©2009 IEEE.
Sheerman-Chase T, Ong E-J, Pugeault N, Bowden R (2013) Improving Recognition and Identification of Facial Areas Involved in Non-verbal Communication by Feature Selection,Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on
Meaningful Non-Verbal Communication (NVC) signals can be recognised by facial deformations based on video tracking. However, the geometric features previously used contain a signi?cant amount of redundant or irrelevant information. A feature selection method is described for selecting a subset of features that improves performance and allows for the identi?cation and visualisation of facial areas involved in NVC. The feature selection is based on a sequential backward elimination of features to ?nd a effective subset of components. This results in a signi?cant improvement in recognition performance, as well as providing evidence that brow lowering is involved in questioning sentences. The improvement in performance is a step towards a more practical automatic system and the facial areas identi?ed provide some insight into human behaviour.
Micilotta AS, Ong EJ, Bowden R (2006) Real-time upper body detection and 3D pose estimation in monoscopic images,COMPUTER VISION - ECCV 2006, PT 3, PROCEEDINGS 3953 pp. 139-150 SPRINGER-VERLAG BERLIN
Moore S, Ong EJ, Bowden R (2010) Facial Expression Recognition using Spatiotemporal Boosted Discriminatory Classifiers, 6111/2010 pp. 405-414
Ong EJ, Bowden R (2011) Robust Facial Feature Tracking Using Shape-Constrained Multi-Resolution Selected Linear Predictors.,IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (9) pp. 1844-1859 IEEE Computer Society
This paper proposes a learnt {\em data-driven} approach for accurate, real-time tracking of facial features using only intensity information, a non-trivial task since the face is a highly deformable object with large textural variations and motion in certain regions. The framework proposed here largely avoids the need for apriori design of feature trackers by automatically identifying the optimal visual support required for tracking a single facial feature point. This is essentially equivalent to automatically determining the visual context required for tracking. Tracking is achieved via biased linear predictors which provide a fast and effective method for mapping pixel-intensities into tracked feature position displacements. Multiple linear predictors are grouped into a rigid flock to increase robustness. To further improve tracking accuracy, a novel probabilistic selection method is used to identify relevant visual areas for tracking a feature point. These selected flocks are then combined into a hierarchical multi-resolution LP model. Finally, we also exploit a simple shape constraint for correcting the occasional tracking failure of a minority of feature points. Experimental results also show that this method performs more robustly and accurately than AAMs, on example sequences that range from SD quality to Youtube quality.
Ong E-J, Bowden R (2006) Learning wormholes for sparsely labelled clustering,18TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 1, PROCEEDINGS pp. 916-919 IEEE COMPUTER SOC
Elliott R, Cooper HM, Ong EJ, Glauert J, Bowden R, Lefebvre-Albaret F (2011) Search-By-Example in Multilingual Sign Language Databases,
We describe a prototype Search-by-Example or look-up tool for signs, based on a newly developed 1000-concept sign lexicon for four national sign languages (GSL, DGS, LSF,BSL), which includes a spoken language gloss, a HamNoSys description, and a video for each sign. The look-up tool combines an interactive sign recognition system, supported by Kinect technology, with a real-time sign synthesis system,using a virtual human signer, to present results to the user. The user performs a sign to the system and is presented with animations of signs recognised as similar. The user also has the option to view any of these signs performed
in the other three sign languages. We describe the supporting technology and architecture for this system, and present some preliminary evaluation results.
Ong E, Bober M (2016) Improved Hamming Distance Search using Variable Length Substrings,2016 IEEE Conference on Computer Vision and Pattern Recognition pp. 2000-2008
This paper addresses the problem of ultra-large-scale
search in Hamming spaces. There has been considerable
research on generating compact binary codes in vision, for
example for visual search tasks. However the issue of efficient
searching through huge sets of binary codes remains
largely unsolved. To this end, we propose a novel, unsupervised
approach to thresholded search in Hamming space,
supporting long codes (e.g. 512-bits) with a wide-range of
Hamming distance radii. Our method is capable of working
efficiently with billions of codes delivering between one
to three orders of magnitude acceleration, as compared to
prior art. This is achieved by relaxing the equal-size constraint
in the Multi-Index Hashing approach, leading to
multiple hash-tables with variable length hash-keys. Based
on the theoretical analysis of the retrieval probabilities of
multiple hash-tables we propose a novel search algorithm
for obtaining a suitable set of hash-key lengths. The resulting
retrieval mechanism is shown empirically to improve
the efficiency over the state-of-the-art, across a range of
datasets, bit-depths and retrieval thresholds.
Ong E, Bowden R (2011) Learning Temporal Signatures for Lip Reading,
Ong E-J, Bowden R (2006) Learning Distance for Arbitrary Visual Features,Proceedings of the British Machine Vision Conference 2 pp. 749-758 BMVA
This paper presents a method for learning distance functions of arbitrary feature
representations that is based on the concept of wormholes. We introduce
wormholes and describe how it provides a method for warping the topology
of visual representation spaces such that a meaningful distance between
examples is available. Additionally, we show how a more general distance
function can be learnt through the combination of many wormholes via an
inter-wormhole network. We then demonstrate the application of the distance
learning method on a variety of problems including nonlinear synthetic
data, face illumination detection and the retrieval of images containing natural
landscapes and man-made objects (e.g. cities).
Lodi S, Phillips A, Fidler S, Hawkins D, Gilson R, McLean K, Fisher M, Post F, Johnson AM, Walker-Nthenda L, Dunn D, Porter K, Kennedy N, Pritchard J, Andrady U, Rajda N, Donnelly C, McKernan S, Drake S, Gilleran G, White D, Ross J, Harding J, Faville R, Sweeney J, Flegg P, Toomer S, Wilding H, Woodward R, Dean G, Richardson C, Perry N, Gompels M, Jennings L, Bansaal D, Browing M, Connolly L, Stanley B, Estreich S, Magdy A, O'Mahony C, Fraser P, Jebakumar SPR, David L, Mette R, Summerfield H, Evans M, White C, Robertson R, Lean C, Morris S, Winter A, Faulkner S, Goorney B, Howard L, Fairley I, Stemp C, Short L, Gomez M, Young F, Roberts M, Green S, Sivakumar K, Minton J, Siminoni A, Calderwood J, Greenhough D, Minton J, DeSouza C, Muthern L, Orkin C, Murphy S, Truvedi M, McLean K, Hawkins D, Higgs C, Moyes A, Antonucci S, McCormack S, Lynn W, Bevan M, Fox J, Teague A, Anderson J, Mguni S, Campbell L, Mazhude C, Russell H, Gilson R, Carrick G, Ainsworth J, Waters A, Byrne P, Johnson M, Fidler S, Kuldanek K, Mullaney S, Lawlor V (2013) Role of HIV Infection Duration and CD4 Cell Level at Initiation of Combination Anti-Retroviral Therapy on Risk of Failure, PLoS ONE 8 (9)
Background:The development of HIV drug resistance and subsequent virological failure are often cited as potential disadvantages of early cART initiation. However, their long-term probability is not known, and neither is the role of duration of infection at the time of initiation.Methods:Patients enrolled in the UK Register of HIV seroconverters were followed-up from cART initiation to last HIV-RNA measurement. Through survival analysis we examined predictors of virologic failure (2HIV-RNA e400 c/l while on cART) including CD4 count and HIV duration at initiation. We also estimated the cumulative probabilities of failure and drug resistance (from the available HIV nucleotide sequences) for early initiators (cART within 12 months of seroconversion).Results:Of 1075 starting cART at a median (IQR) CD4 count 272 (190,370) cells/mm3and HIV duration 3 (1,6) years, virological failure occurred in 163 (15%). Higher CD4 count at initiation, but not HIV infection duration at cART initiation, was independently associated with lower risk of failure (p=0.033 and 0.592 respectively). Among 230 patients initiating cART early, 97 (42%) discontinued it after a median of 7 months; cumulative probabilities of resistance and failure by 8 years were 7% (95% CI 4,11) and 19% (13,25), respectively.Conclusion:Although the rate of discontinuation of early cART in our cohort was high, the long-term rate of virological failure was low. Our data do not support early cART initiation being associated with increased risk of failure and drug resistance. © 2013 Lodi et al.
Bowden R, Cox S, Harvey R, Lan Y, Ong E-J, Owen G, Theobald B-J (2013) Recent developments in automated lip-reading, OPTICS AND PHOTONICS FOR COUNTERTERRORISM, CRIME FIGHTING AND DEFENCE IX; AND OPTICAL MATERIALS AND BIOMATERIALS IN SECURITY AND DEFENCE SYSTEMS TECHNOLOGY X 8901 SPIE-INT SOC OPTICAL ENGINEERING
Sheerman-Chase T, Ong EJ, Bowden R (2009) Feature selection of facial displays for detection of non verbal communication in natural conversation,2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops 2009 pp. 1985-1992
Recognition of human communication has previously focused on deliberately acted emotions or in structured or artificial social contexts. This makes the result hard to apply to realistic social situations. This paper describes the recording of spontaneous human communication in a specific and common social situation: conversation between two people. The clips are then annotated by multiple observers to reduce individual variations in interpretation of social signals. Temporal and static features are generated from tracking using heuristic and algorithmic methods. Optimal features for classifying examples of spontaneous communication signals are then extracted by AdaBoost. The performance of the boosted classifier is comparable to human performance for some communication signals, even on this challenging and realistic data set. ©2009 IEEE.
Sheerman-Chase T, Ong EJ, Bowden R (2009) Online learning of robust facial feature trackers,2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops 2009 pp. 1386-1392
This paper presents a head pose and facial feature estimation technique that works over a wide range of pose variations without a priori knowledge of the appearance of the face. Using simple LK trackers, head pose is estimated by Levenberg-Marquardt (LM) pose estimation using the feature tracking as constraints. Factored sampling and RANSAC are employed to both provide a robust pose estimate and identify tracker drift by constraining outliers in the estimation process. The system provides both a head pose estimate and the position of facial features and is capable of tracking over a wide range of head poses. ©2009 IEEE.
Ong E-J, Micilotta AS, Bowden R, Hilton A (2005) Viewpoint invariant exemplar-based 3D human tracking, COMPUTER VISION AND IMAGE UNDERSTANDING 104 (2-3) pp. 178-189 ACADEMIC PRESS INC ELSEVIER SCIENCE
Holt B, Ong EJ, Bowden R (2013) Accurate static pose estimation combining direct regression and geodesic extrema, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, FG 2013
Human pose estimation in static images has received significant attention recently but the problem remains challenging. Using data acquired from a consumer depth sensor, our method combines a direct regression approach for the estimation of rigid body parts with the extraction of geodesic extrema to find extremities. We show how these approaches are complementary and present a novel approach to combine the results resulting in an improvement over the state-of-the-art. We report and compare our results a new dataset of aligned RGB-D pose sequences which we release as a benchmark for further evaluation. © 2013 IEEE.
Ong EJ, Hilton A (2006) Learnt inverse kinematics for animation synthesis, Graphical Models 68 (5-6) pp. 472-483
Existing work on animation synthesis can be roughly split into two approaches, those that combine segments of motion-capture data, and those that perform inverse kinematics. In this paper, we present a method for performing animation synthesis of an articulated object (e.g. human body and a dog) from a minimal set of body joint positions, following the approach of inverse kinematics. We tackle this problem from a learning perspective. Firstly, we address the need for knowledge on the physical constraints of the articulated body, so as to avoid the generation of a physically impossible poses. A common solution is to heuristically specify the kinematic constraints for the skeleton model. In this paper however, the physical constraints of the articulated body are represented using a hierarchical cluster model learnt from a motion capture database. Additionally, we shall show that the learnt model automatically captures the correlation between different joints through simultaneous modelling of their angles. We then show how this model can be utilised to perform inverse kinematics in a simple and efficient manner. Crucially, we describe how IK is carried out from a minimal set of end-effector positions. Following this, we show how this "learnt inverse kinematics" framework can be used to perform animation syntheses on different types of articulated structures. To this end, the results presented include the retargeting of a flat surface walking animation to various uneven terrains to demonstrate the synthesis of a full human body motion from the positions of only the hands, feet and torso. Additionally, we show how the same method can be applied to the animation synthesis of a dog using only its feet and torso positions. © 2006 Elsevier Inc. All rights reserved.
Holt B, Ong EJ, Cooper H, Bowden R (2011) Putting the pieces together: Connected Poselets for human pose estimation,Proceedings of the IEEE International Conference on Computer Vision pp. 1196-1201
We propose a novel hybrid approach to static pose estimation called Connected Poselets. This representation combines the best aspects of part-based and example-based estimation. First detecting poselets extracted from the training data; our method then applies a modified Random Decision Forest to identify Poselet activations. By combining keypoint predictions from poselet activitions within a graphical model, we can infer the marginal distribution over each keypoint without any kinematic constraints. Our approach is demonstrated on a new publicly available dataset with promising results. © 2011 IEEE.
Ong E, Bowden R (2011) Learning Sequential Patterns for Lipreading,Proceedings of the 22nd British Machine Vision Conference pp. 55.1-55.10 BMVA Press
This paper proposes a novel machine learning algorithm (SP-Boosting) to tackle the
problem of lipreading by building visual sequence classifiers based on sequential patterns.
We show that an exhaustive search of optimal sequential patterns is not possible
due to the immense search space, and tackle this with a novel, efficient tree-search
method with a set of pruning criteria. Crucially, the pruning strategies preserve our ability
to locate the optimal sequential pattern. Additionally, the tree-based search method
accounts for the training set?s boosting weight distribution. This temporal search method
is then integrated into the boosting framework resulting in the SP-Boosting algorithm.
We also propose a novel constrained set of strong classifiers that further improves recognition
accuracy. The resulting learnt classifiers are applied to lipreading by performing
multi-class recognition on the OuluVS database. Experimental results show that our
method achieves state of the art recognition performane, using only a small set of sequential
patterns.
Ong EJ, Bowden R (2011) Learning sequential patterns for lipreading,BMVC 2011 - Proceedings of the British Machine Vision Conference 2011
This paper proposes a novel machine learning algorithm (SP-Boosting) to tackle the problem of lipreading by building visual sequence classifiers based on sequential patterns. We show that an exhaustive search of optimal sequential patterns is not possible due to the immense search space, and tackle this with a novel, efficient tree-search method with a set of pruning criteria. Crucially, the pruning strategies preserve our ability to locate the optimal sequential pattern. Additionally, the tree-based search method accounts for the training set's boosting weight distribution. This temporal search method is then integrated into the boosting framework resulting in the SP-Boosting algorithm. We also propose a novel constrained set of strong classifiers that further improves recognition accuracy. The resulting learnt classifiers are applied to lipreading by performing multi-class recognition on the OuluVS database. Experimental results show that our method achieves state of the art recognition performane, using only a small set of sequential patterns. © 2011. The copyright of this document resides with its authors.
Sheerman-Chase T, Ong EJ, Bowden R (2013) Non-linear predictors for facial feature tracking across pose and expression, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, FG 2013
This paper proposes a non-linear predictor for estimating the displacement of tracked feature points on faces that exhibit significant variations across pose and expression. Existing methods such as linear predictors, ASMs or AAMs are limited to a narrow range in pose. In order to track across a large pose range, separate pose-specific models are required that are then coupled via a pose-estimator. In our approach, we neither require a set of pose-specific models nor a pose-estimator. Using just a single tracking model, we are able to robustly and accurately track across a wide range of expression on poses. This is achieved by gradient boosting of regression trees for predicting the displacement vectors of tracked points. Additionally, we propose a novel algorithm for simultaneously configuring this hierarchical set of trackers for optimal tracking results. Experiments were carried out on sequences of naturalistic conversation and sequences with large pose and expression changes. The results show that the proposed method is superior to state of the art methods, in being able to robustly track a set of facial points whilst gracefully recovering from tracking failures. © 2013 IEEE.
Okwechime D, Ong E-J, Gilbert A, Bowden R (2011) Visualisation and prediction of conversation interest through mined social signals,2011 IEEE International Conference on Automatic Face and Gesture Recognition and Workshops pp. 951-956 IEEE
This paper introduces a novel approach to social behaviour recognition governed by the exchange of non-verbal cues between people. We conduct experiments to try and deduce distinct rules that dictate the social dynamics of people in a conversation, and utilise semi-supervised computer vision techniques to extract their social signals such as laughing and nodding. Data mining is used to deduce frequently occurring patterns of social trends between a speaker and listener in both interested and not interested social scenarios. The confidence values from rules are utilised to build a Social Dynamic Model (SDM), that can then be used for classification and visualisation. By visualising the rules generated in the SDM, we can analyse distinct social trends between an interested and not interested listener in a conversation. Results show that these distinctions can be applied generally and used to accurately predict conversational interest.
Ong E, Hilton ADM (2006) Learnt Inverse Kinematics for Animation Synthesis,Graphical Models 68 5-6 pp. 472-483 Elsevier
Existing work on animation synthesis can be roughly split into two approaches, those that combine segments of motion capture data, and those that perform inverse kinematics. In this paper, we present a method for performing animation synthesis of an articulated object (e.g. human body and a dog) from a minimal set of body joint positions, following the approach of inverse kinematics. We tackle this problem from a learning perspective. Firstly, we address the need for knowledge on the physical constraints of the articulated body, so as to avoid the generation of a physically impossible poses. A common solution is to heuristically specify the kinematic constraints for the skeleton model. In this paper however, the physical constraints of the articulated body are represented using a hierarchical cluster model learnt from a motion capture database. Additionally, we shall show that the learnt model automatically captures the correlation between different joints through the simultaneous modelling their angles. We then show how this model can be utilised to perform inverse kinematics in a simple and efficient manner. Crucially, we describe how IK is carried out from a minimal set of end-effector positions. Following this, we show how this "learnt inverse kinematics" framework can be used to perform animation syntheses of different types of articulated structures. To this end, the results presented include the retargeting of a at surface walking animation to various uneven terrains to demonstrate the synthesis of a full human body motion from the positions of only the hands, feet and torso. Additionally, we show how the same method can be applied to the animation synthesis of a dog using only its feet and torso positions.
Cooper Helen, Ong Eng-Jon, Pugeault Nicolas, Bowden Richard (2017) Sign Language Recognition Using Sub-units,In: Escalera Sergio, Guyon Isabelle, Athitsos Vassilis (eds.), Gesture Recognition pp. 89-118 Springer International Publishing
This chapter discusses sign language recognition using linguistic sub-units. It presents three types of sub-units for consideration; those learnt from appearance data as well as those inferred from both 2D or 3D tracking data. These sub-units are then combined using a sign level classifier; here, two options are presented. The first uses Markov Models to encode the temporal changes between sub-units. The second makes use of Sequential Pattern Boosting to apply discriminative feature selection at the same time as encoding temporal information. This approach is more robust to noise and performs well in signer independent tests, improving results from the 54% achieved by the Markov Chains to 76%.
Ong E, Pugeault N, Gilbert A, Bowden R (2016) Learning multi-class discriminative patterns using episode-trees,7th International Conference on Cloud Computing, GRIDs, and Virtualization (CLOUD COMPUTING 2016)
In this paper, we aim to tackle the problem of recognising temporal sequences in the context of a multi-class problem. In the past, the representation of sequential patterns was used for modelling discriminative temporal patterns for different classes. Here, we have improved on this by using the more general representation of episodes, of which sequential patterns are a special case. We then propose a novel tree structure called a MultI-Class Episode Tree (MICE-Tree) that allows one to simultaneously model a set of different episodes in an efficient manner whilst providing labels for them. A set of MICE-Trees are then combined together into a MICE-Forest that is learnt in a Boosting framework. The result is a strong classifier that utilises episodes for performing classification of temporal sequences. We also provide experimental evidence showing that the MICE-Trees allow for a more compact and efficient model compared to sequential patterns. Additionally, we demonstrate the accuracy and robustness of the proposed method in the presence of different levels of noise and class labels.
Ong Eng-Jon, Husain Sameed, Bober-Irizar Mikel, Bober Miroslaw (2018) Deep Architectures and Ensembles for Semantic Video Classification,IEEE Transactions on Circuits and Systems for Video Technology Institute of Electrical and Electronics Engineers (IEEE)
This work addresses the problem of accurate semantic
labelling of short videos. To this end, a multitude of different
deep nets, ranging from traditional recurrent neural networks
(LSTM, GRU), temporal agnostic networks (FV,VLAD,BoW),
fully connected neural networks mid-stage AV fusion and others.
Additionally, we also propose a residual architecture-based
DNN for video classification, with state-of-the art classification
performance at significantly reduced complexity. Furthermore,
we propose four new approaches to diversity-driven multi-net
ensembling, one based on fast correlation measure and three
incorporating a DNN-based combiner. We show that significant
performance gains can be achieved by ensembling diverse nets
and we investigate factors contributing to high diversity. Based
on the extensive YouTube8M dataset, we provide an in-depth
evaluation and analysis of their behaviour. We show that the
performance of the ensemble is state-of-the-art achieving the
highest accuracy on the YouTube8M Kaggle test data. The
performance of the ensemble of classifiers was also evaluated
on the HMDB51 and UCF101 datasets, and show that the
resulting method achieves comparable accuracy with state-ofthe-
art methods using similar input features.