A method of representing an image or sequence of images using a depth map comprises transforming an n-bit depth map representation into an m-bit depth map representation, where m
A method of representing an object appearing in a still or video image, by processing signals corresponding to the image, comprises deriving the peak values in CSS space for the object outline and applying a non-linear transformation to said peak values to arrive at a representation of the outline.
A method of representing a data distribution derived from an object or image by processing signals corresponding to the object or image comprising deriving an approximate representation of the data distribution and analysing the errors of the data elements when expressed in terms of the approximate representation.
A method of representing an object appearing in a still or video image, by processing signals corresponding to the image, comprises deriving a plurality of sets of co-ordinate values representing the shape of the object and quantising the co-ordinate values to derive a coded representation of the shape, and further comprises quantising a first co-ordinate value over a first quantisation range and quantising a smaller co-ordinate value over a smaller range.
A method of representing at least one image comprises deriving at least one descriptor based on color information and color interrelation information for at least one region of the image, the descriptor having at least one descriptor element, derived using values of pixels in said region, wherein at least one descriptor element for a region is derived using a non-wavelet transform. The representations may be used for image comparisons.
M PETROU, M BOBER, J KITTLER (1994)MULTIRESOLUTION MOTION SEGMENTATION, In: PROCEEDINGS OF THE 12TH IAPR INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION - CONFERENCE A: COMPUTER VISION & IMAGE PROCESSINGpp. 379-383 A method of representing a group of data items comprises, for each of a plurality of data items in the group, determining the similarity between said data item and each of a plurality of other data items in the group, assigning a rank to each pair on the basis of similarity, wherein the ranked similarity values for each of said plurality of data items are associated to reflect the overall relative similarities of data items in the group.
The complete theory for Fisher and dual discriminant analysis is presented as the background of the novel algorithms. LDA is found as composition of projection onto the singular subspace for within-class normalised data with the projection onto the singular subspace for between-class normalised data. The dual LDA consists of those projections applied in reverse order. The experiments show that using suitable composition of dual LDA transformations gives as least as good results as recent state-of-the-art solutions.
A method of representing an object appearing in a still or video image for use in searching, wherein the object appears in the image with a first two-dimensional outline, by processing signals corresponding to the image, comprises deriving a view descriptor of the first outline of the object and deriving at least one additional view descriptor of the outline of the object in a different view, and associating the two or more view descriptors to form an object descriptor.
A Hough transform based method of estimating N parameters a=(a.sub.1, . . . , a.sub.N) of motion of a region Y in a first image to a following image, the first and following images represented, in a first spatial resolution, by intensities at pixels having coordinates in a coordinate system, the method including: determining the total support H(Y,a) as a sum of the values of an error function for the intensities at pixels in the region Y; determining the motion parameters a that give the total support a minimum value; the determining being made in steps of an iterative process moving along a series of parameter estimates a.sub.1, a.sub.2, . . . by calculating partial derivatives dH.sub.i =MH(Y,a.sub.n)/Ma.sub.n,i of the total support for a parameter estimate a.sub.n with respect to each of the parameters a.sub.i and evaluating the calculated partial derivatives for taking a new a.sub.n+1 ; and wherein, in the evaluating of the partial derivatives, the partial derivatives dH.sub.i are first scaled by multiplying by scaling factors dependent on the spatial extension of the region to produce scaled partial derivatives dHN.sub.i.
A method of representing an object appearing in a still or video image, by processing signals corresponding to the image, comprises deriving a plurality of numerical values associated with features appearing on the outline of an object starting from an arbitrary point on the outline and applying a predetermined ordering to said values to arrive at a representative of the outline.
An entity is subjected to an interrogating signal, and the reflection from the entity is repeatedly sampled to obtain a first set of values each dependent on the intensity of the reflected signal. A logarithmic transformation is applied to the sample values to obtain a second set of values. A set of descriptor values is derived, the set comprising at least a first descriptor value (L) representing the difference between the mean and the median of the second set of values, and a second descriptor value (D) representing the mean of the absolute value of the deviation between each second set value and an average of the second set of values.
This paper addresses a problem of robust, accurate and fast object detection in complex environments, such as cluttered backgrounds and low-quality images. To overcome the problems with existing methods, we propose a new object detection approach, called Statistical Template Matching. It is based on generalized description of the object by a set of template regions and statistical testing of object/non-object hypotheses. A similarity measure between the image and a template is derived from the Fisher criterion. We show how to apply our method to face and facial feature detection tasks, and demonstrate its performance in some difficult cases, such as moderate variation of scale factor of the object, local image warping and distortions caused by image compression. The method is very fast; its speed is independent of the template size and depends only on the template complexity.
A method of detecting an object in an image comprises comparing a template with a region of an image and determining a similarity measure, wherein the similarity measure is determined using a statistical measure. The template comprises a number of regions corresponding to parts of the object and their spatial relations. The variance of the pixels within the total template is set in relation to the variances of the pixels in all individual regions, to provide a similarity measure.
A method of deriving a representation of a video sequence comprises deriving metadata expressing at least one temporal characteristic of a frame or group of frames, and one or both of metadata expressing at least one content-based characteristic of a frame or group of frames and relational metadata expressing relationships between at least one content-based characteristic of a frame or group of frames and at least one other frame or group of frames, and associating said metadata and/or relational metadata with the respective frame or group of frames.
Method and apparatus for motion vector field encoding Abstract A method and apparatus for representing motion in a sequence of digitized images derives a dense motion vector field and vector quantizes the motion vector field.
Visual search and image retrieval underpin numerous applications, however the task is still challenging predominantly due to the variability of object appearance and ever increasing size of the databases, often exceeding billions of images. Prior art methods rely on aggregation of local scale-invariant descriptors, such as SIFT, via mechanisms including Bag of Visual Words (BoW), Vector of Locally Aggregated Descriptors (VLAD) and Fisher Vectors (FV). However, their performance is still short of what is required. This paper presents a novel method for deriving a compact and distinctive representation of image content called Robust Visual Descriptor with Whitening (RVD-W). It significantly advances the state of the art and delivers world-class performance. In our approach local descriptors are rank-assigned to multiple clusters. Residual vectors are then computed in each cluster, normalized using a direction-preserving normalization function and aggregated based on the neighborhood rank. Importantly, the residual vectors are de-correlated and whitened in each cluster before aggregation, leading to a balanced energy distribution in each dimension and significantly improved performance. We also propose a new post-PCA normalization approach which improves separability between the matching and non-matching global descriptors. This new normalization benefits not only our RVD-W descriptor but also improves existing approaches based on FV and VLAD aggregation. Furthermore, we show that the aggregation framework developed using hand-crafted SIFT features also performs exceptionally well with Convolutional Neural Network (CNN) based features. The RVD-W pipeline outperforms state-of-the-art global descriptors on both the Holidays and Oxford datasets. On the large scale datasets, Holidays1M and Oxford1M, SIFT-based RVD-W representation obtains a mAP of 45.1% and 35.1%, while CNN-based RVD-W achieve a mAP of 63.5% and 44.8%, all yielding superior performance to the state-of-the-art.
This work addresses the problem of accurate semantic labelling of short videos. To this end, a multitude of different deep nets, ranging from traditional recurrent neural networks (LSTM, GRU), temporal agnostic networks (FV,VLAD,BoW), fully connected neural networks mid-stage AV fusion and others. Additionally, we also propose a residual architecture-based DNN for video classification, with state-of-the art classification performance at significantly reduced complexity. Furthermore, we propose four new approaches to diversity-driven multi-net ensembling, one based on fast correlation measure and three incorporating a DNN-based combiner. We show that significant performance gains can be achieved by ensembling diverse nets and we investigate factors contributing to high diversity. Based on the extensive YouTube8M dataset, we provide an in-depth evaluation and analysis of their behaviour. We show that the performance of the ensemble is state-of-the-art achieving the highest accuracy on the YouTube8M Kaggle test data. The performance of the ensemble of classifiers was also evaluated on the HMDB51 and UCF101 datasets, and show that the resulting method achieves comparable accuracy with state-ofthe- art methods using similar input features.
The incorporation of intelligent video processing algorithms into digital surveillance systems has been examined in this work. In particular, the use of the latest standard in multi-media feature extraction and matching is discussed. The use of such technology makes a system very different to current surveillance systems which store text-based meta-data. In our system, descriptions based upon shape and colour are extracted in real-time from two sequences of video recorded from a real-life scenario. The stored database of descriptions can then be searched using a query description constructed by the operator; this query is then compared with every description stored for the video sequence. We show examples of the fast and accurate search made possible with this latest technology for multimedia content description applied to a video surveillance database.
M Bober (2001)MPEG-7 visual shape descriptors, In: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY11(6)pp. 716-719 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Linear Discriminant Analysis (LDA) is a popular feature extraction technique that aims at creating a feature set of enhanced discriminatory power. The authors introduced a novel approach Dual LDA (DLDA) and proposed an efficient SVD-based implementation. This paper focuses on feature space reduction aspect of DLDA achieved in course of proper choice of the parameters controlling the DLDA algorithm. The comparative experiments conducted on a collection of five facial databases consisting in total of more than 10000 photos show that DLDA outperforms by a great margin the methods reducing the feature space by means of feature subset selection. © 2005 IEEE.
The use of a robust, low-level motion estimator based on a Robust Hough Transform (RHT) in a range of tasks, such as optical flow estimation, and motion estimation for video coding and retrieval from video sequences was discussed. RHT derived not only pixels displacements, but also provided direct motion segmentation and other motion-related clues. The RHT algorithm employed an affine region-to-region transformation model and was invariant to illumination changes, in addition to being statistically robust. It was found that RHT did not base the correspondence analyses on any specific type of feature, but used textured regions in the image as non-localized features.
A method for deriving an image identifier comprises deriving a scale-space representation of an image, and processing the scale-space representation to detect a plurality of feature points having values that are maxima or minima. A representation is derived for a scale-dependent image region associated with one or more of the detected plurality of feature points. In an embodiment, the size of the image region is dependent on the scale associated with the corresponding feature point. An image identifier is derived using the representations derived for the scale-dependent image regions. The image identifiers may be used in a method for comparing images.
A method of representing an image comprises deriving at least one 1-dimensional representation of the image by projecting the image onto an axis, wherein the projection involves summing values of selected pixels in a respective line of the image perpendicular to said axis, characterised in that the number of selected pixels is less than the number of pixels in the line.
A method of identifying or tracking a line in an image comprises determining a start point that belongs to a line, and identifying a plurality of possible end points belonging to the line using a search window, and calculating values for a plurality of paths connecting the start point and the end points to determine an optimum end point and path, characterised in that the search window is non-rectangular.
Fisher linear discriminant analysis (FLDA) based on variance ratio is compared with scatter linear discriminant (SLDA) analysis based on determinant ratio. It is shown that each optimal FLDA data model is optimal SLDA data model but not opposite. The novel algorithm 2SS4LDA (two singular subspaces for LDA) is presented using two singular value decompositions applied directly to normalized multiclass input data matrix and normalized class means data matrix. It is controlled by two singular subspace dimension parameters q and r, respectively. It appears in face recognition experiments on the union of MPEG-7, Altkom, and Feret facial databases that 2SS4LDA reaches about 94% person identification rate and about 0.21 average normalized mean retrieval rank. The best face recognition performance measures are achieved for those combinations of q, r values for which the variance ratio is close to its maximum, too. None such correlation is observed for SLDA separation measure. © Springer-Verlag Berlin Heidelberg 2003.
This paper presents an approach for generating class-specific image segmentation. We introduce two novel features that use the quantized data of the Discrete Cosine Transform (DCT) in a Semantic Texton Forest based framework (STF), by combining together colour and texture information for semantic segmentation purpose. The combination of multiple features in a segmentation system is not a straightforward process. The proposed system is designed to exploit complementary features in a computationally efficient manner. Our DCT based features describe complex textures represented in the frequency domain and not just simple textures obtained using differences between intensity of pixels as in the classic STF approach. Differently than existing methods (e.g., filter bank) just a limited amount of resources is required. The proposed method has been tested on two popular databases: CamVid and MSRC-v2. Comparison with respect to recent state-of-the-art methods shows improvement in terms of semantic segmentation accuracy.
A method of representing an object appearing in a still or video image, by processing signals corresponding to the image, comprises deriving a curvature scale space (CSS) representation of the object outline by smoothing the object outline, deriving at least one additional parameter reflecting the shape or mass distribution of a smoothed version of the original curve, and associating the CSS representation and the additional parameter as a shape descriptor of the object.
Transmission of compressed video over error prone channels such as mobile networks is a challenging issue. Maintaining an acceptable quality of service in such an environment demands additional post-processing tools to limit the impact of uncorrected transmission errors. Significant visual degradation of a video stream occurs when the motion vector component is corrupted. In this paper, an effective and computationally efficient method for the recovery of lost motion vectors (MVs) is proposed. The novel idea selects a neighbouring block MV that has the minimum distance from an estimated MV. Simulation results are presented, including comparison with existing methods. Our method follows the performance of the best existing method by approximately 0.1-0.5 dB. However, it has a significant advantage in that it is 50% computationally simpler. This makes our method ideal for use in mobile handsets and other applications with limited processing power.
MPEG-7 is the first international standard which contains a number of key techniques from Computer Vision and Image Processing. The Curvature Scale Space technique was selected as a contour shape descriptor for MPEG-7 after substantial and comprehensive testing, which demonstrated the superior performance of the CSS-based descriptor. Curvature Scale Space Representation: Theory, Applications, and MPEG-7 Standardization is based on key publications on the CSS technique, as well as its multiple applications and generalizations. The goal was to ensure that the reader will have access to the most fundamental results concerning the CSS method in one volume. These results have been categorized into a number of chapters to reflect their focus as well as content. The book also includes a chapter on the development of the CSS technique within MPEG standardization, including details of the MPEG-7 testing and evaluation processes which led to the selection of the CSS shape descriptor for the standard. The book can be used as a supplementary textbook by any university or institution offering courses in computer and information science.
A method and apparatus for deriving a representation of an image is described. The method involves processing signals corresponding to the image. A two-dimensional function of the image, such as a Trace transform (T (d, θ)), of the image using at least one functional T, is derived and processed using a mask function (β) to derive an intermediate representation of the image, corresponding to a one-dimensional function. In one embodiment, the mask function defines pairs of image bands of the Trace transform in the Trace domain. The representation of the image may be derived by applying existing techniques to the derived one-dimensional function.
MZ Bober, F Preteux, W-Y Kim (2002)Shape Descriptors, In: Introduction to MPEG-7 John Wiley & Sons Inc
Introduction to MPEG-7 takes a systematic approach to the standard and provides a unique overview of the principles and concepts behind audio-visual indexing, ...
© Springer-Verlag Berlin Heidelberg 1997.Phase correlation techniques have been used in image registration to estimate image displacements. These techniques have been also used to estimate optical flow by applying it locally. In this work a different phase correlation-based method is proposed to deal with a deformation/translation motion model, instead of the pure translations that the basic phase correlation technique can estimate. Some experimentals results are also presented to show the accuracy of the motion paramenters estimated and the use of the phase correlation to estimate optical flow.
This paper addresses the problem of very large-scale image retrieval, focusing on improving its accuracy and robustness. We target enhanced robustness of search to factors such as variations in illumination, object appearance and scale, partial occlusions, and cluttered backgrounds -particularly important when search is performed across very large datasets with significant variability. We propose a novel CNN-based global descriptor, called REMAP, which learns and aggregates a hierarchy of deep features from multiple CNN layers, and is trained end-to-end with a triplet loss. REMAP explicitly learns discriminative features which are mutually-supportive and complementary at various semantic levels of visual abstraction. These dense local features are max-pooled spatially at each layer, within multi-scale overlapping regions, before aggregation into a single image-level descriptor. To identify the semantically useful regions and layers for retrieval, we propose to measure the information gain of each region and layer using KL-divergence. Our system effectively learns during training how useful various regions and layers are and weights them accordingly. We show that such relative entropy-guided aggregation outperforms classical CNN-based aggregation controlled by SGD. The entire framework is trained in an end-to-end fashion, outperforming the latest state-of-the-art results. On image retrieval datasets Holidays, Oxford and MPEG, the REMAP descriptor achieves mAP of 95.5%, 91.5% and 80.1% respectively, outperforming any results published to date. REMAP also formed the core of the winning submission to the Google Landmark Retrieval Challenge on Kaggle.
This paper addresses the problem of ultra-large-scale search in Hamming spaces. There has been considerable research on generating compact binary codes in vision, for example for visual search tasks. However the issue of efficient searching through huge sets of binary codes remains largely unsolved. To this end, we propose a novel, unsupervised approach to thresholded search in Hamming space, supporting long codes (e.g. 512-bits) with a wide-range of Hamming distance radii. Our method is capable of working efficiently with billions of codes delivering between one to three orders of magnitude acceleration, as compared to prior art. This is achieved by relaxing the equal-size constraint in the Multi-Index Hashing approach, leading to multiple hash-tables with variable length hash-keys. Based on the theoretical analysis of the retrieval probabilities of multiple hash-tables we propose a novel search algorithm for obtaining a suitable set of hash-key lengths. The resulting retrieval mechanism is shown empirically to improve the efficiency over the state-of-the-art, across a range of datasets, bit-depths and retrieval thresholds.
This paper is concerned with design of a compact, binary and scalable image representation that is easy to compute, fast to match and delivers beyond state-of-the-art performance in visual recognition of objects, buildings and scenes. A novel descriptor is proposed which combines rank-based multi-assignment with robust aggregation framework and cluster/bit selection mechanisms for size scalability. Extensive performance evaluation is presented, including experiments within the state-of-the art pipeline developed by the MPEG group standardising Compact Descriptors for Visual Search (CVDS).
A one-dimensional representation of an image is obtained using a mapping function defining a closed scanning curve. The function is decomposed into component signals which represent different parts of the bandwidth of the representation using bi-directional filters to achieve zero group delay.
A method of representing a 2-dimensional image comprises deriving at least one 1-dimensional representation of the image by projecting the image onto at least one axis, and applying a Fourier transform to said 1-dimensional representation. The representation can be used for estimation of dominant motion between images.
A method and apparatus for processing a first sequence of images and a second sequence of images to compare the first and second sequences is disclosed. Each of a plurality of the images in the first sequence and each of a plurality of the images in the second sequence is processed by (i) processing the image data for each of a plurality of pixel neighbourhoods in the image to generate at least one respective descriptor element for each of the pixel neighbourhoods, each descriptor element comprising one or more bits; and (ii) forming a plurality of words from the descriptor elements of the image such that each word comprises a unique combination of descriptor element bits. The words for the second sequence are generated from the same respective combinations of descriptor element bits as the words for the first sequence. Processing is performed to compare the first and second sequences by comparing the words generated for the plurality of images in the first sequences with the words...
M Bober (2001)MPEG-7: Evolution or revolution?, In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)2124pp. 1-? © Springer-Verlag Berlin Heidelberg 2001.The ISO MPEG-7 Standard, also known as a Multimedia Content Description Interface, will be soon finalized. After several years of intensive work on technology development, implementation and testing by almost all major players in the digital multimedia arena, the results of this international project will be assessed by the most cruel and demanding judge: the market. Will it meet all the high expectations of the developers and, above all, future users? Will it result in a revolution, evolution or will it just simply pass unnoticed? In this invited lecture, I will review the components of the MPEG-7 Standard in the context of some novel applications. I will go beyond the classical image/ video retrieval scenarios, and look into more generic image/object recognition framework relying on the MPEG-7 technology. Such a framework is applicable to a wide range of new applications. The benefits of using standardized technology, over other state-of-the art techniques from computer vision, image processing, and database retrieval, will be investigated. Demonstrations of the generic object recognition system will be presented, followed by some other examples of emerging applications made possible by the Standard. In conclusion, I will assess the potential impact of this new standard on emerging services, products and future technology developments.
Multiple cues play a crucial role in image interpretation. A vision system that combines shape, colour, motion, prior scene knowledge and object motion behaviour is described. We show that the use of interpretation strategies which depend on the image data, temporal context and visual goals significantly simplifies the complexity of the image interpretation problem and makes it computationally feasible.
© Springer-Verlag Berlin Heidelberg 1997.This paper is concerned with an efficient estimation and segmentation of 2-D motion from image sequences, with the focus on traffic monitoring applications. In order to reduce the computational load and facilitate real-time implementation, the proposed approach makes use of simplifying assumptions that the camera is stationary and that the projection of vehicles motion on the image plane can be approximated by translation. We show that a good performance can be achieved even under such apparently restrictive assumptions. To further reduce processing time, we perform gray-level based segmentation that extracts regions of uniform intensity. Subsequently, we estimate motion for the regions. Regions moving with the coherent motion are allowed to merge. The use of 2D motion analysis and the pre-segmentation stage significantly reduces the computational load, and the region-based estimator gives robustness to noise and changes of illumination.
Humans have an innate ability to communicate visually; the earliest forms of communication were cave drawings, and children can communicate visual descriptions of scenes through drawings well before they can write. Drawings and sketches offer an intuitive and efficient means for communicating visual concepts. Today, society faces a deluge of digital visual content driven by a surge in the generation of video on social media and the online availability of video archives. Mobile devices are emerging as the dominant platform for consuming this content, with Cisco predicting that by 2018 over 80% of mobile traffic will be video. Sketch offers a familiar and expressive modality for interacting with video on the touch-screens commonly present on such devices. This thesis contributes several new algorithms for searching and manipulating video using free-hand sketches. We propose the Visual Narrative (VN); a storyboarded sequence of one or more actions in the form of sketch that collectively describe an event. We show that VNs can be used to both efficiently search video repositories, and to synthesise video clips. First, we describe a sketch based video retrieval (SBVR) system that fuses multiple modalities (shape, colour, semantics, and motion) in order to find relevant video clips. An efficient multi-modal video descriptor is proposed enabling the search of hundreds of videos in milliseconds. This contrasts with prior SBVR that lacks an efficient index representation, and take minutes or hours to search similar datasets. This contribution not only makes SBVR practical at interactive speeds, but also enables user-refinement of results through relevance feedback to resolve sketch ambiguity, including the relative priority of the different VN modalities. Second, we present the first algorithm for sketch based pose retrieval. A pictographic representation (stick-men) is used to specify a desired human pose within the VN, and similar poses found within a video dataset. We use archival dance performance footage from the UK National Resource Centre for Dance (UK-NRCD), containing diverse examples of human pose. We investigate appropriate descriptors for sketch and video, and propose a novel manifold learning technique for mapping between the two descriptor spaces and so performing sketched pose retrieval. We show that domain adaptation can be applied to boost the performance of this system through a novel piece-wise feature-space warping technique. Third, we present a graph representation for VNs comprising multiple actions. We focus on the extension of our pose retrieval system to a sequence of poses interspersed with actions (e.g. jump, twirl). We show that our graph representation can be used for multiple applications: 1) to retrieve sequences of video comprising multiple actions; 2) to navigate in pictorial form, the retrieved video sequences; 3) to synthesise new video sequences by retrieving and concatenating video fragments from archival footage.
Visual search and recognition underpins numerous applications including management of multimedia content, mobile commerce, surveillance, navigation, robotics and many others. However the task is still challenging predominantly due to the variability of object appearance and ever increasing size of the databases, often exceeding billions of images. The objective of this thesis is to develop a robust, compact and discriminative image representation suitable for tasks of visual search. This thesis contributes to four research areas. First we propose a novel method, named Robust Visual Descriptor (RVD), for deriving a compact and robust representation of image content which significantly advances state of the art and delivers world-class performance. In our approach, the local descriptors are assigned to multiple cluster centres with rank weights leading to a stable and reliable global image representation. Residual vectors are then computed in each cluster, normalized using a direction preserving normalization and aggregated based on the neighbourhood rank information. We then propose two extensions to the core RVD descriptor. The first one consists of de-correlating weighted residual vectors by applying cluster level PCA before aggregation. In the second extension, the weighted residual vectors are whitened in each cluster before aggregation, leading to a balanced energy distribution in each dimension and improved performance. Compressing floating point global signatures to binary codes improves storage requirements and matching speed for large scale image retrieval tasks. Our third contribution is to derive a compact and robust binary image signature from the core RVD representation. In addition, we propose a novel binary descriptors matching algorithm, PCAE with Weighted Hamming distance (PCAE+WH), to minimize the quantization loss associated with converting floating point vector to discrete binary codes. In the context of industry work on Compact descriptors for Visual Search (CDVS) and its standardization in MPEG (ISO), we propose a scalable RVD representation. The bitrate scalability is achieved by employing novel Cluster Selection and Bit Selection mechanisms which support interoperable binary RVD representations. Moreover, we propose a very efficient and effective score function based on weighted Hamming distance, to compute similarity between two binary representations. Our fourth contribution is to develop an image classification system based on RVD representation. We introduce an effective method to incorporate second order statistics in the original RVD framework.
The method of Dynamic Mode Decomposition (DMD) was introduced originally in the area of Computatational Fluid Dynamics (CFD) for extracting coherent structures from spatio-temporal complex fluid flow data. DMD takes in time series data and computes a set of modes, each of which is associated with a complex eigenvalue. DMD analysis is closely associated with spectral analysis of the Koopman operator, which provides linear but infinite-dimensional representation of nonlinear dynamical systems. Therefore, by using DMD a nonlinear system could be described by a superposition of modes whose dynamics are governed by the eigenvalues. The key advantage of DMD is its data-driven nature which does not rely on any prior assumptions except the inherent dynamics which are observed over time. Its capability for extracting relevant modes from complex fluid flows has seen significant application across multiple fields, including computer vision, robotics and neuroscience. This thesis, in order to expand DMD to other applications, advances the original formulation so that it can be used to solve novel problems in the fields of signal processing and computer vision. In signal processing this thesis introduces the method of using DMD for decomposing a univariate time series into a number of interpretable elements with different subspaces, such as noise, trends and harmonics. In addition, univariate time series forecasting is shown using DMD. The computer vision part of this thesis focuses on innovative applications pertaining to the areas of medical imaging, biometrics and background modelling. In the area of medical imaging a novel DMD framework is proposed that introduces windowed and reconstruction variants of DMD for quantifying kidney function in Dynamic Contrast Enhanced Magnetic Resonance imaging (DCE-MRI) sequences, through movement correction and functional segmentation of the kidneys. The biometrics portion of this thesis introduces a DMD based classification pipeline for counter spoofing 2D facial videos and static finger vein images. The finger vein counter spoofing makes use of a novel atemporal variant of DMD that captures micro-level artefacts that can differentiate the quality and light reflection properties between a live and a spoofed finger vein image, while the DMD on 2D facial image sequences distinguishes attack specific cues from a live face by capturing complex dynamics of head movements, eye-blinking and lip-movements in a data driven manner. Finally, this thesis proposes a new technique using DMD to obtain a background model of a visual scene in the colour domain. These aspects form the major contributions of this thesis. The results from this thesis present DMD as a promising approach for applications requiring feature extraction including: (i) trends and noise from signals, (ii) micro-level texture descriptor from images, and (iii) coherent structures from image sequences/videos, as well as applications that require suppression of movements from dynamical spatio-temporal image sequences.
High resolution brain magnetic resonance (MR) images acquired at multiple time points across the treatment of a patient allow the quantification of localised changes brought about by disease progression. The aim of this thesis is to address the challenge of performing automatic longitudinal analysis of magnetic resonance imaging (MRI) in paediatric brain tumours. The first contribution in this thesis is the validation of a semi-automated segmentation technique. This technique was applied to intra-operative MR images acquired during the surgical resection of hypothalamic tumours in children, in order to assess the volume of tumour resected at different stages of the surgical procedure. The second contribution in this thesis is the quantification of a rare condition known as hypertrophic olivary degeneration (HOD) in lobes within the brain known as inferior olivary nucleii (ION) in relation to the development of posterior fossa syndrome (PFS) following tumour resection in the hind brain. The change in grey-level intensity over time in the left ION has been identified as a suitable biomarker that correlates with the occurrence of posterior fossa syndrome following tumour resection surgery. This study demonstrates the application of machine learning techniques to T2 brain MR images. The third contribution presents a novel approach to longitudinal brain MR analysis, focusing on the cerebellum and brain stem. This contribution presents a technique developed to interpolate multi-slice 2D MR image slices of the brain stem and cerebellum both to infill gaps between slices as well as longitudinally over time, that is, in four-dimensional space. This study also investigates the application of machine learning techniques directly to the MR images. Another novel method developed in this study is the Jacobian of deformations in the brain over time, and its use as an imaging feature. Unlike the previous contribution chapter, the third contribution is not hypothesis-driven, and automatically detects six potential biomarkers that are related to the development of PFS following tumour resection in the posterior fossa. The limited number of patients considered in each study posed a major challenge. This has prompted the use of multiple validation techniques in order to provide accurate results despite the small dataset. These techniques are presented in the second and third contribution chapters.
The abrupt expansion of the Internet use over the last decade led to an uncontrollable amount of media stored in the Web. Image, video and news information has ooded the pool of data that is at our disposal and advanced data mining techniques need to be developed in order to take full advantage of them. The focus of this thesis is mainly on developing robust video analysis technologies concerned with detecting and recognizing activities in video. The work aims at developing a compact activity descriptor with low computational cost, which will be robust enough to discriminate easily among diverse activity classes. Additionally, we introduce a motion compensation algorithm which alleviates any issues introduced by moving camera and is used to create motion binary masks, referred to as compensated Activity Areas (cAA), where dense interest points are sampled. Motion and appearance descriptors invariant to scale and illumination changes are then computed around them and a thorough evaluation of their merit is carried out. The notion of Motion Boundaries Activity Areas (MBAA) is then introduced. The concept differs from cAA in terms of the area they focus on (ie human boundaries), reducing even more the computational cost of the activity descriptor. A novel algorithm that computes human trajectories, referred to as 'optimal trajectories', with variable temporal scale is introduced. It is based on the Statistical Sequential Change Detection (SSCD) algorithm, which allows dynamic segmentation of trajectories based on their motion pattern and facilitates their classification with better accuracy. Finally, we introduce an activity detection algorithm, which segments long duration videos in an accurate but computationally efficient manner. We advocate Statistical Sequential Boundary Detection (SSBD) method as a means of analysing motion patterns and report improvement over the State-of-the-Art.