Miroslaw Bober

Professor Miroslaw Bober



Professor of Video Processing BSc, MSc, PhD, MIEEE

Miroslaw Bober joined Surrey in 2011 as Professor of Video Processing. He is leading the Visual Media Analysis team within the Centre for Vision, Speech and Signal Processing (CVSSP).

Prior to his appointment at Surrey Prof Bober was the General Manager of the Mitsubishi Electric R&D Centre Europe (MERCE-UK), and the Head of Research for its Visual & Sensing Division. He was leading this European Corporate R&D centre for 15 years. His technical achievements were recognized in numerous awards, including the Presidential Award for strengthening the TV business in Japan via an innovative “Visual Navigation” content access technology (2010) and the prestigious Mitsubishi Best Invention Award for his Image Signature Technology (2008-one winner selected globally).

Prof Bober received a BSc and an MSc degree with distinction in electrical engineering from the AGH University of Science and Technology (Krakow, Poland) (1990), an MSc degree in Machine Intelligence (with distinction) from Surrey University (1991), and a PhD degree in computer vision from Surrey University (1995).

Miroslaw has published over 60 peer-reviewed publications and is the named inventor on over 80 unique patent applications. He has held over 30 research and industrial grants, with the value exceeding £16M. He is a member of the British Standards Institution (BSI) committee IST/37, responsible for UK contributions into MPEG and JPEG and represents UK in the area of image and video analysis and associated metadata. Prof Bober is chairing MPEG technical work on Compact Descriptors for Visual Search (CDVS - standard ISO/IEC FDIS 15938-13) and Compact Descriptors for Video Analysis (CDVA - work in progress).

Research interests

My research focuses on novel techniques in Signal Processing, Computer Vision and Machine Learning and their applications in industry, healthcare, big-data and security. I have a particular interest in image and video analysis & retrieval (visual search, object recognition, analysis of motion, shape & texture). The broad research objective is to develop unique methods and technology solutions for visual content understanding that can dramatically improve on existing state-of-the art leading to new applications. My algorithms for shape analysis and image/video fingerprinting as well as visual search are considered world-leading and were selected for ISO International standards within MPEG and used by, e.g. Metropolitan Police.


  • EEE3034 - Media Casting (Module Coordinator)
  • EEE3029 - Multimedia Systems and Component Technology
  • EEEM001 - Image and Video Compression
  • EEE3035 - Engineering Professional Studies

Departmental duties

  • Programme Director for MSc in Multimedia Signal Processing and Communications
  • Industrial Tutor for undergraduate industrial placement year
  • Personal tutor for undergraduate students (L1, L2, L3, L4)
  • MSc-level tutor
  • Member of Faculty Research Degrees Committee (FRDC)
  • Member of the Departmental Industrial Advisory Board (IAB)


I have extensive collaboration links with universities and research institutions in Europe (UK, Switzerland, Germany, Poland, France, Spain), US, Japan and China. I have also worked with the following companies: the BBC (UK), Bang and Olufsen (DE), CEDEO (IT), Casio (JP), Ericsson (SE), Huawei (DE), Mitsubishi Electric (JP), RAI television (IT), Renesas Electronics (JP), Telecom Italia (IT), and Visual Atoms (UK).

Current Projects & Research Funding

I am the project coordinator and PI for the BRIDGET FP-7 project [5.28 M€], where my team is responsible for the development of ultra large-scale visual search and media analysis algorithms for the broadcast industry. The project aims to open new dimensions for multimedia content creation and consumption by bridging the gap between the Broadcast and Internet. Project partners include RAI television, Huawei, Telecom Italia and more.

CODAM is my latest project (PI) and is funded by the TSB creative media call [£1.05 M]. My team is working with the BBC and Visual Atoms to develop an advanced video asset management system with unique visual fingerprinting and visual search capabilities. It will aid content creation and deployment by enabling visual content tracking, identification and searching across multiple devices and platforms, and across diverse digital media ecosystems and markets. Where is the original version of the low-quality clip? Which video clip has been used most often in BBC programmes? Is it a stock shot of a red double decker bus, or an excerpt from a royal wedding? Is there other footage in the archive that shows the same event but can provide a fresh viewpoint? The CODAM system will answer these questions, track the origins of video clips across multi-platform productions and search for related material. It will take the form of a modular software system that can identify individual video clips in edited programmes, and perform object or scene recognition to find similar footage in an archive without relying on manually entered and often incomplete metadata.

My publications


The request for maritime security and safety applications has increased in the recent past. In this scenario, Synthetic Aperture Radar (SAR) sensors are one of the most effective means thanks to their capability to get images independently from daylight and weather conditions. In the SAR ship-detection field, many algorithms have been presented in literature; however none of them has ever considered the aspects behind the interaction of the electromagnetic wave between the target and the surrounding sea. This thesis explores the electromagnetic interaction arising between the ship and the sea and, firstly, a novel model to evaluate the Radar Cross Section (RCS) backscattered from a canonical ship is derived. RCS is modelled according to Kirchhoff Approximation (KA) within the Geometric Optics (GO) solution. The probability density function relative to the double reflection contribution is derived for all polarizations and the new model is validated on SAR images showing a good match between the theoretical values and those ones measured on real SAR images. Then, a novel ship detector, based on the Generalized Likelihood Ratio Test (GLRT) where both the sea and the ship electromagnetic models are considered, is proposed. The GLRT is compared to the CFAR algorithm through Monte Carlo simulations in terms of ROCs (Receiver Operating Characteristic curves) and computational load at different bands (S, C and X). Performances are also compared through simulations with different orbital and scene parameters. The GLRT is then applied to datasets acquired from different sensors operating at different bands: the Target to Clutter Ratio (TCR) is computed and detection outcomes are compared with AIS data. Results show that the GLRT presents better ROCs and greatly improves the TCR, but its computational time is slower when compared to the CFAR algorithm. Finally, a new approach for ship-detection and ambiguities removal in LPRF (Low Pulse Repetition Frequency) SAR imagery is proposed. The method exploits the range migration pattern and is evaluated on a downsampled SAR image. The algorithm is able to reject the SAR azimuth ambiguities and can be adapted for the upcoming Maritime Mode of the future NovaSAR-S sensor.
L Cieplinski, M Bober (1997)Scalable image coding using Gaussian pyramid vector quantisation with resolution-independent block size, In: 1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - Vpp. 2949-2952 I E E E, COMPUTER SOC PRESS
M Bober, W Price, J Atkinson (2000)The contour shape descriptor for MPEG-7 and its applications, In: IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - 2000 DIGEST OF TECHNICAL PAPERSpp. 286-287 IEEE
Miroslaw Bober, Josef Kittler (1994)Robust motion analysis, In: 1994 IEEE computer society conference on computer vision and pattern recognition, proceedingspp. 947-952 IEEE, Computer Soc Press
MZ Bober, S Paschalakis (2010)MPEG image and video signaturepp. 81-95
© 2012 Springer Science+Business Media, LLC. All rights reserved. MPEG-7, formally called ISO/IEC 15938 Multimedia Content Description Interface, is an international standard aimed at providing an interoperable solution to the description of various types of multimedia content, irrespective of their representation format. It is quite different from standards such as MPEG-1, MPEG-2 and MPEG-4, which aim to represent the content itself, since MPEG-7 aims to represent information about the content, known as metadata, so as to allow users to search, identify, navigate and browse audio-visual content more effectively. The MPEG-7 standard provides not only elementary visual and audio descriptors, but also multi-media description schemes, which combine the elementary audio and visual descriptors, a description definition language (DDL), and a binary compression scheme for the efficient compression and transportation of MPEG-7 metadata. In addition, reference software and conformance testing information are also part of the MPEG-7 standard and provide valuable tools to the development of standard-compliant systems.
K Zhang, M Bober, J Kittler (1995)Motion based image segmentation for video coding, In: INTERNATIONAL CONFERENCE ON IMAGE PROCESSING - PROCEEDINGS, VOLS I-IIIpp. C476-C479 I E E E, COMPUTER SOC PRESS
K Zhang, M Bober, J Kittler (1996)A hybrid codec for very low bit rate video coding, In: INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, PROCEEDINGS - VOL Ipp. 641-644 I E E E
Daqi Liu, Miroslaw Bober, Josef Kittler (2019)Visual Semantic Information Pursuit: A Survey, In: IEEE Transactions on Pattern Analysis and Machine Intelligence Institute of Electrical and Electronics Engineers (IEEE)
Visual semantic information comprises two important parts: the meaning of each visual semantic unit and the coherent visual semantic relation conveyed by these visual semantic units. Essentially, the former one is a visual perception task while the latter one corresponds to visual context reasoning. Remarkable advances in visual perception have been achieved due to the success of deep learning. In contrast, visual semantic information pursuit, a visual scene semantic interpretation task combining visual perception and visual context reasoning, is still in its early stage. It is the core task of many different computer vision applications, such as object detection, visual semantic segmentation, visual relationship detection or scene graph generation. Since it helps to enhance the accuracy and the consistency of the resulting interpretation, visual context reasoning is often incorporated with visual perception in current deep end-to-end visual semantic information pursuit methods. Surprisingly, a comprehensive review for this exciting area is still lacking. In this survey, we present a unified theoretical paradigm for all these methods, followed by an overview of the major developments and the future trends in each potential direction. The common benchmark datasets, the evaluation metrics and the comparisons of the corresponding methods are also introduced.
S Paschalakis, K Iwamoto, N Sprljan, R Oami, T Nomura, A Yamada, M Bober (2012)The MPEG-7 Video Signature Tools for Content Identification, In: IEEE Transactions on Circuits and Systems for Video Technology22(7)pp. 1050-1063 IEEE
This paper presents the core technologies of the Video Signature Tools recently standardized by ISO/IEC MPEG as an amendment to the MPEG-7 Standard (ISO/IEC 15938). The Video Signature is a high-performance content fingerprint which is suitable for desktop-scale to web-scale deployment and provides high levels of robustness to common video editing operations and high temporal localization accuracy at extremely low false alarm rates, achieving a detection rate in the order of 96% at a false alarm rate in the order of five false matches per million comparisons. The applications of the Video Signature are numerous and include rights management and monetization, distribution management, usage monitoring, metadata association, and corporate or personal database management. In this paper we review the prior work in the field, we explain the standardization process and status, and we provide the details and evaluation results of the Video Signature Tools.
Santosh Tirunagari, Norman Poh, Miroslaw Bober, David Windridge (2015)Windowed DMD as a microtexture descriptor for finger vein counter-spoofing in biometrics., In: WIFSpp. 1-6
Recent studies have shown that it is possible to attack a finger vein (FV) based biometric system using printed materials. In this study, we propose a novel method to detect spoofing of static finger vein images using Windowed Dynamic mode decomposition (W-DMD). This is an atemporal variant of the recently proposed Dynamic Mode Decomposition for image sequences. The proposed method achieves better results when compared to established methods such as local binary patterns (LBP), discrete wavelet transforms (DWT), histogram of gradients (HoG), and filter methods such as range-filters, standard deviation filters (STD) and entropy filters, when using SVM with a minimum intersection kernel. The overall pipeline which consists ofW-DMD and SVM, proves to be efficient, and convenient to use, given the absence of additional parameter tuning requirements. The effectiveness of our methodology is demonstrated using FV-Spoofing-Attack database which is publicly available. Our test results show that W-DMD can successfully detect printed finger vein images because they contain micro-level artefacts that not only differ in quality but also in light reflection properties compared to valid/live finger vein images.
M Bober, K Asai, A Divakaran (2001)A MPEG-4/7 based internet video and still image browsing system, In: MULTIMEDIA SYSTEMS AND APPLICATIONS III4209pp. 33-38
A Messina, FM Burgos, M Preda, S Lepsoy, M Bober, D Bertola, S Paschalakis (2015)Making second screen sustainable in media production: The BRIDGET approach, In: TVX 2015 - Proceedings of the ACM International Conference on Interactive Experiences for TV and Online Videopp. 155-160
This paper presents work in progress of the European Commission FP7 project BRIDGET "BRIDging the Gap for Enhanced broadcasT". The project is developing innovative technology and the underlying architecture for efficient production of second screen applications for broadcasters and media companies. The project advancements include novel front-end authoring tools as well as back-end enabling technologies such as visual search, media structure analysis and 3D A/V reconstruction to support new editorial workflows.
S Paschalakis, M Bober (2003)A low cost FPGA system for high speed face detection and tracking, In: Proceedings - 2003 IEEE International Conference on Field-Programmable Technology, FPT 2003pp. 214-221
We present an FPGA face detection and tracking system for audiovisual communications, with a particular focus on mobile videoconferencing. The advantages of deploying such a technology in a mobile handset are many, including face stabilisation, reduced bitrate, and higher quality video on practical display sizes. Most face detection methods, however, assume at least modest general purpose processing capabilities, making them inappropriate for real-time applications, especially for power-limited devices,as well as modestcustom hardware implementations. We present a method which achieves a very high detection and tracking performance and, at the same time, entails a significantly reduced computational complexity, allowing real-time implementations on custom hardware or simple microprocessors. We then propose an FPGA implementation which entails very low logic and memory costs and achieves extremely high processing rates at very low clock speeds.
David M. Frohlich, Emily Corrigan-Kavanagh, Mirek Bober, Haiyue Yuan, Radu Sporea, Brice Le Borgne, Caroline Scarles, George Revill, Jan Van Duppen, Alan W. Brown, Megan Beynon (2019)The Cornwall a-book: An Augmented Travel Guide Using Next Generation Paper, In: The Journal of Electronic Publishing22(1) Michigan Publishing
Electronic publishing usually presents readers with book or e-book options for reading on paper or screen. In this paper, we introduce a third method of reading on paper-and-screen through the use of an augmented book (‘a-book’) with printed hotlinks than can be viewed on a nearby smartphone or other device. Two experimental versions of an augmented guide to Cornwall are shown using either optically recognised pages or embedded electronics making the book sensitive to light and touch. We refer to these as second generation (2G) and third generation (3G) paper respectively. A common architectural framework, authoring workflow and interaction model is used for both technologies, enabling the creation of two future generations of augmented books with interactive features and content. In the travel domain we use these features creatively to illustrate the printed book with local multimedia and updatable web media, to point to the printed pages from the digital content, and to record personal and web media into the book.
Syed Sameed Husain, Miroslaw Bober (2016)On Aggregation of local binary descriptors, In: ICME MMC 2016 Proceedings
This paper addresses the problem of aggregating local binary descriptors for large scale image retrieval in mobile scenarios. Binary descriptors are becoming increasingly popular, especially in mobile applications, as they deliver high matching speed, have a small memory footprint and are fast to extract. However, little research has been done on how to efficiently aggregate binary descriptors. Direct application of methods developed for conventional descriptors, such as SIFT, results in unsatisfactory performance. In this paper we introduce and evaluate several algorithms to compress high-dimensional binary local descriptors, for efficient retrieval in large databases. In addition, we propose a robust global image representation; Binary Robust Visual Descriptor (B-RVD), with rank-based multi-assignment of local descriptors and direction-based aggregation, achieved by the use of L1-norm on residual vectors. The performance of the B-RVD is further improved by balancing the variances of residual vector directions in order to maximize the discriminatory power of the aggregated vectors. Standard datasets and measures have been used for evaluation showing significant improvement of around 4% mean Average Precision as compared to the state-of-the-art.
A Sibiryakov, M Bober (2006)Real-time multi-frame analysis of dominant translation, In: 18th International Conference on Pattern Recognition, Vol 1, Proceedingspp. 55-58
S Madeo, MZ Bober (2016)Fast, Compact and Discriminative: Evaluation of Binary Descriptors for Mobile Applications, In: IEEE Transactions on Multimedia
Local feature descriptors underpin many diverse applications, supporting object recognition, image registration, database search, 3D reconstruction and more. The recent phenomenal growth in mobile devices and mobile computing in general has created demand for descriptors that are not only discriminative, but also compact in size and fast to extract and match. In response, a large number of binary descriptors have been proposed, each claiming to overcome some limitations of the predecessors. This paper provides a comprehensive evaluation of several promising binary designs. We show that existing evaluation methodologies are not sufficient to fully characterize descriptors’ performance and propose a new evaluation protocol and a challenging dataset. In contrast to the previous reviews, we investigate the effects of the matching criteria, operating points and compaction methods, showing that they all have a major impact on the systems’ design and performance. Finally, we provide descriptor extraction times for both general-purpose systems and mobile devices, in order to better understand the real complexity of the extraction task. The objective is to provide a comprehensive reference and a guide that will help in selection and design of the future descriptors.
K Zhang, M Bober, J Kittler (1996)Video coding using affine motion compensated prediction, In: 1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6pp. 1978-1981
M Bober, J Kittler (1996)Combining the hough transform and multiresolution MRF's for the robust motion estimation, In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)1035pp. 91-100
© Springer-Verlag Berlin Heidelberg 1996.The paper presents a novel approach to the Robust Analysis of Complex Motion. It employs a low-level robust motion estimator, conceptually based on the Hough Transform, and uses Multiresolution Markov Random Fields for the global interpretation of the local, low-level estimates. Motion segmentation is performed in the front-end estimator, in parallel with the motion parameter estimation process. This significantly improves the accuracy of estimates, particularly in the vicinity of motion boundaries, facilitates the detection of such boundaries, and allows the use of larger regions, thus improving robustness. The measurements extracted from the sequence in the front-end estimator include displacement, the spatial derivatives of the displacement, confidence measures, and the location of motion boundaries. The measurements are then combined within the MRF framework, employing the supercoupling approach for fast convergence. The excellent performance, in terms of estimate accuracy, boundary detection and robustness is demonstrated on synthetic and real-word sequences.
A method of representing an object appearing in a still or video image, by processing signals corresponding to the image, comprises deriving the peak values in CSS space for the object outline and applying a non-linear transformation to said peak values to arrive at a representation of the outline.
MZ Bober, R Zaharia Occupant monitoring apparatus
A seat occupant monitoring apparatus which includes a sensor reactive to a force distribution applied to the seat by the occupant, means for making a plurality of measurements of the force distribution and means for monitoring the occupant based on the measurements. The measurements are used to classify the occupant and on a basis of the classification, parameters of the occupant's environment such as a seat orientation, the rate of deployment of an air bag or control of a seat belt pre-tensioner are altered. A neural network or other learning based technique is used for the classification.
A method of analysing an image comprises performing a Hough transform on points in an image space to an n-dimensional Hough space, selecting points in the Hough space representing features in the image space, and analysing m of the n variables for the selected points, where m is less than n, for information about the features in the image space.
R O'Callaghan, M Bober (2005)MPEG-7 visual-temporal clustering for digital image collections, In: INFORMATION RETRIEVAL TECHNOLOGY, PROCEEDINGS3689pp. 339-350 SPRINGER-VERLAG BERLIN
A method of detecting lines in an image comprises using one or more masks for detecting lines in one or more directions out of horizontal, vertical, left diagonal and right diagonal, and further comprises one or more additional masks for detecting lines in one or more additional directions.
P Brasnett, M Bober (2008)Fast and Robust Image Identification, In: 19TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOLS 1-6pp. 871-875
K Kucharski, W Skarbek, M Bober (2005)Feature space reduction for face recognition with dual linear discriminant analysis, In: COMPUTER ANALYSIS OF IMAGES AND PATTERNS, PROCEEDINGS3691pp. 587-595 SPRINGER-VERLAG BERLIN
Drawing Description Claims Patent number: 7236629 Filing date: 8 Apr 2003 Issue date: 26 Jun 2007 Application number: 10/408,316 A method of detecting a region having predetermined colour characteristics in an image comprises transforming colour values of pixels in the image from a first colour space to a second colour space, using the colour values in the second colour space to determine probability values expressing a match between pixels and the predetermined colour characteristics, where the probability values range over a multiplicity of values, using said probability values to identify pixels at least approximating to said predetermined colour characteristics, grouping pixels which at least approximate to said predetermined colour characteristics, and extracting information about each group, wherein pixels are weighted according to the respective multiplicity of probability values, and the weightings are used when grouping the pixels and/or when extracting information about a group.
MZ Bober, A Sibiryakov Robust image registration
A method of estimating a transformation between a pair of images, comprises estimating local transformations for a plurality of regions of the images to derive a set of estimated transformations, and selecting a subset of said estimated local transformations as estimated global transformations for the image.
A method representing an object appearing in still or video image for use in searching, wherein the object appears in the image with a first two-dimensional outline, by processing signals corresponding to the image, comprises deriving a view descriptor of the first outline of the object and deriving at least one additional view descriptor of the outline of the object in a different view, and associating the two or more view descriptors to form an object descriptor.
RS Smith, M Bober, T Windeatt (2011)A comparison of random forest with ECOC-based classifiers, In: Lecture Notes in Computer Science: Multiple Classifier Systems6713pp. 207-216
We compare experimentally the performance of three approaches to ensemble-based classification on general multi-class datasets. These are the methods of random forest, error-correcting output codes (ECOC) and ECOC enhanced by the use of bootstrapping and class-separability weighting (ECOC-BW). These experiments suggest that ECOC-BW yields better generalisation performance than either random forest or unmodified ECOC. A bias-variance analysis indicates that ECOC benefits from reduced bias, when compared to random forest, and that ECOC-BW benefits additionally from reduced variance. One disadvantage of ECOC-based algorithms, however, when compared with random forest, is that they impose a greater computational demand leading to longer training times.
Santosh Tirunagari, Norman Poh, Kevin Wells, Miroslaw Bober, I Gorden, David Windridge (2017)Movement correction in DCE-MRI through windowed and reconstruction dynamic mode decomposition, In: Machine Vision and Applications28(3-4)pp. 393-407 Springer Verlag
Images of the kidneys using dynamic contrast-enhanced magnetic resonance renography (DCE-MRR) contains unwanted complex organ motion due to respiration. This gives rise to motion artefacts that hinder the clinical assessment of kidney function. However, due to the rapid change in contrast agent within the DCE-MR image sequence, commonly used intensity-based image registration techniques are likely to fail. While semi-automated approaches involving human experts are a possible alternative, they pose significant drawbacks including inter-observer variability, and the bottleneck introduced through manual inspection of the multiplicity of images produced during a DCE-MRR study. To address this issue, we present a novel automated, registration-free movement correction approach based on windowed and reconstruction variants of dynamic mode decomposition (WR-DMD). Our proposed method is validated on ten different healthy volunteers’ kidney DCE-MRI data sets. The results, using block-matching-block evaluation on the image sequence produced by WR-DMD, show the elimination of 99%99% of mean motion magnitude when compared to the original data sets, thereby demonstrating the viability of automatic movement correction using WR-DMD.
A method of representing an image or sequence of images using a depth map comprises transforming an n-bit depth map representation into an m-bit depth map representation, where m
A method of representing an object appearing in a still or video image, by processing signals corresponding to the image, comprises deriving the peak values in CSS space for the object outline and applying a non-linear transformation to said peak values to arrive at a representation of the outline.
A method of representing a data distribution derived from an object or image by processing signals corresponding to the object or image comprising deriving an approximate representation of the data distribution and analysing the errors of the data elements when expressed in terms of the approximate representation.
A method of representing an object appearing in a still or video image, by processing signals corresponding to the image, comprises deriving a plurality of sets of co-ordinate values representing the shape of the object and quantising the co-ordinate values to derive a coded representation of the shape, and further comprises quantising a first co-ordinate value over a first quantisation range and quantising a smaller co-ordinate value over a smaller range.
A method of representing at least one image comprises deriving at least one descriptor based on color information and color interrelation information for at least one region of the image, the descriptor having at least one descriptor element, derived using values of pixels in said region, wherein at least one descriptor element for a region is derived using a non-wavelet transform. The representations may be used for image comparisons.
C Santamaria, M Bober, W Szajnowski, N Aso (2004)Analysis of remotely-sensed imagery using the level-crossing statistics texture descriptor, In: IMAGE AND SIGNAL PROCESSING FOR REMOTE SENSING X5573pp. 115-125
A Sibiryakov, M Bober (2007)Graph-based multiple panorama extraction from unordered image sets - art. no. 649809, In: Computational Imaging V6498pp. 49809-49809
A method of representing a group of data items comprises, for each of a plurality of data items in the group, determining the similarity between said data item and each of a plurality of other data items in the group, assigning a rank to each pair on the basis of similarity, wherein the ranked similarity values for each of said plurality of data items are associated to reflect the overall relative similarities of data items in the group.
W Skarbek, K Kucharski, M Bober (2004)Dual LDA for face recognition, In: Fundamenta Informaticae61(3-4)pp. 303-334
The complete theory for Fisher and dual discriminant analysis is presented as the background of the novel algorithms. LDA is found as composition of projection onto the singular subspace for within-class normalised data with the projection onto the singular subspace for between-class normalised data. The dual LDA consists of those projections applied in reverse order. The experiments show that using suitable composition of dual LDA transformations gives as least as good results as recent state-of-the-art solutions.
A method of representing an object appearing in a still or video image for use in searching, wherein the object appears in the image with a first two-dimensional outline, by processing signals corresponding to the image, comprises deriving a view descriptor of the first outline of the object and deriving at least one additional view descriptor of the outline of the object in a different view, and associating the two or more view descriptors to form an object descriptor.
C Santamaria, M Bober, W Szajnowski (2004)Texture analysis using level-crossing statistics, In: PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 2pp. 712-715
A Hough transform based method of estimating N parameters a=(a.sub.1, . . . , a.sub.N) of motion of a region Y in a first image to a following image, the first and following images represented, in a first spatial resolution, by intensities at pixels having coordinates in a coordinate system, the method including: determining the total support H(Y,a) as a sum of the values of an error function for the intensities at pixels in the region Y; determining the motion parameters a that give the total support a minimum value; the determining being made in steps of an iterative process moving along a series of parameter estimates a.sub.1, a.sub.2, . . . by calculating partial derivatives dH.sub.i =MH(Y,a.sub.n)/Ma.sub.n,i of the total support for a parameter estimate a.sub.n with respect to each of the parameters a.sub.i and evaluating the calculated partial derivatives for taking a new a.sub.n+1 ; and wherein, in the evaluating of the partial derivatives, the partial derivatives dH.sub.i are first scaled by multiplying by scaling factors dependent on the spatial extension of the region to produce scaled partial derivatives dHN.sub.i.
A method of representing an object appearing in a still or video image, by processing signals corresponding to the image, comprises deriving a plurality of numerical values associated with features appearing on the outline of an object starting from an arbitrary point on the outline and applying a predetermined ordering to said values to arrive at a representative of the outline.
An entity is subjected to an interrogating signal, and the reflection from the entity is repeatedly sampled to obtain a first set of values each dependent on the intensity of the reflected signal. A logarithmic transformation is applied to the sample values to obtain a second set of values. A set of descriptor values is derived, the set comprising at least a first descriptor value (L) representing the difference between the mean and the median of the second set of values, and a second descriptor value (D) representing the mean of the absolute value of the deviation between each second set value and an average of the second set of values.
A Sibiryakov, M Bober (2005)A method of statistical template matching and its application to face and facial feature detection, In: WSEAS Transactions on Information Science and Applications2(9)pp. 1285-1293
This paper addresses a problem of robust, accurate and fast object detection in complex environments, such as cluttered backgrounds and low-quality images. To overcome the problems with existing methods, we propose a new object detection approach, called Statistical Template Matching. It is based on generalized description of the object by a set of template regions and statistical testing of object/non-object hypotheses. A similarity measure between the image and a template is derived from the Fisher criterion. We show how to apply our method to face and facial feature detection tasks, and demonstrate its performance in some difficult cases, such as moderate variation of scale factor of the object, local image warping and distortions caused by image compression. The method is very fast; its speed is independent of the template size and depends only on the template complexity.
A method of detecting an object in an image comprises comparing a template with a region of an image and determining a similarity measure, wherein the similarity measure is determined using a statistical measure. The template comprises a number of regions corresponding to parts of the object and their spatial relations. The variance of the pixels within the total template is set in relation to the variances of the pixels in all individual regions, to provide a similarity measure.
A method of deriving a representation of a video sequence comprises deriving metadata expressing at least one temporal characteristic of a frame or group of frames, and one or both of metadata expressing at least one content-based characteristic of a frame or group of frames and relational metadata expressing relationships between at least one content-based characteristic of a frame or group of frames and at least one other frame or group of frames, and associating said metadata and/or relational metadata with the respective frame or group of frames.
Method and apparatus for motion vector field encoding Abstract A method and apparatus for representing motion in a sequence of digitized images derives a dense motion vector field and vector quantizes the motion vector field.
A Sibiryakov, M Bober (2007)Low-complexity motion analysis for mobile video devices, In: ICCE: 2007 DIGEST OF TECHNICAL PAPERS INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICSpp. 407-408
A Sibiryakov, M Bober (2006)Image registration using RST-clustering and its application in remote sensing - art. no. 63650G, In: Image and Signal Processing for Remote Sensing XII6365pp. G3650-G3650
S Husain, Miroslaw Bober (2016)Improving large-scale image retrieval through robust aggregation of local descriptors, In: IEEE Transactions on Pattern Analysis and Machine Intelligence
Visual search and image retrieval underpin numerous applications, however the task is still challenging predominantly due to the variability of object appearance and ever increasing size of the databases, often exceeding billions of images. Prior art methods rely on aggregation of local scale-invariant descriptors, such as SIFT, via mechanisms including Bag of Visual Words (BoW), Vector of Locally Aggregated Descriptors (VLAD) and Fisher Vectors (FV). However, their performance is still short of what is required. This paper presents a novel method for deriving a compact and distinctive representation of image content called Robust Visual Descriptor with Whitening (RVD-W). It significantly advances the state of the art and delivers world-class performance. In our approach local descriptors are rank-assigned to multiple clusters. Residual vectors are then computed in each cluster, normalized using a direction-preserving normalization function and aggregated based on the neighborhood rank. Importantly, the residual vectors are de-correlated and whitened in each cluster before aggregation, leading to a balanced energy distribution in each dimension and significantly improved performance. We also propose a new post-PCA normalization approach which improves separability between the matching and non-matching global descriptors. This new normalization benefits not only our RVD-W descriptor but also improves existing approaches based on FV and VLAD aggregation. Furthermore, we show that the aggregation framework developed using hand-crafted SIFT features also performs exceptionally well with Convolutional Neural Network (CNN) based features. The RVD-W pipeline outperforms state-of-the-art global descriptors on both the Holidays and Oxford datasets. On the large scale datasets, Holidays1M and Oxford1M, SIFT-based RVD-W representation obtains a mAP of 45.1% and 35.1%, while CNN-based RVD-W achieve a mAP of 63.5% and 44.8%, all yielding superior performance to the state-of-the-art.
Eng-Jon Ong, Sameed Husain, Mikel Bober-Irizar, Miroslaw Bober (2018)Deep Architectures and Ensembles for Semantic Video Classification, In: IEEE Transactions on Circuits and Systems for Video Technology Institute of Electrical and Electronics Engineers (IEEE)
This work addresses the problem of accurate semantic labelling of short videos. To this end, a multitude of different deep nets, ranging from traditional recurrent neural networks (LSTM, GRU), temporal agnostic networks (FV,VLAD,BoW), fully connected neural networks mid-stage AV fusion and others. Additionally, we also propose a residual architecture-based DNN for video classification, with state-of-the art classification performance at significantly reduced complexity. Furthermore, we propose four new approaches to diversity-driven multi-net ensembling, one based on fast correlation measure and three incorporating a DNN-based combiner. We show that significant performance gains can be achieved by ensembling diverse nets and we investigate factors contributing to high diversity. Based on the extensive YouTube8M dataset, we provide an in-depth evaluation and analysis of their behaviour. We show that the performance of the ensemble is state-of-the-art achieving the highest accuracy on the YouTube8M Kaggle test data. The performance of the ensemble of classifiers was also evaluated on the HMDB51 and UCF101 datasets, and show that the resulting method achieves comparable accuracy with state-ofthe- art methods using similar input features.
WP Berriss, WG Price, MZ Bober (2003)The use of MPEG-7 for intelligent analysis and retrieval in video surveillance, In: IEE Colloquium (Digest)3-1006pp. 41-45
The incorporation of intelligent video processing algorithms into digital surveillance systems has been examined in this work. In particular, the use of the latest standard in multi-media feature extraction and matching is discussed. The use of such technology makes a system very different to current surveillance systems which store text-based meta-data. In our system, descriptions based upon shape and colour are extracted in real-time from two sequences of video recorded from a real-life scenario. The stored database of descriptions can then be searched using a query description constructed by the operator; this query is then compared with every description stored for the video sequence. We show examples of the fast and accurate search made possible with this latest technology for multimedia content description applied to a video surveillance database.
WP Berriss, WG Price, MZ Bober (2002)< VISS >(TM): A video intelligent surveillance system, In: MULTIMEDIA SYSTEMS AND APPLICATIONS V4861pp. 13-21
N Georgis, J Kittler, M Bober (2000)Accurate recovery of dense depth map for 3D motion based coding, In: EUROPEAN TRANSACTIONS ON TELECOMMUNICATIONS11(2)pp. 219-232 ASSOC ELETTROTECNICA ED ELETTRONICA ITALIANA
K Kucharski, W Skarbek, M Bober (2005)Dual LDA - An effective feature space reduction method for face recognition, In: IEEE International Conference on Advanced Video and Signal Based Surveillance - Proceedings of AVSS 20052005pp. 336-341
Linear Discriminant Analysis (LDA) is a popular feature extraction technique that aims at creating a feature set of enhanced discriminatory power. The authors introduced a novel approach Dual LDA (DLDA) and proposed an efficient SVD-based implementation. This paper focuses on feature space reduction aspect of DLDA achieved in course of proper choice of the parameters controlling the DLDA algorithm. The comparative experiments conducted on a collection of five facial databases consisting in total of more than 10000 photos show that DLDA outperforms by a great margin the methods reducing the feature space by means of feature subset selection. © 2005 IEEE.
M Bober, N Georgis, J Kittler (1998)On accurate and robust estimation of fundamental matrix, In: COMPUTER VISION AND IMAGE UNDERSTANDING72(1)pp. 39-53 ACADEMIC PRESS INC
K Zhang, M Bober, J Kittler (1997)Image sequence coding using multiple-level segmentation and affine motion estimation, In: IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS15(9)pp. 1704-1713 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
G Cordara, M Bober, Y Reznik (2013)Special issue on visual search and augmented reality, In: SIGNAL PROCESSING-IMAGE COMMUNICATION28(4)pp. 309-310 ELSEVIER SCIENCE BV
M Bober (1999)Motion analysis for video coding and retrieval, In: IEE Colloquium (Digest)(103)pp. 51-56
The use of a robust, low-level motion estimator based on a Robust Hough Transform (RHT) in a range of tasks, such as optical flow estimation, and motion estimation for video coding and retrieval from video sequences was discussed. RHT derived not only pixels displacements, but also provided direct motion segmentation and other motion-related clues. The RHT algorithm employed an affine region-to-region transformation model and was invariant to illumination changes, in addition to being statistically robust. It was found that RHT did not base the correspondence analyses on any specific type of feature, but used textured regions in the image as non-localized features.
A method for deriving an image identifier comprises deriving a scale-space representation of an image, and processing the scale-space representation to detect a plurality of feature points having values that are maxima or minima. A representation is derived for a scale-dependent image region associated with one or more of the detected plurality of feature points. In an embodiment, the size of the image region is dependent on the scale associated with the corresponding feature point. An image identifier is derived using the representations derived for the scale-dependent image regions. The image identifiers may be used in a method for comparing images.
A method of representing an image comprises deriving at least one 1-dimensional representation of the image by projecting the image onto an axis, wherein the projection involves summing values of selected pixels in a respective line of the image perpendicular to said axis, characterised in that the number of selected pixels is less than the number of pixels in the line.
A method of identifying or tracking a line in an image comprises determining a start point that belongs to a line, and identifying a plurality of possible end points belonging to the line using a search window, and calculating values for a plurality of paths connecting the start point and the end points to determine an optimum end point and path, characterised in that the search window is non-rectangular.
S Paschalakis, K Wnukowicz, M Bober (2011)Low-Cost Hierarchical Video Segmentation for Consumer Electronics Applications, In: IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS (ICCE 2011)pp. 79-80
M Bober, K Kucharski, W Skarbek (2003)Face recognition by Fisher and scatter linear discriminant analysis, In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)2756pp. 638-645
Fisher linear discriminant analysis (FLDA) based on variance ratio is compared with scatter linear discriminant (SLDA) analysis based on determinant ratio. It is shown that each optimal FLDA data model is optimal SLDA data model but not opposite. The novel algorithm 2SS4LDA (two singular subspaces for LDA) is presented using two singular value decompositions applied directly to normalized multiclass input data matrix and normalized class means data matrix. It is controlled by two singular subspace dimension parameters q and r, respectively. It appears in face recognition experiments on the union of MPEG-7, Altkom, and Feret facial databases that 2SS4LDA reaches about 94% person identification rate and about 0.21 average normalized mean retrieval rank. The best face recognition performance measures are achieved for those combinations of q, r values for which the variance ratio is close to its maximum, too. None such correlation is observed for SLDA separation measure. © Springer-Verlag Berlin Heidelberg 2003.
D Ravi, M Bober, GM Farinella, M Guarnera, S Battiato (2016)Semantic segmentation of images exploiting DCT based features and random forest, In: Pattern Recognition52pp. 260-273
This paper presents an approach for generating class-specific image segmentation. We introduce two novel features that use the quantized data of the Discrete Cosine Transform (DCT) in a Semantic Texton Forest based framework (STF), by combining together colour and texture information for semantic segmentation purpose. The combination of multiple features in a segmentation system is not a straightforward process. The proposed system is designed to exploit complementary features in a computationally efficient manner. Our DCT based features describe complex textures represented in the frequency domain and not just simple textures obtained using differences between intensity of pixels as in the classic STF approach. Differently than existing methods (e.g., filter bank) just a limited amount of resources is required. The proposed method has been tested on two popular databases: CamVid and MSRC-v2. Comparison with respect to recent state-of-the-art methods shows improvement in terms of semantic segmentation accuracy.
J Badenas, M Bober, F Pla (2001)Segmenting traffic scenes from grey level and motion information, In: PATTERN ANALYSIS AND APPLICATIONS4(1)pp. 28-38 SPRINGER-VERLAG
M Bober, M Petrou, J Kittler (1998)Nonlinear motion estimation using the supercoupling approach, In: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE20(5)pp. 550-555 IEEE COMPUTER SOC
S Paschalakis, M Bober (2004)Real-time face detection and tracking for mobile videoconferencing, In: REAL-TIME IMAGING10(2)pp. 81-94 ACADEMIC PRESS LTD- ELSEVIER SCIENCE LTD
A method of representing an object appearing in a still or video image, by processing signals corresponding to the image, comprises deriving a curvature scale space (CSS) representation of the object outline by smoothing the object outline, deriving at least one additional parameter reflecting the shape or mass distribution of a smoothed version of the original curve, and associating the CSS representation and the additional parameter as a shape descriptor of the object.
S Ghanbari, L Cieplinski, MZ Bober (2003)Recovery of lost motion vectors for error concealment in video coding, In: Picture Coding Symposiumpp. 239-242
Transmission of compressed video over error prone channels such as mobile networks is a challenging issue. Maintaining an acceptable quality of service in such an environment demands additional post-processing tools to limit the impact of uncorrected transmission errors. Significant visual degradation of a video stream occurs when the motion vector component is corrupted. In this paper, an effective and computationally efficient method for the recovery of lost motion vectors (MVs) is proposed. The novel idea selects a neighbouring block MV that has the minimum distance from an estimated MV. Simulation results are presented, including comparison with existing methods. Our method follows the performance of the best existing method by approximately 0.1-0.5 dB. However, it has a significant advantage in that it is 50% computationally simpler. This makes our method ideal for use in mobile handsets and other applications with limited processing power.
MPEG-7 is the first international standard which contains a number of key techniques from Computer Vision and Image Processing. The Curvature Scale Space technique was selected as a contour shape descriptor for MPEG-7 after substantial and comprehensive testing, which demonstrated the superior performance of the CSS-based descriptor. Curvature Scale Space Representation: Theory, Applications, and MPEG-7 Standardization is based on key publications on the CSS technique, as well as its multiple applications and generalizations. The goal was to ensure that the reader will have access to the most fundamental results concerning the CSS method in one volume. These results have been categorized into a number of chapters to reflect their focus as well as content. The book also includes a chapter on the development of the CSS technique within MPEG standardization, including details of the MPEG-7 testing and evaluation processes which led to the selection of the CSS shape descriptor for the standard. The book can be used as a supplementary textbook by any university or institution offering courses in computer and information science.
A method and apparatus for deriving a representation of an image is described. The method involves processing signals corresponding to the image. A two-dimensional function of the image, such as a Trace transform (T (d, θ)), of the image using at least one functional T, is derived and processed using a mask function (β) to derive an intermediate representation of the image, corresponding to a one-dimensional function. In one embodiment, the mask function defines pairs of image bands of the Trace transform in the Trace domain. The representation of the image may be derived by applying existing techniques to the derived one-dimensional function.
MZ Bober, F Preteux, W-Y Kim (2002)Shape Descriptors, In: Introduction to MPEG-7 John Wiley & Sons Inc
Introduction to MPEG-7 takes a systematic approach to the standard and provides a unique overview of the principles and concepts behind audio-visual indexing, ...
F Pla, M Bober (1997)Estimating translation/deformation motion through phase correlation, In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)1310pp. 653-660
© Springer-Verlag Berlin Heidelberg 1997.Phase correlation techniques have been used in image registration to estimate image displacements. These techniques have been also used to estimate optical flow by applying it locally. In this work a different phase correlation-based method is proposed to deal with a deformation/translation motion model, instead of the pure translations that the basic phase correlation technique can estimate. Some experimentals results are also presented to show the accuracy of the motion paramenters estimated and the use of the phase correlation to estimate optical flow.
S Paschalakis, P Lee, M Bober (2003)An FPGA system for the high speed extraction, normalization and classification of moment descriptors, In: FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS, PROCEEDINGS2778pp. 543-552 SPRINGER-VERLAG BERLIN
Syed Sameed Husain, Miroslaw Bober (2019)REMAP: Multi-layer entropy-guided pooling of dense CNN features for image retrieval, In: IEEE Transactions on Image Processingpp. 1-1 Institute of Electrical and Electronics Engineers (IEEE)
This paper addresses the problem of very large-scale image retrieval, focusing on improving its accuracy and robustness. We target enhanced robustness of search to factors such as variations in illumination, object appearance and scale, partial occlusions, and cluttered backgrounds -particularly important when search is performed across very large datasets with significant variability. We propose a novel CNN-based global descriptor, called REMAP, which learns and aggregates a hierarchy of deep features from multiple CNN layers, and is trained end-to-end with a triplet loss. REMAP explicitly learns discriminative features which are mutually-supportive and complementary at various semantic levels of visual abstraction. These dense local features are max-pooled spatially at each layer, within multi-scale overlapping regions, before aggregation into a single image-level descriptor. To identify the semantically useful regions and layers for retrieval, we propose to measure the information gain of each region and layer using KL-divergence. Our system effectively learns during training how useful various regions and layers are and weights them accordingly. We show that such relative entropy-guided aggregation outperforms classical CNN-based aggregation controlled by SGD. The entire framework is trained in an end-to-end fashion, outperforming the latest state-of-the-art results. On image retrieval datasets Holidays, Oxford and MPEG, the REMAP descriptor achieves mAP of 95.5%, 91.5% and 80.1% respectively, outperforming any results published to date. REMAP also formed the core of the winning submission to the Google Landmark Retrieval Challenge on Kaggle.
E Ong, M Bober (2016)Improved Hamming Distance Search using Variable Length Substrings, In: 2016 IEEE Conference on Computer Vision and Pattern Recognitionpp. 2000-2008
This paper addresses the problem of ultra-large-scale search in Hamming spaces. There has been considerable research on generating compact binary codes in vision, for example for visual search tasks. However the issue of efficient searching through huge sets of binary codes remains largely unsolved. To this end, we propose a novel, unsupervised approach to thresholded search in Hamming space, supporting long codes (e.g. 512-bits) with a wide-range of Hamming distance radii. Our method is capable of working efficiently with billions of codes delivering between one to three orders of magnitude acceleration, as compared to prior art. This is achieved by relaxing the equal-size constraint in the Multi-Index Hashing approach, leading to multiple hash-tables with variable length hash-keys. Based on the theoretical analysis of the retrieval probabilities of multiple hash-tables we propose a novel search algorithm for obtaining a suitable set of hash-key lengths. The resulting retrieval mechanism is shown empirically to improve the efficiency over the state-of-the-art, across a range of datasets, bit-depths and retrieval thresholds.
Syed Husain, Miroslaw Bober (2014)Robust and scalable aggregation of local features for ultra large-scale retrieval, In: Image Processing (ICIP), 2014 IEEE International Conference onpp. 2799-2803
This paper is concerned with design of a compact, binary and scalable image representation that is easy to compute, fast to match and delivers beyond state-of-the-art performance in visual recognition of objects, buildings and scenes. A novel descriptor is proposed which combines rank-based multi-assignment with robust aggregation framework and cluster/bit selection mechanisms for size scalability. Extensive performance evaluation is presented, including experiments within the state-of-the art pipeline developed by the MPEG group standardising Compact Descriptors for Visual Search (CVDS).
MZ Bober, W Szajnowski Image Analysis and representation
A one-dimensional representation of an image is obtained using a mapping function defining a closed scanning curve. The function is decomposed into component signals which represent different parts of the bandwidth of the representation using bi-directional filters to achieve zero group delay.
MZ Bober, A Sibiryakov Dominant motion analysis
A method of representing a 2-dimensional image comprises deriving at least one 1-dimensional representation of the image by projecting the image onto at least one axis, and applying a Fourier transform to said 1-dimensional representation. The representation can be used for estimation of dominant motion between images.
D Windridge, M Bober (2014)A Kernel-Based Framework for Medical Big-Data Analytics, In: Interactive Knowledge Discovery and Data Mining in Biomedical Informatics8401pp. 197-208 Springer Berlin Heidelberg
MZ Bober, S Paschalakis, P Brasnett Video Identification
A method and apparatus for processing a first sequence of images and a second sequence of images to compare the first and second sequences is disclosed. Each of a plurality of the images in the first sequence and each of a plurality of the images in the second sequence is processed by (i) processing the image data for each of a plurality of pixel neighbourhoods in the image to generate at least one respective descriptor element for each of the pixel neighbourhoods, each descriptor element comprising one or more bits; and (ii) forming a plurality of words from the descriptor elements of the image such that each word comprises a unique combination of descriptor element bits. The words for the second sequence are generated from the same respective combinations of descriptor element bits as the words for the first sequence. Processing is performed to compare the first and second sequences by comparing the words generated for the plurality of images in the first sequences with the words...
M Bober (2001)MPEG-7: Evolution or revolution?, In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)2124pp. 1-?
© Springer-Verlag Berlin Heidelberg 2001.The ISO MPEG-7 Standard, also known as a Multimedia Content Description Interface, will be soon finalized. After several years of intensive work on technology development, implementation and testing by almost all major players in the digital multimedia arena, the results of this international project will be assessed by the most cruel and demanding judge: the market. Will it meet all the high expectations of the developers and, above all, future users? Will it result in a revolution, evolution or will it just simply pass unnoticed? In this invited lecture, I will review the components of the MPEG-7 Standard in the context of some novel applications. I will go beyond the classical image/ video retrieval scenarios, and look into more generic image/object recognition framework relying on the MPEG-7 technology. Such a framework is applicable to a wide range of new applications. The benefits of using standardized technology, over other state-of-the art techniques from computer vision, image processing, and database retrieval, will be investigated. Demonstrations of the generic object recognition system will be presented, followed by some other examples of emerging applications made possible by the Standard. In conclusion, I will assess the potential impact of this new standard on emerging services, products and future technology developments.
M Bober, J Kittler (1996)Video coding for mobile communications - MPEG4 perspective, In: IEE Colloquium (Digest)(248)
Miroslaw Bober, Josef Kittler (1994)Estimation of complex multimodal motion - an approach based on robust statistics and hough transform, In: Image and vision computing12(10)pp. 661-668 Elsevier
J Kittler, J Matas, M Bober, L Nguyen (1995)Image interpretation: exploiting multiple cues, In: IEE Conference Publication(410)pp. 1-5
Multiple cues play a crucial role in image interpretation. A vision system that combines shape, colour, motion, prior scene knowledge and object motion behaviour is described. We show that the use of interpretation strategies which depend on the image data, temporal context and visual goals significantly simplifies the complexity of the image interpretation problem and makes it computationally feasible.
J Badenas, M Bober, F Pla (1997)Motion and intensity-based segmentation and its application to traffice monitoring, In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)1310pp. 502-509
© Springer-Verlag Berlin Heidelberg 1997.This paper is concerned with an efficient estimation and segmentation of 2-D motion from image sequences, with the focus on traffic monitoring applications. In order to reduce the computational load and facilitate real-time implementation, the proposed approach makes use of simplifying assumptions that the camera is stationary and that the projection of vehicles motion on the image plane can be approximated by translation. We show that a good performance can be achieved even under such apparently restrictive assumptions. To further reduce processing time, we perform gray-level based segmentation that extracts regions of uniform intensity. Subsequently, we estimate motion for the regions. Regions moving with the coherent motion are allowed to merge. The use of 2D motion analysis and the pre-segmentation stage significantly reduces the computational load, and the region-based estimator gives robustness to noise and changes of illumination.
Humans have an innate ability to communicate visually; the earliest forms of communication were cave drawings, and children can communicate visual descriptions of scenes through drawings well before they can write. Drawings and sketches offer an intuitive and efficient means for communicating visual concepts. Today, society faces a deluge of digital visual content driven by a surge in the generation of video on social media and the online availability of video archives. Mobile devices are emerging as the dominant platform for consuming this content, with Cisco predicting that by 2018 over 80% of mobile traffic will be video. Sketch offers a familiar and expressive modality for interacting with video on the touch-screens commonly present on such devices. This thesis contributes several new algorithms for searching and manipulating video using free-hand sketches. We propose the Visual Narrative (VN); a storyboarded sequence of one or more actions in the form of sketch that collectively describe an event. We show that VNs can be used to both efficiently search video repositories, and to synthesise video clips. First, we describe a sketch based video retrieval (SBVR) system that fuses multiple modalities (shape, colour, semantics, and motion) in order to find relevant video clips. An efficient multi-modal video descriptor is proposed enabling the search of hundreds of videos in milliseconds. This contrasts with prior SBVR that lacks an efficient index representation, and take minutes or hours to search similar datasets. This contribution not only makes SBVR practical at interactive speeds, but also enables user-refinement of results through relevance feedback to resolve sketch ambiguity, including the relative priority of the different VN modalities. Second, we present the first algorithm for sketch based pose retrieval. A pictographic representation (stick-men) is used to specify a desired human pose within the VN, and similar poses found within a video dataset. We use archival dance performance footage from the UK National Resource Centre for Dance (UK-NRCD), containing diverse examples of human pose. We investigate appropriate descriptors for sketch and video, and propose a novel manifold learning technique for mapping between the two descriptor spaces and so performing sketched pose retrieval. We show that domain adaptation can be applied to boost the performance of this system through a novel piece-wise feature-space warping technique. Third, we present a graph representation for VNs comprising multiple actions. We focus on the extension of our pose retrieval system to a sequence of poses interspersed with actions (e.g. jump, twirl). We show that our graph representation can be used for multiple applications: 1) to retrieve sequences of video comprising multiple actions; 2) to navigate in pictorial form, the retrieved video sequences; 3) to synthesise new video sequences by retrieving and concatenating video fragments from archival footage.
Visual search and recognition underpins numerous applications including management of multimedia content, mobile commerce, surveillance, navigation, robotics and many others. However the task is still challenging predominantly due to the variability of object appearance and ever increasing size of the databases, often exceeding billions of images. The objective of this thesis is to develop a robust, compact and discriminative image representation suitable for tasks of visual search. This thesis contributes to four research areas. First we propose a novel method, named Robust Visual Descriptor (RVD), for deriving a compact and robust representation of image content which significantly advances state of the art and delivers world-class performance. In our approach, the local descriptors are assigned to multiple cluster centres with rank weights leading to a stable and reliable global image representation. Residual vectors are then computed in each cluster, normalized using a direction preserving normalization and aggregated based on the neighbourhood rank information. We then propose two extensions to the core RVD descriptor. The first one consists of de-correlating weighted residual vectors by applying cluster level PCA before aggregation. In the second extension, the weighted residual vectors are whitened in each cluster before aggregation, leading to a balanced energy distribution in each dimension and improved performance. Compressing floating point global signatures to binary codes improves storage requirements and matching speed for large scale image retrieval tasks. Our third contribution is to derive a compact and robust binary image signature from the core RVD representation. In addition, we propose a novel binary descriptors matching algorithm, PCAE with Weighted Hamming distance (PCAE+WH), to minimize the quantization loss associated with converting floating point vector to discrete binary codes. In the context of industry work on Compact descriptors for Visual Search (CDVS) and its standardization in MPEG (ISO), we propose a scalable RVD representation. The bitrate scalability is achieved by employing novel Cluster Selection and Bit Selection mechanisms which support interoperable binary RVD representations. Moreover, we propose a very efficient and effective score function based on weighted Hamming distance, to compute similarity between two binary representations. Our fourth contribution is to develop an image classification system based on RVD representation. We introduce an effective method to incorporate second order statistics in the original RVD framework.
The method of Dynamic Mode Decomposition (DMD) was introduced originally in the area of Computatational Fluid Dynamics (CFD) for extracting coherent structures from spatio-temporal complex fluid flow data. DMD takes in time series data and computes a set of modes, each of which is associated with a complex eigenvalue. DMD analysis is closely associated with spectral analysis of the Koopman operator, which provides linear but infinite-dimensional representation of nonlinear dynamical systems. Therefore, by using DMD a nonlinear system could be described by a superposition of modes whose dynamics are governed by the eigenvalues. The key advantage of DMD is its data-driven nature which does not rely on any prior assumptions except the inherent dynamics which are observed over time. Its capability for extracting relevant modes from complex fluid flows has seen significant application across multiple fields, including computer vision, robotics and neuroscience. This thesis, in order to expand DMD to other applications, advances the original formulation so that it can be used to solve novel problems in the fields of signal processing and computer vision. In signal processing this thesis introduces the method of using DMD for decomposing a univariate time series into a number of interpretable elements with different subspaces, such as noise, trends and harmonics. In addition, univariate time series forecasting is shown using DMD. The computer vision part of this thesis focuses on innovative applications pertaining to the areas of medical imaging, biometrics and background modelling. In the area of medical imaging a novel DMD framework is proposed that introduces windowed and reconstruction variants of DMD for quantifying kidney function in Dynamic Contrast Enhanced Magnetic Resonance imaging (DCE-MRI) sequences, through movement correction and functional segmentation of the kidneys. The biometrics portion of this thesis introduces a DMD based classification pipeline for counter spoofing 2D facial videos and static finger vein images. The finger vein counter spoofing makes use of a novel atemporal variant of DMD that captures micro-level artefacts that can differentiate the quality and light reflection properties between a live and a spoofed finger vein image, while the DMD on 2D facial image sequences distinguishes attack specific cues from a live face by capturing complex dynamics of head movements, eye-blinking and lip-movements in a data driven manner. Finally, this thesis proposes a new technique using DMD to obtain a background model of a visual scene in the colour domain. These aspects form the major contributions of this thesis. The results from this thesis present DMD as a promising approach for applications requiring feature extraction including: (i) trends and noise from signals, (ii) micro-level texture descriptor from images, and (iii) coherent structures from image sequences/videos, as well as applications that require suppression of movements from dynamical spatio-temporal image sequences.
High resolution brain magnetic resonance (MR) images acquired at multiple time points across the treatment of a patient allow the quantification of localised changes brought about by disease progression. The aim of this thesis is to address the challenge of performing automatic longitudinal analysis of magnetic resonance imaging (MRI) in paediatric brain tumours. The first contribution in this thesis is the validation of a semi-automated segmentation technique. This technique was applied to intra-operative MR images acquired during the surgical resection of hypothalamic tumours in children, in order to assess the volume of tumour resected at different stages of the surgical procedure. The second contribution in this thesis is the quantification of a rare condition known as hypertrophic olivary degeneration (HOD) in lobes within the brain known as inferior olivary nucleii (ION) in relation to the development of posterior fossa syndrome (PFS) following tumour resection in the hind brain. The change in grey-level intensity over time in the left ION has been identified as a suitable biomarker that correlates with the occurrence of posterior fossa syndrome following tumour resection surgery. This study demonstrates the application of machine learning techniques to T2 brain MR images. The third contribution presents a novel approach to longitudinal brain MR analysis, focusing on the cerebellum and brain stem. This contribution presents a technique developed to interpolate multi-slice 2D MR image slices of the brain stem and cerebellum both to infill gaps between slices as well as longitudinally over time, that is, in four-dimensional space. This study also investigates the application of machine learning techniques directly to the MR images. Another novel method developed in this study is the Jacobian of deformations in the brain over time, and its use as an imaging feature. Unlike the previous contribution chapter, the third contribution is not hypothesis-driven, and automatically detects six potential biomarkers that are related to the development of PFS following tumour resection in the posterior fossa. The limited number of patients considered in each study posed a major challenge. This has prompted the use of multiple validation techniques in order to provide accurate results despite the small dataset. These techniques are presented in the second and third contribution chapters.
The abrupt expansion of the Internet use over the last decade led to an uncontrollable amount of media stored in the Web. Image, video and news information has ooded the pool of data that is at our disposal and advanced data mining techniques need to be developed in order to take full advantage of them. The focus of this thesis is mainly on developing robust video analysis technologies concerned with detecting and recognizing activities in video. The work aims at developing a compact activity descriptor with low computational cost, which will be robust enough to discriminate easily among diverse activity classes. Additionally, we introduce a motion compensation algorithm which alleviates any issues introduced by moving camera and is used to create motion binary masks, referred to as compensated Activity Areas (cAA), where dense interest points are sampled. Motion and appearance descriptors invariant to scale and illumination changes are then computed around them and a thorough evaluation of their merit is carried out. The notion of Motion Boundaries Activity Areas (MBAA) is then introduced. The concept differs from cAA in terms of the area they focus on (ie human boundaries), reducing even more the computational cost of the activity descriptor. A novel algorithm that computes human trajectories, referred to as 'optimal trajectories', with variable temporal scale is introduced. It is based on the Statistical Sequential Change Detection (SSCD) algorithm, which allows dynamic segmentation of trajectories based on their motion pattern and facilitates their classification with better accuracy. Finally, we introduce an activity detection algorithm, which segments long duration videos in an accurate but computationally efficient manner. We advocate Statistical Sequential Boundary Detection (SSBD) method as a means of analysing motion patterns and report improvement over the State-of-the-Art.