Dr Armin Mustafa
Academic and research departmentsCentre for Vision, Speech and Signal Processing (CVSSP), Faculty of Engineering and Physical Sciences.
I am currently a Royal Academy of Engineering Research Fellow in the Centre for Vision, Speech and Signal Processing, University of Surrey working in 4D Vision for perceptive machines. I finished PhD in general dynamic scene reconstruction from multi-view videos in 2016 from the University of Surrey, supervised by Prof. Adrian Hilton, after which I became a Research Fellow at CVSSP, University of Surrey. I have previously worked at Samsung Research Institute, Bangalore, India for 3 years (2010 - 2013) in Computer Vision. In 2010 I received M.Tech. degree from the Indian Institute of Technology (IIT), Kanpur, India supervised by Prof. K.S. Venkatesh in Computer Vision.
Areas of specialism
2018 - Research Fellowship, The Royal Academy of Engineering , UK.
2017 - Young Researcher award, CVPR.
2016 - Doctoral Consortium grant, CVPR.
2015 - BMVA travel grant for ICCV.
2014 - Set-Squared Research to Innovator grant, Global #1 University incubator, UK.
2013 - Overseas Research Scholarship, FEPS, The University of Surrey, UK.
2010 - Cadence Silver Medal, Indian Institute of Technology, Kanpur, India.
In the media
The emergence of machines that interact with their environment has led to an increasing demand for automatic visual understanding of real-world scenes. My research aims to better understand complex scenes so that machines can efficiently model and interpret real-world for a range of socially beneficial applications including autonomous systems, augmented reality and healthcare.
We introduce the first approach to solve the challenging problem of automatic 4D visual scene understanding for complex dynamic scenes with multiple interacting people from multi-view video. Our approach simultaneously estimates a detailed model that includes a per-pixel semantically and temporally coherent reconstruction, together with instance-level segmentation exploiting photo-consistency, semantic and motion information. We further leverage recent advances in 3D pose estimation to constrain the joint semantic instance segmentation and 4D temporally coherent reconstruction. This enables per person semantic instance segmentation of multiple interacting people in complex dynamic scenes. Extensive evaluation of the joint visual scene understanding framework against state-of-the-art methods on challenging indoor and outdoor sequences demonstrates a significant (≈40%) improvement in semantic segmentation, reconstruction and scene flow accuracy. In addition to the evaluation on several indoor and outdoor scenes, the proposed joint 4D scene understanding framework is applied to challenging outdoor sports scenes in the wild captured with manually operated wide-baseline broadcast cameras.
We present SILT, a Self-supervised Implicit Lighting Transfer method. Unlike previous research on scene relighting, we do not seek to apply arbitrary new lighting configurations to a given scene. Instead, we wish to transfer the lighting style from a database of other scenes, to provide a uniform lighting style regardless of the input. The solution operates as a two-branch network that first aims to map input images of any arbitrary lighting style to a unified domain, with extra guidance achieved through implicit image decomposition. We then remap this unified input domain using a discriminator that is presented with the generated outputs and the style reference, i.e. images of the desired illumination conditions. Our method is shown to outperform supervised relighting solutions across two different datasets without requiring lighting supervision.
This paper introduces a general approach to dynamic scene reconstruction from multiple moving cameras without prior knowledge or limiting constraints on the scene structure, appearance, or illumination. Existing techniques for dynamic scene reconstruction from multiple wide-baseline camera views primarily focus on accurate reconstruction in controlled environments, where the cameras are fixed and calibrated and background is known. These approaches are not robust for general dynamic scenes captured with sparse moving cameras. Previous approaches for outdoor dynamic scene reconstruction assume prior knowledge of the static background appearance and structure. The primary contributions of this paper are twofold: an automatic method for initial coarse dynamic scene segmentation and reconstruction without prior knowledge of background appearance or structure; and a general robust approach for joint segmentation refinement and dense reconstruction of dynamic scenes from multiple wide-baseline static or moving cameras. Evaluation is performed on a variety of indoor and outdoor scenes with cluttered backgrounds and multiple dynamic non-rigid objects such as people. Comparison with state-of-the-art approaches demonstrates improved accuracy in both multiple view segmentation and dense reconstruction. The proposed approach also eliminates the requirement for prior knowledge of scene structure and appearance.
This paper presents a method for dense 4D temporal alignment of partial reconstructions of non-rigid surfaces observed from single or multiple moving cameras of complex scenes. 4D Match Trees are introduced for robust global alignment of non-rigid shape based on the similarity between images across sequences and views. Wide-timeframe sparse correspondence between arbitrary pairs of images is established using a segmentation-based feature detector (SFD) which is demonstrated to give improved matching of non-rigid shape. Sparse SFD correspondence allows the similarity between any pair of image frames to be estimated for moving cameras and multiple views. This enables the 4D Match Tree to be constructed which minimises the observed change in non-rigid shape for global alignment across all images. Dense 4D temporal correspondence across all frames is then estimated by traversing the 4D Match tree using optical flow initialised from the sparse feature matches. The approach is evaluated on single and multiple view images sequences for alignment of partial surface reconstructions of dynamic objects in complex indoor and outdoor scenes to obtain a temporally consistent 4D representation. Comparison to previous 2D and 3D scene flow demonstrates that 4D Match Trees achieve reduced errors due to drift and improved robustness to large non-rigid deformations.
Light-field video has recently been used in virtual and augmented reality applications to increase realism and immersion. However, existing light-field methods are generally limited to static scenes due to the requirement to acquire a dense scene representation. The large amount of data and the absence of methods to infer temporal coherence pose major challenges in storage, compression and editing compared to conventional video. In this paper, we propose the first method to extract a spatio-temporally coherent light-field video representation. A novel method to obtain Epipolar Plane Images (EPIs) from a spare lightfield camera array is proposed. EPIs are used to constrain scene flow estimation to obtain 4D temporally coherent representations of dynamic light-fields. Temporal coherence is achieved on a variety of light-field datasets. Evaluation of the proposed light-field scene flow against existing multiview dense correspondence approaches demonstrates a significant improvement in accuracy of temporal coherence.
We introduce the first approach to solve the challenging problem of unsupervised 4D visual scene understanding for complex dynamic scenes with multiple interacting people from multi-view video. Our approach simultaneously estimates a detailed model that includes a per-pixel semantically and temporally coherent reconstruction, together with instance-level segmentation exploiting photo-consistency, semantic and motion information. We further leverage recent advances in 3D pose estimation to constrain the joint semantic instance segmentation and 4D temporally coherent reconstruction. This enables per person semantic instance segmentation of multiple interacting people in complex dynamic scenes. Extensive evaluation of the joint visual scene understanding framework against state-of-the-art methods on challenging indoor and outdoor sequences demonstrates a significant (≈ 40%) improvement in semantic segmentation, reconstruction and scene flow accuracy.
Existing techniques for dynamic scene re- construction from multiple wide-baseline cameras pri- marily focus on reconstruction in controlled environ- ments, with fixed calibrated cameras and strong prior constraints. This paper introduces a general approach to obtain a 4D representation of complex dynamic scenes from multi-view wide-baseline static or moving cam- eras without prior knowledge of the scene structure, ap- pearance, or illumination. Contributions of the work are: An automatic method for initial coarse reconstruc- tion to initialize joint estimation; Sparse-to-dense tem- poral correspondence integrated with joint multi-view segmentation and reconstruction to introduce tempo- ral coherence; and a general robust approach for joint segmentation refinement and dense reconstruction of dynamic scenes by introducing shape constraint. Com- parison with state-of-the-art approaches on a variety of complex indoor and outdoor scenes, demonstrates im- proved accuracy in both multi-view segmentation and dense reconstruction. This paper demonstrates unsuper- vised reconstruction of complete temporally coherent 4D scene models with improved non-rigid object seg- mentation and shape reconstruction and its application to various applications such as free-view rendering and virtual reality.
The rise of autonomous machines in our day-to-day lives has led to an increasing demand for machine perception of real-world to be more robust, accurate and human-like. The research in visual scene un- derstanding over the past two decades has focused on machine perception in controlled environments such as indoor, static and rigid objects. There is a gap in literature for machine perception in general complex scenes (outdoor with multiple interacting people). The proposed research ad- dresses the limitations of existing methods by proposing an unsupervised framework to simultaneously model, semantically segment and estimate motion for general dynamic scenes captured from multiple view videos with a network of static or moving cameras. In this talk I will explain the proposed joint framework to understand general dynamic scenes for ma- chine perception; give a comprehensive performance evaluation against state-of-the-art techniques on challenging indoor and outdoor sequences; and demonstrate applications such as virtual, augmented, mixed reality (VR/AR/MR) and broadcast production (Free-view point video - FVV).
A common problem in wide-baseline matching is the sparse and non-uniform distribution of correspondences when using conventional detectors such as SIFT, SURF, FAST, A-KAZE and MSER. In this paper we introduce a novel segmentation based feature detector (SFD) that produces an increased number of accurate features for wide-baseline matching. A multi-scale SFD is proposed using bilateral image decomposition to produce a large number of scale-invariant features for wide-baseline reconstruction. All input images are over-segmented into regions using any existing segmentation technique like Watershed, Mean-shift, and SLIC. Feature points are then detected at the intersection of the boundaries of three or more regions. The detected feature points are local maxima of the image function. The key advantage of feature detection based on segmentation is that it does not require global threshold setting and can therefore detect features throughout the image. A comprehensive evaluation demonstrates that SFD gives an increased number of features which are accurately localised and matched between wide-baseline camera views; the number of features for a given matching error increases by a factor of 3-5 compared to SIFT; feature detection and matching performance is maintained with increasing baseline between views; multi-scale SFD improves matching performance at varying scales. Application of SFD to sparse multi-view wide-baseline reconstruction demonstrates a factor of ten increase in the number of reconstructed points with improved scene coverage compared to SIFT/MSER/A-KAZE. Evaluation against ground-truth shows that SFD produces an increased number of wide-baseline matches with reduced error.
Simultaneous semantically coherent object-based long-term 4D scene flow estimation, co-segmentation and reconstruction is proposed exploiting the coherence in semantic class labels both spatially, between views at a single time instant, and temporally, between widely spaced time instants of dynamic objects with similar shape and appearance. In this paper we propose a framework for spatially and temporally coherent semantic 4D scene flow of general dynamic scenes from multiple view videos captured with a network of static or moving cameras. Semantic coherence results in improved 4D scene flow estimation, segmentation and reconstruction for complex dynamic scenes. Semantic tracklets are introduced to robustly initialize the scene flow in the joint estimation and enforce temporal coherence in 4D flow, semantic labelling and reconstruction between widely spaced instances of dynamic objects. Tracklets of dynamic objects enable unsupervised learning of long-term flow, appearance and shape priors that are exploited in semantically coherent 4D scene flow estimation, co-segmentation and reconstruction. Comprehensive performance evaluation against state-of-the-art techniques on challenging indoor and outdoor sequences with hand-held moving cameras shows improved accuracy in 4D scene flow, segmentation, temporally coherent semantic labelling, and reconstruction of dynamic scenes.
A common problem in wide-baseline stereo is the sparse and non-uniform distribution of correspondences when using conventional detectors such as SIFT, SURF, FAST and MSER. In this paper we introduce a novel segmentation based feature detector SFD that produces an increased number of ‘good’ features for accurate wide-baseline reconstruction. Each image is segmented into regions by over-segmentation and feature points are detected at the intersection of the boundaries for three or more regions. Segmentation-based feature detection locates features at local maxima giving a relatively large number of feature points which are consistently detected across wide-baseline views and accurately localised. A comprehensive comparative performance evaluation with previous feature detection approaches demonstrates that: SFD produces a large number of features with increased scene coverage; detected features are consistent across wide-baseline views for images of a variety of indoor and outdoor scenes; and the number of wide-baseline matches is increased by an order of magnitude compared to alternative detector-descriptor combinations. Sparse scene reconstruction from multiple wide-baseline stereo views using the SFD feature detector demonstrates at least a factor six increase in the number of reconstructed points with reduced error distribution compared to SIFT when evaluated against ground-truth and similar computational cost to SURF/FAST.
This paper presents an approach for reconstruction of 4D temporally coherent models of complex dynamic scenes. No prior knowledge is required of scene structure or camera calibration allowing reconstruction from multiple moving cameras. Sparse-to-dense temporal correspondence is integrated with joint multi-view segmentation and reconstruction to obtain a complete 4D representation of static and dynamic objects. Temporal coherence is exploited to overcome visual ambiguities resulting in improved reconstruction of complex scenes. Robust joint segmentation and reconstruction of dynamic objects is achieved by introducing a geodesic star convexity constraint. Comparative evaluation is performed on a variety of unstructured indoor and outdoor dynamic scenes with hand-held cameras and multiple people. This demonstrates reconstruction of complete temporally coherent 4D scene models with improved nonrigid object segmentation and shape reconstruction.
In this paper we propose a framework for spatially and temporally coherent semantic co-segmentation and reconstruction of complex dynamic scenes from multiple static or moving cameras. Semantic co-segmentation exploits the coherence in semantic class labels both spatially, between views at a single time instant, and temporally, between widely spaced time instants of dynamic objects with similar shape and appearance. We demonstrate that semantic coherence results in improved segmentation and reconstruction for complex scenes. A joint formulation is proposed for semantically coherent object-based co-segmentation and reconstruction of scenes by enforcing consistent semantic labelling between views and over time. Semantic tracklets are introduced to enforce temporal coherence in semantic labelling and reconstruction between widely spaced instances of dynamic objects. Tracklets of dynamic objects enable unsupervised learning of appearance and shape priors that are exploited in joint segmentation and reconstruction. Evaluation on challenging indoor and outdoor sequences with hand-held moving cameras shows improved accuracy in segmentation, temporally coherent semantic labelling and 3D reconstruction of dynamic scenes.