Dr Armin Mustafa

Senior Research Fellow in Computer Vision
Royal Academy of Engineering Research Fellow


Areas of specialism

Computer Vision; Scene Understanding; 3D/4D Vision; Virtual Reality; Light Fields; Machine Learning; Human Computer Interaction; Augmented Reality


Research interests

Research projects

My publications


Mustafa A, Kim H, Guillemaut J-Y, Hilton ADM (2016) Temporally coherent 4D reconstruction of complex dynamic scenes, IEEE
This paper presents an approach for reconstruction of
4D temporally coherent models of complex dynamic scenes.
No prior knowledge is required of scene structure or camera
calibration allowing reconstruction from multiple moving
cameras. Sparse-to-dense temporal correspondence is integrated
with joint multi-view segmentation and reconstruction
to obtain a complete 4D representation of static and
dynamic objects. Temporal coherence is exploited to overcome
visual ambiguities resulting in improved reconstruction
of complex scenes. Robust joint segmentation and reconstruction
of dynamic objects is achieved by introducing
a geodesic star convexity constraint. Comparative evaluation
is performed on a variety of unstructured indoor and
outdoor dynamic scenes with hand-held cameras and multiple
people. This demonstrates reconstruction of complete
temporally coherent 4D scene models with improved nonrigid
object segmentation and shape reconstruction.
Mustafa A, Kim H, Hilton ADM (2016) 4D Match Trees for Non-rigid Surface Alignment, Computer Vision ? ECCV 2016 14th European Conference, Amsterdam, The Netherlands, October 11?14, 2016, Proceedings, Part I 9905 (1) pp. 213-229
This paper presents a method for dense 4D temporal alignment of partial reconstructions of non-rigid surfaces observed from single or multiple moving cameras of complex scenes. 4D Match Trees are introduced for robust global alignment of non-rigid shape based on the similarity between images across sequences and views. Wide-timeframe sparse correspondence between arbitrary pairs of images is established using a segmentation-based feature detector (SFD) which is demonstrated to give improved matching of non-rigid shape. Sparse SFD correspondence allows the similarity between any pair of image frames to be estimated for moving cameras and multiple views. This enables the 4D Match Tree to be constructed which minimises the observed change in non-rigid shape for global alignment across all images. Dense 4D temporal correspondence across all frames is then estimated by traversing the 4D Match tree using optical flow initialised from the sparse feature matches. The approach is evaluated on single and multiple view images sequences for alignment of partial surface reconstructions of dynamic objects in complex indoor and outdoor scenes to obtain a temporally consistent 4D representation. Comparison to previous 2D and 3D scene flow demonstrates that 4D Match Trees achieve reduced errors due to drift and improved robustness to large non-rigid deformations.
Mustafa A, Kim H, Imre H, Hilton A (2015) Segmentation based features for wide-baseline multi-view reconstruction, pp. 282-290
A common problem in wide-baseline stereo is the sparse
and non-uniform distribution of correspondences when using
conventional detectors such as SIFT, SURF, FAST and
MSER. In this paper we introduce a novel segmentation
based feature detector SFD that produces an increased
number of ?good? features for accurate wide-baseline reconstruction.
Each image is segmented into regions by
over-segmentation and feature points are detected at the
intersection of the boundaries for three or more regions.
Segmentation-based feature detection locates features at
local maxima giving a relatively large number of feature
points which are consistently detected across wide-baseline
views and accurately localised. A comprehensive comparative
performance evaluation with previous feature detection
approaches demonstrates that: SFD produces a large
number of features with increased scene coverage; detected
features are consistent across wide-baseline views for images
of a variety of indoor and outdoor scenes; and the
number of wide-baseline matches is increased by an order
of magnitude compared to alternative detector-descriptor
combinations. Sparse scene reconstruction from multiple
wide-baseline stereo views using the SFD feature detector
demonstrates at least a factor six increase in the number of
reconstructed points with reduced error distribution compared
to SIFT when evaluated against ground-truth and
similar computational cost to SURF/FAST.
Mustafa Armin, Kim Hansung, Guillemaut Jean-Yves, Hilton Adrian (2015) General Dynamic Scene Reconstruction from Multiple View Video, 2015 IEEE International Conference on Computer Vision (ICCV) pp. 900-908 IEEE
This paper introduces a general approach to dynamic scene reconstruction from multiple moving cameras without prior knowledge or limiting constraints on the scene structure, appearance, or illumination. Existing techniques for dynamic scene reconstruction from multiple wide-baseline camera views primarily focus on accurate reconstruction in controlled environments, where the cameras are fixed and calibrated and background is known. These approaches are not robust for general dynamic scenes captured with sparse moving cameras. Previous approaches for outdoor dynamic scene reconstruction assume prior knowledge of the static background appearance and structure. The primary contributions of this paper are twofold: an automatic method for initial coarse dynamic scene segmentation and reconstruction without prior knowledge of background appearance or structure; and a general robust approach for joint segmentation refinement and dense reconstruction of dynamic scenes from multiple wide-baseline static or moving cameras. Evaluation is performed on a variety of indoor and outdoor scenes with cluttered backgrounds and multiple dynamic non-rigid objects such as people. Comparison with state-of-the-art approaches demonstrates improved accuracy in both multiple view segmentation and dense reconstruction. The proposed approach also eliminates the requirement for prior knowledge of scene structure and appearance.
Mustafa A, Hilton A (2017) Semantically Coherent Co-segmentation and Reconstruction of Dynamic Scenes, CVPR 2017 Proceedings pp. 5583-5592 IEEE
In this paper we propose a framework for spatially and temporally coherent semantic co-segmentation and reconstruction of complex dynamic scenes from multiple static or moving cameras. Semantic co-segmentation exploits the coherence in semantic class labels both spatially, between views at a single time instant, and temporally, between widely spaced time instants of dynamic objects with similar shape and appearance. We demonstrate that semantic coherence results in improved segmentation and reconstruction for complex scenes. A joint formulation is proposed for semantically coherent object-based co-segmentation and reconstruction of scenes by enforcing consistent semantic labelling between views and over time. Semantic tracklets are introduced to enforce temporal coherence in semantic labelling and reconstruction between widely spaced instances of dynamic objects. Tracklets of dynamic objects enable unsupervised learning of appearance and shape priors that are exploited in joint segmentation and reconstruction. Evaluation on challenging indoor and outdoor sequences with hand-held moving cameras shows improved accuracy in segmentation, temporally coherent semantic labelling and 3D reconstruction of dynamic scenes.
Mustafa Armin, Kim Hansung, Hilton Adrian (2018) MSFD: Multi-scale segmentation based feature detection for wide-baseline scene reconstruction, IEEE Transactions on Image Processing 28 (3) pp. 1118-1132 Institute of Electrical and Electronics Engineers (IEEE)
A common problem in wide-baseline matching is the sparse and non-uniform distribution of correspondences when using conventional detectors such as SIFT, SURF, FAST, A-KAZE and MSER. In this paper we introduce a novel segmentation based feature detector (SFD) that produces an increased number of accurate features for wide-baseline matching. A multi-scale SFD is proposed using bilateral image decomposition to produce a large number of scale-invariant features for wide-baseline reconstruction. All input images are over-segmented into regions using any existing segmentation technique like Watershed, Mean-shift, and SLIC. Feature points are then detected at the intersection of the boundaries of three or more regions. The detected feature points are local maxima of the image function. The key advantage of feature detection based on segmentation is that it does not require global threshold setting and can therefore detect features throughout the image. A comprehensive evaluation demonstrates that SFD gives an increased number of features which are accurately localised and matched between wide-baseline camera views; the number of features for a given matching error increases by a factor of 3-5 compared to SIFT; feature detection and matching performance is maintained with increasing baseline between views; multi-scale SFD improves matching performance at varying scales. Application of SFD to sparse multi-view wide-baseline reconstruction demonstrates a factor of ten increase in the number of reconstructed points with improved scene coverage compared to SIFT/MSER/A-KAZE. Evaluation against ground-truth shows that SFD produces an increased number of wide-baseline matches with reduced error.
Mustafa Armin, Hilton Adrian (2019) Understanding real-world scenes for human-like machine perception, Proceedings of the Machine Intelligence 21 (MI21-HLC) workshop Imperial College Press
The rise of autonomous machines in our day-to-day lives has
led to an increasing demand for machine perception of real-world to be
more robust, accurate and human-like. The research in visual scene un-
derstanding over the past two decades has focused on machine perception
in controlled environments such as indoor, static and rigid objects. There
is a gap in literature for machine perception in general complex scenes
(outdoor with multiple interacting people). The proposed research ad-
dresses the limitations of existing methods by proposing an unsupervised
framework to simultaneously model, semantically segment and estimate
motion for general dynamic scenes captured from multiple view videos
with a network of static or moving cameras. In this talk I will explain the
proposed joint framework to understand general dynamic scenes for ma-
chine perception; give a comprehensive performance evaluation against
state-of-the-art techniques on challenging indoor and outdoor sequences;
and demonstrate applications such as virtual, augmented, mixed reality
(VR/AR/MR) and broadcast production (Free-view point video - FVV).