Professor John Collomosse

Professor of Computer Vision and AI | Director DECaDE: UKRI/EPSRC Centre for the Decentralised Digital Economy

PhD, CEng, FIET

+441483686035

j.collomosse@surrey.ac.uk

23 BA 00

Academic and research departments

Centre for Vision, Speech and Signal Processing (CVSSP), Surrey Institute for People-Centred Artificial Intelligence (PAI), School of Computer Science and Electronic Engineering.

About

Biography

John Collomosse is a Professor of Computer Vision and AI at the University of Surrey where he is the founder and director of DECaDE, the UKRI Research Centre for the Decentralized Digital Economy. Signal Processing (CVSSP). He is a Fellow of the IET, a Chartered Engineer and Senior Member of IEEE. From 2018-2024 he was on the UKRI/EPSRC advisory team for Information & Communication Technologies (ICT).

He is concurrently a Principal Scientist and distinguished inventor at Adobe Research, where he manages the cross-modal representation learning (XRL) research group. He leads research for Adobe’s Content Authenticity Initiative (CAI) and is a core technical advisor to the initiative since his involvement in its inception in 2019. Now with 3500+ members, CAI leads a cross-industry standards group (C2PA; Coalition for Content Provenance and Authenticity) where John chairs the cross-industry task forces on watermarking, and distributed ledgers (Blockchain) and previously also chaired a task force on fingerprinting.

John’s research intersects Artificial Intelligence (AI) and Distributed Ledger Technology (DLT), with focus on media provenance to fight misinformation and online harms, and on improving data integrity and attribution for responsible AI. John’s content fingerprinting research is used to protect millions of images daily across Adobe’s platforms such as Photoshop, Lightroom and generative AI Firefly tools. John has also pioneered several visual search technologies, such as style, sketch and pose based search. These technologies have been shipped in products such as Behance.net. Notably, he led the ARCHANGEL project which pioneered use of AI and Blockchain to tamper-proof National Archives around the world and was called out as a highlight of the 10 year UK Science Council (EPSRC) Digital Economy research programme. John has presented to various government bodies on provenance, authenticity and AI opt out including the European Commission, UK House of Lords and APPG Blockchain.

John’s early research developed some of the first computer vision based technologies to create the kind of non-photorealistic (artistic) filtering effects now commonly found in products like Photoshop, and was featured in the BBC, New Scientist among others. John has also spent previous periods of time in industry R&D elsewhere, including at Vodafone Munich and IBM Research Hursley.

News

05 AUG 2024

Professor John Collomosse elected for prestigious fellowship

19 JUN 2021

CVSSP academics showcase 11 papers at leading computer vision conference

03 NOV 2020

Surrey Business School receives funding for new research centre to develop a fairer digital economy

28 OCT 2020

New AI and Blockchain Centre to help usher in a new era for the digital economy

29 MAY 2019

ARCHANGEL: Securing our National Archives with AI and blockchain

05 FEB 2019

University of Surrey kicks off €5million Europe-wide testbed for Blockchain

Publications

John Collomose, Andy Parsons (2024)To Authenticity, and Beyond! Building Safe and Fair Generative AI Upon the Three Pillars of Provenance, In: IEEE computer graphics and applications44(3)pp. 82-90

DOI: 10.1109/MCG.2024.3380168

Provenance facts, such as who made an image and how, can provide valuable context for users to make trust decisions about visual content. Against a backdrop of inexorable progress in generative AI for computer graphics, over two billion people will vote in public elections this year. Emerging standards and provenance enhancing tools promise to play an important role in fighting fake news and the spread of misinformation. In this article, we contrast three provenance enhancing technologies—metadata, fingerprinting, and watermarking—and discuss how we can build upon the complementary strengths of these three pillars to provide robust trust signals to support stories told by real and generative images. Beyond authenticity, we describe how provenance can also underpin new models for value creation in the age of generative AI. In doing so, we address other risks arising with generative AI such as ensuring training consent, and the proper attribution of credit to creatives who contribute their work to train generative models. We show that provenance may be combined with distributed ledger technology to develop novel solutions for recognizing and rewarding creative endeavor in the age of generative AI.

Vishal Asnani, John Collomose, Tu Van Bui, Xiaoming Liu, Shruti Agarwal (2025)ProMark: Proactive Diffusion Watermarking for Causal Attribution

Generative AI (GenAI) is transforming creative workflows through the capability to synthesize and manipulate images via high-level prompts. Yet creatives are not well supported to receive recognition or reward for the use of their content in GenAI training. To this end, we propose ProMark, a causal attribution technique to attribute a synthetically generated image to its training data concepts like objects, motifs, templates, artists, or styles. The concept information is proactively embedded into the input training images using imperceptible watermarks, and the diffusion models (unconditional or conditional) are trained to retain the corresponding watermarks in generated images. We show that we can embed as many as 216 unique watermarks into the training data, and each training image can contain more than one watermark. ProMark can maintain image quality whilst outperforming correlation-based attribution. Finally, several qualitative examples are presented, providing the confidence that the presence of the watermark conveys a causative relationship between training data and synthetic images.

Pengxiang Li, Lu Yin, John Collomose, Shiwei Liu Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN

DOI: 10.48550/arxiv.2412.13795

Large Language Models (LLMs) have achieved remarkable success, yet recent findings reveal that their deeper layers often contribute minimally and can be pruned without affecting overall performance. While some view this as an opportunity for model compression, we identify it as a training shortfall rooted in the widespread use of Pre-Layer Normalization (Pre-LN). We demonstrate that Pre-LN, commonly employed in models like GPT and LLaMA, leads to diminished gradient norms in its deeper layers, reducing their effectiveness. In contrast, Post-Layer Normalization (Post-LN) preserves larger gradient norms in deeper layers but suffers from vanishing gradients in earlier layers. To address this, we introduce Mix-LN, a novel normalization technique that combines the strengths of Pre-LN and Post-LN within the same model. Mix-LN applies Post-LN to the earlier layers and Pre-LN to the deeper layers, ensuring more uniform gradients across layers. This allows all parts of the network--both shallow and deep layers--to contribute effectively to training. Extensive experiments with various model sizes from 70M to 7B demonstrate that Mix-LN consistently outperforms both Pre-LN and Post-LN, promoting more balanced, healthier gradient norms throughout the network, and enhancing the overall quality of LLM pre-training. Furthermore, we demonstrate that models pre-trained with Mix-LN learn better compared to those using Pre-LN or Post-LN during supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), highlighting the critical importance of high-quality deep layers. By effectively addressing the inefficiencies of deep layers in current LLMs, Mix-LN unlocks their potential, enhancing model capacity without increasing model size. Our code is available at https://github.com/pixeli99/MixLN.

John Philip Collomosse, KINDBERG TIMOTHY PAUL JAMES Gerard (2015)Method of generating a sequence of display frames for display on a display device

A method of generating a sequence of display frames for display on a display device, wherein the sequence of display frames are derived from a data string which is encoded to include error correction in order to enable recreation of the data string at a receiving device, includes dividing the data string to be encoded into a plurality of source segments; encoding the plurality of source segments to generate a plurality of codewords, each codeword comprising a plurality of codeword bits; and positioning codeword bits in the sequence of frames.

Kar Balan, Andrew Gilbert, John Collomosse (2025)Content ARCs: Decentralized Content Rights in the Age of Generative AI, In: International Conference on AI and the Digital Economy (CADE 2025) IEEE

The rise of Generative AI (GenAI) has sparked significant debate over balancing the interests of creative rightsholders and AI developers. As GenAI models are trained on vast datasets that often include copyrighted material, questions around fair compensation and proper attribution have become increasingly urgent. To address these challenges, this paper proposes a framework called Content ARCs (Authenticity, Rights, Compensation). By combining open standards for provenance and dynamic licensing with data attribution, and decentralized technologies, Content ARCs create a mechanism for managing rights and compensating creators for using their work in AI training. We characterize several nascent works in the AI data licensing space within Content ARCs and identify where challenges remain to fully implement the end-to-end framework.

Dan Ruta, Andrew Gilbert, Saeid Motiian, Baldo Faieta, Zhe Lin, John Collomosse (2023)HyperNST: Hyper-Networks for Neural Style Transfer, In: Computer Vision – ECCV 2022 Workshops. Proceedings, Part I Springer

DOI: 10.1007/978-3-031-25056-9_14

We present HyperNST; a neural style transfer (NST) technique for the artistic stylization of images, based on Hyper-networks and the StyleGAN2 architecture. Our contribution is a novel method for inducing style transfer parameterized by a metric space, pre-trained for style-based visual search (SBVS). We show for the first time that such space may be used to drive NST, enabling the application and interpolation of styles from an SBVS system. The technical contribution is a hyper-network that predicts weight updates to a StyleGAN2 pre-trained over a diverse gamut of artistic content (portraits), tailoring the style parameterization on a per-region basis using a semantic map of the facial regions. We show HyperNST to exceed state of the art in content preservation for our stylized content while retaining good style transfer performance

Dan Ruta, Andrew Gilbert, Pranav Aggarwal, Naveen Marri, Ajinkya Kale, Jo Briggs, Chris Speed, Halin Jin, Baldo Faieta, Alex Filipkowski, Zhe Lin, John Collomosse (2022)StyleBabel: artistic style tagging and captioning, In: Computer Vision – ECCV 2022 17th European Conference Tel Aviv, Israel, October 23–27, 2022 Proceedings, Part VIII13668 Springer

DOI: 10.1007/978-3-031-20074-8_13

We present StyleBabel, a unique open access dataset of natural language captions and free-form tags describing the artistic style of over 135K digital artworks, collected via a novel participatory method from experts studying at specialist art and design schools. StyleBabel was collected via an iterative method, inspired by ‘Grounded Theory’: a qualitative approach that enables annotation while co-evolving a shared language for fine-grained artistic style attribute description. We demonstrate several downstream tasks for StyleBabel, adapting the recent ALADIN architecture for fine-grained style similarity, to train cross-modal embeddings for: 1) free-form tag generation; 2) natural language description of artistic style; 3) fine-grained text search of style. To do so, we extend ALADIN with recent advances in Visual Transformer (ViT) and cross-modal representation learning, achieving a state of the art accuracy in fine-grained style retrieval.

Kar Balan, Andrew Gilbert, John Philip Collomosse (2024)PDFed: Privacy-Preserving and Decentralized Asynchronous Federated Learning for Diffusion Models, In: Proceedings of 21st ACM SIGGRAPH Conference on Visual Media Production8pp. 1-9 Association for Computing Machinery (ACM)

DOI: 10.1145/3697294.3697306

We present PDFed, a decentralized, aggregator-free, and asynchronous federated learning protocol for training image diffusion models using a public blockchain. In general, diffusion models are prone to memorization of training data, raising privacy and ethical concerns (e.g., regurgitation of private training data in generated images). Federated learning (FL) offers a partial solution via collaborative model training across distributed nodes that safeguard local data privacy. PDFed proposes a novel sample-based score that measures the novelty and quality of generated samples, incorporating these into a blockchain-based federated learning protocol that we show reduces private data memorization in the collaboratively trained model. In addition, PDFed enables asynchronous collaboration among participants with varying hardware capabilities, facilitating broader participation. The protocol records the provenance of AI models, improving transparency and auditability, while also considering automated incentive and reward mechanisms for participants. PDFed aims to empower artists and creators by protecting the privacy of creative works and enabling decentralized, peer-to-peer collaboration. The protocol positively impacts the creative economy by opening up novel revenue streams and fostering innovative ways for artists to benefit from their contributions to the AI space.

Yash Mahesh Kulthe, Andrew Gilbert, John Philip Collomosse (2025)MutiNeRF: Multiple Watermark Embedding for Neural Radiance Fields, In: Pre-print

We present MultiNeRF1, a 3D watermarking method that embeds multiple uniquely keyed watermarks within images rendered by a single Neural Radiance Field (NeRF) model, whilst maintaining high visual quality. Our approach extends the TensoRF NeRF model by incorporating a dedicated watermark grid alongside the existing geometry and appearance grids. This extension ensures higher watermark capacity without entangling watermark signals with scene content. We propose a FiLM-based conditional modulation mechanism that dynamically activates watermarks based on input identifiers, allowing multiple independent watermarks to be embedded and extracted without requiring model retraining. MultiNeRF is validated on the NeRF-Synthetic and LLFF datasets, with statistically significant improvements in robust capacity without compromising rendering quality. By generalizing single-watermark NeRF methods into a flexible multi-watermarking framework, MultiNeRF provides a scalable solution for 3D content attribution.

Dan Sebastian Ruta, Andrew Gilbert, John Philip Collomosse, Eli Shechtman, Nick Kolkin (2024)NeAT: Neural Artistic Tracing for Beautiful Style Transfer

Style transfer is the task of reproducing the semantic contents of a source image in the artistic style of a second target image. In this paper, we present NeAT, a new state-of-the art feed-forward style transfer method. We re-formulate feed-forward style transfer as image editing, rather than image generation, resulting in a model which improves over the state-of-the-art in both preserving the source content and matching the target style. One component of our model’s success is identifying and fixing "style halos", a commonly occurring artefact across many style transfer techniques. In addition to training and testing on standard datasets, we introduce the BBST-4M dataset, a new, large scale, high resolution dataset of 4M images. As a component of curating this data, we present a novel model able to classify if an image is stylistic. We use BBST-4M to improve and measure the generalization of NeAT across a huge variety of styles. Not only does NeAT offer stateof-the-art quality and generalization, it is designed and trained for fast inference at high resolution.

Dan Sebastian Ruta, Gemma Canet Tarres, Alexander Black, Andrew Gilbert, John Philip Collomosse (2024)Self-supervised disentangled representation learning of artistic style through Neural Style Transfer

We present a new method for learning a fine-grained representation of visual style. Representation learning aims to discover individual salient features of a domain in a compact and descriptive form that strongly identifies the unique characteristics of that domain. Prior visual style representation works attempt to disentangle style (i.e. appearance) from content (i.e. semantics) yet a complete separation has yet to be achieved. We present a technique to learn a representation of visual style more strongly disentangled from the semantic content depicted in an image. We use Neural Style Transfer (NST) to measure and drive the learning signal and achieve state-of-the-art representation learning on explicitly disentangled metrics. We show that strongly addressing the disentanglement of style and content leads to large gains in style-specific metrics, encoding far less semantic information and achieving state-of-the-art accuracy in downstream style matching (retrieval) and zero-shot style tagging tasks.

Dan Sebastian Ruta, Andrew Gilbert, Gemma Canet Tarres, Eli Shechtman, Nick Kolkin, John Philip Collomosse (2024)Diff-nst: Diffusion interleaving for deformable neural style transfer

Neural Style Transfer (NST) is the field of study applying neural techniques to modify the artistic appearance of a content image to match the style of a reference style image. Traditionally, NST methods have focused on texture-based image edits, affecting mostly low level information and keeping most image structures the same. However, style-based deformation of the content is desirable for some styles, especially in cases where the style is abstract or the primary concept of the style is in its deformed rendition of some content. With the recent introduction of diffusion models, such as Stable Diffusion, we can access far more powerful image generation techniques, enabling new possibilities. In our work, we propose using this new class of models to perform style transfer while enabling deformable style transfer, an elusive capability in previous models. We show how leveraging the priors of these models can expose new artistic controls at inference time, and we document our findings in exploring this new direction for the field of style transfer.

Zhe Lin, Zhifei Zhang, He Zhang, Andrew Gilbert, John Philip Collomosse, Soo Ye Kim (2025)Multitwine: Multi-Object Compositing with Text and Layout Control, In: Pre-print

We introduce the first generative model capable of simultaneous multi-object compositing, guided by both text and layout. Our model allows for the addition of multiple objects within a scene, capturing a range of interactions from simple positional relations (e.g., next to, in front of) to complex actions requiring reposing (e.g., hugging, playing guitar). When an interaction implies additional props, like `taking a selfie', our model autonomously generates these supporting objects. By jointly training for compositing and subject-driven generation, also known as customization, we achieve a more balanced integration of textual and visual inputs for text-driven object compositing. As a result, we obtain a versatile model with state-of-the-art performance in both tasks. We further present a data generation pipeline leveraging visual and language models to effortlessly synthesize multimodal, aligned training data.

E Oneill, John Collomosse, T Jay, K Yousef, M Reiser, S Jones (2010)Older User Experience: An Evaluation with a Location-based Mobile Multimedia Service IEEE

This article reports a user-experience study in which a group of 18 older adults used a location-based mobile multimedia service in the setting of a rural nature reserve. The prototype system offered a variety of means of obtaining rich multimedia content from oak waymarker posts using a mobile phone. Free text questionnaires and focus groups were employed to investigate participants' experiences with the system and their attitudes to the use of mobile and pervasive systems in general. The users' experiences with the system were positive with respect to the design of the system in the context of the surrounding natural environment. However, the authors found some significant barriers to their adoption of mobile and pervasive systems as replacements for traditional information sources.

Burkhard Schafer, Glenn Charles Parry, John Philip Collomosse, Steven Alfred Schneider, Chris Speed, Christopher Elsden (2022)DeCaDE Contribution to the Law Commission Call for Evidence on The Changing Law on Ownership in Digital Assets

This response to the Law Commission Call for Evidence on The Changing Law on Ownership in Digital Assetsis written on behalf of DeCaDe, the UKRI funded Centre for the decentralised digital economy. DECaDE is a multi-disciplinary collaboration between the Universities of Surrey, Edinburgh and the Digital Catapult. https://decade.ac.uk/ The response was coordinated by Professor Burkhard Schafer, University of Edinburgh

Glenn Parry, John Collomosse (2021)Perspectives on “Good” in Blockchain for Good, In: Frontiers in Blockchain3609136 Frontiers Media S.A

DOI: 10.3389/fbloc.2020.609136

Blockchain projects have been developed to extend the reach of distributed ledger technology (DLT) beyond cryptocurrency to achieve “good” in the world. Such projects may make a claim for moral, ethical, and responsible intent, but many researchers have not critically examined what good means in context. The concept of good has been debated for centuries and whilst we will not conclude the argument, we should engage in the discourse. We propose the idea that exploration across micro, meso, and macro levels of value creating ecosystems is needed. The implications, both practical and theoretical, of the use of blockchain for good require analysis. As the ambition for blockchain innovations to transform society for the better becomes practical reality, understanding of such change will come from transdisciplinary researchers able to bridge knowledge of social and technical systems.

Dan Ruta, Saeid Motiian, Baldo Faieta, Zhe Lin, Hailin Jin, Alex Filipkowski, Andrew Gilbert, John Collomosse (2021)ALADIN: All Layer Adaptive Instance Normalization for Fine-grained Style Similarity, In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV) IEEE

DOI: 10.1109/ICCV48922.2021.01171

We present ALADIN (All Layer AdaIN); a novel architecture for searching images based on the similarity of their artistic style. Representation learning is critical to visual search, where distance in the learned search embedding reflects image similarity. Learning an embedding that discriminates fine-grained variations in style is hard, due to the difficulty of defining and labelling style. ALADIN takes a weakly supervised approach to learning a representation for fine-grained style similarity of digital artworks, leveraging BAM-FG, a novel large-scale dataset of user generated content groupings gathered from the web. ALADIN sets a new state of the art accuracy for style-based visual search over both coarse labelled style data (BAM) and BAM-FG; a new 2.62 million image dataset of 310,000 fine-grained style groupings also contributed by this work.

Matthew Trumble, Andrew Gilbert, Charles Malleson, Adrian Hilton, John Collomosse (2020)Data for 'Total Capture' University of Surrey

DOI: 10.15126/surreydata.00841993

T Wang, A Mansfield, R Hu, JP Collomosse (2009)An Evolutionary Approach to Automatic Video Editing, In: 2009 CONFERENCE FOR VISUAL MEDIA PRODUCTION: CVMP 2009pp. 127-134

DOI: 10.1109/CVMP.2009.8

Frances Liddell, Ella Tallyn, Evan Morgan, Kar Balan, Martin Disley, Theodore Koterwas, Billy Dixon, Caterina Moruzzi, John Philip Collomosse, Chris Elsden (2024)ORAgen: Exploring the Design of Attribution through Media Tokenisation, In: Designing Interactive Systems Conferencepp. 229-233 ACM

DOI: 10.1145/3656156.3663693

In this work-in-progress, we present ORAgen, as ‘unfinished software’, materialised through a demonstrative web application that enables participants to engage with a novel approach to media tokenisation – the ORA framework. By presenting ORAgen in ‘think-aloud’ interviews with 17 professionals working in the creative and cultural industries, we explore potential values of media tokenisation in relation to existing challenges they face related to ownership, rights, and attribution. From our initial findings, we reflect specifically on the challenges of attribution and ongoing control of creative media, and examine how media tokenisation, and underpinning distributed ledger technologies can enable new approaches to designing attribution.

R Hu, M Barnard, J Collomosse (2010)Gradient Field Descriptor for Sketch based Retrieval and Localization, In: Proceedings of Intl. Conf. on Image Proc. (ICIP)pp. 1025-1028

DOI: 10.1109/ICIP.2010.5649331

We present an image retrieval system driven by free-hand sketched queries depicting shape. We introduce Gradient Field HoG (GF-HOG) as a depiction invariant image descriptor, encapsulating local spatial structure in the sketch and facilitating efficient codebook based retrieval. We show improved retrieval accuracy over 3 leading descriptors (Self Similarity, SIFT, HoG) across two datasets (Flickr160, ETHZ extended objects), and explain how GF-HOG can be combined with RANSAC to localize sketched objects within relevant images. We also demonstrate a prototype sketch driven photo montage application based on our system.

R Hu, M Barnard, J Collomosse (2010)Gradient Field Descriptor for Sketch based Retrieval and Localization, In: Proceedings of Intl. Conf. on Image Proc. (ICIP)pp. 1025-1028

DOI: 10.1109/ICIP.2010.5649331

R Hu, John Collomosse (2010)Motion-sketch based Video Retrieval using a Trellis Levenshtein Distance, In: Intl. Conf. on Pattern Recognition (ICPR)

We present a fast technique for retrieving video clips using free-hand sketched queries. Visual keypoints within each video are detected and tracked to form short trajectories, which are clustered to form a set of spacetime tokens summarising video content. A Viterbi process matches a space-time graph of tokens to a description of colour and motion extracted from the query sketch. Inaccuracies in the sketched query are ameliorated by computing path cost using a Levenshtein (edit) distance. We evaluate over datasets of sports footage.

R Hu, T Wang, J Collomosse (2011)A Bag-of-Regions approach to Sketch-based Image Retrieval, In: International Conference on Image Processing (ICIP)pp. 3661-3664

DOI: 10.1109/ICIP.2011.6116513

This paper presents a sketch-based image retrieval system using a bag-of-region representation of images. Regions from the nodes of a hierarchical region tree range in various scales of details. They have appealing properties for object level inference such as the naturally encoded shape and scale information of objects and the specified domains on which to compute features without being affected by clutter from outside the region. The proposed approach builds shape descriptor on the salient shape among the clutters and thus yields significant performance improvements over the previous results on three leading descriptors in Bag-of-Words framework for sketch based image retrieval. Matched region also facilitates the localization of sketched object within the retrieved image.

R Hu, S James, J Collomosse (2012)Annotated Free-hand Sketches for Video Retrieval using Object Semantics and Motion, In: Lecture Notes in Computer Science Proc. Intl. Conf. on Multimedia Modelling7131pp. 473-484

DOI: 10.1007/978-3-642-27355-1_44

We present a novel video retrieval system that accepts annotated free-hand sketches as queries. Existing sketch based video retrieval (SBVR) systems enable the appearance and movements of objects to be searched naturally through pictorial representations. Whilst visually expressive, such systems present an imprecise vehicle for conveying the semantics (e.g. object types) within a scene. Our contribution is to fuse the semantic richness of text with the expressivity of sketch, to create a hybrid `semantic sketch' based video retrieval system. Trajectory extraction and clustering are applied to pre-process each clip into a video object representation that we augment with object classification and colour information. The result is a system capable of searching videos based on the desired colour, motion path, and semantic labels of the objects present. We evaluate the performance of our system over the TSF dataset of broadcast sports footage.

T Wang, John Collomosse, R Hu, D Slatter, P Cheatle, D Greig (2011)Stylized Ambient Displays of Digital Media Collections, In: Computers and Graphics35(1)pp. 54-66 Elsevier

DOI: 10.1016/j.cag.2010.11.004

The falling cost of digital cameras and camcorders has encouraged the creation of massive collections of personal digital media. However, once captured, this media is infrequently accessed and often lies dormant on users' PCs. We present a system to breathe life into home digital media collections, drawing upon artistic stylization to create a “Digital Ambient Display” that automatically selects, stylizes and transitions between digital contents in a semantically meaningful sequence. We present a novel algorithm based on multi-label graph cut for segmenting video into temporally coherent region maps. These maps are used to both stylize video into cartoons and paintings, and measure visual similarity between frames for smooth sequence transitions. The system automatically structures the media collection into a hierarchical representation based on visual content and semantics. Graph optimization is applied to adaptively sequence content for display in a coarse-to-fine manner, driven by user attention level (detected in real-time by a webcam). Our system is deployed on embedded hardware in the form of a compact digital photo frame. We demonstrate coherent segmentation and stylization over a variety of home videos and photos. We evaluate our media sequencing algorithm via a small-scale user study, indicating that our adaptive display conveys a more compelling media consumption experience than simple linear “slide-shows”.

Gemma Canet Tarres, Zhe Lin, Zhifei Zhang, Jianming Zhang, Yizhi Song, Dan Sebastian Ruta, Andrew Gilbert, John Philip Collomosse, Soo Ye Kim (2024)Thinking Outside the BBox: Unconstrained Generative Object Compositing

Compositing an object into an image involves multiple non-trivial sub-tasks such as object placement and scaling, color/lighting harmonization, viewpoint/geometry adjustment, and shadow/reflection generation. Recent generative image compositing methods leverage diffusion models to handle multiple sub-tasks at once. However, existing models face limitations due to their reliance on masking the original object \sooye{during} training, which constrains their generation to the input mask. Furthermore, obtaining an accurate input mask specifying the location and scale of the object in a new image can be highly challenging. To overcome such limitations, we define a novel problem of \textit{unconstrained generative object compositing}, i.e., the generation is not bounded by the mask, and train a diffusion-based model on a synthesized paired dataset. Our first-of-its-kind model is able to generate object effects such as shadows and reflections that go beyond the mask, enhancing image realism. Additionally, if an empty mask is provided, our model automatically places the object in diverse natural locations and scales, accelerating the compositing workflow. Our model outperforms existing object placement and compositing models in various quality metrics and user studies.

Tu Bui, Shruti Agarwal, Ning Yu, John Collomosse (2023)RoSteALS: Robust Steganography using Autoencoder Latent Space, In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)2023-pp. 933-942 IEEE

DOI: 10.1109/CVPRW59228.2023.00100

Data hiding such as steganography and invisible watermarking has important applications in copyright protection, privacy-preserved communication and content provenance. Existing works often fall short in either preserving image quality, or robustness against perturbations or are too complex to train. We propose RoSteALS, a practical steganography technique leveraging frozen pretrained autoencoders to free the payload embedding from learning the distribution of cover images. RoSteALS has a lightweight secret encoder of just 300k parameters, is easy to train, has perfect secret recovery performance and comparable image quality on three benchmarks. Additionally, RoSteALS can be adapted for novel cover-less steganography applications in which the cover image can be sampled from noise or conditioned on text prompts via a denoising diffusion process. Our model and code are available at https://github.com/TuBui/RoSteALS.

Evren Imre, Adrian Hilton (2012)Through-the-Lens Synchronisation for Heterogeneous Camera Networks, In: R Bowden, J Collomosse, K Mikolajczyk (eds.), PROCEEDINGS OF THE BRITISH MACHINE VISION CONFERENCE 2012 B M V A Press

DOI: 10.5244/C.26.103

Accurate camera synchronisation is indispensable for many video processing tasks, such as surveillance and 3D modelling. Video-based synchronisation facilitates the design and setup of networks with moving cameras or devices without an external synchronisation capability, such as low-cost web cameras, or Kinects. In this paper, we present an algorithm which can work with such heterogeneous networks. The algorithm first finds the corresponding frame indices between each camera pair, by the help of image feature correspondences and epipolar geometry. Then, for each pair, a relative frame rate and offset are computed by fitting a 2D line to the index correspondences. These pairwise relations define a graph, in which each spanning cycle comprises an absolute synchronisation hypothesis. The optimal solution is found by an exhaustive search over the spanning cycles. The algorithm is experimentally demonstrated to yield highly accurate estimates in a number of scenarios involving static and moving cameras, and Kinect.

J Kim, JP Collomosse (2013)Semi-automated Video Logging by Incremental and Transfer Learning

We describe a semi-automatic video logging system, ca- pable of annotating frames with semantic metadata describ- ing the objects present. The system learns by visual exam- ples provided interactively by the logging operator, which are learned incrementally to provide increased automation over time. Transfer learning is initially used to bootstrap the sys- tem using relevant visual examples from ImageNet. We adapt the hard-assignment Bag of Word strategy for object recogni- tion to our interactive use context, showing transfer learning to significantly reduce the degree of interaction required.

Tu Bui, Leonardo Ribeiro, Moacir Ponti, John Collomosse (2019)Deep Manifold Alignment for Mid-Grain Sketch Based Image Retrieval, In: Computer Vision – ACCV 201811363pp. 314-329 Springer Verlag

DOI: 10.1007/978-3-030-20893-6_20

We present an algorithm for visually searching image collections using free-hand sketched queries. Prior sketch based image retrieval (SBIR) algorithms adopt either a category-level or fine-grain (instance-level) definition of cross-domain similarity—returning images that match the sketched object class (category-level SBIR), or a specific instance of that object (fine-grain SBIR). In this paper we take the middle-ground; proposing an SBIR algorithm that returns images sharing both the object category and key visual characteristics of the sketched query without assuming photo-approximate sketches from the user. We describe a deeply learned cross-domain embedding in which ‘mid-grain’ sketch-image similarity may be measured, reporting on the efficacy of unsupervised and semi-supervised manifold alignment techniques to encourage better intra-category (mid-grain) discrimination within that embedding. We propose a new mid-grain sketch-image dataset (MidGrain65c) and demonstrate not only mid-grain discrimination, but also improved category-level discrimination using our approach.

JP Collomosse, G McNeill, L Watts (2008)Free-hand Sketch Grouping for Video Retrieval, In: 19TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOLS 1-6pp. 884-887

P Benard, J Thollot, JP Collomosse (2013)Temporally Coherent Video Stylization, In: J Collomosse, P Rosin (eds.), Image and video based artistic stylization42pp. 257-284 Springer-Verlag

P.M. Hall, J.P. Collomosse, Yi-Zhe Song, P. Shen, C. Li (2007)RTcams: A new perspective on nonphotorealistic rendering from photographs, In: IEEE Transactions on Visualization and Computer Graphics13(5)pp. 966-979 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/TVCG.2007.1047

We introduce a simple but versatile camera model that we call the Rational Tensor Camera (RTcam). RTcams are well principled mathematically and provably subsume several important contemporary camera models in both computer graphics and vision; their generality Is one contribution. They can be used alone or compounded to produce more complicated visual effects. In this paper, we apply RTcams to generate synthetic artwork with novel perspective effects from real photographs. Existing Nonphotorealistic Rendering from Photographs (NPRP) is constrained to the projection inherent in the source photograph, which is most often linear. RTcams lift this restriction and so contribute to NPRP via multiperspective projection. This paper describes RTcams, compares them to contemporary alternatives, and discusses how to control them in practice. Illustrative examples are provided throughout.

Leo Sampaio Ferraz Ribeiro, Tu Bui, John Collomosse, Moacir Ponti (2023)Scene designer: compositional sketch-based image retrieval with contrastive learning and an auxiliary synthesis task, In: Multimedia tools and applications82(24)pp. 38117-38139 Springer Nature

DOI: 10.1007/s11042-022-14282-0

Scene Designer is a novel method for Compositional Sketch-based Image Retrieval (CSBIR) that combines semantic layout synthesis with its main task both to boost performance and enable new creative workflows. While most studies on sketch focus on single-object retrieval, we look to multi-object scenes instead for increased query specificity and flexibility. Our training protocol improves contrastive learning by synthesising harder negative samples and introduces a layout synthesis task that further improves the semantic scene representations. We show that our object-oriented graph neural network (GNN) more than doubles the current SoTA recall@1 on the SketchyCOCO CSBIR benchmark under our novel contrastive learning setting and combined search and synthesis tasks. Furthermore, we introduce the first large-scale sketched scene dataset and benchmark in QuickDrawCOCO.

Eric Nguyen, Tu Bui, Viswanathan Swaminathan, John Collomosse (2021)OSCAR-Net: Object-centric Scene Graph Attention for Image Attribution, In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV)pp. 14479-14488 IEEE

DOI: 10.1109/ICCV48922.2021.01423

Images tell powerful stories but cannot always be trusted. Matching images back to trusted sources (attribution) enables users to make a more informed judgment of the images they encounter online. We propose a robust image hashing algorithm to perform such matching. Our hash is sensitive to manipulation of subtle, salient visual details that can substantially change the story told by an image. Yet the hash is invariant to benign transformations (changes in quality, codecs, sizes, shapes, etc.) experienced by images during online redistribution. Our key contribution is OSCAR-Net 1 (Object-centric Scene Graph Attention for Image Attribution Network); a robust image hashing model inspired by recent successes of Transformers in the visual domain. OSCAR-Net constructs a scene graph representation that attends to fine-grained changes of every object's visual appearance and their spatial relationships. The network is trained via contrastive learning on a dataset of original and manipulated images yielding a state of the art image hash for content fingerprinting that scales to millions of images.

J. Collomosse, T. Bui, A. Brown, J. Sheridan, A. Green, M. Bell, J. Fawcett, J. Higgins, O. Thereaux (2020)ARCHANGEL: Trusted Archives of Digital Public Documents, In: Proceedings of the ACM Symposium on Document Engineering 2018 - DocEng '18pp. 1-4

DOI: 10.1145/3209280.3229120

We present ARCHANGEL; a decentralised platform for ensuring the long-term integrity of digital documents stored within public archives. Document integrity is fundamental to public trust in archives. Yet currently that trust is built upon institutional reputation --- trust at face value in a centralised authority, like a national government archive or University. ARCHANGEL proposes a shift to a technological underscoring of that trust, using distributed ledger technology (DLT) to cryptographically guarantee the provenance, immutability and so the integrity of archived documents. We describe the ARCHANGEL architecture, and report on a prototype of that architecture build over the Ethereum infrastructure. We report early evaluation and feedback of ARCHANGEL from stakeholders in the research data archives space.

Dan Ruta, Andrew Gilbert, John Collomosse, Eli Shechtman, Nicholas Kolkin NeAT: Neural Artistic Tracing for Beautiful Style Transfer

DOI: 10.48550/arxiv.2304.05139

Style transfer is the task of reproducing the semantic contents of a source image in the artistic style of a second target image. In this paper, we present NeAT, a new state-of-the art feed-forward style transfer method. We re-formulate feed-forward style transfer as image editing, rather than image generation, resulting in a model which improves over the state-of-the-art in both preserving the source content and matching the target style. An important component of our model's success is identifying and fixing "style halos", a commonly occurring artefact across many style transfer techniques. In addition to training and testing on standard datasets, we introduce the BBST-4M dataset, a new, large scale, high resolution dataset of 4M images. As a component of curating this data, we present a novel model able to classify if an image is stylistic. We use BBST-4M to improve and measure the generalization of NeAT across a huge variety of styles. Not only does NeAT offer state-of-the-art quality and generalization, it is designed and trained for fast inference at high resolution.

JE Kyprianidis, JP Collomosse, T Isenberg, T Wang (2012)State of the Art: A Taxonomy of Artistic Stylization Techniques for Images and Video, In: IEEE Transactions on Visualization and Computer Graphicsvolume IEEE

DOI: 10.1109/TVCG.2012.160

This paper surveys the field of non-photorealistic rendering (NPR), focusing on techniques for transforming 2D input (images and video) into artistically stylized renderings. We first present a taxonomy of the 2D NPR algorithms developed over the past two decades, structured according to the design characteristics and behavior of each technique. We then describe a chronology of development from the semi-automatic paint systems of the early nineties, through to the automated painterly rendering systems of the late nineties driven by image gradient analysis. Two complementary trends in the NPR literature are then addressed, with reference to our taxonomy. First, the fusion of higher level computer vision and NPR, illustrating the trends toward scene analysis to drive artistic abstraction and diversity of style. Second, the evolution of local processing approaches toward edge-aware filtering for real-time stylization of images and video. The survey then concludes with a discussion of open challenges for 2D NPR identified in recent NPR symposia, including topics such as user and aesthetic evaluation.

Amir Fard Bahreini, John Collomosse, Marc-David L. Seidel, Maral Sotoudehnia, Carson C. Woo (2021)Distributing and Democratizing Institutional Power Through Decentralization, In: Building Decentralized Trustpp. 95-109 Springer International Publishing

DOI: 10.1007/978-3-030-54414-0_5

Discussions of decentralization with respect to blockchain have tended to focus on the architecture of decentralization, its influence, and potential technical hurdles. Furthermore, decentralization is often conflated with distribution in these discussions. The authors argue, instead, for a relational definition of decentralization that considers blockchain within a data-social-technical framework. By considering conceptual ambiguities and the influence of decentralization on each of the data, technical, and social layers of blockchain, the authors analyze the disruptive power of decentralization. Shifting to a decentralized organization of activity necessarily requires a transition that threatens existing power structures and dynamics, requiring actors to consider how such changes in power might interplay with, and be mitigated by changes in, trust dynamics. These institutional barriers to adoption must be understood as part of the design of the technology to help realize its ultimate societal success. As the process of decentralization evolves, further research will be needed to expose and mitigate associated challenges to data preservation and security, and to consider how societal resistance and related pitfalls might impact and restrict adoption of the democratization process.

John Collomosse, P Huang, A Hilton, M Tejera (2015)Hybrid Skeletal-Surface Motion Graphs for Character Animation from 4D Performance Capture, In: ACM Transactions on Graphics34(2)

DOI: 10.1145/2699643

We present a novel hybrid representation for character animation from 4D Performance Capture (4DPC) data which combines skeletal control with surface motion graphs. 4DPC data are temporally aligned 3D mesh sequence reconstructions of the dynamic surface shape and associated appearance from multiple view video. The hybrid representation supports the production of novel surface sequences which satisfy constraints from user specified key-frames or a target skeletal motion. Motion graph path optimisation concatenates fragments of 4DPC data to satisfy the constraints whilst maintaining plausible surface motion at transitions between sequences. Spacetime editing of the mesh sequence using a learnt part-based Laplacian surface deformation model is performed to match the target skeletal motion and transition between sequences. The approach is quantitatively evaluated for three 4DPC datasets with a variety of clothing styles. Results for key-frame animation demonstrate production of novel sequences which satisfy constraints on timing and position of less than 1% of the sequence duration and path length. Evaluation of motion capture driven animation over a corpus of 130 sequences shows that the synthesised motion accurately matches the target skeletal motion. The combination of skeletal control with the surface motion graph extends the range and style of motion which can be produced whilst maintaining the natural dynamics of shape and appearance from the captured performance.

G McNeill, J Collomosse (2007)Reverse Storyboarding for Video Retrieval, In: Proceedings of 3rd European Conf. on Visual Media Production (CVMP)

JP Collomosse, D Rowntree, PM Hall (2003)Video analysis for Cartoon-style Special Effects, In: Proceedings 14th British Machine Vision Conference (BMVC)2pp. 749-758

T wang, JP Collomosse (2012)Probabilistic Motion Diffusion of Labeling Priors for Coherent Video Segmentation, In: IEEE Transactions on Multimedia14(2)pp. 389-400 IEEE

DOI: 10.1109/TMM.2011.2177078

We present a robust algorithm for temporally coherent video segmentation. Our approach is driven by multi-label graph cut applied to successive frames, fusing information from the current frame with an appearance model and labeling priors propagated forwarded from past frames. We propagate using a novel motion diffusion model, producing a per-pixel motion distribution that mitigates against cumulative estimation errors inherent in systems adopting “hard” decisions on pixel motion at each frame. Further, we encourage spatial coherence by imposing label consistency constraints within image regions (super-pixels) obtained via a bank of unsupervised frame segmentations, such as mean-shift. We demonstrate quantitative improvements in accuracy over state-of-the-art methods on a variety of sequences exhibiting clutter and agile motion, adopting the Berkeley methodology for our comparative evaluation.

Kar Balan, Alexander Black, Simon Jenni, Andy Parsons, Andrew Gilbert, John Philip Collomosse (2023)DECORAIT -DECentralized Opt-in/out Registry for AI Training

Figure 1: DECORAIT enables creatives to register consent (or not) for Generative AI training using their content, as well as to receive recognition and reward for that use. Provenance is traced via visual matching, and consent and ownership registered using a distributed ledger (blockchain). Here, a synthetic image is generated via the Dreambooth[32] method using prompt "a photo of [Subject]" and concept images (left). The red cross indicates images whose creatives have opted out of AI training via DECORAIT, which when taken into account leads to a significant visual change (right). DECORAIT also determines credit apportionment across the opted-in images and pays a proportionate reward to creators via crypto-currency micropyament. ABSTRACT We present DECORAIT; a decentralized registry through which content creators may assert their right to opt in or out of AI training as well as receive reward for their contributions. Generative AI (GenAI) enables images to be synthesized using AI models trained on vast amounts of data scraped from public sources. Model and content creators who may wish to share their work openly without sanctioning its use for training are thus presented with a data gov-ernance challenge. Further, establishing the provenance of GenAI training data is important to creatives to ensure fair recognition and reward for their such use. We report a prototype of DECO-RAIT, which explores hierarchical clustering and a combination of on/off-chain storage to create a scalable decentralized registry to trace the provenance of GenAI training data in order to determine training consent and reward creatives who contribute that data. DECORAIT combines distributed ledger technology (DLT) with visual fingerprinting, leveraging the emerging C2PA (Coalition for Content Provenance and Authenticity) standard to create a secure, open registry through which creatives may express consent and data ownership for GenAI.

Cusuh Ham, Gemma Canet Tarres, Tu Bui, James Hays, Zhe Lin, John Collomosse (2022)CoGS: Controllable Generation and Search from Sketch and Style, In: S Avidan, G Brostow, M Cisse, G M Farinella, T Hassner (eds.), COMPUTER VISION - ECCV 2022, PT XVI13676pp. 632-650 Springer Nature

DOI: 10.1007/978-3-031-19787-1_36

We present CoGS, a novel method for the style-conditioned, sketch-driven synthesis of images. CoGS enables exploration of diverse appearance possibilities for a given sketched object, enabling decoupled control over the structure and the appearance of the output. Coarse-grained control over object structure and appearance are enabled via an input sketch and an exemplar "style" conditioning image to a transformer-based sketch and style encoder to generate a discrete codebook representation. We map the codebook representation into a metric space, enabling fine-grained control over selection and interpolation between multiple synthesis options before generating the image via a vector quantized GAN (VQGAN) decoder. Our framework thereby unifies search and synthesis tasks, in that a sketch and style pair may be used to run an initial synthesis which may be refined via combination with similar results in a search corpus to produce an image more closely matching the user's intent. We show that our model, trained on the 125 object classes of our newly created Pseudosketches dataset, is capable of producing a diverse gamut of semantic content and appearance styles.

Dan Ruta, Gemma Canet Tarrés, Andrew Gilbert, Eli Shechtman, Nicholas Kolkin, John Collomosse DIFF-NST: Diffusion Interleaving For deFormable Neural Style Transfer

DOI: 10.48550/arxiv.2307.04157

Andrew Gilbert, Matthew Trumble, Charles Malleson, Adrian Hilton, John Collomosse (2018)Fusing Visual and Inertial Sensors with Semantics for 3D Human Pose Estimation, In: International Journal of Computer Vision Springer Verlag

DOI: 10.1007/s11263-018-1118-y

We propose an approach to accurately esti- mate 3D human pose by fusing multi-viewpoint video (MVV) with inertial measurement unit (IMU) sensor data, without optical markers, a complex hardware setup or a full body model. Uniquely we use a multi-channel 3D convolutional neural network to learn a pose em- bedding from visual occupancy and semantic 2D pose estimates from the MVV in a discretised volumetric probabilistic visual hull (PVH). The learnt pose stream is concurrently processed with a forward kinematic solve of the IMU data and a temporal model (LSTM) exploits the rich spatial and temporal long range dependencies among the solved joints, the two streams are then fused in a final fully connected layer. The two complemen- tary data sources allow for ambiguities to be resolved within each sensor modality, yielding improved accu- racy over prior methods. Extensive evaluation is per- formed with state of the art performance reported on the popular Human 3.6M dataset [26], the newly re- leased TotalCapture dataset and a challenging set of outdoor videos TotalCaptureOutdoor. We release the new hybrid MVV dataset (TotalCapture) comprising of multi- viewpoint video, IMU and accurate 3D skele- tal joint ground truth derived from a commercial mo- tion capture system. The dataset is available online at http://cvssp.org/data/totalcapture/.

Dan Ruta, Gemma Canet Tarres, Alexander Black, Andrew Gilbert, John Collomosse ALADIN-NST: Self-supervised disentangled representation learning of artistic style through Neural Style Transfer

DOI: 10.48550/arxiv.2304.05755

Representation learning aims to discover individual salient features of a domain in a compact and descriptive form that strongly identifies the unique characteristics of a given sample respective to its domain. Existing works in visual style representation literature have tried to disentangle style from content during training explicitly. A complete separation between these has yet to be fully achieved. Our paper aims to learn a representation of visual artistic style more strongly disentangled from the semantic content depicted in an image. We use Neural Style Transfer (NST) to measure and drive the learning signal and achieve state-of-the-art representation learning on explicitly disentangled metrics. We show that strongly addressing the disentanglement of style and content leads to large gains in style-specific metrics, encoding far less semantic information and achieving state-of-the-art accuracy in downstream multimodal applications.

Alexander Black, Tu Bui, Simon Jenni, Vishy Swaminathan, John Collomosse VPN: Video Provenance Network for Robust Content Attribution, In: Proceedings - CVMP 2021: 18th ACM SIGGRAPH European Conference on Visual Media Production

DOI: 10.1145/3485441.3485650

We present VPN - a content attribution method for recovering provenance information from videos shared online. Platforms, and users, often transform video into different quality, codecs, sizes, shapes, etc. or slightly edit its content such as adding text or emoji, as they are redistributed online. We learn a robust search embedding for matching such video, invariant to these transformations, using full-length or truncated video queries. Once matched against a trusted database of video clips, associated information on the provenance of the clip is presented to the user. We use an inverted index to match temporal chunks of video using late-fusion to combine both visual and audio features. In both cases, features are extracted via a deep neural network trained using contrastive learning on a dataset of original and augmented video clips. We demonstrate high accuracy recall over a corpus of 100,000 videos.

Leo Sampaio Ferraz Ribeiro, Tu Bui, John Collomosse, Moacir Ponti (2021)Scene Designer: a Unified Model for Scene Search and Synthesis from Sketch, In: 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)2021-pp. 2424-2433 IEEE

DOI: 10.1109/ICCVW54120.2021.00275

Scene Designer is a novel method for searching and generating images using free-hand sketches of scene compositions; i.e. drawings that describe both the appearance and relative positions of objects. Our core contribution is a single unified model to learn both a cross-modal search embedding for matching sketched compositions to images, and an object embedding for layout synthesis. We show that a graph neural network (GNN) followed by Transformer under our novel contrastive learning setting is required to allow learning correlations between object type, appearance and arrangement, driving a mask generation module that synthesizes coherent scene layouts, whilst also delivering state of the art sketch based visual search of scenes.

Trisha Mittal, Ritwik Sinha, Viswanathan Swaminathan, John Collomosse, Dinesh Manocha (2023)Video Manipulations Beyond Faces: A Dataset with Human-Machine Analysis, In: 2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS (WACVW)pp. 643-652 IEEE

DOI: 10.1109/WACVW58289.2023.00071

As tools for content editing mature, and artificial intelligence (AI) based algorithms for synthesizing media grow, the presence of manipulated content across online media is increasing. This phenomenon causes the spread of misinformation, creating a greater need to distinguish between "real" and "manipulated" content. To this end, we present VIDEOSHAM, a dataset consisting of 826 videos (413 real and 413 manipulated). Many of the existing deepfake datasets focus exclusively on two types of facial manipulations-swapping with a different subject's face or altering the existing face. VIDEOSHAM, on the other hand, contains more diverse, context-rich, and human-centric, high-resolution videos manipulated using a combination of 6 different spatial and temporal attacks. Our analysis shows that state-of-the-art manipulation detection algorithms only work for a few specific attacks and do not scale well on VIDEOSHAM. We performed a user study on Amazon Mechanical Turk with 1200 participants to understand if they can differentiate between the real and manipulated videos in VIDEOSHAM. Finally, we dig deeper into the strengths and weaknesses of performances by humans and SOTA-algorithms to identify gaps that need to be filled with better AI algorithms. We present the dataset here(1).

E Oneill, John Collomosse, T Jay, K Yousef, M Reiser, S Jones (2010)Older User Experience: An Evaluation with a Location-based Mobile Multimedia Service, In: IEEE Vehicular Technology Magazine5(1)pp. 31-38 IEEE

DOI: 10.1109/MVT.2009.935543

J Collomosse, P Hall (2005)Motion analysis in video: dolls, dynamic cues and Modern Art, In: Proceedings of 2nd Intl. Conf on Vision Video and Graphics (VVG)pp. 109-116

JOHN PHILIP COLLOMOSSE, THOMAS GITTINGS, Steve Schneider (2019)Robust Synthesis of Adversarial Visual Examples Using a Deep Image Prior

We present a novel method for generating robust adversarial image examples building upon the recent ‘deep image prior’ (DIP) that exploits convolutional network architectures to enforce plausible texture in image synthesis. Adversarial images are commonly generated by perturbing images to introduce high frequency noise that induces image misclassification, but that is fragile to subsequent digital manipulation of the image. We show that using DIP to reconstruct an image under adversarial constraint induces perturbations that are more robust to affine deformation, whilst remaining visually imperceptible. Furthermore we show that our DIP approach can also be adapted to produce local adversarial patches (‘adversarial stickers’). We demonstrate robust adversarial examples over a broad gamut of images and object classes drawn from the ImageNet dataset.

D Trujillo-Pisanty, A Durrant, S Martindale, S James, JP Collomosse (2014)Admixed Portrait: Reflections on Being Online as a New Parent, In: Proceedings of the 2014 conference on Designing interactive systemspp. 503-512

DOI: 10.1145/2598510.2602962

This Pictorial documents the process of designing a device as an intervention within a field study of new parents. The device was deployed in participating parents’ homes to invite reflection on their everyday experiences of portraying self and others through social media in their transition to parenthood. The design creates a dynamic representation of each participant’s Facebook photo collection, extracting and amalgamating ‘faces’ from it to create an alternative portrait of an online self. We document the rationale behind our design, explaining how its features were inspired and developed, and how they function to address research questions about human experience.

Charles Malleson, Marco Volino, Andrew Gilbert, Matthew Trumble, John Collomosse, Adrian Hilton (2017)Real-time Full-Body Motion Capture from Video and IMUs, In: PROCEEDINGS 2017 INTERNATIONAL CONFERENCE ON 3D VISION (3DV)pp. 449-457 IEEE

DOI: 10.1109/3DV.2017.00058

A real-time full-body motion capture system is presented which uses input from a sparse set of inertial measurement units (IMUs) along with images from two or more standard video cameras and requires no optical markers or specialized infra-red cameras. A real-time optimization-based framework is proposed which incorporates constraints from the IMUs, cameras and a prior pose model. The combination of video and IMU data allows the full 6-DOF motion to be recovered including axial rotation of limbs and drift-free global position. The approach was tested using both indoor and outdoor captured data. The results demonstrate the effectiveness of the approach for tracking a wide range of human motion in real time in unconstrained indoor/outdoor scenes.

Dan Casas, Christian Richardt, John Collomosse, Christian Theobalt, Adrian Hilton (2015)4D Model Flow: Precomputed Appearance Alignment for Real-time 4D Video Interpolation, In: Computer graphics forum34(7)pp. 173-182 Wiley

DOI: 10.1111/cgf.12756

We introduce the concept of 4D model flow for the precomputed alignment of dynamic surface appearance across 4D video sequences of different motions reconstructed from multi-view video. Precomputed 4D model flow allows the efficient parametrization of surface appearance from the captured videos, which enables efficient real-time rendering of interpolated 4D video sequences whilst accurately reproducing visual dynamics, even when using a coarse underlying geometry. We estimate the 4D model flow using an image-based approach that is guided by available geometry proxies. We propose a novel representation in surface texture space for efficient storage and online parametric interpolation of dynamic appearance. Our 4D model flow overcomes previous requirements for computationally expensive online optical flow computation for data-driven alignment of dynamic surface appearance by precomputing the appearance alignment. This leads to an efficient rendering technique that enables the online interpolation between 4D videos in real time, from arbitrary viewpoints and with visual quality comparable to the state of the art.

Andrew Gilbert, Matt Trumble, Adrian Hilton, John Collomosse (2018)Inpainting of Wide-baseline Multiple Viewpoint Video, In: IEEE Transactions on Visualization and Computer Graphics Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/TVCG.2018.2889297

We describe a non-parametric algorithm for multiple-viewpoint video inpainting. Uniquely, our algorithm addresses the domain of wide baseline multiple-viewpoint video (MVV) with no temporal look-ahead in near real time speed. A Dictionary of Patches (DoP) is built using multi-resolution texture patches reprojected from geometric proxies available in the alternate views. We dynamically update the DoP over time, and a Markov Random Field optimisation over depth and appearance is used to resolve and align a selection of multiple candidates for a given patch, this ensures the inpainting of large regions in a plausible manner conserving both spatial and temporal coherence. We demonstrate the removal of large objects (e.g. people) on challenging indoor and outdoor MVV exhibiting cluttered, dynamic backgrounds and moving cameras.

Moacir Antonelli Ponti, Leonardo Sampaio Ferraz Ribeiro, Tiago Santana Nazare, Tu Bui, John Collomosse (2018)Everything You Wanted to Know about Deep Learning for Computer Vision but Were Afraid to Ask, In: Proceedings of Sibgrapi 2017pp. 17-41 IEEE

DOI: 10.1109/SIBGRAPI-T.2017.12

Deep Learning methods are currently the state-of-the-art in many Computer Vision and Image Processing problems, in particular image classification. After years of intensive investigation, a few models matured and became important tools, including Convolutional Neural Networks (CNNs), Siamese and Triplet Networks, Auto-Encoders (AEs) and Generative Adversarial Networks (GANs). The field is fast-paced and there is a lot of terminologies to catch up for those who want to adventure in Deep Learning waters. This paper has the objective to introduce the most fundamental concepts of Deep Learning for Computer Vision in particular CNNs, AEs and GANs, including architectures, inner workings and optimization. We offer an updated description of the theoretical and practical knowledge of working with those models. After that, we describe Siamese and Triplet Networks, not often covered in tutorial papers, as well as review the literature on recent and exciting topics such as visual stylization, pixel-wise prediction and video processing. Finally, we discuss the limitations of Deep Learning for Computer Vision.

Tu Bui, John Collomosse (2016)Scalable sketch-based image retrieval using color gradient features, In: Proceedings of the IEEE International Conference on Computer Vision Workshops IEEE

DOI: 10.1109/ICCVW.2015.133

We present a scalable system for sketch-based image retrieval (SBIR), extending the state of the art Gradient Field HoG (GF-HoG) retrieval framework through two technical contributions. First, we extend GF-HoG to enable color-shape retrieval and comprehensively evaluate several early-and late-fusion approaches for integrating the modality of color, considering both the accuracy and speed of sketch retrieval. Second, we propose an efficient inverse-index representation for GF-HoG that delivers scalable search with interactive query times over millions of images. A mobile app demo accompanies this paper (Android).

JP Collomosse, D Rowntree, PM Hall (2005)Rendering cartoon-style motion cues in post-production video, In: GRAPHICAL MODELS67(6)pp. 549-564

DOI: 10.1016/j.gmod.2004.12.002

Kary Hoa, ANDREW GILBERT, Hailin Jinb, John Collomossea (2021)Neural Architecture Search for Deep Image Prior, In: Computers & graphics98pp. 188-196 Elsevier Ltd

DOI: 10.1016/j.cag.2021.05.013

•Representation and method for evolutionary neural architecture search of encoder-decoder architectures for Deep Image prior,•Leveraging a state-of-the-art perceptual metric to guide the optimization.•State of the art DIP results for inpainting, denoising, up-scaling, beating the hand-optimized DIP architectures proposed.•Demonstrated the content- style dependency of DIP architectures. We present a neural architecture search (NAS) technique to enhance image denoising, inpainting, and super-resolution tasks under the recently proposed Deep Image Prior (DIP). We show that evolutionary search can automatically optimize the encoder-decoder (E-D) structure and meta-parameters of the DIP network, which serves as a content-specific prior to regularize these single image restoration tasks. Our binary representation encodes the design space for an asymmetric E-D network that typically converges to yield a content-specific DIP within 10-20 generations using a population size of 500. The optimized architectures consistently improve upon the visual quality of classical DIP for a diverse range of photographic and artistic content.

JP Collomosse, D Rowntree, PM Hall (2003)Cartoon-style Rendering of Motion from Video, In: Proceedings Video, Vision and Graphics (VVG)pp. 117-124

Andrew Gilbert, John Collomosse, H Jin, B Price (2018)Disentangling Structure and Aesthetics for Style-aware Image Completion, In: CVPR Proceedings

Content-aware image completion or in-painting is a fundamental tool for the correction of defects or removal of objects in images. We propose a non-parametric in-painting algorithm that enforces both structural and aesthetic (style) consistency within the resulting image. Our contributions are two-fold: 1) we explicitly disentangle image structure and style during patch search and selection to ensure a visually consistent look and feel within the target image. 2) we perform adaptive stylization of patches to conform the aesthetics of selected patches to the target image, so harmonizing the integration of selected patches into the final composition. We show that explicit consideration of visual style during in-painting delivers excellent qualitative and quantitative results across the varied image styles and content, over the Places2 scene photographic dataset and a challenging new in-painting dataset of artwork derived from BAM!

JP Collomosse, PM Hall (2006)Video motion analysis for the synthesis of dynamic cues and Futurist art, In: GRAPHICAL MODELS68(5-6)pp. 402-414 ACADEMIC PRESS INC ELSEVIER SCIENCE

DOI: 10.1016/j.gmod.2006.05.003

Dipu Manandhar, Dan Ruta, John Collomosse (2020)Learning Structural Similarity of User Interface Layouts Using Graph Networks, In: Computer Vision – ECCV 2020pp. 730-746 Springer International Publishing

DOI: 10.1007/978-3-030-58542-6_44

We propose a novel representation learning technique for measuring the similarity of user interface designs. A triplet network is used to learn a search embedding for layout similarity, with a hybrid encoder-decoder backbone comprising a graph convolutional network (GCN) and convolutional decoder (CNN). The properties of interface components and their spatial relationships are encoded via a graph which also models the containment (nesting) relationships of interface components. We supervise the training of a dual reconstruction and pair-wise loss using an auxiliary measure of layout similarity based on intersection over union (IoU) distance. The resulting embedding is shown to exceed state of the art performance for visual search of user interface layouts over the public Rico dataset, and an auto-annotated dataset of interface layouts collected from the web. We release the codes and dataset (https://github.com/dips4717/gcn-cnn.)

Simon Jenni, Alexander Black, John Collomosse Audio-Visual Contrastive Learning with Temporal Self-Supervision, In: arXiv.org

DOI: 10.48550/arxiv.2302.07702

We propose a self-supervised learning approach for videos that learns representations of both the RGB frames and the accompanying audio without human supervision. In contrast to images that capture the static scene appearance, videos also contain sound and temporal scene dynamics. To leverage the temporal and aural dimension inherent to videos, our method extends temporal self-supervision to the audio-visual setting and integrates it with multi-modal contrastive objectives. As temporal self-supervision, we pose playback speed and direction recognition in both modalities and propose intra- and inter-modal temporal ordering tasks. Furthermore, we design a novel contrastive objective in which the usual pairs are supplemented with additional sample-dependent positives and negatives sampled from the evolving feature space. In our model, we apply such losses among video clips and between videos and their temporally corresponding audio clips. We verify our model design in extensive ablation experiments and evaluate the video and audio representations in transfer experiments to action recognition and retrieval on UCF101 and HMBD51, audio classification on ESC50, and robust video fingerprinting on VGG-Sound, with state-of-the-art results.

Louise V. Coutts, David Plans, Alan W. Brown, John Collomosse (2020)Deep learning with wearable based heart rate variability for prediction of mental and general health, In: Journal of biomedical informatics112103610pp. 103610-103610 Elsevier Inc

DOI: 10.1016/j.jbi.2020.103610

The ubiquity and commoditisation of wearable biosensors (fitness bands) has led to a deluge of personal healthcare data, but with limited analytics typically fed back to the user. The feasibility of feeding back more complex, seemingly unrelated measures to users was investigated, by assessing whether increased levels of stress, anxiety and depression (factors known to affect cardiac function) and general health measures could be accurately predicted using heart rate variability (HRV) data from wrist wearables alone. Levels of stress, anxiety, depression and general health were evaluated from subjective questionnaires completed on a weekly or twice-weekly basis by 652 participants. These scores were then converted into binary levels (either above or below a set threshold) for each health measure and used as tags to train Deep Neural Networks (LSTMs) to classify each health measure using HRV data alone. Three data input types were investigated: time domain, frequency domain and typical HRV measures. For mental health measures, classification accuracies of up to 83% and 73% were achieved, with five and two minute HRV data streams respectively, showing improved predictive capability and potential future wearable use for tracking stress and well-being. [Display omitted] •A novel study further exploiting data collected from wearable biosensors.•Heart rate variability (HRV) was recorded using wearables in 652 participants.•Long Short Term Memory networks link HRV to mental & general health.

JP Collomosse, PM Hall (2006)Salience-adaptive painterly rendering using genetic search, In: INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS15(4)pp. 551-575 WORLD SCIENTIFIC PUBL CO PTE LTD

DOI: 10.1142/S0218213006002813

JP Collomosse, PM Hall (2005)Video Paintbox: The fine art of video painting, In: COMPUTERS & GRAPHICS-UK29(6)pp. 862-870 PERGAMON-ELSEVIER SCIENCE LTD

DOI: 10.1016/j.cag.2005.09.003

F. Schweiger, G. Thomas, A. Sheikh, W. Paier, M. Kettern, P. Eisert, J.-S Franco, M. Volino, P. Huang, J. Collomosse, A. Hilton, V. Jantet, P. Smyth (2015)RE@CT: A new production pipeline for interactive 3D content, In: 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW)pp. 1-4 IEEE

DOI: 10.1109/ICMEW.2015.7169830

The RE@CT project set out to revolutionise the production of realistic 3D characters for game-like applications and interactive video productions, and significantly reduce costs by developing an automated process to extract and represent animated characters from actor performance captured in a multi-camera studio. The key innovation is the development of methods for analysis and representation of 3D video to allow reuse for real-time interactive animation. This enables efficient authoring of interactive characters with video quality appearance and motion.

John Collomosse, Tu Bui, M Wilber, C Fang, H Jin (2017)Sketching with Style: Visual Search with Sketches and Aesthetic Context, In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2017 IEEE

DOI: 10.1109/ICCV.2017.290

We propose a novel measure of visual similarity for image retrieval that incorporates both structural and aesthetic (style) constraints. Our algorithm accepts a query as sketched shape, and a set of one or more contextual images specifying the desired visual aesthetic. A triplet network is used to learn a feature embedding capable of measuring style similarity independent of structure, delivering significant gains over previous networks for style discrimination. We incorporate this model within a hierarchical triplet network to unify and learn a joint space from two discriminatively trained streams for style and structure. We demonstrate that this space enables, for the first time, style-constrained sketch search over a diverse domain of digital artwork comprising graphics, paintings and drawings. We also briefly explore alternative query modalities.

Charles Malleson, Marco Volino, Andrew Gilbert, Matthew Trumble, John Collomosse, Adrian Hilton (2017)Real-time Full-Body Motion Capture from Video and IMUs, In: 3DV 2017 Proceedings CPS

Leo Sampaio Ferraz Ribeiro, Tu Bui, John Collomosse, Moacir Ponti (2020)Sketchformer: Transformer-Based Representation for Sketched Structure, In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)pp. 14141-14150

DOI: 10.1109/CVPR42600.2020.01416

Sketchformer is a novel transformer-based representation for encoding free-hand sketches input in a vector form, i.e. as a sequence of strokes. Sketchformer effectively addresses multiple tasks: sketch classification, sketch based image retrieval (SBIR), and the reconstruction and interpolation of sketches. We report several variants exploring continuous and tokenized input representations, and contrast their performance. Our learned embedding, driven by a dictionary learning tokenization scheme, yields state of the art performance in classification and image retrieval tasks, when compared against baseline representations driven by LSTM sequence to sequence architectures: SketchRNN and derivatives. We show that sketch reconstruction and interpolation are improved significantly by the Sketchformer embedding for complex sketches with longer stroke sequences.

JP Collomosse, PM Hall (2003)Cubist style rendering from photographs, In: IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS9(4)pp. 443-453 IEEE COMPUTER SOC

DOI: 10.1109/TVCG.2003.1260739

Tu Bui, Daniel Cooper, John Collomosse, Mark Bell, Alex Green, John Sheridan, Jez Higgins, Arindra Das, Jared Robert Keller, Olivier Thereaux (2020)Tamper-Proofing Video with Hierarchical Attention Autoencoder Hashing on Blockchain, In: IEEE Transactions on Multimedia Institute of Electrical and Electronics Engineers

DOI: 10.1109/TMM.2020.2967640

We present ARCHANGEL; a novel distributed ledger based system for assuring the long-term integrity of digital video archives. First, we introduce a novel deep network architecture using a hierarchical attention autoencoder (HAAE) to compute temporal content hashes (TCHs) from minutes or hourlong audio-visual streams. Our TCHs are sensitive to accidental or malicious content modification (tampering). The focus of our self-supervised HAAE is to guard against content modification such as frame truncation or corruption but ensure invariance against format shift (i.e. codec change). This is necessary due to the curatorial requirement for archives to format shift video over time to ensure future accessibility. Second, we describe how the TCHs (and the models used to derive them) are secured via a proof-of-authority blockchain distributed across multiple independent archives.We report on the efficacy of ARCHANGEL within the context of a trial deployment in which the national government archives of the United Kingdom, United States of America, Estonia, Australia and Norway participated.

Alexander Black, Tu Bui, Long Mai, Hailin Jin, JOHN PHILIP COLLOMOSSE (2021)COMPOSITIONAL SKETCH SEARCH

We present an algorithm for searching image collections using free-hand sketches that describe the appearance and relative positions of multiple objects. Sketch based image retrieval (SBIR) methods predominantly match queries containing a single, dominant object invariant to its position within an image. Our work exploits drawings as a concise and intuitive representation for specifying entire scene compositions. We train a convolutional neural network (CNN) to encode masked visual features from sketched objects, pooling these into a spatial descriptor encoding the spatial relationships and appearances of objects in the composition. Training the CNN backbone as a Siamese network under triplet loss yields a metric search embedding for measuring compositional similarity which may be efficiently leveraged for visual search by applying product quantization.

Alexander Black, Tu Bui, Hailin Jin, Vishy Swaminathan, JOHN PHILIP COLLOMOSSE (2021)Deep Image Comparator: Learning to Visualize Editorial Change

DOI: 10.1109/CVPRW53098.2021.00108

We present a novel architecture for comparing a pair of images to identify image regions that have been subjected to editorial manipulation. We first describe a robust near-duplicate search, for matching a potentially manipulated image circulating online to an image within a trusted database of originals. We then describe a novel architecture for comparing that image pair, to localize regions that have been manipulated to differ from the retrieved original. The localization ignores discrepancies due to benign image transformations that commonly occur during online redistribution. These include artifacts due to noise and recom-pression degradation, as well as out-of-place transformations due to image padding, warping, and changes in size and shape. Robustness towards out-of-place transformations is achieved via the end-to-end training of a differen-tiable warping module within the comparator architecture. We demonstrate effective retrieval and comparison of benign transformed and manipulated images, over a dataset of millions of photographs.

John Collomosse, G McNeill, Y Qian (2009)Storyboard sketches for content based video retrieval, In: Proceedings of Intl. Conf. Computer Vision (ICCV)pp. 245-252

DOI: 10.1109/ICCV.2009.5459258

We present a novel Content Based Video Retrieval (CBVR) system, driven by free-hand sketch queries depicting both objects and their movement (via dynamic cues; streak-lines and arrows). Our main contribution is a probabilistic model of video clips (based on Linear Dynamical Systems), leading to an algorithm for matching descriptions of sketched objects to video. We demonstrate our model fitting to clips under static and moving camera conditions, exhibiting linear and oscillatory motion. We evaluate retrieval on two real video data sets, and on a video data set exhibiting controlled variation in shape, color, motion and clutter.

Tu Bui, Leonardo Ribeiro, Moacir Ponti, John Collomosse (2018)Sketching out the details: Sketch-based image retrieval using convolutional neural networks with multi-stage regression, In: Computers & Graphics71pp. 77-87 Elsevier

DOI: 10.1016/j.cag.2017.12.006

We propose and evaluate several deep network architectures for measuring the similarity between sketches and photographs, within the context of the sketch based image retrieval (SBIR) task. We study the ability of our networks to generalize across diverse object categories from limited training data, and explore in detail strategies for weight sharing, pre-processing, data augmentation and dimensionality reduction. In addition to a detailed comparative study of network configurations, we contribute by describing a hybrid multi-stage training network that exploits both contrastive and triplet networks to exceed state of the art performance on several SBIR benchmarks by a significant margin.

J Collomosse, T Kindberg (2007)Screen Codes: Efficient Data Transfer from Video Displays to Mobile Devices, In: Proceedings of 3rd European Conf. on Visual Media Production (CVMP) IEEE

We present ‘Screen codes’ - a space- and time-efficient, aesthetically compelling method for transferring data from a display to a camera-equipped mobile device. Screen codes encode data as a grid of luminosity fluctuations within an arbitrary image, displayed on the video screen and decoded on a mobile device. These ‘twinkling’ images are a form of ‘visual hyperlink’, by which users can move dynamically generated content to and from their mobile devices. They help bridge the ‘content divide’ between mobile and fixed computing.

Tu Bui, Ning Yu, John Collomosse (2022)RepMix: Representation Mixing for Robust Attribution of Synthesized Images, In: S Avidan, G Brostow, M Cisse, G M Farinella, T Hassner (eds.), COMPUTER VISION - ECCV 2022, PT XIV13674pp. 146-163 Springer Nature

DOI: 10.1007/978-3-031-19781-9_9

Rapid advances in Generative Adversarial Networks (GANs) raise new challenges for image attribution; detecting whether an image is synthetic and, if so, determining which GAN architecture created it. Uniquely, we present a solution to this task capable of 1) matching images invariant to their semantic content; 2) robust to benign transformations (changes in quality, resolution, shape, etc.) commonly encountered as images are re-shared online. In order to formalize our research, a challenging benchmark, Attribution88, is collected for robust and practical image attribution. We then propose RepMix, our GAN fingerprinting technique based on representation mixing and a novel loss. We validate its capability of tracing the provenance of GAN-generated images invariant to the semantic content of the image and also robust to perturbations. We show our approach improves significantly from existing GAN finger-printing works on both semantic generalization and robustness. Data and code are available at https://github.com/TuBui/image attribution.

Burkhard Schafer, John Philip Collomosse, Glenn Charles Parry, Christopher Elsden (2023)DECaDE Contribution for DCMS Call for Evidence on NFTs

This submission is made on behalf of DECaDE, the UKRI Centre for the Decentralised Digital Economy, and Creative Informatics, the Research and Development accelerator for the creative industries.

PM Hall, JP Collomosse, Y-Z Song, P Shen, C Li (2007)RTcams: A new perspective on nonphotorealistic rendering from photographs, In: IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS13(5)pp. 966-979 IEEE COMPUTER SOC

DOI: 10.1109/TVCG.2007.1047

Reede Ren, John Collomosse, Joemon Jose (2011)A BOVW Based Query Generative Model, In: Proc. International Conference on MultiMedia Modelling (MMM), Lecture Notes in Computer Science (LNCS)6523pp. 118-128

Bag-of-visual words (BOVW) is a local feature based framework for content-based image and video retrieval. Its performance relies on the discriminative power of visual vocabulary, i.e. the cluster set on local features. However, the optimisation of visual vocabulary is of a high complexity in a large collection. This paper aims to relax such a dependence by adapting the query generative model to BOVW based retrieval. Local features are directly projected onto latent content topics to create effective visual queries; visual word distributions are learnt around local features to estimate the contribution of a visual word to a query topic; the relevance is justified by considering concept distributions on visual words as well as on local features. Massive experiments are carried out the TRECVid 2009 collection. The notable improvement on retrieval performance shows that this probabilistic framework alleviates the problem of visual ambiguity and is able to afford visual vocabulary with relatively low discriminative power.

Kar Balan, Alex Black, Simon Jenni, Andrew Gilbert, Andy Parsons, John Collomosse DECORAIT -- DECentralized Opt-in/out Registry for AI Training

DOI: 10.48550/arxiv.2309.14400

We present DECORAIT; a decentralized registry through which content creators may assert their right to opt in or out of AI training as well as receive reward for their contributions. Generative AI (GenAI) enables images to be synthesized using AI models trained on vast amounts of data scraped from public sources. Model and content creators who may wish to share their work openly without sanctioning its use for training are thus presented with a data governance challenge. Further, establishing the provenance of GenAI training data is important to creatives to ensure fair recognition and reward for their such use. We report a prototype of DECORAIT, which explores hierarchical clustering and a combination of on/off-chain storage to create a scalable decentralized registry to trace the provenance of GenAI training data in order to determine training consent and reward creatives who contribute that data. DECORAIT combines distributed ledger technology (DLT) with visual fingerprinting, leveraging the emerging C2PA (Coalition for Content Provenance and Authenticity) standard to create a secure, open registry through which creatives may express consent and data ownership for GenAI.

J Collomosse, T Kindberg (2008)Screen Codes: Visual Hyperlinks for Displays, In: M Spasojevic, MD Corner (eds.), Proceedings of the 9th Workshop on Mobile Computing Systems and Applications

DOI: 10.1145/1411759.1411782

We present 'Screen codes' - a space- and time-efficient, aesthetically compelling method for transferring data from a display to a camera-equipped mobile device. Screen codes encode data as a grid of luminosity fluctuations within an arbitrary image, displayed on the video screen and decoded on a mobile device. These 'twinkling' images are a form of 'visual hyperlink', by which users can move dynamically generated content to and from their mobile devices. They help bridge the 'content divide' between mobile and fixed computing.

Haotian Zhang, Long Mai, Hailin Jin, Zhaowen Wang, Ning Xu, John Collomosse (2019)An Internal Learning Approach to Video Inpainting, In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV)pp. 2720-2729 IEEE

DOI: 10.1109/ICCV.2019.00281

We propose a novel video inpainting algorithm that simultaneously hallucinates missing appearance and motion (optical flow) information, building upon the recent 'Deep Image Prior' (DIP) that exploits convolutional network architectures to enforce plausible texture in static images. In extending DIP to video we make two important contributions. First, we show that coherent video inpainting is possible without a priori training. We take a generative approach to inpainting based on internal (within-video) learning without reliance upon an external corpus of visual data to train a one-size-fits-all model for the large space of general videos. Second, we show that such a framework can jointly generate both appearance and flow, whilst exploiting these complementary modalities to ensure mutual consistency. We show that leveraging appearance statistics specific to each video achieves visually plausible results whilst handling the challenging problem of long-term consistency.

Tong He, John Collomosse, Hailin Jin, Stefano Soatto (2020)Geo-PIFu: Geometry and Pixel Aligned Implicit Functions for Single-view Human Reconstruction, In: arXiv.org Cornell University Library, arXiv.org

DOI: 10.48550/arxiv.2006.08072

We propose Geo-PIFu, a method to recover a 3D mesh from a monocular color image of a clothed person. Our method is based on a deep implicit function-based representation to learn latent voxel features using a structure-aware 3D U-Net, to constrain the model in two ways: first, to resolve feature ambiguities in query point encoding, second, to serve as a coarse human shape proxy to regularize the high-resolution mesh and encourage global shape regularity. We show that, by both encoding query points and constraining global shape using latent voxel features, the reconstruction we obtain for clothed human meshes exhibits less shape distortion and improved surface details compared to competing methods. We evaluate Geo-PIFu on a recent human mesh public dataset that is $10 \times$ larger than the private commercial dataset used in PIFu and previous derivative work. On average, we exceed the state of the art by $42.7\%$ reduction in Chamfer and Point-to-Surface Distances, and $19.4\%$ reduction in normal estimation errors.

Tong He, John Collomosse, Hailin Jin, Stefano Soatto (2021)DeepVoxels++: Enhancing the Fidelity of Novel View Synthesis from 3D Voxel Embeddings, In: Computer Vision – ACCV 2020pp. 244-260 Springer International Publishing

DOI: 10.1007/978-3-030-69525-5_15

We present a novel view synthesis method based upon latent voxel embeddings of an object, which encode both shape and appearance information and are learned without explicit 3D occupancy supervision. Our method uses an encoder-decoder architecture to learn such deep volumetric representations from a set of images taken at multiple viewpoints. Compared with DeepVoxels, our DeepVoxels++ applies a series of enhancements: a) a patch-based image feature extraction and neural rendering scheme that learns local shape and texture patterns, and enables neural rendering at high resolution; b) learned view-dependent feature transformation kernels to explicitly model perspective transformations induced by viewpoint changes; c) a recurrent-concurrent aggregation technique to alleviate single-view update bias of the voxel embeddings recurrent learning process. Combined with d) a simple yet effective implementation trick of frustum representation sufficient sampling, we achieve improved visual quality over the prior deep voxel-based methods (33%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document} SSIM error reduction and 22%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document} PSNR improvement) on 360∘\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$360^\circ $$\end{document} novel-view synthesis benchmarks.

Leo Sampaio Ferraz Ribeiro, Tu Bui, JOHN PHILIP COLLOMOSSE, Moacir Ponti (2021)Scene Designer: a Unified Model for Scene Search and Synthesis from Sketch

T Wang, J Guillemaut, J Collomosse (2010)Multi-label Propagation for Coherent Video Segmentation and Artistic Stylization, In: Proceedings of Intl. Conf. on Image Proc. (ICIP)pp. 3005-3008

DOI: 10.1109/ICIP.2010.5649291

We present a new algorithm for segmenting video frames into temporally stable colored regions, applying our technique to create artistic stylizations (e.g. cartoons and paintings) from real video sequences. Our approach is based on a multilabel graph cut applied to successive frames, in which the color data term and label priors are incrementally updated and propagated over time. We demonstrate coherent segmentation and stylization over a variety of home videos.

Charles Malleson, John Collomosse, Adrian Hilton (2019)Real-Time Multi-person Motion Capture from Multi-view Video and IMUs., In: International Journal of Computer Vision Springer

DOI: 10.1007/s11263-019-01270-5

A real-time motion capture system is presented which uses input from multiple standard video cameras and inertial measurement units (IMUs). The system is able to track multiple people simultaneously and requires no optical markers, specialized infra-red cameras or foreground/background segmentation, making it applicable to general indoor and outdoor scenarios with dynamic backgrounds and lighting. To overcome limitations of prior video or IMU-only approaches, we propose to use flexible combinations of multiple-view, calibrated video and IMU input along with a pose prior in an online optimization-based framework, which allows the full 6-DoF motion to be recovered including axial rotation of limbs and drift-free global position. A method for sorting and assigning raw input 2D keypoint detections into corresponding subjects is presented which facilitates multi-person tracking and rejection of any bystanders in the scene. The approach is evaluated on data from several indoor and outdoor capture environments with one or more subjects and the trade-off between input sparsity and tracking performance is discussed. State-of-the-art pose estimation performance is obtained on the Total Capture (mutli-view video and IMU) and Human 3.6M (multi-view video) datasets. Finally, a live demonstrator for the approach is presented showing real-time capture, solving and character animation using a light-weight, commodity hardware setup.

Dan Casas, Marco Volino, John Collomosse, Adrian Hilton (2014)4D video textures for interactive character appearance, In: Computer graphics forum33(2)pp. 371-380 Wiley

DOI: 10.1111/cgf.12296

4D Video Textures (4DVT) introduce a novel representation for rendering video-realistic interactive character animation from a database of 4D actor performance captured in a multiple camera studio. 4D performance capture reconstructs dynamic shape and appearance over time but is limited to free-viewpoint video replay of the same motion. Interactive animation from 4D performance capture has so far been limited to surface shape only. 4DVT is the final piece in the puzzle enabling video-realistic interactive animation through two contributions: a layered view-dependent texture map representation which supports efficient storage, transmission and rendering from multiple view video capture; and a rendering approach that combines multiple 4DVT sequences in a parametric motion space, maintaining video quality rendering of dynamic surface appearance whilst allowing high-level interactive control of character motion and viewpoint. 4DVT is demonstrated for multiple characters and evaluated both quantitatively and through a user-study which confirms that the visual quality of captured video is maintained. The 4DVT representation achieves >90% reduction in size and halves the rendering cost.

Thomas Gittings, Steve Schneider, John Collomosse (2019)Robust Synthesis of Adversarial Visual Examples Using a Deep Image Prior

We present a novel method for generating robust adversarial image examples building upon the recent `deep image prior' (DIP) that exploits convolutional network architectures to enforce plausible texture in image synthesis. Adversarial images are commonly generated by perturbing images to introduce high frequency noise that induces image misclassification, but that is fragile to subsequent digital manipulation of the image. We show that using DIP to reconstruct an image under adversarial constraint induces perturbations that are more robust to affine deformation, whilst remaining visually imperceptible. Furthermore we show that our DIP approach can also be adapted to produce local adversarial patches (`adversarial stickers'). We demonstrate robust adversarial examples over a broad gamut of images and object classes drawn from the ImageNet dataset.

T Wang, J Collomosse, D Slatter, P Cheatle, D Greig (2010)Video Stylization for Digital Ambient Displays of Home Movies, In: Proceedings ACM 4th Intl. Symposium on Non-photorealistic Animation and Rendering (NPAR)pp. 137-146

DOI: 10.1145/1809939.1809955

Falling hardware costs have prompted an explosion in casual video capture by domestic users. Yet, this video is infrequently accessed post-capture and often lies dormant on users’ PCs. We present a system to breathe life into home video repositories, drawing upon artistic stylization to create a “Digital Ambient Display” that automatically selects, stylizes and transitions between videos in a semantically meaningful sequence. We present a novel algorithm based on multi-label graph cut for segmenting video into temporally coherent region maps. These maps are used to both stylize video into cartoons and paintings, and measure visual similarity between frames for smooth sequence transitions. We demonstrate coherent segmentation and stylization over a variety of home videos.

C Malleson, J Collomosse (2013)Virtual Volumetric Graphics on Commodity Displays using 3D Viewer Tracking, In: International Journal of Computer Vision (IJCV)101(3)pp. 519-532 Springer

DOI: 10.1007/s11263-012-0533-8

Three dimensional (3D) displays typically rely on stereo disparity, requiring specialized hardware to be worn or embedded in the display. We present a novel 3D graphics display system for volumetric scene visualization using only standard 2D display hardware and a pair of calibrated web cameras. Our computer vision-based system requires no worn or other special hardware. Rather than producing the depth illusion through disparity, we deliver a full volumetric 3D visualization - enabling users to interactively explore 3D scenes by varying their viewing position and angle according to the tracked 3D position of their face and eyes. We incorporate a novel wand-based calibration that allows the cameras to be placed at arbitrary positions and orientations relative to the display. The resulting system operates at real-time speeds (~25 fps) with low latency (120-225 ms) delivering a compelling natural user interface and immersive experience for 3D viewing. In addition to objective evaluation of display stability and responsiveness, we report on user trials comparing users' timings on a spatial orientation task.

JP Collomosse, T kindberg (2017)Content Encoder and Decoder and Methods of Encoding and Decoding Content

A content encoder for encoding content in a source image for display on a display device, the content encoder comprising: inputs for receiving data representing content to be encoded in the source image; a processor arranged to encode content as a time varying two-dimensional pattern of luminosity modulations within the source image to form an encoded image; outputs arranged to output the encoded image to the display device.

JP Collomosse, T Kindberg (2017)Method of Generating a Sequence of Display Frames For Display on a Display Device

DOI: US20090028448

Matthew Trumble, Andrew Gilbert, Adrian Hilton, John Collomosse (2018)Deep Autoencoder for Combined Human Pose Estimation and Body Model Upscaling, In: Proceedings of ECCV 2018: European Conference on Computer Vision Springer Science+Business Media

We present a method for simultaneously estimating 3D hu- man pose and body shape from a sparse set of wide-baseline camera views. We train a symmetric convolutional autoencoder with a dual loss that enforces learning of a latent representation that encodes skeletal joint positions, and at the same time learns a deep representation of volumetric body shape. We harness the latter to up-scale input volumetric data by a factor of 4X, whilst recovering a 3D estimate of joint positions with equal or greater accuracy than the state of the art. Inference runs in real-time (25 fps) and has the potential for passive human behaviour monitoring where there is a requirement for high fidelity estimation of human body shape and pose.

JP Collomosse, D Rowntree, PM Hall (2005)Stroke surfaces: Temporally coherent artistic animations from video, In: IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS11(5)pp. 540-549 IEEE COMPUTER SOC

DOI: 10.1109/TVCG.2005.85

S James, John Collomosse (2017)Evolutionary Data Purification for Social Media Classification, In: ICPR 2016 Proceedingspp. 2676-2681

DOI: 10.1109/ICPR.2016.7900039

We present a novel algorithm for the semantic labeling of photographs shared via social media. Such imagery is diverse, exhibiting high intra-class variation that demands large training data volumes to learn representative classifiers. Unfortunately image annotation at scale is noisy resulting in errors in the training corpus that confound classifier accuracy. We show how evolutionary algorithms may be applied to select a ’purified’ subset of the training corpus to optimize classifier performance. We demonstrate our approach over a variety of image descriptors (including deeply learned features) and support vector machines.

Andrew Gilbert, Marco Volino, John Collomosse, Adrian Hilton (2018)Volumetric performance capture from minimal camera viewpoints, In: V Ferrari, M Hebert, C Sminchisescu, Y Weiss (eds.), Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science11215pp. 591-607 Springer Science+Business Media

DOI: 10.1007/978-3-030-01252-6_35

We present a convolutional autoencoder that enables high fidelity volumetric reconstructions of human performance to be captured from multi-view video comprising only a small set of camera views. Our method yields similar end-to-end reconstruction error to that of a prob- abilistic visual hull computed using significantly more (double or more) viewpoints. We use a deep prior implicitly learned by the autoencoder trained over a dataset of view-ablated multi-view video footage of a wide range of subjects and actions. This opens up the possibility of high-end volumetric performance capture in on-set and prosumer scenarios where time or cost prohibit a high witness camera count.

Kar Balan, Shruti Agarwal, Simon Jenni, Andy Parsons, Andrew Gilbert, John Philip Collomosse (2023)EKILA: Synthetic Media Provenance and Attribution for Generative Art, In: Proceedings of the 2023 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2023) Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/CVPRW59228.2023.00098

We present EKILA; a decentralized framework that enables creatives to receive recognition and reward for their contributions to generative AI (GenAI). EKILA proposes a robust visual attribution technique and combines this with an emerging content provenance standard (C2PA) to address the problem of synthetic image provenance – determining the generative model and training data responsible for an AI-generated image. Furthermore, EKILA extends the non-fungible token (NFT) ecosystem to introduce a tokenized representation for rights, enabling a triangular relationship between the asset’s Ownership, Rights, and Attribution (ORA). Leveraging the ORA relationship enables creators to express agency over training consent and, through our attribution model, to receive apportioned credit, including royalty payments for the use of their assets in GenAI.

T. Gittings, S. Schneider, J. Collomosse (2021)Vax-a-Net: Training-Time Defence Against Adversarial Patch Attacks, In: Computer Vision – ACCV 2020pp. 235-251 Springer International Publishing

DOI: 10.1007/978-3-030-69538-5_15

We present Vax-a-Net; a technique for immunizing convolutional neural networks (CNNs) against adversarial patch attacks (APAs). APAs insert visually overt, local regions (patches) into an image to induce misclassification. We introduce a conditional Generative Adversarial Network (GAN) architecture that simultaneously learns to synthesise patches for use in APAs, whilst exploiting those attacks to adapt a pre-trained target CNN to reduce its susceptibility to them. This approach enables resilience against APAs to be conferred to pre-trained models, which would be impractical with conventional adversarial training due to the slow convergence of APA methods. We demonstrate transferability of this protection to defend against existing APAs, and show its efficacy across several contemporary CNN architectures.

Dipu Manandhar, Hailin Jin, John Collomosse (2021)Magic Layouts: Structural Prior for Component Detection in User Interface Designs, In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)pp. 15804-15813 IEEE

DOI: 10.1109/CVPR46437.2021.01555

We present Magic Layouts; a method for parsing screen-shots or hand-drawn sketches of user interface (UI) layouts. Our core contribution is to extend existing detectors to exploit a learned structural prior for UI designs, enabling robust detection of UI components; buttons, text boxes and similar. Specifically we learn a prior over mobile UI layouts, encoding common spatial co-occurrence relationships between different UI components. Conditioning region proposals using this prior leads to performance gains on UI layout parsing for both hand-drawn UIs and app screen-shots, which we demonstrate within the context an interactive application for rapidly acquiring digital prototypes of user experience (UX) designs.

Yue Bai, Dipu Manandhar, Zhaowen Wang, John Collomosse, Yun Fu (2023)Layout Representation Learning with Spatial and Structural Hierarchies, In: Proceedings of the ... AAAI Conference on Artificial Intelligence37(1)pp. 206-214

DOI: 10.1609/aaai.v37i1.25092

We present a novel hierarchical modeling method for layout representation learning, the core of design documents (e.g., user interface, poster, template). Existing works on layout representation often ignore element hierarchies, which is an important facet of layouts, and mainly rely on the spatial bounding boxes for feature extraction. This paper proposes a Spatial-Structural Hierarchical Auto-Encoder (SSH-AE) that learns hierarchical representation by treating a hierarchically annotated layout as a tree format. On the one side, we model SSH-AE from both spatial (semantic views) and structural (organization and relationships) perspectives, which are two complementary aspects to represent a layout. On the other side, the semantic/geometric properties are associated at multiple resolutions/granularities, naturally handling complex layouts. Our learned representations are used for effective layout search from both spatial and structural similarity perspectives. We also newly involve the tree-edit distance (TED) as an evaluation metric to construct a comprehensive evaluation protocol for layout similarity assessment, which benefits a systematic and customized layout search. We further present a new dataset of POSTER layouts which we believe will be useful for future layout research. We show that our proposed SSH-AE outperforms the existing methods achieving state-of-the-art performance on two benchmark datasets. Code is available at github.com/yueb17/SSH-AE.

Eric Nguyen, Tu Bui, Viswanathan Swaminathan, JOHN PHILIP COLLOMOSSE (2021)OSCAR-Net: Object-centric Scene Graph Attention for Image Attribution

Images tell powerful stories but cannot always be trusted. Matching images back to trusted sources (attribution) enables users to make a more informed judgment of the images they encounter online. We propose a robust image hashing algorithm to perform such matching. Our hash is sensitive to manipulation of subtle, salient visual details that can substantially change the story told by an image. Yet the hash is invariant to benign transformations (changes in quality, codecs, sizes, shapes, etc.) experienced by images during online redistribution. Our key contribution is OSCAR-Net1 (Object-centric Scene Graph Attention for Image Attribution Network); a robust image hashing model inspired by recent successes of Transformers in the visual domain. OSCAR-Net constructs a scene graph representation that attends to fine-grained changes of every object’s visual appearance and their spatial relationships. The network is trained via contrastive learning on a dataset of original and manipulated images yielding a state of the art image hash for content fingerprinting that scales to millions of images

D Vanderhaeghe, JP Collomosse (2013)Stroke based painterly rendering, In: J Collomosse, P Rosin (eds.), Image and video based artistic stylization42pp. 3-22 Springer-Verlag

R Ren, J Collomosse (2012)Visual Sentences for Pose Retrieval over Low-resolution Cross-media Dance Collections, In: IEEE Transactions on Multimedia14(6)pp. 1652-1661 IEEE

DOI: 10.1109/TMM.2012.2199971

We describe a system for matching human posture (pose) across a large cross-media archive of dance footage spanning nearly 100 years, comprising digitized photographs and videos of rehearsals and performances. This footage presents unique challenges due to its age, quality and diversity. We propose a forest-like pose representation combining visual structure (self-similarity) descriptors over multiple scales, without explicitly detecting limb positions which would be infeasible for our data. We explore two complementary multi-scale representations, applying passage retrieval and latent Dirichlet allocation (LDA) techniques inspired by the the text retrieval domain, to the problem of pose matching. The result is a robust system capable of quickly searching large cross-media collections for similarity to a visually specified query pose. We evaluate over a crosssection of the UK National Research Centre for Dance’s (UK-NRCD), and the Siobhan Davies Replay’s (SDR) digital dance archives, using visual queries supplied by dance professionals. We demonstrate significant performance improvements over two base-lines; classical single and multi-scale Bag of Visual Words (BoVW) and spatial pyramid kernel (SPK) matching.

P Keitler, F Pankratz, B Schwerdtfeger, D Pustka, W Rodiger, G Klinker, C Rauch, A Chathoth, J Collomosse, Y Song (2009)Mobile Augmented Reality based 3D Snapshots, In: Proc. GI-Workshop on VR/AR

PL Rosin, J Collomosse (2012)Image and Video-Based Artistic Stylisation Springer

This guide s cutting-edge coverage explains the full spectrum of NPR techniques used in photography, TV and film.

Maksym Andriushchenko, Xiaoyang Rebecca Li, Geoffrey Oxholm, Thomas Gittings, Tu Bui, Nicolas Flammarion, John Collomosse (2022)ARIA: Adversarially Robust Image Attribution for Content Provenance, In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)2022-pp. 33-43 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/CVPRW56347.2022.00013

Image attribution - matching an image back to a trusted source - is an emerging tool in the fight against online misinformation. Deep visual fingerprinting models have recently been explored for this purpose. However, they are not robust to tiny input perturbations known as adversarial examples. First we illustrate how to generate valid adversarial images that can easily cause incorrect image attribution. Then we describe an approach to prevent imperceptible adversarial attacks on deep visual fingerprinting models, via robust contrastive learning. The proposed training procedure leverages training on ℓ ∞ -bounded adversarial examples, it is conceptually simple and incurs only a small computational overhead. The resulting models are substantially more robust, are accurate even on unperturbed images, and perform well even over a database with millions of images. In particular, we achieve 91.6% standard and 85.1% adversarial recall under ℓ ∞ -bounded perturbations on manipulated images compared to 80.1% and 0.0% from prior work. We also show that robustness generalizes to other types of imperceptible perturbations unseen during training. Finally, we show how to train an adversarially robust image comparator model for detecting editorial changes in matched images. Project page: https://max-andr.github.io/robust_image_attribution.

J Collomosse, P Hall (2004)A Mid-level Description of Video, with Application to Non-photorealistic Animation, In: Proceedings 15th British Machine Vision Conference (BMVC)1pp. 7-16

Tu Bui, J Collomosse (2015)Font finder: Visual recognition of typeface in printed documents, In: 2015 IEEE International Conference on Image Processing (ICIP) IEEE

DOI: 10.1109/ICIP.2015.7351541

We describe a novel algorithm for visually identifying the font used in a scanned printed document. Our algorithm requires no pre-recognition of characters in the string (i. e. optical character recognition). Gradient orientation features are collected local the character boundaries, and quantized into a hierarchical Bag of Visual Words representation. Following stop-word analysis, classification via logistic regression (LR) of the codebooked features yields per-character probabilities which are combined across the string to decide the posterior for each font. We achieve 93.4% accuracy over a 1000 font database of scanned printed text comprising Latin characters.

J Kim, JP Collomosse (2014)Incremental Transfer Learning for Object Classification in Streaming Video, In: 2014 IEEE International Conference on Image Processing (ICIP)pp. 2729-2733

DOI: 10.1109/ICIP.2014.7025552

We present a new incremental learning framework for realtime object recognition in video streams. ImageNet is used to bootstrap a set of one-vs-all incrementally trainable SVMs which are updated by user annotation events during streaming. We adopt an inductive transfer learning (ITL) approach to warp the video feature space to the ImageNet feature space, so enabling the incremental updates. Uniquely, the transformation used for the ITL warp is also learned incrementally using the same update events. We demonstrate a semiautomated video logging (SAVL) system using our incrementally learned ITL approach and show this to outperform existing SAVL which uses non-incremental transfer learning.

Tu Bui, John Collomosse, Mark Bell, Alex Green, John Sheridan, Jez Higgins, Arindra Das, Jared Keller, Olivier Thereaux, Alan Brown (2019)ARCHANGEL: Tamper-Proofing Video Archives Using Temporal Content Hashes on the Blockchain, In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019) Workshops Institute of Electrical and Electronics Engineers (IEEE)

We present ARCHANGEL; a novel distributed ledger based system for assuring the long-term integrity of digital video archives. First, we describe a novel deep network architecture for computing compact temporal content hashes (TCHs) from audio-visual streams with durations of minutes or hours. Our TCHs are sensitive to accidental or malicious content modification (tampering) but invariant to the codec used to encode the video. This is necessary due to the curatorial requirement for archives to format shift video over time to ensure future accessibility. Second, we describe how the TCHs (and the models used to derive them) are secured via a proof-of-authority blockchain distributed across multiple independent archives. We report on the efficacy of ARCHANGEL within the context of a trial deployment in which the national government archives of the United Kingdom, Estonia and Norway participated.

John Collomosse, Tu Bui, Hailin Jin (2019)LiveSketch: Query Perturbations for Guided Sketch-based Visual Search, In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019)pp. 2879-2887 Institute of Electrical and Electronics Engineers (IEEE)

LiveSketch is a novel algorithm for searching large image collections using hand-sketched queries. LiveSketch tackles the inherent ambiguity of sketch search by creating visual suggestions that augment the query as it is drawn, making query specification an iterative rather than one-shot process that helps disambiguate users' search intent. Our technical contributions are: a triplet convnet architecture that incorporates an RNN based variational autoencoder to search for images using vector (stroke-based) queries; real-time clustering to identify likely search intents (and so, targets within the search embedding); and the use of backpropagation from those targets to perturb the input stroke sequence, so suggesting alterations to the query in order to guide the search. We show improvements in accuracy and time-to-task over contemporary baselines using a 67M image corpus.

JP Collomosse, T Kindberg (2017)Encoder and Decoder and Methods of Encoding and Decoding Sequence Information

A method of encoding sequence information into a sequence of display frames for display on a display device, the method comprising the steps of: generating the sequence of display frames; inserting monitor flags within each display frame, each monitor flag being capable of moving between a first state and a second state; setting the state of monitor flags within each display frame to a predetermined configuration; encoding sequence information in the sequence of display frames by varying the predetermined configuration throughout the sequence of display frames such that neighbouring display frames in the sequence have different predetermined configurations.

JP Collomosse, PM Hall, F Rothlauf, J Branke, S Cagnoni, DW Corne, R Drechsler, Y Jin, P Machado, E Marchiori, J Romero, GD Smith, G Squillero (2005)Genetic paint: A search for salient paintings, In: APPLICATIONS OF EVOLUTIONARY COMPUTING, PROCEEDINGS3449pp. 437-447

DOI: 10.1007/978-3-540-32003-6_44

Matthew Trumble, Andrew Gilbert, Charles Malleson, Adrian Hilton, John Collomosse (2017)Total Capture: 3D Human Pose Estimation Fusing Video and Inertial Sensors, In: Proceedings of 28th British Machine Vision Conferencepp. 1-13

We present an algorithm for fusing multi-viewpoint video (MVV) with inertial measurement unit (IMU) sensor data to accurately estimate 3D human pose. A 3-D convolutional neural network is used to learn a pose embedding from volumetric probabilistic visual hull data (PVH) derived from the MVV frames. We incorporate this model within a dual stream network integrating pose embeddings derived from MVV and a forward kinematic solve of the IMU data. A temporal model (LSTM) is incorporated within both streams prior to their fusion. Hybrid pose inference using these two complementary data sources is shown to resolve ambiguities within each sensor modality, yielding improved accuracy over prior methods. A further contribution of this work is a new hybrid MVV dataset (TotalCapture) comprising video, IMU and a skeletal joint ground truth derived from a commercial motion capture system. The dataset is available online at http://cvssp.org/data/totalcapture/.

Tu Bui, L Ribeiro, M Ponti, John Collomosse (2017)Compact Descriptors for Sketch-based Image Retrieval using a Triplet loss Convolutional Neural Network, In: Computer Vision and Image Understanding164pp. 27-37 Elsevier

DOI: 10.1016/j.cviu.2017.06.007

We present an efficient representation for sketch based image retrieval (SBIR) derived from a triplet loss convolutional neural network (CNN). We treat SBIR as a cross-domain modelling problem, in which a depiction invariant embedding of sketch and photo data is learned by regression over a siamese CNN architecture with half-shared weights and modified triplet loss function. Uniquely, we demonstrate the ability of our learned image descriptor to generalise beyond the categories of object present in our training data, forming a basis for general cross-category SBIR. We explore appropriate strategies for training, and for deriving a compact image descriptor from the learned representation suitable for indexing data on resource constrained e. g. mobile devices. We show the learned descriptors to outperform state of the art SBIR on the defacto standard Flickr15k dataset using a significantly more compact (56 bits per image, i. e. ≈ 105KB total) search index than previous methods.

E Syngelakis, J Collomosse (2012)A Bag of Features Approach to Ambient Fall Detection for Domestic Elder-care, In: 1st Intl. Symposium on Ambient Technologies (AMBIENT)

Falls in the home are a major source of injury for the elderly. The affordability of commodity video cameras is prompting the development of ambient intelligent environments to monitor the occurence of falls in the home. This paper describes an automated fall detection system, capable of tracking movement and detecting falls in real-time. In particular we explore the application of the Bag of Features paradigm, frequently applied to general activity recognition in Computer Vision, to the domestic fall detection problem. We show that fall detection is feasible using such a framework, evaluted our approach in both controlled test scenarios and domestic scenarios exhibiting uncontrolled fall direction and visually cluttered environments.

Alexander Black, Simon Jenni, Tu Bui, Md. Mehrab Tanjim, Stefano Petrangeli, Ritwik Sinha, Viswanathan Swaminathan, John Collomosse VADER: Video Alignment Differencing and Retrieval, In: arXiv.org

DOI: 10.48550/arxiv.2303.13193

We propose VADER, a spatio-temporal matching, alignment, and change summarization method to help fight misinformation spread via manipulated videos. VADER matches and coarsely aligns partial video fragments to candidate videos using a robust visual descriptor and scalable search over adaptively chunked video content. A transformer-based alignment module then refines the temporal localization of the query fragment within the matched video. A space-time comparator module identifies regions of manipulation between aligned content, invariant to any changes due to any residual temporal misalignments or artifacts arising from non-editorial changes of the content. Robustly matching video to a trusted source enables conclusions to be drawn on video provenance, enabling informed trust decisions on content encountered.

DIPU MANANDHAR, Hailin Jin, JOHN PHILIP COLLOMOSSE (2021)Magic Layouts: Structural Prior for Component Detection in User Interface Designs

Matthew Trumble, Andrew Gilbert, Adrian Hilton, John Collomosse (2016)Deep Convolutional Networks for Marker-less Human Pose Estimation from Multiple Views, In: Proceedings of CVMP 2016. The 13th European Conference on Visual Media Production

DOI: 10.1145/2998559.2998565

We propose a human performance capture system employing convolutional neural networks (CNN) to estimate human pose from a volumetric representation of a performer derived from multiple view-point video (MVV).We compare direct CNN pose regression to the performance of an affine invariant pose descriptor learned by a CNN through a classification task. A non-linear manifold embedding is learned between the descriptor and articulated pose spaces, enabling regression of pose from the source MVV. The results are evaluated against ground truth pose data captured using a Vicon marker-based system and demonstrate good generalisation over a range of human poses, providing a system that requires no special suit to be worn by the performer.

T Wang, B Han, JP Collomosse (2014)Touchcut: Single-touch object segmentation driven by level set methods

In this paper, we propose an object segmentation algorithm driven by minimal user interactions. Compared to previous user-guided systems, our system can cut out the desired object in a given image with only a single finger touch minimizing user effort. The proposed model harnesses both edge and region based local information in an adaptive manner as well as geometric cues implied by the user-input to achieve fast and robust segmentation in a level set framework. We demonstrate the advantages of our method in terms of computational efficiency and accuracy comparing qualitatively and quantitatively with graph cut based techniques.

Matthew Trumble, Andrew Gilbert, Adrian Hilton, John Collomosse (2016)Learning Markerless Human Pose Estimation from Multiple Viewpoint Video, In: Computer Vision – ECCV 2016 Workshops. Lecture Notes in Computer Science9915pp. 871-878

DOI: 10.1007/978-3-319-49409-8_70

We present a novel human performance capture technique capable of robustly estimating the pose (articulated joint positions) of a performer observed passively via multiple view-point video (MVV). An affine invariant pose descriptor is learned using a convolutional neural network (CNN) trained over volumetric data extracted from a MVV dataset of diverse human pose and appearance. A manifold embedding is learned via Gaussian Processes for the CNN descriptor and articulated pose spaces enabling regression and so estimation of human pose from MVV input. The learned descriptor and manifold are shown to generalise over a wide range of human poses, providing an efficient performance capture solution that requires no fiducials or other markers to be worn. The system is evaluated against ground truth joint configuration data from a commercial marker-based pose estimation system

C Lord, John Collomosse (2014)StegoNPR: Information Hiding in Painterly Renderings

We present a novel steganographic technique for concealing information within an image. Uniquely we explore the practicality of hiding, and recovering, data within the pattern of brush strokes generated by a non-photorealistic rendering (NPR) algorithm. We incorporate a linear binary coding (LBC) error correction scheme over a raw data channel established through the local statistics of NPR stroke orientations. This enables us to deliver a robust channel for conveying short (e.g. 30 character) strings covertly within a painterly rendering. We evaluate over a variety of painterly renderings, parameter settings and message lengths.

Alexander Black, Tu Bui, Long Mai, Hailin JinJohn Collomosse (2021)Compositional Sketch Search, In: 2021 IEEE International Conference on Image Processing (ICIP)2021-pp. 2668-2672 IEEE

DOI: 10.1109/ICIP42928.2021.9506609

We present an algorithm for searching image collections using free-hand sketches that describe the appearance and relative positions of multiple objects 1 Sketch based image retrieval (SBIR) methods predominantly match queries containing a single, dominant object invariant to its position within an image. Our work exploits drawings as a concise and intuitive representation for specifying entire scene compositions. We train a convolutional neural network (CNN) to encode masked visual features from sketched objects, pooling these into a spatial descriptor encoding the spatial relationships and appearances of objects in the composition. Training the CNN backbone as a Siamese network under triplet loss yields a metric search embedding for measuring compositional similarity which may be efficiently leveraged for visual search by applying product quantization.

T Wang, JP Collomosse, A Hilton (2014)Wide Baseline Multi-View Video Matting using a Hybrid Markov Random Field

We describe a novel framework for segmenting a time- and view-coherent foreground matte sequence from synchronised multiple view video. We construct a Markov Random Field (MRF) comprising links between superpixels corresponded across views, and links between superpixels and their constituent pixels. Texture, colour and disparity cues are incorporated to model foreground appearance. We solve using a multi-resolution iterative approach enabling an eight view high definition (HD) frame to be processed in less than a minute. Furthermore we incorporate a temporal diffusion process introducing a prior on the MRF using information propagated from previous frames, and a facility for optional user correction. The result is a set of temporally coherent mattes that are solved for simultaneously across views for each frame, exploiting similarities across views and time.

Yifan Yang, Daniel Cooper, John Collomosse, Catalin Dragan, Mark Manulis, Jo Briggs, Jamie Steane, Arthi Manohar, Wendy Moncur, Helen Jones (2020)TAPESTRY: A De-centralized Service for Trusted Interaction Online, In: IEEE Transactions on Services Computingpp. 1-1 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/TSC.2020.2993081

We present a novel de-centralised service for proving the provenance of online digital identity, exposed as an assistive tool to help non-expert users make better decisions about whom to trust online. Our service harnesses the digital personhood (DP); the longitudinal and multi-modal signals created through users' lifelong digital interactions, as a basis for evidencing the provenance of identity. We describe how users may exchange trust evidence derived from their DP, in a granular and privacy-preserving manner, with other users in order to demonstrate coherence and longevity in their behaviour online. This is enabled through a novel secure infrastructure combining hybrid on- and off-chain storage combined with deep learning for DP analytics and visualization. We show how our tools enable users to make more effective decisions on whether to trust unknown third parties online, and also to spot behavioural deviations in their own social media footprints indicative of account hijacking.

Andrew Gilbert, Adrian Hilton, John Collomosse (2020)Semantic Estimation of 3D Body Shape and Pose using Minimal Cameras, In: 31st British Machine Vision Conference

DOI: 10.48550/arXiv.1908.03030

We aim to simultaneously estimate the 3D articulated pose and high fidelity volumetric occupancy of human performance, from multiple viewpoint video (MVV) with as few as two views. We use a multi-channel symmetric 3D convolutional encoder-decoder with a dual loss to enforce the learning of a latent embedding that enables inference of skeletal joint positions and a volumetric reconstruction of the performance. The inference is regularised via a prior learned over a dataset of view-ablated multi-view video footage of a wide range of subjects and actions, and show this to generalise well across unseen subjects and actions. We demonstrate improved reconstruction accuracy and lower pose estimation error relative to prior work on two MVV performance capture datasets: Human 3.6M and TotalCapture.

T Wang, B Han, John Collomosse (2014)TouchCut: Fast Image and Video Segmentation using Single-Touch Interaction, In: Computer Vision and Image Understanding120pp. 14-30 Elsevier

DOI: 10.1016/j.cviu.2013.10.013

We present TouchCut; a robust and efficient algorithm for segmenting image and video sequences with minimal user interaction. Our algorithm requires only a single finger touch to identify the object of interest in the image or first frame of video. Our approach is based on a level set framework, with an appearance model fusing edge, region texture and geometric information sampled local to the touched point. We first present our image segmentation solution, then extend this framework to progressive (per-frame) video segmentation, encouraging temporal coherence by incorporating motion estimation and a shape prior learned from previous frames. This new approach to visual object cut-out provides a practical solution for image and video segmentation on compact touch screen devices, facilitating spatially localized media manipulation. We describe such a case study, enabling users to selectively stylize video objects to create a hand-painted effect. We demonstrate the advantages of TouchCut by quantitatively comparing against the state of the art both in terms of accuracy, and run-time performance.

C Gray, S James, J Collomosse, P Asente (2014)A Particle Filtering Approach to Salient Video Object Localizationpp. 194-198

DOI: 10.1109/ICIP.2014.7025038

We describe a novel fully automatic algorithm for identifying salient objects in video based on their motion. Spatially coherent clusters of optical flow vectors are sampled to generate estimates of affine motion parameters local to super-pixels identified within each frame. These estimates, combined with spatial data, form coherent point distributions in a 5D solution space corresponding to objects or parts there-of. These distributions are temporally denoised using a particle filtering approach, and clustered to estimate the position and motion parameters of salient moving objects in the clip. We demonstrate localization of salient object/s in a variety of clips exhibiting moving and cluttered backgrounds.

JP Collomosse, F Rothlauf (2006)Supervised genetic search for parameter selection in painterly rendering, In: APPLICATIONS OF EVOLUTIONARY COMPUTING, PROCEEDINGS3907pp. 599-610

R Bowden, J Collomosse, K Mikolajczyk (2014)Guest Editorial: Tracking, Detection and Segmentation, In: INTERNATIONAL JOURNAL OF COMPUTER VISION110(1)pp. 1-1 SPRINGER

DOI: 10.1007/s11263-014-0753-1

P. Keitler, F. Pankratz, B. Schwerdtfeger, D. Pustka, W. Rodiger, G. Klinker, C. Rauch, A. Chathoth, J. Collomosse, Yi-Zhe Song (2010)Mobile augmented reality based 3D snapshots, In: Proceedings of the 2009 8th IEEE International Symposium on Mixed and Augmented Reality (ISMAR 2009)pp. 199-200 Institute of Electrical and Electronics Engineers (IEEE)

DOI: 10.1109/ISMAR.2009.5336462

We describe a mobile augmented reality application that is based on 3D snapshotting using multiple photographs. Optical square markers provide the anchor for reconstructed virtual objects in the scene. A novel approach based on pixel flow highly improves tracking performance. This dual tracking approach also allows for a new single-button user interface metaphor for moving virtual objects in the scene. The development of the AR viewer was accompanied by user studies confirming the chosen approach.

M Shugrina, M Betke, J Collomosse (2006)Empathic Painting: Interactive stylization using observed emotional state, In: Proceedings ACM 4th Intl. Symposium on Non-photorealistic Animation and Rendering (NPAR)pp. 87-96

DOI: 10.1145/1124728.1124744

We present the "empathie painting" - an interactive painterly rendering whose appearance adapts in real time to reflect the perceived emotional state of the viewer. The empathie painting is an experiment into the feasibility of using high level control parameters (namely, emotional state) to replace the plethora of low-level constraints users must typically set to affect the output of artistic rendering algorithms. We describe a suite of Computer Vision algorithms capable of recognising users' facial expressions through the detection of facial action units derived from the FACS scheme. Action units are mapped to vectors within a continuous 2D space representing emotional state, from which we in turn derive a continuous mapping to the style parameters of a simple but fast segmentation-based painterly rendering algorithm. The result is a digital canvas capable of smoothly varying its painterly style at approximately 4 frames per second, providing a novel user interactive experience using only commodity hardware.

R Hu, J Collomosse (2013)A performance evaluation of gradient field HOG descriptor for sketch based image retrieval, In: Computer Vision and Image Understandingpp. 790-806 Elsevier

DOI: 10.1016/j.cviu.2013.02.005

We present an image retrieval system for the interactive search of photo collections using free-hand sketches depicting shape. We describe Gradient Field HOG (GF-HOG); an adapted form of the HOG descriptor suitable for Sketch Based Image Retrieval (SBIR). We incorporate GF-HOG into a Bag of Visual Words (BoVW) retrieval framework, and demonstrate how this combination may be harnessed both for robust SBIR, and for localizing sketched objects within an image. We evaluate over a large Flickr sourced dataset comprising 33 shape categories, using queries from 10 non-expert sketchers. We compare GF-HOG against state-of-the-art descriptors with common distance measures and language models for image retrieval, and explore how affine deformation of the sketch impacts search performance. GF-HOG is shown to consistently outperform retrieval versus SIFT, multi-resolution HOG, Self Similarity, Shape Context and Structure Tensor. Further, we incorporate semantic keywords into our GF-HOG system to enable the use of annotated sketches for image search. A novel graph-based measure of semantic similarity is proposed and two applications explored: semantic sketch based image retrieval and a semantic photo montage.

Marco Volino, D Casas, JP Collomosse, A Hilton (2014)Optimal Representation of Multi-View Video, In: Proceedings of BMVC 2014 - British Machine Vision Conference BMVC

Multi-view video acquisition is widely used for reconstruction and free-viewpoint rendering of dynamic scenes by directly resampling from the captured images. This paper addresses the problem of optimally resampling and representing multi-view video to obtain a compact representation without loss of the view-dependent dynamic surface appearance. Spatio-temporal optimisation of the multi-view resampling is introduced to extract a coherent multi-layer texture map video. This resampling is combined with a surface-based optical flow alignment between views to correct for errors in geometric reconstruction and camera calibration which result in blurring and ghosting artefacts. The multi-view alignment and optimised resampling results in a compact representation with minimal loss of information allowing high-quality free-viewpoint rendering. Evaluation is performed on multi-view datasets for dynamic sequences of cloth, faces and people. The representation achieves >90% compression without significant loss of visual quality.

D Casas, Marco Volino, JP Collomosse, A Hilton (2014)4D Video Textures for Interactive Character Appearance, In: Computer Graphics Forum: the international journal of the Eurographics Association

JP Collomosse, PM Hall (2002)Painterly rendering using image salience, In: 20TH EUROGRAPHICS UK CONFERENCE, PROCEEDINGSpp. 122-128

DOI: 10.1109/EGUK.2002.1011281

The contribution of this paper is a novel non-photorealistic rendering (NPR) technique, capable of producing an artificial 'hand-painted' effect on 2D images, such as photographs. Our method requires no user interaction, and makes use of image salience and gradient information to determine the implicit ordering and attributes of individual brush strokes. The benefits of our technique are complete automation, and mitigation against the loss of image detail during painting. Strokes from lower salience regions of the image do not encroach upon higher salience regions; this can occur with some existing painting methods. We describe our algorithm in detail, and illustrate its application with a gallery of images.

T Wang, JP Collomosse, A Hunter, D Greig (2014)Learnable Stroke Models for Example-based Portrait Painting

We present a novel algorithm for stylizing photographs into portrait paintings comprised of curved brush strokes. Rather than drawing upon a prescribed set of heuristics to place strokes, our system learns a flexible model of artistic style by analyzing training data from a human artist. Given a training pair — a source image and painting of that image—a non-parametric model of style is learned by observing the geometry and tone of brush strokes local to image features. A Markov Random Field (MRF) enforces spatial coherence of style parameters. Style models local to facial features are learned using a semantic segmentation of the input face image, driven by a combination of an Active Shape Model and Graph-cut. We evaluate style transfer between a variety of training and test images, demonstrating a wide gamut of learned brush and shading styles.

S James, M Fonseca, John Collomosse (2014)ReEnact: Sketch based Choreographic Design from Archival Dance Footage, In: ACM International Conference on Multimedia Retrieval (ICMR) Association for Computing Machinery

DOI: 10.1145/2578726.2578766

We describe a novel system for synthesising video choreography using sketched visual storyboards comprising human poses (stick men) and action labels. First, we describe an algorithm for searching archival dance footage using sketched pose. We match using an implicit representation of pose parsed from a mix of challenging low and high delity footage. In a training pre-process we learn a mapping between a set of exemplar sketches and corresponding pose representations parsed from the video, which are generalized at query-time to enable retrieval over previously unseen frames, and over additional unseen videos. Second, we describe how a storyboard of sketched poses, interspersed with labels indicating connecting actions, may be used to drive the synthesis of novel video choreography from the archival footage. We demonstrate both our retrieval and synthesis algorithms over both low delity PAL footage from the UK Digital Dance Archives (DDA) repository of contemporary dance, circa 1970, and over higher-defi nition studio captured footage.

PM Hall, MJ Owen, JP Collomosse (2004)A Trainable Low-level Feature Detector, In: Proceedings Intl. Conference on Pattern Recognition (ICPR)1pp. 708-711

DOI: 10.1109/ICPR.2004.1334279

We introduce a trainable system that simultaneously filters and classifies low-level features into types specified by the user. The system operates over full colour images, and outputs a vector at each pixel indicating the probability that the pixel belongs to each feature type. We explain how common features such as edge, corner, and ridge can all be detected within a single framework, and how we combine these detectors using simple probability theory. We show its efficacy, using stereo-matching as an example.