12 noon - 11pm
Sunday 19 June - Friday 24 June 2022

CVSSP at CVPR 2022

The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) is the premier annual computer vision event.

New Orleans Ernest N. Morial Convention Center
New Orleans
Louisiana, USA

The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) is the premier annual computer vision event comprising the main conference and several co-located workshops and short courses. With its high quality and low cost, it provides an exceptional value for students, academics and industry researchers. CVPR 2022 will be a hybrid conference, with both in-person and virtual attendance options. 

Again this year, CVSSP is presenting a variety of papers. See details below.

Style-Based Global Appearance Flow for Virtual Try-On

Style-Based Global Appearance Flow for Virtual Try-On

By Dr Sen He, Yi-Zhe Song, Tao Xiang.
PaperCode | Video 

Image-based virtual try-on aims to fit an in-shop garment into a clothed person image. To achieve this, a key step is garment warping which spatially aligns the target garment with the corresponding body parts in the person image. Prior methods typically adopt a local appearance flow estimation model. They are thus intrinsically susceptible to difficult body poses/occlusions and large mis-alignments between person and garment images. To overcome this limitation, a novel global appearance flow estimation model is proposed in this work. For the first time, a StyleGAN based architecture is adopted for appearance flow estimation. This enables us to take advantage of a global style vector to encode a whole-image context to cope with the aforementioned challenges.

Doodle It Yourself: Class Incremental Learning by Drawing a Few Sketches

Doodle It Yourself: Class Incremental Learning by Drawing a Few Sketches

By Ayan Bhunia, Viswanatha Reddy Gajjala, Subhadeep Koley, Rohit Kundu, Aneeshan Sain, Tao Xiang, Yi-Zhe Song.
Paper | Project page | @2ayanbhunia @subhadeepko @AneeshanSain

This paper focusses on Few-Shot Incremental Learning (FSCIL) where sketches aid the model in learning novel classes. Gradient consensus coupled with knowledge distillation and graph attention networks, ensure a robust learning while not forgetting old knowledge.

Sketching without Worrying: Noise-Tolerant Sketch-Based Image Retrieval

Sketching without Worrying: Noise-Tolerant Sketch-Based Image Retrieval

By Ayan Bhunia, Subhadeep Koley, Abdullah Faiz Ur Rahman Khilji, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, Yi-Zhe Song.
Paper | Project page |  @2ayanbhunia @subhadeepko @AneeshanSain @pinaki_nc

This paper shows that accuracy for sketch based image retrieval falls due to the presence of noisy strokes in sketch-queries. Accordingly it proposes a reinforcement learning enforced stroke-subset selector that filters out noisy strokes for a successful retrieval, easing the common concern of "I can't sketch".

Sketch3T: Test-time Training for Zero-Shot SBIR

Sketch3T: Test-Time Training for Zero-Shot SBIR

By Aneeshan Sain, Ayan Kumar Bhunia, Vaishnav Potlapalli, Pinaki Nath Chowdhury, Tao Xiang , Yi-Zhe Song.
Paper | Project page | @AneeshanSain @2ayanbhunia @pinaki_nc

Models trained in existing zero-shot sketch based image retrieval setup, struggle in understanding sketches from test-time distribution. This paper thus introduces a test-time training paradigm, where a model adapts to the test-distribution using just one sketch via a free of cost self-supervised task during inference, without forgetting old knowledge thanks to a novel meta-learning based training framework.

 

 

Partially Does It: Towards Scene-Level FG-SBIR with Partial Input

Partially Does It: Towards Scene-Level FG-SBIR with Partial Input

By Pinaki Nath Chowdhury, Ayan Kumar Bhunia, Viswanatha Reddy Gajjala, Aneeshan Sain, Tao Xiang , Yi-Zhe Song.
Paper | Project page

This work upholds an important aspect in understanding scene-sketches -- a scene-sketch may not contain all objects corresponding to its photo, thus resulting in collapsing retrieval accuracy as sketches become relatively emptier or ‘partial’. Addressing this issue, a set-based approach using Optimal Transport has been proposed, to model cross-modal region associativity for better retrieval.

Finding Badly Drawn Bunnies

Finding Badly Drawn Bunnies

By Kaiyue Pang, Lan Yang, Yi-Zhe Song, Honggang Zhang.
Paper | @yizhe_song

Everyone can sketch, the debate is on how well. We, for the first time, teach computers to recognise this "how well" problem that tells you just how badly/well drawn your bunny (or any other sketch) is. Our key discovery lies in exploiting the magnitude of a sketch feature as a quantitative quality metric. This is reassuring for many as we no longer need to collect expensive quality annotations from humans to enable the said metric learning. We confirm consistent quality agreements between our proposed metric and human perception through a carefully designed human study. We also showcase the practical benefits in three sketch applications thanks to the successful modelling of sketch quality.

Talk:  10:00AM-12:30PM (CST), 22, June, 2022. 162 - Poster Session 2.1, New Orleans Ernest N. Morial Convention Center.

"The Pedestrian next to the Lamppost'' Adaptive Object Graphs for Better Instantaneous Mapping

“The Pedestrian next to the Lamppost” Adaptive Object Graphs for Better Instantaneous Mapping

By Avi Saha, Oscar Mendez-Maldonado, Richard Bowden, Chris Russell.
Paper | Video | @avi_saha_00 @OscarMendezM @c_russl @CogVis

Estimating a semantically segmented bird’s-eye-view (BEV) map from a single image has become a popular technique for autonomous control and navigation. However, they show an increase in localization error with distance from the camera. While such an increase in error is entirely expected – localization is harder at distance – much of the drop in performance can be attributed to the cues used by current texture-based models, in particular, they make heavy use of object-ground intersections (such as shadows) [10], which become increasingly sparse and uncertain for distant objects. In this work, we address these shortcomings in BEV-mapping by learning the spatial relationship between objects in a scene. We propose a graph neural network which predicts BEV objects from a monocular image by spatially reasoning about an object within the context of other objects. Our approach sets a new state-of-the-art in BEV estimation from monocular images across three largescale datasets, including a 50% relative improvement for objects on nuScenes.

Signing at Scale: Learning to Co-Articulate Signs for Large-Scale Photo-Realistic Sign Language Production

Signing at Scale

By Ben Saunders, Richard Bowden, Cihan Camgoz.
Paper | @BenMSaunders @ncihancamgoz @CogVis

Sign languages are visual languages, with vocabularies as rich as their spoken language counterparts. However, current deep-learning based Sign Language Production (SLP) models produce under-articulated skeleton pose sequences from constrained vocabularies and this limits applicability. To be understandable and accepted by the deaf, an automatic SLP system must be able to generate co-articulated photo-realistic signing sequences for large domains of discourse.