
Pulkit Gera
About
My research project
Foundation models for multi-modal video understandingLeveraging foundation models with 3D priors and multimodal inputs to improve video understanding across recognition, reasoning, and prediction.
Supervisors
Leveraging foundation models with 3D priors and multimodal inputs to improve video understanding across recognition, reasoning, and prediction.
ResearchResearch interests
3D Computer Vision, Multi-modal models, Large Vision Models, Relighting, Face Avatars
Research interests
3D Computer Vision, Multi-modal models, Large Vision Models, Relighting, Face Avatars
Publications
Simultaneous relighting and novel-view rendering of digital human representations is an important yet challenging task with numerous applications. Progress in this area has been significantly limited due to the lack of publicly available, high-quality datasets, especially for full-body human captures. To address this critical gap, we introduce the HumanOLAT dataset, the first publicly accessible large-scale dataset of multi-view One-Light-at-a-Time (OLAT) captures of full-body humans. The dataset includes HDR RGB frames under various illuminations, such as white light, environment maps, color gradients and fine-grained OLAT illuminations. Our evaluations of state-of-the-art relighting and novel-view synthesis methods underscore both the dataset's value and the significant challenges still present in modeling complex human-centric appearance and lighting interactions. We believe HumanOLAT will significantly facilitate future research, enabling rigorous benchmarking and advancements in both general and human-specific relighting and rendering techniques.
We present PanoHDR-NeRF, a neural representation of the full HDR radiance field of an indoor scene, and a pipeline to capture it casually, without elaborate setups or complex capture protocols. First, a user captures a low dynamic range (LDR) omnidirectional video of the scene by freely waving an off-the-shelf camera around the scene. Then, an LDR2HDR network uplifts the captured LDR frames to HDR, which are used to train a tailored NeRF++ model. The resulting PanoHDR-NeRF can render full HDR images from any location of the scene. Through experiments on a novel test dataset of real scenes with the ground truth HDR radiance captured at locations not seen during training, we show that PanoHDR-NeRF predicts plausible HDR radiance from any scene point. We also show that the predicted radiance can synthesize correct lighting effects, enabling the augmentation of indoor scenes with synthetic objects that are lit correctly. Datasets and code are available at https://lvsn.github.io/PanoHDR-NeRF/.
We present a neural rendering framework for simultaneous view synthesis and appearance editing of a scene with known environmental illumination captured using a mobile camera. Existing approaches either achieve view synthesis alone or view synthesis along with relighting, without control over the scene's appearance. Our approach explicitly disentangles the appearance and learns a lighting representation that is independent of it. Specifically, we jointly learn the scene appearance and a lighting-only representation of the scene. Such disentanglement allows our approach to generalize to arbitrary changes in appearance while performing view synthesis. We show results of editing the appearance of real scenes in interesting and non-trivial ways. The performance of our view synthesis approach is on par with state-of-the-art approaches on both real and synthetic data.