Open 3D video dataset for interdisciplinary, audio-visual research
Using the Centre for Vision, Speech and Signal Processing's (CVSSP) expertise and state-of-the-art facilities, Dr Philip Jackson and his colleagues have released carefully-curated recordings of realistic actions as 3D videos, to widen access to such 'volumetric' data and foster interdisciplinary research on audio-visual perception and immersive technologies.
Although we naturally integrate information from across our senses in the blink of an eye, research into human and artificial perception of events is fragmented across many distinct academic communities, different research groups, disciplines and even faculties: from psychology and performing arts to science and engineering. Similarly, each sense or modality is commonly treated alone, rather than as forming an integrated experience, and publicly-available volumetric video datasets often specialise, offering low-quality audio, if any.
Meanwhile tech giants, such as Intel or Facebook, and major Hollywood studios have resources to capture volumetric data using arrays of microphones and literally hundreds of high-quality cameras, but little incentive to share their proprietary data. Not only do these multi-camera rigs require space, investment and technical skill to operate, but specialist techniques and heavy-weight compute are needed for this video processing. Our aim therefore was to create a dataset of sounding actions filmed in 3D and made available to diverse international research communities for investigating experiences of sight and sound together.
With CVSSP's studio facilities equipped for volumetric video capture, and sound recording expertise within our team, the critical conditions were established for us to attempt the construction of this dataset. However, without direct funding a key challenge was to form and maintain the collaboration to accrue all the necessary elements, which involved months of data preparation, processing and curation brought together via zoom, long after the studio recordings themselves.
Video capture in the chroma-key studio used 16 ultra-high-definition (UHD) cameras. Sound was recorded with professional microphone techniques and post-production editing. We recruited two actor volunteers from GSA. Defining a broad set of actions that could be performed in the studio, we were able to select subsequently those that were suitable for processing, balanced across sound source types, and varied in sound character (transient, harmonic and continuous).
Visual processing extracted silhouettes, carved out a visual hull, refined the 3D shape frame-by-frame and mapped on the texture to yield the full appearance. Audio signals were de-noised and equalized, where necessary, before mixing.
The final result was a dataset of forty short clips, all exactly the same duration, with four repetitions of each action. The data comprise the combination of 3D video, as a sequence of 3D meshes together with a texture atlas, and accompanying audio. We also provide results at several stages of the visual processing to allow others to replicate and improve on our results using methods developed in CVSSP. Thus we hope to facilitate both scientific studies, employing our data to investigate perceptual or perhaps therapeutic applications of immersive technologies, and technical studies of system performance and reproduction quality.
Using these data in our own research activities, working with collaborators and by dissemination , we are now looking to promote the use of this resource with 3D video, spatial audio and interactive six-degree-of-freedom environments, such as virtual or mixed reality (VR/XR).
-  H. Stenzel, D. Berghi, M. Volino and P. J. B. Jackson, Naturalistic audio-visual volumetric sequences dataset of sounding actions for six degree-of-freedom interaction. 2021 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), 2021, pp. 637-638, doi: 10.1109/VRW52623.2021.00201.