Robust Dual Gaussian Splatting for Immersive Human-centric Volumetric Videos (2409.08353v1)

Published 12 Sep 2024 in cs.GR and cs.CV

Abstract: Volumetric video represents a transformative advancement in visual media, enabling users to freely navigate immersive virtual experiences and narrowing the gap between digital and real worlds. However, the need for extensive manual intervention to stabilize mesh sequences and the generation of excessively large assets in existing workflows impedes broader adoption. In this paper, we present a novel Gaussian-based approach, dubbed \textit{DualGS}, for real-time and high-fidelity playback of complex human performance with excellent compression ratios. Our key idea in DualGS is to separately represent motion and appearance using the corresponding skin and joint Gaussians. Such an explicit disentanglement can significantly reduce motion redundancy and enhance temporal coherence. We begin by initializing the DualGS and anchoring skin Gaussians to joint Gaussians at the first frame. Subsequently, we employ a coarse-to-fine training strategy for frame-by-frame human performance modeling. It includes a coarse alignment phase for overall motion prediction as well as a fine-grained optimization for robust tracking and high-fidelity rendering. To integrate volumetric video seamlessly into VR environments, we efficiently compress motion using entropy encoding and appearance using codec compression coupled with a persistent codebook. Our approach achieves a compression ratio of up to 120 times, only requiring approximately 350KB of storage per frame. We demonstrate the efficacy of our representation through photo-realistic, free-view experiences on VR headsets, enabling users to immersively watch musicians in performance and feel the rhythm of the notes at the performers' fingertips.

Citations (2)

View on Semantic Scholar

Summary

The paper introduces Dual Gaussian Splatting, a technique that uses dual-layered joint and skin Gaussians to model global motion and fine appearance details.
It employs a sequential coarse-to-fine optimization strategy that refines volumetric representations and achieves up to 120-fold compression.
Experimental results demonstrate superior PSNR, SSIM, and LPIPS metrics compared to state-of-the-art methods, enabling high-fidelity VR on low-end devices.

Dual Gaussian Splatting for Real-time Human-centric Volumetric Videos

In their paper "Robust Dual Gaussian Splatting for Immersive Human-centric Volumetric Videos," Jiang et al. present an advanced methodology for high-fidelity, real-time rendering and compression of volumetric videos. This paper addresses critical challenges in the domain of 3D and 4D content, specifically focusing on human performances within volumetric video. The novel approach, dubbed Dual Gaussian Splatting (DualGS), distinguishes itself through a compressed, high-quality spatio-temporal representation that enables immersive virtual reality (VR) experiences on low-end devices.

Introduction

The primary innovation in this paper is the DualGS representation, which uses a dual-layered system of Gaussians to independently model motion and appearance attributes. Traditional volumetric video production workflows depend heavily on mesh sequences and often require extensive manual intervention to stabilize these sequences, generating large asset sizes that inhibit broader adoption. DualGS eliminates these inefficiencies by representing motion through joint Gaussians and appearance through skin Gaussians.

Methodology

Dual-Gaussian Representation:

DualGS achieves efficient and accurate human performance tracking by initializing two distinct sets of Gaussians:

Joint Gaussians: A compact number of Gaussians (~15,000) that capture global motion.
Skin Gaussians: A larger set of Gaussians (~180,000) that represent visual details.

During initialization, joint Gaussians are first optimized to capture the performance’s global motion, with constraints applied to prevent overly skinny Gaussians and oversized structures. Each skin Gaussian is then anchored to multiple joint Gaussians through k-nearest neighbors (KNN), enabling spatial interpolation for motion representation while maintaining temporal coherence. This hierarchical structure substantially reduces motion redundancy and enhances tracking robustness.

Sequential Optimization:

The methodology employs a coarse-to-fine optimization strategy across frames, divided into:

Coarse Alignment: Focuses solely on joint Gaussians’ motion using a locally rigid regularizer and velocity prediction for robust tracking.
Fine-grained Optimization: Updates both joint and skin Gaussian attributes. Here, skin Gaussian positions and rotations are interpolated from joint Gaussians to balance rendering quality and temporal consistency. A temporal regularization term further mitigates abrupt changes in Gaussian attributes across frames.

Compression Strategy:

DualGS aims to make the high-fidelity 4D assets viable for integration into low-end devices. The proposed compression framework achieves compression ratios up to 120-fold, effectively encoding ~350KB per frame. Key elements of this strategy include:

Residual Vector Quantization (RVQ): Applied to joint Gaussians’ motion.
Codec Compression: Utilized for skin Gaussians’ opacity and scaling, arranged into 2D look-up tables (LUT).
Persistent Codebook Compression: Handles spherical harmonic (SH) color attributes, greatly reducing storage requirements by clustering SH components and encoding them as persistent indices with length encoding.

Results and Evaluation

The DualGS framework is validated through rigorous qualitative and quantitative comparisons against state-of-the-art dynamic rendering methods such as HumanRF, NeuS2, Spacetime Gaussian, and HiFi4G. The results indicate the superiority of DualGS in terms of rendering quality while maintaining minimal storage overhead. Specifically, DualGS consistently delivers higher PSNR, SSIM, and VMAF scores, with lower LPIPS values.

A comprehensive analysis of ablation studies further demonstrates the efficacy of the DualGS representation. Components such as velocity prediction, joint Gaussians, and coarse-to-fine optimization contribute significantly to the accurate rendering of complex human performances.

Practical Implementation and Implications

The dual Gaussian-based compression strategy makes real-time, high-fidelity VR rendering feasible even on mobile devices like smartphones and standalone VR headsets. The implementation of a Unity plugin and a DualGS player ensures seamless integration into conventional 3D rendering pipelines, facilitating the immersive experience.

Conclusion

Jiang et al.’s work offers a notable advancement in volumetric video rendering and compression. By introducing a dual Gaussian layer representation, this research significantly enhances both the fidelity and efficiency of rendering human performances. Future developments may explore more dynamic optimization strategies to further improve temporal coherence and accommodate topological changes, as well as integrate multi-modal inputs to drive animations.

References

The paper's reference list includes seminal works on dynamic human modeling, neural human representation, and volumetric video compression, reflecting the breadth and depth of research in this field. The authors acknowledge contributions from neural radiance fields, dynamic Gaussian splatting, and adaptive mesh compression, all of which underpin the innovations presented in this paper.

PDF Markdown

Related Papers

Tweets

https://twitter.com/janusch_patas/status/1835566295355363637

https://twitter.com/_akhaliq/status/1835511460870459506

https://twitter.com/ai_bites/status/1835776395957698736

https://twitter.com/arXivGPT/status/1836130497103274113

YouTube

Show All Videos