Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OccFusion: Rendering Occluded Humans with Generative Diffusion Priors (2407.00316v1)

Published 29 Jun 2024 in cs.CV

Abstract: Most existing human rendering methods require every part of the human to be fully visible throughout the input video. However, this assumption does not hold in real-life settings where obstructions are common, resulting in only partial visibility of the human. Considering this, we present OccFusion, an approach that utilizes efficient 3D Gaussian splatting supervised by pretrained 2D diffusion models for efficient and high-fidelity human rendering. We propose a pipeline consisting of three stages. In the Initialization stage, complete human masks are generated from partial visibility masks. In the Optimization stage, 3D human Gaussians are optimized with additional supervision by Score-Distillation Sampling (SDS) to create a complete geometry of the human. Finally, in the Refinement stage, in-context inpainting is designed to further improve rendering quality on the less observed human body parts. We evaluate OccFusion on ZJU-MoCap and challenging OcMotion sequences and find that it achieves state-of-the-art performance in the rendering of occluded humans.

Summary

  • The paper introduces OccFusion, a novel pipeline that combines 3D Gaussian splatting with diffusion priors to render occluded humans from monocular videos.
  • The method employs a three-stage process—Initialization for occupancy masks, Optimization using Score-Distillation Sampling, and Refinement with in-context inpainting—to achieve complete 3D reconstructions.
  • Experimental results on ZJU-MoCap and OcMotion datasets demonstrate state-of-the-art performance with improved PSNR, SSIM, and LPIPS metrics, enhancing applications in AR, sports analytics, and healthcare.

Overview of "OccFusion: Rendering Occluded Humans with Generative Diffusion Priors"

"OccFusion: Rendering Occluded Humans with Generative Diffusion Priors" by Adam Sun, Tiange Xiang, Scott Delp, Li Fei-Fei, and Ehsan Adeli presents a novel methodology designed to address the persistent challenge of rendering 3D humans from monocular videos, especially in scenarios involving occlusion. This problem is critical in numerous fields such as virtual and augmented reality, healthcare, and sports analytics. Traditional methods have largely ignored the issue of occlusion, assuming unobstructed views of the human subjects. This paper introduces "OccFusion", a method that leverages 3D Gaussian splatting in combination with pretrained 2D diffusion models to achieve high-fidelity human rendering despite occlusions.

Methodology

OccFusion operates via a structured, multi-stage pipeline consisting of three sequential stages: Initialization, Optimization, and Refinement.

  1. Initialization Stage:
    • This stage generates complete human occupancy masks from partial visibility masks. The authors acknowledge the weaknesses of direct inpainting via diffusion models in challenging poses, particularly when dealing with self-occlusions. To address this, a simplified representation of pose priors is introduced, improving the diffusion model's ability to generate feasible outputs. The results are binary human masks extracted from inpainted images, which offer greater cross-frame consistency.
  2. Optimization Stage:
    • In this stage, the rendered 3D human Gaussians are optimized, with additional supervision provided by Score-Distillation Sampling (SDS). The concept of utilizing diffusion priors to ensure completeness of the body model is novel and effective. The authors employ SDS to regularize the canonical pose, ensuring a consistent and complete 3D representation across arbitrary viewing angles.
  3. Refinement Stage:
    • The final stage involves enhancing the rendering quality using in-context inpainting. Here, coarse renderings from the Optimization stage are utilized as contextual references for the diffusion model to generate high-fidelity appearances for occluded regions. This approach significantly refines both the appearance and the geometry of the human model, ensuring the final rendering is of superior quality.

Results

The paper’s evaluation of OccFusion on the ZJU-MoCap and OcMotion datasets demonstrates its superiority over existing methods. The quantitative metrics, including Peak Signal-to-Noise Ratio (PSNR), Structural SIMilarity (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS), highlight the method's effectiveness. Notably, OccFusion achieves state-of-the-art performance with higher PSNR and SSIM values, and considerably lower LPIPS values compared to other benchmarks. Qualitative evaluations further reveal that OccFusion provides sharp, artifact-free renderings, even under complex occlusions.

Implications and Future Directions

The practical implications of OccFusion are profound. In healthcare, for example, accurate 3D reconstructions of occluded human figures can enhance remote patient monitoring and telemedicine. In sports analytics, it can provide precise replay analysis of athletes occluded by equipment or other players. Additionally, in augmented reality, it enables more realistic and immersive user experiences by accurately reconstructing environments with occluded humans.

Theoretically, this research illuminates the potential of combining explicit geometric techniques (like 3D Gaussian splatting) with rapid advancements in generative models (such as diffusion priors). The structured approach of isolating and addressing specific weaknesses at different stages of the pipeline is particularly noteworthy.

Future directions could involve exploring more sophisticated generative models tailored exclusively for human rendering tasks, potentially improving cross-frame consistency and overall rendering fidelity. Further integration with real-time processing systems could also be explored, enabling instant applications in dynamic environments.

In conclusion, "OccFusion" pushes the boundaries of monocular human rendering, presenting a significant step forward in addressing occlusions. The robust experimental evidence provided validates its place as a leading technique in the domain, pointing towards a fertile ground for subsequent innovations.