Papers
Topics
Authors
Recent
2000 character limit reached

HuPrior3R: Incorporating Human Priors for Better 3D Dynamic Reconstruction from Monocular Videos (2512.06368v2)

Published 6 Dec 2025 in cs.CV

Abstract: Monocular dynamic video reconstruction faces significant challenges in dynamic human scenes due to geometric inconsistencies and resolution degradation issues. Existing methods lack 3D human structural understanding, producing geometrically inconsistent results with distorted limb proportions and unnatural human-object fusion, while memory-constrained downsampling causes human boundary drift toward background geometry. To address these limitations, we propose to incorporate hybrid geometric priors that combine SMPL human body models with monocular depth estimation. Our approach leverages structured human priors to maintain surface consistency while capturing fine-grained geometric details in human regions. We introduce HuPrior3R, featuring a hierarchical pipeline with refinement components that processes full-resolution images for overall scene geometry, then applies strategic cropping and cross-attention fusion for human-specific detail enhancement. The method integrates SMPL priors through a Feature Fusion Module to ensure geometrically plausible reconstruction while preserving fine-grained human boundaries. Extensive experiments on TUM Dynamics and GTA-IM datasets demonstrate superior performance in dynamic human reconstruction.

Summary

  • The paper introduces a hybrid geometric prior that fuses SMPL mesh depth with monocular depth cues to correct anatomical artifacts and boundary degradation.
  • It employs a robust feature fusion module with cross-attention and adaptive gating to align and integrate depth information effectively.
  • The approach achieves notable performance improvements, evidenced by lower Abs Rel errors and enhanced pose metrics, setting new standards in dynamic human reconstruction.

Monocular 3D Dynamic Human Reconstruction via Structured Human Priors: The HuPrior3R Approach

Motivation and Problem Formulation

Monocular 3D dynamic reconstruction, particularly in scenes containing deformable, articulated humans, remains subject to geometric inconsistencies and loss of resolution at human boundaries. Existing methods based on feed-forward ViT pointmap regression pipelines (e.g., DUSt3R, Align3R, MonST3R, VGGT) achieve competitive results on general dynamic scenes but consistently produce anatomical artifacts—distorted limbs, disconnected body parts, foreground-background blending—and human boundary drift, primarily due to insufficient 3D human modeling and information loss from memory-constrained downsampling. The supervised monocular depth estimation path alone (e.g., MiDaS, DepthAnything) is limited by a lack of strong structural priors, manifesting as severe temporal flicker and degraded geometry on human surfaces.

HuPrior3R represents an explicit architectural fusion of structured human priors and monocular cues: a hybrid geometric prior comprising SMPL mesh depth and monocular depth maps, implemented through a hierarchical pipeline with refinement, feature fusion, and cross-attention mechanisms for context-focused detail enhancement. This approach is demonstrated to rectify the two core failure modes—anatomical artifacts and resolution/boundary degradation—in monocular video-based 3D human-centric reconstruction.

Methodology

Hybrid Geometric Priors and Depth Alignment

HuPrior3R leverages per-frame SMPL mesh estimation (via a perspective-parameterized pipeline such as CameraHMR) to obtain robust human masks and SMPL depth maps under dynamic camera motion. These are aligned to the monocular depth predictions through robust RANSAC-based linear fitting, ensuring that both modalities are brought to a compatible metric scale within human masks:

  • Only masked foreground (human) pixels are used for robust scale/shift estimation.
  • Alignment is critical: direct feature fusion of unaligned monocular and SMPL depths leads to catastrophic performance collapse, with Abs Rel increasing by over two orders of magnitude.

Feature Fusion Module

After alignment, scene context, monocular-aligned depth, and SMPL depth are mapped into ViT feature token format. A cross-attention module concatenates image and monocular depth features (as query) and projects SMPL features as key/value, followed by multi-head attention and an adaptive, learned gated fusion. The gating operation modulates the per-location contribution of SMPL features, leveraging local context to avoid propagation of human priors into the background and to robustly incorporate structural information only where beneficial.

This design enables the network to:

  • Enforce anatomical correctness within human regions, preventing disconnected or implausible geometry.
  • Learn to suppress SMPL priors in regions where they are irrelevant or conflict with the image/depth cues.

Hierarchical Human-Centric Refinement

When humans occupy small pixel regions, boundary drift and detail loss become significant due to ViT tokenization and global context mixing. To counter this, a refinement module is triggered:

  • Input crops are extracted around humans, and pointmaps are upsampled and processed via refinement decoders.
  • Cross-attention layers couple crop-specific features to global context, ensuring temporal and spatial consistency while super-resolving human geometry.
  • Final high-resolution pointmaps are integrated back into the global scene via spatial alignment.

The refinement stage is ablation-critical: it yields improved boundary localization and higher fidelity, particularly in temporally coherent reconstructions of small or fast-moving human actors.

Empirical Results

HuPrior3R is evaluated on both synthetic (GTA-IM, Bedlam) and real-world (BEHAVE, Bonn, TUM Dynamics) datasets. Absolute relative errors (Abs Rel) and δ<1.25\delta<1.25 inlier rates are reported for depth, and ATE, RTE, and RRE for camera pose. Key quantitative results:

  • On BEHAVE, HuPrior3R achieves Abs Rel = 0.033 and δ<1.25\delta<1.25 = 0.992 (outperforming Align3R and VGGT by a large margin). On TUM Dynamics, Abs Rel = 0.102 and δ<1.25\delta<1.25 = 0.907.
  • On GTA-IM, HuPrior3R achieves second-best Abs Rel (0.112), exceeding other non-diffusion methods. On Bedlam, a synthetic environment with harder domain shift, Geo4D outperforms due to its diffusion-based priors.
  • On camera pose evaluation, HuPrior3R obtains ATE = 0.015, RTE = 0.010, and RRE = 0.227 on GTA-IM.
  • Ablation studies show: without SMPL priors, Abs Rel increases to 0.042; naive (ungated) fusion yields inferior depth quality and introduces background artifacts. Omitting RANSAC-based depth alignment leads to Abs Rel = 5.12, clearly demonstrating the necessity of robust prior scale matching.

Qualitative analysis shows that the refinement module removes boundary drift and flicker, recovers fine geometric detail at human perimeters, and produces temporally consistent reconstructions even under dynamic motion and occlusion.

Theoretical and Practical Implications

The architecture rigorously demonstrates the necessity of structured domain priors—such as SMPL body models—for 3D human reconstruction in the monocular setting. Key findings:

  • Multi-modal geometric priors, when properly aligned and adaptively fused, outperform both pure monocular and naive dual-prior approaches.
  • Hierarchical refinement, conditioned on actor scale in the scene, is indispensable for preserving subject fidelity in pixel-constrained regimes.
  • Explicit, adaptive gating between different geometric modalities is necessary to avoid background contamination and cross-modal conflicts—this is confirmed by both quantitative and qualitative failure cases in the absence of feature gating.

Practically, these results indicate that holistic human-centric scene understanding, critical for AR/VR, robotics, and animation, will require the integration of interpretable parametric body priors within scalable feed-forward architectures. The pipeline and evaluation protocol, including robust pose alignment and temporal metrics, set a new bar for evaluating dynamic human-centric feed-forward 3D scene reconstruction.

Future Directions

Given the strong results attained by integrating SMPL priors, immediate future work includes:

  • Incorporation of more general non-rigid and scene-aware priors, extending beyond humans.
  • Leveraging temporal sequence modeling and memory-based architectures for improved longer-range temporal coherence.
  • Exploring large-scale, pretrained mono-to-3D or video diffusion models that can co-train on parameterized mesh priors, offsetting the limitations seen with both pure diffusion and pure feed-forward models.
  • Integration with multi-view data or event-based inputs for further robustness under motion and occlusion.
  • Extending refinement strategies to fully instance-agnostic object-centric scenes.

Conclusion

HuPrior3R presents a comprehensive monocular 3D dynamic human reconstruction method that successfully fuses SMPL priors and monocular cues through hierarchical, context-aware fusion and refinement. By demonstrating the critical importance of properly aligned, gated geometric priors, and local refinement modules, the work advances the state of the art in anatomically-consistent, temporally stable, human-centric 3D reconstruction from monocular video. The architectural principles and empirical results argue strongly for the continued, explicit integration of domain priors and adaptive fusion mechanisms in learned geometric perception systems.

Reference: "HuPrior3R: Incorporating Human Priors for Better 3D Dynamic Reconstruction from Monocular Videos" (2512.06368)

Whiteboard

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 2 likes about this paper.