MonoFusion: Sparse-View 4D Scene Reconstruction
- MonoFusion is a sparse-view 4D dynamic scene reconstruction method that fuses independent monocular estimates into a unified model capturing geometry, appearance, and motion.
- It integrates a canonical 3D Gaussian splatting representation with low-dimensional motion factorization to overcome challenges like limited cross-view overlap and occlusions.
- The method leverages DUSt3R and MoGe priors to align static multi-view cues with monocular depth information, enhancing novel view synthesis and reconstruction fidelity.
Searching arXiv for the relevant MonoFusion paper and closely related work to ground the article in current literature. MonoFusion is a method for sparse-view dynamic 3D scene reconstruction that fuses independent monocular reconstructions from a small set of static RGB cameras into a single time- and view-consistent 4D model. It is designed for capture rigs with only a handful of fixed cameras that see the entire scene but have little cross-view overlap, and it targets reconstruction of dynamic human behaviors in cluttered environments. The method combines a canonical 3D Gaussian Splatting representation, a static multi-view geometric reference from DUSt3R, monocular depth from MoGe, and a low-dimensional motion factorization with shared basis trajectories, with the stated goals of reconstructing geometry, appearance, and motion, enabling high-quality interpolation to held-out views, and synthesizing novel viewpoints far from any training camera (Wang et al., 31 Jul 2025).
1. Problem formulation and operating regime
MonoFusion assumes static, inward-facing RGB cameras, roughly equidistant around the subject and approximately apart, recording synchronized videos of frames each. The intrinsics and extrinsics are known and fixed for all cameras, and the scene may contain a human acting in a cluttered environment with multiple moving parts. Under these conditions, the central challenge is not merely dynamic reconstruction, but dynamic reconstruction under very limited cross-view overlap, large baselines, severe occlusions, and strong reliance on calibration (Wang et al., 31 Jul 2025).
The method is positioned against two failure modes that arise in this regime. Standard multi-view SfM/MVS is described as failing or producing poor initializations when overlap is limited. Monocular priors, while strong, are inconsistent because of per-image affine depth ambiguity, so naive fusion leads to contradictions such as duplicated body parts. MonoFusion addresses this by aligning monocular estimates in both space and time and then optimizing a single shared 4D representation.
A common misconception is to treat sparse-view dynamic reconstruction as a straightforward reduction of dense multi-view capture. MonoFusion explicitly rejects that assumption. It states that dense multi-view reconstruction methods struggle to adapt to sparse-view setups because of limited overlap between viewpoints. This suggests that the method’s contribution lies less in replacing one rendering primitive with another than in constructing a cross-view alignment procedure that remains stable when classical cross-view correspondences are weak or absent.
2. Canonical scene representation and motion factorization
MonoFusion uses an explicit canonical 3D Gaussian Splatting representation augmented with feature fields and low-dimensional motion bases. At the canonical time , the scene is represented as a set of 3D Gaussians with parameters
Here , derived from DINOv2 features. Color and opacity 0 are held fixed over time, while pose is time-varying (Wang et al., 31 Jul 2025).
Motion is parameterized by a small set of 1 rigid basis trajectories in 2, with 3. Each basis has per-time transforms 4, and each Gaussian is attached to all bases with fixed weights 5. Gaussian centers and orientations evolve through a linear-blend skinning–style combination of those bases: 6
7
In practice, rotations are optimized as quaternions and the blend is implemented in a differentiable manner following standard LBS practice.
Rendering is performed by tile-based 3D Gaussian rasterization with front-to-back alpha compositing. Along a camera ray 8, if per-pixel Gaussians are depth-sorted and have pre-multiplied opacities 9, rendered color is
0
with depth and feature maps computed analogously. The paper states that this is equivalent to a discretized volumetric rendering and that visibility and occlusions are handled by construction. Although MonoFusion is not a NeRF, its rendering is described as similar in spirit to volumetric rendering with transmittance and density.
3. Monocular priors and space–time depth alignment
MonoFusion’s fusion mechanism is built on two distinct feed-forward priors with complementary roles. DUSt3R provides a view-consistent static reference. For images at time 1, or at a canonical time 2, DUSt3R predicts per-image pointmaps 3 and a global alignment. In MonoFusion, that global fit is constrained to the known intrinsics and extrinsics 4, yielding metric-scale, view-consistent 3D pointmaps and depth maps
5
The method characterizes DUSt3R as strong on static background and as providing a global reference frame that consistently ties all cameras together (Wang et al., 31 Jul 2025).
MoGe is used for accurate but relative monocular depth. For each frame-camera pair 6, MoGe predicts a depth map 7 that is defined only up to an affine transform. MonoFusion aligns this relative depth to the DUSt3R reference only on static background. Let 8 be a background mask obtained using SAM 2 with light user prompting and temporal tracking. For each 9, the method solves
0
where 1 is a time-invariant background depth target for camera 2, obtained by averaging DUSt3R background depths over time or selecting a reference 3. The rationale is that stationary cameras observe a static background, so background depths should be identical across time.
After alignment, the transformed depth 4 is unprojected into 3D using the known camera parameters. Static background points from all times are concatenated and averaged per pixel index to denoise occlusions and MoGe noise, while dynamic foreground remains frame-specific and is handled by motion optimization. The paper’s ablations identify this space–time alignment as crucial, reporting a 5 PSNR improvement, exemplified by 6, when replacing naive monocular depth with DUSt3R+MoGe aligned depth.
This design also clarifies an important methodological distinction. DUSt3R is not used as a universal dynamic prior. The paper notes that DUSt3R tends to underfit humans and may pull people onto walls, which is why MonoFusion aligns MoGe to DUSt3R only on background. The foreground is then recovered through the shared 4D model, feature supervision, and motion constraints rather than through direct static multi-view matching.
4. Initialization and grouping-based motion discovery
Canonical geometry is initialized by unprojecting all aligned depth maps into the global frame. Background Gaussians are initialized densely by aggregating temporally averaged background points, while foreground Gaussians are initialized per frame and associated with the canonical set through the motion model (Wang et al., 31 Jul 2025).
A notable implementation choice is per-pixel multi-Gaussian initialization: five Gaussians per pixel rather than one. The paper states that this is used to capture details and reduce blurring. Each Gaussian’s 3D scale is initialized with a pixel-area heuristic,
7
where 8 is depth and 9 are focal lengths. This is reported to be much more stable than 0-NN scale heuristics. Colors and opacities are initialized from the input image and fixed over time, which the paper reports empirically improves motion learning. For the depth alignment and background averaging stage, only DUSt3R points with at least 1 confidence are used.
Motion grouping is initialized from DINOv2 features rather than from track velocities. Per-pixel DINOv2 features, with registers, are averaged over an image pyramid and reduced by PCA to 32 dimensions. Because each image pixel corresponds to a 3D point 2, the 32-D feature is attached to that 3D point. K-means clustering in feature space produces 3 cluster centers, and the per-Gaussian blend weights 4 are initialized from distances to those centers and normalized to sum to 1. The intended effect is to group semantically similar parts, such as a left forearm, into one rigid unit.
The basis trajectories are initialized to identity and optimized during training. The paper argues that feature-based grouping is more robust than velocity-based bases in sparse-view settings, because monocular depth flicker corrupts velocities. It further reports that fewer than approximately 20 bases lead to visible failures such as missing limbs and merged legs, whereas 5 works robustly.
5. Optimization objective and view consistency
At each optimization step, MonoFusion samples a time 6 and camera 7, rasterizes RGB 8, feature map 9, silhouette or alpha 0, and depth 1, and compares them to the corresponding observations or priors 2, 3, 4, and 5. The losses are defined as
6
7
8
Here 9 is supervised by image-plane DINOv2 features, and 0 is a foreground mask from SAM 2. The paper states that 1 is the aligned, view- and time-consistent depth; on foreground it acts as a soft prior rather than a hard constraint (Wang et al., 31 Jul 2025).
To constrain dynamic motion, MonoFusion adds a local rigidity term over Gaussian centers: 2 This preserves neighbor distances over time and discourages nonphysical shearing within a local rigid group while allowing different groups to move independently.
Additional regularizers used in practice include basis acceleration penalization in the Lie algebra,
3
track smoothness,
4
depth-gradient consistency,
5
an optional 6-axis acceleration penalty,
7
and a Gaussian scale variance regularizer 8.
The total objective is
9
The implementation uses Adam and fixed weights across sequences; typical values cited in the appendix include 0, 1, and 2.
The paper identifies two mechanisms for view consistency. First, DUSt3R supplies a single global reference frame and per-camera static background depth targets that are view-consistent. Second, the canonical Gaussian set and shared motion bases are optimized jointly against all cameras and times, forcing one 4D model to explain every observation. This suggests that MonoFusion’s consistency is not imposed by pairwise correspondence alone, but by coupling all views through a shared canonical scene and a shared motion basis.
6. Quantitative performance, runtime, and limitations
MonoFusion is evaluated on PanopticStudio and ExoRecon, a subset of Ego-Exo4D. On PanopticStudio, with 4 input views and 4 held-out views 3 apart, the method reports on held-out frames: PSNR 4, SSIM 5, LPIPS 6, and AbsRel 7 on the full frame, outperforming MV-SOM and Dynamic 3DGS. On dynamic-only regions it reports PSNR 8, SSIM 9, LPIPS 0, and IoU 1. For 2 novel-view extrapolation, it reports PSNR 3, SSIM 4, LPIPS 5, IoU 6, and AbsRel 7, again outperforming SOM, Dynamic 3DGS, and MV-SOM (Wang et al., 31 Jul 2025).
On ExoRecon, across 6 scenes, the reported held-out frame performance is PSNR 8, SSIM 9, LPIPS 0, and AbsRel 1 on the full frame; on dynamic-only regions it reports PSNR 2, SSIM 3, LPIPS 4, and IoU 5. Qualitatively, the method is described as avoiding duplicate limbs and background bleeding common in per-view monocular fusions, and as yielding crisp dynamic details under extreme novel views.
Ablation studies identify several sensitivities. Space–time depth alignment is crucial. The feature-metric loss 6 improves motion segmentation and IoU, at the cost of a small PSNR drop for silhouettes. Freezing colors across time improves motion learning. Feature-based motion bases outperform velocity-based ones in the sparse-view setting. The number of bases matters: fewer than approximately 20 bases produces visible failures, whereas 7 is stable.
The reported runtime regime is sequence-level rather than online. Typical sequences are about 10 seconds at 30 fps and 8 resolution, training takes about 30 minutes per sequence on a single NVIDIA A6000, and rendering runs at about 30 fps at that resolution. The representation size is approximately 9 Gaussians for dynamic foreground and approximately 00 for background. Memory is dominated by the background Gaussians, while scalability to longer videos is attributed to per-01 independent depth alignment and constant-size basis trajectories.
MonoFusion operates under explicit assumptions and has stated failure modes. Cameras are stationary, calibrated, and synchronized; no rolling shutter or timing offsets are modeled; bundle adjustment is not used, with 02. The method relies on 2D foundation models, so failures in SAM 2 masks or MoGe depth can introduce artifacts, especially on thin structures or specular and cluttered backgrounds. Long occlusions can break mask tracking. If a body part is never observed from any view, reconstruction degrades. Calibration errors are not modeled. The paper also notes that better cross-view human priors, automatic dynamic mask discovery, optional camera refinement, more principled 03 blending, and active camera placement are natural directions for future work.
These limitations clarify what MonoFusion is and is not. It is not a calibration-refining method, not a dense multi-view system adapted unchanged to sparse cameras, and not a per-frame static reconstructor with temporal post-processing. Rather, it is a sparse-view 4D reconstruction framework whose central mechanism is the alignment of monocular priors to a static global reference and their consolidation into a single canonical Gaussian field with shared motion bases.