Papers
Topics
Authors
Recent
Search
2000 character limit reached

MonoFusion: Sparse-View 4D Scene Reconstruction

Updated 4 July 2026
  • MonoFusion is a sparse-view 4D dynamic scene reconstruction method that fuses independent monocular estimates into a unified model capturing geometry, appearance, and motion.
  • It integrates a canonical 3D Gaussian splatting representation with low-dimensional motion factorization to overcome challenges like limited cross-view overlap and occlusions.
  • The method leverages DUSt3R and MoGe priors to align static multi-view cues with monocular depth information, enhancing novel view synthesis and reconstruction fidelity.

Searching arXiv for the relevant MonoFusion paper and closely related work to ground the article in current literature. MonoFusion is a method for sparse-view dynamic 3D scene reconstruction that fuses independent monocular reconstructions from a small set of static RGB cameras into a single time- and view-consistent 4D model. It is designed for capture rigs with only a handful of fixed cameras that see the entire scene but have little cross-view overlap, and it targets reconstruction of dynamic human behaviors in cluttered environments. The method combines a canonical 3D Gaussian Splatting representation, a static multi-view geometric reference from DUSt3R, monocular depth from MoGe, and a low-dimensional motion factorization with shared basis trajectories, with the stated goals of reconstructing geometry, appearance, and motion, enabling high-quality interpolation to held-out views, and synthesizing novel viewpoints far from any training camera (Wang et al., 31 Jul 2025).

1. Problem formulation and operating regime

MonoFusion assumes K=34K = 3\text{–}4 static, inward-facing RGB cameras, roughly equidistant around the subject and approximately 9090^\circ apart, recording synchronized videos of TT frames each. The intrinsics KkK_k and extrinsics PkP_k are known and fixed for all cameras, and the scene may contain a human acting in a cluttered environment with multiple moving parts. Under these conditions, the central challenge is not merely dynamic reconstruction, but dynamic reconstruction under very limited cross-view overlap, large baselines, severe occlusions, and strong reliance on calibration (Wang et al., 31 Jul 2025).

The method is positioned against two failure modes that arise in this regime. Standard multi-view SfM/MVS is described as failing or producing poor initializations when overlap is limited. Monocular priors, while strong, are inconsistent because of per-image affine depth ambiguity, so naive fusion leads to contradictions such as duplicated body parts. MonoFusion addresses this by aligning monocular estimates in both space and time and then optimizing a single shared 4D representation.

A common misconception is to treat sparse-view dynamic reconstruction as a straightforward reduction of dense multi-view capture. MonoFusion explicitly rejects that assumption. It states that dense multi-view reconstruction methods struggle to adapt to sparse-view setups because of limited overlap between viewpoints. This suggests that the method’s contribution lies less in replacing one rendering primitive with another than in constructing a cross-view alignment procedure that remains stable when classical cross-view correspondences are weak or absent.

2. Canonical scene representation and motion factorization

MonoFusion uses an explicit canonical 3D Gaussian Splatting representation augmented with feature fields and low-dimensional motion bases. At the canonical time t0t_0, the scene is represented as a set of NN 3D Gaussians with parameters

Θ(i)={x0(i)R3,  R0(i)SO(3),  s(i)R3,  α(i)R,  c(i)R3,  f(i)RNf}.\Theta^{(i)} = \{x_0^{(i)} \in \mathbb{R}^3,\; R_0^{(i)} \in SO(3),\; s^{(i)} \in \mathbb{R}^3,\; \alpha^{(i)} \in \mathbb{R},\; c^{(i)} \in \mathbb{R}^3,\; f^{(i)} \in \mathbb{R}^{N_f}\}.

Here Nf=32N_f = 32, derived from DINOv2 features. Color cc and opacity 9090^\circ0 are held fixed over time, while pose is time-varying (Wang et al., 31 Jul 2025).

Motion is parameterized by a small set of 9090^\circ1 rigid basis trajectories in 9090^\circ2, with 9090^\circ3. Each basis has per-time transforms 9090^\circ4, and each Gaussian is attached to all bases with fixed weights 9090^\circ5. Gaussian centers and orientations evolve through a linear-blend skinning–style combination of those bases: 9090^\circ6

9090^\circ7

In practice, rotations are optimized as quaternions and the blend is implemented in a differentiable manner following standard LBS practice.

Rendering is performed by tile-based 3D Gaussian rasterization with front-to-back alpha compositing. Along a camera ray 9090^\circ8, if per-pixel Gaussians are depth-sorted and have pre-multiplied opacities 9090^\circ9, rendered color is

TT0

with depth and feature maps computed analogously. The paper states that this is equivalent to a discretized volumetric rendering and that visibility and occlusions are handled by construction. Although MonoFusion is not a NeRF, its rendering is described as similar in spirit to volumetric rendering with transmittance and density.

3. Monocular priors and space–time depth alignment

MonoFusion’s fusion mechanism is built on two distinct feed-forward priors with complementary roles. DUSt3R provides a view-consistent static reference. For images at time TT1, or at a canonical time TT2, DUSt3R predicts per-image pointmaps TT3 and a global alignment. In MonoFusion, that global fit is constrained to the known intrinsics and extrinsics TT4, yielding metric-scale, view-consistent 3D pointmaps and depth maps

TT5

The method characterizes DUSt3R as strong on static background and as providing a global reference frame that consistently ties all cameras together (Wang et al., 31 Jul 2025).

MoGe is used for accurate but relative monocular depth. For each frame-camera pair TT6, MoGe predicts a depth map TT7 that is defined only up to an affine transform. MonoFusion aligns this relative depth to the DUSt3R reference only on static background. Let TT8 be a background mask obtained using SAM 2 with light user prompting and temporal tracking. For each TT9, the method solves

KkK_k0

where KkK_k1 is a time-invariant background depth target for camera KkK_k2, obtained by averaging DUSt3R background depths over time or selecting a reference KkK_k3. The rationale is that stationary cameras observe a static background, so background depths should be identical across time.

After alignment, the transformed depth KkK_k4 is unprojected into 3D using the known camera parameters. Static background points from all times are concatenated and averaged per pixel index to denoise occlusions and MoGe noise, while dynamic foreground remains frame-specific and is handled by motion optimization. The paper’s ablations identify this space–time alignment as crucial, reporting a KkK_k5 PSNR improvement, exemplified by KkK_k6, when replacing naive monocular depth with DUSt3R+MoGe aligned depth.

This design also clarifies an important methodological distinction. DUSt3R is not used as a universal dynamic prior. The paper notes that DUSt3R tends to underfit humans and may pull people onto walls, which is why MonoFusion aligns MoGe to DUSt3R only on background. The foreground is then recovered through the shared 4D model, feature supervision, and motion constraints rather than through direct static multi-view matching.

4. Initialization and grouping-based motion discovery

Canonical geometry is initialized by unprojecting all aligned depth maps into the global frame. Background Gaussians are initialized densely by aggregating temporally averaged background points, while foreground Gaussians are initialized per frame and associated with the canonical set through the motion model (Wang et al., 31 Jul 2025).

A notable implementation choice is per-pixel multi-Gaussian initialization: five Gaussians per pixel rather than one. The paper states that this is used to capture details and reduce blurring. Each Gaussian’s 3D scale is initialized with a pixel-area heuristic,

KkK_k7

where KkK_k8 is depth and KkK_k9 are focal lengths. This is reported to be much more stable than PkP_k0-NN scale heuristics. Colors and opacities are initialized from the input image and fixed over time, which the paper reports empirically improves motion learning. For the depth alignment and background averaging stage, only DUSt3R points with at least PkP_k1 confidence are used.

Motion grouping is initialized from DINOv2 features rather than from track velocities. Per-pixel DINOv2 features, with registers, are averaged over an image pyramid and reduced by PCA to 32 dimensions. Because each image pixel corresponds to a 3D point PkP_k2, the 32-D feature is attached to that 3D point. K-means clustering in feature space produces PkP_k3 cluster centers, and the per-Gaussian blend weights PkP_k4 are initialized from distances to those centers and normalized to sum to 1. The intended effect is to group semantically similar parts, such as a left forearm, into one rigid unit.

The basis trajectories are initialized to identity and optimized during training. The paper argues that feature-based grouping is more robust than velocity-based bases in sparse-view settings, because monocular depth flicker corrupts velocities. It further reports that fewer than approximately 20 bases lead to visible failures such as missing limbs and merged legs, whereas PkP_k5 works robustly.

5. Optimization objective and view consistency

At each optimization step, MonoFusion samples a time PkP_k6 and camera PkP_k7, rasterizes RGB PkP_k8, feature map PkP_k9, silhouette or alpha t0t_00, and depth t0t_01, and compares them to the corresponding observations or priors t0t_02, t0t_03, t0t_04, and t0t_05. The losses are defined as

t0t_06

t0t_07

t0t_08

Here t0t_09 is supervised by image-plane DINOv2 features, and NN0 is a foreground mask from SAM 2. The paper states that NN1 is the aligned, view- and time-consistent depth; on foreground it acts as a soft prior rather than a hard constraint (Wang et al., 31 Jul 2025).

To constrain dynamic motion, MonoFusion adds a local rigidity term over Gaussian centers: NN2 This preserves neighbor distances over time and discourages nonphysical shearing within a local rigid group while allowing different groups to move independently.

Additional regularizers used in practice include basis acceleration penalization in the Lie algebra,

NN3

track smoothness,

NN4

depth-gradient consistency,

NN5

an optional NN6-axis acceleration penalty,

NN7

and a Gaussian scale variance regularizer NN8.

The total objective is

NN9

The implementation uses Adam and fixed weights across sequences; typical values cited in the appendix include Θ(i)={x0(i)R3,  R0(i)SO(3),  s(i)R3,  α(i)R,  c(i)R3,  f(i)RNf}.\Theta^{(i)} = \{x_0^{(i)} \in \mathbb{R}^3,\; R_0^{(i)} \in SO(3),\; s^{(i)} \in \mathbb{R}^3,\; \alpha^{(i)} \in \mathbb{R},\; c^{(i)} \in \mathbb{R}^3,\; f^{(i)} \in \mathbb{R}^{N_f}\}.0, Θ(i)={x0(i)R3,  R0(i)SO(3),  s(i)R3,  α(i)R,  c(i)R3,  f(i)RNf}.\Theta^{(i)} = \{x_0^{(i)} \in \mathbb{R}^3,\; R_0^{(i)} \in SO(3),\; s^{(i)} \in \mathbb{R}^3,\; \alpha^{(i)} \in \mathbb{R},\; c^{(i)} \in \mathbb{R}^3,\; f^{(i)} \in \mathbb{R}^{N_f}\}.1, and Θ(i)={x0(i)R3,  R0(i)SO(3),  s(i)R3,  α(i)R,  c(i)R3,  f(i)RNf}.\Theta^{(i)} = \{x_0^{(i)} \in \mathbb{R}^3,\; R_0^{(i)} \in SO(3),\; s^{(i)} \in \mathbb{R}^3,\; \alpha^{(i)} \in \mathbb{R},\; c^{(i)} \in \mathbb{R}^3,\; f^{(i)} \in \mathbb{R}^{N_f}\}.2.

The paper identifies two mechanisms for view consistency. First, DUSt3R supplies a single global reference frame and per-camera static background depth targets that are view-consistent. Second, the canonical Gaussian set and shared motion bases are optimized jointly against all cameras and times, forcing one 4D model to explain every observation. This suggests that MonoFusion’s consistency is not imposed by pairwise correspondence alone, but by coupling all views through a shared canonical scene and a shared motion basis.

6. Quantitative performance, runtime, and limitations

MonoFusion is evaluated on PanopticStudio and ExoRecon, a subset of Ego-Exo4D. On PanopticStudio, with 4 input views and 4 held-out views Θ(i)={x0(i)R3,  R0(i)SO(3),  s(i)R3,  α(i)R,  c(i)R3,  f(i)RNf}.\Theta^{(i)} = \{x_0^{(i)} \in \mathbb{R}^3,\; R_0^{(i)} \in SO(3),\; s^{(i)} \in \mathbb{R}^3,\; \alpha^{(i)} \in \mathbb{R},\; c^{(i)} \in \mathbb{R}^3,\; f^{(i)} \in \mathbb{R}^{N_f}\}.3 apart, the method reports on held-out frames: PSNR Θ(i)={x0(i)R3,  R0(i)SO(3),  s(i)R3,  α(i)R,  c(i)R3,  f(i)RNf}.\Theta^{(i)} = \{x_0^{(i)} \in \mathbb{R}^3,\; R_0^{(i)} \in SO(3),\; s^{(i)} \in \mathbb{R}^3,\; \alpha^{(i)} \in \mathbb{R},\; c^{(i)} \in \mathbb{R}^3,\; f^{(i)} \in \mathbb{R}^{N_f}\}.4, SSIM Θ(i)={x0(i)R3,  R0(i)SO(3),  s(i)R3,  α(i)R,  c(i)R3,  f(i)RNf}.\Theta^{(i)} = \{x_0^{(i)} \in \mathbb{R}^3,\; R_0^{(i)} \in SO(3),\; s^{(i)} \in \mathbb{R}^3,\; \alpha^{(i)} \in \mathbb{R},\; c^{(i)} \in \mathbb{R}^3,\; f^{(i)} \in \mathbb{R}^{N_f}\}.5, LPIPS Θ(i)={x0(i)R3,  R0(i)SO(3),  s(i)R3,  α(i)R,  c(i)R3,  f(i)RNf}.\Theta^{(i)} = \{x_0^{(i)} \in \mathbb{R}^3,\; R_0^{(i)} \in SO(3),\; s^{(i)} \in \mathbb{R}^3,\; \alpha^{(i)} \in \mathbb{R},\; c^{(i)} \in \mathbb{R}^3,\; f^{(i)} \in \mathbb{R}^{N_f}\}.6, and AbsRel Θ(i)={x0(i)R3,  R0(i)SO(3),  s(i)R3,  α(i)R,  c(i)R3,  f(i)RNf}.\Theta^{(i)} = \{x_0^{(i)} \in \mathbb{R}^3,\; R_0^{(i)} \in SO(3),\; s^{(i)} \in \mathbb{R}^3,\; \alpha^{(i)} \in \mathbb{R},\; c^{(i)} \in \mathbb{R}^3,\; f^{(i)} \in \mathbb{R}^{N_f}\}.7 on the full frame, outperforming MV-SOM and Dynamic 3DGS. On dynamic-only regions it reports PSNR Θ(i)={x0(i)R3,  R0(i)SO(3),  s(i)R3,  α(i)R,  c(i)R3,  f(i)RNf}.\Theta^{(i)} = \{x_0^{(i)} \in \mathbb{R}^3,\; R_0^{(i)} \in SO(3),\; s^{(i)} \in \mathbb{R}^3,\; \alpha^{(i)} \in \mathbb{R},\; c^{(i)} \in \mathbb{R}^3,\; f^{(i)} \in \mathbb{R}^{N_f}\}.8, SSIM Θ(i)={x0(i)R3,  R0(i)SO(3),  s(i)R3,  α(i)R,  c(i)R3,  f(i)RNf}.\Theta^{(i)} = \{x_0^{(i)} \in \mathbb{R}^3,\; R_0^{(i)} \in SO(3),\; s^{(i)} \in \mathbb{R}^3,\; \alpha^{(i)} \in \mathbb{R},\; c^{(i)} \in \mathbb{R}^3,\; f^{(i)} \in \mathbb{R}^{N_f}\}.9, LPIPS Nf=32N_f = 320, and IoU Nf=32N_f = 321. For Nf=32N_f = 322 novel-view extrapolation, it reports PSNR Nf=32N_f = 323, SSIM Nf=32N_f = 324, LPIPS Nf=32N_f = 325, IoU Nf=32N_f = 326, and AbsRel Nf=32N_f = 327, again outperforming SOM, Dynamic 3DGS, and MV-SOM (Wang et al., 31 Jul 2025).

On ExoRecon, across 6 scenes, the reported held-out frame performance is PSNR Nf=32N_f = 328, SSIM Nf=32N_f = 329, LPIPS cc0, and AbsRel cc1 on the full frame; on dynamic-only regions it reports PSNR cc2, SSIM cc3, LPIPS cc4, and IoU cc5. Qualitatively, the method is described as avoiding duplicate limbs and background bleeding common in per-view monocular fusions, and as yielding crisp dynamic details under extreme novel views.

Ablation studies identify several sensitivities. Space–time depth alignment is crucial. The feature-metric loss cc6 improves motion segmentation and IoU, at the cost of a small PSNR drop for silhouettes. Freezing colors across time improves motion learning. Feature-based motion bases outperform velocity-based ones in the sparse-view setting. The number of bases matters: fewer than approximately 20 bases produces visible failures, whereas cc7 is stable.

The reported runtime regime is sequence-level rather than online. Typical sequences are about 10 seconds at 30 fps and cc8 resolution, training takes about 30 minutes per sequence on a single NVIDIA A6000, and rendering runs at about 30 fps at that resolution. The representation size is approximately cc9 Gaussians for dynamic foreground and approximately 9090^\circ00 for background. Memory is dominated by the background Gaussians, while scalability to longer videos is attributed to per-9090^\circ01 independent depth alignment and constant-size basis trajectories.

MonoFusion operates under explicit assumptions and has stated failure modes. Cameras are stationary, calibrated, and synchronized; no rolling shutter or timing offsets are modeled; bundle adjustment is not used, with 9090^\circ02. The method relies on 2D foundation models, so failures in SAM 2 masks or MoGe depth can introduce artifacts, especially on thin structures or specular and cluttered backgrounds. Long occlusions can break mask tracking. If a body part is never observed from any view, reconstruction degrades. Calibration errors are not modeled. The paper also notes that better cross-view human priors, automatic dynamic mask discovery, optional camera refinement, more principled 9090^\circ03 blending, and active camera placement are natural directions for future work.

These limitations clarify what MonoFusion is and is not. It is not a calibration-refining method, not a dense multi-view system adapted unchanged to sparse cameras, and not a per-frame static reconstructor with temporal post-processing. Rather, it is a sparse-view 4D reconstruction framework whose central mechanism is the alignment of monocular priors to a static global reference and their consolidation into a single canonical Gaussian field with shared motion bases.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MonoFusion.