Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GenMOJO: Advanced 4D Scene Reconstruction

Updated 30 June 2025
  • GenMOJO is a generative framework that reconstructs and synthesizes dynamic 4D scenes using object-centric generative priors and differentiable 3D Gaussian decomposition.
  • It jointly optimizes all scene components via photometric loss and object-conditioned diffusion priors to achieve realistic novel view synthesis and robust occlusion handling.
  • Empirical evaluations on challenging datasets show state-of-the-art performance in metrics like PSNR, CLIP, and LPIPS, alongside superior point trajectory tracking.

GenMOJO is a generative framework for reconstructing and synthesizing dynamic 4D (three-dimensional spatial plus time) scenes from monocular, multi-object videos, particularly designed to address the challenges posed by object interactions, severe occlusions, and scene complexity. GenMOJO integrates object-centric generative priors, differentiable 3D Gaussian scene decomposition, and joint optimization of all scene components to produce accurate, temporally consistent reconstructions and realistic novel view renderings in cluttered environments (2506.12716).

1. Scene Representation and Decomposition

GenMOJO represents each foreground object and the background in a scene as sets of deformable 3D Gaussians. This parameterization assigns to each object a collection of Gaussians whose locations, scales, orientations, and per-Gaussian properties (e.g., color, opacity) are differentiable with respect to the model’s loss functions. The scene is decomposed as follows:

  • The input is a monocular RGB video, for which camera poses and per-frame object segmentations/tracking are assumed to be available.
  • Foreground-background separation is achieved by per-frame tracking, with a distinct set of Gaussians initialized for each object instance.
  • Each Gaussian instance is allowed to deform over time, parameterized by a neural deformation field that predicts its spatial transformation (position, orientation, scale) for each frame.

This explicit object decomposition, unlike previous dynamic NeRF-like methods, supports the integration of powerful object-level generative priors and facilitates independent hallucination of occluded or unobserved regions.

2. Joint Optimization and Generative Diffusion Priors

The learning objective of GenMOJO jointly optimizes all object Gaussians and background parameters by combining two forms of supervision:

  • Photometric/Rendering Loss: For each frame, the rendered 2D image (produced by compositing all Gaussians via differentiable splatting and alpha blending) is compared against the observed image under the known camera pose. This loss directly enforces frame-level fidelity.
  • Object-centric Diffusion Prior (“Score Distillation Sampling”): Each object’s Gaussians are optimized with an object-conditioned diffusion model (e.g., Zero-1-to-3) as a prior, using Score Distillation Sampling (SDS). The SDS loss is given by

ϕLSDSt=Et,τ,ϵ,p[w(τ)(ϵθ(I^tp;τ,I1,p)ϵ)I^tpϕ]\nabla_\phi \mathcal{L}^t_\text{SDS} = \mathbb{E}_{t,\tau,\epsilon,p}\left[ w(\tau) \left( \epsilon_\theta(\hat{I}_t^p; \tau, I_1, p) - \epsilon \right) \frac{\partial \hat{I}_t^p}{\partial \phi} \right]

where the diffusion model provides gradient guidance for generating plausible appearances in both observed and unseen (novel) viewpoints.

  • Differentiable transformations between object-centric and frame-centric coordinates bridge the gap between the canonical spaces required by generative diffusion priors and the global scene’s camera space. For object ii in frame tt, this includes affine warps parameterized by geometric estimates (e.g., bounding boxes, depth) and further refined during optimization:

pti=Cr(Crμti)ki,sti=stiki\mathbf{p}^i_t = \mathcal{C}^r - (\mathcal{C}^r - \mathbf{\mu}^i_t ) \cdot k_i, \qquad \mathbf{s}^i_t = \mathbf{s}^i_t \cdot k_i

Here, kik_i is a relative depth-based scale, Cr\mathcal{C}^r is camera position, and μti\mathbf{\mu}^i_t is the base Gaussian mean.

These strategies allow GenMOJO to leverage learned object priors to fill in unseen regions, disambiguate occlusions, and produce more realistic geometry.

3. Joint Splatting and Occlusion Modeling

GenMOJO introduces a global “joint splatting” procedure in which all objects’ and background’s Gaussians are composited in a unified forward pass for each frame. The principal innovations are:

  • Occlusion-aware rendering: By jointly rendering all Gaussians, the model faithfully captures inter-object occlusions and correct visibility ordering. This contrasts with past work where object renderings are composited post-hoc, often leading to artifacts such as incorrect depth ordering and ghosting.
  • Supervision on joint renderings: RGB photometric losses, mask/segmentation losses, and trajectory-based losses are all computed with respect to the unified splatted image (containing all mutual occlusions and interactions), providing more physically consistent training signals.

This methodology directly addresses the failure cases observed in DreamScene4D and related baselines, where independent optimization of objects led to inconsistent scene structure and tracking drift.

4. Temporal Deformation and 4D Reconstruction

To account for dynamics, each object's Gaussian parameters are evolved over time using a learnable deformation network. At each time step:

  • The current deformation parameters predict the instantaneous transformation of each Gaussian.
  • The rendered scene reflects both the spatial and temporal evolution of all components, allowing for coherent 4D (3D + time) reconstructions.
  • Point trajectories (in both 2D and 3D) are generated by tracking the means of selected Gaussians over time under the learned deformations, producing both visualizations and quantitative benchmarks for tracking accuracy.

This design enables not only plausible scene generation but also accurate motion tracking, even in the presence of severe occlusions where prior methods struggle.

5. Evaluation and Empirical Results

GenMOJO’s efficacy is substantiated using challenging benchmark datasets (DAVIS, MOSE-PTS), which feature multiple occluding objects and complex dynamics:

  • Novel view synthesis: GenMOJO achieves state-of-the-art results on PSNR, CLIP, and LPIPS metrics, as well as in perceptual user studies. For the MOSE dataset, it reports PSNR 25.56 (vs. 22.98 DreamScene4D), CLIP 85.41 (vs. 85.16), and LPIPS 0.168 (lower is better).
  • Point trajectory tracking: GenMOJO establishes the lowest tracking errors among unsupervised/optimization-based methods and performs competitively or better than supervised trackers in occlusion-heavy scenes.
  • Ablations: Each architectural choice—joint splatting, object-centric SDS, and mask-based supervision—is critically important; removing any degrades both photometric and tracking performance.

A summary of comparative benchmarks: | Method | CLIP (↑) | PSNR (↑) | LPIPS (↓) | User Pref. | |------------------|----------|----------|-----------|------------| | Consistent4D | 77.78 | 22.59 | 0.172 | – | | DreamGaussian4D | 81.96 | 17.82 | 0.195 | – | | DreamScene4D | 85.16 | 22.98 | 0.169 | 36.8% | | GenMOJO | 85.41| 25.56| 0.168 | 63.2% |

6. Comparison with Prior Work

GenMOJO advances the field beyond several state-of-the-art systems:

  • Consistent4D: Fails in cluttered, multi-object scenarios and delivers weak geometry and appearance.
  • DreamGaussian4D: Lacks explicit object decomposition and object-level priors, resulting in deficient completion and tracking.
  • DreamScene4D: Uses object decomposition but with independent optimization and compositing, leading to occlusion artifacts and temporal inconsistency.

GenMOJO’s critical advances include:

  • Simultaneous, joint optimization of all scene objects for consistent modeling of mutual occlusions.
  • Exploitation of object-centric diffusion priors for plausible inference in unobserved views.
  • End-to-end differentiable alignment between object and scene coordinates, ensuring accurate and stable integration across modules.

7. Significance, Limitations, and Prospective Directions

GenMOJO provides the first unified framework for 4D scene generation that combines differentiable scene representation, object-centric generative modeling, and robust occlusion-aware learning. The principal significance lies in its ability to generate high-fidelity, temporally coherent novel views and accurate point tracks from monocular real-world video, advances that have been verified on complex multi-object datasets.

A plausible implication is that GenMOJO’s object-centric joint optimization with generative priors can be extended to broader classes of dynamic scenes, or potentially to interactive 3D/4D scene editing scenarios.

The framework presupposes accurate per-frame object segmentation/tracking and camera pose estimation as input; errors at this preprocessing step may propagate through to final reconstructions. Furthermore, as with generative diffusion models, computational cost can be substantial compared to classical rendering-only optimization.

GenMOJO’s approach currently relies on priors trained predominantly on single-object imagery (e.g., Zero-1-to-3); scaling this mechanism to jointly-trained multi-object generative models may afford further gains in realism and generalization.


GenMOJO establishes a new state-of-the-art for generative 4D scene modeling from monocular video, defined by the synthesis of joint, object-level optimization, strong generative priors, and differentiable representations, resulting in accurate and visually compelling reconstructions in the most challenging real-world scenarios. For additional details and demonstration results, see the GenMOJO project page: https://genmojo.github.io/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)