DIMO: Diverse 3D Motion Generation for Arbitrary Objects

Published 10 Nov 2025 in cs.CV | (2511.07409v1)

Abstract: We present DIMO, a generative approach capable of generating diverse 3D motions for arbitrary objects from a single image. The core idea of our work is to leverage the rich priors in well-trained video models to extract the common motion patterns and then embed them into a shared low-dimensional latent space. Specifically, we first generate multiple videos of the same object with diverse motions. We then embed each motion into a latent vector and train a shared motion decoder to learn the distribution of motions represented by a structured and compact motion representation, i.e., neural key point trajectories. The canonical 3D Gaussians are then driven by these key points and fused to model the geometry and appearance. During inference time with learned latent space, we can instantly sample diverse 3D motions in a single-forward pass and support several interesting applications including 3D motion interpolation and language-guided motion generation. Our project page is available at https://linzhanm.github.io/dimo.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a novel parametric pipeline that samples diverse 3D motions directly from a static image by leveraging large-scale video diffusion models.
It employs neural keypoint trajectory factorization and a unified latent space to encapsulate complex motion patterns with high fidelity.
The approach achieves up to 6x–8x improvements in diversity and efficiency over baselines, enabling rapid and photorealistic 4D content generation.

Diverse 3D Motion Generation for Arbitrary Objects: The DIMO Framework

Introduction and Motivation

The "DIMO: Diverse 3D Motion Generation for Arbitrary Objects" framework introduces a novel parametric generative pipeline for instant and diverse 3D motion synthesis from a single static image of any general object. Unlike prior methods that target category-specific articulated models (e.g., SMPL for humans) or require repeated per-motion optimization runs, DIMO distills object-specific motion priors from large-scale video diffusion models, canonically encodes them via neural keypoint trajectories, and jointly models diverse motion patterns in a shared low-dimensional latent space. This enables direct sampling of multiple plausible dynamic 3D instances in a single forward pass, supporting applications such as language-guided animation, motion interpolation in latent space, and fast test-motion reconstruction. The approach demonstrates significantly increased diversity, fidelity, efficiency, and extensibility compared to established 4D generation baselines.

Figure 1: The DIMO pipeline leverages video model priors, auto-generated motion prompts, latent keypoint factorization, and 3D Gaussian splatting for unified geometry and appearance modeling.

Prior Distillation and Data Generation

DIMO replaces traditional motion-capture or template-driven datasets with automated pipelines for motion prior distillation. Multimodal LLMs (Llama3, GPT-4o) generate structured "meta" prompts and a set of diverse text descriptions of potential actions, based on a reference static image. These prompts are then passed into state-of-the-art text-conditioned image-to-video models (CogVideoX) to generate a large set of candidate motion-rich video clips. A filter pipeline combines MLLM-based quality assessment and optical flow-based motion magnitude ranking to curate the set, ensuring coverage of variable, plausible dynamics for each object.

To enable robust spatiotemporal and geometric modeling, multi-view generative video models (SV3D, SV4D) are used to further provide novel viewpoints and stronger 3D priors, facilitating multi-motion, multi-view, multi-frame joint training.

Unified Latent Space and Neural Keypoint Trajectories

Central to DIMO is the construction of a shared low-dimensional latent space encoding the diversity of 3D motion patterns for each target object. This is achieved via explicit neural keypoint trajectory factorization. Each motion is represented by a sparse set of canonical 3D keypoints, each parameterized with position, control radius, and an RBF kernel influencing nearby points. Their time-varying poses are encoded as sequences of 6DoF SE(3) transformations (translation, quaternion rotation). The set of keypoints and their trajectories define the motion basis for all generated instances.

A graph structure connects keypoints based on temporal trajectory similarity (KNN in concatenated positions), allowing topology preservation. A latent-conditioned MLP-based motion decoder maps the motion-specific latent vector, keypoint position, and timestep to the appropriate transformation. Latent codes are optimized via the VAE reparameterization trick, enforcing smooth diversity and continuity through KL regularization.

Geometry and Appearance Modeling via 3D Gaussian Splatting

To couple motion dynamics with object-specific geometric details and appearance, canonical 3D Gaussian primitives are attached to keypoints and further distributed in proximity. Each Gaussian is parameterized by center, rotation, scale, opacity, and SH color. Deformation at each timestep is performed via Linear Blend Skinning (LBS), driven by the nearest keypoint poses and graph edges, resulting in temporally coherent and topologically faithful geometry.

Differentiable rasterization using splatting-based rendering allows efficient photometric supervision across multiple views and timestamps, supporting direct optimization of geometry, keypoints, and latent codes.

Optimization Strategies and Training Regimes

Training utilizes a two-stage coarse-to-fine schedule to disentangle motion factorization from fine geometry optimization. In the first stage, keypoints, latent codes, and decoder are jointly pre-trained to capture robust motion bases and latent embedding continuity, with periodic keypoint resampling and densification. The second stage initializes canonical Gaussians near each keypoint and refines parameters for photorealistic appearance and detailed motions, recycling keypoint trajectories for guidance via Chamfer distance regularization.

Multi-loss training combines RGB, mask, perceptual, depth smoothness, normal smoothness, ARAP rigidity, and KL regularization. Batch sampling across multiple motions, views, and frames provides strong constraints for learning shared structures. Render resolution is annealed for stable convergence.

Figure 2: Qualitative DIMO results illustrating instant generation of diverse and photorealistic 3D motions and 4D contents from the same object image under novel viewpoints and timestamps.

Comparative Performance and Experimental Results

Extensive benchmark comparisons against DG4D, 4DGen, Animate124, STAG4D, SV4D, Consistent4D, and 4D-fy show clear superiority of DIMO in terms of diversity, alignment, fidelity, and efficiency. In the 3D Motion Generation task (Figure 3), DIMO is able to synthesize a variety of high-quality plausible motions for each object, where baselines often fail to produce noticeable motion diversity or suffer identity drift due to repeated diffusion sampling.

Figure 3: DIMO generates high-fidelity, diverse 3D motions per object, outperforming baseline models in pronounced and realistic dynamic behavior.

For image-to-4D and video-to-4D tasks (Figure 4), quantitative and qualitative metrics (CLIP-I, CLIP-F, LPIPS, FVD, user studies) show DIMO surpasses others by large margins, with up to 6x-8x improvement in motion diversity and a drastic reduction in required training time (often achieving results in minutes rather than hours).

Figure 4: DIMO produces photorealistic, multi-motion 4D contents for both synthetic and in-the-wild objects, outperforming existing methods in visual and temporal coherence.

For text-to-4D tasks, DIMO's latent space is compatible with language embedding projections, enabling direct feed-forward generation of plausible, text-specified dynamics at interactive speeds. The framework achieves over 80% text-motion alignment and motion diversity in user studies, versus under 10-16% for prior approaches.

Applications: Latent Motion Interpolation and Language-Guided Generation

DIMO demonstrates continuous and meaningful latent space interpolation, allowing synthesis of novel coherent motion sequences between two anchors. Language-guided motion generation is supported via a learned mapping from text embeddings (BERT) into latent codes, yielding plausible animation driven by compact language prompts.

Figure 5: Linear interpolation in DIMO’s latent motion space produces smooth and novel transitions between distinct sampled motion sequences.

Test motion reconstruction is supported by fine-tuning the latent code for unknown videos, providing high-fidelity alignment and rapid convergence.

Ablation and Model Analysis

Ablation studies confirm the necessity of motion factorization, the latent space, coarse-to-fine pre-training, and multi-motion joint training. Removal of any component causes significant degradation in diversity, spatiotemporal consistency, and fidelity. Joint multi-motion training enhances model robustness and preserves object identity and appearance across dynamic behaviors.

Limitations and Future Directions

The reliance on state-of-the-art video diffusion models for prior distillation makes DIMO’s performance contingent on ongoing advances in generative models and multimodal prompting. Current two-stage separation of latent code learning and language mapping could be unified for more efficient end-to-end training. Extension to interactively editable motion content and fine-grained physical realism is a prospective research direction.

Conclusion

DIMO provides a scalable, generalizable, and efficient framework for diverse 3D motion synthesis for arbitrary objects, supporting instant sampling of rich dynamic behaviors from a single image. Its motion latent space, neural keypoint factorization, and joint optimization strategy offer clear advantages for geometric and appearance fidelity, diversity, and real-time applications. The work represents a substantive advancement towards universal data-driven parametric models akin to SMPL, but for any object class with minimal supervision, opening new directions for generative modeling, interactive animation, and content creation at scale.

Markdown