3DiMo: 3D-Aware Implicit Motion Control
- 3DiMo is a framework for 3D-aware implicit motion control that leverages a dual transformer-based motion encoder and a DiT-based latent diffusion video generator.
- The system uses cross-attention to inject compact, view-agnostic motion tokens, ensuring robust 3D consistency and flexible text-driven camera control.
- Empirical evaluations demonstrate 3DiMo’s superior perceptual scores and motion realism compared to state-of-the-art methods in dynamic video synthesis.
3DiMo is a framework for 3D-aware implicit motion control in view-adaptive human video generation, which advocates for a view-agnostic, learnable representation of human motion rather than directly translating from 2D pose trajectories or explicit 3D parametric models. 3DiMo addresses limitations of prior control signal paradigms by integrating a fully implicit motion encoding with a pretrained large-scale video generative model, optimizing for cross-view consistency and supporting flexible, text-driven camera control by leveraging the latent 3D spatial priors of the video generator rather than relying on externally reconstructed constraints (Fang et al., 3 Feb 2026).
1. Architectural Structure and Model Components
3DiMo jointly trains two discrete modules:
- Pretrained Video Generator G: A DiT-based Latent Diffusion Model (LDM) backbone serves as the generative core, with a causal 3D-VAE compressing frames into latents. The backbone alternates self-attention and feed-forward layers across space-time, processing video, text, and reference-image latents together. A flow-based diffusion objective (v-prediction) is used for training.
- Implicit Motion Encoder: Two transformer-based tokenizers are trained, for body motion and for hand motion. Each encoder ingests a sequence of driving video frames , augmented for appearance and perspective, patchifies each frame, prepends learnable "motion" tokens, and processes via transformer blocks. Only the latent tokens per encoder are retained at the output, yielding compact motion representations ; these are concatenated as and injected into the video generator through cross-attention.
The architecture is depicted schematically in Figure 1 of (Fang et al., 3 Feb 2026), aligning each input-driving frame with its representation pathway to the generative backbone.
2. Implicit View-Agnostic Motion Tokenization
The core motion encoding utilizes the following process:
- For each time , frames are augmented via random perspective warps, color jitter, and scaling, denoted as .
- Patchify: .
- Initialize , a set of learnable latent tokens.
- Each transformer layer receives , and processes via standard attention and feed-forward operations, retaining only after the last layer.
- For all frames, stack outputs: , , with .
By discarding spatial patch outputs and only retaining motion latents, the encoder is forced to distill motion in a view-agnostic, compact form, with the augmentations aiding invariance to viewpoint and appearance.
3. Semantic Motion Injection via Cross-Attention
At each DiT block within , after full self-attention across video, text, and reference latents, a cross-attention mechanism is introduced where only video latents can read from motion tokens:
This decouples the spatial alignment between generator and motion encoder, enabling selective and semantically-controlled motion import without rigid correspondence; text and reference tokens are excluded from this cross-attention, ensuring modular control.
4. Training Regime: View-Rich Supervision and Auxiliary Losses
Multi-stage curriculum and loss design enforce robust 3D awareness:
- View-Rich Data: Three data regimes are employed (Figure 2 in (Fang et al., 3 Feb 2026)):
- Single-view (internet-scale): for self-reconstruction.
- Multi-view synchronous captures: enabling cross-view motion supervision.
- Moving-camera trajectories: for camera-motion and cross-view consistency.
- Curriculum:
- Stage 1 (0–10k steps): Single-view, reconstruct to same view,
- Stage 2 (10k–25k): Mix of single/cross-view, combined objective
where penalizes discrepancy on a different view, decays from 1 to 0.5. - Stage 3 (25k–30k): All cross-view, .
- Core loss: The LDM is always trained with a v-prediction diffusion loss:
- Auxiliary Geometric Supervision: Early in training, a lightweight MLP head decodes into pose parameters , supervised by pseudo-GT labels from SMPL/MANO estimators. Only local body/hand pose (not root orientation) is supervised to prevent view leakage. Geometric loss term:
is weighted by , linearly annealed from $0.1$ to $0$ by step 12k. The total loss is:
5. Text-Driven Camera Control and Inference Protocol
The DiT-based generator natively supports composite text prompts that specify both subject and camera action. At inference:
- A reference image (e.g., from the first driving frame) is supplied.
- Driving frames are encoded via (without augmentation) to produce .
- A text prompt encodes both identity and camera motion (e.g., "camera circles around subject 360° at waist height").
- synthesizes a video from these inputs, with cross-attention ensuring motion realism while the text pathway modulates camera viewpoint.
This protocol supports not only realistic motion transfer but also text-conditioned novel-view re-rendering as illustrated in qualitative examples from Figures 1 and 4.
6. Empirical Evaluation and Ablations
Quantitative comparisons of 3DiMo against AnimateAnyone, MimicMotion, MTVCrafter, and Uni3C (Table 1, (Fang et al., 3 Feb 2026)) on static-camera internet videos yield the following:
| Method | SSIM ↑ | PSNR ↑ | LPIPS ↓ | FID ↓ | FVD ↓ |
|---|---|---|---|---|---|
| AnimateAnyone | 0.7325 | 17.21 | 0.2754 | 68.72 | 862.5 |
| MimicMotion | 0.7051 | 16.83 | 0.3286 | 62.45 | 628.2 |
| MTVCrafter | 0.7489 | 18.03 | 0.2542 | 57.21 | 379.6 |
| Uni3C | 0.7185 | 17.53 | 0.2639 | 41.28 | 321.9 |
| 3DiMo | 0.7390 | 17.96 | 0.2206 | 36.92 | 297.4 |
3DiMo achieves the lowest LPIPS (perceptual distance), FID (visual fidelity), and FVD (video quality) scores, with marginal SSIM/PSNR differences attributed to viewpoint sensitivity.
A user study with 30 raters using 5-point Likert scales for cross-identity motion transfer shows 3DiMo outperforming in motion accuracy (4.28±0.08), naturalness (4.18±0.06), 3D plausibility (4.05±0.09), and overall rating (4.38±0.08).
Ablations confirm the importance of:
- Implicit view-agnostic motion encoder vs. SMPL token inputs.
- View-rich multi-stage supervision schedule.
- Cross-attention for motion token injection (superior to channel concatenation).
- Early-stage auxiliary geometric supervision for convergence.
- Dual-scale body/hand encoding for fine-grained motion transfer (Table 3, Fig. 5).
7. Context, Significance, and Prospective Implications
3DiMo establishes a learning regime that harnesses the intrinsic 3D priors of large pretrained video diffusion models, while bypassing the limitations of pose-centric control or fixed parametric models that are vulnerable to depth ambiguity and misalignment in dynamic contexts. The framework demonstrates qualitative improvements in depth understanding, nuanced motion transfer, and viewpoint adaptivity under text-driven camera control, advancing capabilities for human-centric video generation. A plausible implication is the emergence of more flexible, view-agnostic motion representations as standard protocol in future controllable video synthesis research. Furthermore, the scheduled reduction of auxiliary supervision illustrates how parametric guidance can initialize—but not dominate—the learning of implicit 3D structure, reinforcing the salience of learned spatiotemporal priors over fixed model constraints (Fang et al., 3 Feb 2026).