Papers
Topics
Authors
Recent
Search
2000 character limit reached

3DiMo: 3D-Aware Implicit Motion Control

Updated 10 February 2026
  • 3DiMo is a framework for 3D-aware implicit motion control that leverages a dual transformer-based motion encoder and a DiT-based latent diffusion video generator.
  • The system uses cross-attention to inject compact, view-agnostic motion tokens, ensuring robust 3D consistency and flexible text-driven camera control.
  • Empirical evaluations demonstrate 3DiMo’s superior perceptual scores and motion realism compared to state-of-the-art methods in dynamic video synthesis.

3DiMo is a framework for 3D-aware implicit motion control in view-adaptive human video generation, which advocates for a view-agnostic, learnable representation of human motion rather than directly translating from 2D pose trajectories or explicit 3D parametric models. 3DiMo addresses limitations of prior control signal paradigms by integrating a fully implicit motion encoding with a pretrained large-scale video generative model, optimizing for cross-view consistency and supporting flexible, text-driven camera control by leveraging the latent 3D spatial priors of the video generator rather than relying on externally reconstructed constraints (Fang et al., 3 Feb 2026).

1. Architectural Structure and Model Components

3DiMo jointly trains two discrete modules:

  • Pretrained Video Generator G: A DiT-based Latent Diffusion Model (LDM) backbone serves as the generative core, with a causal 3D-VAE compressing frames into latents. The backbone alternates self-attention and feed-forward layers across space-time, processing video, text, and reference-image latents together. A flow-based diffusion objective (v-prediction) is used for training.
  • Implicit Motion Encoder: Two transformer-based tokenizers are trained, EbE_b for body motion and EhE_h for hand motion. Each encoder ingests a sequence of driving video frames {IDt}t=0T\{I_D^t\}_{t=0}^T, augmented for appearance and perspective, patchifies each frame, prepends KK learnable "motion" tokens, and processes via LL transformer blocks. Only the KK latent tokens per encoder are retained at the output, yielding compact motion representations zb,zh∈RK×dz_b, z_h \in \mathbb{R}^{K \times d}; these are concatenated as z∈R2K×dz \in \mathbb{R}^{2K \times d} and injected into the video generator GG through cross-attention.

The architecture is depicted schematically in Figure 1 of (Fang et al., 3 Feb 2026), aligning each input-driving frame with its representation pathway to the generative backbone.

2. Implicit View-Agnostic Motion Tokenization

The core motion encoding utilizes the following process:

  • For each time tt, frames are augmented via random perspective warps, color jitter, and scaling, denoted as Aug(IDt)Aug(I_D^t).
  • Patchify: Xt=patchify(Aug(IDt))∈RM×dvX^t = patchify(Aug(I_D^t)) \in \mathbb{R}^{M \times d_v}.
  • Initialize L0∈RK×dL^0 \in \mathbb{R}^{K \times d}, a set of learnable latent tokens.
  • Each transformer layer â„“\ell receives Tâ„“=concat(Lℓ−1,Xt)T^\ell = concat(L^{\ell-1}, X^t), and processes via standard attention and feed-forward operations, retaining only LLL^L after the last layer.
  • For all frames, stack outputs: Z=Eb({Aug(IDt)}t=0T)Z = E_b(\{Aug(I_D^t)\}_{t=0}^T), Z′=Eh({Aug(IDt)}t=0T)Z' = E_h(\{Aug(I_D^t)\}_{t=0}^T), with z=[Z;Z′]z = [Z; Z'].

By discarding spatial patch outputs and only retaining motion latents, the encoder is forced to distill motion in a view-agnostic, compact form, with the augmentations aiding invariance to viewpoint and appearance.

3. Semantic Motion Injection via Cross-Attention

At each DiT block within GG, after full self-attention across video, text, and reference latents, a cross-attention mechanism is introduced where only video latents can read from motion tokens:

Qv=WqXv Km=Wkz,Vm=Wvz A=Softmax(QvKmTd) Xv′=AVm\begin{align*} Q_v &= W_q X_v \ K_m &= W_k z,\quad V_m = W_v z \ A &= \text{Softmax}\left(\frac{Q_v K_m^T}{\sqrt{d}}\right) \ X_v' &= A V_m \end{align*}

This decouples the spatial alignment between generator and motion encoder, enabling selective and semantically-controlled motion import without rigid correspondence; text and reference tokens are excluded from this cross-attention, ensuring modular control.

4. Training Regime: View-Rich Supervision and Auxiliary Losses

Multi-stage curriculum and loss design enforce robust 3D awareness:

  • View-Rich Data: Three data regimes are employed (Figure 2 in (Fang et al., 3 Feb 2026)):
    • Single-view (internet-scale): for self-reconstruction.
    • Multi-view synchronous captures: enabling cross-view motion supervision.
    • Moving-camera trajectories: for camera-motion and cross-view consistency.
  • Curriculum:

    • Stage 1 (0–10k steps): Single-view, reconstruct to same view,

    Lrecon=EVD[∥G(z;IR,T)−VD∥22]L_\text{recon} = \mathbb{E}_{V_D} \left[\|G(z; I_R, T) - V_D\|_2^2\right] - Stage 2 (10k–25k): Mix of single/cross-view, combined objective

    Lstage2=αLrecon+(1−α)LcrossL_\text{stage2} = \alpha L_\text{recon} + (1-\alpha) L_\text{cross}

    where LcrossL_\text{cross} penalizes discrepancy on a different view, α\alpha decays from 1 to 0.5. - Stage 3 (25k–30k): All cross-view, α=0\alpha=0.

  • Core loss: The LDM is always trained with a v-prediction diffusion loss:

Ldiff=Et,ϵ[∥ϵ−ϵθ(xt,text,IR,z)∥22]L_\text{diff} = \mathbb{E}_{t, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, \text{text}, I_R, z)\|_2^2\right]

  • Auxiliary Geometric Supervision: Early in training, a lightweight MLP head DgD_g decodes zz into pose parameters (θ^b,θ^h)(\hat{\theta}_b, \hat{\theta}_h), supervised by pseudo-GT labels from SMPL/MANO estimators. Only local body/hand pose (not root orientation) is supervised to prevent view leakage. Geometric loss term:

Lgeo=∥θ^b−θbGT∥22+∥θ^h−θhGT∥22L_\text{geo} = \|\hat{\theta}_b - \theta_b^\text{GT}\|_2^2 + \|\hat{\theta}_h - \theta_h^\text{GT}\|_2^2

is weighted by λgeo(t)\lambda_\text{geo}(t), linearly annealed from $0.1$ to $0$ by step 12k. The total loss is:

Ltotal=Ldiff+Ldiffusion,recon/cross+λgeo(t)LgeoL_\text{total} = L_\text{diff} + L_\text{diffusion,recon/cross} + \lambda_\text{geo}(t)L_\text{geo}

5. Text-Driven Camera Control and Inference Protocol

The DiT-based generator natively supports composite text prompts that specify both subject and camera action. At inference:

  • A reference image IRI_R (e.g., from the first driving frame) is supplied.
  • Driving frames are encoded via Eb,EhE_b,E_h (without augmentation) to produce zz.
  • A text prompt TT encodes both identity and camera motion (e.g., "camera circles around subject 360° at waist height").
  • GG synthesizes a video from these inputs, with cross-attention ensuring motion realism while the text pathway modulates camera viewpoint.

This protocol supports not only realistic motion transfer but also text-conditioned novel-view re-rendering as illustrated in qualitative examples from Figures 1 and 4.

6. Empirical Evaluation and Ablations

Quantitative comparisons of 3DiMo against AnimateAnyone, MimicMotion, MTVCrafter, and Uni3C (Table 1, (Fang et al., 3 Feb 2026)) on static-camera internet videos yield the following:

Method SSIM ↑ PSNR ↑ LPIPS ↓ FID ↓ FVD ↓
AnimateAnyone 0.7325 17.21 0.2754 68.72 862.5
MimicMotion 0.7051 16.83 0.3286 62.45 628.2
MTVCrafter 0.7489 18.03 0.2542 57.21 379.6
Uni3C 0.7185 17.53 0.2639 41.28 321.9
3DiMo 0.7390 17.96 0.2206 36.92 297.4

3DiMo achieves the lowest LPIPS (perceptual distance), FID (visual fidelity), and FVD (video quality) scores, with marginal SSIM/PSNR differences attributed to viewpoint sensitivity.

A user study with 30 raters using 5-point Likert scales for cross-identity motion transfer shows 3DiMo outperforming in motion accuracy (4.28±0.08), naturalness (4.18±0.06), 3D plausibility (4.05±0.09), and overall rating (4.38±0.08).

Ablations confirm the importance of:

  • Implicit view-agnostic motion encoder vs. SMPL token inputs.
  • View-rich multi-stage supervision schedule.
  • Cross-attention for motion token injection (superior to channel concatenation).
  • Early-stage auxiliary geometric supervision for convergence.
  • Dual-scale body/hand encoding for fine-grained motion transfer (Table 3, Fig. 5).

7. Context, Significance, and Prospective Implications

3DiMo establishes a learning regime that harnesses the intrinsic 3D priors of large pretrained video diffusion models, while bypassing the limitations of pose-centric control or fixed parametric models that are vulnerable to depth ambiguity and misalignment in dynamic contexts. The framework demonstrates qualitative improvements in depth understanding, nuanced motion transfer, and viewpoint adaptivity under text-driven camera control, advancing capabilities for human-centric video generation. A plausible implication is the emergence of more flexible, view-agnostic motion representations as standard protocol in future controllable video synthesis research. Furthermore, the scheduled reduction of auxiliary supervision illustrates how parametric guidance can initialize—but not dominate—the learning of implicit 3D structure, reinforcing the salience of learned spatiotemporal priors over fixed model constraints (Fang et al., 3 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to 3DiMo Framework.