- The paper introduces a dual-stream diffusion architecture that disentangles camera and object motion for causality-aware video generation.
- It employs novel dropout and hybrid supervision techniques to learn forward and inverse reasoning of active and passive motion components.
- Experimental results demonstrate superior motion controllability, realism, and physical plausibility over state-of-the-art methods.
Motion-Controlled Video Generation with Causal and Disentangled Control: An Expert Analysis of MoRight
"MoRight: Motion Control Done Right" (2604.07348) addresses a persistent dual deficiency in video generation: the entanglement of camera and object motion, and the lack of action-consequence (causality) reasoning. Whereas prior approaches constrain controllability and realism by representing motion as pixel displacement agnostic to causality and viewpoint, MoRight proposes a unified formulation based on (1) disentangled camera and object motion representation, and (2) explicit modeling of causal relationships between foreground actions and their physical consequences.
The essential premise is that user-driven physical interactions require not simply kinematic replay but the reasoning of how agent actions propagate through complex environments—critical for embodied agents, interactive world models, and content creation. MoRight operationalizes this by both enabling precise specification of dynamics under arbitrary viewpoints and supporting both forward (“given action → infer consequences”) and inverse (“given consequences → infer plausible actions”) causal queries.
Architecture: Dual-Stream Disentanglement and Causality Encoding
MoRight leverages a dual-stream latent video diffusion architecture, building on a ViT-based backbone (Wan2.1-14B), to decouple motion control:
- Canonical stream: Synthesizes object-only motion under a static, canonical view. Here, user-drawn 2D trajectories—without need for future frames—specify fine or coarse object motion, serving as the anchor for interaction control.
- Target stream: Synthesizes frames under the user-specified camera trajectory (arbitrary viewpoint), but without direct object-motion input. Temporal cross-view self-attention in the transformer layers exchanges information between streams, transferring canonical object motion into the user’s target camera space.
Motion and camera conditions are injected into all attention blocks via learned projections; camera signals use depth-aware warping from a single image, while object trajectories use dense/sparse maps or strokes. This disentanglement resolves the inherent degeneracy of pixel trajectory-based controls under camera motion.
For causality, MoRight introduces a novel decomposition of object motion into active (agent-driven) and passive (environmental consequence) components—leveraging vision-language (Qwen3-VL) and segmentation models for active-passive assignment during data curation. During training, a “motion dropout” scheme randomly withholds one motion component from the input, compelling the model to learn to infer (and hallucinate) plausible missing counterparts given the available tracks—training the joint forward/inverse reasoning capacity.
Training Data Pipeline and Mixed-Supervision Protocol
MoRight’s training pipeline tackles the lack of paired, multi-view, and causally-annotated real-world videos by:
- Extracting camera pose, depth, and dense tracks using foundation models (ViPE, AllTracker, SAM2, Qwen3-VL).
- Canonicalizing 2D trajectories by backprojecting to 3D and reprojecting onto a reference frame.
- Decomposing tracks into active (agent) and passive (affected) clusters via mask-based membership.
- Generating paired dynamic/static videos using synthetic camera control when needed.
- Employing hybrid (real + synthetic, single-view + dual-view) supervision and extensive augmentation (track occlusion/dropout, granularity variation) to diversify learned motion distributions and improve robustness, especially under highly dynamic camera motion setups.
Experimental Results
Evaluation across three challenging benchmarks—DynPose-100K, WISA, and a real-world Cooking benchmark—substantiates the following principal claims:
- Disentangled Controllability: MoRight achieves SOTA or near-SOTA end-point error (EPE) for motion accuracy and camera pose fidelity versus state-of-the-art baselines (ATI, WanMove, MP, Gen3C), despite using only first-frame 2D trajectories without privileged future or multi-frame labels.
- Causal Interaction Reasoning: In scenarios where only active motion is provided, MoRight produces semantically aligned consequences with higher physical commonsense (PC) scores and superior FID/FVD versus methods that require full action+outcome specification. This gap is most notable in the inverse reasoning setting, where the model successfully reconstructs plausible causal actions from sparse outcome cues.
- Human Preference and Realism: User studies confirm dominant preference for MoRight (controllability: 53.5%, motion realism: 54.6%, photorealism: 55.9%), with notable margins over competitors that use privileged 3D/cinematic cues.
- Ablation Outcomes: Removing the fixed canonical-view stream, causal dropout, or hybrid supervision sharply degrades camera-motion disentanglement, physical plausibility, and control precision, validating the necessity of the architectural design.
Limitations
MoRight’s causal reasoning can fail in ambiguous or underconstrained settings, suffering from merged object artifacts, failure in severe occlusion or sparse tracking scenarios, and occasional physical implausibility or hallucination in scenes with rapid camera egomotion. The method’s strongest regime remains in smooth camera trajectories and interactions with well-segmented active/passive agents.
Implications and Future Directions
The MoRight formulation yields an architectural pathway for scalable interactive video models applicable to embodied AI, robot planning, and high-DOF content creation. Its success in training “cause-and-effect” aware generative models from single-view, weakly annotated data relaxes the need for full 3D or physics-engine supervision. Practically, MoRight’s GUI enables artists/engineers to draw actions and receive causality-consistent videos from arbitrary angles, with applications extending from AR/VR content editing to physically consistent world-model learning.
From a theoretical perspective, MoRight advances the direction that diffusion-based video generation can internalize latent causal structures—if decomposition and dropout regularization bridge the gap between observable kinematics and unobservable physical causation. This opens avenues for:
- Joint learning with explicit physics simulation feedback loops or hybrid RL-imitation learning.
- Generalization of active/passive decomposition to multi-agent or continuous control scenarios.
- End-to-end training on vision-only datasets without dependence on foundation model signal for data annotation.
- Unification with language-guided interaction reasoning for multimodal, instruction-following generative agents.
Conclusion
MoRight establishes a rigorous method for interactive, causality-aware video generation with precise, disentangled control over camera and object dynamics. Its combination of dual-stream motion encoding and action-consequence modeling enables both forward and inverse reasoning about physical interactions in complex scenes—demonstrating that transformer-based diffusion models, with proper data curation and regularization, can transcend straightforward kinematic animation and approach grounded causal world simulation. MoRight sets the stage for next-generation physically-consistent and user-driven synthetic video systems.