Time-to-Move (TTM): Video & Travel Metrics
- Time-to-Move (TTM) is a dual-faceted concept featuring a training-free, plug-and-play framework for video generation that injects user-specified motion cues via a dual-clock denoising mechanism.
- The dual-clock denoising method applies strong and weak noise schedules to motion-constrained and flexible regions respectively, ensuring precise motion adherence and improved dynamic fidelity.
- TTM also defines a max–min travel-time metric that aggregates worst-case transit times over temporal intervals, supporting spatial analytics and delay-aware planning with metric robustness.
Time-to-Move (TTM) refers to two distinct, rigorously formulated constructs in modern research: a training-free framework for controlled video generation in diffusion models (Singer et al., 9 Nov 2025), and a max–min aggregate travel-time metric for expressing distance in terms of temporal transport cost (Halpern, 2015). Both instantiations of TTM are motivated by the need for interpretable, fine-grained control—over motion in high-dimensional generative modeling, or over routes and distances in time-sensitive spatial networks.
1. TTM in Motion-Controlled Video Generation
Time-to-Move, for motion-controlled video synthesis, addresses a core deficiency in contemporary diffusion-based video generators: while models such as Stable Video Diffusion and CogVideoX produce visually striking outputs, they lack precise, user-specified motion targeting. Traditional conditioning through text prompts (e.g., “a car drives left”) produces only approximate alignment to motion intent and is prone to introduce undesired global effects, such as camera pans or scene deformation. Meanwhile, approaches that introduce motion conditioning through auxiliary data (e.g., optical flow or trajectory branches) require expensive model-specific fine-tuning and risk compromising visual fidelity.
TTM introduces a training-free, plug-and-play solution that injects explicit motion cues into any off-the-shelf image-to-video (I2V) diffusion model via sampling-time modification, eliminating the need for retraining. The key innovation is to utilize crude reference animations—produced by simple manipulations such as cut-and-drag or depth-based reprojection—as coarse motion guides, and to enforce these guides during sampling with a region-dependent, dual-clock denoising mechanism.
2. Dual-Clock Denoising Mechanism
The dual-clock denoising scheme is the central algorithmic contribution of TTM in video generation. It enables region-dependent adherence to user-provided motion fields by modulating the denoising schedule. More formally, pixels specified in a binary mask (denoting "motion-constrained" regions) are denoised with a "strong" schedule (), retaining high fidelity to the warped reference animation . Pixels outside the mask ("flexible" regions) are denoised with a "weak" schedule (), allowing the model to erase inconsistencies and synthesize plausible unconstrained dynamics.
The pseudocode for dual-clock sampling is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
Algorithm TTM_Sample(I, Vᵂ, M)
1. Choose two noise timesteps 0 ≤ tₛₜᵣₒₙ𝓰 < t_wₑₐₖ ≤ T.
2. Compute x_T ← q_T(Vᵂ); // fully noised
3. for t = T … t_wₑₐₖ + 1 do
x_{t−1} ← DenoiseStep(x_t, t, I)
4. end for
5. for t = t_wₑₐₖ … tₛₜᵣₒₙ𝓰 + 1 do
pred ← DenoiseStep(x_t, t, I)
x_{t−1} ← (1 − M) ⊙ pred + M ⊙ Noised(Vᵂ, t−1)
6. end for
7. for t = tₛₜᵣₒₙ𝓰 … 1 do
x_{t−1} ← DenoiseStep(x_t, t, I)
8. end for
9. return x₀ |
At timesteps , the masked regions are "clamped" to the warped reference corrupted to the relevant noise level, while outside regions undergo standard denoising. This approach allows strict motion adherence in the user-designated object or region, while the background naturally evolves according to the generative model.
3. Extraction and Use of Crude Reference Animations
TTM’s conditioning inputs consist of:
- A reference frame ,
- A warped reference animation ,
- A binary mask video .
Crude reference animations are constructed from either "cut-and-drag" operations—where the user selects and manually drags an object in the initial frame along a trajectory—or via monocular depth-based reprojection, where a depth map is inferred, points are reprojected along a camera path, and new pixels are inpainted as necessary. This extraction incurs no additional training or architectural overhead, and both and are directly consumable by the diffusion sampler.
4. Quantitative Benchmarks and Comparative Performance
TTM’s efficacy is substantiated on established motion control benchmarks, including MC-Bench (object motion) and DL3DV-10K (camera motion). Performance is assessed using a suite of metrics:
- CoTracker Distance (CTD): pixel-wise trajectory error, lower is better.
- BG–Obj CTD: background-object separation, higher is better.
- VBench quality suite: dynamic degree, subject consistency, background consistency, motion smoothness, aesthetic quality, and image quality.
Sample results for the Stable Video Diffusion (SVD) backbone on MC-Bench (16 frames):
| Method | CTD↓ | BG–Obj CTD↑ | Dynamic↑ | Subject↑ | Background↑ | Smoothness↑ | ImgQual↑ |
|---|---|---|---|---|---|---|---|
| DragAnything | 10.65 | 50.9 | 0.981 | 0.956 | 0.942 | 0.983 | 0.554 |
| SG-I2V | 5.80 | 12.0 | 0.803 | 0.976 | 0.953 | 0.991 | 0.621 |
| MotionPro | 8.69 | 24.5 | 0.422 | 0.979 | 0.975 | 0.993 | 0.617 |
| TTM (Ours) | 7.97 | 35.3 | 0.427 | 0.979 | 0.967 | 0.993 | 0.617 |
On DL3DV-10K for camera motion, TTM reduces pixel MSE by 33%, FID by 15.5%, and Flow MSE by 21% compared to Go-With-the-Flow, while improving CLIP-based temporal consistency.
The dual-clock schedule proved superior to both single-clock (SDEdit-style) and RePaint-style alternatives, as evidenced by trade-offs between CTD, dynamic degree, and visual fidelity.
5. Assumptions, Limitations, and Open Problems
TTM’s object and camera motion control presupposes access to a reliable mask and that appearance anchoring in the initial frame suffices for identity preservation. Object-level control currently requires dense binary masks, though robustness to mask imperfections is reported. The TTM framework does not extend to articulated or occluded entities where single-mask cut-and-drag or depth-only cues are insufficient. Cohesion over long video horizons (50 frames) and handling of new object appearances remain open directions. The dual-clock mechanism requires tuning of two hyperparameters () to balance motion fidelity with natural scene dynamics.
6. TTM as a Max–Min Travel-Time Metric
Independently, TTM refers to a travel-time metric over a set of locations, as formulated by Halpern (Halpern, 2015). The function maps route and time-dependent empirical data into a bona fide symmetric metric. For each pair and departure interval :
- Compute for each the fastest admissible route: .
- Aggregate the worst-case best-time over all : .
- Symmetrize: .
This metric satisfies nonnegativity, symmetry, and the triangle inequality. Computationally, time-dependent shortest-path computations are required per ordered pair. The construction is robust to the temporal variability of routes but does not account for prospective congestion, cost, or risk. Regularization is needed at interval boundaries to guarantee existence. The metric is suited for applications in cartography and delay-aware planning where the maximal worst-case traversal time is critical.
7. Applications, Extensions, and Context within Related Work
For motion-controlled video generation, TTM advances the domain by providing zero-shot controllability and precise correspondence between user-intended motion and generated content, outmatching learned or retrained counterparts in specific metrics without added computational burden during sampling. For network science and transportation analytics, the TTM max–min metric enables meaningful embedding of geographical locations onto time-responsive maps, guaranteeing metric properties essential for optimization and planning.
Potential extensions for both domains include: hybridization with other conditioning or metric modalities (e.g., adding cost to the travel-time metric, incorporating articulated motion cues in video TTM), and adaptive time-horizon or scenario-based evaluation. In both implementations, TTM emphasizes rigorous enforcement of interpretability and control, aligning generated or inferred structure to precise, user- or data-driven specification.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free