Envisioning the Future, One Step at a Time

This presentation explores a breakthrough in visual forecasting that shifts from dense pixel-level prediction to sparse trajectory modeling. By introducing a step-wise autoregressive diffusion model for point trajectories, the authors achieve orders of magnitude higher sampling throughput than conventional video generation models while maintaining superior accuracy. The approach enables efficient exploration of thousands of plausible futures, demonstrated through applications ranging from open-world motion prediction to counterfactual planning in billiards, fundamentally rethinking how machines anticipate scene dynamics.
Script
Most visual AI tries to predict the future pixel by pixel, like painting every frame of a movie. But the authors discovered something remarkable: you can forecast scene dynamics orders of magnitude faster by tracking sparse points through time instead of rendering dense video. This isn't just faster, it's fundamentally more powerful for exploring what might happen next.
Traditional approaches pay what the authors call the visual tax. Dense video generation models must render every pixel at every timestep, which means even billion-parameter systems can barely produce one future scenario per minute. When you need to explore thousands of possible outcomes for planning or decision-making, this computational bottleneck makes the problem unsolvable.
The key insight is to separate dynamics from appearance.
Instead of pixels, the model reasons about sparse point trajectories. Motion tokens combine local appearance with randomized identity vectors and Fourier-embedded motion history. A flow matching head predicts velocity increments over noisy observations, using a multi-scale cascade to handle everything from slow drift to sudden impacts. At each step, the model accumulates these increments autoregressively, building complete futures one moment at a time.
Here's where sparse prediction becomes transformative. The model searches for a billiard shot by enumerating thousands of initial actions and simulating their consequences. The top left shows the goal: move the red ball to the target. By predicting trajectories for different cue ball strikes, shown top right, the model finds a plan that achieves the objective when executed, shown bottom. This exhaustive search is only possible because sparse rollouts are 2200 times faster than dense video generation, turning intractable planning into real-time reasoning.
On the new OWM benchmark of 95 diverse real-world scenes, this 665 million parameter model matches or beats multi-billion parameter video generators in accuracy. But the real story is throughput: 2200 future scenarios per minute compared to less than one for dense methods. When compute budgets are fixed, this efficiency gap translates directly into better predictions because you can explore vastly more hypotheses.
By predicting futures one sparse step at a time instead of rendering them pixel by pixel, this work makes the impossible tractable: machines that can genuinely explore what might happen next. Visit EmergentMind.com to learn more and create your own research videos.