Long-horizon identity and dynamics stability in image-to-video generation

Establish image-to-video generation mechanisms that ensure persistent agent/object identity and stable motion dynamics across long-horizon interactive tasks in VBVR-Bench, specifically preventing duplication and flickering while maintaining coherent, legal paths; this is exemplified by failures observed for VBVR-Wan2.2 on the Multiple Keys for One Door maze task (G-47).

Background

Within the VBVR-Wan2.2 analysis, the authors qualitatively examine out-of-domain tasks and highlight persistent control challenges over long temporal horizons. In maze-like interactive tasks such as G-47 (Multiple Keys for One Door), the model sometimes exhibits agent duplication or flickering, which breaks identity tracking and undermines path coherence.

These failures indicate that current image-to-video generation models, even when fine-tuned on large-scale reasoning data (VBVR-Dataset), lack reliable mechanisms to preserve object identity and stable dynamics across extended sequences. Solving this problem is important for verifiable video reasoning, where correctness depends on faithful, stepwise execution without artifacts that corrupt the reasoning trace.

References

However, it can still suffer from control failures such as agent duplication/flickering when traversing a coherent path, indicating that maintaining identity and stable dynamics over long horizons remains an open problem.

A Very Big Video Reasoning Suite  (2602.20159 - Wang et al., 23 Feb 2026) in Section 6.3 (Qualitative Analysis), Limitations and failure modes (VBVR-Wan2.2)