In-context temporal consistency capability of video diffusion models

Ascertain whether current diffusion-based video generation models exhibit in-context learning capabilities for temporal consistency tasks that are comparable to the established in-context generation capabilities of text-to-image diffusion models.

Background

Prior work on in-context learning for diffusion models has primarily focused on images, where text-to-image models have been shown to support in-context generation, and appearance-consistent video customization has leveraged temporal in-context cues for identity and style preservation.

The paper notes that it remains unresolved whether analogous in-context capabilities extend to temporal consistency in video generation. The authors conduct exploratory evaluations and propose an assumption that video diffusion models may handle temporal consistency through spatial context, but the general question of inherent in-context capabilities for temporal tasks is explicitly stated as unclear.

References

However, it remains unclear whether current video diffusion models exhibit comparable in-context capabilities for temporal consistency tasks.

OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer  (2601.14250 - Zhang et al., 20 Jan 2026) in Section 4.2 (Task-aware Positional Bias)