Peekaboo: Interactive Video Generation

Updated 5 March 2026

Peekaboo: Interactive Video Generation is a framework that enables real-time, user-driven video synthesis through fine-grained, multimodal control and iterative feedback.
It leverages advanced architectures such as latent diffusion transformers, autoregressive pipelines, and neural rendering to ensure temporal coherence and immediate responsiveness.
The system-level engineering integrates low-latency streaming, dynamic scheduling, and quantization techniques to balance high fidelity output with efficient interactive performance.

Peekaboo: Interactive Video Generation refers to a class of computational frameworks that enable dynamic, user-in-the-loop control and iterative editing of temporally coherent video through multimodal, semantic, and fine-grained interaction mechanisms. Unlike conventional video synthesis models that operate unidirectionally on fixed inputs, Peekaboo frameworks permit real-time instruction via text, images, sketches, trajectories, or discrete action signals, with live feedback and refinement at all stages of the generation process. The term encompasses both model architectural innovations and system-level engineering that collectively realize tightly coupled, high-fidelity, and responsive interactive video environments. Key exemplars include the "InteractiveVideo" system featuring synergistic multimodal instruction fusion (Zhang et al., 2024), diffusion-based autoregressive pipelines with causal control (Wang et al., 18 Dec 2025), self-supervised action-decomposed video generators (Menapace et al., 2021), and user-editable neural rendering environments (Menapace et al., 2022).

1. Conceptual Foundations and Taxonomy

Interactive video generation has its roots in playable video synthesis and generative modeling conditioned on user actions. It is formally cast as constructing a conditional generative process $p_\theta(x_{1:T} \mid u_{1:T})$ where each frame $x_t$ depends on the history of generated frames and a sequence of structured user control signals $u_{1:t}$ . Modern frameworks distinguish themselves by achieving:

Fine-grained, multi-modal control. Direct integration of a variety of instruction types at both spatial (where/what to edit) and temporal (when/how to move/change) scales.
On-the-fly, iterative refinement. Users can intervene at any pipeline stage, modifying prompts, object trajectories, region masks, or semantic details, with immediate regeneration and feedback.
Streaming, low-latency generation. Architectures are optimized for real-time responsiveness, supporting coherent infinite-horizon video with negligible drift and minimal "time to first frame".
Semantic and physical consistency. Systems enforce multi-frame coherence and logical dynamics, often leveraging architectural modules for memory, causality, and action grounding (Yu et al., 30 Apr 2025).

The field encompasses multiple modeling traditions, including latent diffusion transformers, autoregressive GANs, structured radiance fields, and action-bottlenecked RNNs, unified by a focus on continuous, bidirectional human-model interaction.

2. Model Architectures and Multimodal Control

Architectural innovations enable Peekaboo-style interaction by simultaneously conditioning on diverse instruction signals and fusing user edits into both image and video generation backbones.

InteractiveVideo (Zhang et al., 2024) combines T2I and I2V latent diffusion pipelines linked via an editable intermediate canvas. Four control modalities are supported: image or sketch ( $x$ ), content semantics ( $y$ ), motion text prompt ( $y'$ ), and trajectory mask ( $r$ ). All conditioning signals are injected as residuals into the diffusion denoiser at inference, using the mechanism:

$\hat \epsilon_t = \epsilon_t + \lambda (\epsilon'_t - \epsilon_t),$

enabling instantaneous alignment with user modifications. No extra parameters are trained; all edits are absorbed as inference-time conditions.

AniX (Wang et al., 18 Dec 2025) introduces a Multi-Modal Diffusion Transformer (MMDiT) merging VAE-based video, LLM text/character encoding, and scene embedding. Fine-grained action, gesture, and camera controls are parsed via rule-based instruction decomposition, with all tokens fused via attention in the transformer stack for jointly grounded, responsive output.
Playable Environments (Menapace et al., 2022) and Playable Video Generation (Menapace et al., 2021) focus on action-space discovery, where a structured latent "environment state" or discrete action bottleneck mediates transitions under user control. External triggers select semantic actions at each step, which are mapped to high-level object or actor behaviors.
MotionStream (Shin et al., 3 Nov 2025) and similar pipelines employ sliding-window causal attention, attention sinks, and key–value cache rolling to enable continuous, chunkwise, multi-modal streaming generation with low latency and support for paint- or trajectory-based interaction.

Compositional architectures thus permit user intervention not only in initial prompt specification but throughout the temporal and spatial evolution of the video output.

3. Training Paradigms and Optimization Objectives

Peekaboo frameworks adopt a variety of optimization strategies, balancing the need for high-fidelity output, semantic controllability, and interactive responsiveness.

Training-free inference wrappers. Systems such as InteractiveVideo do not modify the underlying T2I/I2V weights and thus introduce no new loss terms, leveraging only the pretrained models' original denoising (typically $\ell_2$ noise-prediction) objectives (Zhang et al., 2024).
Flow matching and distributionally matched training. AniX employs continuous-time flow matching, regressing the velocity of noisy latents toward ground-truth motion for all diffusion steps, eschewing adversarial or explicit motion losses (Wang et al., 18 Dec 2025).
Self-forcing and student-forcing. Streaming systems (e.g., MotionStream, FlowAct-R1) use self-forcing—the model is exposed during training to its own imperfect rollouts, matching inference conditions and preventing drift accumulation over long horizons. Distribution Matching Distillation (DMD) aligns the student's and teacher's score distributions under full rollout (Shin et al., 3 Nov 2025, Wang et al., 15 Jan 2026).
Adversarial post-training. Pipelines such as AAPT employ relativistic R3GAN objectives, regularized penalties, and per-frame discriminators to fine-tune pretrained diffusion models for improved frame quality and segmental consistency under strict causal constraints (Lin et al., 11 Jun 2025).

Effectively, these training regimes converge on architectures optimized for both iterative feedback and sustained temporal coherence, a key requirement for user-driven interactive generation.

4. System-Level Engineering and Real-Time Performance

Delivering real-time, interactive video generation at scale necessitates hardware-aware engineering and specialized scheduling.

Low-latency and streaming throughput. Systems such as StreamDiffusionV2 achieve time-to-first-frame (TTFF) $<0.5$ s and 30–60 FPS steady-state throughput on multi-GPU setups using SLO-aware batching and block schedulers, rolling KV caches, and sink-token integration for memory continuity (Feng et al., 10 Nov 2025).
Chunkwise generation and pipeline parallelism. Streaming video models produce video in small, overlapping temporal chunks, decoupling denoising, VAE decode, and control-signal integration across asynchronous compute streams (e.g., FlowAct-R1's 0.5 s chunk buffer with short-/long-term memory) (Wang et al., 15 Jan 2026).
Quantization and kernel fusion. INT8/FP8 quantization on dominant compute paths (UNet) and fusion of memory-intensive attention kernels yield 1.5–2 $\times$ throughput boosts with minimal drift in fidelity, supporting real-time avatar and world synthesis (Yu et al., 6 Jun 2025).
Dynamic scheduling and adaptive batch sizing. Real-time systems deploy online latency models and dynamic parallelization to balance per-frame deadlines, minimize jitter, and flexibly adjust the batch/chunk parameters to the available hardware (Feng et al., 10 Nov 2025).

These optimizations collectively ensure sub-250 ms round-trip latency and robust user feedback cycles for interactive video environments.

5. Evaluation Protocols and Empirical Benchmarks

Quantitative evaluation of interactive video generation demands both objective alignment metrics and subjective user studies.

CLIP-based alignment. Metrics include mean cosine similarity between input prompts (image/text) and generated frames for both content and motion (e.g., CLIP Image/Text Alignment) (Zhang et al., 2024).
User-study preference and satisfaction rates. Studies assess fidelity, controllability, motion consistency, and subjective satisfaction via pairwise comparisons and attribute ranking (Zhang et al., 2024, Wang et al., 18 Dec 2025).
Temporal and identity consistency. Frechet Video Distance (FVD), perceptual similarity (LPIPS), and DINOv2/CLIP-based character/scene similarity quantify coherence and maintenance of scene or avatar identity (Wang et al., 15 Jan 2026, Wang et al., 18 Dec 2025).
Control success and action accuracy. Action controllability is measured by user or algorithmic success in achieving requested trajectories or behaviors (e.g., 100% object-control success on standard benchmarks for AniX) (Wang et al., 18 Dec 2025).
Latency, throughput, and scalability. Time-to-first-frame, FPS, and standard deviation of per-frame latency serve as system-level performance benchmarks (Feng et al., 10 Nov 2025, Yu et al., 6 Jun 2025).

In side-by-side comparisons, interactive frameworks consistently outperform static generation baselines in both controllability and satisfaction:

Method	CLIP Img	CLIP Txt	User Img	User Txt	Sat. Rate
VideoComposer	225.3	62.85	0.180	0.110	43.5%
AnimateDiff	218.0	63.31	0.295	0.220	51.6%
PIA	225.9	63.68	0.525	0.670	52.5%
InteractiveVideo	234.6	65.31	0.745	0.813	72.8%

6. Applications, Use Cases, and Limitations

Peekaboo-style systems have been validated in a spectrum of real-world scenarios:

Content creation and iterative editing. Users can start from photographs or sketches, add and manipulate objects (e.g., painting birds into a landscape and dragging trajectories), refine motion/appearance, and preview changes in near real-time (Zhang et al., 2024).
World simulation and controllable character animation. Given user-specified 3D scenes and multi-view character images, open-ended actions or locomotion instructions in natural language or trajectory form yield fully dynamic, coherent video clips (Wang et al., 18 Dec 2025, Menapace et al., 2022).
Avatar generation and real-time communication. Audio-driven, fine-grained facial expression and gesture control enables fluid two-way avatar video at up to 78 FPS, supporting live interactions (Yu et al., 6 Jun 2025, Wang et al., 15 Jan 2026).

Known limitations include:

Trade-off between latency and fidelity. Aggressive quantization or low-step denoising reduces latency but can sacrifice fine visual detail and consistency.
Drift and error accumulation. Despite self-forcing, very long sequences or infrequent correctives can lead to semantic drift or defect accumulation, especially in open-ended or memory-constrained pipelines (Hong et al., 15 Dec 2025).
Action vocabulary and physical constraints. Models trained on narrow action sets or lacking explicit physical simulation retain only limited dynamics; further, complex occlusion scenarios remain challenging (Wang et al., 18 Dec 2025).

Potential future directions identified include hierarchical memory architectures, integration of causal LLM planning for open-ended evolution, tighter coupling to physics engines, and memory compression for truly infinite-horizon generation (Yu et al., 30 Apr 2025, Hong et al., 15 Dec 2025).

7. Theoretical and Practical Impact

Peekaboo: Interactive Video Generation marks a convergent paradigm in generative modeling, merging the strengths of diffusion, autoregressive, and neural rendering architectures with multimodal, semantically contextual user interaction. The field demonstrates that, via carefully engineered architectural components—synergistic instruction fusion, dynamic inference-time conditioning, self-forcing memory alignment, and system-level parallelization—one can achieve streaming, user-editable video synthesis with fine-grained control, high temporal coherence, and real-world usability. These systems underlie emerging applications in content creation, gaming, simulation, embodied AI, and telepresence. Ongoing research targets further reductions in latency, expansion of semantic control spaces, and principled integration with causal reasoning frameworks, positioning Peekaboo as a foundational paradigm for next-generation interactive media synthesis (Zhang et al., 2024, Wang et al., 18 Dec 2025, Yu et al., 30 Apr 2025).