TV2TV: A Unified Framework for Interleaved Language and Video Generation (2512.05103v2)

Published 4 Dec 2025 in cs.LG, cs.AI, and cs.CV

Abstract: Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to "think in words" about subsequent content before ``acting in pixels'' to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality and controllability. TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-LLMs (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model's ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.

Summary

The paper introduces TV2TV, a unified framework that interleaves text planning and video chunk generation to enhance semantic consistency and enable mid-sequence user control.
It employs a Mixture-of-Transformers with modality-specific towers and innovative autoregressive switching to facilitate dynamic reasoning and video synthesis.
Empirical results show significant improvements in visual quality and intervention correctness over existing T2V and plan-then-render baselines across gameplay and real sports domains.

TV2TV: A Unified Generative Framework for Interleaved Language and Video Generation

Motivation and Problem Setting

Standard video generation methods, despite advances in visual fidelity, exhibit systematic limitations when tasked with generating sequences requiring high-level semantic reasoning, temporally extended planning, or controllable action-based interventions. Prior works attempt either pure text-conditioned video generation or structured action-to-video simulation, and neither approach is capable of optimal long-term semantic consistency or fine-grained, open-vocabulary user control. "{TV2TV: A Unified Framework for Interleaved Language and Video Generation" (2512.05103) introduces a unified multimodal formulation where video generation is decomposed into an interleaved process of text planning and video rendering, enabling the model to dynamically alternate between reasoning in language and synthesizing in pixels.

Figure 1: TV2TV interleaves language-based planning and video chunk generation, enabling user prompting and intervention at any stage for flexible video control.

Methodology

Architectural Design

TV2TV utilizes a Mixture-of-Transformers (MoT) backbone with dedicated transformer “towers” for text and video modalities, architecturally derived from Transfusion. Text tokens are initialized from a pretrained LLM (Llama), while video segments are tokenized using a VAE-based, causal 3D CNN-based quantizer, producing a temporally compressed latent representation (1 token per 0.25s of video at 16 FPS). Input sequences are chronologically interleaved: segments of text and frame chunks, utilizing a hybrid attention mask to maintain causality and allow intra-chunk bidirectional attention within videos. To train the model with efficient context and denoising, noisy and clean latent representations of each frame chunk are maintained in the sequence, enabling flow-matching training.

Figure 2: TV2TV builds upon a MoT architecture with modality-specific towers and a unified timeline for autoregressive interleaved generation.

Inference and Controllability

During inference, the model autoregressively generates tokens, dynamically switching between text and video generation using special control tokens (BOF/EOF). This design enables the model to “think” (i.e., produce next-step textual plans) and then “act” (produce corresponding video chunks) at will. At any step, users may insert or modify textual plan segments to steer ongoing video synthesis. The autoregressive formulation permits sliding-window inference, facilitating extensible video generation beyond strictly trained context size.

Figure 3: Sequence representations for TV2TV, T2V (no interleaving), and Think2V (plan-then-generate) highlight different planning strategies; TV2TV uniquely enables mid-sequence intervention.

Training and Datasets

Video Game Domain Experiments

For empirical validation, TV2TV and two baselines—T2V (no mid-sequence text) and Think2V (all text upfront, then video)—are trained on Counter-Strike: Global Offensive (CS:GO) gameplay data paired with temporally synchronized controller actions as text. The video sequences span 95 hours, upsampled and tokenized to the model’s input format.

Real-World Sports Domain Scaling

To extend interleaved training to open-domain video, the authors construct a dense interleaved text/video corpus by augmenting 8K hours of real sports videos (mined from YT-Temporal-1B) with meta-captions and differential segment captions generated by VLMs (Qwen3-VL) at fine-grained intervals. The data pipeline performs motion-, face-, and quality-based filtering, automated scene segmentation, and hierarchical chunking prior to VLM captioning.

Figure 4: Interleaved data pipeline uses VLM-based captioning and content-aware segmentation for synthetic annotation of large-scale real-world video.

Experimental Results

Human Evaluation: Video Quality and Controllability

Paired human studies on CS:GO reveal that TV2TV achieves strong visual qualitative advantages and controllability improvements:

In pairwise preference, TV2TV is chosen for visual quality over T2V in 91% of cases and over Think2V in 48% (with 8% and 38% ties, respectively), for both short (6s) and long (64s) rollouts.
Figure 5: Human visual quality preference rates for TV2TV significantly outperform T2V and Think2V in both short and long-form settings.
On fine-grained controllability, TV2TV yields a 19-point higher intervention correctness (0.782 vs. 0.591) over Think2V when evaluated for user-directed mid-video action changes (jump, left-click, reload, backward), with robust gains for highly dynamic interventions.
Figure 6: TV2TV provides strong improvement in intervention correctness and visual quality during user-driven interventions compared to Think2V.

Out-of-Domain and Baseline Comparisons

When scaled to real sports data, TV2TV demonstrates:

Superior prompt alignment, real-world fidelity, and holistic preference over Cosmos2 variants, competitive results with MAGI-1, and only lags behind WAN 2.2 5B in raw visual quality when evaluated blind by external annotators on the sports domain.
Figure 7: TV2TV surpasses Cosmos2 baselines and performs competitively with MAGI-1 and WAN 2.2 5B for sports generation alignment and quality.
Against controlled baselines (T2V and Think2V), TV2TV achieves an overall preference win-rate +19 points over T2V and +12 points over Think2V, with improved alignment and comparable fidelity, even though VLM-generated captions are less temporally dense and noisier than CS:GO action text.
Figure 8: TV2TV provides improved holistic preference and alignment compared to T2V and Think2V in a controlled sports setup.

Discussion and Implications

TV2TV’s unified interleaved planning and synthesis paradigm leverages LLM reasoning to reduce entropy in video generation and exposes high-level control handles to the user, a capability previously reserved for pipeline designs with explicit, rigid action interfaces. Strong empirical results establish both the architectural efficiency of the MoT-autoregressive flow hybrid and the practical value of flexible, text-driven control, especially for tasks requiring mid-sequence intervention and robust long-horizon planning.

The quality of interleaved text annotations directly impacts downstream controllability and semantic consistency; hand-structured action text (as in CS:GO) produces larger gains vs. synthetic VLM captions (as in sports), suggesting that future research should focus on obtaining richer, more temporally dense action descriptions. Advances in VLMs for event detection and differential captioning will facilitate further improvements.

Practically, the framework unlocks accessible user steering for real-world video generation systems, blurring the boundary between reasoning and rendering and enabling applications in dynamic media authoring, interactive simulation, and vision-language agent planning. Theoretically, TV2TV strengthens the link between autoregressive LLMs and multimodal world models, providing evidence that explicit text-based reasoning at generation time confers sample-efficiency and controllability advantages over monolithic video diffusion or plan-then-render splits.

Conclusion

TV2TV is a unified multimodal generative model that integrates linguistic plan formulation and video synthesis through autoregressively interleaved text and video chunks. Evaluation demonstrates robust improvements in visual quality, prompt alignment, and fine-grained controllability over traditional T2V and non-interleaved baselines across both synthetic (CS:GO) and real (sports) domains. The architecture's flexibility makes it a credible candidate for future interactive video synthesis and agent control scenarios, with downstream success contingent on more granular and accurate interleaved annotation pipelines and advances in joint vision-language representation learning.