Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding

Published 13 Apr 2026 in cs.CV | (2604.11244v2)

Abstract: Advances in Multimodal LLMs (MLLMs) are transforming video captioning from a descriptive endpoint into a semantic interface for both video understanding and generation. However, the dominant paradigm still casts videos as monolithic narrative paragraphs that entangle visual, auditory, and identity information. This dense coupling not only compromises representational fidelity but also limits scalability, since even local edits can trigger global rewrites. To address this structural bottleneck, we propose Multi-Stream Scene Script (MTSS), a novel paradigm that replaces monolithic text with factorized and explicitly grounded scene descriptions. MTSS is built on two core principles: Stream Factorization, which decouples a video into complementary streams (Reference, Shot, Event, and Global), and Relational Grounding, which reconnects these isolated streams through explicit identity and temporal links to maintain holistic video consistency. Extensive experiments demonstrate that MTSS consistently enhances video understanding across various models, achieving an average reduction of 25% in the total error rate on Video-SALMONN-2 and an average performance gain of 67% on the Daily-Omni reasoning benchmark. It also narrows the performance gap between smaller and larger MLLMs, indicating a substantially more learnable caption interface. Finally, even without architectural adaptation, replacing monolithic prompts with MTSS in multi-shot video generation yields substantial human-rated improvements: a 45% boost in cross-shot identity consistency, a 56% boost in audio-visual alignment, and a 71% boost in temporal controllability.

Abstract PDF Upgrade to Chat

Authors (1)

Tencent Hunyuan Team

Summary

The paper introduces the Multi-Stream Scene Script (MTSS) that factorizes video content into Reference, Shot, Event, and Global streams for improved captioning.
It demonstrates significant improvements, including a 25% reduction in error rates and a 110% boost in reasoning performance on advanced video captioning tasks.
The framework enhances identity consistency, temporal alignment, and multi-shot controllability, paving the way for more precise audio-visual understanding and generative video modeling.

Deep Structured Audio-Visual Captioning with Multi-Stream Scene Script (MTSS)

Introduction and Motivation

Script-a-Video introduces the Multi-Stream Scene Script (MTSS), a deeply structured paradigm for representing video semantics that supersedes conventional monolithic captions. Standard approaches encode the entire temporal, visual, auditory, and identity content of a video into a single narrative, entangling cross-modal dependencies and impeding local editability, reference consistency, and model learnability. The MTSS framework strategically factorizes a video into four specialized and relationally grounded streams: Reference, Shot, Event, and Global. This stream-centric approach addresses three structural bottlenecks in previous work: persistent entity referencing, explicit temporal cross-modal alignment, and modular, non-redundant scripting for fine-grained downstream control.

Figure 1: Overview of the Multi-Stream Scene Script (MTSS) representational design, depicting factorization into four complementary and interconnected streams for improved traceability and update efficiency.

MTSS Design: Factorization and Relational Grounding

MTSS is built on two essential principles: stream factorization and relational grounding.

Stream Factorization splits video content into:

Reference Stream: A persistent entity bank containing unique identifiers, semantic descriptions, and appearance anchors for principal actors, objects, and scenes. This stream eliminates redundancy by allowing identity mentions across the script to cite an explicit reference (e.g., PERSON_1) rather than regenerate semantic and perceptual details.
Shot Stream: A chronologically ordered set of visual segments, each precisely timestamped and described with both a narrative and explicit camera/action metadata. Each shot binds to the relevant reference entities and links temporally to concurrent auditory events.
Event Stream: A track of temporally grounded, relationally anchored audio events (e.g., dialogues, sound effects, or music) with explicit speaker binding and micro-timestamps for precise lip-sync and action-event correspondence.
Global Stream: Ambient, scene-setting information capturing the overarching context, style, and any persistent or non-localized audio cues.

Relational Grounding then reinstates semantic cohesion by explicitly mapping each stream’s elements to global entity IDs and precise temporal anchors, thus ensuring that persistent subjects, cross-modal events, and their interactions remain unambiguous and logically coherent across multi-shot content.

Figure 2: Example of a Multi-Stream Scene Script (MTSS) showing relationally grounded streams with explicit references, timestamps, and contextual linkage.

Improvements on Video Understanding and Captioning

On multiple open and proprietary Audio-Visual MLLMs, MTSS consistently delivers reduced error rates and enhanced detail granularity in captioning and a notable boost in downstream reasoning:

On Video-SALMONN-2, MTSS reduces the total error rate by 25% over monolithic captions. On UGC-VideoCap, there is a gain from 62.80 to 71.54 in detail scores for Qwen3-Omni, and 110% improvement on the Daily-Omni reasoning test (2604.11244).
MTSS narrows the performance gap between small and large MLLMs, evidencing improved learnability and compositional generalization.
Fine-tuning with MTSS pushes open-source models into parity with commercial state-of-the-art, halving the omission rates compared to monolithic baselines.

Contradictory Claims

Ablation studies and baseline comparisons strongly suggest that the improvements are a function of the structured representation (not just extra annotation), as even prompt-only replacement (sans architecture changes) yields significant quantitative and qualitative benefits.

MTSS for Advanced Generative Video Modeling

The authors systematically evaluate MTSS as a control interface for multi-shot, identity-consistent, and jointly audio-visual video generation using an adapted LTX-2 generative framework. Two architectural augmentations leverage the MTSS structure:

Shot-Aware Structured Attention partitions the diffusion model’s attention by explicit shot boundaries, with cross-attention between visual tokens and shot-specific embeddings, achieving sharp inter-shot context isolation and enhancing multi-shot controllability.
Identity Customization injects character and object identities via reference VAE features and learnable tokens, directly grounded by the Reference Stream.

Figure 3: The pipeline overview with dual-branch DiT architecture, highlighting MTSS-driven shot-awareness and identity customization.

Empirical Results

Quantitative benchmarks and human ratings confirm substantial gains:

Task	Monolithic	MTSS Prompt	MTSS + Model Adap.
Multi-shot Consistency	1.22	1.77	2.41
Audio-Visual Alignment	1.18	1.85	2.26
Controllability	1.00	1.71	2.59

Pure representational replacement (with MTSS prompts) already accounts for the majority of improvement, with model-level changes further amplifying the benefit. Automated metrics (e.g., shot boundary deviation, Ref. ID similarity, WER, A-V sync) consistently favor the full MTSS pipeline.

Qualitative comparisons (Figures 4–7) illustrate:

Monolithic prompts lead to identity drift, weak shot boundaries, and flat/irrelevant audio.
MTSS-structured control, even without further adaptation, yields persistent identity, accurate shot transitions, and precise audio-visual correspondence.
The full MTSS pipeline achieves near-human-level alignment and narrative coherence in both single- and multi-shot scenarios.

Figure 4: Single-shot comparison: Only MTSS-structured specification maintains identity and continuous audio coverage, outperforming monolithic and weak baselines.

Figure 5: Single-shot comparison: Only MTSS-structured pipelines generate speech-like, temporally precise AV patterns.

Figure 6: Multi-shot comparison: MTSS and pipeline variants achieve accurate identity and event grounding across dynamic, cross-shot transitions.

Figure 7: Multi-shot comparison: Periodic audio-visual events and boundaries (e.g., footsteps) are only synchronized via MTSS streams and grounding.

Theoretical Implications and Future Prospects

The MTSS paradigm establishes a data-centric foundation for both discriminative and generative tasks:

It exposes video as a relational database, making cross-modal temporal inference and reference retrieval tractable within MLLMs, and substantially reducing the search/decoding space in generative models.
For generation, MTSS offers a universal and composable scripting interface well-suited to advanced, controller-rich Diffusion Transformer architectures.
Future directions include scaling structured annotation quality for open-source MLLMs, optimizing cross-stream attention and alignment, and comprehensive evaluation on long-form and out-of-domain multimodal content.

Conclusion

Script-a-Video’s MTSS paradigm (2604.11244) delivers a principled, scalable, and highly effective framework for structured audio-visual captioning and tightly controlled multi-shot video generation. By decomposing the representational load into relationally grounded streams, MTSS addresses the key limitations of monolithic captioning: ambiguity, redundancy, and intractable edit/update logic. Empirical analysis confirms robust gains in descriptive fidelity, logical reasoning, identity consistency, multi-shot controllability, and audio-visual synchronization. MTSS can be anticipated to provide the backbone for next-generation semantic interfaces in both video understanding and generative modeling.

Markdown Report Issue