Video-T1: Test-Time Scaling for Video Generation (2503.18942v2)

Published 24 Mar 2025 in cs.CV and cs.AI

Abstract: With the scale capability of increasing training data, model size, and computational cost, video generation has achieved impressive results in digital creation, enabling users to express creativity across various domains. Recently, researchers in LLMs have expanded the scaling to test-time, which can significantly improve LLM performance by using more inference-time computation. Instead of scaling up video foundation models through expensive training costs, we explore the power of Test-Time Scaling (TTS) in video generation, aiming to answer the question: if a video generation model is allowed to use non-trivial amount of inference-time compute, how much can it improve generation quality given a challenging text prompt. In this work, we reinterpret the test-time scaling of video generation as a searching problem to sample better trajectories from Gaussian noise space to the target video distribution. Specifically, we build the search space with test-time verifiers to provide feedback and heuristic algorithms to guide searching process. Given a text prompt, we first explore an intuitive linear search strategy by increasing noise candidates at inference time. As full-step denoising all frames simultaneously requires heavy test-time computation costs, we further design a more efficient TTS method for video generation called Tree-of-Frames (ToF) that adaptively expands and prunes video branches in an autoregressive manner. Extensive experiments on text-conditioned video generation benchmarks demonstrate that increasing test-time compute consistently leads to significant improvements in the quality of videos. Project page: https://liuff19.github.io/Video-T1

Summary

The paper’s main contribution is the formulation of test-time scaling for video generation by recasting the task as a search for optimal generation trajectories.
It introduces two heuristic algorithms—Random Linear Search and Tree-of-Frames (ToF) Search—that leverage verifier feedback to balance computational cost and quality.
Experiments show significant improvements in video fidelity and prompt alignment across both diffusion-based and autoregressive models without retraining.

The paper "Video-T1: Test-Time Scaling for Video Generation" (2503.18942) investigates the application of Test-Time Scaling (TTS) techniques, previously explored in LLMs, to the domain of text-to-video generation. The primary objective is to determine the extent to which increasing computational expenditure during the inference phase can enhance the quality and prompt fidelity of videos generated by pre-trained models, particularly for complex prompts, without resorting to costly model retraining or expansion.

Video-T1 Framework: TTS as Trajectory Search

The core contribution is the formalization of TTS for video generation within a framework termed Video-T1. This framework recasts the problem as a search for optimal generation trajectories within the latent space, starting from initial Gaussian noise and navigating towards the target video distribution conditioned on the input text prompt. The framework comprises three essential components:

Video Generator ( $\mathcal{G}$ ): A pre-trained text-to-video model. The paper demonstrates applicability across both diffusion-based models (e.g., OpenSora) and autoregressive models (e.g., CogVideoX, NOVA, Pyramid-Flow).
Test Verifiers ( $\mathcal{V}$ ): One or more multimodal models capable of evaluating the quality, coherence, and text-alignment of generated video frames or sequences. These verifiers provide quantitative feedback (scores) used to guide the search. The paper utilizes models such as VisionReward, VideoScore, and VideoLLaMA3, and also proposes an ensemble approach ("Multi-Verifiers") to mitigate potential biases of individual verifiers and enhance robustness.
Heuristic Search Algorithms ( $f$ ): Algorithms that leverage the verifier feedback to explore the solution space and identify superior video generation trajectories. The paper introduces and evaluates two specific algorithms.

The overarching principle is that by strategically increasing computations at inference time (e.g., exploring multiple noise initializations or generation pathways), guided by verifier feedback, one can identify outputs that better satisfy the prompt constraints and quality criteria compared to a single, standard inference pass.

Test-Time Scaling Algorithms

Two primary heuristic search algorithms are implemented and analyzed within the Video-T1 framework:

Random Linear Search

This approach represents a straightforward "Best-of-N" strategy.

Implementation:

1. Sample $N$ distinct initial noise vectors $z_0^1, ..., z_0^N$ . 2. Independently execute the full video generation process $\mathcal{G}$ for each noise vector to produce $N$ complete video sequences $v^1, ..., v^N$ . 3. Employ the test verifier(s) $\mathcal{V}$ to compute a quality/alignment score $S(v^i)$ for each generated video $v^i$ . 4. Select the video $v^* = \arg\max_{v^i} S(v^i)$ as the final output.

Computational Cost: The computational complexity scales linearly with the number of candidates $N$ and the cost of generating a single video (proportional to video length/denoising steps $T$ ). The cost is approximately $O(T \times N)$ . While simple, this method becomes computationally demanding for large $N$ or long videos, as it requires completing the generation for all $N$ candidates.

Tree-of-Frames (ToF) Search

ToF Search is proposed as a more computationally efficient alternative, inspired by similar structured search methods in other domains and adapted for video generation, particularly compatible with autoregressive frame generation or intermediate steps in diffusion models.

Implementation:

1. Tree Structure: Video generation is conceptualized as constructing a tree (or a forest of trees if starting from multiple initial points) where nodes represent intermediate states (e.g., partially denoised frames or generated frames in an autoregressive sequence). Edges represent steps in the generation process. 2. Adaptive Expansion: At each step or stage $t$ (e.g., generating the next frame or a block of denoising steps), multiple ( $b_t$ ) potential continuations or branches are generated from promising parent nodes. 3. Verifier-Guided Pruning: The test verifier(s) $\mathcal{V}$ are used during the generation process to evaluate these intermediate branches. A heuristic score $H$ , derived from the verifier output, quantifies the potential of each branch. 4. Selection: Only the top $k_t$ branches (where $k_t < b_t$ ) with the highest heuristic scores are retained and expanded further in subsequent steps. Unpromising branches are pruned early, avoiding the computational cost of completing their generation. 5. Image-Level Alignment: For diffusion models, this involves evaluating frame quality during the denoising process itself, allowing early termination within a single frame's generation if alignment is poor. 6. Hierarchical Prompting: To improve the granularity of verification, especially for long videos with evolving content, an LLM (e.g., GPT-4) is used offline to decompose the main text prompt into stage-specific sub-prompts. These sub-prompts guide the verifier's assessment at corresponding stages of the ToF search (e.g., verifying the initial scene setup, then intermediate actions, then the final state).

Computational Cost: By pruning unpromising paths early, ToF significantly reduces the overall computation compared to linear search. While the exact cost depends on the branching factors ( $b_t$ ), pruning factors ( $k_t$ ), and tree depth ( $T$ ), the potential cost can be much lower, roughly estimated as closer to $O(N+T)$ under favorable conditions, where $N$ relates to the total number of explored nodes rather than full videos.

Implementation Details and Considerations

Verifier Integration: The verifiers (

\mathcal{V}

) are crucial. They need to be efficient enough to be called potentially many times during the search. The choice of verifier(s) impacts the quality assessment; using an ensemble (Multi-Verifiers) can provide more balanced feedback. The output is typically a scalar score.

# Pseudocode for Verifier Usage in ToF
def get_heuristic_score(video_segment, prompt, verifier):
    # Evaluate the quality/alignment of the current segment
    score = verifier.evaluate(video_segment, prompt)
    return score

# Inside ToF search loop at step t:
candidate_branches = generate_continuations(parent_node, b_t)
scores = []
for branch in candidate_branches:
    # Use stage-specific sub-prompt if available
    current_prompt = get_sub_prompt(t, hierarchical_prompts)
    score = get_heuristic_score(branch.get_segment(), current_prompt, verifier)
    scores.append((branch, score))

scores.sort(key=lambda x: x[1], reverse=True)
selected_branches = [branch for branch, score in scores[:k_t]]
# Continue expansion from selected_branches

Hierarchical Prompting Setup: This requires a preliminary step using an LLM to break down the main prompt based on expected temporal stages.

# Pseudocode for Hierarchical Prompt Generation
def generate_sub_prompts(main_prompt, num_stages, LLM):
    query = f"Decompose the video described by '{main_prompt}' into {num_stages} distinct temporal stages. For each stage, provide a concise prompt describing the key visual elements or actions."
    response = LLM.generate(query)
    sub_prompts = parse_LLM_response(response) # Example: ["Initial scene setup", "Main action", "Concluding state"]
    return sub_prompts

# Hierarchical prompts are then used by the verifier at corresponding stages in ToF

Trade-offs: The primary trade-off is between inference compute/latency and output quality. Increasing $N$ in Linear Search or the branching/exploration factors in ToF increases computational cost but generally improves results up to a point. ToF offers a better balance by focusing compute on more promising avenues.
Compatibility: The Video-T1 framework is designed to be model-agnostic, applicable to various underlying video generation architectures. The specific implementation details (e.g., how to define intermediate states or branches) may vary depending on whether the generator is diffusion-based or autoregressive.

Experimental Validation and Results

Extensive experiments were conducted using various open-source video generators and the VBench benchmark. Key findings include:

Consistent Improvement: Both Linear Search and ToF Search consistently demonstrated significant improvements in video quality and text alignment compared to standard single-pass generation, as measured by metrics like VBench scores. The improvements were particularly notable for challenging prompts requiring complex dynamics or high fidelity.
ToF Efficiency: The ToF search algorithm achieved comparable or sometimes superior results to Linear Search but with substantially lower computational requirements, measured in GFLOPs and Number of Function Evaluations (NFE). For instance, ToF might achieve similar quality to Linear Search with N=16 while incurring only a fraction of the computational cost.
Model Scaling: Larger, more capable foundation video models tended to derive greater benefit from TTS compared to smaller models.
Verifier Impact: The choice of verifiers influences the search outcome. The Multi-Verifier ensemble approach generally yielded the most robust improvements.

These results strongly support the hypothesis that allocating additional compute at test time, guided by explicit verification, is an effective strategy for enhancing video generation quality without modifying the base model.

Conclusion

The "Video-T1" paper introduces a principled framework for applying Test-Time Scaling (TTS) to video generation by formulating it as a search problem over generation trajectories. It proposes and validates two search algorithms, Random Linear Search and the more efficient Tree-of-Frames (ToF) search, demonstrating that increased inference-time computation, guided by test verifiers, can significantly improve the quality and prompt adherence of generated videos across various models and benchmarks. The ToF method, incorporating techniques like hierarchical prompting and early pruning, provides a practical approach to achieve these gains with substantially lower computational overhead compared to exhaustive search methods.

PDF Markdown

Related Papers

Find Related Papers

GitHub

Tweets

https://twitter.com/fangfu0830/status/1904520386353803445

https://twitter.com/fangfu0830/status/1904383797371744267

https://twitter.com/fangfu0830/status/1904373490373030168

https://twitter.com/fangfu0830/status/1904518633755492706

https://twitter.com/_akhaliq/status/1904380999187718497

https://twitter.com/fangfu0830/status/1905294594440667153

YouTube

Show All Videos

Reddit

Video-T1: Test-Time Scaling for Video Generation (15 points, 2 comments)
[2503.18942] Video-T1: Test-Time Scaling for Video Generation (1 point, 0 comments)