Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model

Published 14 Mar 2025 in cs.CV and cs.CL | (2503.11251v1)

Abstract: We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames based on both text and image inputs. We build Step-Video-TI2V-Eval as a new benchmark for the text-driven image-to-video task and compare Step-Video-TI2V with open-source and commercial TI2V engines using this dataset. Experimental results demonstrate the state-of-the-art performance of Step-Video-TI2V in the image-to-video generation task. Both Step-Video-TI2V and Step-Video-TI2V-Eval are available at https://github.com/stepfun-ai/Step-Video-TI2V.

Abstract PDF Upgrade to Chat

Authors (54)

First 10 authors:

Summary

Step-Video-TI2V: Advancements in Text-Driven Image-to-Video Generation

Introduction: The paper introduces Step-Video-TI2V, an advanced text-driven image-to-video generation model poised to set new standards within the domain of AI-generated video content. The model, containing 30 billion parameters, distinguishes itself by its ability to process up to 102 frames per video using both textual and image inputs. The study concurrently presents Step-Video-TI2V-Eval, a benchmarking dataset devised to evaluate the performance of text-driven image-to-video models, positioning Step-Video-TI2V against existing commercial and open-source solutions.

Contributions and Model Architecture: The authors articulate four major contributions:

Step-Video-TI2V’s Model Scale: This open-source model is unparalleled in terms of scale within the domain, boasting the largest parameter size to date.
Enhanced Motion Dynamics Control: The model offers users the capability to manipulate the motion dynamics, thus facilitating greater creative control over video output.
Anime-Style Generation: Due to the composition of its training data, the model demonstrates superior performance in generating anime-style content, highlighting its specialization and strength in niche video styles.
Establishing a New Benchmark: Step-Video-TI2V-Eval stands as a comprehensive benchmark dataset fostering further research and evaluation efforts in this sector.

The model is built upon the pre-trained Step-Video-T2V architecture, augmented with Image Conditioning and Motion Conditioning enhancements. These improvements facilitate the seamless integration of image input as the first frame while allowing dynamic video generation based on user-defined motion parameters.

Technical Implementation: The technical description provides a thorough examination of both Image and Motion Conditioning techniques. Image inputs are transformed into latent representations which are concatenated with video latent features, enabling the model to generate coherent video sequences. Motion Conditioning leverages optical flow computations to enable users to define the desired level of motion realism, transforming the abstract notions of movement into computationally actionable inputs.

Benchmarking and Results: The performance of Step-Video-TI2V is rigorously tested against alternatives like OSTopA/OSTopB (open-source) and CSTopC/CSTopD (commercial models) on Step-Video-TI2V-Eval. The model outperforms others in overall scoring, notably in driving innovation in camera motion understanding and anime-style video generation. A secondary benchmarking using VBench-I2V underscores the model's robustness across varied motion complexities, marking it as an industry leader in visual fidelity and video-text conformity.

Implications and Future Work: This work has significant implications for the generation of customized and stylistically complex video content, with potential applications spanning entertainment, advertising, and content creation. The explicit control over motion elements and stylistic nuances could streamline workflows for creatives looking to leverage AI in video production. Future research may focus on enhancing real-world-style video generation, broadening the model's applicability, and refining its ability to interpret and accurately implement complex instructions.

In summary, Step-Video-TI2V and its associated benchmark represent a significant step forward in the development of highly scalable and functionally diverse text-driven image-to-video generation models, underlined by meticulous attention to both technical detail and application potential.