I2V-Bench: Image-to-Video Evaluation

Updated 4 July 2025

I2V-Bench is a specialized collection of benchmarks and datasets designed to evaluate image-to-video generation by measuring visual consistency, motion realism, and control.
It improves on traditional methods by using paired datasets, detailed metric evaluations like FC, FID, FVD, and region-wise annotations for nuanced motion control.
The framework drives progress in applications from surveillance to creative animation, supporting robust and reproducible assessments of advanced generative models.

I2V-Bench refers to a series of benchmarks and datasets specifically designed to rigorously evaluate the performance of image-to-video (I2V) generation models under a variety of qualitative and quantitative criteria. Developed in response to the limitations of prior evaluation schemes—often constructed for broader text-to-video (T2V) or detection tasks—I2V-Bench frameworks enable the systematic assessment of visual consistency, motion realism, controllability, and video fidelity in cutting-edge I2V research.

1. Origins and Motivation

I2V-Bench originated as a targeted solution to the inadequacies of general-purpose video generation benchmarks, which failed to measure essential aspects of I2V such as the preservation of subject identity, spatiotemporal coherence, and nuanced motion control. Conventional datasets often contained discontinuous or unpaired images (unsuitable for sequence modeling) and were not constructed for evaluating synthesized video consistency relative to a fixed reference frame or specific user intent. Benchmarks such as IRVI ("I2V-Bench"), proposed by I2V-GAN, and more general-purpose open-domain I2V-Benches (as in ConsistI2V), address these deficits through task-specific dataset design, detailed evaluation metrics, and human preference assessments.

2. Benchmark Design and Dataset Composition

I2V-Bench frameworks exhibit the following core characteristics:

Curated Dataset Structure: I2V-Bench datasets are organized to maximize practical coverage, typically including paired reference images and video sequences annotated for various types of motion or scene categories. For example, IRVI features 12 video clips (6 for training, 6 for testing) with 24,352 temporally aligned IR and VI frames encompassing traffic and monitoring scenes at a fixed spatial resolution. Datasets in newer I2V-Bench iterations extend to open-domain, user-annotated scenarios, with diverse prompts and user-defined trajectories or motion masks.
Paired and Unpaired Video Domains: Some versions focus on translation between distinct modalities (e.g., infrared to visible), while others target general image animation or scene-driven video generation.
Region/Wise Annotations and User Control: Certain datasets (e.g., MC-Bench) provide fine-grained, manually-brushed motion regions and trajectories to facilitate evaluation of controllability and disentanglement between object and camera motion.

3. Evaluation Criteria and Metrics

I2V-Bench assessment encompasses a multi-faceted set of metrics:

Visual Consistency: The primary focus is the preservation of subject, background, and style from the initial (reference) frame throughout the video. Frame Consistency (FC) metrics measure average similarity—using learned feature encoders such as CLIP or DINO—between each generated frame and the input frame:

$FC = \frac{1}{T} \sum_{t=1}^T \mathrm{sim}(\mathrm{feat}(G_0), \mathrm{feat}(G_t))$

where $\mathrm{feat}(\cdot)$ is the feature extractor, and $\mathrm{sim}$ is typically cosine similarity.

Motion Quality and Diversity: Metrics such as average optical flow displacement, dynamic range, and CLIP-based motion or FlowScore quantify temporal coherence, plausibility, and motion amplitude.
Visual Quality: Fréchet Video Distance (FVD) and Fréchet Inception Distance (FID) are used to measure realism and distributional similarity between generated and ground-truth samples (lower is better).
Controllability and Alignment: For benchmarks supporting explicit user control, metrics like MD-Img and MD-Vid (mean distance between user-specified trajectory and resulting motion) are employed. Camera control metrics—Rotation Error and Translation Error—assess the effectiveness of camera-guided video generation.
Human Evaluation: Subjective tests include ranking or rating video samples for realism, fidelity, frame consistency, motion plausibility, and overall preference, typically across a pool of annotators.

4. Major Dataset and Metric Examples

Table: Exemplary I2V-Bench Datasets and Their Features

Name	Main Task Type	Key Features
IRVI	IR-to-Visible video translation	12 paired video clips, traffic & monitoring, 24k+ frames, temporal alignment
MC-Bench	Fine-grained motion control	1.1K user-annotated image-trajectory pairs, motion masks for object/camera
Open-domain I2V-Bench	Visual consistency in open domains	Multiple datasets, varied prompts and scenes, FC/FVD/human preference

Evaluation Metrics Table (by category):

Metric	Measured Property	Typical Use
FC	Visual consistency	ConsistI2V, generic I2V
FID/FVD	Visual/video quality	IRVI, ConsistI2V
FlowScore	Motion amplitude	I2V-Adapter
MD-Img/Vid	Trajectory alignment	MotionPro, MC-Bench
RotErr	Camera pose control	RealCam-I2V

5. Representative Results and Findings

Evaluations across I2V-Bench have repeatedly demonstrated the following:

Leading methods (e.g., I2V-GAN, I2V-Adapter, Motion-I2V, ConsistI2V, MotionPro, RealCam-I2V) substantially outperform baselines in visual consistency and overall video quality as measured by FC, FID, FVD, and human preference rates.
Newer benchmarks capture nuanced challenges: for example, Dynamci-I2V’s DIVE metric exposes the overfitting of prior models to low-motion regimes and quantitatively rewards motion diversity, controllability, and fidelity in a balanced manner.
Methods equipped for region-wise or trajectory-guided evaluation (e.g., MotionPro on MC-Bench) yield lower MD-Img/MD-Vid values, denoting improved user alignment and disentanglement of motion types. RealCam-I2V achieves lower RotErr/TransErr than previous methods, validating enhanced camera control.

6. Impact on Research and Applications

The introduction of I2V-Bench frameworks has directly facilitated:

Benchmark-driven progress in visual consistency, frame stability, realistic motion, and fine-grained user controllability, serving as a proving ground for architectural advances (e.g., cross-frame attention, explicit motion modeling, fine-grained feature modulation).
Deployment in real-world applications such as surveillance (I2V-GAN, IRVI), creative content generation and personalized animation (I2V-Adapter, ConsistI2V), interactive camera control and scene-based looping (RealCam-I2V), and robust motion editing or video-to-video translation (Motion-I2V).
The community-wide adoption of open benchmarks, with datasets and code frequently released for reproducibility and third-party comparison.

7. Future Directions and Standardization

I2V-Bench continues to evolve by addressing remaining gaps such as:

Broader support for high-resolution, long-duration, or out-of-domain video input (RealCam-I2V).
More expressive control signals—combining textual, interactive, and metric-scale geometric guidance.
Expansion of motion diversity and realism metrics, as prioritized by the DIVE benchmark.
Continued emphasis on objective and subjective human-in-the-loop evaluation, critical for practical deployment readiness.

In summary, I2V-Bench—across its instantiations (e.g., IRVI, MC-Bench, open-domain frameworks, DIVE)—provides the methodological backbone for rigorous, multi-dimensional evaluation of image-to-video generation systems, ensuring measurable progress in fidelity, controllability, and application scope in the contemporary generative modeling landscape.

PDF Markdown Chat (Upgrade)