Papers
Topics
Authors
Recent
2000 character limit reached

OpenS2V-Eval Benchmark Overview

Updated 9 December 2025
  • OpenS2V-Eval is a benchmark that provides a fine-grained evaluation of subject consistency, naturalness, and text relevance in subject-to-video generation.
  • It integrates real and synthetic samples using tailored metrics like NexusScore, NaturalScore, and GmeScore to diagnose model performance.
  • The framework supports precise analysis across seven S2V categories, offering actionable insights on failure modes and overall video synthesis quality.

OpenS2V-Eval is a fine-grained benchmark specifically constructed for subject-to-video (S2V) generation research, addressing critical gaps left by prior video generation evaluation suites. Unlike benchmarks inherited from general text-to-video assessment, OpenS2V-Eval measures three core axes of S2V model output: subject consistency, subject naturalness, and text relevance. The framework integrates real and synthetic samples, tailored automatic metrics, and a holistic scoring method to enable precise separation of model strengths and weaknesses. It is a key component of the OpenS2V-Nexus infrastructure for video synthesis research (Yuan et al., 26 May 2025).

1. Motivation and Problem Statement

Existing video-generation benchmarks—such as VBench and ChronoMagic-Bench—are designed for text-to-video tasks and emphasize global image/video quality and motion. These frameworks neglect the defining properties of S2V: whether the same object or person is preserved across the video ("subject consistency") and whether appearances remain physically plausible ("subject naturalness"). S2V-specific coarse benchmarks (e.g., ConsisID-Bench, A2-Bench, VACE-Bench) are limited in domain coverage (often faces only) and reuse global scoring functions such as CLIP or DINO, which are susceptible to background sensitivity and fail to penalize "copy-paste" artefacts.

OpenS2V-Eval aims to directly measure these S2V-critical attributes. The benchmark evaluates:

  • Subject Consistency: Temporal fidelity of the reference subject’s appearance.
  • Subject Naturalness: Physical plausibility without typical synthesis artefacts.
  • Text Relevance: Alignment of generated content with input prompts.

This multi-dimensional protocol delivers a more rigorous evaluation of S2V generators by capturing nuanced failure modes that disrupt practical deployment.

2. Benchmark Structure and Data Composition

OpenS2V-Eval comprises 180 subject-text test cases partitioned among seven S2V categories of ascending complexity:

Category Index Description Example Entities
1 single-face-to-video faces
2 single-body-to-video persons
3 single-entity-to-video object, animal
4 multi-face-to-video groups of faces
5 multi-body-to-video multiple persons
6 multi-entity-to-video diverse objects/animals
7 human-entity-to-video hybrid person+object

Data sources and synthesis pipelines:

  • 80 real cases: Images and captions curated from open-license repositories (Pexels, MixKit, Pixabay) and prior S2V benchmarks, with quality control.
  • 100 synthetic cases: Stress-testing generalization via two strategies:
    • GPT-Frame Pairs: GPT-Image-1 re-renders the reference subject from raw video frames and extracted keywords to obtain multi-view representations.
    • Cross-Frame Pairs: Semantic clustering (GME model) of videos enables pairing of different segments for diverse subject-view combinations.

This ensures the benchmark evaluates both memorization and generalization, guarding against overfitting to training crops.

3. Evaluation Metrics and Scoring Protocol

OpenS2V-Eval introduces three targeted automatic metrics alongside standard scores for visual and motion quality:

3.1 NexusScore (Subject Consistency)

Measures reference subject fidelity across frames, discounting background. For each reference image RiR_i and frame ItI_t:

  1. Detect & crop with Mdetect\mathcal{M}_{\text{detect}} (e.g., Yolo-World).
  2. Use Mretrieve\mathcal{M}_{\text{retrieve}} (GME multimodal LLM) to verify entity name match above confidence threshold β\beta.
  3. Compute image similarity only on valid detections above detection threshold α\alpha:

SNexus=1IT′∑i=1I∑t=1T′Mretrieve(Ci,t,Ri)where ci,t>α,si,t>βS_{\mathrm{Nexus}} = \frac{1}{I T'} \sum_{i=1}^{I} \sum_{t=1}^{T'} \mathcal{M}_{\text{retrieve}}(C_{i,t}, R_i) \quad \text{where } c_{i,t} > \alpha, s_{i,t} > \beta

where T′T' is the valid detection count.

3.2 NaturalScore (Subject Naturalness)

Assesses physical plausibility. Human raters use a 5-point Likert scale C={1,…,5}C=\{1,\dots,5\}; GPT-4o is prompted as surrogate judge on frames ItI_t:

SNatural=1T∑t=1Tst(st∈{1,…,5})S_{\mathrm{Natural}} = \frac{1}{T} \sum_{t=1}^{T} s_t \quad (s_t \in \{1,\dots,5\})

3.3 GmeScore (Text Relevance)

Quantifies text-prompt alignment. Uses GME (Qwen-2VL variant) for image-text similarity per frame ItI_t and prompt PP:

SGme=1T∑t=1Trt(rt=Mretrieve(It,P))S_{\mathrm{Gme}} = \frac{1}{T} \sum_{t=1}^{T} r_t \quad (r_t = \mathcal{M}_{\mathrm{retrieve}}(I_t, P))

Additional metrics include AestheticScore (improved-aesthetic-predictor), MotionScore (OpenCV Farneback optical flow), and FaceSim-Cur (curricularface on detected faces). All scores are normalized to [0,1][0,1] and aggregated:

Total_Score=0.20 SNexus+0.24 SNatural+0.12 SGme+0.20 SFaceSim+0.12 SAesthetic+0.12 SMotion\text{Total\_Score} = 0.20\,S_{\mathrm{Nexus}} + 0.24\,S_{\mathrm{Natural}} + 0.12\,S_{\mathrm{Gme}} + 0.20\,S_{\mathrm{FaceSim}} + 0.12\,S_{\mathrm{Aesthetic}} + 0.12\,S_{\mathrm{Motion}}

Weights are adjusted for human-domain content.

4. Model Evaluation Protocol

OpenS2V-Eval benchmarked 18 S2V models:

Closed-source: Vidu 2.0, Pika 2.1, Kling 1.6, Hailuo S2V-01 Open-source: VACE (Preview-1.3B, Wan2.1-1.3B, Wan2.1-14B), Phantom-1.3B, SkyReels-A2-14B, HunyuanCustom (single-subject), ConsisID, Concat-ID, FantasyID, EchoVideo, VideoMaker, ID-Animator

All models were assessed using official inference scripts, with standardized seeds, durations (2–5 s), frame rates (8–30 fps), and resolutions (480–1080 p).

Sampling: 32 uniformly sampled frames per video for most metrics; all frames for MotionScore.

Aggregation: Empirically or theoretically bounded normalization, followed by weighted composite scoring, yields fine-grained ranking across S2V categories.

5. Results and Analysis

5.1 Quantitative Highlights

  • In open-domain (all seven categories), Kling 1.6 led overall (Total ≈54.5%) with top NexusScore and NaturalScore.
  • Pika 2.1 achieved highest text relevance via GmeScore (≈69%).
  • VACE 14B outperformed smaller open models, evidencing scale-driven improvements.
  • Phantom and SkyReels-A2 attained strong temporal consistency but were penalized on NaturalScore, revealing copy-paste-related artefacts.
  • For human-domain (faces & bodies), Hailuo scored highest via FaceSim and balanced naturalness (Total ≈60%).
  • EchoVideo led among open-source human-domain competitors; ConsisID and Concat-ID showed high face identity retention but low naturalness.
  • On single-domain tasks (face/body/object), VACE 14B maintained leadership; open-source models narrowed the gap for single-subject scenarios.

5.2 Qualitative Failure Modes

Observed failure patterns include:

  • Limited generalization to out-of-distribution subjects.
  • Copy-paste artefacts such as unnatural lighting/rigid pose.
  • Temporal identity drift for human subjects (face consistency loss).
  • Cloned or blurred initial frames.
  • Diminishing detail fidelity in later frames ("consistency fade").

6. Discussion and Future Directions

OpenS2V-Eval is the first benchmark specifically tailored to S2V, spanning from simple single-face scenarios to complex multi-entity video synthesis cases. The combination of real and synthetic evaluations stresses model generalization beyond memorizing training samples. Three S2V-specific metrics—each closely matching human preference judgments (>75% correlation)—provide diagnostic precision missing from prior coarse benchmarks. By integrating six evaluation axes, the framework delivers holistic model appraisal.

Metric alignment with human preferences remains at approximately 75%, indicating room to refine prompts, thresholds, or evaluator models. NaturalScore currently relies on proprietary GPT-4o, posing cost and transparency challenges; open-model surrogates might be developed. The test suite is monolingual (English), suggesting that future extension to multilingual prompts can expose further generalization limitations. Expanding to cover long-duration and highly dynamic video scenarios is also suggested.

OpenS2V-Eval establishes a robust, fine-grained infrastructure for evaluating subject-to-video generators, enabling detailed attribution of strengths and weaknesses beyond prior approaches (Yuan et al., 26 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to OpenS2V-Eval Benchmark.