OpenS2V-Eval Benchmark Overview
- OpenS2V-Eval is a benchmark that provides a fine-grained evaluation of subject consistency, naturalness, and text relevance in subject-to-video generation.
- It integrates real and synthetic samples using tailored metrics like NexusScore, NaturalScore, and GmeScore to diagnose model performance.
- The framework supports precise analysis across seven S2V categories, offering actionable insights on failure modes and overall video synthesis quality.
OpenS2V-Eval is a fine-grained benchmark specifically constructed for subject-to-video (S2V) generation research, addressing critical gaps left by prior video generation evaluation suites. Unlike benchmarks inherited from general text-to-video assessment, OpenS2V-Eval measures three core axes of S2V model output: subject consistency, subject naturalness, and text relevance. The framework integrates real and synthetic samples, tailored automatic metrics, and a holistic scoring method to enable precise separation of model strengths and weaknesses. It is a key component of the OpenS2V-Nexus infrastructure for video synthesis research (Yuan et al., 26 May 2025).
1. Motivation and Problem Statement
Existing video-generation benchmarks—such as VBench and ChronoMagic-Bench—are designed for text-to-video tasks and emphasize global image/video quality and motion. These frameworks neglect the defining properties of S2V: whether the same object or person is preserved across the video ("subject consistency") and whether appearances remain physically plausible ("subject naturalness"). S2V-specific coarse benchmarks (e.g., ConsisID-Bench, A2-Bench, VACE-Bench) are limited in domain coverage (often faces only) and reuse global scoring functions such as CLIP or DINO, which are susceptible to background sensitivity and fail to penalize "copy-paste" artefacts.
OpenS2V-Eval aims to directly measure these S2V-critical attributes. The benchmark evaluates:
- Subject Consistency: Temporal fidelity of the reference subject’s appearance.
- Subject Naturalness: Physical plausibility without typical synthesis artefacts.
- Text Relevance: Alignment of generated content with input prompts.
This multi-dimensional protocol delivers a more rigorous evaluation of S2V generators by capturing nuanced failure modes that disrupt practical deployment.
2. Benchmark Structure and Data Composition
OpenS2V-Eval comprises 180 subject-text test cases partitioned among seven S2V categories of ascending complexity:
| Category Index | Description | Example Entities |
|---|---|---|
| 1 | single-face-to-video | faces |
| 2 | single-body-to-video | persons |
| 3 | single-entity-to-video | object, animal |
| 4 | multi-face-to-video | groups of faces |
| 5 | multi-body-to-video | multiple persons |
| 6 | multi-entity-to-video | diverse objects/animals |
| 7 | human-entity-to-video | hybrid person+object |
Data sources and synthesis pipelines:
- 80 real cases: Images and captions curated from open-license repositories (Pexels, MixKit, Pixabay) and prior S2V benchmarks, with quality control.
- 100 synthetic cases: Stress-testing generalization via two strategies:
- GPT-Frame Pairs: GPT-Image-1 re-renders the reference subject from raw video frames and extracted keywords to obtain multi-view representations.
- Cross-Frame Pairs: Semantic clustering (GME model) of videos enables pairing of different segments for diverse subject-view combinations.
This ensures the benchmark evaluates both memorization and generalization, guarding against overfitting to training crops.
3. Evaluation Metrics and Scoring Protocol
OpenS2V-Eval introduces three targeted automatic metrics alongside standard scores for visual and motion quality:
3.1 NexusScore (Subject Consistency)
Measures reference subject fidelity across frames, discounting background. For each reference image and frame :
- Detect & crop with (e.g., Yolo-World).
- Use (GME multimodal LLM) to verify entity name match above confidence threshold .
- Compute image similarity only on valid detections above detection threshold :
where is the valid detection count.
3.2 NaturalScore (Subject Naturalness)
Assesses physical plausibility. Human raters use a 5-point Likert scale ; GPT-4o is prompted as surrogate judge on frames :
3.3 GmeScore (Text Relevance)
Quantifies text-prompt alignment. Uses GME (Qwen-2VL variant) for image-text similarity per frame and prompt :
Additional metrics include AestheticScore (improved-aesthetic-predictor), MotionScore (OpenCV Farneback optical flow), and FaceSim-Cur (curricularface on detected faces). All scores are normalized to and aggregated:
Weights are adjusted for human-domain content.
4. Model Evaluation Protocol
OpenS2V-Eval benchmarked 18 S2V models:
Closed-source: Vidu 2.0, Pika 2.1, Kling 1.6, Hailuo S2V-01 Open-source: VACE (Preview-1.3B, Wan2.1-1.3B, Wan2.1-14B), Phantom-1.3B, SkyReels-A2-14B, HunyuanCustom (single-subject), ConsisID, Concat-ID, FantasyID, EchoVideo, VideoMaker, ID-Animator
All models were assessed using official inference scripts, with standardized seeds, durations (2–5 s), frame rates (8–30 fps), and resolutions (480–1080 p).
Sampling: 32 uniformly sampled frames per video for most metrics; all frames for MotionScore.
Aggregation: Empirically or theoretically bounded normalization, followed by weighted composite scoring, yields fine-grained ranking across S2V categories.
5. Results and Analysis
5.1 Quantitative Highlights
- In open-domain (all seven categories), Kling 1.6 led overall (Total ≈54.5%) with top NexusScore and NaturalScore.
- Pika 2.1 achieved highest text relevance via GmeScore (≈69%).
- VACE 14B outperformed smaller open models, evidencing scale-driven improvements.
- Phantom and SkyReels-A2 attained strong temporal consistency but were penalized on NaturalScore, revealing copy-paste-related artefacts.
- For human-domain (faces & bodies), Hailuo scored highest via FaceSim and balanced naturalness (Total ≈60%).
- EchoVideo led among open-source human-domain competitors; ConsisID and Concat-ID showed high face identity retention but low naturalness.
- On single-domain tasks (face/body/object), VACE 14B maintained leadership; open-source models narrowed the gap for single-subject scenarios.
5.2 Qualitative Failure Modes
Observed failure patterns include:
- Limited generalization to out-of-distribution subjects.
- Copy-paste artefacts such as unnatural lighting/rigid pose.
- Temporal identity drift for human subjects (face consistency loss).
- Cloned or blurred initial frames.
- Diminishing detail fidelity in later frames ("consistency fade").
6. Discussion and Future Directions
OpenS2V-Eval is the first benchmark specifically tailored to S2V, spanning from simple single-face scenarios to complex multi-entity video synthesis cases. The combination of real and synthetic evaluations stresses model generalization beyond memorizing training samples. Three S2V-specific metrics—each closely matching human preference judgments (>75% correlation)—provide diagnostic precision missing from prior coarse benchmarks. By integrating six evaluation axes, the framework delivers holistic model appraisal.
Metric alignment with human preferences remains at approximately 75%, indicating room to refine prompts, thresholds, or evaluator models. NaturalScore currently relies on proprietary GPT-4o, posing cost and transparency challenges; open-model surrogates might be developed. The test suite is monolingual (English), suggesting that future extension to multilingual prompts can expose further generalization limitations. Expanding to cover long-duration and highly dynamic video scenarios is also suggested.
OpenS2V-Eval establishes a robust, fine-grained infrastructure for evaluating subject-to-video generators, enabling detailed attribution of strengths and weaknesses beyond prior approaches (Yuan et al., 26 May 2025).