SVBench: Dual Benchmarks for Data & Video
- SVBench in Data Analytics is a benchmark that standardizes and compares Shapley value algorithms using modular pipelines and privacy-preserving techniques.
- SVBench for Streaming Video Understanding evaluates large vision-language models with a temporal multi-turn QA framework across overlapping video segments.
- The framework’s extensible design supports reproducibility and future research in both data valuation and streaming video comprehension tasks.
SVBench is a designation shared by multiple prominent, task-specific open benchmarks in both visionary data analytics and video understanding domains. While their underlying purposes differ, the canonical instances provide rigorous testbeds: (1) for benchmarking Shapley value methodologies in data analytics (Lin et al., 2024), and (2) for assessing large vision-LLMs’ (LVLMs) capacity for streaming video comprehension via temporally linked, multi-turn dialog (Yang et al., 15 Feb 2025). This article systematically delineates both benchmarks, addressing their design, implementation, and significance.
1. SVBench in Data Analytics: Modular Benchmarking for Shapley Value
SVBench, as introduced in "A Comprehensive Study of Shapley Value in Data Analytics" (Lin et al., 2024), is an extensible, open-source framework engineered to standardize, accelerate, and comparatively evaluate Shapley value (SV) algorithms across diverse data analytics tasks. The framework addresses four central challenges: computational efficiency, approximation error, privacy preservation, and interpretability.
Architecture and Component Pipeline
SVBench builds a six-stage modular pipeline encompassing:
- Data Ingestion: Abstracts raw data—tables, features, model checkpoints—as “players”.
- Configuration Loader: Parses user-provided YAML/JSON specifying players, utility functions, SV algorithm, sampling schema, optimization, and privacy recipes.
- Sampler: Supports random, stratified, antithetic, or custom coalition/permutation sampling, producing iterators over coalitions.
- Shapley Computation Engine: Encapsulates a range of SV algorithms (Monte Carlo, regression-based, multilinear extension, group testing, compressive sampling), integrating the utility calculator and incremental SV estimation.
- Convergence Checker: Employs criteria such as
where is the SV estimate after iterations.
- Output Aggregator & Privacy Module: Produces final SV, optionally applying privacy mechanisms (e.g., DP, quantization, dimension reduction), before optionally visualizing outputs.
The implementation is fully scriptable and supports extension by user-registered algorithms, samplers, or privacy modules, verified for interface conformity at load time.
Shapley Value Algorithms and Approximations
Let represent the player set, a utility function, and a subset of players. The Shapley value for is defined by:
Approximation strategies supported include Monte Carlo sampling, regression (KernelSHAP), multilinear extension, group testing, compressive sampling, and truncation.
Table: SV Computation Approaches
| Strategy | Complexity / Notes |
|---|---|
| Monte Carlo | , error bound via Hoeffding |
| Regression (RE) | Weighted least squares in |
| Multilinear Ext. | Integral-based, expectation over 0 |
| GT / CP | 1 / 2 |
| Truncation (TC) | Early stopping based on 3 proximity |
Quantitative Findings and Interpretability
SVBench measures output via metrics such as time cost (4), sample complexity (5), ranking variance, and privacy-attack resistance. Truncation reduces sample complexity by 18–72%, gradient-based optimization accelerates federated learning cases, and antithetic sampling stabilizes rankings. Privacy interventions (DP, quantization, dim-red) reduce attack efficacy but may perturb 6-orderings.
Interpretability focuses on the relative magnitude of 7 as a proxy for player impact on 8, but SVBench highlights that aggressive approximation, especially with boundary-coalition pruning, can degrade interpretational reliability.
2. SVBench for Streaming Video Understanding: Temporal Multi-Turn Benchmark
SVBench, as constructed in "SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding" (Yang et al., 15 Feb 2025), constitutes the first large-scale benchmark with temporally linked, multi-turn QA chains tailored to stream-level evaluation of LVLMs. The focus is on assessing models’ ability to maintain, reference, and reason over prolonged temporal contexts typical in real-world streaming and surveillance applications.
Dataset Composition and Annotation Pipeline
The benchmark comprises 1,353 streaming videos from six major sources (YT-Temporal-1B, YouCook2, ActivityNet, MovieChat, Panda-70M, Ego4D), totaling 49,979 QA pairs. Each video averages 8.61 QA chains (of 4–5 turns each), mapped onto overlapping segments identified via PySceneDetect. QA pairs are generated through LVLM assistance, followed by manual editing for coherence and context. Temporal linkages across QA chains are algorithmically extracted (LLM-based relation extraction) and refined by human annotators to enforce cross-segment reasoning.
Temporal Multi-Turn Task and Evaluation
Each video 9 is divided into 0 overlapping segments 1, each associated with a chain of QA pairs 2. Linkages 3 carry information between chains 4, with each
5
where 6 {Action, Person, Object, Event, Environment, Quantity}, supporting distinct temporal reasoning tasks such as intention inference, counterfactual reasoning, and spatio-temporal speculation.
Models are evaluated under:
- Dialogue Evaluation: Cumulatively exposed to clip histories and preceding QA, answering in-turn.
- Streaming Evaluation: A simulated "jumping" procedure with probability 0.8 to test persistence under partial observation and non-local temporal queries.
Metrics and Assessment Protocol
Evaluation comprises:
- Standard Metrics: BLEU-4, METEOR, ROUGE-L, CIDEr (answer-level).
- GPT4-Score: OpenAI GPT-4 rates answer accuracy, scaled to [0,100].
- Dialogue Rubric: Five expert-scored dimensions—Semantic Accuracy (SA), Contextual Coherence (CC), Logical Consistency (LC), Temporal Understanding (TU), Informational Completeness (IC)—with overall score
7
Aggregate results demonstrate that even GPT-4o, the strongest closed-source baseline, trails human-level temporal understanding. Open-source models (e.g., StreamingChat, InternVL2) exhibit significant gaps in referential tracking, temporal jumps, and counterfactuals.
3. Key Experimental Outcomes
Data Analytics SVBench
- Truncation (TC) reduced sample complexity by 18–72% across varied feature and data valuation tasks, e.g., from 6,000 to 1,400 queries.
- Gradient-approximate methods decreased per-query computational cost by approximately 90% in federated learning scenarios.
- MC with antithetic sampling plus truncation provided stable 8-rankings even at loose convergence thresholds.
- Privacy protections reduced membership/feature inference attack accuracy but increased 9-ranking variance.
Streaming Video SVBench
- GPT-4o achieved overall scores of 66.29 (Dialogue), 58.17 (Streaming), compared to human-level ∼84 (Dialogue), ∼80 (Streaming).
- Open-source StreamingChat improved over fine-tuned InternVL2 by +9.4 (Dialogue) and +3.3 (Streaming).
- Weaknesses in counterfactual and spatio-temporal categories persisted, with CR and STS scores for GPT-4o ∼50%, below 68% for semantic accuracy.
- Models frequently failed referential continuity (e.g., losing track of "the red-jerseyed runner" across segments).
4. Modularity, Extensibility, and Reproducibility
Both SVBench instances are engineered for extensibility. In data analytics, users can register custom algorithms, samplers, and privacy modules; in video QA, full code, annotation pipelines, and model checkpoints are released under open license (https://yzy-bupt.github.io/SVBench). Streaming video SVBench provides resource scripts for GPU-based workflows (PyTorch, HuggingFace Transformers, PySceneDetect, Open-Sora), scalable to ≥32 GB VRAM environments.
5. Open Problems and Future Directions
SVBench highlights fundamental research gaps:
- In data analytics SV, robust privacy defenses and formal interpretability for approximate 0 remain unsolved. Open challenges include streaming, evolving games, and interdependent players.
- For streaming video, cross-segment temporal reasoning and robust counterfactual inference are unsolved—even for leading LVLMs. SVBench authors propose future integration of audio, speech transcripts, multi-view footage, adversarial/counterfactual dialogues, and complex task-oriented dialogue evaluation (Yang et al., 15 Feb 2025).
- Both benchmarks invite extension, providing a foundation for methodological innovation, reproducibility studies, and the systematic assessment needed for further advances.
6. Comparative Summary: SVBench Instances
| SVBench Instance | Task Domain | Principal Capabilities |
|---|---|---|
| Data Analytics SVBench (Lin et al., 2024) | Shapley value DA | Modular SV computation, privacy, interpretability, APIs |
| Streaming Video SVBench (Yang et al., 15 Feb 2025) | Video QA | Temporal, multi-turn, multi-clip streaming QA, LVLM eval |
Collectively, SVBench stands as an archetype for rigorous, extensible benchmarking in data-intensive subfields, enabling the systematic diagnosis of model and method limitations across technical dimensions.