Papers
Topics
Authors
Recent
Search
2000 character limit reached

SVBench: Dual Benchmarks for Data & Video

Updated 1 May 2026
  • SVBench in Data Analytics is a benchmark that standardizes and compares Shapley value algorithms using modular pipelines and privacy-preserving techniques.
  • SVBench for Streaming Video Understanding evaluates large vision-language models with a temporal multi-turn QA framework across overlapping video segments.
  • The framework’s extensible design supports reproducibility and future research in both data valuation and streaming video comprehension tasks.

SVBench is a designation shared by multiple prominent, task-specific open benchmarks in both visionary data analytics and video understanding domains. While their underlying purposes differ, the canonical instances provide rigorous testbeds: (1) for benchmarking Shapley value methodologies in data analytics (Lin et al., 2024), and (2) for assessing large vision-LLMs’ (LVLMs) capacity for streaming video comprehension via temporally linked, multi-turn dialog (Yang et al., 15 Feb 2025). This article systematically delineates both benchmarks, addressing their design, implementation, and significance.

1. SVBench in Data Analytics: Modular Benchmarking for Shapley Value

SVBench, as introduced in "A Comprehensive Study of Shapley Value in Data Analytics" (Lin et al., 2024), is an extensible, open-source framework engineered to standardize, accelerate, and comparatively evaluate Shapley value (SV) algorithms across diverse data analytics tasks. The framework addresses four central challenges: computational efficiency, approximation error, privacy preservation, and interpretability.

Architecture and Component Pipeline

SVBench builds a six-stage modular pipeline encompassing:

  • Data Ingestion: Abstracts raw data—tables, features, model checkpoints—as “players”.
  • Configuration Loader: Parses user-provided YAML/JSON specifying players, utility functions, SV algorithm, sampling schema, optimization, and privacy recipes.
  • Sampler: Supports random, stratified, antithetic, or custom coalition/permutation sampling, producing iterators over coalitions.
  • Shapley Computation Engine: Encapsulates a range of SV algorithms (Monte Carlo, regression-based, multilinear extension, group testing, compressive sampling), integrating the utility calculator and incremental SV estimation.
  • Convergence Checker: Employs criteria such as

15nm=15i=1nϕi(e)ϕi(emn)ϕi(e)<τ\frac{1}{5n}\sum_{m=1}^5\sum_{i=1}^n \left|\frac{\phi_i^{(e)} - \phi_i^{(e-mn)}}{\phi_i^{(e)}}\right| < \tau

where ϕi(e)\phi_i^{(e)} is the SV estimate after ee iterations.

  • Output Aggregator & Privacy Module: Produces final SV, optionally applying privacy mechanisms (e.g., DP, quantization, dimension reduction), before optionally visualizing outputs.

The implementation is fully scriptable and supports extension by user-registered algorithms, samplers, or privacy modules, verified for interface conformity at load time.

Shapley Value Algorithms and Approximations

Let N={p1,,pn}\mathcal{N} = \{p_1, \dots, p_n\} represent the player set, U:P(N)RU: \mathcal{P}(\mathcal{N}) \to \mathbb{R} a utility function, and SS a subset of players. The Shapley value for pip_i is defined by:

ϕi=SN{pi}S!(nS1)!n![U(S{pi})U(S)]\phi_i = \sum_{S \subseteq \mathcal{N} \setminus \{p_i\}} \frac{|S|!\,(n - |S| - 1)!}{n!} \left[ U(S \cup \{p_i\}) - U(S) \right]

Approximation strategies supported include Monte Carlo sampling, regression (KernelSHAP), multilinear extension, group testing, compressive sampling, and truncation.

Table: SV Computation Approaches

Strategy Complexity / Notes
Monte Carlo O(N)\mathcal{O}(N), error bound via Hoeffding
Regression (RE) Weighted least squares in Rn\mathbb{R}^n
Multilinear Ext. Integral-based, expectation over ϕi(e)\phi_i^{(e)}0
GT / CP ϕi(e)\phi_i^{(e)}1 / ϕi(e)\phi_i^{(e)}2
Truncation (TC) Early stopping based on ϕi(e)\phi_i^{(e)}3 proximity

Quantitative Findings and Interpretability

SVBench measures output via metrics such as time cost (ϕi(e)\phi_i^{(e)}4), sample complexity (ϕi(e)\phi_i^{(e)}5), ranking variance, and privacy-attack resistance. Truncation reduces sample complexity by 18–72%, gradient-based optimization accelerates federated learning cases, and antithetic sampling stabilizes rankings. Privacy interventions (DP, quantization, dim-red) reduce attack efficacy but may perturb ϕi(e)\phi_i^{(e)}6-orderings.

Interpretability focuses on the relative magnitude of ϕi(e)\phi_i^{(e)}7 as a proxy for player impact on ϕi(e)\phi_i^{(e)}8, but SVBench highlights that aggressive approximation, especially with boundary-coalition pruning, can degrade interpretational reliability.

2. SVBench for Streaming Video Understanding: Temporal Multi-Turn Benchmark

SVBench, as constructed in "SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding" (Yang et al., 15 Feb 2025), constitutes the first large-scale benchmark with temporally linked, multi-turn QA chains tailored to stream-level evaluation of LVLMs. The focus is on assessing models’ ability to maintain, reference, and reason over prolonged temporal contexts typical in real-world streaming and surveillance applications.

Dataset Composition and Annotation Pipeline

The benchmark comprises 1,353 streaming videos from six major sources (YT-Temporal-1B, YouCook2, ActivityNet, MovieChat, Panda-70M, Ego4D), totaling 49,979 QA pairs. Each video averages 8.61 QA chains (of 4–5 turns each), mapped onto overlapping segments identified via PySceneDetect. QA pairs are generated through LVLM assistance, followed by manual editing for coherence and context. Temporal linkages across QA chains are algorithmically extracted (LLM-based relation extraction) and refined by human annotators to enforce cross-segment reasoning.

Temporal Multi-Turn Task and Evaluation

Each video ϕi(e)\phi_i^{(e)}9 is divided into ee0 overlapping segments ee1, each associated with a chain of QA pairs ee2. Linkages ee3 carry information between chains ee4, with each

ee5

where ee6 {Action, Person, Object, Event, Environment, Quantity}, supporting distinct temporal reasoning tasks such as intention inference, counterfactual reasoning, and spatio-temporal speculation.

Models are evaluated under:

  • Dialogue Evaluation: Cumulatively exposed to clip histories and preceding QA, answering in-turn.
  • Streaming Evaluation: A simulated "jumping" procedure with probability 0.8 to test persistence under partial observation and non-local temporal queries.

Metrics and Assessment Protocol

Evaluation comprises:

  • Standard Metrics: BLEU-4, METEOR, ROUGE-L, CIDEr (answer-level).
  • GPT4-Score: OpenAI GPT-4 rates answer accuracy, scaled to [0,100].
  • Dialogue Rubric: Five expert-scored dimensions—Semantic Accuracy (SA), Contextual Coherence (CC), Logical Consistency (LC), Temporal Understanding (TU), Informational Completeness (IC)—with overall score

ee7

Aggregate results demonstrate that even GPT-4o, the strongest closed-source baseline, trails human-level temporal understanding. Open-source models (e.g., StreamingChat, InternVL2) exhibit significant gaps in referential tracking, temporal jumps, and counterfactuals.

3. Key Experimental Outcomes

Data Analytics SVBench

  • Truncation (TC) reduced sample complexity by 18–72% across varied feature and data valuation tasks, e.g., from 6,000 to 1,400 queries.
  • Gradient-approximate methods decreased per-query computational cost by approximately 90% in federated learning scenarios.
  • MC with antithetic sampling plus truncation provided stable ee8-rankings even at loose convergence thresholds.
  • Privacy protections reduced membership/feature inference attack accuracy but increased ee9-ranking variance.

Streaming Video SVBench

  • GPT-4o achieved overall scores of 66.29 (Dialogue), 58.17 (Streaming), compared to human-level ∼84 (Dialogue), ∼80 (Streaming).
  • Open-source StreamingChat improved over fine-tuned InternVL2 by +9.4 (Dialogue) and +3.3 (Streaming).
  • Weaknesses in counterfactual and spatio-temporal categories persisted, with CR and STS scores for GPT-4o ∼50%, below 68% for semantic accuracy.
  • Models frequently failed referential continuity (e.g., losing track of "the red-jerseyed runner" across segments).

4. Modularity, Extensibility, and Reproducibility

Both SVBench instances are engineered for extensibility. In data analytics, users can register custom algorithms, samplers, and privacy modules; in video QA, full code, annotation pipelines, and model checkpoints are released under open license (https://yzy-bupt.github.io/SVBench). Streaming video SVBench provides resource scripts for GPU-based workflows (PyTorch, HuggingFace Transformers, PySceneDetect, Open-Sora), scalable to ≥32 GB VRAM environments.

5. Open Problems and Future Directions

SVBench highlights fundamental research gaps:

  • In data analytics SV, robust privacy defenses and formal interpretability for approximate N={p1,,pn}\mathcal{N} = \{p_1, \dots, p_n\}0 remain unsolved. Open challenges include streaming, evolving games, and interdependent players.
  • For streaming video, cross-segment temporal reasoning and robust counterfactual inference are unsolved—even for leading LVLMs. SVBench authors propose future integration of audio, speech transcripts, multi-view footage, adversarial/counterfactual dialogues, and complex task-oriented dialogue evaluation (Yang et al., 15 Feb 2025).
  • Both benchmarks invite extension, providing a foundation for methodological innovation, reproducibility studies, and the systematic assessment needed for further advances.

6. Comparative Summary: SVBench Instances

SVBench Instance Task Domain Principal Capabilities
Data Analytics SVBench (Lin et al., 2024) Shapley value DA Modular SV computation, privacy, interpretability, APIs
Streaming Video SVBench (Yang et al., 15 Feb 2025) Video QA Temporal, multi-turn, multi-clip streaming QA, LVLM eval

Collectively, SVBench stands as an archetype for rigorous, extensible benchmarking in data-intensive subfields, enabling the systematic diagnosis of model and method limitations across technical dimensions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SVBench.