InterBench: Multi-Domain Interactive Benchmarking

Updated 2 December 2025

InterBench is a collection of benchmarks that rigorously assess interactivity, result correctness, and responsiveness in domains where traditional static benchmarks fall short.
It utilizes domain-tailored methods such as Markov interaction models, hand-crafted arithmetic expressions, and prompted visual interactions to generate realistic and reproducible workloads.
By clearly separating static metrics from dynamic behaviors, InterBench offers actionable insights for improving system performance, portability, and evaluation standards.

InterBench is the name of multiple domain-specific benchmarks, each aimed at rigorously evaluating system performance along axes of interactivity, result correctness, and responsiveness in scenarios where the traditional, static benchmarking paradigms are insufficient. Three major benchmarks bearing the InterBench name have emerged in distinct research areas: interactive data exploration in database systems (Eichmann et al., 2018), certified interval arithmetic libraries (Tang et al., 2021), and interaction fidelity in generative game world video models (Tang et al., 28 Nov 2025). Each InterBench instantiation is motivated by domain-specific gaps in existing evaluation methodology and implements a reproducible, extensible framework for quantitative and qualitative system assessment.

1. Benchmarking Scope, Motivation, and Target Domains

Across all disciplines, InterBench is designed to address the limitations of traditional benchmarks when applied to highly interactive, correctness-sensitive, or causality-driven workloads.

Interactive Data Exploration (IDE): Static OLAP benchmarks (e.g., TPC-H, TPC-DS) are not appropriate for evaluating database engines running IDE workloads, where queries are ad-hoc, incrementally constructed, and may leverage user "think-times" between actions (Eichmann et al., 2018).
Interval Arithmetic Libraries: No standard exists for verifying the containment or correctness of floating-point interval computations across hardware, compilers, or APIs. Variations in real-world implementations—especially for transcendental functions—necessitate an exact, portable means of validation (Tang et al., 2021).
Generative Game-World Models: Existing video-generation metrics focus on static or uniform content, not fine-grained execution of user-driven, temporally and causally coherent actions within rich, interactive simulated environments (Tang et al., 28 Nov 2025).

A key principle for all InterBench setups is the explicit separation between static outcome metrics and the dynamic, interactivity-driven phenomena characteristic of the target domain.

2. Design Principles and Test Suite Construction

Each instantiation of InterBench is constructed using domain-appropriate strategies for workload generation, data diversity, and world realism. The primary design dimensions are summarized below.

InterBench Domain	Test Generation Approach	Data/Workload Diversity	Reproducibility & Extensibility
Data Exploration (Eichmann et al., 2018)	Markov interaction over visualization workflows	Schemas, distributions, scale	Plug-in datasets, parameter tuning, new metrics
Interval Arithmetic (Tang et al., 2021)	Hand-crafted and FPBench expressions	Arithmetic/transcendentals, mixed ops	Full input/output corpus, cross-platform
Game Interaction (Tang et al., 28 Nov 2025)	Prompted interaction over curated image sets	Scene types, effects, actions	Base/interaction prompts, extendable task types

In data exploration, IDEBench uses sequences of interaction primitives (create, filter, link, discard) sampled from real user-paper logs, mapped to SQL aggregation templates, and supports manipulation of data scale, schema normalization, and query selectivity.

In interval arithmetic, expressions span basic arithmetic, composite formulas, and real-world computational kernels, with all interval inputs instantiated as point intervals and ground-truth results generated by symbolic computation for full result containment verification.

In generative video, InterBench enumerates hundreds of tasks in environmental interaction, actor action, and entity appearance categories, using pairs of natural language scene/command prompts and evaluating model outputs for spatial, temporal, and physical action fidelity.

3. Evaluation Protocols and Quantitative Metrics

Each InterBench defines metrics tuned to capture both traditional system correctness and nuanced properties of interaction:

Interactive Data Exploration (Eichmann et al., 2018):

Latency: $Lat_i = t_{end,i} - t_{start,i}$
Time Requirement Violations: Indicator if $Lat_i > TR$ (where $TR$ is user-specified).
Completeness: $MissingBins = | \{ \text{missing bins}\} | / | \{ \text{bins in ground truth}\} |$
Error and Accuracy: $MeanRelativeError = \frac{1}{n} \sum_{i=1}^{n} \frac{|F_i - A_i|}{|A_i|}$
Distribution Shape: $CosineDistance = 1 - \frac{\sum_i F_i A_i}{||F||\, ||A||}$
Confidence Intervals, Bias, Out-of-Margin Rate: All precisely tracked to expose trade-offs between rapid partial results and statistical error.

Interval Arithmetic (Tang et al., 2021):

Correctness: Per-expression containment of true result within output interval.
Interval Size: $\Delta = \overline y - \underline y$ ; detailed distribution statistics collected.
Performance: Wall-clock for $10^7$ operations per test per platform.
Cross-Platform Consistency: Flag differences in numerical outputs and interval widths.
Portability: Record build and run success for all library/platform/compiler combinations.

Game Video Model Interaction (Tang et al., 28 Nov 2025):

Trigger Rate: $\text{TriggerRate} = \frac{1}{N}\sum_i \mathbf{1}[\text{action}_i\ \text{occurs}]$
Alignment, Fluency, Scope, EndState, Physics: Each rated on a 0-1-3-5 ordinal scale by a VLM-based scorer.
Overall Interaction Score:

$\text{Overall} = \frac{5 \times \text{Trigger} + \text{Align} + \text{Fluency} + \text{Scope} + \text{EndState} + \text{Physics}}{6}$

Classical Video Metrics: Fréchet Video Distance (FVD), Dynamic Average (optical flow), Relative Pose Error (RPE) for camera motion.

4. Supported Systems and Baseline Results

InterBench is applied to a broad range of systems, revealing critical behavioral distinctions:

Data Exploration (Eichmann et al., 2018): MonetDB, approXimateDB/XDB, IDEA, System X, and System Y are evaluated for latency, error, and response under diverse schema designs. MonetDB exhibits near-linear improvement in time violations as $TR$ increases; IDEA consistently meets sub-second targets with high-quality progressive outputs; hybrid/approximate techniques (approXimateDB) incur unpredictable latency or error for unsupported aggregate queries.
Interval Arithmetic (Tang et al., 2021): filib and filib++ in pred-succ/multiplicative modes pass all tests and are consistent across platforms; Boost.Interval fails transcendental and composite cases; BIAS/PROFIL exhibits significant portability and correctness limitations. filib++ multiplicative mode consistently outperforms others on composite expressions, while hardware rounding approaches may yield slightly tighter but less robust intervals.
Game World Modeling (Tang et al., 28 Nov 2025): Hunyuan-GameCraft-2 achieves the highest average interaction trigger and alignment scores across all three test categories, substantially surpassing both Wan2.2 A14B and HunyuanVideo baselines. For the prompt "open the door," successful models must exhibit a chain of causally correct visual changes: the door must open (Trigger), the effect must match the spatial region (Scope), proceed smoothly (Fluency), and finish with the door fully open (EndState) without physically implausible deformation (Physics).

5. Insights, Best Practices, and Recommendations

Empirical analysis from InterBench studies motivates several methodological and architectural recommendations:

Exploit User Think-Time: In database IDE, speculative computation during user think-time can significantly reduce missing results and latency violations, especially in linked workflow queries (Eichmann et al., 2018).
Favor Software-Based Certification: For certified interval arithmetic, pure software interval algorithms—especially when avoiding hardware rounding dependencies—are recommended for maximal cross-platform reliability and containment (Tang et al., 2021).
Automatic, Fine-Grained Scoring: In generative world modeling, VLM-based ordinal assessment of multiple dimensions is demonstrated to be reproducible, scalable, and sensitive to artifacts that evade traditional global metrics (e.g., FVD) (Tang et al., 28 Nov 2025).
Portability and Extensibility: All InterBench variants stress exporter-friendly dataset and workflow specification, parameter tunability, and contribution mechanisms for the community, enabling longitudinal studies and robust cross-system comparison.

6. Limitations, Challenges, and Future Directions

Several limitations and open challenges are identified in the design and application of InterBench benchmarks:

Scalability of Interactive Benchmarks: As workloads and datasets scale, the need for dynamic workflow generation, non-static data formats (JSON, graphs), and richer schema normalization variants becomes more acute, particularly in data exploration.
Coverage of Mathematical Functions: Current interval arithmetic benchmarks lack comprehensive tests for $\tan$ , $\log$ , vectorized/SIMD routines, and multithreaded safety, limiting generalizability across computational science domains.
Diversity of Interaction Patterns: In interactive video, many user-intent types (e.g., interactive model building, visual recommendations, and data cleaning) remain unbenchmarked.
Standardization Gaps: Interfaces and APIs for certified computations remain non-standard, complicating integration and fair evaluation.

Recommended future directions include expanding InterBench test suites to new function classes and higher-order interactions; enhancing cross-platform build/reproducibility guarantees; and converging on standard APIs for critical components, especially in interval arithmetic and interactive data systems.

7. References and Community Resources

All major InterBench benchmarks are associated with open-source implementations or documentation portals:

IDEBench: http://idebench.github.io (Eichmann et al., 2018)
Interval arithmetic InterBench: https://github.com/txstc55/filib (Tang et al., 2021)
Hunyuan-GameCraft-2/InterBench: See (Tang et al., 28 Nov 2025) for task definitions and example prompts

Community contributions of datasets, workflows, and adapters are encouraged in each context, supporting the ongoing extension and validity of InterBench for new classes of systems and interaction paradigms.