MultiVis-Bench: Multi-Modal Viz Benchmark
- MultiVis-Bench is a benchmark that rigorously evaluates multi-modal, end-to-end data visualization systems by mapping diverse input scenarios to executable Altair code.
- It covers four distinct visualization generation scenarios, including text, image, code references, and iterative refinement, ensuring high-quality, expert-validated outputs.
- The benchmark enables comprehensive assessment through dual-layer scoring, integrated logic rule frameworks, and robust reliability metrics for LLM-based agents.
MultiVis-Bench is a large-scale benchmark designed to rigorously evaluate multi-modal, end-to-end data visualization generation systems across a spectrum of realistic analytical scenarios, with a particular focus on cross-modal and iterative workflows. It provides 1,202 expert-validated cases covering four orthogonal visualization generation scenarios—including basic text-to-visualization, image-referenced, code-referenced, and iterative refinement tasks—aimed at supporting the development and assessment of reliable, logic rule-enhanced LLM frameworks for data visualization. Unlike prior efforts restricted to text-plus-table translation or intermediate representations, MultiVis-Bench enables direct evaluation of systems generating executable Python (Altair) code with built-in protocol mechanisms for comprehensive quality and reliability measurement (Lu et al., 26 Jan 2026).
1. Motivation and Scope
MultiVis-Bench was conceived in response to two central limitations of existing Text-to-Vis benchmarks: their coverage is restricted to single-shot, text-plus-table input and they generate non-executable intermediate representations (e.g., VQL, Vega-Lite), which do not reflect the complexity of real-world analytical workflows. In practice, analysts require multi-modal input channels (spanning reference imagery and code), iterative solution refinement, and outputs that are directly executable for seamless analysis and visualization. MultiVis-Bench explicitly addresses these gaps by formalizing a family of data-to-visualization mappings and providing a unified, expert-curated dataset for end-to-end system evaluation.
2. Multi-Scenario Structure and Data Organization
MultiVis-Bench is organized into four scenario classes, each corresponding to a distinct family of cross-modal visualization tasks. The formal mapping for each scenario is denoted , where represents the tuple of input modalities and is the Altair visualization code.
| Scenario | Inputs | Cases | Key Modalities |
|---|---|---|---|
| A—Basic Generation (BG) | 306 | Natural language + database schema/data | |
| B—Image-Referenced Generation (IRG) | 109 | Text + database + reference image | |
| C—Code-Referenced Generation (CRG) | 233 | Text + database + reference code (Matplotlib/Altair) | |
| D—Iterative Refinement (IR) | 554 | Text + database + previous Altair code |
The 1,202 cases are distributed to sample diverse analytical needs: 127 Altair chart templates are utilized on 141 SQLite databases (from Spider, filtered for schema and attribute diversity). Each example includes high-fidelity input-output pairs, supporting multi-modal evaluation with real database constraints and direct execution of generated Python code.
3. Construction and Quality Assurance Protocols
The assembly of MultiVis-Bench followed a hybrid human-in-the-loop pipeline:
- Example Generation: Draft samples are proposed by LLMs (Gemini-2.0-pro-exp), conditioned on schema, template, and scenario specifications.
- Human Validation: Each case undergoes an average of 2.5 expert review rounds for technical correctness, semantic faithfulness, and perceptual effectiveness.
- Automatic Checks: Python code is formatted via Black and executed on the associated SQLite database; any samples failing syntactic or logical execution, or failing checklist-based semantic/visual quality standards, are rejected or corrected manually.
Dataset releases do not prescribe explicit train/dev/test splits, but users are encouraged to apply stratified sampling (e.g., 70/15/15) by scenario for robust generalization benchmarking (Lu et al., 26 Jan 2026).
4. Evaluation Metrics and Scoring Functions
MultiVis-Bench adopts a dual-layer scoring strategy for visualization quality, supplemented by reliability metrics tailored to the unique failure modes of LLM-based agents.
- Low-Level Structural Score: , covering chart type, data mapping, encoding, interaction, configuration, and transformation. Each is normalized, and .
- High-Level Perceptual Score: , with (VLM-based) quantifying type appropriateness, layout, text, data representation, style, and clarity.
- Combined Visualization Score: , by default.
Reliability Metrics:
- Task Completion Rate:
- Code Execution Success:
This framework ensures both fine-grained assessment of visualization correctness and perceptual quality, as well as robustness to LLM agent failure cases.
5. Logic Rule Framework and Benchmarking Robustness
Although structurally static, MultiVis-Bench is intrinsically designed to facilitate evaluation under logic rule-enhanced agent frameworks (e.g., MultiVis-Agent). The evaluation protocol is underpinned by a four-layer logic rule system that regulates:
- Coordination (CR): Scenario classification by , with preference ordering .
- Tool Execution (TE): Parameter boundary constraints .
- Error Handling (EH): Automatic LLM output error classification .
- ReAct Control (RC): Iteration termination , including hard cap .
These rules guarantee that benchmarks avoid infinite loops, boundary violations, or error cascades, thus enabling repeatable, trustworthy system comparisons.
6. Empirical Results and Comparative Analysis
Experimental outcomes with MultiVis-Agent on MultiVis-Bench demonstrate substantial performance improvements attributable to the logic rule-backed approach:
- Visualization Quality: Average of 74.18% across scenarios A–C, surpassing the strongest baseline by approximately 10 percentage points; in scenario B (IRG) MultiVis-Agent achieves 75.63% versus 62.79% (LLM Workflow) and 57.54% (Instructing LLM).
- Reliability Gains:
- Basic Generation: Completion 98.68% (+20.46 pp), ExecSuccess 95.71% (+32.53 pp)
- Image-Referenced Generation: Completion 99.58% (+25.10 pp), ExecSuccess 94.56% (+29.46 pp)
- Code-Referenced Generation: Completion 99.81% (+9.25 pp), ExecSuccess 96.32% (+15.20 pp)
- Iterative Refinement: Completion 100.00% (+7.97 pp), ExecSuccess 97.10% (+13.77 pp)
A plausible implication is that these logic rule constraints are necessary for both high-level output quality and operational stability in LLM-based, multi-modal visualization agents (Lu et al., 26 Jan 2026).
7. Significance and Future Directions
MultiVis-Bench represents a step-change in the benchmarking of automated analytical visualization agents by marrying scenario diversity, rigorous ground-truth curation, and executable output specification. It sets a new methodological standard for evaluating cross-modal, iterative, and agent-centric systems, especially those relying on LLMs combined with logic control. While MultiVis-Bench currently operates as a static dataset, its logic rule-aware design anticipates broader adoption of programmatic reliability mechanisms in autonomous data analytics frameworks. Future expansions might incorporate new modalities or domain-specific constraints but will likely retain the dual emphasis on end-to-end, executable evaluation and formalized system robustness (Lu et al., 26 Jan 2026).