Papers
Topics
Authors
Recent
Search
2000 character limit reached

MultiVis-Bench: Multi-Modal Viz Benchmark

Updated 28 January 2026
  • MultiVis-Bench is a benchmark that rigorously evaluates multi-modal, end-to-end data visualization systems by mapping diverse input scenarios to executable Altair code.
  • It covers four distinct visualization generation scenarios, including text, image, code references, and iterative refinement, ensuring high-quality, expert-validated outputs.
  • The benchmark enables comprehensive assessment through dual-layer scoring, integrated logic rule frameworks, and robust reliability metrics for LLM-based agents.

MultiVis-Bench is a large-scale benchmark designed to rigorously evaluate multi-modal, end-to-end data visualization generation systems across a spectrum of realistic analytical scenarios, with a particular focus on cross-modal and iterative workflows. It provides 1,202 expert-validated cases covering four orthogonal visualization generation scenarios—including basic text-to-visualization, image-referenced, code-referenced, and iterative refinement tasks—aimed at supporting the development and assessment of reliable, logic rule-enhanced LLM frameworks for data visualization. Unlike prior efforts restricted to text-plus-table translation or intermediate representations, MultiVis-Bench enables direct evaluation of systems generating executable Python (Altair) code with built-in protocol mechanisms for comprehensive quality and reliability measurement (Lu et al., 26 Jan 2026).

1. Motivation and Scope

MultiVis-Bench was conceived in response to two central limitations of existing Text-to-Vis benchmarks: their coverage is restricted to single-shot, text-plus-table input and they generate non-executable intermediate representations (e.g., VQL, Vega-Lite), which do not reflect the complexity of real-world analytical workflows. In practice, analysts require multi-modal input channels (spanning reference imagery and code), iterative solution refinement, and outputs that are directly executable for seamless analysis and visualization. MultiVis-Bench explicitly addresses these gaps by formalizing a family of data-to-visualization mappings and providing a unified, expert-curated dataset for end-to-end system evaluation.

2. Multi-Scenario Structure and Data Organization

MultiVis-Bench is organized into four scenario classes, each corresponding to a distinct family of cross-modal visualization tasks. The formal mapping for each scenario is denoted fS:XVf_S: \mathcal{X} \rightarrow V, where X\mathcal{X} represents the tuple of input modalities and VV is the Altair visualization code.

Scenario Inputs X\mathcal{X} Cases Key Modalities
A—Basic Generation (BG) (Q,D)(Q, D) 306 Natural language + database schema/data
B—Image-Referenced Generation (IRG) (Q,D,Iref)(Q, D, I_{ref}) 109 Text + database + reference image
C—Code-Referenced Generation (CRG) (Q,D,Cref)(Q, D, C_{ref}) 233 Text + database + reference code (Matplotlib/Altair)
D—Iterative Refinement (IR) (Q,D,Vold)(Q, D, V_{old}) 554 Text + database + previous Altair code

The 1,202 cases are distributed to sample diverse analytical needs: 127 Altair chart templates are utilized on 141 SQLite databases (from Spider, filtered for schema and attribute diversity). Each example includes high-fidelity input-output pairs, supporting multi-modal evaluation with real database constraints and direct execution of generated Python code.

3. Construction and Quality Assurance Protocols

The assembly of MultiVis-Bench followed a hybrid human-in-the-loop pipeline:

  • Example Generation: Draft samples are proposed by LLMs (Gemini-2.0-pro-exp), conditioned on schema, template, and scenario specifications.
  • Human Validation: Each case undergoes an average of 2.5 expert review rounds for technical correctness, semantic faithfulness, and perceptual effectiveness.
  • Automatic Checks: Python code is formatted via Black and executed on the associated SQLite database; any samples failing syntactic or logical execution, or failing checklist-based semantic/visual quality standards, are rejected or corrected manually.

Dataset releases do not prescribe explicit train/dev/test splits, but users are encouraged to apply stratified sampling (e.g., 70/15/15) by scenario for robust generalization benchmarking (Lu et al., 26 Jan 2026).

4. Evaluation Metrics and Scoring Functions

MultiVis-Bench adopts a dual-layer scoring strategy for visualization quality, supplemented by reliability metrics tailored to the unique failure modes of LLM-based agents.

  • Low-Level Structural Score: Slow=i=16wiMiS_{low} = \sum_{i=1}^6 w_i M_i, covering chart type, data mapping, encoding, interaction, configuration, and transformation. Each Mi[0,1]M_i \in [0, 1] is normalized, and iwi=1\sum_i w_i = 1.
  • High-Level Perceptual Score: Shigh=j=16vjPjS_{high} = \sum_{j=1}^6 v_j P_j, with PjP_j (VLM-based) quantifying type appropriateness, layout, text, data representation, style, and clarity.
  • Combined Visualization Score: Svis=αSlow+(1α)ShighS_{vis} = \alpha S_{low} + (1-\alpha) S_{high}, α=0.5\alpha = 0.5 by default.

Reliability Metrics:

  • Task Completion Rate: (#tasks passed)/(#tasks)(\#\text{tasks passed}) / (\#\text{tasks})
  • Code Execution Success: (#snippets executed without error)/(#snippets)(\#\text{snippets executed without error}) / (\#\text{snippets})

This framework ensures both fine-grained assessment of visualization correctness and perceptual quality, as well as robustness to LLM agent failure cases.

5. Logic Rule Framework and Benchmarking Robustness

Although structurally static, MultiVis-Bench is intrinsically designed to facilitate evaluation under logic rule-enhanced agent frameworks (e.g., MultiVis-Agent). The evaluation protocol is underpinned by a four-layer logic rule system that regulates:

  • Coordination (CR): Scenario classification by FT(I)=argmaxt{A,B,C,D}π(t,I)\mathcal{F}_T(I) = \arg\max_{t \in \{A,B,C,D\}} \pi(t, I), with preference ordering DCBAD \succ C \succ B \succ A.
  • Tool Execution (TE): Parameter boundary constraints ϕ(p,v)\phi(p, v).
  • Error Handling (EH): Automatic LLM output error classification ϵ(x)\epsilon(x).
  • ReAct Control (RC): Iteration termination τ(i,r)\tau(i, r), including hard cap Tmax=10T_{max}=10.

These rules guarantee that benchmarks avoid infinite loops, boundary violations, or error cascades, thus enabling repeatable, trustworthy system comparisons.

6. Empirical Results and Comparative Analysis

Experimental outcomes with MultiVis-Agent on MultiVis-Bench demonstrate substantial performance improvements attributable to the logic rule-backed approach:

  • Visualization Quality: Average SvisS_{vis} of 74.18% across scenarios A–C, surpassing the strongest baseline by approximately 10 percentage points; in scenario B (IRG) MultiVis-Agent achieves 75.63% versus 62.79% (LLM Workflow) and 57.54% (Instructing LLM).
  • Reliability Gains:
    • Basic Generation: Completion 98.68% (+20.46 pp), ExecSuccess 95.71% (+32.53 pp)
    • Image-Referenced Generation: Completion 99.58% (+25.10 pp), ExecSuccess 94.56% (+29.46 pp)
    • Code-Referenced Generation: Completion 99.81% (+9.25 pp), ExecSuccess 96.32% (+15.20 pp)
    • Iterative Refinement: Completion 100.00% (+7.97 pp), ExecSuccess 97.10% (+13.77 pp)

A plausible implication is that these logic rule constraints are necessary for both high-level output quality and operational stability in LLM-based, multi-modal visualization agents (Lu et al., 26 Jan 2026).

7. Significance and Future Directions

MultiVis-Bench represents a step-change in the benchmarking of automated analytical visualization agents by marrying scenario diversity, rigorous ground-truth curation, and executable output specification. It sets a new methodological standard for evaluating cross-modal, iterative, and agent-centric systems, especially those relying on LLMs combined with logic control. While MultiVis-Bench currently operates as a static dataset, its logic rule-aware design anticipates broader adoption of programmatic reliability mechanisms in autonomous data analytics frameworks. Future expansions might incorporate new modalities or domain-specific constraints but will likely retain the dual emphasis on end-to-end, executable evaluation and formalized system robustness (Lu et al., 26 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MultiVis-Bench.