MultiVis-Bench: Multi-Modal Viz Benchmark

Updated 28 January 2026

MultiVis-Bench is a benchmark that rigorously evaluates multi-modal, end-to-end data visualization systems by mapping diverse input scenarios to executable Altair code.
It covers four distinct visualization generation scenarios, including text, image, code references, and iterative refinement, ensuring high-quality, expert-validated outputs.
The benchmark enables comprehensive assessment through dual-layer scoring, integrated logic rule frameworks, and robust reliability metrics for LLM-based agents.

MultiVis-Bench is a large-scale benchmark designed to rigorously evaluate multi-modal, end-to-end data visualization generation systems across a spectrum of realistic analytical scenarios, with a particular focus on cross-modal and iterative workflows. It provides 1,202 expert-validated cases covering four orthogonal visualization generation scenarios—including basic text-to-visualization, image-referenced, code-referenced, and iterative refinement tasks—aimed at supporting the development and assessment of reliable, logic rule-enhanced LLM frameworks for data visualization. Unlike prior efforts restricted to text-plus-table translation or intermediate representations, MultiVis-Bench enables direct evaluation of systems generating executable Python (Altair) code with built-in protocol mechanisms for comprehensive quality and reliability measurement (Lu et al., 26 Jan 2026).

1. Motivation and Scope

MultiVis-Bench was conceived in response to two central limitations of existing Text-to-Vis benchmarks: their coverage is restricted to single-shot, text-plus-table input and they generate non-executable intermediate representations (e.g., VQL, Vega-Lite), which do not reflect the complexity of real-world analytical workflows. In practice, analysts require multi-modal input channels (spanning reference imagery and code), iterative solution refinement, and outputs that are directly executable for seamless analysis and visualization. MultiVis-Bench explicitly addresses these gaps by formalizing a family of data-to-visualization mappings and providing a unified, expert-curated dataset for end-to-end system evaluation.

2. Multi-Scenario Structure and Data Organization

MultiVis-Bench is organized into four scenario classes, each corresponding to a distinct family of cross-modal visualization tasks. The formal mapping for each scenario is denoted $f_S: \mathcal{X} \rightarrow V$ , where $\mathcal{X}$ represents the tuple of input modalities and $V$ is the Altair visualization code.

Scenario	Inputs $\mathcal{X}$	Cases	Key Modalities
A—Basic Generation (BG)	$(Q, D)$	306	Natural language + database schema/data
B—Image-Referenced Generation (IRG)	$(Q, D, I_{ref})$	109	Text + database + reference image
C—Code-Referenced Generation (CRG)	$(Q, D, C_{ref})$	233	Text + database + reference code (Matplotlib/Altair)
D—Iterative Refinement (IR)	$(Q, D, V_{old})$	554	Text + database + previous Altair code

The 1,202 cases are distributed to sample diverse analytical needs: 127 Altair chart templates are utilized on 141 SQLite databases (from Spider, filtered for schema and attribute diversity). Each example includes high-fidelity input-output pairs, supporting multi-modal evaluation with real database constraints and direct execution of generated Python code.

3. Construction and Quality Assurance Protocols

The assembly of MultiVis-Bench followed a hybrid human-in-the-loop pipeline:

Example Generation: Draft samples are proposed by LLMs (Gemini-2.0-pro-exp), conditioned on schema, template, and scenario specifications.
Human Validation: Each case undergoes an average of 2.5 expert review rounds for technical correctness, semantic faithfulness, and perceptual effectiveness.
Automatic Checks: Python code is formatted via Black and executed on the associated SQLite database; any samples failing syntactic or logical execution, or failing checklist-based semantic/visual quality standards, are rejected or corrected manually.

Dataset releases do not prescribe explicit train/dev/test splits, but users are encouraged to apply stratified sampling (e.g., 70/15/15) by scenario for robust generalization benchmarking (Lu et al., 26 Jan 2026).

4. Evaluation Metrics and Scoring Functions

MultiVis-Bench adopts a dual-layer scoring strategy for visualization quality, supplemented by reliability metrics tailored to the unique failure modes of LLM-based agents.

Low-Level Structural Score: $S_{low} = \sum_{i=1}^6 w_i M_i$ , covering chart type, data mapping, encoding, interaction, configuration, and transformation. Each $M_i \in [0, 1]$ is normalized, and $\mathcal{X}$ 0.
High-Level Perceptual Score: $\mathcal{X}$ 1, with $\mathcal{X}$ 2 (VLM-based) quantifying type appropriateness, layout, text, data representation, style, and clarity.
Combined Visualization Score: $\mathcal{X}$ 3, $\mathcal{X}$ 4 by default.

Reliability Metrics:

Task Completion Rate: $\mathcal{X}$ 5
Code Execution Success: $\mathcal{X}$ 6

This framework ensures both fine-grained assessment of visualization correctness and perceptual quality, as well as robustness to LLM agent failure cases.

5. Logic Rule Framework and Benchmarking Robustness

Although structurally static, MultiVis-Bench is intrinsically designed to facilitate evaluation under logic rule-enhanced agent frameworks (e.g., MultiVis-Agent). The evaluation protocol is underpinned by a four-layer logic rule system that regulates:

Coordination (CR): Scenario classification by $\mathcal{X}$ 7, with preference ordering $\mathcal{X}$ 8.
Tool Execution (TE): Parameter boundary constraints $\mathcal{X}$ 9.
Error Handling (EH): Automatic LLM output error classification $V$ 0.
ReAct Control (RC): Iteration termination $V$ 1, including hard cap $V$ 2.

These rules guarantee that benchmarks avoid infinite loops, boundary violations, or error cascades, thus enabling repeatable, trustworthy system comparisons.

6. Empirical Results and Comparative Analysis

Experimental outcomes with MultiVis-Agent on MultiVis-Bench demonstrate substantial performance improvements attributable to the logic rule-backed approach:

Visualization Quality: Average $V$ 3 of 74.18% across scenarios A–C, surpassing the strongest baseline by approximately 10 percentage points; in scenario B (IRG) MultiVis-Agent achieves 75.63% versus 62.79% (LLM Workflow) and 57.54% (Instructing LLM).
Reliability Gains:
- Basic Generation: Completion 98.68% (+20.46 pp), ExecSuccess 95.71% (+32.53 pp)
- Image-Referenced Generation: Completion 99.58% (+25.10 pp), ExecSuccess 94.56% (+29.46 pp)
- Code-Referenced Generation: Completion 99.81% (+9.25 pp), ExecSuccess 96.32% (+15.20 pp)
- Iterative Refinement: Completion 100.00% (+7.97 pp), ExecSuccess 97.10% (+13.77 pp)

A plausible implication is that these logic rule constraints are necessary for both high-level output quality and operational stability in LLM-based, multi-modal visualization agents (Lu et al., 26 Jan 2026).

7. Significance and Future Directions

MultiVis-Bench represents a step-change in the benchmarking of automated analytical visualization agents by marrying scenario diversity, rigorous ground-truth curation, and executable output specification. It sets a new methodological standard for evaluating cross-modal, iterative, and agent-centric systems, especially those relying on LLMs combined with logic control. While MultiVis-Bench currently operates as a static dataset, its logic rule-aware design anticipates broader adoption of programmatic reliability mechanisms in autonomous data analytics frameworks. Future expansions might incorporate new modalities or domain-specific constraints but will likely retain the dual emphasis on end-to-end, executable evaluation and formalized system robustness (Lu et al., 26 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

MultiVis-Agent: A Multi-Agent Framework with Logic Rules for Reliable and Comprehensive Cross-Modal Data Visualization (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MultiVis-Bench.