MultiVis-Agent: Cross-Modal Visualization Framework
- MultiVis-Agent is a multi-agent, cross-modal visualization framework that unifies LLM-driven reasoning with deterministic logic to ensure robust analytics.
- Its modular design employs specialized agents—Database Query, Visualization Implementation, and Validation—to support iterative refinement and reference-guided synthesis.
- Empirical results on MultiVis-Bench show significant gains in reliability and execution success, underscoring its practical viability for real-world visual analytics.
MultiVis-Agent is a multi-agent, cross-modal visualization generation framework designed to deliver mathematically reliable and flexible visual analytics pipelines. It unifies LLM–driven reasoning and code synthesis with deterministic logic rule enforcement, aiming to support real-world, multi-modal visualization scenarios—including iterative refinement, reference-guided synthesis, and perceptual grounding—while systematically mitigating reliability hazards and failure modes common in LLM-centric toolchains (Lu et al., 26 Jan 2026).
1. System Architecture and Modular Agent Design
MultiVis-Agent adopts a centralized multi-agent architecture, comprising a Coordinator Agent and three specialized worker agents, enabling robust decomposition and orchestration of complex visualization tasks. The core agents and their tools are:
- Database Query Agent (DQ Agent): Responsible for schema exploration and SQL query generation via iterative Thought–Action–Observation (TAO) interactions. Equipped with tools such as list_tables, get_table, get_foreign_keys, find_fields, and execute_sql.
- Visualization Implementation Agent (Vis Agent): Produces executable Altair/Python visualization code and supports two interfaces—generate_visualization_code (initial synthesis) and modify_visualization_code (for iterative or reference-driven refinement)—by leveraging code examples and prior outputs.
- Validation and Evaluation Agent (VE Agent): Executes Altair or Matplotlib code, assesses technical correctness and perceptual effectiveness (using VLMs for chart scoring), and generates feedback for further refinement.
- Coordinator Agent: Manages the workflow by applying strict logic rules for task classification, tool selection, result validation, and error control, guaranteeing coordinated agent interactions and global state consistency.
This architectural paradigm supports seamless cross-modal fusion: textual database queries, reference images, pre-existing or reference code, and prior visualizations are integrally accessible for both one-shot and iterative execution contexts.
2. Four-Layer Logic Rule Framework and Formal Guarantees
At the system's core is a rigorous four-layer logic rule framework, , which governs all agent operations:
- Coordination Rules (CR): Provide formal logic for task type detection (e.g., prioritize iterative refinement over basic generation), guarantee tool prerequisite satisfaction before calls, and deterministically map evaluation outcomes to system actions:
- Tool Execution Rules (TE): Enforce parameter safety ( clipped or flagged invalid), standardized execution environments (atomic rollback, namespace normalization), and reference material validation.
- Error Handling Rules (EH): Systematic classification and bounded recovery from errors in tool invocations, malformed code, and image processing:
- Error recovery strategy mapping ensures all recoveries terminate within bounded steps.
- ReAct Control Rules (RC): Iteration regulation and response validation, with RC-Rule 1 imposing a strict upper bound () on TAO loop iterations, assuring guaranteed termination and immunity to infinite loops.
Key theorems derived from these rules include:
- Parameter Safety: All tool executions either conform to safe boundaries or are gracefully failed.
- Bounded Error Recovery: Recovery from any error completes in steps.
- Guaranteed Termination: No execution trace exceeds steps: .
- System Reliability: Satisfying all prior theorems yields global reliability (Lu et al., 26 Jan 2026).
3. Task Formalization and Scenario Coverage
MultiVis-Agent formally decomposes cross-modal visualization into four distinct scenarios, each mapped to a specialized agent workflow:
- Basic Generation (BG):
- Image-Referenced Generation (IRG):
- Code-Referenced Generation (CRG):
- Iterative Refinement (IR):
Letting denote the input modality set, MultiVis-Agent computes scenario-specific mappings , enabling context-sensitive tool selection and refinement pathways. Task type detection is performed deterministically by maximizing a modality score over the candidate set.
4. MultiVis-Bench: Benchmark Design and Evaluation Metrics
MultiVis-Agent is evaluated using MultiVis-Bench, a comprehensive benchmark with $1,202$ curated cases spanning the four scenario types:
| Scenario | #Examples | Modalities | Databases | Chart Types |
|---|---|---|---|---|
| BG | 306 | Text/Database | 141 | 127 |
| IRG | 109 | Text/DB/Image | ||
| CRG | 233 | Text/DB/Code | ||
| IR | 554 | Text/DB/Existing Vis. Code |
Evaluation uses a dual-layer metric:
- Structural Score (): Six weighted dimensions: Chart Type, Data Mapping, Encoding, Interaction, Configuration, Transformation.
- Perceptual Score (): Six weighted perceptual axes: Appropriateness, Layout, Text, Representation, Styling, Clarity.
- Combined Visualization Score: .
5. Empirical Results and Reliability Impact
Extensive experiments validate the efficacy of MultiVis-Agent with logic-rule enhancement on MultiVis-Bench. Empirical highlights:
| Method | (%) | Success Rate (%) | Exec. Rate (%) |
|---|---|---|---|
| Instructing LLM | 57–63 | 85–89 | 63–74 |
| LLM Workflow | 62–65 | 88–93 | 65–83 |
| MultiVis-Agent (no rules) | 37–71 | 78–92 | 63–83 |
| MultiVis-Agent (full) | 73–76 | 98–100 | 94–97 |
For Image-Referenced Generation (IRG), MultiVis-Agent achieves , compared to (Instruct) and (Workflow). Code execution success rates for Scenario B reach (vs. without rules); task completion rates approach (vs. without rules). The integration of logic rules is determinative for system reliability.
6. Error Handling, Safety, and System Robustness
Logic rules systematically transform LLM brittleness and error-prone behavior into provably constrained, recoverable operation:
- Parameter and Execution Safety: All tool calls are bounded, non-conforming values are clipped or flagged, and code is preprocessed for environmental safety and rollback capability.
- Systematic Error Recovery: Malformed SQL/code, failed visualizations, or invalid references are parsed, classified, and corrected within a bounded number of steps, per explicit error-handling rules.
- Infinite-Loop Immunity: Hard caps on iterative refinement and TAO loop length, enforced by RC-Rule 1, guarantee that the system cannot devolve into unbounded or catastrophic failure cycles.
A plausible implication is that such a hybrid approach—combining flexible LLM reasoning with robust formal logic—can serve as a blueprint for production-grade, generalizable, and auditable multi-modal AI automation.
7. Position within the Broader Multi-Agent and Visual Intelligence Landscape
MultiVis-Agent distinguishes itself from other agentic and visual reasoning systems—including those in Orion (Reddy et al., 18 Nov 2025), OrchVis (Zhou, 28 Oct 2025), and recent hybrid, role-based pipelines (Wolter et al., 30 Aug 2025, Gyarmati et al., 6 Sep 2025)—by integrating a mathematically explicit logic rule system that controllably bounds agent behavior without negating the creative or cross-modal strengths of LLM workflows. While agent orchestration, workflow transparency, and error checking are present in other works, only MultiVis-Agent establishes formal reliability theorems (parameter safety, bounded recovery, guaranteed termination, system-level correctness) and achieves empirical dominance on a multi-modal visualization benchmark (Lu et al., 26 Jan 2026). This framework addresses both complexity and reliability, charting a significant advance in trustworthy, scenario-rich visual analytics automation.