ArtifactsBench: Interactive Artifact Benchmark

Updated 3 March 2026

ArtifactsBench is a comprehensive benchmark suite that evaluates the visual, interactive, and dynamic properties of generated artifacts across nine real-world domains.
It employs a fully automated, multimodal evaluation framework using programmatic rendering, temporal screenshot capture, and MLLM-guided checklists for fine-grained scoring.
ArtifactsBench’s open-source release and robust methodology enable reliable assessment of both open- and closed-source LLMs, driving advancements in user-centric AI research.

ArtifactsBench refers to a suite of open benchmarks designed to facilitate the evaluation of algorithms, models, or systems with respect to their handling, generation, or detection of artifacts—defined variably across domains as user-facing interactive content, manipulated objects, software engineering research artifacts, or visual distortions in media. The most prominent instantiation of "ArtifactsBench" is described in "ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation" (Zhang et al., 7 Jul 2025), which establishes the first large-scale, multimodal, automated benchmark for visual-interactive artifact generation. Additional related benchmarks—such as Artisan-Bench (Baek et al., 10 Feb 2026), ArtiBench (Wu et al., 25 Nov 2025), and BVI-Artefact (Feng et al., 2023)—extend the concept to software engineering reproducibility, robotic object manipulation, and no-reference video artifact detection, respectively.

1. Motivation: The Visual–Interactive Gap in Evaluation

Typical benchmarks for code generation by LLMs such as HumanEval or SWE-Bench provide strong tests of algorithmic correctness but are critically limited: they do not assess the dynamic, multimodal properties required for quality user experience with generated artifacts. In real-world front-end contexts—web widgets, dashboards, simulations—user-perceived quality depends on layout fidelity, responsive state transitions, interactive integrity, and aesthetic coherence. Such properties are either ignored or only coarsely approximated by methods relying on DOM-tree comparisons or pixel-level similarity.

This discrepancy, termed the "visual–interactive gap," produces situations where unit-test-passing code yields inferior visual products, rendering prior benchmarks insufficient for guiding user-centric model development (Zhang et al., 7 Jul 2025).

2. Benchmark Construction and Task Suite

ArtifactsBench (Zhang et al., 7 Jul 2025) addresses the visual–interactive gap through a carefully stratified suite of 1,825 tasks spanning nine real-world domains: Game Development, SVG Generation, Web Applications, Simulations, Data Science Dashboards, Management Systems, Multimedia Editing, Quick Tools, and miscellaneous Others. The tasks are distributed across difficulty levels (30% Easy, 40% Medium, 30% Hard) and constructed through an eight-stage pipeline:

Raw Extraction and Filtering: Collect artifacts from expert-curated showcases, open datasets (e.g., Svgen-500k, Instruct-SVG), tutorials, and result amplification via visual-to-query LLM pipelines. Filter for non-duplicates, visual content, and open licensing.
Manual and LLM Rewriting: Domain experts and LLMs (e.g., GPT-4o) co-refine prompts for clarity, completeness, and stylistic variation.
Classification and Difficulty Labeling: LLM heuristics, refined by human verification, assign domain and difficulty; underspecified or trivial tasks are culled.
Sample Annotation & Checklist Generation: Fine-grained per-task checklists are manually authored for ∼10% of tasks, then extended via LLM synthesis and quality controlled.
Model Generation & Task Validation: Multiple baseline LLMs generate artifacts; ambiguous or failed tasks are iteratively revised.
Final QA and Consolidation: Experts review all components for coherence, difficulty balance, and coverage.

This pipeline enforces both diversity and reproducibility, producing a suite capable of reliably assessing state-of-the-art LLMs in visually interactive code generation.

3. Automated Multimodal Evaluation Architecture

ArtifactsBench’s distinguishing methodological contribution is its fully automated, multimodal evaluation framework. The pipeline consists of:

Programmatic Rendering: The generated code is executed in a sandboxed browser (Playwright). Three screenshots are captured at fixed intervals, aligned to key dynamic states: pre-interaction, mid-animation, and post-event. This captures both static and temporal behavioral traces.
Multimodal LLM-as-Judge: The visual evidence, source code, task description, and itemized checklist are supplied to a Multimodal LLM (MLLM) referee for evaluation—open-source (Qwen2.5-VL-72B) for reproducibility, and proprietary (Gemini-2.5-pro-0506) as the reference.
Fine-Grained Checklist Scoring: Each task $j$ is annotated with a $D$ -dimensional checklist $c_j=(c_{j,1},…,c_{j,D})$ , covering precise criteria. The MLLM produces subscores $s_{m,j,d}\in[0,10]$ and the overall artifact score:

$S_{m,j} = \sum_{d=1}^D w_d s_{m,j,d}$

where $w_d$ are default-equal weights. This enables diagnosis of model performance along axes such as layout, logic, interactivity, and aesthetics.

4. Validation, Metrics, and Human Alignment

Benchmarks are validated on two primary axes:

Pairwise Agreement $P$ :

$P_{a,b} = \frac{1}{\binom{M}{2}} \sum_{i<j} 1 \left[ \operatorname{sign}(S_{a,i} - S_{a,j}) = \operatorname{sign}(S_{b,i} - S_{b,j}) \right]$

Assessed on 280 tasks across six models, top MLLMs achieve $P_{human,MLLM}\approx 90.95\%$ (Gemini-2.5-pro), confirming the reliability of the automated judge.

Ranking Consistency (RC):

$\mathrm{RC} = 1 - \frac{1}{Z} \sum_m \lvert \operatorname{rank}_{ArtifactsBench}(m) - \operatorname{rank}_{WebDev}(m) \rvert$

ArtifactsBench achieves $D$ 0 against the human-voted WebDev Arena leaderboard, vastly surpassing earlier automated benchmarks (e.g., WebBench: $D$ 1).

These metrics establish that ArtifactsBench is aligned with human perception for both pairwise and aggregated model assessment.

5. Experimental Results and Model Insights

ArtifactsBench enables the evaluation of both open- and closed-source LLMs at meaningful scale:

Over 30 models assessed: 24 open-source (Qwen2.5/3, Hunyuan-A13B, Gemma3, DeepSeek, Seed-Coder) and 10 closed-source (Gemini, GPT-4o/4.1, Claude 3.7/4, etc.).
Findings:
- Proprietary multimodal models (Gemini-2.5-pro, Claude 4 Sonnet) vastly outperform others (∼57/60 points).
- Generalist, instruction-tuned models (Qwen2.5-Instruct) surpass domain-specific variants (Qwen2.5-Coder, Qwen2.5-VL), emphasizing the benefit of integrated vision-language-code training.
- The hardest categories ("Intensive Interactive" tasks, Management Systems) remain unsolved ( $D$ 2 points for all), highlighting key frontiers for future research.

Table: Summary Results for Selected Models (excerpted from (Zhang et al., 7 Jul 2025))

Model	Mean Score (S̄ₘ)	Notes
Gemini-2.5-pro	~57/60	Proprietary MLLM, top aligned
Claude 4 Sonnet	~57/60	Proprietary, strong V+L
Qwen2.5-Instruct	> domain-specific	Outperforms Qwen2.5-Coder/VL
All models on hardest	<50	Intensive Interactive, MgmtSys

6. Open-Source Release, Usage Modes, and Community Adoption

ArtifactsBench is released at [https://artifactsbenchmark.github.io/]. It is designed for rapid research integration and ongoing community benchmarking, consisting of:

The full 1,825-task suite, scripts for Playwright-based rendering and temporal capture, all MLLM scoring prompts, and baseline outputs.
A provided Docker container for reproducible artifact execution.
API scripts for both open-source (Qwen2.5-VL-72B) and reference (Gemini) MLLMs.
Adaptable checklists and rubrics for extension to new artifact classes.
Published S_{m,j} vectors and model rankings as baselines for comparison.

7. Contributions, Limitations, and Broader Scope

Key contributions of ArtifactsBench (Zhang et al., 7 Jul 2025):

Establishes the first large-scale benchmark focusing on dynamic, interactive visual artifacts.
Automates artifact evaluation at human-level alignment through a novel combination of programmatic rendering, temporal sampling, and MLLM-guided checklists.
Provides fine-grained scoring and diagnostic capability, enabling identification of failure modes along user-relevant axes.

Limitations and open challenges:

Discrete screenshot sampling may miss long-horizon or deeply stateful interactions. Integration of DOM-level exploration or video-based assessment is a prospective advance.
Single-turn, non-agentic generation is tested; iterative or auto-debugging workflows remain out of scope.
Emergent artifact classes (e.g., 3D, VR) will require rubric and judge extension.

Extensions and Related Benchmarks:

The concept of an ArtifactsBench extends into related domains:

Artisan-Bench for automated artifact evaluation in software engineering, focusing on LLM agent–generated reproduction scripts and two-tier outcome/method judging (Baek et al., 10 Feb 2026).
ArtiBench for generalizable articulated object manipulation in robotics, built around multi-level generalization tasks and hierarchical VLM-based control (Wu et al., 25 Nov 2025).
BVI-Artefact as an "ArtifactsBench" for no-reference detection of common visual video artifacts in professional streaming; features balanced, multi-type multi-label sequences for detector benchmarking (Feng et al., 2023).

8. Best Practices and Recommendations for ArtifactsBench-Style Research

Employ multimodal evidence (code + visual outputs + behavioral traces) as the basis for automated judgment.
Use hierarchical or itemized checklists to enable fine-grained diagnostic scoring.
Validate automated metrics against large-scale human preference data and pairwise agreement.
Release artifact datasets, evaluation scripts, and baseline outputs openly for reproducibility and tracking progress.
Target future efforts at richer interaction models, compositional and agentic tasks, and fine-grained spatiotemporal evaluation.

ArtifactsBench marks a paradigm shift toward comprehensive, scalable, and human-aligned evaluation of user-perceived quality in generative systems, with broad implications for the development and benchmarking of user-centric AI models.

Markdown Report Issue Upgrade to Chat

References (4)

ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation (2025)

Artisan: Agentic Artifact Evaluation (2026)

ArtiBench and ArtiBrain: Benchmarking Generalizable Vision-Language Articulated Object Manipulation (2025)

BVI-Artefact: An Artefact Detection Benchmark Dataset for Streamed Videos (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ArtifactsBench.