ArtifactsBench: Interactive Artifact Benchmark
- ArtifactsBench is a comprehensive benchmark suite that evaluates the visual, interactive, and dynamic properties of generated artifacts across nine real-world domains.
- It employs a fully automated, multimodal evaluation framework using programmatic rendering, temporal screenshot capture, and MLLM-guided checklists for fine-grained scoring.
- ArtifactsBench’s open-source release and robust methodology enable reliable assessment of both open- and closed-source LLMs, driving advancements in user-centric AI research.
ArtifactsBench refers to a suite of open benchmarks designed to facilitate the evaluation of algorithms, models, or systems with respect to their handling, generation, or detection of artifacts—defined variably across domains as user-facing interactive content, manipulated objects, software engineering research artifacts, or visual distortions in media. The most prominent instantiation of "ArtifactsBench" is described in "ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation" (Zhang et al., 7 Jul 2025), which establishes the first large-scale, multimodal, automated benchmark for visual-interactive artifact generation. Additional related benchmarks—such as Artisan-Bench (Baek et al., 10 Feb 2026), ArtiBench (Wu et al., 25 Nov 2025), and BVI-Artefact (Feng et al., 2023)—extend the concept to software engineering reproducibility, robotic object manipulation, and no-reference video artifact detection, respectively.
1. Motivation: The Visual–Interactive Gap in Evaluation
Typical benchmarks for code generation by LLMs such as HumanEval or SWE-Bench provide strong tests of algorithmic correctness but are critically limited: they do not assess the dynamic, multimodal properties required for quality user experience with generated artifacts. In real-world front-end contexts—web widgets, dashboards, simulations—user-perceived quality depends on layout fidelity, responsive state transitions, interactive integrity, and aesthetic coherence. Such properties are either ignored or only coarsely approximated by methods relying on DOM-tree comparisons or pixel-level similarity.
This discrepancy, termed the "visual–interactive gap," produces situations where unit-test-passing code yields inferior visual products, rendering prior benchmarks insufficient for guiding user-centric model development (Zhang et al., 7 Jul 2025).
2. Benchmark Construction and Task Suite
ArtifactsBench (Zhang et al., 7 Jul 2025) addresses the visual–interactive gap through a carefully stratified suite of 1,825 tasks spanning nine real-world domains: Game Development, SVG Generation, Web Applications, Simulations, Data Science Dashboards, Management Systems, Multimedia Editing, Quick Tools, and miscellaneous Others. The tasks are distributed across difficulty levels (30% Easy, 40% Medium, 30% Hard) and constructed through an eight-stage pipeline:
- Raw Extraction and Filtering: Collect artifacts from expert-curated showcases, open datasets (e.g., Svgen-500k, Instruct-SVG), tutorials, and result amplification via visual-to-query LLM pipelines. Filter for non-duplicates, visual content, and open licensing.
- Manual and LLM Rewriting: Domain experts and LLMs (e.g., GPT-4o) co-refine prompts for clarity, completeness, and stylistic variation.
- Classification and Difficulty Labeling: LLM heuristics, refined by human verification, assign domain and difficulty; underspecified or trivial tasks are culled.
- Sample Annotation & Checklist Generation: Fine-grained per-task checklists are manually authored for ∼10% of tasks, then extended via LLM synthesis and quality controlled.
- Model Generation & Task Validation: Multiple baseline LLMs generate artifacts; ambiguous or failed tasks are iteratively revised.
- Final QA and Consolidation: Experts review all components for coherence, difficulty balance, and coverage.
This pipeline enforces both diversity and reproducibility, producing a suite capable of reliably assessing state-of-the-art LLMs in visually interactive code generation.
3. Automated Multimodal Evaluation Architecture
ArtifactsBench’s distinguishing methodological contribution is its fully automated, multimodal evaluation framework. The pipeline consists of:
- Programmatic Rendering: The generated code is executed in a sandboxed browser (Playwright). Three screenshots are captured at fixed intervals, aligned to key dynamic states: pre-interaction, mid-animation, and post-event. This captures both static and temporal behavioral traces.
- Multimodal LLM-as-Judge: The visual evidence, source code, task description, and itemized checklist are supplied to a Multimodal LLM (MLLM) referee for evaluation—open-source (Qwen2.5-VL-72B) for reproducibility, and proprietary (Gemini-2.5-pro-0506) as the reference.
- Fine-Grained Checklist Scoring: Each task is annotated with a -dimensional checklist , covering precise criteria. The MLLM produces subscores and the overall artifact score:
where are default-equal weights. This enables diagnosis of model performance along axes such as layout, logic, interactivity, and aesthetics.
4. Validation, Metrics, and Human Alignment
Benchmarks are validated on two primary axes:
- Pairwise Agreement :
Assessed on 280 tasks across six models, top MLLMs achieve (Gemini-2.5-pro), confirming the reliability of the automated judge.
- Ranking Consistency (RC):
ArtifactsBench achieves 0 against the human-voted WebDev Arena leaderboard, vastly surpassing earlier automated benchmarks (e.g., WebBench: 1).
These metrics establish that ArtifactsBench is aligned with human perception for both pairwise and aggregated model assessment.
5. Experimental Results and Model Insights
ArtifactsBench enables the evaluation of both open- and closed-source LLMs at meaningful scale:
- Over 30 models assessed: 24 open-source (Qwen2.5/3, Hunyuan-A13B, Gemma3, DeepSeek, Seed-Coder) and 10 closed-source (Gemini, GPT-4o/4.1, Claude 3.7/4, etc.).
- Findings:
- Proprietary multimodal models (Gemini-2.5-pro, Claude 4 Sonnet) vastly outperform others (∼57/60 points).
- Generalist, instruction-tuned models (Qwen2.5-Instruct) surpass domain-specific variants (Qwen2.5-Coder, Qwen2.5-VL), emphasizing the benefit of integrated vision-language-code training.
- The hardest categories ("Intensive Interactive" tasks, Management Systems) remain unsolved (2 points for all), highlighting key frontiers for future research.
Table: Summary Results for Selected Models (excerpted from (Zhang et al., 7 Jul 2025))
| Model | Mean Score (S̄ₘ) | Notes |
|---|---|---|
| Gemini-2.5-pro | ~57/60 | Proprietary MLLM, top aligned |
| Claude 4 Sonnet | ~57/60 | Proprietary, strong V+L |
| Qwen2.5-Instruct | > domain-specific | Outperforms Qwen2.5-Coder/VL |
| All models on hardest | <50 | Intensive Interactive, MgmtSys |
6. Open-Source Release, Usage Modes, and Community Adoption
ArtifactsBench is released at [https://artifactsbenchmark.github.io/]. It is designed for rapid research integration and ongoing community benchmarking, consisting of:
- The full 1,825-task suite, scripts for Playwright-based rendering and temporal capture, all MLLM scoring prompts, and baseline outputs.
- A provided Docker container for reproducible artifact execution.
- API scripts for both open-source (Qwen2.5-VL-72B) and reference (Gemini) MLLMs.
- Adaptable checklists and rubrics for extension to new artifact classes.
- Published S_{m,j} vectors and model rankings as baselines for comparison.
7. Contributions, Limitations, and Broader Scope
Key contributions of ArtifactsBench (Zhang et al., 7 Jul 2025):
- Establishes the first large-scale benchmark focusing on dynamic, interactive visual artifacts.
- Automates artifact evaluation at human-level alignment through a novel combination of programmatic rendering, temporal sampling, and MLLM-guided checklists.
- Provides fine-grained scoring and diagnostic capability, enabling identification of failure modes along user-relevant axes.
Limitations and open challenges:
- Discrete screenshot sampling may miss long-horizon or deeply stateful interactions. Integration of DOM-level exploration or video-based assessment is a prospective advance.
- Single-turn, non-agentic generation is tested; iterative or auto-debugging workflows remain out of scope.
- Emergent artifact classes (e.g., 3D, VR) will require rubric and judge extension.
Extensions and Related Benchmarks:
The concept of an ArtifactsBench extends into related domains:
- Artisan-Bench for automated artifact evaluation in software engineering, focusing on LLM agent–generated reproduction scripts and two-tier outcome/method judging (Baek et al., 10 Feb 2026).
- ArtiBench for generalizable articulated object manipulation in robotics, built around multi-level generalization tasks and hierarchical VLM-based control (Wu et al., 25 Nov 2025).
- BVI-Artefact as an "ArtifactsBench" for no-reference detection of common visual video artifacts in professional streaming; features balanced, multi-type multi-label sequences for detector benchmarking (Feng et al., 2023).
8. Best Practices and Recommendations for ArtifactsBench-Style Research
- Employ multimodal evidence (code + visual outputs + behavioral traces) as the basis for automated judgment.
- Use hierarchical or itemized checklists to enable fine-grained diagnostic scoring.
- Validate automated metrics against large-scale human preference data and pairwise agreement.
- Release artifact datasets, evaluation scripts, and baseline outputs openly for reproducibility and tracking progress.
- Target future efforts at richer interaction models, compositional and agentic tasks, and fine-grained spatiotemporal evaluation.
ArtifactsBench marks a paradigm shift toward comprehensive, scalable, and human-aligned evaluation of user-perceived quality in generative systems, with broad implications for the development and benchmarking of user-centric AI models.