Papers
Topics
Authors
Recent
Search
2000 character limit reached

ArtifactsBench: Interactive Artifact Benchmark

Updated 3 March 2026
  • ArtifactsBench is a comprehensive benchmark suite that evaluates the visual, interactive, and dynamic properties of generated artifacts across nine real-world domains.
  • It employs a fully automated, multimodal evaluation framework using programmatic rendering, temporal screenshot capture, and MLLM-guided checklists for fine-grained scoring.
  • ArtifactsBench’s open-source release and robust methodology enable reliable assessment of both open- and closed-source LLMs, driving advancements in user-centric AI research.

ArtifactsBench refers to a suite of open benchmarks designed to facilitate the evaluation of algorithms, models, or systems with respect to their handling, generation, or detection of artifacts—defined variably across domains as user-facing interactive content, manipulated objects, software engineering research artifacts, or visual distortions in media. The most prominent instantiation of "ArtifactsBench" is described in "ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation" (Zhang et al., 7 Jul 2025), which establishes the first large-scale, multimodal, automated benchmark for visual-interactive artifact generation. Additional related benchmarks—such as Artisan-Bench (Baek et al., 10 Feb 2026), ArtiBench (Wu et al., 25 Nov 2025), and BVI-Artefact (Feng et al., 2023)—extend the concept to software engineering reproducibility, robotic object manipulation, and no-reference video artifact detection, respectively.

1. Motivation: The Visual–Interactive Gap in Evaluation

Typical benchmarks for code generation by LLMs such as HumanEval or SWE-Bench provide strong tests of algorithmic correctness but are critically limited: they do not assess the dynamic, multimodal properties required for quality user experience with generated artifacts. In real-world front-end contexts—web widgets, dashboards, simulations—user-perceived quality depends on layout fidelity, responsive state transitions, interactive integrity, and aesthetic coherence. Such properties are either ignored or only coarsely approximated by methods relying on DOM-tree comparisons or pixel-level similarity.

This discrepancy, termed the "visual–interactive gap," produces situations where unit-test-passing code yields inferior visual products, rendering prior benchmarks insufficient for guiding user-centric model development (Zhang et al., 7 Jul 2025).

2. Benchmark Construction and Task Suite

ArtifactsBench (Zhang et al., 7 Jul 2025) addresses the visual–interactive gap through a carefully stratified suite of 1,825 tasks spanning nine real-world domains: Game Development, SVG Generation, Web Applications, Simulations, Data Science Dashboards, Management Systems, Multimedia Editing, Quick Tools, and miscellaneous Others. The tasks are distributed across difficulty levels (30% Easy, 40% Medium, 30% Hard) and constructed through an eight-stage pipeline:

  1. Raw Extraction and Filtering: Collect artifacts from expert-curated showcases, open datasets (e.g., Svgen-500k, Instruct-SVG), tutorials, and result amplification via visual-to-query LLM pipelines. Filter for non-duplicates, visual content, and open licensing.
  2. Manual and LLM Rewriting: Domain experts and LLMs (e.g., GPT-4o) co-refine prompts for clarity, completeness, and stylistic variation.
  3. Classification and Difficulty Labeling: LLM heuristics, refined by human verification, assign domain and difficulty; underspecified or trivial tasks are culled.
  4. Sample Annotation & Checklist Generation: Fine-grained per-task checklists are manually authored for ∼10% of tasks, then extended via LLM synthesis and quality controlled.
  5. Model Generation & Task Validation: Multiple baseline LLMs generate artifacts; ambiguous or failed tasks are iteratively revised.
  6. Final QA and Consolidation: Experts review all components for coherence, difficulty balance, and coverage.

This pipeline enforces both diversity and reproducibility, producing a suite capable of reliably assessing state-of-the-art LLMs in visually interactive code generation.

3. Automated Multimodal Evaluation Architecture

ArtifactsBench’s distinguishing methodological contribution is its fully automated, multimodal evaluation framework. The pipeline consists of:

  • Programmatic Rendering: The generated code is executed in a sandboxed browser (Playwright). Three screenshots are captured at fixed intervals, aligned to key dynamic states: pre-interaction, mid-animation, and post-event. This captures both static and temporal behavioral traces.
  • Multimodal LLM-as-Judge: The visual evidence, source code, task description, and itemized checklist are supplied to a Multimodal LLM (MLLM) referee for evaluation—open-source (Qwen2.5-VL-72B) for reproducibility, and proprietary (Gemini-2.5-pro-0506) as the reference.
  • Fine-Grained Checklist Scoring: Each task jj is annotated with a DD-dimensional checklist cj=(cj,1,,cj,D)c_j=(c_{j,1},…,c_{j,D}), covering precise criteria. The MLLM produces subscores sm,j,d[0,10]s_{m,j,d}\in[0,10] and the overall artifact score:

Sm,j=d=1Dwdsm,j,dS_{m,j} = \sum_{d=1}^D w_d s_{m,j,d}

where wdw_d are default-equal weights. This enables diagnosis of model performance along axes such as layout, logic, interactivity, and aesthetics.

4. Validation, Metrics, and Human Alignment

Benchmarks are validated on two primary axes:

  • Pairwise Agreement PP:

Pa,b=1(M2)i<j1[sign(Sa,iSa,j)=sign(Sb,iSb,j)]P_{a,b} = \frac{1}{\binom{M}{2}} \sum_{i<j} 1 \left[ \operatorname{sign}(S_{a,i} - S_{a,j}) = \operatorname{sign}(S_{b,i} - S_{b,j}) \right]

Assessed on 280 tasks across six models, top MLLMs achieve Phuman,MLLM90.95%P_{human,MLLM}\approx 90.95\% (Gemini-2.5-pro), confirming the reliability of the automated judge.

  • Ranking Consistency (RC):

RC=11ZmrankArtifactsBench(m)rankWebDev(m)\mathrm{RC} = 1 - \frac{1}{Z} \sum_m \lvert \operatorname{rank}_{ArtifactsBench}(m) - \operatorname{rank}_{WebDev}(m) \rvert

ArtifactsBench achieves DD0 against the human-voted WebDev Arena leaderboard, vastly surpassing earlier automated benchmarks (e.g., WebBench: DD1).

These metrics establish that ArtifactsBench is aligned with human perception for both pairwise and aggregated model assessment.

5. Experimental Results and Model Insights

ArtifactsBench enables the evaluation of both open- and closed-source LLMs at meaningful scale:

  • Over 30 models assessed: 24 open-source (Qwen2.5/3, Hunyuan-A13B, Gemma3, DeepSeek, Seed-Coder) and 10 closed-source (Gemini, GPT-4o/4.1, Claude 3.7/4, etc.).
  • Findings:
    • Proprietary multimodal models (Gemini-2.5-pro, Claude 4 Sonnet) vastly outperform others (∼57/60 points).
    • Generalist, instruction-tuned models (Qwen2.5-Instruct) surpass domain-specific variants (Qwen2.5-Coder, Qwen2.5-VL), emphasizing the benefit of integrated vision-language-code training.
    • The hardest categories ("Intensive Interactive" tasks, Management Systems) remain unsolved (DD2 points for all), highlighting key frontiers for future research.

Table: Summary Results for Selected Models (excerpted from (Zhang et al., 7 Jul 2025))

Model Mean Score (S̄ₘ) Notes
Gemini-2.5-pro ~57/60 Proprietary MLLM, top aligned
Claude 4 Sonnet ~57/60 Proprietary, strong V+L
Qwen2.5-Instruct > domain-specific Outperforms Qwen2.5-Coder/VL
All models on hardest <50 Intensive Interactive, MgmtSys

6. Open-Source Release, Usage Modes, and Community Adoption

ArtifactsBench is released at [https://artifactsbenchmark.github.io/]. It is designed for rapid research integration and ongoing community benchmarking, consisting of:

  • The full 1,825-task suite, scripts for Playwright-based rendering and temporal capture, all MLLM scoring prompts, and baseline outputs.
  • A provided Docker container for reproducible artifact execution.
  • API scripts for both open-source (Qwen2.5-VL-72B) and reference (Gemini) MLLMs.
  • Adaptable checklists and rubrics for extension to new artifact classes.
  • Published S_{m,j} vectors and model rankings as baselines for comparison.

7. Contributions, Limitations, and Broader Scope

Key contributions of ArtifactsBench (Zhang et al., 7 Jul 2025):

  • Establishes the first large-scale benchmark focusing on dynamic, interactive visual artifacts.
  • Automates artifact evaluation at human-level alignment through a novel combination of programmatic rendering, temporal sampling, and MLLM-guided checklists.
  • Provides fine-grained scoring and diagnostic capability, enabling identification of failure modes along user-relevant axes.

Limitations and open challenges:

  • Discrete screenshot sampling may miss long-horizon or deeply stateful interactions. Integration of DOM-level exploration or video-based assessment is a prospective advance.
  • Single-turn, non-agentic generation is tested; iterative or auto-debugging workflows remain out of scope.
  • Emergent artifact classes (e.g., 3D, VR) will require rubric and judge extension.

Extensions and Related Benchmarks:

The concept of an ArtifactsBench extends into related domains:

  • Artisan-Bench for automated artifact evaluation in software engineering, focusing on LLM agent–generated reproduction scripts and two-tier outcome/method judging (Baek et al., 10 Feb 2026).
  • ArtiBench for generalizable articulated object manipulation in robotics, built around multi-level generalization tasks and hierarchical VLM-based control (Wu et al., 25 Nov 2025).
  • BVI-Artefact as an "ArtifactsBench" for no-reference detection of common visual video artifacts in professional streaming; features balanced, multi-type multi-label sequences for detector benchmarking (Feng et al., 2023).

8. Best Practices and Recommendations for ArtifactsBench-Style Research

  • Employ multimodal evidence (code + visual outputs + behavioral traces) as the basis for automated judgment.
  • Use hierarchical or itemized checklists to enable fine-grained diagnostic scoring.
  • Validate automated metrics against large-scale human preference data and pairwise agreement.
  • Release artifact datasets, evaluation scripts, and baseline outputs openly for reproducibility and tracking progress.
  • Target future efforts at richer interaction models, compositional and agentic tasks, and fine-grained spatiotemporal evaluation.

ArtifactsBench marks a paradigm shift toward comprehensive, scalable, and human-aligned evaluation of user-perceived quality in generative systems, with broad implications for the development and benchmarking of user-centric AI models.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ArtifactsBench.