Design2Code Benchmark: UI-to-Code Evaluation

Updated 22 January 2026

Design2Code Benchmark is a rigorous evaluation paradigm that transforms visual UI designs into executable front-end code using multimodal large language models.
It standardizes testing through structured datasets, test-only annotations, and metrics like CLIP score, CW-SSIM, and TreeBLEU to assess visual and interactive fidelity.
The benchmark supports advanced tasks such as hierarchical UI understanding, dynamic code repair, and chart synthesis to drive improvements in automated front-end engineering.

The Design2Code benchmark defines a rigorous evaluation paradigm for automated conversion of visual UI designs into front-end source code, serving as a cornerstone for research in multimodal LLMs (MLLMs) for front-end engineering. It encompasses test-only datasets, structured task formulations, and advanced evaluation metrics targeting semantic, structural, and visual fidelity. Major lines of research build on the initial Design2Code protocol, extending its scope to hierarchical UI understanding, dynamic code repair, chart synthesis, and visual-interactive artifact assessment.

1. Problem Definition and Benchmark Scope

Design2Code formalizes the mapping $f : D \rightarrow C$ from a design specification $D$ —which may include screenshot images, wireframes, or design metadata—to executable code $C$ (e.g., HTML/CSS/JS or component frameworks) such that the rendered code replicates the visual state $V_D$ and interactions $B_D$ of the original design. The benchmark systematically quantifies model performance via objectives:

$C^* = \arg\min_C \left[ \alpha \cdot d_v(C,D) + \beta \cdot d_i(C,D) \right]$

where $d_v$ quantifies visual fidelity and layout structure, and $d_i$ encodes interactive compliance for dynamic elements (Zhang et al., 7 Jul 2025).

Multiple instantiations of the benchmark exist: the classic Design2Code (484 webpages) (Si et al., 2024), mobile UI counterparts (Chen et al., 16 Jun 2025), and chart generation generalizations (Tang et al., 20 Oct 2025). Research extensions, such as ArtifactsBench, further expand the scope to cover dynamic behaviors, animation fidelity, and accessibility diagnostics (Zhang et al., 7 Jul 2025).

2. Dataset Construction and Annotation

The canonical Design2Code dataset consists of 484 real-world webpage-image–HTML pairs, manually curated from a large pool (C4 validation set) by (i) automatic and manual filtering for stand-alone, well-formatted pages and (ii) removal of external dependencies, scripts, and sensitive data. Statistics highlight its complexity: median code length $\sim$ 31,216 tokens, DOM tree depths 4–32 (median 13), and 84 HTML5 tag types (Si et al., 2024). Annotation is strictly test-only: the dataset is held out from model training and serves as an out-of-distribution challenge set (Liang et al., 2024).

Additional benchmarks adapt the protocol to mobile UIs (300 Figma-based instances in five domains, with rich component trees and design metadata) (Chen et al., 16 Jun 2025), or chart2code tasks (2,023 tasks, three levels from chart reproduction to data-driven chart synthesis) (Tang et al., 20 Oct 2025).

With the proliferation of new datasets (e.g., WebSight, FullFront), the coverage extends to interaction authoring, code refinement, and user-driven design conceptualization, introducing multi-phase pipelines for clean and copyright-compliant ground-truth code (Sun et al., 23 May 2025).

3. Task Formulation and Model Interfaces

The Design2Code task is cast as a multimodal generation problem:

Input $x$ : A rendered webpage image, mobile UI mockup, or chart/figure screenshot.
Output $y$ : A tokenized, serialized code sequence (HTML, JSX, Python plotting code) which, when rendered, faithfully reproduces $x$ .

Variants stress:

Hierarchical structure prediction (via DOM or component trees).
Chain-of-thought grouping (mobile UIs: division, semantic extraction, grouping) (Chen et al., 16 Jun 2025).
Divide-and-conquer code synthesis and vision-guided repair loops.
End-to-end interactive artifact generation, including temporal behavior captures for animated/dynamic UIs (Zhang et al., 7 Jul 2025).

In advanced benchmarks, model interfaces are extended to accept natural language design intents, layer/metadata lists, and explicit interaction specifications, facilitating compound tasks such as multi-step chart editing or front-end workflow emulation (Tang et al., 20 Oct 2025, Sun et al., 23 May 2025).

4. Evaluation Metrics

The evaluation framework spans multiple axes:

Visual and Structural Metrics

CLIP Score: Cosine similarity between CLIP-ViT-B/32 embeddings of renders (Si et al., 2024, Liang et al., 2024).
CW-SSIM & SSIM: Structural similarity indices (complex-wavelet or luminance-contrast-structure) (Liang et al., 2024, Chen et al., 16 Jun 2025).
Block-Match: Size-weighted recall/precision of matched text block pairs based on bounding box overlap (Si et al., 2024).
TreeBLEU, htmlBLEU: Tree- and token-level code structure overlaps.
Low-Level Element Matching (LLEM): Averaged recall across existence, text, position, color for text blocks (category-wise) (Liang et al., 2024).
Execution Rate, Exact Match (Chart2Code): Ratio of code outputs that execute/syntactically match ground truth (Tang et al., 20 Oct 2025).

Functional and Interactive Metrics

Checklist Scoring (ArtifactsBench): Multimodal LLM-as-judge assigns 0–10 per axes including functional correctness, robustness, animation, accessibility, etc. Final score $S_t$ is a weighted sum of items (Zhang et al., 7 Jul 2025).
Dynamic behavior compliance: For each event $e \in E$ , post-event DOM and screenshot states are logged and compared (Zhang et al., 7 Jul 2025).
User studies: Human expert ratings of readability, maintainability, and modification efficiency (Chen et al., 16 Jun 2025).

The evaluation pipeline incorporates human-in-the-loop pairwise ranking and direct assessment (Si et al., 2024), with metric validation via agreement rates (e.g., PairACC $>90\%$ vs. expert ranking in ArtifactsBench).

5. Baseline Methods, Model Variants, and Performance Trends

Design2Code benchmarks analyze both prompting-based and fine-tuned multimodal models:

Zero-shot prompting: e.g., Gemini, GPT-4o, LLaVA, CogAgent (direct and text-augmented), and WebSight VLMs.
Fine-tuning: Structure-aware finetuning (e.g., WAFFLE, GCN graph-encoder finetuning (Vu et al., 25 Apr 2025)), model-specific UIs (React Native in DesignCoder).
Self-correction and repair: Vision-guided code mutation after initial rendering (Chen et al., 16 Jun 2025).

Typical results indicate:

CLIP scores for top models (GPT-4V, Gemini) in the 83–89 range; LLEM $>$ 80% only for top commercial models (Liang et al., 2024).
Incorporation of hierarchical attention or graph structure yields significant (+3–5 pp) gains on block/position/color metrics (Liang et al., 2024, Vu et al., 25 Apr 2025).
Mobile UI benchmarks find Divide-and-Conquer and grouping-chain models (DesignCoder) improve pixel-level (MSE, SSIM) and structural (TreeBLEU, TED) scores by 12–37% (Chen et al., 16 Jun 2025).
Even leading LMMs achieve only moderate visual fidelity in chart-editing and long-table chart synthesis (SSIM $\sim$ 0.06 for best models) (Tang et al., 20 Oct 2025).

A representative performance table for Design2Code (HTML/CSS generation, N=484) (Liang et al., 2024):

Model (Prompt/FT)	CW-SSIM	CLIP	LLEM (%)
Gemini 1.5 Pro	0.2652	87.76	87.17
GPT-4o	0.2776	89.03	83.67
Moondream2+FT	0.1348	46.63	40.71
Moondream2+Waffle	0.2142	79.62	67.83
VLM-WebSight+FT	0.2518	82.35	73.00
VLM-WebSight+Waffle	0.2815	85.98	77.81

6. Methodological Innovations and Extensions

Research in the Design2Code ecosystem introduces several architectural and evaluative advances:

Structure-aware attention in decoding: Token-level attention masks based on parent/sibling/self DOM sets (Liang et al., 2024).
Dynamic graph-based multimodal conditioning: GCN over component sets with edge types derived from spatial and semantic relationships (Vu et al., 25 Apr 2025).
Divide-and-conquer code aggregation: Top-down code generation with bottom-up style aggregation for modularity (Chen et al., 16 Jun 2025).
Vision-guided, checklist-based LLM evaluation: Rendering artifacts in headless browsers, with systematic scoring by both open and closed-source LMM referees; target agreement rates $\geq 0.9$ with human preferences (Zhang et al., 7 Jul 2025).
Progressive task complexity: Hierarchical task regimes from simple reproduction to multi-step editing and data-driven charting (Chart2Code) (Tang et al., 20 Oct 2025).

These methodologies are operationalized in released toolkits (e.g., https://artifactsbenchmark.github.io/).

7. Impact, Insights, and Future Directions

Design2Code benchmarks have driven systematic measurement of progress in automated front-end engineering. Major findings include:

State-of-the-art MLLMs show persistent deficits in fine-grained layout, color, and non-trivial nested structure hallucination.
Incorporation of explicit hierarchical or graph-based structural priors confers significant fidelity gains, but does not close the gap to human expert benchmarks in perception, code and interaction (Sun et al., 23 May 2025, Zhang et al., 7 Jul 2025).
Recent evaluation paradigms such as ArtifactsBench and FullFront recommend the inclusion of dynamic, interactive, and accessibility diagnostics in future benchmarks, and point toward hybrid architectures leveraging both LMMs and task-specific vision modules (Zhang et al., 7 Jul 2025, Sun et al., 23 May 2025).
User studies underscore the relevance of maintainability and modularization for industrial applicability (Chen et al., 16 Jun 2025).

Key recommended directions are:

Inclusion of dynamic and interactive states (animations, gestures).
Expansion to multi-page and navigation-flow tasks, including mobile/responsive layouts.
Augmentation of metric suites with pixel-level IoU, behavioral tests, and explicit accessibility diagnostics.
Further refinement of benchmark diversity, difficulty stratification, and alignment of automated metrics with human expert preferences.

Collectively, Design2Code and its successors constitute the reference suite for quantifying, analyzing, and advancing the state of multimodal code generation for front-end development (Si et al., 2024, Liang et al., 2024, Chen et al., 16 Jun 2025, Vu et al., 25 Apr 2025, Sun et al., 23 May 2025, Tang et al., 20 Oct 2025, Zhang et al., 7 Jul 2025).