MiLDEBench: Multi-Layer Design Editing Benchmark

Updated 15 January 2026

MiLDEBench is a large-scale, human-curated corpus that enables fine-grained, multi-layer design editing with detailed, gold-standard supervision.
It features a robust human-in-the-loop annotation pipeline and the MiLDEEval protocol to measure instruction following, layout consistency, aesthetics, and text rendering.
The benchmark drives research in inter-layer reasoning and multimodal alignment, fostering the development of agents for realistic, instruction-driven design editing.

MiLDEBench is a large-scale, human-in-the-loop corpus and benchmarking suite tailored for evaluating reasoning-based, multi-layer design document editing from natural language instructions. Built upon 20,000+ transparent-background design templates from the Crello corpus, MiLDEBench addresses the unique challenges of multi-layered artifacts in real-world design domains such as posters, flyers, and slides. MiLDEBench is released alongside a comprehensive evaluation protocol, MiLDEEval, covering instruction compliance, structural and visual fidelity, and text rendering. By providing gold-standard layer-wise supervision and an integrated benchmarking framework, MiLDEBench establishes an experimental foundation for the development of agents capable of fine-grained, layer-aware document editing (Lin et al., 8 Jan 2026).

1. Dataset Construction and Representation

MiLDEBench is constructed from transparent-background templates sourced from Crello, systematically spanning domains such as posters, cards, and slides. Each document $D$ is defined as an ordered set of $n$ RGBA layers $L_1, ..., L_n$ and composited via alpha blending:

$D = L_1 \oplus L_2 \oplus \dots \oplus L_n$

Layer classification leverages a pretrained multimodal LLM (MLLM), InternVL3-38B, to assign each layer to one of three categories—decoration, text, or image—and merges non-overlapping, semantically-aligned sublayers of the same type to yield layer-coherent documents with precise $z$ -order preservation.

The dataset statistics are as follows:

Aspect	Train	Test
Number of design documents	17.7 K	1.9 K
Average number of layers per document	4.45	4.44
Average number of layers edited/doc	1.66	1.66
Average instruction length (doc-level)	15.56	15.53
Average instruction length (layer-wise)	24.50	24.48

Document layer counts are distributed primarily in the 3–6 range, and document-level instructions generally contain 5–30 tokens.

2. Instruction Taxonomy and Complexity

MiLDEBench supports a diverse and complex instruction set, generated through a two-stream process:

Persona-based generation: Candidate instructions are synthesized by sampling user personas—such as "historian" or "teacher"—then prompting InternVL3-38B to recast the document $D$ within the constraints of that persona (e.g., transforming a concert poster into a historical exhibition poster).
Document-conditioned generation: Instructions are created by proposing semantically adjacent modifications directly grounded in document content (e.g., "summer camp" to "winter camp").

This combined pool undergoes automated ranking for clarity, realism, and specificity, followed by human validation. Final document-level instructions $I_D$ are decomposed by InternVL3-38B into atomic, layer-targeted steps, aligned to layer types (text or image) via a content-aware matcher. The instruction taxonomy includes:

Text-only edits: e.g., headline, font, or color changes
Image-only edits: e.g., replacing the background or main photograph
Mixed text+image edits

Persona diversity and varied prompt vocabulary ensure wide coverage of actual editing intents. On average 1.66 layers are edited per document.

3. Human-In-The-Loop Annotation Pipeline

The annotation process is a multi-stage human-in-the-loop pipeline:

Document-level instruction generation:
- Automatic candidate generation (persona and document-conditioned)
- Automated filtering for clarity and feasibility
- Human validation of realism and specificity
Layer-wise decomposition and annotation:
- MLLM decomposes $I_D$ into atomic actions
- Content-aware matching associates edits with layer types and positions
- Rule-based validators check syntactic and semantic well-formedness
- Expert human reviewers (at least M.S./Ph.D. with UI/design expertise) finalize layer-wise prompts and relevant sets $S^*$

Quality control includes stringent rule-based checks (e.g., non-empty prompts, text edits must reference both source and target strings) and a consensus protocol. Inter-annotator agreement for human evaluations achieves Cohen’s $\kappa \approx 0.70$ .

4. MiLDEEval Evaluation Protocol

MiLDEEval is an evaluation suite quantifying layer-aware editing across four core dimensions and producing a unified aggregate score (MiLDEScore):

Instruction Following (IF): VQA-style yes/no question answering per gold-edited layer, computed as correct "yes" answers over total layers edited.
Layout Consistency (LC): Measures preservation of layer geometry and $z$ $z$ -ordering via mask extraction, bipartite IoU matching, and features evaluating centroid displacement, shape IoU, and area similarity. Unmatched masks are penalized by area-weighted factors.
- Aggregate:
$\text{LayoutConsistency} = \max\left(0, \omega_\text{match} r_\text{match} + \omega_\text{pos} \overline{y}_\text{pos} + \omega_\text{shape} \overline{y}_\text{shape} + \omega_\text{area} \overline{y}_\text{area} - \omega_\text{penalty}(p_\text{dis} + p_\text{new})\right)$
Aesthetics (A): Automated scoring in $[1, 10]$ using a frozen Aesthetic Predictor V2.5.
Text Rendering (TR): Automated analysis using Adopd Doc2BBox and InternVL3-38B for region detection, OCR, and semantic comparison to required text; scores are $\{0, 0.5, 1\}$ .

The unified MiLDEScore combines normalized metrics with a sigmoid-gated dependency on IF to reflect real-world dependencies across editing objectives:

$\text{MiLDEScore} = w_\text{if} \cdot \text{IF}_h + w_\text{tr} \cdot \text{TR}_h + g(\text{IF}_h)(w_\text{lc} \cdot \text{LC}_h + w_a \cdot A_h) + w_\text{sy} g(\text{IF}_h) \text{IF}_h \cdot \text{LC}_h$

where $g$ is a sigmoid gate with $k = 10$ , $\tau = 0.3$ , and default weights $(w_\text{if}, w_\text{lc}, w_\text{tr}, w_a) = (0.30, 0.30, 0.30, 0.10)$ , $w_\text{sy}=0.15$ .

5. Dataset Usage and Benchmarking

MiLDEBench defines a single-round evaluation protocol where models receive only the rendered document $D$ and the composite instruction $I_D$ ; ground-truth layer-wise inputs are concealed. The dataset is split into 17,700 training and 1,900 test documents, with no validation set released.

Special evaluation scenarios consider:

Text-only edits
Image-only edits
Mixed type edits (text + image)
Model performance as a function of the number of edited layers
Scaling behavior with model size

Competing models are benchmarked on all MiLDEEval dimensions, with MiLDEAgent reported as establishing the first strong open-source baseline (Lin et al., 8 Jan 2026).

6. Challenges Addressed and Design Objectives

MiLDEBench is designed to overcome several technical hurdles unique to the multi-layer document editing domain:

Layer Awareness: Determining which layers require modification (layer selection) and which should remain intact.
Inter-Layer Reasoning: Managing dependencies among layers while preserving global document structure and layout fidelity.
Fine-Grained Text Fidelity: Achieving high-quality OCR, accurate semantic text matching, and appropriate stylization.
Intent Diversity: Enabling edits reflective of realistic, diverse user goals through persona- and domain-conditioned instruction generation.

Principal goals are:

Providing gold supervision at both document and layer granularity (across IF, LC, A, TR, as well as $S^*$ and layer instructions $I_i$ )
Establishing a rigorous, standardized experimental protocol mirroring real-world creative editing workflows
Stimulating research in multimodal, reasoning-capable models for layer-aware, instruction-driven design editing as opposed to flat-canvas, undifferentiated image modification.

7. Significance and Prospects

MiLDEBench, in conjunction with MiLDEEval, constitutes the inaugural large-scale, human-curated benchmark for instruction-driven, multi-layer design document editing. It provides foundational infrastructure for training and evaluating agents capable of sophisticated multimodal reasoning over complex document hierarchies. A plausible implication is its adoption may drive the development of new architectures focused on compositional reasoning, spatial consistency, and multimodal alignment across structured design domains (Lin et al., 8 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

MiLDEdit: Reasoning-Based Multi-Layer Design Document Editing (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MiLDEBench.