MiLDEBench: Multi-Layer Design Editing Benchmark
- MiLDEBench is a large-scale, human-curated corpus that enables fine-grained, multi-layer design editing with detailed, gold-standard supervision.
- It features a robust human-in-the-loop annotation pipeline and the MiLDEEval protocol to measure instruction following, layout consistency, aesthetics, and text rendering.
- The benchmark drives research in inter-layer reasoning and multimodal alignment, fostering the development of agents for realistic, instruction-driven design editing.
MiLDEBench is a large-scale, human-in-the-loop corpus and benchmarking suite tailored for evaluating reasoning-based, multi-layer design document editing from natural language instructions. Built upon 20,000+ transparent-background design templates from the Crello corpus, MiLDEBench addresses the unique challenges of multi-layered artifacts in real-world design domains such as posters, flyers, and slides. MiLDEBench is released alongside a comprehensive evaluation protocol, MiLDEEval, covering instruction compliance, structural and visual fidelity, and text rendering. By providing gold-standard layer-wise supervision and an integrated benchmarking framework, MiLDEBench establishes an experimental foundation for the development of agents capable of fine-grained, layer-aware document editing (Lin et al., 8 Jan 2026).
1. Dataset Construction and Representation
MiLDEBench is constructed from transparent-background templates sourced from Crello, systematically spanning domains such as posters, cards, and slides. Each document is defined as an ordered set of RGBA layers and composited via alpha blending:
Layer classification leverages a pretrained multimodal LLM (MLLM), InternVL3-38B, to assign each layer to one of three categories—decoration, text, or image—and merges non-overlapping, semantically-aligned sublayers of the same type to yield layer-coherent documents with precise -order preservation.
The dataset statistics are as follows:
| Aspect | Train | Test |
|---|---|---|
| Number of design documents | 17.7 K | 1.9 K |
| Average number of layers per document | 4.45 | 4.44 |
| Average number of layers edited/doc | 1.66 | 1.66 |
| Average instruction length (doc-level) | 15.56 | 15.53 |
| Average instruction length (layer-wise) | 24.50 | 24.48 |
Document layer counts are distributed primarily in the 3–6 range, and document-level instructions generally contain 5–30 tokens.
2. Instruction Taxonomy and Complexity
MiLDEBench supports a diverse and complex instruction set, generated through a two-stream process:
- Persona-based generation: Candidate instructions are synthesized by sampling user personas—such as "historian" or "teacher"—then prompting InternVL3-38B to recast the document within the constraints of that persona (e.g., transforming a concert poster into a historical exhibition poster).
- Document-conditioned generation: Instructions are created by proposing semantically adjacent modifications directly grounded in document content (e.g., "summer camp" to "winter camp").
This combined pool undergoes automated ranking for clarity, realism, and specificity, followed by human validation. Final document-level instructions are decomposed by InternVL3-38B into atomic, layer-targeted steps, aligned to layer types (text or image) via a content-aware matcher. The instruction taxonomy includes:
- Text-only edits: e.g., headline, font, or color changes
- Image-only edits: e.g., replacing the background or main photograph
- Mixed text+image edits
Persona diversity and varied prompt vocabulary ensure wide coverage of actual editing intents. On average 1.66 layers are edited per document.
3. Human-In-The-Loop Annotation Pipeline
The annotation process is a multi-stage human-in-the-loop pipeline:
- Document-level instruction generation:
- Automatic candidate generation (persona and document-conditioned)
- Automated filtering for clarity and feasibility
- Human validation of realism and specificity
- Layer-wise decomposition and annotation:
- MLLM decomposes into atomic actions
- Content-aware matching associates edits with layer types and positions
- Rule-based validators check syntactic and semantic well-formedness
- Expert human reviewers (at least M.S./Ph.D. with UI/design expertise) finalize layer-wise prompts and relevant sets
Quality control includes stringent rule-based checks (e.g., non-empty prompts, text edits must reference both source and target strings) and a consensus protocol. Inter-annotator agreement for human evaluations achieves Cohen’s .
4. MiLDEEval Evaluation Protocol
MiLDEEval is an evaluation suite quantifying layer-aware editing across four core dimensions and producing a unified aggregate score (MiLDEScore):
- Instruction Following (IF): VQA-style yes/no question answering per gold-edited layer, computed as correct "yes" answers over total layers edited.
- Layout Consistency (LC): Measures preservation of layer geometry and -ordering via mask extraction, bipartite IoU matching, and features evaluating centroid displacement, shape IoU, and area similarity. Unmatched masks are penalized by area-weighted factors.
- Aggregate:
- Aesthetics (A): Automated scoring in using a frozen Aesthetic Predictor V2.5.
- Text Rendering (TR): Automated analysis using Adopd Doc2BBox and InternVL3-38B for region detection, OCR, and semantic comparison to required text; scores are .
The unified MiLDEScore combines normalized metrics with a sigmoid-gated dependency on IF to reflect real-world dependencies across editing objectives:
where is a sigmoid gate with , , and default weights , .
5. Dataset Usage and Benchmarking
MiLDEBench defines a single-round evaluation protocol where models receive only the rendered document and the composite instruction ; ground-truth layer-wise inputs are concealed. The dataset is split into 17,700 training and 1,900 test documents, with no validation set released.
Special evaluation scenarios consider:
- Text-only edits
- Image-only edits
- Mixed type edits (text + image)
- Model performance as a function of the number of edited layers
- Scaling behavior with model size
Competing models are benchmarked on all MiLDEEval dimensions, with MiLDEAgent reported as establishing the first strong open-source baseline (Lin et al., 8 Jan 2026).
6. Challenges Addressed and Design Objectives
MiLDEBench is designed to overcome several technical hurdles unique to the multi-layer document editing domain:
- Layer Awareness: Determining which layers require modification (layer selection) and which should remain intact.
- Inter-Layer Reasoning: Managing dependencies among layers while preserving global document structure and layout fidelity.
- Fine-Grained Text Fidelity: Achieving high-quality OCR, accurate semantic text matching, and appropriate stylization.
- Intent Diversity: Enabling edits reflective of realistic, diverse user goals through persona- and domain-conditioned instruction generation.
Principal goals are:
- Providing gold supervision at both document and layer granularity (across IF, LC, A, TR, as well as and layer instructions )
- Establishing a rigorous, standardized experimental protocol mirroring real-world creative editing workflows
- Stimulating research in multimodal, reasoning-capable models for layer-aware, instruction-driven design editing as opposed to flat-canvas, undifferentiated image modification.
7. Significance and Prospects
MiLDEBench, in conjunction with MiLDEEval, constitutes the inaugural large-scale, human-curated benchmark for instruction-driven, multi-layer design document editing. It provides foundational infrastructure for training and evaluating agents capable of sophisticated multimodal reasoning over complex document hierarchies. A plausible implication is its adoption may drive the development of new architectures focused on compositional reasoning, spatial consistency, and multimodal alignment across structured design domains (Lin et al., 8 Jan 2026).