MiLDEEval: Multi-Layer Editing Evaluation
- The paper introduces MiLDEEval, a bespoke protocol that provides detailed, multi-dimensional assessment for reasoning-intensive, multi-layer document editing systems.
- It systematically decomposes performance into instruction following, layout consistency, aesthetics, and text rendering, using tailored metrics for each dimension.
- The aggregated MiLDEScore, reinforced by gating and synergy, strongly aligns with human judgments and offers a robust benchmark for future multimodal editing research.
MiLDEEval is a bespoke evaluation protocol designed for the assessment of reasoning-intensive, multi-layer document editing systems. Developed alongside @@@@1@@@@, MiLDEEval introduces a multidimensional, perceptually driven framework to benchmark and diagnose model performance in tasks where natural-language instructions guide fine-grained edits across complex, multi-layer design documents consisting of text, images, and decorative elements. Unlike previous approaches that evaluate only flat canvas image edits, MiLDEEval systematically decomposes editing success into four dimensions—Instruction Following, Layout Consistency, Aesthetics, and Text Rendering—and recombines these into a single MiLDEScore that exhibits strong alignment with human judgments (&&&0&&&).
1. Role and Motivation
MiLDEEval serves two primary functions within the MiLDEBench suite: diagnostic analysis and unified system ranking. The protocol provides per-dimension scores to reveal failure modes, such as models that preserve spatial arrangement but ignore instructions. In addition, these dimensions are aggregated into a summary MiLDEScore, enabling straightforward comparison and ranking of end-to-end editing agents. This dual capability allows researchers to distinguish between models that superficially satisfy visual or structural criteria and those that perform authentic, user-intended, layer-aware edits (Lin et al., 8 Jan 2026).
2. The Four Evaluation Dimensions
Each dimension is precisely defined with tailored metrics. This decomposition enables granular scrutiny and robust aggregation.
2.1 Instruction Following
This dimension quantifies whether the model accomplished the layer-specific content changes requested. For each layer flagged for editing in the gold annotation, InternVL3-38B generates a binary question ("Has the main image been changed to a museum scene?") and auto-judges the answer. The raw score is computed as:
where is the number of layers requiring edit (Lin et al., 8 Jan 2026).
2.2 Layout Consistency
This dimension measures fidelity of the output’s spatial arrangement versus the original. Masks are extracted via Adopd Doc2Mask for both original and edited documents :
- Compute IoU similarity matrix .
- Solve maximum-weight assignment with Hungarian algorithm, retaining matches with .
- For each matched mask pair , calculate:
- Position:
- Shape:
- Area:
Unmatched-layer penalties are computed for disappeared () and newly created layers ():
Weights are empirically assigned: , , , , (Lin et al., 8 Jan 2026).
2.3 Aesthetics
The edited document is fed to a frozen Aesthetic Predictor V2.5. The raw score is linearly rescaled for normalization: This dimension judges if the edit maintains or improves the document's visual appeal.
2.4 Text Rendering
Text rendering ensures all required textual edits are correctly executed. The protocol applies Adopd Doc2BBox for detection and InternVL3-38B for OCR and edit judgment. Each text region is scored:
- 0 = incorrect
- 0.5 = partially correct
- 1 = correct
Final score:
with the total number of text regions needing edits (Lin et al., 8 Jan 2026).
3. Aggregation: The MiLDEScore
To synthesize the four dimensions into an overall measure, MiLDEEval employs gated, synergistic aggregation. Raw scores are normalized to . A sigmoid gate on instruction-following (, with , threshold , steepness ) suppresses irrelevant layout and aesthetics contributions if instructions are ignored. The final MiLDEScore is:
with , , , , . This configuration yields maximal Spearman correlation () with human ratings, outperforming alternatives such as weighted sum, geometric, or harmonic mean aggregation (Lin et al., 8 Jan 2026).
4. Data and Annotation Protocol
The MiLDEBench dataset comprises 19.6 K multi-layer documents (avg. 4.45 layers/document), with 17.7 K for training and 1.9 K for testing. 50 K editing instructions are generated via persona-based and document-based pipelines with InternVL3-38B and further refined by human validation. Layer-wise decomposition and alignment to instructions are verified by multimodal model matching and expert annotation.
For MiLDEEval, all 1.9 K test samples are assessed. Human annotation for MiLDEScore validation uses 100 sampled test cases with two independent annotators per system (PhD/master’s students in multimodal research or professional designers), rating each dimension (0–3 scale) and overall outcome. Inter-annotator agreement statistics: Instruction Following , Layout Consistency , Aesthetics , Text Rendering , Overall (Lin et al., 8 Jan 2026).
5. Experimental Protocols and Exemplars
MiLDEEval discriminates sharply among model behaviors. Illustrative cases include:
- A diffusion model returning the input unedited scores , ; MiLDEScore gates out layout, yielding near-zero overall.
- Partial text edits (e.g., “piano”→“harpsichord” but not “concert”) result in , , lowering MiLDEScore accordingly.
- Closed-source models may follow instructions () with slight layout drift (), combining for high MiLDEScore via synergy and gating; open-source models may follow instructions () but poorly preserve layout (), resulting in moderate overall scores.
No hypothesis testing or p-values are computed; metric validation is limited to correlation and inter-annotator statistics (Lin et al., 8 Jan 2026).
6. Conceptual Distinctions and Implications
MiLDEEval provides a uniquely fine-grained, human-aligned yardstick for multi-layer document editing, going beyond flat image metrics or superficial “looks good” criteria, and instead enforcing rigorous assessment of intent execution, layer structure, and aesthetic/typographic integrity. The gating and synergy mechanisms in MiLDEScore prevent vacuous success on irrelevant dimensions and drive correct interpretability. A plausible implication is that MiLDEEval’s holistic rigidity makes it suited as a primary benchmark for future multimodal editing research, especially in settings where reasoning about layered structure is necessary for authentic document modification (Lin et al., 8 Jan 2026).