TextEditBench: Text-Centric Image Editing Benchmark
- TextEditBench is a benchmark for evaluating text-centric editing in images, focusing on semantic, geometric, and logical reasoning to ensure accurate text modifications.
- It categorizes editing tasks by atom operations and reasoning intensity, using both pixel-level metrics and multimodal semantic assessments to gauge performance.
- The benchmark integrates synthetic and real-world datasets with rigorous annotations, enabling detailed error analysis and evaluation of advanced text editing systems.
TextEditBench is a comprehensive evaluation benchmark specifically designed to assess and advance the reasoning capabilities of models performing text-centric editing within natural images. Unlike preceding image editing benchmarks, which predominantly target pixel-level manipulations or object-based modifications, TextEditBench directly addresses the intricate challenges of editing embedded text in photographs and designs—where each character is densely coupled with semantic, geometric, and contextual constraints. Its unique contribution is the explicit focus on reasoning-intensive scenarios, where successful editing requires more than visually plausible glyph rendering: it necessitates semantic coherence, physical plausibility, and cross-modal logical consistency (Gui et al., 18 Dec 2025).
1. Motivation and Benchmark Structure
TextEditBench is motivated by the observation that state-of-the-art diffusion and multimodal models exhibit strong capabilities in object and background editing but systematically fail on text manipulation: they often hallucinate glyphs, misspell words, violate geometric perspective, break the relationships among text elements, or ignore contextual dependencies. The benchmark explicitly pushes the field from simple text legibility to context- and reasoning-aware modification, setting a new standard for evaluation.
Edits are characterized along two orthogonal axes:
- Atom Operations: Six canonical edit types, capturing all primitive text manipulation operations required in natural images:
- Text Delete
- Text Insert
- Text Change (replace an existing string)
- Text Relocation (translate/rotate text)
- Scaling (resize text instances)
- Text Attribute change (alter font, weight, or color)
Reasoning Intensity:
- Simple edits: Local, span-based string changes (e.g., "change ‘Sale’ → ‘Offer’")
- Geometric edits: Require preservation of physical perspective, curvature, or complex surface deformation
- Semantic and linguistic edits: Necessitate multi-step world knowledge (e.g., postpone a date logically, recalculate prices, perform time conversions)
This taxonomy establishes a rigorous foundation for the systematic evaluation of advanced text editing systems beyond traditional generative benchmarks.
2. Dataset Composition and Annotation Protocol
TextEditBench contains 1,196 total instances, split evenly between:
- Manually-synthesized designs (via Canva), providing paired input and ground-truth for high-fidelity quantitative assessment
- Web-sourced natural images (from GIEBench, AnyEdit, and public repositories), with meticulously annotated input-only examples and human-drawn region masks
The benchmark covers 14 subject domains, reflecting the variety of text appearances in real-world images (signage, documents, infographics, packaging, etc.), and supports multilingual evaluation (English, Chinese, mixed scripts) and a spectrum of font styles (decorative, serif; planar and curved surfaces; variable lighting/backgrounds).
Annotation involves a multi-stage process:
- Human instruction drafting: edit tasks, difficulty characteristics, and explicit "Knowledge Prompts" where complex reasoning is implied.
- LLM refinement (GPT-5): prompt normalization and terminological consistency.
- Senior expert verification: validation of region masks, instruction clarity, and fidelity to the reasoning chain.
Every instance is tagged with ten interpretable difficulty attributes (e.g., num_text_regions, font_complexity, context_dependency, semantic_linkage), each taking values in {0,1,2}; their sum (range ∈ [0,20]) defines easy/medium/hard tiers.
3. Evaluation Dimensions and Metrics
TextEditBench implements a dual-track evaluation protocol, integrating both pixel-level fidelity and multimodal LLM (MLLM) semantic assessment:
A. Pixel-Level Objective Metrics (restricted to non-edited regions, using region masks to penalize unintended background changes):
- Masked MSE:
- Masked SSIM: structural similarity, computed on non-edited pixels
- Masked PSNR:
- Masked LPIPS: perceptual similarity
- Spatial correction: SIFT–FLANN–affine pre-alignment accounts for slight crop/scale changes
B. MLLM-Based Semantic Metrics (GPT-4o, 0–5 discrete score per dimension):
- Instruction Following (IF): compliance with textual command
- Text Accuracy (TA): exactness of character rendering
- Visual Consistency (VC): fidelity to local font, color, noise, and illumination
- Layout Preservation (LP): absence of distortion in non-target regions; valid spatial placement of text
- Semantic Expectation (SE): novel metric probing cross-modal, world-knowledge, and multi-step reasoning; for instances with dense semantic coupling, auxiliary Knowledge Prompts are included as explicit reasoning chains
The final "Overall" semantic score is the sum of the five components: , range [0,25]. Human-MLLM agreement (50-sample subset) exhibits Pearson .
4. Reasoning-Intensive Editing and Failure Analysis
TextEditBench's primary innovation lies in systematically stress-testing models on editing scenarios requiring integrated semantic and geometric reasoning. Some canonical examples:
- Date arithmetic: e.g., "Move calendar event 'April 10' forward by four days" (requires parsing, logic, rewriting, and geometric re-blending)
- Arithmetic updates: e.g., "Change 'Total: $45' for 1 person to 2 people" (requires extracting number, multiplying, updating text, and matching style)
- Physical transformations: e.g., "Rotate label from horizontal to vertical on a curved cup surface," demanding accurate warping and consistent lighting
The evaluation explicitly distinguishes between:
- Physical plausibility: respecting existing image geometry, surface distortion, and illumination
- Semantic consistency: preserving or correcting all logic relations between textual, numerical, and visual elements (e.g., bar chart labels, pricing tables)
Error analysis reveals two main failure patterns:
- Correct semantic interpretation, but mislocalization—wrong region edited
- Correct region selection, but visual blending or logical integration fails (e.g., inconsistent font/shading, arithmetic errors)
5. Experimental Protocol and Baseline Performance
TextEditBench evaluated eleven contemporary editing systems, encompassing both open-source (e.g., Step1X-Edit, MagicBrush, Emu3.5, FLUX.1-Kontext-dev, Qwen-Image-Edit) and proprietary (Google NanoBanana, Seedream) models. Experiments were conducted on 8 × Tesla V100 with default hyperparameters, across two splits:
- Synthetic (Canva subset, with ground-truth for quantitative assessment)
- Real-world (unpaired input images)
Results, summarized below, distinguish pixel-fidelity limits from reasoning constraints:
| Model | Best SSIM (Synthetic) | Best LPIPS (Synthetic) | Overall Semantic (Synthetic) | Overall Semantic (Real) |
|---|---|---|---|---|
| FLUX.1-Kontext-dev | 0.906 | — | — | — |
| NanoBanana | 0.904 | 0.036 | 16.54 | 18.22 |
| Qwen-Image-Edit | — | 0.039 | 16.58 | 18.70 |
| Seedream | — | — | 14.90 | 18.54 |
| Mean (all models) | 0.850 (SSIM) | — | 10.82 | 13.01 |
Layout Preservation is the highest-performing semantic dimension across all models (mean ~3.12/5), indicating spatial localization is relatively mature. Semantic Expectation is the lowest (mean ~1.57/5 synthetic), confirming that reasoning-aware editing remains a critical open challenge.
In an ablation experiment, introducing explicit Chain-of-Thought ("think") prompting (Step1X-Edit-Think) improved synthetic MSE from ~983 to 584, and GPT-IF from 1.40 to 2.28.
6. Conclusions, Challenges, and Outlook
TextEditBench exposes the persistent gap in model capabilities: while contemporary systems can reliably execute local word swaps or basic stylistic alterations, they consistently fail on operations requiring cross-element reasoning, logic, or nuanced physical integration. Current attention and decoding architectures are proficient at blending generated glyphs into arbitrary backgrounds but lack mechanisms for structured causal or semantic planning.
Identified open problems include:
- Spatial disentanglement: preventing "ghosting" or distortion when relocating or warping text
- Cross-element dependencies: ensuring updates to logical relationships (e.g., updating "per person" labels alongside total prices)
- Physical realism: preserving implicit properties like sub-pixel noise and surface gloss under generation constraints
Planned extensions for the benchmark encompass video insertion, animated layout edits, and the integration of differentiable layout engines or 3D scene analysis to enforce geometric constraints. Incorporating sociocultural knowledge graphs for factually-grounded edits and human-in-the-loop fine-tuning (explicit reasoning chains) are also identified as key future directions.
7. Position Relative to Related Benchmarks
TextEditBench distinguishes itself from generic text-to-image editing or inpainting benchmarks (e.g., EditBench (Wang et al., 2022), EBench-18K (Xu et al., 22 Jul 2025), IE-Bench (Sun et al., 17 Jan 2025)) by targeting the domain-specific intersection of text, vision, and structured reasoning—where editing success is predicated not only on local pixel accuracy but on the global logical and contextual integrity of the image. By introducing carefully annotated, reasoning-intensive scenarios and robust, multi-dimensional metrics, TextEditBench establishes a rigorous testing ground and reference for future multimodal editing research (Gui et al., 18 Dec 2025).