Benchmark Data for Text Removal

Updated 4 December 2025

Benchmark Data for Text Removal is a collection of datasets, annotation methods, and evaluation metrics aimed at excising unwanted text from visual and linguistic content.
It includes synthetic overlays, real-world scene images, document-centric corpora, and linguistic benchmarks to test diverse text removal scenarios.
Evaluation protocols combine pixel-level accuracy with perceptual metrics to measure artifact removal and ensure high-quality inpainting outcomes.

Text removal, defined as the excision or inpainting of unwanted text from visual or linguistic content, is a core operation for document redaction, scene understanding, privacy protection, and digital media editing. Benchmark datasets, protocols, and metrics for text removal serve as the empirical foundation for advancing models in image inpainting, visual text localization, segmentation, and linguistic disfluency elimination. Datasets span pixel-level annotation of text in images (scene or document), textbox and stroke masks, paired “text-on/text-off” imagery, and syntactic tagging in transcript corpora. A rigorous benchmark must provide artifact-free ground truth, statistically representative content, and evaluation metrics that measure both completeness and perceptual fidelity. The following sections survey the benchmark landscape across domain-specialized corpora, annotation methodologies, mask modeling, evaluation metrics, and protocol innovations for robust text removal research.

1. Benchmark Dataset Foundations and Taxonomy

Visual text removal benchmarks can be broadly categorized into synthetic overlays, real-world scene datasets, document-centric corpora, and linguistic benchmarks addressing transcript-level disfluency.

Synthetic Overlays: Oxford Synthetic Text Dataset (≈800,000 pairs), SCUT-Syn (8,800), and OTR (Overlay Text Removal; 74,716 train, 14,593 test) render text programmatically on photographs or segmented objects. Overlay text is typically axis-aligned, labeled at word/character level, and paired with unaltered clean backgrounds (Lee et al., 2022, Zdenek et al., 3 Oct 2025, Feng et al., 29 Feb 2024).
Scene Text Datasets: SCUT-EnsText (real photos; 3,562 images), Flickr-ST (real-world scenes with 3,004 images, pixel-accurate masks and ground-truth clean backgrounds), and RW datasets constructed for robust evaluation (Lyu et al., 2023, Bian et al., 2020).
Document Image Datasets: Benchmarks with dense OCR, as in (Nakada et al., 27 Nov 2025), feature high text density (500–800 characters/image), multi-column layouts, and manual “erase→refine” pipelines to generate artifact-free text-free ground truth.
Web Content Extraction: CleanEval (736 HTML pages) and ClueWeb12 provide block-level segmentations (main content vs. boilerplate) for sequence-labeling extraction (Vogels et al., 2018).
Linguistic Disfluency Removal: DRES, built on Switchboard Treebank parses (1,155 conversations, 90,000 utterances), yields parallel disfluent/fluent text pairs and fine-grained syntactic tags (Teleki et al., 24 Sep 2025).

Key annotation modalities include pixel stroke masks, bounding boxes, character or word-level polygons, instance segmentation masks, and parse trees with categorical tags.

2. Mask Modeling for Text Removal Evaluation

Accurate modeling and generation of text masks are critical for both supervised training and robust evaluation. Two principal families of mask models are established (Nakada et al., 27 Nov 2025):

OCR-Bounding Box Superellipse Masks: Given a set of OCR-detected boxes $M_s = \{m_1,...,m_K\}$ parametrized by chunk size ( $s_{\rm chunk}$ : character, word, paragraph), masks are the union of superellipses:

$M(x,y;\theta_s) = \bigcup_{m \in M_s} 1_{\left| \frac{x-x_m}{s_{\rm scale} a_m} \right|^{2/s_{\rm round}} + \left| \frac{y-y_m}{s_{\rm scale} b_m} \right|^{2/s_{\rm round}} \leq 1}$

where $\theta_s=(s_{\rm chunk},s_{\rm scale},s_{\rm round})$ encodes chunking, scale expansion (1.0–1.5× box), and roundness (0=rect, 1=ellipse).

Pixel-Stroke Masks and Morphology: Binarized stroke masks, $f(I_0,t_{\rm thres})$ , are thresholded likelihood maps, morphologically dilated/eroded with kernel size $t_{\rm kernel}$ and repetition $t_{\rm times}$ :

$M(x,y;\theta_t) = g( f(I_0,t_{\rm thres}), t_{\rm times}, t_{\rm kernel} )$

Bayesian optimization over mask parameters targets a loss $L(\theta)$ defined as the average Frechet Inception Distance (FID) between inpainted outputs and ground truth. Exhaustive type-1/type-2 grid search (27/30 variants) demonstrates sensitivity to mask shape—character-wise rectangles scaled by ≈1.3–1.4× are empirically optimal (FID≈32, 20% Δ over baseline), while both minimal and oversized masks degrade quality.

3. Dataset Annotation and Ground-Truth Generation Protocols

Protocols for text removal benchmarks aim to eliminate annotation artifacts, maximize semantic variety, and document mask provenance:

Document-Image Ground Truth: Manual erasure of text via IOPaint, refinement by Real-ESRGAN super-resolution, iterated until visual discomfort is eliminated (Nakada et al., 27 Nov 2025).
Synthetic Overlay: Programmatic placement of text on guaranteed “text-free” backgrounds (filtered by scene-text detector [Liao et al. 2022]), randomization across fonts, colors, locations, and semantic content generated by vision–LLMs (e.g., SMOL-VLM) with contextual prompts (Zdenek et al., 3 Oct 2025).
Real-Scene Annotation: Human inpainting (Photoshop, Spot Healing) ensures paired “clean” images, verified for spot artifacts, bounding polygons, and pixel-accurate instance masks (Lyu et al., 2023).
Pseudo-Stroke Mask Generation: Binary stroke-level masks derived by thresholding pixel-wise differences ( $|I_{in}(x,y)-I_{gt}(x,y)|>25$ ), implemented for standard supervision (Lee et al., 2022).

Automated verification via text detectors confirms the absence of residual/overwritten text. Comprehensive folder organization and released code facilitate reproducibility (e.g., https://github.com/naver/garnet, https://huggingface.co/datasets/cyberagent/OTR).

4. Evaluation Metrics and Protocols

Benchmark evaluation employs both image similarity metrics and text-detection–based scoring, often restricted to inpainted regions corresponding to mask coverage:

Image Reconstruction Metrics:
- Mean Squared Error (MSE): $\mathrm{MSE}(I,\hat{I}) = \frac{1}{N} \sum_{i=1}^N (I_i - \hat{I}_i)^2$
- Peak Signal-to-Noise Ratio (PSNR): $\mathrm{PSNR}(I, \hat{I}) = 10 \log_{10} \frac{L^2}{\mathrm{MSE}(I,\hat{I})}$
- Structural Similarity Index (SSIM): $\mathrm{SSIM}(I,\hat{I})$
- Multi-Scale SSIM (MSSIM) for robustness (Lee et al., 2022, Feng et al., 29 Feb 2024).
Error Pixel Metrics: AGE (average absolute difference), pEPs (percentage error pixels), pCEPs (percentage connected clusters) (Zdenek et al., 3 Oct 2025).
No-Reference IQA: QualiCLIP, LIQE, TOPIQ, HyperIQA for perceptual realism, often required to distinguish semantic artifacts from pixel-wise fidelity (Zdenek et al., 3 Oct 2025).
Text Detection Metrics: CRAFT detector recall $R$ $R$ , precision $P$ $P$ , $F$ $F$ -measure, and intersection-over-union (IoU) quantify completeness of removal (Bian et al., 2020, Lyu et al., 2023).
- CleanEval (web extraction): block-level Precision, Recall, F1, Accuracy (Vogels et al., 2018).
- DRES (speech): true/false positive/negative token counts, $Z$ -scores for fine-grained disfluency categories (Teleki et al., 24 Sep 2025).
Benchmark Protocols: Evaluation must separate “easy” (background-only) and “hard” (object-overlay) splits, e.g., in OTR, and compute all metrics both globally and inside mask regions only.

5. Comparative Analysis of Major Benchmarks

Dataset	Modality	Images (Train/Test)	Ground Truth	Mask Level	Notable Annotations
Oxford Synthetic	Scene/Syn	~800k / 30k	Clean backgrounds	Word/Stroke	Pixel masks, paired GT
SCUT-EnsText	Scene/Real	2,749 / 813	Inpainted images	Word	Binary masks
SCUT-Syn	Scene/Syn	8,000 / 800	Clean backgrounds	Word	Pixel masks
Flickr-ST	Scene/Real	2,204 / 800	Manual inpainting	Character/Word	PNG masks, XML polys
RW (Cascaded)	Scene/Real	11,040 / 1,080	Manual inpainting	Polygon/Stroke	Multi-lingual diversity
OTR	Scene/Syn	74,716 / 14,593	Original images	Word	JSON boxes, entropy
CleanEval	Web/HTML	60 / 676	Clean texts	DOM block	Algorithmic alignment
DRES	Linguistic	72k / 9k / 9k utts	Fluent–disfluent	Token/syntax	Treebank parse tags

Flickr-ST is notable for providing character-level segmentation and multi-class XML annotation for generic scene text (Lyu et al., 2023). OTR delivers artifact-free overlay on complex backgrounds, with detailed entropy characterization and robust split organization (Zdenek et al., 3 Oct 2025). The RW dataset (Bian et al., 2020) combines ICDAR2017 frames, hand-captured and synthetic overlays, and multi-script annotations. CleanEval remains the standard for web-content block extraction (Vogels et al., 2018).

6. Empirical Findings and Practical Guidelines

Mask Profile Sensitivity: Even small perturbations in mask shape (profile, scale, chunk granularity) can degrade inpainting quality (FID range 136–40); minimum cover masks are suboptimal (Nakada et al., 27 Nov 2025).
- Character-wise rectangle masks enlarged 30–40% over OCR bounding boxes yield the best results (FID≈32).
- Dilation is preferred over erosion for pixel-stroke masks.
Annotation Quality: Manual inpainting requires iterative RGAN-based refinement to produce artifact-free backgrounds.
Scene Complexity: Object-aware placement and background entropy (up to 6.96 bits, OTR) are essential for realistic evaluation (Zdenek et al., 3 Oct 2025).
Evaluation Metrics: PSNR/SSIM, while standard, may not capture perceptual realism—complementary NR-IQA is advised.
Recommendations:

1. Mask at character level; scale up rectangles 1.3×–1.4×; avoid rounding. 2. Separate training/evaluation for synthetic and real splits; document font/language distributions. 3. Leverage instance segmentation (multi-category masks) where possible. 4. Isolate evaluation to mask/region-of-interest, not global metrics. 5. Release annotation scripts, folder structure, and preprocessing protocols for reproducibility.

7. Future Directions and Benchmark Extensions

A plausible implication is that further benchmark advances should target:

Artifact minimization: Fully synthetic overlays, automated verification by multiple detectors, and semantic ground-truth consistency.
Annotation richness: Multi-level (token, word, character, polygon) synoptic labeling and XML-based instance associations.
Generalization: Inclusion of diverse backgrounds, scripts, font styles, and domain-robust splits for out-of-domain evaluation.
Metric expansion: Adoption of perceptual quality metrics alongside traditional pixel-wise and structural scores.
Reproducibility: Provision of open-source code, annotation artifacts, and full preprocessing pipelines.

By consolidating these practices, text removal benchmarks facilitate both rigorous model comparison and detailed error analysis, accelerating research across document, scene, web, and spoken-language domains.