TEXTS-Diff: Text-Aware Diffusion Model
- The paper demonstrates that TEXTS-Diff achieves state-of-the-art performance with a +4.8 pp improvement in OCR accuracy over existing diffusion models.
- It employs a two-stage attention mechanism that fuses global semantic cues and precise text region features to accurately restore text and background details.
- The method integrates composite losses including fidelity, perceptual, and edge-aware text-region losses to ensure robust restoration across multilingual text scenarios.
The TEXTS-Aware Diffusion Model (TEXTS-Diff) is a diffusion-based framework for real-world text image super-resolution (SR), explicitly designed to restore both overall scene fidelity and text legibility in natural images subject to numerous degradations. Combining a two-stage attention mechanism for text guidance with a purpose-built real-world dataset, TEXTS-Diff demonstrates state-of-the-art reconstruction and generalization across complex multilingual and multitextual scenarios (He et al., 24 Jan 2026).
1. Diffusion-Based Super-Resolution Framework
TEXTS-Diff operates within the denoising diffusion probabilistic model (DDPM) paradigm, targeting the ill-posed SR task where text regions are often subject to substantial information loss during low-resolution image formation. The model adopts a one-step diffusion framework inspired by OSEDiff, where the forward noising process corrupts a clean high-resolution (HR) image using a Gaussian schedule :
and the reverse process, parameterized by , inverts this corruption in a single learned step:
This simplification (i.e., ) prioritizes computational efficiency and is effective for text image SR when guided by strong semantic and region cues.
2. Text-Aware Feature Guidance Mechanism
Central to the TEXTS-Diff approach is a cascaded feature guidance pipeline, leveraging both global semantic and precise text region information:
- Feature Extraction: Given a degraded low-resolution (LR) input , a feature extractor computes with reduced spatial dimensions ().
- Abstract Textual Perception: The single-token prompt “TEXTS” is embedded using a CLIP text encoder, yielding a global abstract representation , which is fused into visual features via cross-attention:
where is computed from ; , from .
- Concrete Text-Region Perception: A pretrained text detection model (DBNet++) identifies high-level region features . These are attended to by to produce , which encodes both abstract and local text information:
- Conditioning the Diffusion U-Net: The diffusion U-Net is conditioned on:
- Latent from a frozen VAE encoder,
- A semantic prompt feature (Prompt Extractor),
- , injected at custom cross-attention blocks after each down block and in the U-Net middle.
Decoder outputs the super-resolved image:
3. Loss Functions and Objective
TEXTS-Diff is optimized end-to-end (with a frozen VAE) using a composite loss that encourages both global quality and text region correctness:
- L₂ Fidelity Loss:
- LPIPS Perceptual Loss:
- Edge-Aware Text-Region Loss: , with a ground-truth text mask.
- OCR-Text Destylization Modeling Loss (ODM):
The total loss is:
with .
4. Real-Texts Dataset Construction
To address the scarcity and domain gap of prior datasets, Real-Texts is introduced:
- Image Collection: 72,476 real-world photographs containing text are collected across diverse scenarios.
- Text Detection & Recognition: PPOCRv5 is used; only images with valid OCR are retained.
- Region Cropping: Detected text bounding boxes yield 74,035 patches.
- Quality Filtering: VisualQuality-R1 scoring (), followed by manual verification, yields 34,875 high-quality LR–HR pairs (33,875 train, 1,000 test).
- Degradation Pipeline: Real-ESRGAN synthetic degradations are applied to generate LR from HR.
Dataset Statistics
| Aspect | Value |
|---|---|
| Total pairs | 34,875 (33,875 train, 1,000 test) |
| Text lines | 136,136 (76K CN, 38K EN, 19K other) |
| Scene types | Indoor, street, posters, natural |
| Font styles | Printed, calligraphic, artistic |
This dataset spans varied HR resolutions ( to ), with text in Chinese, English, and mixed scripts (He et al., 24 Jan 2026).
5. Training Pipeline and Hyperparameters
- Backbone: Stable Diffusion 2.1 with frozen VAE encoder/decoder
- Fine-tuning: LoRA on U-Net weights only
- Optimizer: AdamW (weight decay 0.01, learning rate )
- Batch size: 16; one-step reverse pass (T=1)
- Data Augmentation: Random horizontal flips only
- Additional Data: Real-CE crops (14,312); LSDIR images (20,000, no text annotation)
6. Evaluation, Ablation, and Insights
Quantitative Results on Real-Texts (Test Split)
| Method | OCR-A ↑ | PSNR ↑ | SSIM ↑ | LPIPS ↓ | DISTS ↓ | FID ↓ |
|---|---|---|---|---|---|---|
| Real-ESRGAN | 0.3764 | 22.91 | 0.7265 | 0.2794 | 0.1878 | 50.12 |
| StableSR (4) | 0.3511 | 23.69 | 0.7481 | 0.2396 | 0.1502 | 36.23 |
| OSEDiff (1) | 0.3049 | 22.91 | 0.7160 | 0.2291 | 0.1476 | 38.21 |
| TEXTS-Diff (1) | 0.3951 | 24.49 | 0.7499 | 0.1801 | 0.1174 | 27.70 |
TEXTS-Diff surpasses all baselines across OCR accuracy, perceptual similarity (LPIPS, DISTS), and FID, improving OCR accuracy by +4.8 pp over OSEDiff and +3.6 pp over the closest multi-step method.
Qualitative and Ablation Insights
- TEXTS-Diff achieves clean text stroke recovery (even with challenging fonts), reduces hallucination and broken character artifacts, and maintains background textures with minimal over-smoothing.
- Ablation studies show complementary value for both abstract (“TEXTS” prompt) and concrete (TDM) modules: removing either results in up to 1.9 pp OCR-A drop.
- The two-stage cross-attention pattern mirrors global-to-local reading, supporting both detection (“text here”) and fine structuring (“what letters?”).
Loss Ablation Impact
The combination of ODM and edge-aware losses with conventional fidelity and perceptual objectives is instrumental in aligning stroke-level output with real OCR-derived guidance, instead of simply hallucinating plausible characters.
7. Comparative Context and Future Directions
Compared to prior diffusion-based text-aware models such as TextDiffuser (Chen et al., 2023), which utilize explicit layout planning via Transformer-based modules and character-level segmentation masks, TEXTS-Diff’s real-world super-resolution strategy employs dual-stage guidance—semantically from CLIP-based prompts and concretely from region proposals—without explicit synthetic layout planning. Both share the use of CLIP for text conditioning and demonstrate the necessity of paired, large-scale real data for legibility-oriented SR evaluation.
Limitations include loss of multi-scale adaptation granularity due to the one-step design and reliance on external (not fully end-to-end) text region proposals. Extending to more scripts (e.g., Arabic, Devanagari) and to vertical text, as well as integrating differentiable, learned text detection modules, represent plausible future research directions.
In summary, TEXTS-Diff represents an advance in real-world text image super-resolution, achieving superior recognition and perceptual quality by leveraging both abstract concept prompts and precise segmentation-driven feature modulation within a diffusion framework (He et al., 24 Jan 2026).