TEXTS-Diff: Text-Aware Diffusion Model

Updated 31 January 2026

The paper demonstrates that TEXTS-Diff achieves state-of-the-art performance with a +4.8 pp improvement in OCR accuracy over existing diffusion models.
It employs a two-stage attention mechanism that fuses global semantic cues and precise text region features to accurately restore text and background details.
The method integrates composite losses including fidelity, perceptual, and edge-aware text-region losses to ensure robust restoration across multilingual text scenarios.

The TEXTS-Aware Diffusion Model (TEXTS-Diff) is a diffusion-based framework for real-world text image super-resolution (SR), explicitly designed to restore both overall scene fidelity and text legibility in natural images subject to numerous degradations. Combining a two-stage attention mechanism for text guidance with a purpose-built real-world dataset, TEXTS-Diff demonstrates state-of-the-art reconstruction and generalization across complex multilingual and multitextual scenarios (He et al., 24 Jan 2026).

1. Diffusion-Based Super-Resolution Framework

TEXTS-Diff operates within the denoising diffusion probabilistic model (DDPM) paradigm, targeting the ill-posed SR task where text regions are often subject to substantial information loss during low-resolution image formation. The model adopts a one-step diffusion framework inspired by OSEDiff, where the forward noising process corrupts a clean high-resolution (HR) image $x_0$ using a Gaussian schedule $\{\beta_t\}_{t=1}^T$ :

$q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I),$

and the reverse process, parameterized by $p_\theta(x_{t-1}|x_t)$ , inverts this corruption in a single learned step:

$p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t)).$

This simplification (i.e., $T=1$ ) prioritizes computational efficiency and is effective for text image SR when guided by strong semantic and region cues.

2. Text-Aware Feature Guidance Mechanism

Central to the TEXTS-Diff approach is a cascaded feature guidance pipeline, leveraging both global semantic and precise text region information:

Feature Extraction: Given a degraded low-resolution (LR) input $I_{lr}$ , a feature extractor computes $F_{lr}$ with reduced spatial dimensions ( $h = H/8, w=W/8$ ).
Abstract Textual Perception: The single-token prompt “TEXTS” is embedded using a CLIP text encoder, yielding a global abstract representation $F_{text}$ , which is fused into visual features via cross-attention:

$F_{abs} = \operatorname{softmax}\left(\frac{Q_{lr}K_a^\top}{\sqrt{d}}\right)V_a$

where $Q_{lr}$ is computed from $F_{lr}$ ; $K_a$ , $V_a$ from $F_{text}$ .

Concrete Text-Region Perception: A pretrained text detection model (DBNet++) identifies high-level region features $F_{tdm}$ . These are attended to by $F_{abs}$ to produce $F_{TEXTS}$ , which encodes both abstract and local text information:

$F_{TEXTS} = \operatorname{softmax}\left(\frac{Q_{abs}K_c^\top}{\sqrt{d}}\right)V_c$

Conditioning the Diffusion U-Net: The diffusion U-Net is conditioned on:
1. Latent $F_v$ from a frozen VAE encoder,
2. A semantic prompt feature $F_p$ (Prompt Extractor),
3. $F_{TEXTS}$ , injected at custom cross-attention blocks after each down block and in the U-Net middle.

Decoder outputs the super-resolved image:

$I_{sr} = \text{Decoder}\left(\text{U-Net}(F_v; F_p, F_{TEXTS})\right)$

3. Loss Functions and Objective

TEXTS-Diff is optimized end-to-end (with a frozen VAE) using a composite loss that encourages both global quality and text region correctness:

L₂ Fidelity Loss: $L_{diff} = \lVert I_{sr} - I_{hr} \rVert_2^2$
LPIPS Perceptual Loss: $L_{lpips} = \text{LPIPS}(I_{sr}, I_{hr})$
Edge-Aware Text-Region Loss: $L_{edge} = \lVert [\text{Canny}(I_{sr}) - \text{Canny}(I_{hr})] \circ M_{text} \rVert_2^2$ , with $M_{text}$ a ground-truth text mask.
OCR-Text Destylization Modeling Loss (ODM): $L_{odm} = \lVert [\text{ODM}(I_{sr}) - \text{ODM}(I_{hr})] \circ M_{text}\rVert_2^2$

The total loss is:

$L_{total} = \lambda_1L_{diff} + \lambda_2L_{lpips} + \lambda_3L_{edge} + \lambda_4L_{odm}$

with $(\lambda_1, \lambda_2, \lambda_3, \lambda_4) = (1, 2, 1, 10)$ .

4. Real-Texts Dataset Construction

To address the scarcity and domain gap of prior datasets, Real-Texts is introduced:

Image Collection: 72,476 real-world photographs containing text are collected across diverse scenarios.
Text Detection & Recognition: PPOCRv5 is used; only images with valid OCR are retained.
Region Cropping: Detected text bounding boxes yield 74,035 $512\times512$ patches.
Quality Filtering: VisualQuality-R1 scoring ( $\geq4.25$ ), followed by manual verification, yields 34,875 high-quality LR–HR pairs (33,875 train, 1,000 test).
Degradation Pipeline: Real-ESRGAN synthetic degradations are applied to generate LR from HR.

Dataset Statistics

Aspect	Value
Total pairs	34,875 (33,875 train, 1,000 test)
Text lines	136,136 (76K CN, 38K EN, 19K other)
Scene types	Indoor, street, posters, natural
Font styles	Printed, calligraphic, artistic

This dataset spans varied HR resolutions ( $\sim512^2$ to $\sim2\text{K}^2$ ), with text in Chinese, English, and mixed scripts (He et al., 24 Jan 2026).

5. Training Pipeline and Hyperparameters

Backbone: Stable Diffusion 2.1 with frozen VAE encoder/decoder
Fine-tuning: LoRA on U-Net weights only
Optimizer: AdamW (weight decay 0.01, learning rate $10^{-4}$ )
Batch size: 16; one-step reverse pass (T=1)
Data Augmentation: Random horizontal flips only
Additional Data: Real-CE crops (14,312); LSDIR images (20,000, no text annotation)

6. Evaluation, Ablation, and Insights

Quantitative Results on Real-Texts (Test Split)

Method	OCR-A ↑	PSNR ↑	SSIM ↑	LPIPS ↓	DISTS ↓	FID ↓
Real-ESRGAN	0.3764	22.91	0.7265	0.2794	0.1878	50.12
StableSR (4)	0.3511	23.69	0.7481	0.2396	0.1502	36.23
OSEDiff (1)	0.3049	22.91	0.7160	0.2291	0.1476	38.21
TEXTS-Diff (1)	0.3951	24.49	0.7499	0.1801	0.1174	27.70

TEXTS-Diff surpasses all baselines across OCR accuracy, perceptual similarity (LPIPS, DISTS), and FID, improving OCR accuracy by +4.8 pp over OSEDiff and +3.6 pp over the closest multi-step method.

Qualitative and Ablation Insights

TEXTS-Diff achieves clean text stroke recovery (even with challenging fonts), reduces hallucination and broken character artifacts, and maintains background textures with minimal over-smoothing.
Ablation studies show complementary value for both abstract (“TEXTS” prompt) and concrete (TDM) modules: removing either results in up to 1.9 pp OCR-A drop.
The two-stage cross-attention pattern mirrors global-to-local reading, supporting both detection (“text here”) and fine structuring (“what letters?”).

Loss Ablation Impact

The combination of ODM and edge-aware losses with conventional fidelity and perceptual objectives is instrumental in aligning stroke-level output with real OCR-derived guidance, instead of simply hallucinating plausible characters.

7. Comparative Context and Future Directions

Compared to prior diffusion-based text-aware models such as TextDiffuser (Chen et al., 2023), which utilize explicit layout planning via Transformer-based modules and character-level segmentation masks, TEXTS-Diff’s real-world super-resolution strategy employs dual-stage guidance—semantically from CLIP-based prompts and concretely from region proposals—without explicit synthetic layout planning. Both share the use of CLIP for text conditioning and demonstrate the necessity of paired, large-scale real data for legibility-oriented SR evaluation.

Limitations include loss of multi-scale adaptation granularity due to the one-step design and reliance on external (not fully end-to-end) text region proposals. Extending to more scripts (e.g., Arabic, Devanagari) and to vertical text, as well as integrating differentiable, learned text detection modules, represent plausible future research directions.

In summary, TEXTS-Diff represents an advance in real-world text image super-resolution, achieving superior recognition and perceptual quality by leveraging both abstract concept prompts and precise segmentation-driven feature modulation within a diffusion framework (He et al., 24 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (2)

TEXTS-Diff: TEXTS-Aware Diffusion Model for Real-World Text Image Super-Resolution (2026)

TextDiffuser: Diffusion Models as Text Painters (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TEXTS-Aware Diffusion Model (TEXTS-Diff).

TEXTS-Diff: Text-Aware Diffusion Model

1. Diffusion-Based Super-Resolution Framework

2. Text-Aware Feature Guidance Mechanism

3. Loss Functions and Objective

4. Real-Texts Dataset Construction

Dataset Statistics

5. Training Pipeline and Hyperparameters

6. Evaluation, Ablation, and Insights

Quantitative Results on Real-Texts (Test Split)

Qualitative and Ablation Insights

Loss Ablation Impact

7. Comparative Context and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

TEXTS-Diff: Text-Aware Diffusion Model

1. Diffusion-Based Super-Resolution Framework

2. Text-Aware Feature Guidance Mechanism

3. Loss Functions and Objective

4. Real-Texts Dataset Construction

Dataset Statistics

5. Training Pipeline and Hyperparameters

6. Evaluation, Ablation, and Insights

Quantitative Results on Real-Texts (Test Split)

Qualitative and Ablation Insights

Loss Ablation Impact

7. Comparative Context and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research