Papers
Topics
Authors
Recent
Search
2000 character limit reached

TEXTS-Diff: Text-Aware Diffusion Model

Updated 31 January 2026
  • The paper demonstrates that TEXTS-Diff achieves state-of-the-art performance with a +4.8 pp improvement in OCR accuracy over existing diffusion models.
  • It employs a two-stage attention mechanism that fuses global semantic cues and precise text region features to accurately restore text and background details.
  • The method integrates composite losses including fidelity, perceptual, and edge-aware text-region losses to ensure robust restoration across multilingual text scenarios.

The TEXTS-Aware Diffusion Model (TEXTS-Diff) is a diffusion-based framework for real-world text image super-resolution (SR), explicitly designed to restore both overall scene fidelity and text legibility in natural images subject to numerous degradations. Combining a two-stage attention mechanism for text guidance with a purpose-built real-world dataset, TEXTS-Diff demonstrates state-of-the-art reconstruction and generalization across complex multilingual and multitextual scenarios (He et al., 24 Jan 2026).

1. Diffusion-Based Super-Resolution Framework

TEXTS-Diff operates within the denoising diffusion probabilistic model (DDPM) paradigm, targeting the ill-posed SR task where text regions are often subject to substantial information loss during low-resolution image formation. The model adopts a one-step diffusion framework inspired by OSEDiff, where the forward noising process corrupts a clean high-resolution (HR) image x0x_0 using a Gaussian schedule {βt}t=1T\{\beta_t\}_{t=1}^T:

q(xtxt1)=N(xt;1βtxt1,βtI),q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I),

and the reverse process, parameterized by pθ(xt1xt)p_\theta(x_{t-1}|x_t), inverts this corruption in a single learned step:

pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t)).p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t)).

This simplification (i.e., T=1T=1) prioritizes computational efficiency and is effective for text image SR when guided by strong semantic and region cues.

2. Text-Aware Feature Guidance Mechanism

Central to the TEXTS-Diff approach is a cascaded feature guidance pipeline, leveraging both global semantic and precise text region information:

  • Feature Extraction: Given a degraded low-resolution (LR) input IlrI_{lr}, a feature extractor computes FlrF_{lr} with reduced spatial dimensions (h=H/8,w=W/8h = H/8, w=W/8).
  • Abstract Textual Perception: The single-token prompt “TEXTS” is embedded using a CLIP text encoder, yielding a global abstract representation FtextF_{text}, which is fused into visual features via cross-attention:

Fabs=softmax(QlrKad)VaF_{abs} = \operatorname{softmax}\left(\frac{Q_{lr}K_a^\top}{\sqrt{d}}\right)V_a

where QlrQ_{lr} is computed from FlrF_{lr}; KaK_a, VaV_a from FtextF_{text}.

  • Concrete Text-Region Perception: A pretrained text detection model (DBNet++) identifies high-level region features FtdmF_{tdm}. These are attended to by FabsF_{abs} to produce FTEXTSF_{TEXTS}, which encodes both abstract and local text information:

FTEXTS=softmax(QabsKcd)VcF_{TEXTS} = \operatorname{softmax}\left(\frac{Q_{abs}K_c^\top}{\sqrt{d}}\right)V_c

  • Conditioning the Diffusion U-Net: The diffusion U-Net is conditioned on:
    1. Latent FvF_v from a frozen VAE encoder,
    2. A semantic prompt feature FpF_p (Prompt Extractor),
    3. FTEXTSF_{TEXTS}, injected at custom cross-attention blocks after each down block and in the U-Net middle.

Decoder outputs the super-resolved image:

Isr=Decoder(U-Net(Fv;Fp,FTEXTS))I_{sr} = \text{Decoder}\left(\text{U-Net}(F_v; F_p, F_{TEXTS})\right)

3. Loss Functions and Objective

TEXTS-Diff is optimized end-to-end (with a frozen VAE) using a composite loss that encourages both global quality and text region correctness:

  • L₂ Fidelity Loss: Ldiff=IsrIhr22L_{diff} = \lVert I_{sr} - I_{hr} \rVert_2^2
  • LPIPS Perceptual Loss: Llpips=LPIPS(Isr,Ihr)L_{lpips} = \text{LPIPS}(I_{sr}, I_{hr})
  • Edge-Aware Text-Region Loss: Ledge=[Canny(Isr)Canny(Ihr)]Mtext22L_{edge} = \lVert [\text{Canny}(I_{sr}) - \text{Canny}(I_{hr})] \circ M_{text} \rVert_2^2, with MtextM_{text} a ground-truth text mask.
  • OCR-Text Destylization Modeling Loss (ODM): Lodm=[ODM(Isr)ODM(Ihr)]Mtext22L_{odm} = \lVert [\text{ODM}(I_{sr}) - \text{ODM}(I_{hr})] \circ M_{text}\rVert_2^2

The total loss is:

Ltotal=λ1Ldiff+λ2Llpips+λ3Ledge+λ4LodmL_{total} = \lambda_1L_{diff} + \lambda_2L_{lpips} + \lambda_3L_{edge} + \lambda_4L_{odm}

with (λ1,λ2,λ3,λ4)=(1,2,1,10)(\lambda_1, \lambda_2, \lambda_3, \lambda_4) = (1, 2, 1, 10).

4. Real-Texts Dataset Construction

To address the scarcity and domain gap of prior datasets, Real-Texts is introduced:

  • Image Collection: 72,476 real-world photographs containing text are collected across diverse scenarios.
  • Text Detection & Recognition: PPOCRv5 is used; only images with valid OCR are retained.
  • Region Cropping: Detected text bounding boxes yield 74,035 512×512512\times512 patches.
  • Quality Filtering: VisualQuality-R1 scoring (4.25\geq4.25), followed by manual verification, yields 34,875 high-quality LR–HR pairs (33,875 train, 1,000 test).
  • Degradation Pipeline: Real-ESRGAN synthetic degradations are applied to generate LR from HR.

Dataset Statistics

Aspect Value
Total pairs 34,875 (33,875 train, 1,000 test)
Text lines 136,136 (76K CN, 38K EN, 19K other)
Scene types Indoor, street, posters, natural
Font styles Printed, calligraphic, artistic

This dataset spans varied HR resolutions (5122\sim512^2 to 2K2\sim2\text{K}^2), with text in Chinese, English, and mixed scripts (He et al., 24 Jan 2026).

5. Training Pipeline and Hyperparameters

  • Backbone: Stable Diffusion 2.1 with frozen VAE encoder/decoder
  • Fine-tuning: LoRA on U-Net weights only
  • Optimizer: AdamW (weight decay 0.01, learning rate 10410^{-4})
  • Batch size: 16; one-step reverse pass (T=1)
  • Data Augmentation: Random horizontal flips only
  • Additional Data: Real-CE crops (14,312); LSDIR images (20,000, no text annotation)

6. Evaluation, Ablation, and Insights

Quantitative Results on Real-Texts (Test Split)

Method OCR-A ↑ PSNR SSIM ↑ LPIPS ↓ DISTS ↓ FID
Real-ESRGAN 0.3764 22.91 0.7265 0.2794 0.1878 50.12
StableSR (4) 0.3511 23.69 0.7481 0.2396 0.1502 36.23
OSEDiff (1) 0.3049 22.91 0.7160 0.2291 0.1476 38.21
TEXTS-Diff (1) 0.3951 24.49 0.7499 0.1801 0.1174 27.70

TEXTS-Diff surpasses all baselines across OCR accuracy, perceptual similarity (LPIPS, DISTS), and FID, improving OCR accuracy by +4.8 pp over OSEDiff and +3.6 pp over the closest multi-step method.

Qualitative and Ablation Insights

  • TEXTS-Diff achieves clean text stroke recovery (even with challenging fonts), reduces hallucination and broken character artifacts, and maintains background textures with minimal over-smoothing.
  • Ablation studies show complementary value for both abstract (“TEXTS” prompt) and concrete (TDM) modules: removing either results in up to 1.9 pp OCR-A drop.
  • The two-stage cross-attention pattern mirrors global-to-local reading, supporting both detection (“text here”) and fine structuring (“what letters?”).

Loss Ablation Impact

The combination of ODM and edge-aware losses with conventional fidelity and perceptual objectives is instrumental in aligning stroke-level output with real OCR-derived guidance, instead of simply hallucinating plausible characters.

7. Comparative Context and Future Directions

Compared to prior diffusion-based text-aware models such as TextDiffuser (Chen et al., 2023), which utilize explicit layout planning via Transformer-based modules and character-level segmentation masks, TEXTS-Diff’s real-world super-resolution strategy employs dual-stage guidance—semantically from CLIP-based prompts and concretely from region proposals—without explicit synthetic layout planning. Both share the use of CLIP for text conditioning and demonstrate the necessity of paired, large-scale real data for legibility-oriented SR evaluation.

Limitations include loss of multi-scale adaptation granularity due to the one-step design and reliance on external (not fully end-to-end) text region proposals. Extending to more scripts (e.g., Arabic, Devanagari) and to vertical text, as well as integrating differentiable, learned text detection modules, represent plausible future research directions.

In summary, TEXTS-Diff represents an advance in real-world text image super-resolution, achieving superior recognition and perceptual quality by leveraging both abstract concept prompts and precise segmentation-driven feature modulation within a diffusion framework (He et al., 24 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TEXTS-Aware Diffusion Model (TEXTS-Diff).