Text-Aware Image Restoration with Diffusion Models (2506.09993v1)

Published 11 Jun 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Image restoration aims to recover degraded images. However, existing diffusion-based restoration methods, despite great success in natural image restoration, often struggle to faithfully reconstruct textual regions in degraded images. Those methods frequently generate plausible but incorrect text-like patterns, a phenomenon we refer to as text-image hallucination. In this paper, we introduce Text-Aware Image Restoration (TAIR), a novel restoration task that requires the simultaneous recovery of visual contents and textual fidelity. To tackle this task, we present SA-Text, a large-scale benchmark of 100K high-quality scene images densely annotated with diverse and complex text instances. Furthermore, we propose a multi-task diffusion framework, called TeReDiff, that integrates internal features from diffusion models into a text-spotting module, enabling both components to benefit from joint training. This allows for the extraction of rich text representations, which are utilized as prompts in subsequent denoising steps. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art restoration methods, achieving significant gains in text recognition accuracy. See our project page: https://cvlab-kaist.github.io/TAIR/

PDF Abstract

Text-Aware Image Restoration with Diffusion Models

The paper addresses a significant gap in the field of image restoration, particularly concerning the restoration of textual regions in images using diffusion models. Despite the advancement of generative modeling techniques, the accurate reconstruction of text is less explored, largely due to limitations in existing datasets and methodologies which result in text-image hallucination, where the generative models produce plausible yet incorrect text-like patterns.

Core Contributions

Introduction of TAIR: The authors propose a novel restoration task called Text-Aware Image Restoration (TAIR) that focuses on simultaneously enhancing both visual content and textual fidelity. This approach diverges from traditional image restoration methodologies that primarily improve general perceptual quality without specifically accounting for textual legibility.
SA-Text Benchmark: The paper introduces a comprehensive benchmark - SA-Text - consisting of 100K high-quality scene images annotated densely with diverse and complex text instances. This dataset serves as a foundational resource for the TAIR task, enabling rigorous evaluation and further research in text-conditioned image restoration.
Multi-task Diffusion Framework: A new model, TeReDiff, integrates diffusion U-Net features with a text-spotting module. This multi-task framework allows rich text representations to be used as prompts in subsequent denoising steps, effectively enhancing text recognition accuracy alongside visual restoration.

Methodology and Technical Insights

The methodological approach is centered on leveraging diffusion models and text-spotting modules to decompress high-quality image features that accurately guide the restoration of text regions. By utilizing a degradation removal module and a text-spotting transformer, the framework learns to detect and rectify textual information embedded in complex visual environments.

Dataset Curation:

The authors devised an automated pipeline for curating SA-Text, combining text detection and recognition with vision-LLM verification to ensure high-resolution images paired with reliable text annotations. This addresses critical issues like low resolution and annotation inaccuracies prevalent in existing datasets.

Textual Prompting:

The TeReDiff model employs recognized text as prompts during the denoising steps, significantly improving the performance of both text detection and recognition systems compared to other contemporaneous diffusion-based methods.

Results and Implications

Extensive experiments demonstrate that TeReDiff consistently surpasses existing state-of-the-art image restoration methods in terms of both text recognition accuracy and image quality metrics across multiple degradation levels. The model shows notable efficacy in reducing the rate of incorrect text patterns and maintaining the integrity of text in challenging conditions.

These results imply significant practical applications, particularly in fields requiring document digitization, street sign recognition, and augmented reality navigation. The introduced framework sets a new direction for image restoration research, emphasizing the integration of semantic textual cues into visual reconstruction processes.

Future Prospects

The paper suggests potential avenues for future research, such as refining text-spotting capabilities in diffusion models further, expanding TAIR to accommodate even more diverse datasets, and exploring advanced prompting techniques. With ongoing advancements in AI and computer vision, the effective restoration of textual content alongside visual quality could radically improve automated interpretation of complex imagery.

This work contributes a vital perspective to the ongoing evolution of image restoration technology, particularly in the accurate and legible restoration of text, paving the path for improved text integration in visual reconstruction tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Jaewon Min (6 papers)
Jin Hyeon Kim (2 papers)
Paul Hyunbin Cho (1 paper)
Jaeeun Lee (1 paper)
Jihye Park (10 papers)
Minkyu Park (6 papers)
Sangpil Kim (34 papers)
Hyunhee Park (11 papers)
Seungryong Kim (103 papers)

Related Papers

Find Related Papers

GitHub

Tweets

https://twitter.com/_akhaliq/status/1933684442775900468

https://twitter.com/AIMIRAI46487/status/1934537689052922130

https://twitter.com/HuggingPapers/status/1933376874492625257

https://twitter.com/wildmindai/status/1933512819435057195

https://twitter.com/ResearchBitesAI/status/1933902307688951986

https://twitter.com/javaeeeee1/status/1933470436073091547