Universal Image Restoration via Internalized Chain-of-Thought Reasoning

Published 16 Jun 2026 in cs.CV | (2606.17557v1)

Abstract: Image restoration seeks to recover high-quality images from degraded inputs but becomes highly ill-posed under complex, mixed degradations. While unified all-in-one models are common, their performance declines as degradation complexity increases. Recent works adopt Chain-of-Thought (CoT) reasoning for multi-round restoration using specialized modules. However, this approach faces two key limitations: (i) increased computational cost due to multi-step processing, and (ii) weak modeling of interactions between degradations during stepwise inference. We introduce CoTIR, a universal image restoration framework that internalizes CoT reasoning within a single model. Concretely, we view image restoration as a specialized subtask of image editing, which implies that a large-scale pre-trained editing model provides a more favorable optimization starting point. Building on this, we fine-tune the model for restoration and further encode structured CoT-style reasoning into the learning objective via a differentiable formulation inspired by Lagrangian optimization, enabling holistic restoration without chaining specialized restorers. To facilitate training and evaluation, we further present CoTIR-Bench, a large-scale benchmark comprising 5.2 million samples with CoT-style reasoning traces. Extensive experiments on CoTIR-Bench and broad real composite degradation scenes show that CoTIR achieves stronger perceptual quality and more competitive fidelity than both all-in-one models and multi-round restoration methods. The source code is available at https://github.com/gy65896/CoTIR.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces CoTIR, a novel model that internalizes chain-of-thought reasoning to perform universal image restoration in a single, end-to-end pass.
The methodology employs a unified 'Thinking → Planning → Action' scheme with Lagrangian constraints and a gated-attention CoT Adapter for effective degradation disentanglement.
Empirical results on CoTIR-Bench demonstrate state-of-the-art perceptual performance and efficiency across diverse, complex degradations in real-world scenarios.

Universal Image Restoration via Internalized Chain-of-Thought Reasoning: An Expert Review

Problem Formulation and Motivations

Image restoration under complex, real-world conditions is challenged by the simultaneous presence of multiple, spatially-varying degradations (e.g., noise, haze, rain, blur, compression artifacts). Conventional all-in-one models struggle to effectively disentangle and jointly correct interacting corruptions, especially for unseen composite cases. Recent advances employing sequential Chain-of-Thought (CoT) reasoning via multi-stage or agentic pipelines attempt to decompose restoration into tractable sub-goals. However, their sequential nature imposes significant computational overhead and fails to capture coupled effects among degradations, leading to sub-optimal restoration, error accumulation, and brittle generalization.

CoTIR: Model Architecture and Methodology

CoTIR (Chain-of-Thought Image Restorer) proposes an internalized CoT paradigm, reframing universal restoration as a “Thinking → Planning → Action” process embedded within a single generative model. This approach leverages a large-scale editing model—FLUX—pre-trained for broad image manipulation and fine-tuned for restoration. The core methodology is as follows:

Disentanglement (Thinking): Given a degraded input, the model infers structured latent representations corresponding to inherent scene description and explicit degradation patterns.
Strategic Recovery (Planning): It models the interactions between scene features and degradation information to formulate a restoration plan.
Restoration Execution (Action): The plan guides the end-to-end removal of composite artifacts in a single pass.

Critically, CoTIR encodes the entire CoT reasoning sequence as differentiable constraints using Lagrangian optimization. Each reasoning component (scene, degradation, plan) is associated with soft constraints, weighted by learnable Lagrange multipliers, ensuring that the latent intermediate reasoning steps align with their respective ground truths. This integrated constraint structure is solved via a minimax dual-optimizer, providing adaptive emphasis on components that are violated during training.

An architectural highlight is the CoT Adapter, a multi-block, gated-attention module that fuses cross-modal cues, enabling deeper integration of visual and textual (prompt-driven) semantics. Training follows a curriculum that transitions from precise restoration prompts to vague, underspecified ones, driving the model to learn sophisticated, context-aware reasoning rather than memorized restoration instructions.

Large-Scale Benchmark and Data Generation

To support structured CoT supervision at scale, CoTIR-Bench is introduced. It aggregates over 60 heterogeneous datasets into 5.2 million pairs encompassing real and synthetic degradations, with each pair annotated by vision-LLMs (e.g., Qwen2.5-VL) producing a three-stage CoT trace: i) scene description, ii) degradation identification, iii) restoration plan. Rigorous text-image consistency filtering is applied, ensuring high-quality intermediate supervision.

Empirical Results

Quantitative Performance

Comprehensive experiments on CoTIR-Bench and heterogeneous real-world scenarios demonstrate that CoTIR establishes state-of-the-art perceptual performance (CLIP-IQA, Q-Align, LIQE, MACLIP, LPIPS) and remains competitive on traditional pixel-based metrics (PSNR, SSIM). Notably:

CoTIR (Flux.2-9B) achieves a CLIP-IQA+ score of 0.6319, outperforming leading diffusion and agentic baselines, with the lowest LPIPS (0.1143).
On unseen complex composite degradations, CoTIR surpasses both all-in-one and multi-stage CoT-pipeline frameworks, substantiating generalization and robustness to distribution shifts.
Despite slightly lower PSNR/SSIM in some cases, CoTIR produces more visually faithful and artifact-free restorations, particularly when residual degradations or hallucinations are problematic for baseline methods.

Efficiency

By eschewing multi-stage pipelines, CoTIR attains significant inference acceleration (e.g., 1.84 s per 512×512 image on Flux.2-4B, several times faster than diffusion/agentic approaches), with memory/latency scalable across backbone sizes.

User Study & Real-World Validation

CoTIR achieves the highest perceptual scores in user studies and real-scene benchmarks involving severe, previously unseen degradation combinations—evidence of its true universality and deployment readiness.

Ablation Studies

Gated-attention blocks in the CoT Adapter yield superior cross-modal fusion compared to cross-attention architectures.
Explicitly disentangled, split-head supervision of scene, degradation, and plan is critical, with learnable Lagrange multipliers providing adaptive, stable convergence and consistently improving both PSNR and LPIPS.

Theoretical and Practical Implications

Theoretical Impact

CoTIR demonstrates that internalizing CoT reasoning as structural, learnable constraints offers a principled path to integrating intermediate reasoning in vision systems, moving beyond naive stepwise or black-box end-to-end schemes. The Lagrangian dual-optimizer is especially adept at balancing restoration fidelity and CoT-aligned semantic traceability.

The choice of large-scale editing pre-training (vs. restoration-only) is theoretically justified: models pre-trained for flexible instruction-following on diverse edit tasks generalize better to heterogeneous, out-of-domain restoration distributions by virtue of a broader operator prior and enhanced scene consistency.

Practical Applications

CoTIR’s unified, single-pass restoration is immediately beneficial for computational photography, video enhancement in security and autonomous driving, digitization of legacy imagery, and real-time restoration in mobile/embedded systems. Further, its ability to support fine-grained, multi-round user-guided restoration through natural language instructions enables powerful, interactive post-processing.

Future Directions

The proposed internalized CoT paradigm opens several avenues. Integrating more sophisticated scene parsers, context-aware LLMs, or even multi-modal large models could further enhance intermediate reasoning. There are strong opportunities for joint training with generative segmentation, detection, or open-vocabulary recognition objectives, moving toward truly generalist digital vision agents. Additionally, extending the framework to temporal (video) or 3D modalities would further enlarge its practical impact.

Conclusion

CoTIR redefines universal image restoration by encoding structured, multi-step reasoning within a single, generative model equipped with adaptive Lagrangian constraints. It achieves robust, perceptually superior restoration and efficient inference across an unprecedented spectrum of degradations. The underlying methodology establishes a foundation for future research into interpretable, instruction-driven, and highly generalizable image reconstruction frameworks (2606.17557).

Markdown Report Issue