Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Document Image Rectification Bases on Self-Adaptive Multitask Fusion (2505.06038v1)

Published 9 May 2025 in cs.CV

Abstract: Deformed document image rectification is essential for real-world document understanding tasks, such as layout analysis and text recognition. However, current multi-task methods -- such as background removal, 3D coordinate prediction, and text line segmentation -- often overlook the complementary features between tasks and their interactions. To address this gap, we propose a self-adaptive learnable multi-task fusion rectification network named SalmRec. This network incorporates an inter-task feature aggregation module that adaptively improves the perception of geometric distortions, enhances feature complementarity, and reduces negative interference. We also introduce a gating mechanism to balance features both within global tasks and between local tasks effectively. Experimental results on two English benchmarks (DIR300 and DocUNet) and one Chinese benchmark (DocReal) demonstrate that our method significantly improves rectification performance. Ablation studies further highlight the positive impact of different tasks on dewarping and the effectiveness of our proposed module.

Summary

  • The paper introduces SalmRec, a novel self-adaptive multitask fusion network designed to rectify deformed document images.
  • SalmRec incorporates an inter-task feature aggregation module and a gating mechanism to improve feature complementarity and task performance.
  • Experiments demonstrate that SalmRec achieves state-of-the-art results on benchmark datasets, significantly enhancing rectification accuracy and downstream OCR performance.

Document Image Rectification Based on Self-Adaptive Multitask Fusion

The paper "Document Image Rectification Bases on Self-Adaptive Multitask Fusion" by Heng Li et al. presents a novel approach for rectifying deformed document images to improve downstream tasks such as layout analysis and text recognition. Through detailed experimentation and analysis, the authors propose SalmRec, an approach that leverages a self-adaptive learnable multi-task fusion network to rectify document images.

SalmRec addresses limitations in existing methods by ensuring feature complementarity among various tasks and reducing negative interference. The architecture introduces several innovations, including an inter-task feature aggregation module that enhances the network's understanding of geometric distortions and a gating mechanism aimed at balancing feature extraction across global and local tasks.

Key Components and Contributions

  1. Inter-Task Feature Aggregation Module: This module employs a leave-one-out combination to improve task correlation and comprehensively utilize input features, mitigating redundancy and enhancing task-specific performance.
  2. Gating Mechanism: Inspired by routing-based multi-task learning, this mechanism dynamically adjusts the importance of global and local features, optimizing the use of task-specific information for document rectification.
  3. Experimental Validations: SalmRec was thoroughly tested on established benchmarks including DIR300, DocReal, and DocUNet, demonstrating improved rectification accuracy and OCR performance. Particularly, the method achieved state-of-the-art results on these datasets, highlighting significant improvements over existing approaches in metrics such as MS-SSIM, LD, AD, ED, and CER.
  4. Ablation Studies: The authors conducted detailed ablation experiments to quantify the contributions of each task and each component of their model, providing insightful evidence for their architectural choices.

Implications and Future Work

The robust design of SalmRec not only improves document rectification in varied presentation settings affected by human and environmental factors but also enhances subsequent document understanding tasks. Practically, the ability to accurately rectify images captured in adverse conditions makes this approach highly valuable for applications involving mobile device imagery and real-world document processing.

One of the future directions suggested by the authors involves the development of more lightweight models that maintain robustness while enhancing rectification performance, potentially for use in resource-constrained environments. Extending this framework to seamlessly integrate with models addressing other document-based tasks could further streamline document processing pipelines.

The paper makes a significant stride towards addressing the complexities inherent in document image rectification, offering a robust solution with real-world applicability and laying the groundwork for innovations in document processing technologies.