Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction (2110.12942v2)

Published 25 Oct 2021 in cs.CV

Abstract: In this work, we propose a new framework, called Document Image Transformer (DocTr), to address the issue of geometry and illumination distortion of the document images. Specifically, DocTr consists of a geometric unwarping transformer and an illumination correction transformer. By setting a set of learned query embedding, the geometric unwarping transformer captures the global context of the document image by self-attention mechanism and decodes the pixel-wise displacement solution to correct the geometric distortion. After geometric unwarping, our illumination correction transformer further removes the shading artifacts to improve the visual quality and OCR accuracy. Extensive evaluations are conducted on several datasets, and superior results are reported against the state-of-the-art methods. Remarkably, our DocTr achieves 20.02% Character Error Rate (CER), a 15% absolute improvement over the state-of-the-art methods. Moreover, it also shows high efficiency on running time and parameter count. The results will be available at https://github.com/fh2019ustc/DocTr for further comparison.

Citations (51)

Summary

  • The paper presents a novel transformer model that rectifies document images by addressing geometric distortions through a segmentation and encoder-decoder approach.
  • It employs a Geometric Unwarping Transformer to compute pixel-wise displacement maps, significantly reducing local distortions while focusing on document content.
  • The integrated Illumination Correction Transformer further improves OCR accuracy by mitigating shading artifacts, lowering CER from 68% to 20.22%.

Document Image Transformer (DocTr): Addressing Geometry and Illumination Distortion

In the presented work, the authors introduce the Document Image Transformer (DocTr), a novel framework designed to tackle challenges associated with geometric unwarping and illumination correction in document images. The framework comprises two integral components: the Geometric Unwarping Transformer and the Illumination Correction Transformer. Both components leverage the transformer architecture to enhance the processing of document images captured under inconsistent geometric and lighting conditions.

Geometric Unwarping

Geometric unwarping involves the rectification of distorted document images caused by various deformations such as folds, curves, and camera misalignment. The authors utilize a preprocessing step to segment and exclude the background of document images, allowing the model to focus solely on the document content. The core of this process is a transformer encoder-decoder architecture that captures global context through self-attention mechanisms and decodes pixel-wise displacement mappings to correct geometric distortions.

The paper showcases significant results, with the model achieving a Local Distortion (LD) of 8.38 and significantly reducing Character Error Rate (CER) to 31% from 68% on distorted images. These results highlight the efficacy of using transformers in capturing complex spatial dependencies that convolutional neural networks often miss.

Illumination Correction

Following geometric correction, the illumination correction transformer addresses shading artifacts and enhances the visual quality, leveraging a patch-based approach. The patches, extracted with overlap, undergo correction using another encoder-decoder transformer setup, which efficiently handles global shading variations and produces high-quality corrected patches for seamless stitching into the final image.

The integration of illumination correction further improves the CER, achieving a remarkable reduction to 20.22%, emphasizing the robust performance of the framework. The improvement in OCR metrics is notable, suggesting practical implications for enhanced readability and OCR accuracy in applications such as text recognition and document analysis.

Comparative Analysis and Implications

The experimental results reveal that DocTr surpasses contemporary methods in both geometric unwarping and illumination correction. This suggests a shift towards using attention mechanisms for tackling complex visual distortions, emphasizing the utility of global feature aggregation in low-level vision tasks.

The implementation of DocTr demonstrates competitive runtime efficiency and model size when compared to its predecessors. This could have broader implications for deploying advanced document rectification solutions in resource-constrained settings.

Future Directions

Considering DocTr's demonstrated potential, future research could explore the integration of text-based features to further refine rectification processes. Additionally, expanding the dataset and tackling more diverse and complex distortions could push the boundaries of document image correction.

In conclusion, DocTr sets a new standard for document image rectification, offering substantial advancements in managing geometric and illumination distortions. Its introduction reinforces the transformative impact of attention-based models in computer vision tasks beyond traditional contexts.