Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deep Unrestricted Document Image Rectification (2304.08796v2)

Published 18 Apr 2023 in cs.CV

Abstract: In recent years, tremendous efforts have been made on document image rectification, but existing advanced algorithms are limited to processing restricted document images, i.e., the input images must incorporate a complete document. Once the captured image merely involves a local text region, its rectification quality is degraded and unsatisfactory. Our previously proposed DocTr, a transformer-assisted network for document image rectification, also suffers from this limitation. In this work, we present DocTr++, a novel unified framework for document image rectification, without any restrictions on the input distorted images. Our major technical improvements can be concluded in three aspects. Firstly, we upgrade the original architecture by adopting a hierarchical encoder-decoder structure for multi-scale representation extraction and parsing. Secondly, we reformulate the pixel-wise mapping relationship between the unrestricted distorted document images and the distortion-free counterparts. The obtained data is used to train our DocTr++ for unrestricted document image rectification. Thirdly, we contribute a real-world test set and metrics applicable for evaluating the rectification quality. To our best knowledge, this is the first learning-based method for the rectification of unrestricted document images. Extensive experiments are conducted, and the results demonstrate the effectiveness and superiority of our method. We hope our DocTr++ will serve as a strong baseline for generic document image rectification, prompting the further advancement and application of learning-based algorithms. The source code and the proposed dataset are publicly available at https://github.com/fh2019ustc/DocTr-Plus.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. G. Ciardiello, G. Scafuro, M. Degrandi, M. Spada, and M. Roccotelli, “An experimental system for office document handling and text recognition,” in Proceedings of the International Conference on Pattern Recognition, 1988, pp. 739–743.
  2. K. Wang, B. Babenko, and S. Belongie, “End-to-end scene text recognition,” in Proceedings of the International Conference on Computer Vision, 2011, pp. 1457–1464.
  3. A. Lat and C. Jawahar, “Enhancing OCR accuracy with super resolution,” in Proceedings of the International Conference on Pattern Recognition, 2018, pp. 3162–3167.
  4. D. Peng, L. Jin, W. Ma, C. Xie, H. Zhang, S. Zhu, and J. Li, “Recognition of handwritten chinese text by segmentation: a segment-annotation-free approach,” IEEE Transactions on Multimedia, 2022.
  5. H. Yuan, Y. Chen, X. Hu, and S. Ji, “Interpreting deep models for text analysis via optimization and regularization methods,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 5717–5724.
  6. Z. Zhang, J. Ma, J. Du, L. Wang, and J. Zhang, “Multimodal pre-training based on graph attention network for document understanding,” IEEE Transactions on Multimedia, 2022.
  7. G. Kim, T. Hong, M. Yim, J. Nam, J. Park, J. Yim, W. Hwang, S. Yun, D. Han, and S. Park, “OCR-free document understanding transformer,” in Proceedings of the European Conference on Computer Vision, 2022, pp. 498–517.
  8. G. Salton, “Developments in automatic text retrieval,” Science, vol. 253, no. 5023, pp. 974–980, 1991.
  9. Y. Yang, Y.-T. Zhuang, F. Wu, and Y.-H. Pan, “Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval,” IEEE Transactions on Multimedia, vol. 10, no. 3, pp. 437–446, 2008.
  10. L. Liu, Y. Lu, and C. Y. Suen, “End-to-end learning of representations for instance-level document image retrieval,” Applied Soft Computing, p. 110136, 2023.
  11. M. Mathew, D. Karatzas, and C. Jawahar, “DocVQA: A dataset for VQA on document images,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2021, pp. 2200–2209.
  12. L. Nie, M. Wang, Y. Gao, Z.-J. Zha, and T.-S. Chua, “Beyond text QA: Multimedia answer generation by harvesting web information,” IEEE Transactions on Multimedia, vol. 15, no. 2, pp. 426–441, 2012.
  13. H. Feng, Z. Wang, J. Tang, J. Lu, W. Zhou, H. Li, and C. Huang, “UniDoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding,” 2023.
  14. M. Liao, Z. Wan, C. Yao, K. Chen, and X. Bai, “Real-time scene text detection with differentiable binarization,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2020.
  15. “Cnstd,” https://github.com/breezedeus/cnstd, 2023.
  16. M. S. Brown and W. B. Seales, “Image restoration of arbitrarily warped documents,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 10, pp. 1295–1306, 2004.
  17. M. S. Brown and W. B. Seal, “Document restoration using 3D shape,” in Proceedings of the IEEE International Conference on Computer Vision, 2001, pp. 9–12.
  18. G. Meng, Y. Wang, S. Qu, S. Xiang, and C. Pan, “Active flattening of curved document images via two structured beams,” in Proceedings of the IEEE International Conference on Computer Vision, 2014, pp. 3890–3897.
  19. L. Zhang, Y. Zhang, and C. Tan, “An improved physically-based method for geometric restoration of distorted document images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 4, pp. 728–734, 2008.
  20. H. I. Koo, J. Kim, and N. I. Cho, “Composition of a dewarped and enhanced document image from two view images,” IEEE Transactions on Image Processing, vol. 18, no. 7, pp. 1551–1562, 2009.
  21. A. Yamashita, A. Kawarago, T. Kaneko, and K. T. Miura, “Shape reconstruction and image restoration for non-flat surfaces of documents with a stereo vision system,” in Proceedings of the International Conference on Pattern Recognition, vol. 1, 2004, pp. 482–485.
  22. Y.-C. Tsoi and M. S. Brown, “Multi-view document rectification using boundary,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8.
  23. Y. Tsoi and M. S. Brown, “Geometric and shading correction for images of printed materials: a unified approach using boundary,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, 2004, pp. I–I.
  24. S. You, Y. Matsushita, S. Sinha, Y. Bou, and K. Ikeuchi, “Multiview rectification of folded documents,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 2, pp. 505–511, 2018.
  25. C. L. Tan, L. Zhang, Z. Zhang, and T. Xia, “Restoring warped document images through 3D shape modeling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 2, pp. 195–208, 2006.
  26. T. Wada, H. Ukida, and T. Matsuyama, “Shape from shading with interreflections under a proximal light source: Distortion-free copying of an unfolded book,” International Journal of Computer Vision, vol. 24, no. 2, pp. 125–135, 1997.
  27. Y. He, P. Pan, S. Xie, J. Sun, and S. Naoi, “A book dewarping system by boundary-based 3D surface reconstruction,” in Proceedings of the International Conference on Document Analysis and Recognition, 2013, pp. 403–407.
  28. H. Cao, X. Ding, and C. Liu, “Rectifying the bound document image captured by the camera: A model based approach,” in Proceedings of the International Conference on Document Analysis and Recognition, vol. 1, 2003, pp. 71–75.
  29. O. Lavialle, X. Molines, F. Angella, and P. Baylou, “Active contours network to straighten distorted text lines,” in Proceedings of the International Conference on Image Processing, vol. 3, 2001, pp. 748–751.
  30. C. Wu and G. Agam, “Document image de-warping for text/graphics recognition,” in Proceedings of the Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition and Structural and Syntactic Pattern Recognition, 2002, pp. 348–357.
  31. G. Meng, C. Pan, S. Xiang, J. Duan, and N. Zheng, “Metric rectification of curved document images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 4, pp. 707–722, 2011.
  32. D. Luo and P. Bo, “Geometric rectification of creased document images based on isometric mapping,” arXiv preprint arXiv:2212.08365, 2022.
  33. J. Liang, D. DeMenthon, and D. Doermann, “Geometric rectification of camera-captured document images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 4, pp. 591–605, 2008.
  34. S. Das, H. M. Sial, R. Baldrich, M. Vanrell, and D. Samaras, “Intrinsic decomposition of document images in-the-wild,” in Proceedings of the British Machine Vision Conference, 2020.
  35. S. Das, K. Ma, Z. Shu, D. Samaras, and R. Shilkrot, “DewarpNet: Single-image document unwarping with stacked 3D and 2D regression networks,” in Proceedings of the International Conference on Computer Vision, 2019, pp. 131–140.
  36. S. Das, K. Y. Singh, J. Wu, E. Bas, V. Mahadevan, R. Bhotika, and D. Samaras, “End-to-end piece-wise unwarping of document images,” in Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 4268–4277.
  37. H. Feng, Y. Wang, W. Zhou, J. Deng, and H. Li, “DocTr: Document image transformer for geometric unwarping and illumination correction,” in Proceedings of the ACM International Conference on Multimedia, 2021, pp. 273–281.
  38. X. Jiang, R. Long, N. Xue, Z. Yang, C. Yao, and G.-S. Xia, “Revisiting document image dewarping by grid regularization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 4543–4552.
  39. X. Liu, G. Meng, B. Fan, S. Xiang, and C. Pan, “Geometric rectification of document images using adversarial gated unwarping network,” Pattern Recognition, vol. 108, p. 107576, 2020.
  40. K. Ma, Z. Shu, X. Bai, J. Wang, and D. Samaras, “DocUNet: Document image unwarping via a stacked U-Net,” in Proceedings of the IEEE International Conference on Computer Vision, 2018, pp. 4700–4709.
  41. G.-W. Xie, F. Yin, X.-Y. Zhang, and C.-L. Liu, “Document dewarping with control points,” in Proceedings of the International Conference on Document Analysis and Recognition, 2021, pp. 466–480.
  42. G. Xie, F. Yin, X. Zhang, and C. Liu, “Dewarping document image by displacement flow estimation with fully convolutional network,” in International Workshop on Document Analysis Systems, 2020, pp. 131–144.
  43. C. Xue, Z. Tian, F. Zhan, S. Lu, and S. Bai, “Fourier document restoration for robust document dewarping and recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 4573–4582.
  44. H. Feng, W. Zhou, J. Deng, Q. Tian, and H. Li, “DocScanner: Robust document image rectification with progressive learning,” arXiv preprint arXiv:2110.14968, 2021.
  45. H. Feng, W. Zhou, J. Deng, Y. Wang, and H. Li, “Geometric representation learning for document image rectification,” in Proceedings of the European Conference on Computer Vision, 2022.
  46. A. Markovitz, I. Lavi, O. Perel, S. Mazor, and R. Litman, “Can you read me now? Content aware rectification using angle supervision,” in Proceedings of the European Conference on Computer Vision, 2020, pp. 208–223.
  47. K. Ma, S. Das, Z. Shu, and D. Samaras, “Learning from documents in the wild to improve document unwarping,” in Proceedings of the ACM SIGGRAPH Conference, 2022, pp. 1–9.
  48. J. Zhang, C. Luo, L. Jin, F. Guo, and K. Ding, “Marior: Margin removal and iterative content rectification for document dewarping in the wild,” in Proceedings of the ACM International Conference on Multimedia, 2022, pp. 2805–2815.
  49. X. Li, B. Zhang, J. Liao, and P. V. Sander, “Document rectification and illumination correction using a patch-based CNN,” ACM Transactions on Graphics, vol. 38, no. 6, pp. 1–11, 2019.
  50. Z. Teed and J. Deng, “RAFT: Recurrent all-pairs field transforms for optical flow,” in Proceedings of the European Conference on Computer Vision, 2020, pp. 402–419.
  51. D. Freedman and T. Zhang, “Interactive graph cut based segmentation with shape priors,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, 2005, pp. 755–762.
  52. D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.
  53. G. Meng, Y. Su, Y. Wu, S. Xiang, and C. Pan, “Exploiting vector fields for geometric rectification of distorted document images,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 172–187.
  54. O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Proceedings of the International Conference on Medical Image Computing and Computer-assisted Intervention, 2015, pp. 234–241.
  55. J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
  56. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the Neural Information Processing Systems, 2017, pp. 6000–6010.
  57. E. Meijering, “A chronology of interpolation: From ancient astronomy to modern signal and image processing,” Proceedings of the IEEE, vol. 90, no. 3, pp. 319–342, 2002.
  58. F. Verhoeven, T. Magne, and O. Sorkine-Hornung, “Neural document unwarping using coupled grids,” arXiv preprint arXiv:2302.02887, 2023.
  59. A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox, “FlowNet: Learning optical flow with convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2758–2766.
  60. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  61. I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le, “Attention augmented convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 3286–3295.
  62. X. Min, K. Gu, G. Zhai, X. Yang, W. Zhang, P. Le Callet, and C. W. Chen, “Screen content quality assessment: overview, benchmark, and beyond,” ACM Computing Surveys, vol. 54, no. 9, pp. 1–36, 2021.
  63. G. Zhai and X. Min, “Perceptual image quality assessment: a survey,” Science China Information Sciences, vol. 63, pp. 1–52, 2020.
  64. X. Min, G. Zhai, J. Zhou, M. C. Farias, and A. C. Bovik, “Study of subjective and objective quality assessment of audio-visual signals,” IEEE Transactions on Image Processing, vol. 29, pp. 6054–6068, 2020.
  65. X. Min, K. Gu, G. Zhai, J. Liu, X. Yang, and C. W. Chen, “Blind quality assessment based on pseudo-reference image,” IEEE Transactions on Multimedia, vol. 20, no. 8, pp. 2049–2062, 2017.
  66. X. Min, G. Zhai, K. Gu, Y. Liu, and X. Yang, “Blind image quality estimation via distortion aggravation,” IEEE Transactions on Broadcasting, vol. 64, no. 2, pp. 508–517, 2018.
  67. X. Min, K. Ma, K. Gu, G. Zhai, Z. Wang, and W. Lin, “Unified blind quality assessment of compressed natural, graphic, and screen content images,” IEEE Transactions on Image Processing, vol. 26, no. 11, pp. 5462–5474, 2017.
  68. Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in Proceedings of the Asilomar Conference on Signals, Systems Computers, vol. 2, 2003, pp. 1398–1402.
  69. V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,” vol. 10, pp. 707–710, 1966.
  70. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
  71. C. Liu, J. Yuen, and A. Torralba, “SIFT flow: Dense correspondence across scenes and its applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 5, pp. 978–994, 2011.
  72. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proceedings of the International Conference on Learning Representations, 2019.
  73. L. N. Smith and N. Topin, “Super-convergence: Very fast training of neural networks using large learning rates,” in Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, vol. 11006, 2019, p. 1100612.
  74. R. Smith, “An overview of the tesseract OCR engine,” in Proceedings of the International Conference on Document Analysis and Recognition, vol. 2, 2007, pp. 629–633.
  75. X. Qin, Z. Zhang, C. Huang, M. Dehghan, O. R. Zaiane, and M. Jagersand, “U2-Net: Going deeper with nested U-structure for salient object detection,” Pattern Recognition, 2020.
  76. J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-VL: A frontier large vision-language model with versatile abilities,” arXiv preprint arXiv:2308.12966, 2023.
  77. N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proceedings of the European Conference on Computer Vision, 2020, pp. 213–229.
  78. H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” arXiv preprint arXiv:2304.08485, 2023.
Citations (10)

Summary

  • The paper presents DocTr++, a hierarchical encoder-decoder model that significantly improves rectification for document images with incomplete boundaries.
  • It reformulates pixel-wise mapping and introduces new evaluation metrics and datasets, outperforming state-of-the-art methods on benchmarks.
  • Extensive experiments using MSSIM, LD, ED, and CER metrics confirm DocTr++'s robustness and real-world applicability for mobile-captured documents.

Deep Unrestricted Document Image Rectification

The paper "Deep Unrestricted Document Image Rectification" presents an advanced framework, DocTr++, which addresses the limitations of current document image rectification techniques. Emerging from the original DocTr, DocTr++ innovatively offers a solution to rectifying document images that do not require complete document boundaries, thus expanding applicability to unrestricted distorted images encountered frequently in realistic scenarios.

Key Contributions

DocTr++ introduces several noteworthy technical improvements:

  1. Hierarchical Encoder-Decoder Architecture: The revised model employs a hierarchical encoder-decoder structure, enhancing multi-scale representation extraction. This approach significantly upgrades the original DocTr architecture, leading to better distortion rectification.
  2. Reformulated Mapping Strategy: The authors have reformulated the pixel-wise mapping between distorted and distortion-free document images. This advanced mapping technique is integral in training DocTr++ to handle the unrestricted rectification scenario effectively.
  3. Dataset and Metrics Contribution: For comprehensive evaluation, a real-world test set and applicable metrics were introduced. These tools are crucial for assessing rectification quality in unrestricted document images, serving as benchmarks for future research.

Experimental Evaluation

Extensive experiments validate the superiority of DocTr++. On the DocUNet Benchmark and the newly proposed dataset, DocTr++ consistently outperforms existing state-of-the-art methods in terms of both quantitative and qualitative metrics. Key metrics used include MSSIM, LD, ED, and CER, which demonstrate the algorithm's robustness across varied document image types.

The innovative introduction of MSSIM-M and LD-M metrics addresses challenges in evaluating images without complete boundaries, providing a more accurate assessment of image similarity and distortion correction.

Theoretical and Practical Implications

The implications of DocTr++ are expansive:

  • Theoretical: The paper pushes the boundary of document image rectification as a field by addressing unrestricted inputs. It challenges existing baseline methods and proposes a robust framework that encompasses a broader range of scenarios.
  • Practical: Practically, DocTr++ offers a compelling solution for real-world applications, such as mobile-captured documents, that are often subject to various distortions like partial document exposure or absence of distinct document boundaries. The ability to rectify such images widens the potential for document digitization in diverse use cases, including archival, legal, or educational contexts.

Future Directions

The research opens several avenues for future exploration:

  • Integration with Downstream Applications: Future work might explore integration with OCR systems, enhancing the overall efficacy of automatic text recognition pipelines.
  • Adapting to Diverse Document Types: While current focus lies on unrestricted documents, exploring the framework’s adaptation to other forms of documents, such as historical manuscripts or multilingual documents, could enhance its utility.
  • Exploration of Geometric Constraints: Future research could aim to explicitly leverage geometric and textural attributes during the rectification process, potentially increasing accuracy in more complex document layouts.

In summation, the paper makes significant strides in document image rectification, offering a robust, scalable solution via DocTr++. With both theoretical foundations and pragmatic implementations, it lays a crucial groundwork for future innovations in the field.