Deep Unrestricted Document Image Rectification (2304.08796v2)
Abstract: In recent years, tremendous efforts have been made on document image rectification, but existing advanced algorithms are limited to processing restricted document images, i.e., the input images must incorporate a complete document. Once the captured image merely involves a local text region, its rectification quality is degraded and unsatisfactory. Our previously proposed DocTr, a transformer-assisted network for document image rectification, also suffers from this limitation. In this work, we present DocTr++, a novel unified framework for document image rectification, without any restrictions on the input distorted images. Our major technical improvements can be concluded in three aspects. Firstly, we upgrade the original architecture by adopting a hierarchical encoder-decoder structure for multi-scale representation extraction and parsing. Secondly, we reformulate the pixel-wise mapping relationship between the unrestricted distorted document images and the distortion-free counterparts. The obtained data is used to train our DocTr++ for unrestricted document image rectification. Thirdly, we contribute a real-world test set and metrics applicable for evaluating the rectification quality. To our best knowledge, this is the first learning-based method for the rectification of unrestricted document images. Extensive experiments are conducted, and the results demonstrate the effectiveness and superiority of our method. We hope our DocTr++ will serve as a strong baseline for generic document image rectification, prompting the further advancement and application of learning-based algorithms. The source code and the proposed dataset are publicly available at https://github.com/fh2019ustc/DocTr-Plus.
- G. Ciardiello, G. Scafuro, M. Degrandi, M. Spada, and M. Roccotelli, “An experimental system for office document handling and text recognition,” in Proceedings of the International Conference on Pattern Recognition, 1988, pp. 739–743.
- K. Wang, B. Babenko, and S. Belongie, “End-to-end scene text recognition,” in Proceedings of the International Conference on Computer Vision, 2011, pp. 1457–1464.
- A. Lat and C. Jawahar, “Enhancing OCR accuracy with super resolution,” in Proceedings of the International Conference on Pattern Recognition, 2018, pp. 3162–3167.
- D. Peng, L. Jin, W. Ma, C. Xie, H. Zhang, S. Zhu, and J. Li, “Recognition of handwritten chinese text by segmentation: a segment-annotation-free approach,” IEEE Transactions on Multimedia, 2022.
- H. Yuan, Y. Chen, X. Hu, and S. Ji, “Interpreting deep models for text analysis via optimization and regularization methods,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 5717–5724.
- Z. Zhang, J. Ma, J. Du, L. Wang, and J. Zhang, “Multimodal pre-training based on graph attention network for document understanding,” IEEE Transactions on Multimedia, 2022.
- G. Kim, T. Hong, M. Yim, J. Nam, J. Park, J. Yim, W. Hwang, S. Yun, D. Han, and S. Park, “OCR-free document understanding transformer,” in Proceedings of the European Conference on Computer Vision, 2022, pp. 498–517.
- G. Salton, “Developments in automatic text retrieval,” Science, vol. 253, no. 5023, pp. 974–980, 1991.
- Y. Yang, Y.-T. Zhuang, F. Wu, and Y.-H. Pan, “Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval,” IEEE Transactions on Multimedia, vol. 10, no. 3, pp. 437–446, 2008.
- L. Liu, Y. Lu, and C. Y. Suen, “End-to-end learning of representations for instance-level document image retrieval,” Applied Soft Computing, p. 110136, 2023.
- M. Mathew, D. Karatzas, and C. Jawahar, “DocVQA: A dataset for VQA on document images,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2021, pp. 2200–2209.
- L. Nie, M. Wang, Y. Gao, Z.-J. Zha, and T.-S. Chua, “Beyond text QA: Multimedia answer generation by harvesting web information,” IEEE Transactions on Multimedia, vol. 15, no. 2, pp. 426–441, 2012.
- H. Feng, Z. Wang, J. Tang, J. Lu, W. Zhou, H. Li, and C. Huang, “UniDoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding,” 2023.
- M. Liao, Z. Wan, C. Yao, K. Chen, and X. Bai, “Real-time scene text detection with differentiable binarization,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2020.
- “Cnstd,” https://github.com/breezedeus/cnstd, 2023.
- M. S. Brown and W. B. Seales, “Image restoration of arbitrarily warped documents,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 10, pp. 1295–1306, 2004.
- M. S. Brown and W. B. Seal, “Document restoration using 3D shape,” in Proceedings of the IEEE International Conference on Computer Vision, 2001, pp. 9–12.
- G. Meng, Y. Wang, S. Qu, S. Xiang, and C. Pan, “Active flattening of curved document images via two structured beams,” in Proceedings of the IEEE International Conference on Computer Vision, 2014, pp. 3890–3897.
- L. Zhang, Y. Zhang, and C. Tan, “An improved physically-based method for geometric restoration of distorted document images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 4, pp. 728–734, 2008.
- H. I. Koo, J. Kim, and N. I. Cho, “Composition of a dewarped and enhanced document image from two view images,” IEEE Transactions on Image Processing, vol. 18, no. 7, pp. 1551–1562, 2009.
- A. Yamashita, A. Kawarago, T. Kaneko, and K. T. Miura, “Shape reconstruction and image restoration for non-flat surfaces of documents with a stereo vision system,” in Proceedings of the International Conference on Pattern Recognition, vol. 1, 2004, pp. 482–485.
- Y.-C. Tsoi and M. S. Brown, “Multi-view document rectification using boundary,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8.
- Y. Tsoi and M. S. Brown, “Geometric and shading correction for images of printed materials: a unified approach using boundary,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, 2004, pp. I–I.
- S. You, Y. Matsushita, S. Sinha, Y. Bou, and K. Ikeuchi, “Multiview rectification of folded documents,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 2, pp. 505–511, 2018.
- C. L. Tan, L. Zhang, Z. Zhang, and T. Xia, “Restoring warped document images through 3D shape modeling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 2, pp. 195–208, 2006.
- T. Wada, H. Ukida, and T. Matsuyama, “Shape from shading with interreflections under a proximal light source: Distortion-free copying of an unfolded book,” International Journal of Computer Vision, vol. 24, no. 2, pp. 125–135, 1997.
- Y. He, P. Pan, S. Xie, J. Sun, and S. Naoi, “A book dewarping system by boundary-based 3D surface reconstruction,” in Proceedings of the International Conference on Document Analysis and Recognition, 2013, pp. 403–407.
- H. Cao, X. Ding, and C. Liu, “Rectifying the bound document image captured by the camera: A model based approach,” in Proceedings of the International Conference on Document Analysis and Recognition, vol. 1, 2003, pp. 71–75.
- O. Lavialle, X. Molines, F. Angella, and P. Baylou, “Active contours network to straighten distorted text lines,” in Proceedings of the International Conference on Image Processing, vol. 3, 2001, pp. 748–751.
- C. Wu and G. Agam, “Document image de-warping for text/graphics recognition,” in Proceedings of the Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition and Structural and Syntactic Pattern Recognition, 2002, pp. 348–357.
- G. Meng, C. Pan, S. Xiang, J. Duan, and N. Zheng, “Metric rectification of curved document images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 4, pp. 707–722, 2011.
- D. Luo and P. Bo, “Geometric rectification of creased document images based on isometric mapping,” arXiv preprint arXiv:2212.08365, 2022.
- J. Liang, D. DeMenthon, and D. Doermann, “Geometric rectification of camera-captured document images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 4, pp. 591–605, 2008.
- S. Das, H. M. Sial, R. Baldrich, M. Vanrell, and D. Samaras, “Intrinsic decomposition of document images in-the-wild,” in Proceedings of the British Machine Vision Conference, 2020.
- S. Das, K. Ma, Z. Shu, D. Samaras, and R. Shilkrot, “DewarpNet: Single-image document unwarping with stacked 3D and 2D regression networks,” in Proceedings of the International Conference on Computer Vision, 2019, pp. 131–140.
- S. Das, K. Y. Singh, J. Wu, E. Bas, V. Mahadevan, R. Bhotika, and D. Samaras, “End-to-end piece-wise unwarping of document images,” in Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 4268–4277.
- H. Feng, Y. Wang, W. Zhou, J. Deng, and H. Li, “DocTr: Document image transformer for geometric unwarping and illumination correction,” in Proceedings of the ACM International Conference on Multimedia, 2021, pp. 273–281.
- X. Jiang, R. Long, N. Xue, Z. Yang, C. Yao, and G.-S. Xia, “Revisiting document image dewarping by grid regularization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 4543–4552.
- X. Liu, G. Meng, B. Fan, S. Xiang, and C. Pan, “Geometric rectification of document images using adversarial gated unwarping network,” Pattern Recognition, vol. 108, p. 107576, 2020.
- K. Ma, Z. Shu, X. Bai, J. Wang, and D. Samaras, “DocUNet: Document image unwarping via a stacked U-Net,” in Proceedings of the IEEE International Conference on Computer Vision, 2018, pp. 4700–4709.
- G.-W. Xie, F. Yin, X.-Y. Zhang, and C.-L. Liu, “Document dewarping with control points,” in Proceedings of the International Conference on Document Analysis and Recognition, 2021, pp. 466–480.
- G. Xie, F. Yin, X. Zhang, and C. Liu, “Dewarping document image by displacement flow estimation with fully convolutional network,” in International Workshop on Document Analysis Systems, 2020, pp. 131–144.
- C. Xue, Z. Tian, F. Zhan, S. Lu, and S. Bai, “Fourier document restoration for robust document dewarping and recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 4573–4582.
- H. Feng, W. Zhou, J. Deng, Q. Tian, and H. Li, “DocScanner: Robust document image rectification with progressive learning,” arXiv preprint arXiv:2110.14968, 2021.
- H. Feng, W. Zhou, J. Deng, Y. Wang, and H. Li, “Geometric representation learning for document image rectification,” in Proceedings of the European Conference on Computer Vision, 2022.
- A. Markovitz, I. Lavi, O. Perel, S. Mazor, and R. Litman, “Can you read me now? Content aware rectification using angle supervision,” in Proceedings of the European Conference on Computer Vision, 2020, pp. 208–223.
- K. Ma, S. Das, Z. Shu, and D. Samaras, “Learning from documents in the wild to improve document unwarping,” in Proceedings of the ACM SIGGRAPH Conference, 2022, pp. 1–9.
- J. Zhang, C. Luo, L. Jin, F. Guo, and K. Ding, “Marior: Margin removal and iterative content rectification for document dewarping in the wild,” in Proceedings of the ACM International Conference on Multimedia, 2022, pp. 2805–2815.
- X. Li, B. Zhang, J. Liao, and P. V. Sander, “Document rectification and illumination correction using a patch-based CNN,” ACM Transactions on Graphics, vol. 38, no. 6, pp. 1–11, 2019.
- Z. Teed and J. Deng, “RAFT: Recurrent all-pairs field transforms for optical flow,” in Proceedings of the European Conference on Computer Vision, 2020, pp. 402–419.
- D. Freedman and T. Zhang, “Interactive graph cut based segmentation with shape priors,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, 2005, pp. 755–762.
- D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.
- G. Meng, Y. Su, Y. Wu, S. Xiang, and C. Pan, “Exploiting vector fields for geometric rectification of distorted document images,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 172–187.
- O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Proceedings of the International Conference on Medical Image Computing and Computer-assisted Intervention, 2015, pp. 234–241.
- J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the Neural Information Processing Systems, 2017, pp. 6000–6010.
- E. Meijering, “A chronology of interpolation: From ancient astronomy to modern signal and image processing,” Proceedings of the IEEE, vol. 90, no. 3, pp. 319–342, 2002.
- F. Verhoeven, T. Magne, and O. Sorkine-Hornung, “Neural document unwarping using coupled grids,” arXiv preprint arXiv:2302.02887, 2023.
- A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox, “FlowNet: Learning optical flow with convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2758–2766.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
- I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le, “Attention augmented convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 3286–3295.
- X. Min, K. Gu, G. Zhai, X. Yang, W. Zhang, P. Le Callet, and C. W. Chen, “Screen content quality assessment: overview, benchmark, and beyond,” ACM Computing Surveys, vol. 54, no. 9, pp. 1–36, 2021.
- G. Zhai and X. Min, “Perceptual image quality assessment: a survey,” Science China Information Sciences, vol. 63, pp. 1–52, 2020.
- X. Min, G. Zhai, J. Zhou, M. C. Farias, and A. C. Bovik, “Study of subjective and objective quality assessment of audio-visual signals,” IEEE Transactions on Image Processing, vol. 29, pp. 6054–6068, 2020.
- X. Min, K. Gu, G. Zhai, J. Liu, X. Yang, and C. W. Chen, “Blind quality assessment based on pseudo-reference image,” IEEE Transactions on Multimedia, vol. 20, no. 8, pp. 2049–2062, 2017.
- X. Min, G. Zhai, K. Gu, Y. Liu, and X. Yang, “Blind image quality estimation via distortion aggravation,” IEEE Transactions on Broadcasting, vol. 64, no. 2, pp. 508–517, 2018.
- X. Min, K. Ma, K. Gu, G. Zhai, Z. Wang, and W. Lin, “Unified blind quality assessment of compressed natural, graphic, and screen content images,” IEEE Transactions on Image Processing, vol. 26, no. 11, pp. 5462–5474, 2017.
- Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in Proceedings of the Asilomar Conference on Signals, Systems Computers, vol. 2, 2003, pp. 1398–1402.
- V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,” vol. 10, pp. 707–710, 1966.
- Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
- C. Liu, J. Yuen, and A. Torralba, “SIFT flow: Dense correspondence across scenes and its applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 5, pp. 978–994, 2011.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proceedings of the International Conference on Learning Representations, 2019.
- L. N. Smith and N. Topin, “Super-convergence: Very fast training of neural networks using large learning rates,” in Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, vol. 11006, 2019, p. 1100612.
- R. Smith, “An overview of the tesseract OCR engine,” in Proceedings of the International Conference on Document Analysis and Recognition, vol. 2, 2007, pp. 629–633.
- X. Qin, Z. Zhang, C. Huang, M. Dehghan, O. R. Zaiane, and M. Jagersand, “U2-Net: Going deeper with nested U-structure for salient object detection,” Pattern Recognition, 2020.
- J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-VL: A frontier large vision-language model with versatile abilities,” arXiv preprint arXiv:2308.12966, 2023.
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proceedings of the European Conference on Computer Vision, 2020, pp. 213–229.
- H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” arXiv preprint arXiv:2304.08485, 2023.