Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Hybrid Approach for Document Layout Analysis in Document images (2404.17888v2)

Published 27 Apr 2024 in cs.CV

Abstract: Document layout analysis involves understanding the arrangement of elements within a document. This paper navigates the complexities of understanding various elements within document images, such as text, images, tables, and headings. The approach employs an advanced Transformer-based object detection network as an innovative graphical page object detector for identifying tables, figures, and displayed elements. We introduce a query encoding mechanism to provide high-quality object queries for contrastive learning, enhancing efficiency in the decoder phase. We also present a hybrid matching scheme that integrates the decoder's original one-to-one matching strategy with the one-to-many matching strategy during the training phase. This approach aims to improve the model's accuracy and versatility in detecting various graphical elements on a page. Our experiments on PubLayNet, DocLayNet, and PubTables benchmarks show that our approach outperforms current state-of-the-art methods. It achieves an average precision of 97.3% on PubLayNet, 81.6% on DocLayNet, and 98.6 on PubTables, demonstrating its superior performance in layout analysis. These advancements not only enhance the conversion of document images into editable and accessible formats but also streamline information retrieval and data extraction processes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. L. Cui, Y. Xu, T. Lv, and F. Wei, “Document AI: benchmarks, models and applications,” CoRR, vol. abs/2111.08609, 2021. [Online]. Available: https://arxiv.org/abs/2111.08609
  2. T. Shehzadi, A. Majid, M. Hameed, A. Farooq, and A. Yousaf, “Intelligent predictor using cancer-related biologically information extraction from cancer transcriptomes,” in 2020 International Symposium on Recent Advances in Electrical Engineering & Computer Sciences (RAEE & CS), vol. 5, 2020, pp. 1–5.
  3. S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards real-time object detection with region proposal networks,” CoRR, vol. abs/1506.01497, 2015. [Online]. Available: http://arxiv.org/abs/1506.01497
  4. K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, “Mask R-CNN,” CoRR, vol. abs/1703.06870, 2017. [Online]. Available: http://arxiv.org/abs/1703.06870
  5. Z. Cai and N. Vasconcelos, “Cascade R-CNN: delving into high quality object detection,” CoRR, vol. abs/1712.00726, 2017. [Online]. Available: http://arxiv.org/abs/1712.00726
  6. N. Ma, X. Zhang, H. Zheng, and J. Sun, “Shufflenet V2: practical guidelines for efficient CNN architecture design,” CoRR, vol. abs/1807.11164, 2018. [Online]. Available: http://arxiv.org/abs/1807.11164
  7. S. Schreiber, S. Agne, I. Wolf, A. Dengel, and S. Ahmed, “Deepdesrt: Deep learning for detection and structure recognition of tables in document images,” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, 2017, pp. 1162–1167.
  8. M. Minouei, K. A. Hashmi, M. R. Soheili, M. Z. Afzal, and D. Stricker, “Continual learning for table detection in document images,” Applied Sciences, vol. 12, no. 18, 2022. [Online]. Available: https://www.mdpi.com/2076-3417/12/18/8969
  9. S. Sinha, K. A. Hashmi, A. Pagani, M. Liwicki, D. Stricker, and M. Z. Afzal, “Rethinking learnable proposals for graphical object detection in scanned document images,” Applied Sciences, vol. 12, no. 20, 2022. [Online]. Available: https://www.mdpi.com/2076-3417/12/20/10578
  10. T. Shehzadi, K. A. Hashmi, A. Pagani, M. Liwicki, D. Stricker, and M. Z. Afzal, “Mask-aware semi-supervised object detection in floor plans,” Applied Sciences, vol. 12, no. 19, 2022. [Online]. Available: https://www.mdpi.com/2076-3417/12/19/9398
  11. S. Naik, K. A. Hashmi, A. Pagani, M. Liwicki, D. Stricker, and M. Z. Afzal, “Investigating attention mechanism for page object detection in document images,” Applied Sciences, vol. 12, no. 15, 2022. [Online]. Available: https://www.mdpi.com/2076-3417/12/15/7486
  12. L. Alzubaidi, J. Zhang, A. J. Humaidi, A. Al-dujaili, Y. Duan, O. Al-Shamma, J. I. Santamaría, M. A. Fadhel, M. Al-Amidie, and L. Farhan, “Review of deep learning: concepts, cnn architectures, challenges, applications, future directions,” Journal of Big Data, vol. 8, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:232434552
  13. X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable {detr}: Deformable transformers for end-to-end object detection,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=gZ9hCDWe6ke
  14. Z. Dai, B. Cai, Y. Lin, and J. Chen, “Up-detr: Unsupervised pre-training for object detection with transformers,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1601–1610, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:227011943
  15. T. Wang, L. Yuan, Y. Chen, J. Feng, and S. Yan, “Pnp-detr: Towards efficient visual analysis with transformers,” CoRR, vol. abs/2109.07036, 2021. [Online]. Available: https://arxiv.org/abs/2109.07036
  16. Y. Fang, B. Liao, X. Wang, J. Fang, J. Qi, R. Wu, J. Niu, and W. Liu, “You only look at one sequence: Rethinking transformer in vision through object detection,” CoRR, vol. abs/2106.00666, 2021. [Online]. Available: https://arxiv.org/abs/2106.00666
  17. T. Shehzadi, K. Azeem Hashmi, D. Stricker, M. Liwicki, and M. Zeshan Afzal, “Towards end-to-end semi-supervised table detection with deformable transformer,” in Document Analysis and Recognition - ICDAR 2023, G. A. Fink, R. Jain, K. Kise, and R. Zanibbi, Eds.   Cham: Springer Nature Switzerland, 2023, pp. 51–76.
  18. Z. Chen, J. Zhang, and D. Tao, “Recurrent glimpse-based decoder for detection with transformer,” CoRR, vol. abs/2112.04632, 2021. [Online]. Available: https://arxiv.org/abs/2112.04632
  19. H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y. Shum, “Dino: Detr with improved denoising anchor boxes for end-to-end object detection,” 2022. [Online]. Available: https://arxiv.org/abs/2203.03605
  20. T. Shehzadi, K. A. Hashmi, D. Stricker, and M. Z. Afzal, “2d object detection with transformers: A review,” arXiv preprint arXiv:2306.04670, 2023.
  21. B. Pfitzmann, C. Auer, M. Dolfi, A. S. Nassar, and P. Staar, “Doclaynet: A large human-annotated dataset for document-layout segmentation,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 3743–3751.
  22. T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” CoRR, vol. abs/1708.02002, 2017. [Online]. Available: http://arxiv.org/abs/1708.02002
  23. A. Asi, R. Cohen, K. Kedem, and J. El-Sana, “Simplifying the reading of historical manuscripts,” in 2015 13th International Conference on Document Analysis and Recognition (ICDAR), 2015, pp. 826–830.
  24. R. Saabni and J. El-Sana, “Language-independent text lines extraction using seam carving,” in 2011 International Conference on Document Analysis and Recognition, 2011, pp. 563–568.
  25. P. Gao, M. Zheng, X. Wang, J. Dai, and H. Li, “Fast convergence of DETR with spatially modulated co-attention,” CoRR, vol. abs/2101.07448, 2021. [Online]. Available: https://arxiv.org/abs/2101.07448
  26. D. Meng, X. Chen, Z. Fan, G. Zeng, H. Li, Y. Yuan, L. Sun, and J. Wang, “Conditional DETR for fast training convergence,” CoRR, vol. abs/2108.06152, 2021. [Online]. Available: https://arxiv.org/abs/2108.06152
  27. F. Liu, H. Wei, W. Zhao, G. Li, J. Peng, and Z. Li, “Wb-detr: Transformer-based detector without backbone,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 2959–2967.
  28. W. Wang, Y. Cao, J. Zhang, and D. Tao, “FP-DETR: Detection transformer advanced by fully pre-training,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=yjMQuLLcGWK
  29. N. Journet, V. Eglin, J. Ramel, and R. Mullot, “Text/graphic labelling of ancient printed documents,” in Eighth International Conference on Document Analysis and Recognition (ICDAR’05), 2005, pp. 1010–1014 Vol. 2.
  30. K. Kise, A. Sato, and M. Iwata, “Segmentation of page images using the area voronoi diagram,” Computer Vision and Image Understanding, vol. 70, no. 3, pp. 370–382, 1998. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1077314298906841
  31. J. Chen and D. Lopresti, “Table detection in noisy off-line handwritten documents,” in 2011 International Conference on Document Analysis and Recognition, 2011, pp. 399–403.
  32. J. Fang, L. Gao, K. Bai, R. Qiu, X. Tao, and Z. Tang, “A table detection method for multipage pdf documents via visual seperators and tabular structures,” in 2011 International Conference on Document Analysis and Recognition, 2011, pp. 779–783.
  33. X.-H. Li, F. Yin, and C.-L. Liu, “Page segmentation using convolutional neural network and graphical model,” in Document Analysis Systems, X. Bai, D. Karatzas, and D. Lopresti, Eds.   Cham: Springer International Publishing, 2020, pp. 231–245.
  34. R. Saha, A. Mondal, and C. V. Jawahar, “Graphical object detection in document images,” CoRR, vol. abs/2008.10843, 2020. [Online]. Available: https://arxiv.org/abs/2008.10843
  35. K. Li, C. Wigington, C. Tensmeyer, H. Zhao, N. Barmpalios, V. I. Morariu, V. Manjunatha, T. Sun, and Y. Fu, “Cross-domain document object detection: Benchmark suite and method,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12 915–12 924.
  36. H. Yang and W. H. Hsu, “Vision-based layout detection from scientific literature using recurrent convolutional neural networks,” in 2020 25th international conference on pattern recognition (ICPR).   IEEE, 2021, pp. 6455–6462.
  37. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30.   Curran Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
  38. J. Li, Y. Xu, T. Lv, L. Cui, C. Zhang, and F. Wei, “Dit: Self-supervised pre-training for document image transformer,” 2022. [Online]. Available: https://arxiv.org/abs/2203.02378
  39. Y. Li, Y. Qian, Y. Yu, X. Qin, C. Zhang, Y. Liu, K. Yao, J. Han, J. Liu, and E. Ding, “Structext: Structured text understanding with multi-modal transformers,” Proceedings of the 29th ACM International Conference on Multimedia, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:236950714
  40. R. Powalski, Ł. Borchmann, D. Jurkiewicz, T. Dwojak, M. Pietruszka, and G. Pałka, “Going full-tilt boogie on document understanding with text-image-layout transformer,” in Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part II 16.   Springer, 2021, pp. 732–747.
  41. H. Yang and W. Hsu, “Transformer-based approach for document layout understanding,” in 2022 IEEE International Conference on Image Processing (ICIP), 2022, pp. 4043–4047.
  42. X. Zhong, J. Tang, and A. J. Yepes, “Publaynet: largest dataset ever for document layout analysis,” in 2019 International Conference on Document Analysis and Recognition (ICDAR).   IEEE, Sep. 2019, pp. 1015–1022.
  43. Y. Huang, T. Lv, L. Cui, Y. Lu, and F. Wei, “Layoutlmv3: Pre-training for document ai with unified text and image masking,” 2022. [Online]. Available: https://arxiv.org/abs/2204.08387
  44. S. Appalaraju, B. Jasani, B. U. Kota, Y. Xie, and R. Manmatha, “Docformer: End-to-end transformer for document understanding,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 993–1003.
  45. G. Kim, T. Hong, M. Yim, J. Park, J. Yim, W. Hwang, S. Yun, D. Han, and S. Park, “Donut: Document understanding transformer without OCR,” CoRR, vol. abs/2111.15664, 2021. [Online]. Available: https://arxiv.org/abs/2111.15664
  46. J. Gu, J. Kuen, V. I. Morariu, H. Zhao, R. Jain, N. Barmpalios, A. Nenkova, and T. Sun, “Unidoc: Unified pretraining framework for document understanding,” Advances in Neural Information Processing Systems, vol. 34, pp. 39–50, 2021.
  47. Z. Gu, C. Meng, K. Wang, J. Lan, W. Wang, M. Gu, and L. Zhang, “Xylayoutlm: Towards layout-aware multimodal networks for visually-rich document understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4583–4592.
  48. T. Shehzadi, K. A. Hashmi, D. Stricker, M. Liwicki, and M. Z. Afzal, “Bridging the performance gap between detr and r-cnn for graphical object detection in document images,” arXiv preprint arXiv:2306.13526, 2023.
  49. T. Shehzadi, K. A. Hashmi, D. Stricker, and M. Z. Afzal, “Sparse semi-detr: Sparse learnable queries for semi-supervised object detection,” arXiv preprint arXiv:2404.01819, 2024.
  50. K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2980–2988.
  51. D. Gunawan, C. A. Sembiring, and M. A. Budiman, “The implementation of cosine similarity to calculate text relevance between two documents,” Journal of Physics: Conference Series, vol. 978, no. 1, p. 012120, mar 2018. [Online]. Available: https://dx.doi.org/10.1088/1742-6596/978/1/012120
  52. F. Li, H. Zhang, S. Liu, J. Guo, L. M. Ni, and L. Zhang, “Dn-detr: Accelerate detr training by introducing query denoising,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 619–13 627.
  53. B. Smock, R. Pesala, and R. Abraham, “PubTables-1M: Towards comprehensive table extraction from unstructured documents,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 4634–4642.
  54. T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: common objects in context,” CoRR, vol. abs/1405.0312, 2014. [Online]. Available: http://arxiv.org/abs/1405.0312
  55. Z. Zhong, J. Wang, H. Sun, K. Hu, E. Zhang, L. Sun, and Q. Huo, “A hybrid approach to document layout analysis for heterogeneous document images,” in Document Analysis and Recognition - ICDAR 2023, G. A. Fink, R. Jain, K. Kise, and R. Zanibbi, Eds.   Cham: Springer Nature Switzerland, 2023, pp. 189–206.
  56. N. Sun, Y. Zhu, and X. Hu, “Faster r-cnn based table detection combining corner locating,” 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1314–1319, 2019.
  57. A. Bochkovskiy, C. Wang, and H. M. Liao, “Yolov4: Optimal speed and accuracy of object detection,” CoRR, vol. abs/2004.10934, 2020. [Online]. Available: https://arxiv.org/abs/2004.10934
  58. M. Minouei, M. R. Soheili, and D. Stricker, “Document layout analysis with an enhanced object detector,” in 2021 5th International Conference on Pattern Recognition and Image Analysis (IPRIA), 2021, pp. 1–5.
  59. H. Bi, C. Xu, C. Shi, G. Liu, Y. Li, H. Zhang, and J. Qu, “Srrv: A novel document object detector based on spatial-related relation and vision,” IEEE Transactions on Multimedia, vol. 25, pp. 3788–3798, 2023.
  60. P. Zhang, C. Li, L. Qiao, Z. Cheng, S. Pu, Y. Niu, and F. Wu, “VSR: A unified framework for document layout analysis combining vision, semantics and relations,” CoRR, vol. abs/2105.06220, 2021. [Online]. Available: https://arxiv.org/abs/2105.06220
Citations (4)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com