Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-Supervised Pre-Training for Table Structure Recognition Transformer (2402.15578v1)

Published 23 Feb 2024 in cs.CV

Abstract: Table structure recognition (TSR) aims to convert tabular images into a machine-readable format. Although hybrid convolutional neural network (CNN)-transformer architecture is widely used in existing approaches, linear projection transformer has outperformed the hybrid architecture in numerous vision tasks due to its simplicity and efficiency. However, existing research has demonstrated that a direct replacement of CNN backbone with linear projection leads to a marked performance drop. In this work, we resolve the issue by proposing a self-supervised pre-training (SSP) method for TSR transformers. We discover that the performance gap between the linear projection transformer and the hybrid CNN-transformer can be mitigated by SSP of the visual encoder in the TSR model. We conducted reproducible ablation studies and open-sourced our code at https://github.com/poloclub/unitable to enhance transparency, inspire innovations, and facilitate fair comparisons in our domain as tables are a promising modality for representation learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. Image-based table recognition: data, model, and evaluation, in: European conference on computer vision, Springer, 2020, pp. 564–580.
  2. Tableformer: Table structure understanding with transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4614–4623.
  3. Improving table structure recognition with visual-alignment sequential coordinate modeling, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11134–11143.
  4. Tablebank: A benchmark dataset for table detection and recognition, arXiv preprint arXiv:1903.01949 (2019).
  5. Pingan-vcgroup’s solution for icdar 2021 competition on scientific literature parsing task b: table recognition to html, arXiv preprint arXiv:2105.01848 (2021).
  6. Attention is all you need, Advances in neural information processing systems 30 (2017).
  7. An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020).
  8. Align before fuse: Vision and language representation learning with momentum distillation, Advances in neural information processing systems 34 (2021) 9694–9705.
  9. Beit: Bert pre-training of image transformers, arXiv preprint arXiv:2106.08254 (2021).
  10. Image as a foreign language: Beit pretraining for vision and vision-language tasks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19175–19186.
  11. High-performance transformers for table structure recognition need early convolutions, in: NeurIPS 2023 Second Table Representation Learning Workshop, 2023.
  12. Imagenet: A large-scale hierarchical image database, in: 2009 IEEE conference on computer vision and pattern recognition, Ieee, 2009, pp. 248–255.
  13. Revisiting unreasonable effectiveness of data in deep learning era, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 843–852.
  14. Laion-5b: An open large-scale dataset for training next generation image-text models, Advances in Neural Information Processing Systems 35 (2022) 25278–25294.
  15. Challenges in end-to-end neural scientific table recognition, in: 2019 International Conference on Document Analysis and Recognition (ICDAR), IEEE, 2019, pp. 894–901.
  16. Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context, in: Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 697–706.
  17. Icdar 2021 competition on scientific literature parsing, in: Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part IV 16, Springer, 2021, pp. 605–617.
  18. Zero-shot text-to-image generation, in: International Conference on Machine Learning, PMLR, 2021, pp. 8821–8831.
  19. Categorical reparameterization with gumbel-softmax, arXiv preprint arXiv:1611.01144 (2016).
  20. I. Loshchilov, F. Hutter, Decoupled weight decay regularization, arXiv preprint arXiv:1711.05101 (2017).

Summary

We haven't generated a summary for this paper yet.