Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UniTable: Towards a Unified Framework for Table Recognition via Self-Supervised Pretraining (2403.04822v2)

Published 7 Mar 2024 in cs.CV and cs.LG

Abstract: Tables convey factual and quantitative data with implicit conventions created by humans that are often challenging for machines to parse. Prior work on table recognition (TR) has mainly centered around complex task-specific combinations of available inputs and tools. We present UniTable, a training framework that unifies both the training paradigm and training objective of TR. Its training paradigm combines the simplicity of purely pixel-level inputs with the effectiveness and scalability empowered by self-supervised pretraining from diverse unannotated tabular images. Our framework unifies the training objectives of all three TR tasks - extracting table structure, cell content, and cell bounding box - into a unified task-agnostic training objective: LLMing. Extensive quantitative and qualitative analyses highlight UniTable's state-of-the-art (SOTA) performance on four of the largest TR datasets. UniTable's table parsing capability has surpassed both existing TR methods and general large vision-LLMs, e.g., GPT-4o, GPT-4-turbo with vision, and LLaVA. Our code is publicly available at https://github.com/poloclub/unitable, featuring a Jupyter Notebook that includes the complete inference pipeline, fine-tuned across multiple TR datasets, supporting all three TR tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Pix2seq: A language modeling framework for object detection. arXiv preprint arXiv:2109.10852, 2021.
  5. A unified sequence interface for vision tasks. Advances in Neural Information Processing Systems, 35:31333–31346, 2022.
  6. Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges. arXiv preprint arXiv:2311.03287, 2023.
  7. Challenges in end-to-end neural scientific table recognition. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pp.  894–901. IEEE, 2019.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  9. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  10. Icdar 2019 competition on table detection and recognition (ctdar). In 2019 International Conference on Document Analysis and Recognition (ICDAR), pp.  1510–1515. IEEE, 2019.
  11. Convmae: Masked convolution meets masked autoencoders. arXiv preprint arXiv:2205.03892, 2022.
  12. Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition, pp.  1449–1453. IEEE, 2013.
  13. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  14. Improving table structure recognition with visual-alignment sequential coordinate modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11134–11143, 2023.
  15. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
  16. Ocr-free document understanding transformer. In European Conference on Computer Vision, pp.  498–517. Springer, 2022.
  17. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning, pp.  18893–18912. PMLR, 2023.
  18. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pp.  12888–12900. PMLR, 2022.
  19. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  20. Tablebank: Table benchmark for image-based table detection and recognition. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pp.  1918–1925, 2020.
  21. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp.  740–755. Springer, 2014.
  22. Show, read and reason: Table structure recognition with flexible context aggregator. In Proceedings of the 29th ACM International Conference on Multimedia, pp.  1084–1092, 2021a.
  23. Neural collaborative graph machines for table structure recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4533–4542, 2022.
  24. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  25. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  10012–10022, 2021b.
  26. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  27. Master: Multi-aspect non-local network for scene text recognition. Pattern Recognition, 117:107980, 2021.
  28. Robust table detection and structure recognition from heterogeneous document images. Pattern Recognition, 133:109006, 2023.
  29. 4m: Massively multimodal masked modeling. arXiv preprint arXiv:2312.06647, 2023.
  30. Tableformer: Table structure understanding with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4614–4623, 2022.
  31. Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pp.  128–133. IEEE, 2019.
  32. High-performance transformers for table structure recognition need early convolutions. In NeurIPS 2023 Second Table Representation Learning Workshop, 2023.
  33. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  34. Zero-shot text-to-image generation. In International Conference on Machine Learning, pp.  8821–8831. PMLR, 2021.
  35. Deepdesrt: Deep learning for detection and structure recognition of tables in document images. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR), volume 1, pp.  1162–1167. IEEE, 2017.
  36. Exploring ocr capabilities of gpt-4v (ision): A quantitative and in-depth evaluation. arXiv preprint arXiv:2310.16809, 2023.
  37. Pubtables-1m: Towards comprehensive table extraction from unstructured documents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4634–4642, 2022.
  38. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  39. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
  40. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1):1, 2023.
  41. Pingan-vcgroup’s solution for icdar 2021 competition on scientific literature parsing task b: table recognition to html. arXiv preprint arXiv:2105.01848, 2021.
  42. Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp.  697–706, 2021.
  43. Image-based table recognition: data, model, and evaluation. In European conference on computer vision, pp.  564–580. Springer, 2020.

Summary

  • The paper introduces a unified framework for table structure recognition that integrates self-supervised pretraining to boost TSR performance across multiple tasks.
  • It reformulates TSR tasks as a language modeling problem, enabling effective extraction of table structure, cell content, and bounding box coordinates.
  • Extensive evaluations on four major datasets demonstrate state-of-the-art results, showcasing the method's scalability and practical impact on document processing.

UniTable: A Unified Framework for Table Structure Recognition via Self-Supervised Pretraining

The paper "UniTable: Towards a Unified Framework for Table Structure Recognition via Self-Supervised Pretraining" proposes a novel training methodology aimed at enhancing the performance of table structure recognition (TSR). TSR remains a challenging task due to the implicit conventions in human-designed tables that are often hard for machines to interpret. Previous methods typically focused on complex and task-specific approaches, which limited the flexibility and scalability of models. UniTable addresses these limitations by introducing a framework that unifies the training approach for TSR tasks, combining pixel-level inputs with advancements in self-supervised pretraining (SSP) to achieve state-of-the-art results.

Contributions and Methodology

The primary contribution of the UniTable framework is its unification of both the training paradigm and objective across three key TSR tasks: extracting table structure, cell content, and bounding boxes (bboxes). This is achieved through a task-agnostic training objective centered on LLMing. The framework operates under the assumption that high-quality SSP of visual encoders can significantly enhance the model's ability to parse and understand tables. This method relies on utilizing diverse unannotated tabular images to pretrain the model, which is then fine-tuned using supervised datasets.

  1. Training Paradigm: The authors integrate SSP into the training paradigm, which involves pretraining the visual encoder with masked tabular images. This step is crucial for optimizing the performance of TSR tasks as it capitalizes on a LLMing approach, treating table structures as sequences to be predicted.
  2. Unified Training Objective: By translating TSR tasks into a LLMing framework, the model predicts sequences of tokens corresponding to elements within tables, such as HTML tags and bbox coordinates. Moreover, the adoption of a linear projection Transformer aligns TSR methodologies with state-of-the-art architectures used across various domains.
  3. Implementation and Results: Extensive evaluations on four significant TSR datasets demonstrate the robustness and effectiveness of the UniTable framework. Notably, UniTable achieves new state-of-the-art performance metrics on several datasets, surpassing prior benchmarks without task-specific enhancement losses or reliance on external resources like PDFs.

Implications and Future Work

The implications of this research are significant for both theoretical and practical applications in AI-driven document understanding. The unification of training paradigms opens the door for more generalized frameworks applicable across diverse domains beyond TSR, contributing to the development of more comprehensive vision-LLMs. Moreover, this work aligns with ongoing trends towards multimodal machine learning, where models must understand and generate complex data combinations seamlessly.

Future research directions could include exploring how UniTable's methodologies can be integrated into broader vision-LLMs to facilitate seamless interaction with various data types, such as images, text, and tabular data within a singular framework. Additionally, given the results achieved through SSP, further studies could investigate the scalability of similar frameworks when applied to other structured data recognition tasks or cross-modal learning applications.

Conclusion

The paper "UniTable" proposes a transformative approach to TSR by combining SSP with a unified framework for training table parsers. The paradigm shift from complex task-specific approaches to a generalized LLMing objective demonstrates substantial improvements, as evidenced by state-of-the-art performance across multiple datasets. This work contributes a promising direction for ongoing research in machine learning, particularly in contexts requiring the integration and interpretation of structured and unstructured data. The demonstrated effectiveness of SP highlights its potential to enhance visual language understanding paradigms, fostering advancements not only in academic research but also in industry applications where automated document processing is pivotal.