TableRAG: Million-Token Table Understanding with Language Models (2410.04739v3)
Abstract: Recent advancements in LLMs (LMs) have notably enhanced their ability to reason with tabular data, primarily through program-aided mechanisms that manipulate and analyze tables. However, these methods often require the entire table as input, leading to scalability challenges due to the positional bias or context length constraints. In response to these challenges, we introduce TableRAG, a Retrieval-Augmented Generation (RAG) framework specifically designed for LM-based table understanding. TableRAG leverages query expansion combined with schema and cell retrieval to pinpoint crucial information before providing it to the LMs. This enables more efficient data encoding and precise retrieval, significantly reducing prompt lengths and mitigating information loss. We have developed two new million-token benchmarks from the Arcade and BIRD-SQL datasets to thoroughly evaluate TableRAG's effectiveness at scale. Our results demonstrate that TableRAG's retrieval design achieves the highest retrieval quality, leading to the new state-of-the-art performance on large-scale table understanding.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Tabfact: A large-scale dataset for table-based fact verification. In International Conference on Learning Representations, 2020.
- Binding language models in symbolic languages. In The Eleventh International Conference on Learning Representations, 2023.
- Understanding tables with intermediate pre-training. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 281–296, 2020.
- Large language models on tabular data–a survey. arXiv preprint arXiv:2402.17944, 2024.
- TaPas: Weakly supervised table parsing via pre-training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4320–4333, 2020.
- Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Advances in Neural Information Processing Systems, 36, 2024.
- An inner table retriever for robust table question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9909–9926, 2023.
- Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024.
- Rethinking tabular data understanding with large language models. arXiv preprint arXiv:2312.16702, 2023.
- MultiTabQA: Generating tabular answers for multi-table question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6322–6334, 2023.
- Compositional semantic parsing on semi-structured tables. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1470–1480, 2015.
- DIN-SQL: Decomposed in-context learning of text-to-SQL with self-correction. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Dts-sql: Decomposed text-to-sql with small large language models. arXiv preprint arXiv:2402.01117, 2024.
- Evaluating the text-to-sql capabilities of large language models. arXiv preprint arXiv:2204.00498, 2022.
- The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends in Information Retrieval, 3(4):333–389, 2009.
- Tap4llm: Table provider on sampling, augmenting, and packing semi-structured data for large language model reasoning. arXiv preprint arXiv:2312.09039, 2023.
- Sql-palm: Improved large language modeladaptation for text-to-sql. arXiv preprint arXiv:2306.00739, 2023.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Mac-sql: Multi-agent collaboration for text-to-sql. arXiv preprint arXiv:2312.11242, 2023a.
- Query2doc: Query expansion with large language models. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023b.
- Chain-of-table: Evolving tables in the reasoning chain for table understanding. In The Twelfth International Conference on Learning Representations, 2024.
- React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023.
- Large language models are versatile decomposers: Decomposing evidence and questions for table-based reasoning. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 174–184, 2023.
- TaBERT: Pretraining for joint understanding of textual and tabular data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8413–8426, 2020.
- Natural language to code generation in interactive data science notebooks. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 126–173, 2023.
- Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3911–3921, 2018.
- Tablellama: Towards open large generalist models for tables. arXiv preprint arXiv:2311.09206, 2023a.
- Reactable: Enhancing react for table question answering. arXiv preprint arXiv:2310.00815, 2023b.
- Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103, 2017.