Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Is Table Retrieval a Solved Problem? Exploring Join-Aware Multi-Table Retrieval (2404.09889v2)

Published 15 Apr 2024 in cs.IR, cs.AI, and cs.CL

Abstract: Retrieving relevant tables containing the necessary information to accurately answer a given question over tables is critical to open-domain question-answering (QA) systems. Previous methods assume the answer to such a question can be found either in a single table or multiple tables identified through question decomposition or rewriting. However, neither of these approaches is sufficient, as many questions require retrieving multiple tables and joining them through a join plan that cannot be discerned from the user query itself. If the join plan is not considered in the retrieval stage, the subsequent steps of reasoning and answering based on those retrieved tables are likely to be incorrect. To address this problem, we introduce a method that uncovers useful join relations for any query and database during table retrieval. We use a novel re-ranking method formulated as a mixed-integer program that considers not only table-query relevance but also table-table relevance that requires inferring join relationships. Our method outperforms the state-of-the-art approaches for table retrieval by up to 9.3% in F1 score and for end-to-end QA by up to 5.4% in accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  2. Table search using a deep contextualized language model. In Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, pages 589–598.
  3. Warpgate: A semantic join discovery system for cloud data warehouse. arXiv preprint arXiv:2212.14155.
  4. Shimon Even and R Endre Tarjan. 1975. Network flow and testing graph connectivity. SIAM journal on computing, 4(4):507–518.
  5. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.
  6. Open domain question answering over tables via dense retrieval. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 512–519, Online. Association for Computational Linguistics.
  7. Mixed-modality representation learning and pre-training for joint table-and-text retrieval in openqa. arXiv preprint arXiv:2210.05197.
  8. Unsupervised dense information retrieval with contrastive learning. Trans. Mach. Learn. Res., 2022.
  9. KaggleDBQA: Realistic evaluation of text-to-SQL parsers. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2261–2273, Online. Association for Computational Linguistics.
  10. Retrieval-augmented generation for knowledge-intensive nlp tasks. ArXiv, abs/2005.11401.
  11. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.
  12. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. arXiv preprint arXiv:2305.03111.
  13. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online. Association for Computational Linguistics.
  14. Data lake management: challenges and opportunities. Proceedings of the VLDB Endowment, 12(12):1986–1989.
  15. End-to-end table question answering via retrieval-augmented generation. arXiv preprint arXiv:2203.16714.
  16. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761.
  17. Joint verification and reranking for open fact checking over tables. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6787–6799, Online. Association for Computational Linguistics.
  18. Table retrieval may not necessitate table-specific model design. In Proceedings of the Workshop on Structured and Unstructured Knowledge Integration (SUKI), pages 36–46, Seattle, USA. Association for Computational Linguistics.
  19. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3911–3921, Brussels, Belgium. Association for Computational Linguistics.
  20. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887.
  21. A survey of knowledge-enhanced text generation. ACM Computing Surveys, 54(11s):1–38.
  22. Finding related tables in data lakes for interactive data science. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD ’20, page 1951–1966, New York, NY, USA. Association for Computing Machinery.
  23. Detecting hallucinated content in conditional neural sequence generation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1393–1404, Online. Association for Computational Linguistics.
  24. Josie: Overlap set similarity search for finding joinable tables in data lakes. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD ’19, page 847–864, New York, NY, USA. Association for Computing Machinery.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Peter Baile Chen (9 papers)
  2. Yi Zhang (994 papers)
  3. Dan Roth (222 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.