Tabular Embedding Model (TEM): Finetuning Embedding Models For Tabular RAG Applications (2405.01585v1)
Abstract: In recent times LLMs have exhibited tremendous capabilities, especially in the areas of mathematics, code generation and general-purpose reasoning. However for specialized domains especially in applications that require parsing and analyzing large chunks of numeric or tabular data even state-of-the-art (SOTA) models struggle. In this paper, we introduce a new approach to solving domain-specific tabular data analysis tasks by presenting a unique RAG workflow that mitigates the scalability issues of existing tabular LLM solutions. Specifically, we present Tabular Embedding Model (TEM), a novel approach to fine-tune embedding models for tabular Retrieval-Augmentation Generation (RAG) applications. Embedding models form a crucial component in the RAG workflow and even current SOTA embedding models struggle as they are predominantly trained on textual datasets and thus underperform in scenarios involving complex tabular data. The evaluation results showcase that our approach not only outperforms current SOTA embedding models in this domain but also does so with a notably smaller and more efficient model structure.
- Gpt-4 technical report, 2024.
- Llemma: An open language model for mathematics, 2024.
- Open question answering over tables and text. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=MmCRswl1UYl.
- TableGPT: Few-shot table-to-text generation with table structure reconstruction and content matching. In Donia Scott, Nuria Bel, and Chengqing Zong, editors, Proceedings of the 28th International Conference on Computational Linguistics, pages 1978–1988, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.179. URL https://aclanthology.org/2020.coling-main.179.
- Evaluating large language models: A comprehensive survey, 2023.
- John Hewitt. Initializing new word embeddings for pretrained language models. https://nlp.stanford.edu/~johnhew/vocab-expansion.html, 2021.
- A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions, 2023.
- Adam: A method for stochastic optimization, 2017.
- Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021.
- Tapex: Table pre-training via learning a neural sql executor, 2022.
- Testing llms on code generation with varying levels of prompt specificity, 2023.
- Codegen: An open large language model for code with multi-turn program synthesis, 2023.
- OpenAI. Gpt-4 technical report, 2023, archivePrefix=arXiv, primaryClass=q-fin.ST.
- End-to-end table question answering via retrieval-augmented generation, 2022.
- Microsoft Research. Sparks of artificial general intelligence:early experiments with gpt-4, 2023, archivePrefix=arXiv, primaryClass=q-fin.ST.
- Ad hoc table retrieval using intrinsic and extrinsic similarities. In Yennun Huang, Irwin King, Tie-Yan Liu, and Maarten van Steen, editors, WWW ’20: The Web Conference 2020, Taipei, Taiwan, April 20-24, 2020, pages 2479–2485. ACM / IW3C2, 2020. doi: 10.1145/3366423.3379995. URL https://doi.org/10.1145/3366423.3379995.
- Fine tuning vs. retrieval augmented generation for less popular knowledge, 2024.
- Improving text embeddings with large language models, 2024.
- Unleashing the potential of large language models for predictive tabular tasks in data science, 2024.
- Mafin: Enhancing black-box embeddings with model augmented fine-tuning, 2024a.
- Retrieve anything to augment large language models, 2023.
- Raft: Adapting language model to domain specific rag, 2024b.
- Tablellama: Towards open large generalist models for tables, 2024c.
- Retrieval-augmented generation for ai-generated content: A survey, 2024.