Papers
Topics
Authors
Recent
Search
2000 character limit reached

Exploring the Impact of Table-to-Text Methods on Augmenting LLM-based Question Answering with Domain Hybrid Data

Published 20 Feb 2024 in cs.CL | (2402.12869v2)

Abstract: Augmenting LLMs for Question Answering (QA) with domain specific data has attracted wide attention. However, domain data often exists in a hybrid format, including text and semi-structured tables, posing challenges for the seamless integration of information. Table-to-Text Generation is a promising solution by facilitating the transformation of hybrid data into a uniformly text-formatted corpus. Although this technique has been widely studied by the NLP community, there is currently no comparative analysis on how corpora generated by different table-to-text methods affect the performance of QA systems. In this paper, we address this research gap in two steps. First, we innovatively integrate table-to-text generation into the framework of enhancing LLM-based QA systems with domain hybrid data. Then, we utilize this framework in real-world industrial data to conduct extensive experiments on two types of QA systems (DSFT and RAG frameworks) with four representative methods: Markdown format, Template serialization, TPLM-based method, and LLM-based method. Based on the experimental results, we draw some empirical findings and explore the underlying reasons behind the success of some methods. We hope the findings of this work will provide a valuable reference for the academic and industrial communities in developing robust QA systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Hellama: Llama-based table to text generation by highlighting the important evidence.
  2. Pythia: A suite for analyzing large language models across training and scaling. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 2397–2430. PMLR.
  3. Harrison Chase. 2022. Langchain.
  4. Open question answering over tables and text. In International Conference on Learning Representations.
  5. Logical Natural Language Generation from Open-Domain Tables. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7929–7942, Online. Association for Computational Linguistics.
  6. Hybridqa: A dataset of multi-hop question answering over tabular and textual data. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1026–1036.
  7. HiTab: A hierarchical table dataset for question answering and natural language generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1094–1110, Dublin, Ireland. Association for Computational Linguistics.
  8. QLoRA: Efficient Finetuning of Quantized LLMs. ArXiv:2305.14314 [cs].
  9. Measuring Causal Effects of Data Statistics on Language Model’s ‘Factual’ Predictions. ArXiv:2207.14251 [cs].
  10. Retrieval-augmented generation for large language models: A survey.
  11. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, Online. Association for Computational Linguistics.
  12. Acegpt, localizing large language models in arabic.
  13. Mixed-modality Representation Learning and Pre-training for Joint Table-and-Text Retrieval in OpenQA. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4117–4129, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  14. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547.
  15. Evaluating open-domain question answering in the era of large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5591–5606, Toronto, Canada. Association for Computational Linguistics.
  16. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
  17. Hurdles to Progress in Long-form Question Answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4940–4957, Online. Association for Computational Linguistics.
  18. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
  19. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459–9474. Curran Associates, Inc.
  20. Dual reader-parser on hybrid textual and tabular evidence for open domain question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4078–4088.
  21. Table-gpt: Table-tuned gpt for diverse table tasks. arXiv preprint arXiv:2310.09263.
  22. Domain Specialization as the Key to Make Large Language Models Disruptive: A Comprehensive Survey. ArXiv:2305.18703 [cs].
  23. PLOG: Table-to-logic pretraining for logical table-to-text generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5531–5546, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  24. G-eval: NLG evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore. Association for Computational Linguistics.
  25. BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine. ArXiv:2308.09442 [cs].
  26. Few-shot Table-to-text Generation with Prefix-Controlled Generator. In Proceedings of the 29th International Conference on Computational Linguistics, pages 6493–6504, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  27. Query rewriting in retrieval-augmented large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5303–5315, Singapore. Association for Computational Linguistics.
  28. FeTaQA: Free-form Table Question Answering. Transactions of the Association for Computational Linguistics, 10:35–49.
  29. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  30. Totto: A controlled table-to-text generation dataset. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1173–1186.
  31. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
  32. Improving language understanding by generative pre-training.
  33. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE.
  34. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506.
  35. Impact of pretraining term frequencies on few-shot numerical reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 840–854, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  36. GPT4Table: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study. ArXiv:2305.13062 [cs] version: 3.
  37. MVP: Multi-task Supervised Pre-training for Natural Language Generation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8758–8794, Toronto, Canada. Association for Computational Linguistics.
  38. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  39. Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of machine learning research, 9(11).
  40. Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity. ArXiv:2310.07521 [cs].
  41. Is ChatGPT a good NLG evaluator? a preliminary study. In Proceedings of the 4th New Frontiers in Summarization Workshop, pages 1–11, Singapore. Association for Computational Linguistics.
  42. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
  43. Augmenting black-box llms with medical textbooks for clinical question answering.
  44. PMC-LLaMA: Towards Building Open-source Language Models for Medicine. ArXiv:2304.14454 [cs].
  45. UnifiedSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 602–631, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  46. Retrieval-augmented domain adaptation of language models. In Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023), pages 54–64, Toronto, Canada. Association for Computational Linguistics.
  47. Empower large language model to perform better on industrial domain-specific question answering. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 294–312, Singapore. Association for Computational Linguistics.
  48. Variational template machine for data-to-text generation. In International Conference on Learning Representations.
  49. Frequency Balanced Datasets Lead to Better Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7859–7872, Singapore. Association for Computational Linguistics.
  50. Retrieve Anything To Augment Large Language Models. ArXiv:2310.07554 [cs].
  51. OPT: Open Pre-trained Transformer Language Models. ArXiv:2205.01068 [cs].
  52. Domain specialization as the key to make large language models disruptive: A comprehensive survey. arXiv preprint arXiv:2305.18703.
  53. Investigating table-to-text generation capabilities of large language models in real-world information seeking scenarios. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 160–175, Singapore. Association for Computational Linguistics.
  54. Reasoning over hybrid chain for table-and-text open domain question answering. In International Joint Conference on Artificial Intelligence (IJCAI), pages 4531–4537.
  55. Tat-qa: A question answering benchmark on a hybrid of tabular and textual content in finance. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3277–3287.
Citations (11)

Summary

  • The paper presents an empirical comparison of four table-to-text methods, revealing up to 16% variation in QA performance.
  • It contrasts DSFT and RAG paradigms, showing that LLM-based and TPLM-based methods significantly boost domain-specific QA accuracy.
  • The study highlights practical trade-offs between resource costs, text diversity, and retrieval efficacy when integrating hybrid data.

Impact of Table-to-Text Generation Methods on LLM-based QA with Hybrid Domain Data

This paper systematically investigates the impact of table-to-text generation methods on augmenting LLM-based question answering (QA) systems with domain hybrid data, which often include both unstructured text and semi-structured tables. The authors provide an empirical comparison of four distinct table-to-text conversion strategies within two major paradigms for leveraging external domain-specific knowledge in LLMs: Domain-Specific Fine-Tuning (DSFT) and Retrieval-Augmented Generation (RAG).

Motivations and Problem Formulation

Domain-adapted LLMs for QA rely heavily on access to high-quality in-domain corpora. In many real-world sectors, such as scientific/technical or medical domains, a typical document interleaves textual narrative with rich table content. Integrating these modalities is challenging: naive approaches like table flattening or independent modality-specific encoding undermine semantic linkage and incur loss of structure. Table-to-text generation—producing coherent natural language statements that faithfully describe tabular content—can unify hybrid corpora for LLM ingestion.

Despite substantial progress in table-to-text generation, little is known about how different methods for textualizing tables differentially impact downstream LLM-based QA over hybrid domain data. This study closes this gap by jointly evaluating major methods on a challenging industrial dataset. Figure 1

Figure 1: Four representative table-to-text generation methods are illustrated, each producing a different (hybrid) domain corpus from the same base set of domain documents.

Methods for Table-to-Text Generation

The authors benchmark four representative strategies:

  1. Markdown format: Direct serialization of tables into markdown textual tables, requiring no model training.
  2. Template serialization: Hand-designed templates generate moderately variable descriptive text from table schema/value patterns; requires human engineering but not model fine-tuning.
  3. TPLM-based method: Fine-tuning traditional pre-trained LLMs (TPLMs, e.g., BART, T5) on table-to-text corpora (here, MVP) for domain adaptation. Requires significant computational resources, delivers expressive and diverse outputs.
  4. LLM-based method: Using state-of-the-art LLMs (e.g., ChatGPT) with in-context learning for one-shot table-to-text generation. Achieves highest diversity and performance but can be resource-intensive and pose data leakage risks if deployed via APIs.

Concise comparison:

Method Resource Speed Text Diversity
Markdown CPU Fast Low
Template CPU Fast Moderate
TPLM-based GPU Moderate High
LLM-based GPU/API Slow Very high

QA System Architectures with Hybrid Corpora

For each table-to-text method, the authors build hybrid domain corpora (raw document text plus serialized tables). QA systems are instantiated under two paradigms:

  • DSFT (Domain-Specific Fine-Tuning): Base LLMs (OPT~1.3B-13B; Llama2-7B/13B) are incrementally pre-trained on the in-domain corpus, then further fine-tuned on instruction-style QA pairs. Figure 2

Figure 2

Figure 2: Architecture of DSFT QA system with domain-specific corpora and QA-oriented instruction tuning.

  • RAG: The corpus is indexed for dense retrieval (DPR, BGE embeddings, FAISS), and top-k relevant chunks, upon query, are passed alongside the question as context for generative QA (GPT-3.5-turbo, Llama2-chat 7B/13B/70B).

Experimental Setup

  • Data: ICT-DATA: 6GB technical ICT product docs (~18% tabular, tables with domain-unique schema/values). ICTQA: 9000 real-world Q&A pairs, test set of 500 with knowledge spanning text and tables.
  • Evaluation: Both GPT-4-based automated scoring [0-5 scale, G-Eval protocol] and human expert annotation, focused on semantic quality, precision, and helpfulness of long-form answers.
  • Fairness: All QA systems are set up to use the same model class, training recipe (QLoRA for DSFT), hyperparameters, and data splits across all methods for fair comparison.

Main Results

Clear differences in QA performance are observed among corpora enriched using different table-to-text strategies, with consistent trends across human and GPT-4 scoring.

  • RSD (Relative Score Difference): Up to ~9% (human) and ~16% (GPT-4) between best/worst methods per setting.
  • DSFT paradigm: TPLM-based and LLM-based textualization yield the highest QA performance across base models. LLM-based outputs are most robust to model scale and instruction recipe.
  • RAG paradigm: Somewhat surprisingly, the Markdown baseline (simple table serialization) achieves competitive results, especially with large-scale Llama2-chat models, rivaling the LLM-based method and outperforming others in certain settings.
  • GPT-3.5-turbo exhibits a higher tendency for abstention (e.g. "I don't know the answer.") compared to Llama2 in RAG. Figure 3

    Figure 3: Human evaluation score distribution for DSFT QA systems (OPT-6.7B) demonstrates distinct performance profiles dependent on table-to-text method.

    Figure 4

Figure 4

Figure 4

Figure 4: Pairwise win-rate comparisons between QA models using different table-to-text methods. Better methods win on >50% of test cases.

Analysis of Observed Differences

In DSFT, methods yielding corpora with higher frequencies of salient domain terms and verbs (as measured relative to the gold ICTQA dataset) correlate strongly with better QA accuracy. LLM-based outputs, in particular, inject both the domain entity name and a diverse set of explanatory verbs in their paraphrases, thereby providing richer signals for parametric knowledge acquisition during LLM pre-training and fine-tuning. Template-based methods often use fewer verbs and more pronouns, leading to less effective factual grounding during model adaptation.

In RAG, retrieval utility is crucially affected by the semantic alignment of query and candidate chunks in representation space. t-SNE visualizations of chunk embeddings show that LLM-generated and Markdown-style table texts produce clusterings where the most relevant chunks (containing the answer) are semantically close to the query, thus maximizing retriever recall. TPLM and Template methods sometimes introduce misleading orthogonal or underspecified clusters, resulting in retrieval failure for factual context. Figure 5

Figure 5: t-SNE visualization of retrieved chunk clusters for each method. Chunks for the LLM-based and Markdown methods are tightly aligned to the query embedding, marking improved retrieval efficacy in RAG.

Method Selection and Practical Guidance

  • LLM-based table rewriting is, overall, the most reliable for both DSFT and RAG, especially for domains with complex entities and poorly standardized tables. Where resource or privacy constraints disallow its use, TPLM-based models offer robust DSFT performance, while Markdown remains viable for RAG—particularly when documents are short, tables are schema-regular, or rapid throughput is required.
  • Text length and storage: LLM-based and Markdown methods generate more concise textualizations, offering memory/resource advantages when deploying RAG systems with very large vector indexes.
  • Deployment: Building vector retrieval libraries for RAG at this scale (~280GB memory for >100M 1024-d vectors) is expensive, and the average text chunk length, determined by table-to-text strategy, directly impacts this resource cost.
  • Performance trade-offs: Despite its simplicity, Markdown serialization works well in RAG because structural cues (headers, columns) map to terms used in user queries—facilitating chunk retrieval in specialized domains. Figure 6

    Figure 6: Top-15 frequent cell contents in table header rows of ICT-DATA, exemplifying domain-specific language critical to QA success.

Implications and Broader Context

  • Empirical findings confirm that corpora creation strategies—particularly for tables—have major, sometimes underappreciated, downstream effects on LLM-based QA. Choices made at the ETL/serialization step propagate to both model learning (DSFT) and retriever effectiveness (RAG).
  • Emergent best practice: When targeting a domain QA system for hybrid data, favor LLM-based or adaptive TPLM-based table-to-text for DSFT, and LLM-based or Markdown for RAG, taking into account memory and privacy trade-offs.
  • Model/Corpus alignment: This study corroborates observations from prior large-scale investigations [pmlr-v202-biderman23a, razeghi-etal-2022-impact, elazar_measuring_2023] that term frequency and phrase diversity in pre-training data tightly correlate with parametric knowledge and downstream factual recall in LLMs.
  • Table-to-text generation can and should be tuned for the target QA system's configuration, especially RAG, rather than simply optimizing for BLEU/ROUGE against reference summaries; semantic retrievability is paramount.
  • Automated evaluation with GPT-4 is a strong—but not perfect—surrogate for expert human annotation in long-form QA, though hallucination and explanation quality remain open challenges [liu-etal-2023-g, wang-etal-2023-chatgpt].

Conclusion

The analysis provides clear guidance for practitioners constructing domain QA systems leveraging LLMs: the choice of table-to-text pipeline is an upstream “bottleneck” for system performance. LLM-based table rewriting offers broad advantages in both DSFT and RAG settings; in resource-constrained or highly standardized table contexts, Markdown conversion remains surprisingly effective for retrieval. These empirical results, leveraging real-world ICT domain data and multi-scale QA systems, highlight the need for continued, methodologically-grounded research at the intersection of document transformation, neural retrieval, and generative QA. Figure 7

Figure 7

Figure 7

Figure 7

Figure 7: Workflow summary: from hybrid domain documents (text + tables), through table-to-text processing (four methods), to augmented QA response generation and evaluation.


Future Directions:

  • Developing adaptive table-to-text models that optimize for downstream QA retrieval rather than surface similarity or fluency.
  • Investigating hybrid serialization (combining format-aware and neural-generated modalities) to balance retrieval and comprehension.
  • Exploring robust chunking and indexing schemes that are agnostic to method yet maintain high answer recall.
  • Applying analogous evaluations to new domains (e.g., legal, biomedical, financial) to generalize or specialize findings for other domain-adapted LLM QA systems.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.