Exploring the Impact of Table-to-Text Methods on Augmenting LLM-based Question Answering with Domain Hybrid Data

Published 20 Feb 2024 in cs.CL | (2402.12869v2)

Abstract: Augmenting LLMs for Question Answering (QA) with domain specific data has attracted wide attention. However, domain data often exists in a hybrid format, including text and semi-structured tables, posing challenges for the seamless integration of information. Table-to-Text Generation is a promising solution by facilitating the transformation of hybrid data into a uniformly text-formatted corpus. Although this technique has been widely studied by the NLP community, there is currently no comparative analysis on how corpora generated by different table-to-text methods affect the performance of QA systems. In this paper, we address this research gap in two steps. First, we innovatively integrate table-to-text generation into the framework of enhancing LLM-based QA systems with domain hybrid data. Then, we utilize this framework in real-world industrial data to conduct extensive experiments on two types of QA systems (DSFT and RAG frameworks) with four representative methods: Markdown format, Template serialization, TPLM-based method, and LLM-based method. Based on the experimental results, we draw some empirical findings and explore the underlying reasons behind the success of some methods. We hope the findings of this work will provide a valuable reference for the academic and industrial communities in developing robust QA systems.

Abstract PDF HTML Upgrade to Chat

References (55)

Citations (11)

View on Semantic Scholar

Summary

The paper presents an empirical comparison of four table-to-text methods, revealing up to 16% variation in QA performance.
It contrasts DSFT and RAG paradigms, showing that LLM-based and TPLM-based methods significantly boost domain-specific QA accuracy.
The study highlights practical trade-offs between resource costs, text diversity, and retrieval efficacy when integrating hybrid data.

Impact of Table-to-Text Generation Methods on LLM-based QA with Hybrid Domain Data

This paper systematically investigates the impact of table-to-text generation methods on augmenting LLM-based question answering (QA) systems with domain hybrid data, which often include both unstructured text and semi-structured tables. The authors provide an empirical comparison of four distinct table-to-text conversion strategies within two major paradigms for leveraging external domain-specific knowledge in LLMs: Domain-Specific Fine-Tuning (DSFT) and Retrieval-Augmented Generation (RAG).

Motivations and Problem Formulation

Domain-adapted LLMs for QA rely heavily on access to high-quality in-domain corpora. In many real-world sectors, such as scientific/technical or medical domains, a typical document interleaves textual narrative with rich table content. Integrating these modalities is challenging: naive approaches like table flattening or independent modality-specific encoding undermine semantic linkage and incur loss of structure. Table-to-text generation—producing coherent natural language statements that faithfully describe tabular content—can unify hybrid corpora for LLM ingestion.

Despite substantial progress in table-to-text generation, little is known about how different methods for textualizing tables differentially impact downstream LLM-based QA over hybrid domain data. This study closes this gap by jointly evaluating major methods on a challenging industrial dataset.

Figure 1: Four representative table-to-text generation methods are illustrated, each producing a different (hybrid) domain corpus from the same base set of domain documents.

Methods for Table-to-Text Generation

The authors benchmark four representative strategies:

Markdown format: Direct serialization of tables into markdown textual tables, requiring no model training.
Template serialization: Hand-designed templates generate moderately variable descriptive text from table schema/value patterns; requires human engineering but not model fine-tuning.
TPLM-based method: Fine-tuning traditional pre-trained LLMs (TPLMs, e.g., BART, T5) on table-to-text corpora (here, MVP) for domain adaptation. Requires significant computational resources, delivers expressive and diverse outputs.
LLM-based method: Using state-of-the-art LLMs (e.g., ChatGPT) with in-context learning for one-shot table-to-text generation. Achieves highest diversity and performance but can be resource-intensive and pose data leakage risks if deployed via APIs.

Concise comparison:

Method	Resource	Speed	Text Diversity
Markdown	CPU	Fast	Low
Template	CPU	Fast	Moderate
TPLM-based	GPU	Moderate	High
LLM-based	GPU/API	Slow	Very high

QA System Architectures with Hybrid Corpora

For each table-to-text method, the authors build hybrid domain corpora (raw document text plus serialized tables). QA systems are instantiated under two paradigms:

DSFT (Domain-Specific Fine-Tuning): Base LLMs (OPT~1.3B-13B; Llama2-7B/13B) are incrementally pre-trained on the in-domain corpus, then further fine-tuned on instruction-style QA pairs.

Figure 2: Architecture of DSFT QA system with domain-specific corpora and QA-oriented instruction tuning.

RAG: The corpus is indexed for dense retrieval (DPR, BGE embeddings, FAISS), and top-k relevant chunks, upon query, are passed alongside the question as context for generative QA (GPT-3.5-turbo, Llama2-chat 7B/13B/70B).

Experimental Setup

Data: ICT-DATA: 6GB technical ICT product docs (~18% tabular, tables with domain-unique schema/values). ICTQA: 9000 real-world Q&A pairs, test set of 500 with knowledge spanning text and tables.
Evaluation: Both GPT-4-based automated scoring [0-5 scale, G-Eval protocol] and human expert annotation, focused on semantic quality, precision, and helpfulness of long-form answers.
Fairness: All QA systems are set up to use the same model class, training recipe (QLoRA for DSFT), hyperparameters, and data splits across all methods for fair comparison.

Main Results

Clear differences in QA performance are observed among corpora enriched using different table-to-text strategies, with consistent trends across human and GPT-4 scoring.

RSD (Relative Score Difference): Up to ~9% (human) and ~16% (GPT-4) between best/worst methods per setting.
DSFT paradigm: TPLM-based and LLM-based textualization yield the highest QA performance across base models. LLM-based outputs are most robust to model scale and instruction recipe.
RAG paradigm: Somewhat surprisingly, the Markdown baseline (simple table serialization) achieves competitive results, especially with large-scale Llama2-chat models, rivaling the LLM-based method and outperforming others in certain settings.
GPT-3.5-turbo exhibits a higher tendency for abstention (e.g. "I don't know the answer.") compared to Llama2 in RAG.
Figure 3: Human evaluation score distribution for DSFT QA systems (OPT-6.7B) demonstrates distinct performance profiles dependent on table-to-text method.

Figure 4: Pairwise win-rate comparisons between QA models using different table-to-text methods. Better methods win on >50% of test cases.

Analysis of Observed Differences

In DSFT, methods yielding corpora with higher frequencies of salient domain terms and verbs (as measured relative to the gold ICTQA dataset) correlate strongly with better QA accuracy. LLM-based outputs, in particular, inject both the domain entity name and a diverse set of explanatory verbs in their paraphrases, thereby providing richer signals for parametric knowledge acquisition during LLM pre-training and fine-tuning. Template-based methods often use fewer verbs and more pronouns, leading to less effective factual grounding during model adaptation.

In RAG, retrieval utility is crucially affected by the semantic alignment of query and candidate chunks in representation space. t-SNE visualizations of chunk embeddings show that LLM-generated and Markdown-style table texts produce clusterings where the most relevant chunks (containing the answer) are semantically close to the query, thus maximizing retriever recall. TPLM and Template methods sometimes introduce misleading orthogonal or underspecified clusters, resulting in retrieval failure for factual context.

Figure 5: t-SNE visualization of retrieved chunk clusters for each method. Chunks for the LLM-based and Markdown methods are tightly aligned to the query embedding, marking improved retrieval efficacy in RAG.

Method Selection and Practical Guidance

LLM-based table rewriting is, overall, the most reliable for both DSFT and RAG, especially for domains with complex entities and poorly standardized tables. Where resource or privacy constraints disallow its use, TPLM-based models offer robust DSFT performance, while Markdown remains viable for RAG—particularly when documents are short, tables are schema-regular, or rapid throughput is required.
Text length and storage: LLM-based and Markdown methods generate more concise textualizations, offering memory/resource advantages when deploying RAG systems with very large vector indexes.
Deployment: Building vector retrieval libraries for RAG at this scale (~280GB memory for >100M 1024-d vectors) is expensive, and the average text chunk length, determined by table-to-text strategy, directly impacts this resource cost.
Performance trade-offs: Despite its simplicity, Markdown serialization works well in RAG because structural cues (headers, columns) map to terms used in user queries—facilitating chunk retrieval in specialized domains.
Figure 6: Top-15 frequent cell contents in table header rows of ICT-DATA, exemplifying domain-specific language critical to QA success.

Implications and Broader Context

Empirical findings confirm that corpora creation strategies—particularly for tables—have major, sometimes underappreciated, downstream effects on LLM-based QA. Choices made at the ETL/serialization step propagate to both model learning (DSFT) and retriever effectiveness (RAG).
Emergent best practice: When targeting a domain QA system for hybrid data, favor LLM-based or adaptive TPLM-based table-to-text for DSFT, and LLM-based or Markdown for RAG, taking into account memory and privacy trade-offs.
Model/Corpus alignment: This study corroborates observations from prior large-scale investigations [pmlr-v202-biderman23a, razeghi-etal-2022-impact, elazar_measuring_2023] that term frequency and phrase diversity in pre-training data tightly correlate with parametric knowledge and downstream factual recall in LLMs.
Table-to-text generation can and should be tuned for the target QA system's configuration, especially RAG, rather than simply optimizing for BLEU/ROUGE against reference summaries; semantic retrievability is paramount.
Automated evaluation with GPT-4 is a strong—but not perfect—surrogate for expert human annotation in long-form QA, though hallucination and explanation quality remain open challenges [liu-etal-2023-g, wang-etal-2023-chatgpt].

Conclusion

The analysis provides clear guidance for practitioners constructing domain QA systems leveraging LLMs: the choice of table-to-text pipeline is an upstream “bottleneck” for system performance. LLM-based table rewriting offers broad advantages in both DSFT and RAG settings; in resource-constrained or highly standardized table contexts, Markdown conversion remains surprisingly effective for retrieval. These empirical results, leveraging real-world ICT domain data and multi-scale QA systems, highlight the need for continued, methodologically-grounded research at the intersection of document transformation, neural retrieval, and generative QA.

Figure 7: Workflow summary: from hybrid domain documents (text + tables), through table-to-text processing (four methods), to augmented QA response generation and evaluation.

Future Directions:

Developing adaptive table-to-text models that optimize for downstream QA retrieval rather than surface similarity or fluency.
Investigating hybrid serialization (combining format-aware and neural-generated modalities) to balance retrieval and comprehension.
Exploring robust chunking and indexing schemes that are agnostic to method yet maintain high answer recall.
Applying analogous evaluations to new domains (e.g., legal, biomedical, financial) to generalize or specialize findings for other domain-adapted LLM QA systems.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Exploring the Impact of Table-to-Text Methods on Augmenting LLM-based Question Answering with Domain Hybrid Data

Summary

Impact of Table-to-Text Generation Methods on LLM-based QA with Hybrid Domain Data

Motivations and Problem Formulation

Methods for Table-to-Text Generation

QA System Architectures with Hybrid Corpora

Experimental Setup

Main Results

Analysis of Observed Differences

Method Selection and Practical Guidance

Implications and Broader Context

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (11)

Collections

Tweets

Exploring the Impact of Table-to-Text Methods on Augmenting LLM-based Question Answering with Domain Hybrid Data

Summary

Impact of Table-to-Text Generation Methods on LLM-based QA with Hybrid Domain Data

Motivations and Problem Formulation

Methods for Table-to-Text Generation

QA System Architectures with Hybrid Corpora

Experimental Setup

Main Results

Analysis of Observed Differences

Method Selection and Practical Guidance

Implications and Broader Context

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (11)

Collections

Tweets