Beyond Traditional Benchmarks: Analyzing Behaviors of Open LLMs on Data-to-Text Generation (2401.10186v3)

Published 18 Jan 2024 in cs.CL

Abstract: We analyze the behaviors of open LLMs on the task of data-to-text (D2T) generation, i.e., generating coherent and relevant text from structured data. To avoid the issue of LLM training data contamination with standard benchmarks, we design Quintd - a tool for collecting novel structured data records from public APIs. We find that open LLMs (Llama 2, Mistral, and Zephyr) can generate fluent and coherent texts in zero-shot settings from data in common formats collected with Quintd. However, we show that the semantic accuracy of the outputs is a major issue: both according to human annotators and our reference-free metric based on GPT-4, more than 80% of the outputs of open LLMs contain at least one semantic error. We publicly release the code, data, and model outputs.

References (82)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces Quintd-1, a new benchmark to evaluate semantic accuracy in data-to-text generation.
It compares three 7B-parameter open LLMs using a uniform prompting method across diverse domains.
Findings reveal that while models produce fluent language, 80%–91% of outputs contain semantic errors, emphasizing the need for improved evaluation methods.

Overview of LLMs and Data-to-Text Generation

LLMs have become widely recognized for their versatile applications in NLP. One intriguing application is data-to-text (D2T) generation, where the challenge lies in creating coherent text from structured data. This requires not just fluency in language generation, but also maintaining semantic accuracy—a notable challenge for LLMs. This blog post discusses an innovative approach to evaluating the performance of LLMs in D2T tasks that sidesteps conventional benchmarks which might be biased due to overfitting on leaked data.

Quintd-1: A New Benchmark for D2T Evaluation

Researchers have devised Quintd-1, a new benchmark that consists of structured data records across five different domains—weather forecasts, product descriptions, sports summaries, health-related time series and world fact descriptions. Quintd-1 relies on standard data formats like JSON, CSV, and Markdown to provide inputs for D2T tasks that are well-represented in the pretraining corpora of many LLMs. This strategy leverages the 'in-context learning abilities' of these models, allowing evaluation without the need for human-written reference texts.

Methodology and Model Behavior

The paper explores the capabilities of three open-source 7B-parameter LLMs—Llama-2, Mistral, and Zephyr—to perform D2T tasks across various domains. The experimental setup is straightforward, using a template prompt across all tasks to see if models can generate outputs on unseen data with minimal prompt engineering. The findings show that while the models can produce fluent text, approximately 80%–91% of the outputs involve some form of semantic error, highlighting the struggle with semantic accuracy.

Moving Forward with D2T Generation

The insights from this work prompt several recommendations. Primarily, the focus should shift from linguistic fluency to semantic accuracies, such as improving content selection and factual correctness. Efficiency should be another area of consideration, especially when dealing with long data inputs. Finally, the research underscores the importance of reproducible and unbiased evaluation methods, signaling a path forward for future studies using LLMs for D2T generation.

The paper paves the way for better D2T systems by providing detailed observations, data, and insights that can help in creating more reliable and accurate language generation models in the future. It also opens up considerations such as multilinguality and real-world application of D2T systems. Given the complexities and nuances of natural language, the journey of refining LLMs to impeccably perform D2T tasks is ongoing, yet promising.

PDF Markdown

GitHub

Tweets

https://twitter.com/ufal_cuni/status/1791458461466722763

https://twitter.com/ZdenekKasner/status/1748245488845942794