Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 28 tok/s

Gemini 2.5 Pro 40 tok/s Pro

GPT-5 Medium 16 tok/s Pro

GPT-5 High 13 tok/s Pro

GPT-4o 103 tok/s Pro

Kimi K2 197 tok/s Pro

GPT OSS 120B 471 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

Handling Divergent Reference Texts when Evaluating Table-to-Text Generation (1906.01081v1)

Published 3 Jun 2019 in cs.CL

Abstract: Automatically constructed datasets for generating text from semi-structured data (tables), such as WikiBio, often contain reference texts that diverge from the information in the corresponding semi-structured data. We show that metrics which rely solely on the reference texts, such as BLEU and ROUGE, show poor correlation with human judgments when those references diverge. We propose a new metric, PARENT, which aligns n-grams from the reference and generated texts to the semi-structured data before computing their precision and recall. Through a large scale human evaluation study of table-to-text models for WikiBio, we show that PARENT correlates with human judgments better than existing text generation metrics. We also adapt and evaluate the information extraction based evaluation proposed by Wiseman et al (2017), and show that PARENT has comparable correlation to it, while being easier to use. We show that PARENT is also applicable when the reference texts are elicited from humans using the data from the WebNLG challenge.

Citations (187)

View on Semantic Scholar

Collections

Summary

The paper introduces PARENT, a novel metric that employs entailment alignment to integrate table data and references for evaluation.
It demonstrates a stronger correlation with human judgment than traditional metrics like BLEU, ROUGE, and METEOR across diverse settings.
The study confirms PARENT's robust performance even when reference texts diverge, using extensive human evaluations.

Evaluation of Divergent Reference Texts in Table-to-Text Generation

The paper "Handling Divergent Reference Texts when Evaluating Table-to-Text Generation" addresses the significant challenge of evaluating text generation systems where reference texts may diverge from semi-structured data sources, such as tables. Such divergence presents a unique problem for existing automatic evaluation metrics, which typically assume ideal or "gold-standard" reference texts that align accurately with the tabular data.

Proposed Metric: PARENT

The authors propose a novel metric dubbed PARENT (Precision And Recall of Entailed N-grams from the Table), designed to enhance the evaluation of text generation models. PARENT innovatively incorporates both the reference text and the structured data in the assessment, addressing the issue of divergence with a sophisticated alignment method. The metric calculates precision and recall by aligning n-grams between the reference and generated texts with the original table data, leveraging an entailment model to infer which n-grams are logical given the table’s content.

Comparison with Existing Metrics

Through extensive human evaluation, the paper demonstrates that PARENT correlates more strongly with human judgment compared to traditional metrics like BLEU, ROUGE, and METEOR. This holds true across various settings, including comparisons between different system types and settings involving subtle hyperparameter changes. The research also includes variants of information extraction-based metrics, demonstrating comparable performance but offering ease of use advantages for PARENT in real-world applications.

Human Evaluation Methodology

The paper underscores its findings with robust human evaluations involving a diverse range of table-to-text models. Using Thurstone’s method, the researchers converted paired preference judgments into quantitative scores, thereby offering a detailed performance validation across multiple systems.

Analysis of Divergence and Metric Sensitivity

In examining the impact of divergent references, the paper finds that PARENT maintains high correlation with human assessments as the proportion of divergent references varies. This suggests the metric's robustness in real-world datasets, where divergence is common. An ablation paper further accentuates the importance of each component within PARENT, particularly the modeling of entailment probability.

Assessment on WebNLG Dataset

To corroborate the general applicability of the PARENT metric, additional evaluations were performed on the WebNLG dataset, where references are less likely to diverge, demonstrating that PARENT performs as well as—in some aspects better than—established metrics even with high-quality human-generated references.

Implications and Future Research

The introduction of PARENT offers compelling implications for the evaluation of table-to-text generation, particularly in scenarios where automatic dataset construction may yield reference texts of varied quality. The metric's ability to align outputs with both reference texts and the originating tables provides a comprehensive evaluation framework that accounts for content fidelity as well as linguistic expression.

Future research could extend the application of PARENT into more complex inference tasks requiring sophisticated LLMs, or develop new entailment models to enhance the accuracy of alignment in data-to-text tasks involving significant paraphrasing beyond lexical overlaps.

In summary, the paper presents an innovative solution to the pervasive challenge of evaluating table-to-text generation models in the presence of reference text divergence, combining theoretical rigor with practical applicability to set a robust standard for future research and development in natural language generation systems.