Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Genie: Achieving Human Parity in Content-Grounded Datasets Generation (2401.14367v1)

Published 25 Jan 2024 in cs.CL, cs.AI, and cs.LG

Abstract: The lack of high-quality data for content-grounded generation tasks has been identified as a major obstacle to advancing these tasks. To address this gap, we propose Genie, a novel method for automatically generating high-quality content-grounded data. It consists of three stages: (a) Content Preparation, (b) Generation: creating task-specific examples from the content (e.g., question-answer pairs or summaries). (c) Filtering mechanism aiming to ensure the quality and faithfulness of the generated data. We showcase this methodology by generating three large-scale synthetic data, making wishes, for Long-Form Question-Answering (LFQA), summarization, and information extraction. In a human evaluation, our generated data was found to be natural and of high quality. Furthermore, we compare models trained on our data with models trained on human-written data -- ELI5 and ASQA for LFQA and CNN-DailyMail for Summarization. We show that our models are on par with or outperforming models trained on human-generated data and consistently outperforming them in faithfulness. Finally, we applied our method to create LFQA data within the medical domain and compared a model trained on it with models trained on other domains.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Asaf Yehudai (16 papers)
  2. Boaz Carmeli (14 papers)
  3. Yosi Mass (8 papers)
  4. Ofir Arviv (11 papers)
  5. Nathaniel Mills (2 papers)
  6. Assaf Toledo (10 papers)
  7. Eyal Shnarch (15 papers)
  8. Leshem Choshen (78 papers)
Citations (21)

Summary

Introduction

In the field of content-grounded tasks within NLP, obtaining high-quality data remains a significant bottleneck. To counteract this scarcity, a methodology named Genie (Generate information & elucidate) has been proposed, which automates the creation of high-caliber synthetic datasets for a variety of domains and tasks. This paper details a method consisting of Content Preparation, Generation by LLMs, and a robust Filtering system designed to ensure the fidelity and richness of the synthesized data.

Automating Dataset Curation

The first stage of dataset curation, Content Preparation, involves extracting relevant passages from raw documents via a rule-based approach. Despite being the least generalizable aspect of the method, it lays the groundwork for the succeeding generation phase. The Generation phase utilizes a LLM to produce task-specific synthetic examples. Efficient decoding methods are employed, such as greedy decoding, to incite more grounded responses. Crucially, the paper details the use of two LLM variants in creating data: Falcon-40B and Llama-2-70B.

Ensuring Data Quality through Filtering

The Filtering phase is multifaceted, aiming to discard examples that lack proper formatting, grounding, or overall quality. Format-based filtering checks for missing components of the examples. Faithfulness is confirmed using an off-the-shelf textual entailment metric combined with a T5-11B NLI model. For quality evaluation, the paper relies on a reward model trained on human preference data, which serves as a proxy for the 'human-like' quality of responses. Generated examples that fall below a certain threshold are excluded from the final dataset.

Empirical Validation and Applications

Experiments validate the efficacy of the Genie-assisted datasets, particularly for Long-Form Question Answering (LFQA) and summarization tasks. Models trained on these synthetic datasets match or outperform those trained on human-generated data across several metrics. Furthermore, data filtered through Genie demonstrates superior faithfulness to the source content. By releasing the synthetic data publicly, the authors provide an invaluable resource which they term 'wishes datasets.' This contribution holds significant promise for future advancement in content-based NLP models, demonstrating the utility of LLM-generated datasets in enhancing performance and ensuring content faithfulness.