Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

A synthetic data approach for domain generalization of NLI models (2402.12368v2)

Published 19 Feb 2024 in cs.CL

Abstract: Natural Language Inference (NLI) remains an important benchmark task for LLMs. NLI datasets are a springboard for transfer learning to other semantic tasks, and NLI models are standard tools for identifying the faithfulness of model-generated text. There are several large scale NLI datasets today, and models have improved greatly by hill-climbing on these collections. Yet their realistic performance on out-of-distribution/domain data is less well-understood. We explore the opportunity for synthetic high-quality datasets to adapt NLI models for zero-shot use in downstream applications across new and unseen text domains. We demonstrate a new approach for generating NLI data in diverse domains and lengths, so far not covered by existing training sets. The resulting examples have meaningful premises, the hypotheses are formed in creative ways rather than simple edits to a few premise tokens, and the labels have high accuracy. We show that models trained on this data ($685$K synthetic examples) have the best generalization to completely new downstream test settings. On the TRUE benchmark, a T5-small model trained with our data improves around $7\%$ on average compared to training on the best alternative dataset. The improvements are more pronounced for smaller models, while still meaningful on a T5 XXL model. We also demonstrate gains on test sets when in-domain training data is augmented with our domain-general synthetic data.

PDF HTML Abstract

A Synthetic Data Approach for Domain Generalization of NLI Models

The paper "A synthetic data approach for domain generalization of NLI models" presents a comprehensive paper focused on enhancing the domain generalization capabilities of Natural Language Inference (NLI) models through a synthetic data generation method. The primary objective is to address the suboptimal performance of NLI models on out-of-distribution or novel domain data, a persistent challenge in leveraging these models for downstream applications such as fact-checking and source attribution.

Key Contributions

The authors propose a method for generating large-scale synthetic data that encompasses a diverse range of domains and premise lengths, thereby improving the generalization power of NLI models. This approach is distinct in that it does not merely rely on minor edits to existing premise tokens. Instead, the hypotheses are creatively generated, maintaining high label accuracy.

The synthetic dataset comprises 685,000 examples spanning 40 distinct and realistic domains, including domains not traditionally covered by existing datasets. A T5-small model trained on this synthetic data showed an approximate 7% improvement in average performance on the TRUE benchmark, compared to using the best existing datasets. The improvements were not confined to smaller models, as gains were still significant for larger models like T5 XXL.

Methodology

The data generation method involves tuning LLMs to produce high-quality and creative (premise, hypothesis, label) triples across various domains. The process begins with generating domain names and then premises of varying lengths within those domains. Hypotheses and corresponding labels are subsequently generated based on these premises. The approach ensures a balanced distribution of domains, premise lengths, and labels, which contrasts with prior datasets that often suffer from stylistic or domain biases.

Evaluation and Results

The paper evaluates the synthetic dataset's impact by training NLI models from scratch using the synthetic data and comparing their performance against models trained on established datasets like MNLI, ANLI, and WANLI. The synthetic dataset's efficacy is evident as models trained using this data achieved state-of-the-art performance on the TRUE factual consistency benchmark, covering 11 diverse tasks.

Moreover, the paper explores the combination of synthetic data with existing datasets, revealing that while domain-specific data yields the highest in-domain test accuracy, augmenting these datasets with synthetic data leads to improved performance across unseen domains and distributions.

Implications and Future Directions

The research profoundly implies that synthetic data generation can be a powerful tool for mitigating the limitations of NLI models when faced with novel or diverse domain inputs. The benefits of such datasets for zero-shot learning scenarios are particularly noteworthy, showcasing potential applicability across tasks involving unseen text genres and lengths.

Future developments could explore extending this approach to other NLP tasks or languages beyond English, considering the adaptability and scalability of the synthetic dataset generation process. Additionally, Analyzing the impact of varying the scale of synthetic data, in terms of quantity and diversity, would provide further insights into optimizing model performance across multiple domains.

In conclusion, this paper underscores the potential of synthetic data to substantially enhance the domain generalization of NLI models, paving the way for broader application and robustness of NLP systems in diverse settings.

PDF Markdown Bookmark Chat (Pro)

References (37)

Authors (4)

Mohammad Javad Hosseini (12 papers)
Andrey Petrov (3 papers)
Alex Fabrikant (16 papers)
Annie Louis (13 papers)

Citations (4)

View on Semantic Scholar

Tweets

https://twitter.com/mjavadhosseini/status/1791218317698453750

https://twitter.com/mjavadhosseini/status/1760081766012850648

https://twitter.com/AI_inAM/status/1759824972758036611