Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A synthetic data approach for domain generalization of NLI models (2402.12368v2)

Published 19 Feb 2024 in cs.CL
A synthetic data approach for domain generalization of NLI models

Abstract: Natural Language Inference (NLI) remains an important benchmark task for LLMs. NLI datasets are a springboard for transfer learning to other semantic tasks, and NLI models are standard tools for identifying the faithfulness of model-generated text. There are several large scale NLI datasets today, and models have improved greatly by hill-climbing on these collections. Yet their realistic performance on out-of-distribution/domain data is less well-understood. We explore the opportunity for synthetic high-quality datasets to adapt NLI models for zero-shot use in downstream applications across new and unseen text domains. We demonstrate a new approach for generating NLI data in diverse domains and lengths, so far not covered by existing training sets. The resulting examples have meaningful premises, the hypotheses are formed in creative ways rather than simple edits to a few premise tokens, and the labels have high accuracy. We show that models trained on this data ($685$K synthetic examples) have the best generalization to completely new downstream test settings. On the TRUE benchmark, a T5-small model trained with our data improves around $7\%$ on average compared to training on the best alternative dataset. The improvements are more pronounced for smaller models, while still meaningful on a T5 XXL model. We also demonstrate gains on test sets when in-domain training data is augmented with our domain-general synthetic data.

A Synthetic Data Approach for Domain Generalization of NLI Models

The paper "A synthetic data approach for domain generalization of NLI models" presents a comprehensive paper focused on enhancing the domain generalization capabilities of Natural Language Inference (NLI) models through a synthetic data generation method. The primary objective is to address the suboptimal performance of NLI models on out-of-distribution or novel domain data, a persistent challenge in leveraging these models for downstream applications such as fact-checking and source attribution.

Key Contributions

The authors propose a method for generating large-scale synthetic data that encompasses a diverse range of domains and premise lengths, thereby improving the generalization power of NLI models. This approach is distinct in that it does not merely rely on minor edits to existing premise tokens. Instead, the hypotheses are creatively generated, maintaining high label accuracy.

The synthetic dataset comprises 685,000 examples spanning 40 distinct and realistic domains, including domains not traditionally covered by existing datasets. A T5-small model trained on this synthetic data showed an approximate 7% improvement in average performance on the TRUE benchmark, compared to using the best existing datasets. The improvements were not confined to smaller models, as gains were still significant for larger models like T5 XXL.

Methodology

The data generation method involves tuning LLMs to produce high-quality and creative (premise, hypothesis, label) triples across various domains. The process begins with generating domain names and then premises of varying lengths within those domains. Hypotheses and corresponding labels are subsequently generated based on these premises. The approach ensures a balanced distribution of domains, premise lengths, and labels, which contrasts with prior datasets that often suffer from stylistic or domain biases.

Evaluation and Results

The paper evaluates the synthetic dataset's impact by training NLI models from scratch using the synthetic data and comparing their performance against models trained on established datasets like MNLI, ANLI, and WANLI. The synthetic dataset's efficacy is evident as models trained using this data achieved state-of-the-art performance on the TRUE factual consistency benchmark, covering 11 diverse tasks.

Moreover, the paper explores the combination of synthetic data with existing datasets, revealing that while domain-specific data yields the highest in-domain test accuracy, augmenting these datasets with synthetic data leads to improved performance across unseen domains and distributions.

Implications and Future Directions

The research profoundly implies that synthetic data generation can be a powerful tool for mitigating the limitations of NLI models when faced with novel or diverse domain inputs. The benefits of such datasets for zero-shot learning scenarios are particularly noteworthy, showcasing potential applicability across tasks involving unseen text genres and lengths.

Future developments could explore extending this approach to other NLP tasks or languages beyond English, considering the adaptability and scalability of the synthetic dataset generation process. Additionally, Analyzing the impact of varying the scale of synthetic data, in terms of quantity and diversity, would provide further insights into optimizing model performance across multiple domains.

In conclusion, this paper underscores the potential of synthetic data to substantially enhance the domain generalization of NLI models, paving the way for broader application and robustness of NLP systems in diverse settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Dyah Adila and Dongyeop Kang. 2022. Understanding out-of-distribution: A perspective of data dynamics. In I (Still) Can’t Believe It’s Not Better! Workshop at NeurIPS 2021, pages 1–8. PMLR.
  2. Qameleon: Multilingual qa with only 5 examples. Transactions of the Association for Computational Linguistics, 11:1754.
  3. Don’t take the premise for granted: Mitigating artifacts in natural language inference. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 877–891.
  4. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
  5. A large annotated corpus for learning natural language inference. In Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, pages 632–642. Association for Computational Linguistics (ACL).
  6. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  7. Scaling instruction-finetuned language models. arXiv e-prints, pages arXiv–2210.
  8. Evaluating attribution in dialogue systems: The begin benchmark. Transactions of the Association for Computational Linguistics, 10:1066–1083.
  9. Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409.
  10. Trueteacher: Learning factual consistency evaluation with large language models. arXiv preprint arXiv:2305.11171.
  11. Palm 2 technical report.
  12. Dialfact: A benchmark for fact-checking in dialogue. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3785–3801.
  13. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112, New Orleans, Louisiana. Association for Computational Linguistics.
  14. Generate, Annotate, and Learn: NLP with Synthetic Text. Transactions of the Association for Computational Linguistics, 10:826–842.
  15. TRUE: Re-evaluating factual consistency evaluation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3905–3920, Seattle, United States. Association for Computational Linguistics.
  16. Q2:: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7856–7870.
  17. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  18. Synthetic data generation with large language models for text classification: Potential and limitations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10443–10461.
  19. Wanli: Worker and ai collaboration for natural language inference dataset creation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6826–6847.
  20. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919.
  21. Domain generalization via invariant feature representation. In Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 10–18. PMLR.
  22. Nikita Nangia and Samuel Bowman. 2019. Human vs. muppet: A conservative estimate of human performance on the glue benchmark. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4566–4575.
  23. Adversarial nli: A new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599.
  24. Understanding factuality in abstractive summarization with frank: A benchmark for factuality metrics. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4812–4829.
  25. Training question answering models from synthetic data. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5811–5826, Online. Association for Computational Linguistics.
  26. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  27. Multi-component image translation for deep domain generalization. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 579–588.
  28. Measuring Attribution in Natural Language Generation Models. Computational Linguistics, 49(4):777–840.
  29. Get your vitamin C! robust fact verification with contrastive evidence. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 624–643, Online. Association for Computational Linguistics.
  30. The fact extraction and VERification (FEVER) shared task. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER), pages 1–9, Brussels, Belgium. Association for Computational Linguistics.
  31. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30.
  32. STraTA: Self-training with task augmentation for better few-shot learning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5715–5731, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  33. Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5008–5020, Online. Association for Computational Linguistics.
  34. Generalizing to unseen domains: A survey on domain generalization. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pages 4627–4635. International Joint Conferences on Artificial Intelligence Organization. Survey Track.
  35. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122.
  36. PAWS: Paraphrase adversaries from word scrambling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1298–1308, Minneapolis, Minnesota. Association for Computational Linguistics.
  37. Learning to generate novel domains for domain generalization. In Computer Vision – ECCV 2020, pages 561–578, Cham. Springer International Publishing.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Mohammad Javad Hosseini (12 papers)
  2. Andrey Petrov (3 papers)
  3. Alex Fabrikant (16 papers)
  4. Annie Louis (13 papers)
Citations (4)