Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models (2310.13671v1)

Published 20 Oct 2023 in cs.CL and cs.AI

Abstract: Data Synthesis is a promising way to train a small model with very little labeled data. One approach for data synthesis is to leverage the rich knowledge from LLMs to synthesize pseudo training examples for small models, making it possible to achieve both data and compute efficiency at the same time. However, a key challenge in data synthesis is that the synthesized dataset often suffers from a large distributional discrepancy from the real task data distribution. Thus, in this paper, we propose Synthesis Step by Step (S3), a data synthesis framework that shrinks this distribution gap by iteratively extrapolating the errors made by a small model trained on the synthesized dataset on a small real-world validation dataset using a LLM. Extensive experiments on multiple NLP tasks show that our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data, resulting in significant improvement compared to several baselines: 9.48% improvement compared to ZeroGen and 2.73% compared to GoldGen, and at most 15.17% improvement compared to the small model trained on human-annotated data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Do not have enough data? deep learning to the rescue! In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7383–7390.
  2. Beat the ai: Investigating adversarial human annotation for reading comprehension. Transactions of the Association for Computational Linguistics, 8:662–678.
  3. The fifth pascal recognizing textual entailment challenge. In TAC. Citeseer.
  4. Knowledge distillation: A good teacher is patient and consistent. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10925–10934.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Big self-supervised models are strong semi-supervised learners. Advances in neural information processing systems, 33:22243–22255.
  7. Meta-learning via language model in-context tuning. arXiv preprint arXiv:2110.07814.
  8. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  9. Terrance DeVries and Graham W Taylor. 2017. Dataset augmentation in feature space. arXiv preprint arXiv:1702.05538.
  10. A survey for in-context learning. arXiv preprint arXiv:2301.00234.
  11. Jerome H Friedman. 2002. Stochastic gradient boosting. Computational statistics & data analysis, 38(4):367–378.
  12. Self-guided noise-free data generation for efficient zero-shot learning. In The Eleventh International Conference on Learning Representations.
  13. The third pascal recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pages 1–9.
  14. The second pascal recognising textual entailment challenge. In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, volume 7.
  15. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  16. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751.
  17. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301.
  18. Data augmentation using pre-trained transformer models. arXiv preprint arXiv:2003.02245.
  19. Data augmentation approaches in natural language processing: A survey. AI Open, 3:71–90.
  20. What makes good in-context examples for gpt-3333? arXiv preprint arXiv:2101.06804.
  21. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786.
  22. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142–150.
  23. Generating training data with language models: Towards zero-shot language understanding. arXiv preprint arXiv:2202.04538.
  24. OpenAI. 2023. Gpt-4 technical report.
  25. Training question answering models from synthetic data. arXiv preprint arXiv:2002.09599.
  26. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  27. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  28. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
  29. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.
  30. Cognitive approach to natural language processing. Elsevier.
  31. Connor Shorten and Taghi M Khoshgoftaar. 2019. A survey on image data augmentation for deep learning. Journal of big data, 6(1):1–48.
  32. Distilling reasoning capabilities into smaller language models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7059–7073.
  33. Language models in the loop: Incorporating prompting into weak supervision. arXiv preprint arXiv:2205.02318.
  34. Llama: Open and efficient foundation language models.
  35. Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of machine learning research, 9(11).
  36. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
  37. Want to reduce labeling cost? gpt-3 can help. arXiv preprint arXiv:2108.13487.
  38. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788.
  39. Large language models are implicitly topic models: Explaining and finding good demonstrations for in-context learning. arXiv preprint arXiv:2301.11916.
  40. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  41. BERT-of-theseus: Compressing BERT by progressive module replacing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7859–7869, Online. Association for Computational Linguistics.
  42. A survey on green deep learning.
  43. Progen: Progressive zero-shot dataset generation via in-context feedback. arXiv preprint arXiv:2210.12329.
  44. Zerogen: Efficient zero-shot learning via dataset generation. arXiv preprint arXiv:2202.07922.
  45. Modular transformers: Compressing transformers into modularized layers for flexible efficient inference. In Findings of the Association for Computational Linguistics: ACL 2023, pages 10452–10465, Toronto, Canada. Association for Computational Linguistics.
  46. Bert loses patience: Fast and robust inference with early exit. In Advances in Neural Information Processing Systems, volume 33, pages 18330–18341. Curran Associates, Inc.
Citations (21)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com