Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Does Synthetic Data Make Large Language Models More Efficient? (2310.07830v1)

Published 11 Oct 2023 in cs.CL, cs.AI, and cs.LG

Abstract: NLP has undergone transformative changes with the advent of deep learning methodologies. One challenge persistently confronting researchers is the scarcity of high-quality, annotated datasets that drive these models. This paper explores the nuances of synthetic data generation in NLP, with a focal point on template-based question generation. By assessing its advantages, including data augmentation potential and the introduction of structured variety, we juxtapose these benefits against inherent limitations, such as the risk of overfitting and the constraints posed by pre-defined templates. Drawing from empirical evaluations, we demonstrate the impact of template-based synthetic data on the performance of modern transformer models. We conclude by emphasizing the delicate balance required between synthetic and real-world data, and the future trajectories of integrating synthetic data in model training pipelines. The findings aim to guide NLP practitioners in harnessing synthetic data's potential, ensuring optimal model performance in diverse applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Synthetic qa corpora generation with roundtrip consistency. arXiv preprint arXiv:1906.05416, 2019.
  2. Do not have enough data? deep learning to the rescue! In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7383–7390, 2020.
  3. Question generation with doubly adversarial nets. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(11):2230–2239, 2018.
  4. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326, 2015.
  5. Text classification for online conversations with machine learning on aws. AWS Machine Learning Blog, 2022.
  6. An analysis of the askmsr question-answering system. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), pages 257–264, 2002.
  7. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  8. A thorough examination of the cnn/daily mail reading comprehension task. arXiv preprint arXiv:1606.02858, 2016.
  9. Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164, 2019.
  10. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  11. Building watson: An overview of the deepqa project. AI magazine, 31(3):59–79, 2010.
  12. Alexa, predict my flight delay. arXiv preprint arXiv:2208.09921, 2022a.
  13. Flight delay prediction using deep learning and conversational voice-based agents. American Academic Scientific Research Journal for Engineering, Technology, and Sciences, 89(1):60–72, 2022b.
  14. Zero-shot open-book question answering. arXiv preprint arXiv:2111.11520, 2021.
  15. You don’t need labeled data for open-book question answering. Applied Sciences, 12(1):111, 2022.
  16. Do generative large language models need billions of parameters? arXiv preprint arXiv:2309.06589, 2023a.
  17. Can pruning make large language models more efficient?, 2023b.
  18. Can a student large language model perform as well as it’s teacher?, 2023c.
  19. Create, train, and deploy a billion-parameter language model on terabytes of data with tensorflow and amazon sagemaker. AWS Machine Learning Blog, 2022.
  20. Improving neural question generation using answer separation. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 6602–6609, 2019.
  21. Towards harnessing natural language generation to explain black-box models. In 2nd Workshop on Interactive Natural Language Technology for Explainable Artificial Intelligence, pages 22–27, 2020.
  22. The structure and performance of an open-domain question answering system. In Proceedings of the 38th annual meeting of the Association for Computational Linguistics, pages 563–570, 2000.
  23. Training question answering models from synthetic data. arXiv preprint arXiv:2002.09599, 2020.
  24. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
  25. Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709, 2015.
  26. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  27. Ellen M Voorhees et al. The trec-8 question answering track report. In Trec, volume 99, pages 77–82. Citeseer, 1999.
  28. Generative data augmentation for commonsense reasoning. arXiv preprint arXiv:2004.11546, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Sia Gholami (8 papers)
  2. Marwan Omar (13 papers)
Citations (8)