Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
104 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
40 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation (2409.02098v1)

Published 3 Sep 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Building high-quality datasets for specialized tasks is a time-consuming and resource-intensive process that often requires specialized domain knowledge. We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed. Given the few-shot examples, we use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents. Lastly, instruction-tuned LLMs augment the retrieved documents into custom-formatted task samples, which then can be used for fine-tuning. We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology question-answering (QA), medicine QA and commonsense QA as well as summarization. Our experiments show that CRAFT-based models outperform or achieve comparable performance to general LLMs for QA tasks, while CRAFT-based summarization models outperform models trained on human-curated data by 46 preference points.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (93)
  1. Tensorflow: Large-scale machine learning on heterogeneous distributed systems.
  2. How much noise is too much: A study in automatic text classification. In Seventh IEEE International Conference on Data Mining (ICDM 2007), pages 3–12. IEEE.
  3. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7319–7328.
  4. Bruce Alberts. 2017. Molecular biology of the cell, 5th edition. Garland science.
  5. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
  6. Giusepppe Attardi. 2015. Wikiextractor. https://github.com/attardi/wikiextractor.
  7. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
  8. A comparison of methods for the evaluation of text summarization techniques. In DATA, pages 200–207.
  9. RecipeNLG: A cooking recipes dataset for semi-structured text generation. In Proceedings of the 13th International Conference on Natural Language Generation, pages 22–28, Dublin, Ireland. Association for Computational Linguistics.
  10. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  11. Evaluation of text generation: A survey. arXiv preprint arXiv:2006.14799.
  12. The selectgen challenge: Finding the best training samples for few-shot neural text generation. In Proceedings of the 14th International Conference on Natural Language Generation, pages 325–330.
  13. Alpagasus: Training a better alpaca with fewer data. In The Twelfth International Conference on Learning Representations.
  14. Dog-instruct: Towards premium instruction-tuning data via text-grounded instruction wrapping. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4125–4135.
  15. Noise suppression for improved few-shot learning. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1900–1904.
  16. Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132.
  17. Dialog inpainting: Turning documents into dialogs. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 4558–4586. PMLR.
  18. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36.
  19. Peter Deutsch. 1996. Gzip file format specification version 4.3.
  20. A survey on in-context learning.
  21. The llama 3 herd of models.
  22. Alpacafarm: A simulation framework for methods that learn from human feedback.
  23. Corpus wide argument mining—a working solution. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7683–7691.
  24. Ronen Eldan and Yuanzhi Li. 2023. Tinystories: How small can language models be and still speak coherent english?
  25. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898.
  26. Flax Sentence Embeddings Team. 2021. Stack exchange question pairs. https://huggingface.co/datasets/flax-sentence-embeddings/.
  27. Better synthetic data by retrieving and transforming existing datasets. arXiv preprint arXiv:2404.14361.
  28. Yvette Graham. 2015. Re-evaluating automatic summarization with bleu and 192 shades of rouge. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 128–137.
  29. A systematic survey of prompt engineering on vision-language foundation models. arXiv preprint arXiv:2307.12980.
  30. Textbooks are all you need.
  31. Is gpt-4 a reliable rater? evaluating consistency in gpt-4’s text ratings. Frontiers in Education, 8.
  32. Array programming with NumPy. Nature, 585(7825):357–362.
  33. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  34. The curious case of neural text degeneration. In International Conference on Learning Representations.
  35. Unnatural instructions: Tuning language models with (almost) no human labor. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14409–14428, Toronto, Canada. Association for Computational Linguistics.
  36. Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia. Association for Computational Linguistics.
  37. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  38. Mistral 7b. arXiv.
  39. Mahnaz Koupaee and William Yang Wang. 2018. Wikihow: A large scale text summarization dataset.
  40. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
  41. Longform: Optimizing instruction tuning for long text generation with corpus extraction.
  42. Zero-data learning of new tasks. In AAAI, volume 1, pages 646–651.
  43. Chain of natural language inference for reducing large language model ungrounded hallucinations. arXiv preprint arXiv:2310.03951.
  44. Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871.
  45. PAQ: 65 million probably-asked questions and what you can do with them. Transactions of the Association for Computational Linguistics, 9:1098–1115.
  46. Measuring the intrinsic dimension of objective landscapes. In International Conference on Learning Representations.
  47. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  48. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  49. Roderick JA Little et al. 1993. Statistical analysis of masked data. JOURNAL OF OFFICIAL STATISTICS-STOCKHOLM-, 9:407–407.
  50. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. In Advances in Neural Information Processing Systems, volume 35, pages 1950–1965. Curran Associates, Inc.
  51. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 11:157–173.
  52. G-eval: Nlg evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522.
  53. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  54. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS).
  55. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
  56. Rephrasing the web: A recipe for compute and data-efficient language modeling. In Data Problems for Foundation Models Workshop at ICLR.
  57. Sarah Malmquist and Kristina Prescott. 2022. Human Biology, 2nd edition. Pressbooks.
  58. Umar Maqsud. 2015. Synthetic text generation for sentiment analysis. In Proceedings of the 6th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pages 156–161.
  59. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330.
  60. Orca 2: Teaching small language models how to reason.
  61. Agentinstruct: Toward generative teaching with agentic flows. arXiv preprint arXiv:2407.03502.
  62. Orca: Progressive learning from complex explanation traces of gpt-4.
  63. Learning to generate instruction tuning datasets for zero-shot task adaptation. arXiv preprint arXiv:2402.18334.
  64. OpenAI. 2023. Gpt-4 technical report.
  65. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
  66. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Proceedings of the Conference on Health, Inference, and Learning, volume 174 of Proceedings of Machine Learning Research, pages 248–260. PMLR.
  67. Pytorch: an imperative style, high-performance deep learning library. In Proceedings of the 33rd International Conference on Neural Information Processing Systems.
  68. True few-shot learning with language models. Advances in neural information processing systems, 34:11054–11070.
  69. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  70. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  71. A review of deduplicate and significance of using fuzzy logic. ICT Analysis and Applications, pages 281–287.
  72. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992.
  73. Behavioral Biology: Proximate and Ultimate Causes of Behavior. OpenStax, Houston, Texas.
  74. Re-evaluating adem: A deeper look at scoring dialogue responses. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6220–6227.
  75. Multitask prompted training enables zero-shot task generalization. In ICLR 2022-Tenth International Conference on Learning Representations.
  76. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics.
  77. A comprehensive survey of few-shot learning: Evolution, applications, challenges, and opportunities. ACM Computing Surveys.
  78. Evaluation metrics in the era of gpt-4: Reliably evaluating large language models on sequence to sequence tasks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8776–8788.
  79. Leurgans Sue. 1987. Linear models, random censoring and synthetic data. Biometrika, 74(2):301–309.
  80. Generating text with recurrent neural networks. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 1017–1024.
  81. Commonsenseqa 2.0: Exposing the limits of ai through gamification. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
  82. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  83. The HDF Group. 2002. Hierarchical Data Format, version 5.
  84. Learning structural representations for recipe generation and food retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3363–3377.
  85. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788.
  86. Chatgpt, enhanced with clinical practice guidelines, is a superior decision support tool. medRxiv, pages 2023–08.
  87. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508.
  88. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  89. Douglas Wilkin and Jean Brainard. 2016. Communication Behavior in Animals - Advanced. CK-12.
  90. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45.
  91. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
  92. Dynosaur: A dynamic growth paradigm for instruction-tuning data curation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4031–4047.
  93. Bertscore: Evaluating text generation with bert.
Citations (2)

Summary

We haven't generated a summary for this paper yet.