CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation (2409.02098v1)
Abstract: Building high-quality datasets for specialized tasks is a time-consuming and resource-intensive process that often requires specialized domain knowledge. We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed. Given the few-shot examples, we use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents. Lastly, instruction-tuned LLMs augment the retrieved documents into custom-formatted task samples, which then can be used for fine-tuning. We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology question-answering (QA), medicine QA and commonsense QA as well as summarization. Our experiments show that CRAFT-based models outperform or achieve comparable performance to general LLMs for QA tasks, while CRAFT-based summarization models outperform models trained on human-curated data by 46 preference points.
- Tensorflow: Large-scale machine learning on heterogeneous distributed systems.
- How much noise is too much: A study in automatic text classification. In Seventh IEEE International Conference on Data Mining (ICDM 2007), pages 3–12. IEEE.
- Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7319–7328.
- Bruce Alberts. 2017. Molecular biology of the cell, 5th edition. Garland science.
- A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
- Giusepppe Attardi. 2015. Wikiextractor. https://github.com/attardi/wikiextractor.
- Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
- A comparison of methods for the evaluation of text summarization techniques. In DATA, pages 200–207.
- RecipeNLG: A cooking recipes dataset for semi-structured text generation. In Proceedings of the 13th International Conference on Natural Language Generation, pages 22–28, Dublin, Ireland. Association for Computational Linguistics.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Evaluation of text generation: A survey. arXiv preprint arXiv:2006.14799.
- The selectgen challenge: Finding the best training samples for few-shot neural text generation. In Proceedings of the 14th International Conference on Natural Language Generation, pages 325–330.
- Alpagasus: Training a better alpaca with fewer data. In The Twelfth International Conference on Learning Representations.
- Dog-instruct: Towards premium instruction-tuning data via text-grounded instruction wrapping. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4125–4135.
- Noise suppression for improved few-shot learning. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1900–1904.
- Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132.
- Dialog inpainting: Turning documents into dialogs. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 4558–4586. PMLR.
- Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36.
- Peter Deutsch. 1996. Gzip file format specification version 4.3.
- A survey on in-context learning.
- The llama 3 herd of models.
- Alpacafarm: A simulation framework for methods that learn from human feedback.
- Corpus wide argument mining—a working solution. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7683–7691.
- Ronen Eldan and Yuanzhi Li. 2023. Tinystories: How small can language models be and still speak coherent english?
- Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898.
- Flax Sentence Embeddings Team. 2021. Stack exchange question pairs. https://huggingface.co/datasets/flax-sentence-embeddings/.
- Better synthetic data by retrieving and transforming existing datasets. arXiv preprint arXiv:2404.14361.
- Yvette Graham. 2015. Re-evaluating automatic summarization with bleu and 192 shades of rouge. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 128–137.
- A systematic survey of prompt engineering on vision-language foundation models. arXiv preprint arXiv:2307.12980.
- Textbooks are all you need.
- Is gpt-4 a reliable rater? evaluating consistency in gpt-4’s text ratings. Frontiers in Education, 8.
- Array programming with NumPy. Nature, 585(7825):357–362.
- Measuring massive multitask language understanding. In International Conference on Learning Representations.
- The curious case of neural text degeneration. In International Conference on Learning Representations.
- Unnatural instructions: Tuning language models with (almost) no human labor. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14409–14428, Toronto, Canada. Association for Computational Linguistics.
- Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia. Association for Computational Linguistics.
- Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Mistral 7b. arXiv.
- Mahnaz Koupaee and William Yang Wang. 2018. Wikihow: A large scale text summarization dataset.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
- Longform: Optimizing instruction tuning for long text generation with corpus extraction.
- Zero-data learning of new tasks. In AAAI, volume 1, pages 646–651.
- Chain of natural language inference for reducing large language model ungrounded hallucinations. arXiv preprint arXiv:2310.03951.
- Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871.
- PAQ: 65 million probably-asked questions and what you can do with them. Transactions of the Association for Computational Linguistics, 9:1098–1115.
- Measuring the intrinsic dimension of objective landscapes. In International Conference on Learning Representations.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Roderick JA Little et al. 1993. Statistical analysis of masked data. JOURNAL OF OFFICIAL STATISTICS-STOCKHOLM-, 9:407–407.
- Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. In Advances in Neural Information Processing Systems, volume 35, pages 1950–1965. Curran Associates, Inc.
- Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 11:157–173.
- G-eval: Nlg evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS).
- Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
- Rephrasing the web: A recipe for compute and data-efficient language modeling. In Data Problems for Foundation Models Workshop at ICLR.
- Sarah Malmquist and Kristina Prescott. 2022. Human Biology, 2nd edition. Pressbooks.
- Umar Maqsud. 2015. Synthetic text generation for sentiment analysis. In Proceedings of the 6th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pages 156–161.
- Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330.
- Orca 2: Teaching small language models how to reason.
- Agentinstruct: Toward generative teaching with agentic flows. arXiv preprint arXiv:2407.03502.
- Orca: Progressive learning from complex explanation traces of gpt-4.
- Learning to generate instruction tuning datasets for zero-shot task adaptation. arXiv preprint arXiv:2402.18334.
- OpenAI. 2023. Gpt-4 technical report.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
- Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Proceedings of the Conference on Health, Inference, and Learning, volume 174 of Proceedings of Machine Learning Research, pages 248–260. PMLR.
- Pytorch: an imperative style, high-performance deep learning library. In Proceedings of the 33rd International Conference on Neural Information Processing Systems.
- True few-shot learning with language models. Advances in neural information processing systems, 34:11054–11070.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- A review of deduplicate and significance of using fuzzy logic. ICT Analysis and Applications, pages 281–287.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992.
- Behavioral Biology: Proximate and Ultimate Causes of Behavior. OpenStax, Houston, Texas.
- Re-evaluating adem: A deeper look at scoring dialogue responses. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6220–6227.
- Multitask prompted training enables zero-shot task generalization. In ICLR 2022-Tenth International Conference on Learning Representations.
- Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics.
- A comprehensive survey of few-shot learning: Evolution, applications, challenges, and opportunities. ACM Computing Surveys.
- Evaluation metrics in the era of gpt-4: Reliably evaluating large language models on sequence to sequence tasks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8776–8788.
- Leurgans Sue. 1987. Linear models, random censoring and synthetic data. Biometrika, 74(2):301–309.
- Generating text with recurrent neural networks. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 1017–1024.
- Commonsenseqa 2.0: Exposing the limits of ai through gamification. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- The HDF Group. 2002. Hierarchical Data Format, version 5.
- Learning structural representations for recipe generation and food retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3363–3377.
- Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788.
- Chatgpt, enhanced with clinical practice guidelines, is a superior decision support tool. medRxiv, pages 2023–08.
- Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
- Douglas Wilkin and Jean Brainard. 2016. Communication Behavior in Animals - Advanced. CK-12.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
- Dynosaur: A dynamic growth paradigm for instruction-tuning data curation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4031–4047.
- Bertscore: Evaluating text generation with bert.