Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DEFT: Data Efficient Fine-Tuning for Pre-Trained Language Models via Unsupervised Core-Set Selection (2310.16776v5)

Published 25 Oct 2023 in cs.CL and cs.AI

Abstract: Recent advances have led to the availability of many pre-trained LLMs (PLMs); however, a question that remains is how much data is truly needed to fine-tune PLMs for downstream tasks? In this work, we introduce DEFT-UCS, a data-efficient fine-tuning framework that leverages unsupervised core-set selection to identify a smaller, representative dataset that reduces the amount of data needed to fine-tune PLMs for downstream tasks. We examine the efficacy of DEFT-UCS in the context of text-editing LMs, and compare to the state-of-the art text-editing model, CoEDIT. Our results demonstrate that DEFT-UCS models are just as accurate as CoEDIT, across eight different datasets consisting of six different editing tasks, while finetuned on 70% less data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Asset: A dataset for tuning and evaluation of sentence simplification models with multiple rewriting transformations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4668–4679.
  2. Jean-Michel Attendu and Jean-Philippe Corbeil. 2023. Nlu on data diets: Dynamic data subset selection for nlp classification tasks. arXiv preprint arXiv:2306.03208.
  3. Semantic redundancies in image-classification datasets: The 10% you don’t need. arXiv preprint arXiv:1901.11409.
  4. Help me write a poem: Instruction tuning as a vehicle for collaborative poetry writing. arXiv preprint arXiv:2210.13669.
  5. Skill-it! a data-driven skills framework for understanding and training language models. arXiv preprint arXiv:2307.14430.
  6. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  7. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  8. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  10. Active prompting with chain-of-thought for large language models. arXiv preprint arXiv:2302.12246.
  11. Understanding iterative revision from human-written text. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3573–3590.
  12. Editeval: An instruction-based benchmark for text improvements. arXiv preprint arXiv:2209.13331.
  13. Joseph L Fleiss and Jacob Cohen. 1973. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and psychological measurement, 33(3):613–619.
  14. On the effectiveness of parameter-efficient fine-tuning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 12799–12807.
  15. Sariel Har-Peled and Akash Kushal. 2005. Smaller coresets for k-median and k-means clustering. In Proceedings of the twenty-first annual symposium on Computational geometry, pages 126–134.
  16. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  17. Data-efficient finetuning using cross-task nearest neighbors. arXiv preprint arXiv:2212.00196.
  18. Retrieve: Coreset selection for efficient and robust semi-supervised learning. Advances in Neural Information Processing Systems, 34:14488–14501.
  19. Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In International conference on machine learning, pages 1885–1894. PMLR.
  20. Shiye Lei and Dacheng Tao. 2023. A comprehensive survey to dataset distillation. arXiv preprint arXiv:2301.05603.
  21. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
  22. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  23. When less is more: Investigating data pruning for pretraining llms at scale. arXiv preprint arXiv:2309.04564.
  24. Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning, pages 6950–6960. PMLR.
  25. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707.
  26. Jfleg: A fluency corpus and benchmark for grammatical error correction. arXiv preprint arXiv:1702.04066.
  27. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877.
  28. Deep learning on a data diet: Finding important examples early in training. Advances in Neural Information Processing Systems, 34:20596–20607.
  29. Automatically neutralizing subjective bias in text. In Proceedings of the aaai conference on artificial intelligence, volume 34, pages 480–489.
  30. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  31. Coedit: Text editing by task-specific instruction tuning. arXiv preprint arXiv:2305.09857.
  32. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506.
  33. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
  34. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  35. Peer: A collaborative language model. arXiv preprint arXiv:2208.11663.
  36. Ketan Rajshekhar Shahapure and Charles Nicholas. 2020. Cluster quality analysis using silhouette score. In 2020 IEEE 7th international conference on data science and advanced analytics (DSAA), pages 747–748. IEEE.
  37. Rewritelm: An instruction-tuned large language model for text rewriting. arXiv preprint arXiv:2305.15685.
  38. Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35:19523–19536.
  39. Selective annotation makes language models better few-shot learners. arXiv preprint arXiv:2209.01975.
  40. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7.
  41. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  42. Optimizing statistical machine translation for text simplification. Transactions of the Association for Computational Linguistics, 4:401–415.
  43. Dataset pruning: Reducing training data by examining generalization influence. arXiv preprint arXiv:2205.09329.
  44. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
  45. Multi-task instruction tuning of llama for specific scenarios: A preliminary study on writing assistance. arXiv preprint arXiv:2305.13225.
  46. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
  47. Dataset quantization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17205–17216.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com