Papers
Topics
Authors
Recent
Search
2000 character limit reached

Data Selection for Fine-tuning Large Language Models Using Transferred Shapley Values

Published 16 Jun 2023 in cs.CL | (2306.10165v1)

Abstract: Although Shapley values have been shown to be highly effective for identifying harmful training instances, dataset size and model complexity constraints limit the ability to apply Shapley-based data valuation to fine-tuning large pre-trained LLMs. To address this, we propose TS-DShapley, an algorithm that reduces computational cost of Shapley-based data valuation through: 1) an efficient sampling-based method that aggregates Shapley values computed from subsets for valuation of the entire training set, and 2) a value transfer method that leverages value information extracted from a simple classifier trained using representations from the target LLM. Our experiments applying TS-DShapley to select data for fine-tuning BERT-based LLMs on benchmark natural language understanding (NLU) datasets show that TS-DShapley outperforms existing data selection methods. Further, TS-DShapley can filter fine-tuning data to increase LLM performance compared to training with the full fine-tuning dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  2. R Dennis Cook. 1977. Detection of influential observation in linear regression. Technometrics, 19(1):15–18.
  3. The PASCAL recognising textual entailment challenge. In Machine learning challenges. evaluating predictive uncertainty, visual object classification, and recognising tectual entailment, pages 177–190. Springer.
  4. Amirata Ghorbani and James Zou. 2019. Data shapley: Equitable valuation of data for machine learning. In International Conference on Machine Learning, pages 2242–2251. PMLR.
  5. Data shapley valuation for efficient batch active learning. arXiv preprint arXiv:2104.08312.
  6. First quora dataset release: Question pairs.
  7. Efficient task-specific data valuation for nearest neighbor algorithms. Proceedings of the VLDB Endowment, 12(11):1610–1623.
  8. Towards efficient data valuation based on the shapley value. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1167–1176. PMLR.
  9. Scalability vs. utility: Do we have to sacrifice one for the other in data importance quantification? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8239–8247.
  10. Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In International conference on machine learning, pages 1885–1894. PMLR.
  11. Yongchan Kwon and James Zou. 2022. Beta shapley: a unified and noise-reduced data valuation framework for machine learning. Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS) 2022.
  12. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  13. Md Rizwan Parvez and Kai-Wei Chang. 2021. Evaluating the values of sources in transfer learning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5084–5116.
  14. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
  15. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  16. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
  17. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations.
  18. Cs-shapley: Class-wise shapley values for data valuation in classification. In Advances in Neural Information Processing Systems.
  19. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
  20. Noisy text data: Achilles’ heel of bert. In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pages 16–21.
  21. Adv-bert: Bert is not robust on misspellings! generating nature adversarial samples on bert. arXiv preprint arXiv:2003.04985.
  22. Dataset cartography: Mapping and diagnosing datasets with training dynamics. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9275–9293.
  23. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In the Proceedings of ICLR.
Citations (15)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.