What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions (2405.13954v1)
Abstract: LLMs are trained on a vast amount of human-written data, but data providers often remain uncredited. In response to this issue, data valuation (or data attribution), which quantifies the contribution or value of each data to the model output, has been discussed as a potential solution. Nevertheless, applying existing data valuation methods to recent LLMs and their vast training datasets has been largely limited by prohibitive compute and memory costs. In this work, we focus on influence functions, a popular gradient-based data valuation method, and significantly improve its scalability with an efficient gradient projection strategy called LoGra that leverages the gradient structure in backpropagation. We then provide a theoretical motivation of gradient projection approaches to influence functions to promote trust in the data valuation process. Lastly, we lower the barrier to implementing data valuation systems by introducing LogIX, a software package that can transform existing training code into data valuation code with minimal effort. In our data valuation experiments, LoGra achieves competitive accuracy against more expensive baselines while showing up to 6,500x improvement in throughput and 5x reduction in GPU memory usage when applied to Llama3-8B-Instruct and the 1B-token dataset.
- AI@Meta. Llama 3 model card, 2024.
- Training data attribution via approximate unrolled differentiation, 2024.
- If influence functions are the answer, then what is the question? Advances in Neural Information Processing Systems, 35:17953–17967, 2022.
- Relatif: Identifying explanatory training samples via relative influence. In International Conference on Artificial Intelligence and Statistics, pages 1899–1909. PMLR, 2020.
- Influence functions in deep learning are fragile. arXiv preprint arXiv:2006.14651, 2020.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Which layer is learning faster? a systematic exploration of layer-wise convergence rate for deep neural networks. In The Eleventh International Conference on Learning Representations, 2023.
- Making scalable meta learning practical. Advances in neural information processing systems, 36, 2024.
- Value theory without efficiency. Mathematics of Operations Research, 6(1):122–128, 1981.
- Dsdm: Model-aware dataset selection with datamodels. arXiv preprint arXiv:2401.12926, 2024.
- What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33:2881–2891, 2020.
- Raul Castro Fernandez. Data-sharing markets: Model, protocol, and algorithms to incentivize the formation of data-sharing consortia. In Proceedings ACMSIGMOD International Conference on Management of Data, 2023.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Data shapley: Equitable valuation of data for machine learning. In International conference on machine learning, pages 2242–2251. PMLR, 2019.
- Openwebtext corpus, 2019.
- Studying large language model generalization with influence functions, 2023.
- A kronecker-factored approximate fisher matrix for convolution layers. In International Conference on Machine Learning, pages 573–582. PMLR, 2016.
- Fastif: Scalable influence functions for efficient model interpretation and debugging. arXiv preprint arXiv:2012.15781, 2020.
- Frank R Hampel. The influence curve and its role in robust estimation. Journal of the american statistical association, 69(346):383–393, 1974.
- Explaining black box predictions and unveiling data artifacts through influence functions. arXiv preprint arXiv:2005.06676, 2020.
- Evaluation of similarity-based explanations. arXiv preprint arXiv:2006.04528, 2020.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
- Jie Huang and Kevin Chen-Chuan Chang. Citation: A key to building responsible and accountable large language models. arXiv preprint arXiv:2307.02185, 2023.
- Datamodels: Predicting predictions from training data, 2022.
- Adversarial examples are not bugs, they are features. Advances in neural information processing systems, 32, 2019.
- Towards efficient data valuation based on the shapley value. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1167–1176. PMLR, 2019.
- Case 3:23-cv-03416, N.D. Cal., 2023.
- Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547, 2019.
- Scaling laws for neural language models, 2020.
- Understanding black-box predictions via influence functions. In International conference on machine learning, pages 1885–1894. PMLR, 2017.
- Captum: A unified and generic model interpretability library for pytorch. arXiv preprint arXiv:2009.07896, 2020.
- Datainf: Efficiently estimating data influence in loRA-tuned LLMs and diffusion models. In The Twelfth International Conference on Learning Representations, 2024.
- Beta shapley: a unified and noise-reduced data valuation framework for machine learning. arXiv preprint arXiv:2110.14049, 2021.
- Influence selection for active learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9274–9283, 2021.
- Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pages 2408–2417. PMLR, 2015.
- Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
- Cade Metz. Lawsuit takes aim at the way A.I. is built. New York Times, 2022.
- Transformerlens. https://github.com/TransformerLensOrg/TransformerLens, 2022.
- Data valuation without training of a model. In The Eleventh International Conference on Learning Representations, 2023.
- TRAK: Attributing model behavior at scale. In International Conference on Machine Learning, pages 27074–27113. PMLR, 2023.
- Estimating training data influence by tracing gradient descent. Advances in Neural Information Processing Systems, 33:19920–19930, 2020.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
- Impact of pretraining term frequencies on few-shot numerical reasoning. In Conference on Empirical Methods in Natural Language Processing, 2022.
- Scaling up influence functions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 8179–8186, 2022.
- Understanding top-k sparsification in distributed deep learning. arXiv preprint arXiv:1911.08772, 2019.
- Data valuation in machine learning:" ingredients", strategies, and open challenges. In IJCAI, pages 5607–5614, 2022.
- Dolma: An open corpus of three trillion tokens for language model pretraining research. arXiv preprint arXiv:2402.00159, 2024.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Data banzhaf: A robust data valuation framework for machine learning. In International Conference on Artificial Intelligence and Statistics, pages 6388–6421. PMLR, 2023.
- Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Neural Information Processing Systems, 2017.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational Linguistics.
- Unifying corroborative and contributive attributions in large language models. arXiv preprint arXiv:2311.12233, 2023.
- Davinz: Data valuation using deep neural networks at initialization. In International Conference on Machine Learning, pages 24150–24176. PMLR, 2022.
- pyvene: A library for understanding and improving pytorch models via interventions. arXiv preprint arXiv:2403.07809, 2024.
- Less: Selecting influential data for targeted instruction tuning. arXiv preprint arXiv:2402.04333, 2024.
- First is better than last for language data influence. Advances in Neural Information Processing Systems, 35:32285–32298, 2022.
- Data valuation using reinforcement learning. In International Conference on Machine Learning, pages 10842–10851. PMLR, 2020.
- Addressing budget allocation and revenue allocation in data market environments using an adaptive sampling algorithm. In International Conference on Machine Learning, pages 42081–42097. PMLR, 2023.