Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions (2405.13954v1)

Published 22 May 2024 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs are trained on a vast amount of human-written data, but data providers often remain uncredited. In response to this issue, data valuation (or data attribution), which quantifies the contribution or value of each data to the model output, has been discussed as a potential solution. Nevertheless, applying existing data valuation methods to recent LLMs and their vast training datasets has been largely limited by prohibitive compute and memory costs. In this work, we focus on influence functions, a popular gradient-based data valuation method, and significantly improve its scalability with an efficient gradient projection strategy called LoGra that leverages the gradient structure in backpropagation. We then provide a theoretical motivation of gradient projection approaches to influence functions to promote trust in the data valuation process. Lastly, we lower the barrier to implementing data valuation systems by introducing LogIX, a software package that can transform existing training code into data valuation code with minimal effort. In our data valuation experiments, LoGra achieves competitive accuracy against more expensive baselines while showing up to 6,500x improvement in throughput and 5x reduction in GPU memory usage when applied to Llama3-8B-Instruct and the 1B-token dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. AI@Meta. Llama 3 model card, 2024.
  2. Training data attribution via approximate unrolled differentiation, 2024.
  3. If influence functions are the answer, then what is the question? Advances in Neural Information Processing Systems, 35:17953–17967, 2022.
  4. Relatif: Identifying explanatory training samples via relative influence. In International Conference on Artificial Intelligence and Statistics, pages 1899–1909. PMLR, 2020.
  5. Influence functions in deep learning are fragile. arXiv preprint arXiv:2006.14651, 2020.
  6. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  8. Which layer is learning faster? a systematic exploration of layer-wise convergence rate for deep neural networks. In The Eleventh International Conference on Learning Representations, 2023.
  9. Making scalable meta learning practical. Advances in neural information processing systems, 36, 2024.
  10. Value theory without efficiency. Mathematics of Operations Research, 6(1):122–128, 1981.
  11. Dsdm: Model-aware dataset selection with datamodels. arXiv preprint arXiv:2401.12926, 2024.
  12. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33:2881–2891, 2020.
  13. Raul Castro Fernandez. Data-sharing markets: Model, protocol, and algorithms to incentivize the formation of data-sharing consortia. In Proceedings ACMSIGMOD International Conference on Management of Data, 2023.
  14. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  15. Data shapley: Equitable valuation of data for machine learning. In International conference on machine learning, pages 2242–2251. PMLR, 2019.
  16. Openwebtext corpus, 2019.
  17. Studying large language model generalization with influence functions, 2023.
  18. A kronecker-factored approximate fisher matrix for convolution layers. In International Conference on Machine Learning, pages 573–582. PMLR, 2016.
  19. Fastif: Scalable influence functions for efficient model interpretation and debugging. arXiv preprint arXiv:2012.15781, 2020.
  20. Frank R Hampel. The influence curve and its role in robust estimation. Journal of the american statistical association, 69(346):383–393, 1974.
  21. Explaining black box predictions and unveiling data artifacts through influence functions. arXiv preprint arXiv:2005.06676, 2020.
  22. Evaluation of similarity-based explanations. arXiv preprint arXiv:2006.04528, 2020.
  23. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  24. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
  25. Jie Huang and Kevin Chen-Chuan Chang. Citation: A key to building responsible and accountable large language models. arXiv preprint arXiv:2307.02185, 2023.
  26. Datamodels: Predicting predictions from training data, 2022.
  27. Adversarial examples are not bugs, they are features. Advances in neural information processing systems, 32, 2019.
  28. Towards efficient data valuation based on the shapley value. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1167–1176. PMLR, 2019.
  29. Case 3:23-cv-03416, N.D. Cal., 2023.
  30. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547, 2019.
  31. Scaling laws for neural language models, 2020.
  32. Understanding black-box predictions via influence functions. In International conference on machine learning, pages 1885–1894. PMLR, 2017.
  33. Captum: A unified and generic model interpretability library for pytorch. arXiv preprint arXiv:2009.07896, 2020.
  34. Datainf: Efficiently estimating data influence in loRA-tuned LLMs and diffusion models. In The Twelfth International Conference on Learning Representations, 2024.
  35. Beta shapley: a unified and noise-reduced data valuation framework for machine learning. arXiv preprint arXiv:2110.14049, 2021.
  36. Influence selection for active learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9274–9283, 2021.
  37. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pages 2408–2417. PMLR, 2015.
  38. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
  39. Cade Metz. Lawsuit takes aim at the way A.I. is built. New York Times, 2022.
  40. Transformerlens. https://github.com/TransformerLensOrg/TransformerLens, 2022.
  41. Data valuation without training of a model. In The Eleventh International Conference on Learning Representations, 2023.
  42. TRAK: Attributing model behavior at scale. In International Conference on Machine Learning, pages 27074–27113. PMLR, 2023.
  43. Estimating training data influence by tracing gradient descent. Advances in Neural Information Processing Systems, 33:19920–19930, 2020.
  44. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  45. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  46. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
  47. Impact of pretraining term frequencies on few-shot numerical reasoning. In Conference on Empirical Methods in Natural Language Processing, 2022.
  48. Scaling up influence functions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 8179–8186, 2022.
  49. Understanding top-k sparsification in distributed deep learning. arXiv preprint arXiv:1911.08772, 2019.
  50. Data valuation in machine learning:" ingredients", strategies, and open challenges. In IJCAI, pages 5607–5614, 2022.
  51. Dolma: An open corpus of three trillion tokens for language model pretraining research. arXiv preprint arXiv:2402.00159, 2024.
  52. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  53. Data banzhaf: A robust data valuation framework for machine learning. In International Conference on Artificial Intelligence and Statistics, pages 6388–6421. PMLR, 2023.
  54. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Neural Information Processing Systems, 2017.
  55. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational Linguistics.
  56. Unifying corroborative and contributive attributions in large language models. arXiv preprint arXiv:2311.12233, 2023.
  57. Davinz: Data valuation using deep neural networks at initialization. In International Conference on Machine Learning, pages 24150–24176. PMLR, 2022.
  58. pyvene: A library for understanding and improving pytorch models via interventions. arXiv preprint arXiv:2403.07809, 2024.
  59. Less: Selecting influential data for targeted instruction tuning. arXiv preprint arXiv:2402.04333, 2024.
  60. First is better than last for language data influence. Advances in Neural Information Processing Systems, 35:32285–32298, 2022.
  61. Data valuation using reinforcement learning. In International Conference on Machine Learning, pages 10842–10851. PMLR, 2020.
  62. Addressing budget allocation and revenue allocation in data market environments using an adaptive sampling algorithm. In International Conference on Machine Learning, pages 42081–42097. PMLR, 2023.
Citations (13)

Summary

  • The paper presents LoGra, a novel algorithm that reduces space and time complexity for influence functions from O(nk) to O(√nk), enabling efficient data valuation for LLMs.
  • It provides a theoretical grounding of gradient projection as a spectral sparsification mechanism, ensuring critical gradient components are preserved.
  • The work also introduces Logix, a practical software tool that integrates data valuation into LLM training workflows, demonstrating up to 6,500× throughput gains and reduced GPU memory usage.

An Insightful Overview of "What is Your Data Worth to GPT?\LLM-Scale Data Valuation with Influence Functions"

The paper "What is Your Data Worth to GPT?\LLM-Scale Data Valuation with Influence Functions" primarily addresses the growing need to credit data providers whose contributions are pivotal in training LLMs. The authors introduce several significant advancements to the existing methods of data valuation, focusing on scaling influence functions for LLMs.

Core Contributions

Efficient Gradient Projection with LoGra

A notable enhancement proposed in this paper is the development of the LoGra algorithm. LoGra addresses the critical challenge of computing and memory costs associated with traditional influence functions. By leveraging the inherent gradient structure in backpropagation, LoGra employs a low-rank gradient projection technique. This approach reduces the space and time complexity from O(nk)O(nk) to O(nk)O(\sqrt{nk}), where nn is the model dimension and kk is the projection dimension. Furthermore, LoGra achieves this efficiency by directly computing projected gradients without having to materialize full gradients, significantly lowering GPU memory usage and boosting throughput.

The efficiency of LoGra is demonstrated through rigorous empirical evaluation. When applied to Llama3-8B-Instruct and a 1B-token dataset, LoGra shows up to 6,500 times improvement in throughput and a fivefold reduction in GPU memory usage compared to EKFAC influence, the current state-of-the-art at this scale.

Theoretical Grounding and Gradient Sparsification

To promote trust in the data valuation process, the paper also provides a theoretical motivation for gradient projection in influence functions. By interpreting the damping term in influence functions as a spectral gradient sparsification mechanism, the authors justify the emphasis on larger gradient components, ensuring that important aspects of the gradients are preserved during projection. This theoretical insight is further solidified through a specialized PCA initialization scheme for LoGra.

Practical Implementation with Logix

In addition to the algorithmic innovation, the paper introduces Logix, a software package designed to ease the integration of data valuation into existing training workflows. Logix utilizes PyTorch hooks to intercept gradient computations and enables the calculation of various statistics required for data valuation. Its compatibility with prevalent tools in the LLM ecosystem, such as DeepSpeed and HF Transformers, and its ability to efficiently handle data IO through memory-mapped files, render it particularly useful for large-scale applications.

Experimental Validation

The efficacy of LoGra and influence functions is validated through both quantitative and qualitative experiments.

Quantitative Evaluation: Using benchmarks like FMNIST with MLP, CIFAR-10 with ResNet-9, and WikiText with GPT2, the paper conducts counterfactual evaluations, including brittleness tests and linear datamodeling scores (LDS). These experiments highlight LoGra's competitive accuracy, particularly in identifying top valuable data and general valuation accuracy.

Qualitative Evaluation: When scaling to billion-scale models and datasets, the paper assesses the qualitative similarities between LLM outputs and the most valuable data identified by LoGra. For models like GPT2-XL and Llama3-8B-Instruct, the findings reveal a notable congruence in semantics, style, and token overlaps, bolstering the credibility of the data valuation process.

Practical and Theoretical Implications

The advancements presented in the paper hold substantial implications both in practice and theory. Practically, the ability to efficiently and accurately value data at the LLM scale opens avenues for transparent and fair data attribution, potentially addressing the legal and ethical concerns surrounding the use of uncredited data in LLM training. Theoretically, the insights into gradient projection and sparsification contribute to the broader understanding of influence functions and their applicability to large-scale neural network training.

Future Directions

While the proposed methods demonstrate significant improvements, the paper also acknowledges certain limitations. For instance, the challenges of dealing with outlier data points and the need for more extensive system optimizations (e.g., incorporating a high-performance vector database) are areas identified for future research. Moreover, exploring alternative gradient compression strategies could further enhance computational efficiency.

Conclusion

In conclusion, "What is Your Data Worth to GPT?\LLM-Scale Data Valuation with Influence Functions" presents a comprehensive solution to scale data valuation methodologies to modern LLMs. The introduction of LoGra, grounded in theoretical robustness and supported by practical implementation via Logix, marks a significant step forward. This research not only enhances the scalability and accuracy of data valuation techniques but also underscores the importance of attributing credit to data providers, paving the way for more transparent and equitable AI applications.