Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 83 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 20 tok/s Pro
GPT-4o 103 tok/s Pro
Kimi K2 205 tok/s Pro
GPT OSS 120B 456 tok/s Pro
Claude Sonnet 4 35 tok/s Pro
2000 character limit reached

$\texttt{dattri}$: A Library for Efficient Data Attribution (2410.04555v1)

Published 6 Oct 2024 in cs.LG and cs.CY

Abstract: Data attribution methods aim to quantify the influence of individual training samples on the prediction of AI models. As training data plays an increasingly crucial role in the modern development of large-scale AI models, data attribution has found broad applications in improving AI performance and safety. However, despite a surge of new data attribution methods being developed recently, there lacks a comprehensive library that facilitates the development, benchmarking, and deployment of different data attribution methods. In this work, we introduce $\texttt{dattri}$, an open-source data attribution library that addresses the above needs. Specifically, $\texttt{dattri}$ highlights three novel design features. Firstly, $\texttt{dattri}$ proposes a unified and easy-to-use API, allowing users to integrate different data attribution methods into their PyTorch-based machine learning pipeline with a few lines of code changed. Secondly, $\texttt{dattri}$ modularizes low-level utility functions that are commonly used in data attribution methods, such as Hessian-vector product, inverse-Hessian-vector product or random projection, making it easier for researchers to develop new data attribution methods. Thirdly, $\texttt{dattri}$ provides a comprehensive benchmark framework with pre-trained models and ground truth annotations for a variety of benchmark settings, including generative AI settings. We have implemented a variety of state-of-the-art efficient data attribution methods that can be applied to large-scale neural network models, and will continuously update the library in the future. Using the developed $\texttt{dattri}$ library, we are able to perform a comprehensive and fair benchmark analysis across a wide range of data attribution methods. The source code of $\texttt{dattri}$ is available at https://github.com/TRAIS-Lab/dattri.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Second-order stochastic optimization for machine learning in linear time. Journal of Machine Learning Research, 18(116):1–40, 2017.
  2. If influence functions are the answer, then what is the question? Advances in Neural Information Processing Systems, 35:17953–17967, 2022.
  3. Influence functions in deep learning are fragile. arXiv preprint arXiv:2006.14651, 2020.
  4. Access: Advancing innovation: Nsf’s advanced cyberinfrastructure coordination ecosystem: Services & support. In Practice and Experience in Advanced Research Computing, pages 173–176. 2023.
  5. Input similarity from the neural network perspective. Advances in Neural Information Processing Systems, 32, 2019.
  6. J. Deng and J. Ma. Computational copyright: Towards a royalty model for ai music generation platforms. arXiv preprint arXiv:2312.06646, 2023.
  7. Efficient ensembles improve training data attribution. arXiv preprint arXiv:2405.17293, 2024.
  8. L. Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
  9. Dsdm: Model-aware dataset selection with datamodels. arXiv preprint arXiv:2401.12926, 2024.
  10. A. Ghorbani and J. Zou. Data shapley: Equitable valuation of data for machine learning. In International conference on machine learning, pages 2242–2251. PMLR, 2019.
  11. Explaining explanations: An overview of interpretability of machine learning. In 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA), pages 80–89. IEEE, 2018.
  12. Enabling factorized piano music modeling and generation with the MAESTRO dataset. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=r1lYRjC9F7.
  13. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  14. Music transformer: Generating music with long-term structure. arXiv preprint arXiv:1809.04281, 2018.
  15. Datamodels: Predicting predictions from training data. In Proceedings of the 39th International Conference on Machine Learning, 2022.
  16. Efficient task-specific data valuation for nearest neighbor algorithms. arXiv preprint arXiv:1908.08619, 2019a.
  17. Towards efficient data valuation based on the shapley value. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1167–1176. PMLR, 2019b.
  18. Opendataval: a unified benchmark for data valuation. Advances in Neural Information Processing Systems, 36, 2023.
  19. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  20. A. Karpathy. char-rnn. https://github.com/karpathy/char-rnn, 2015.
  21. A. Karpathy. nano-gpt. https://github.com/karpathy/nanoGPT, 2022.
  22. P. W. Koh and P. Liang. Understanding black-box predictions via influence functions. In International conference on machine learning, pages 1885–1894. PMLR, 2017.
  23. Captum: A unified and generic model interpretability library for pytorch. arXiv preprint arXiv:2009.07896, 2020.
  24. Learning multiple layers of features from tiny images. University of Toronto, 2009.
  25. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  26. J. Martens et al. Deep learning via hessian-free optimization. In Icml, volume 27, pages 735–742, 2010.
  27. Trak: Attributing model behavior at scale. arXiv preprint arXiv:2303.14186, 2023.
  28. Influenciæ: A library for tracing the influence back to the data-points. working paper or preprint, Nov. 2023. URL https://hal.science/hal-04284178.
  29. Estimating training data influence by tracing gradient descent. Advances in Neural Information Processing Systems, 33:19920–19930, 2020.
  30. Scaling up influence functions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 8179–8186, 2022.
  31. A. Søgaard et al. Revisiting methods for finding influential examples. arXiv preprint arXiv:2111.04683, 2021.
  32. C. Spearman. The proof and measurement of association between two things. The American Journal of Psychology, 15(1):72–101, 1904.
  33. TransferLab Team. Pydvl. https://pydvl.org/stable/, 2024. Version 0.9.2.
  34. J. T. Wang and R. Jia. Data banzhaf: A robust data valuation framework for machine learning. In International Conference on Artificial Intelligence and Statistics, pages 6388–6421. PMLR, 2023.
  35. Rethinking data shapley for data selection tasks: Misleads and merits. arXiv preprint arXiv:2405.03875, 2024.
  36. Representer point selection for explaining deep neural networks. Advances in neural information processing systems, 31, 2018.

Summary

We haven't generated a summary for this paper yet.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 post and received 33 likes.