Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 88 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 18 tok/s
GPT-5 High 27 tok/s Pro
GPT-4o 97 tok/s
GPT OSS 120B 451 tok/s Pro
Kimi K2 212 tok/s Pro
2000 character limit reached

Fast Training Dataset Attribution via In-Context Learning (2408.11852v2)

Published 14 Aug 2024 in cs.CL, cs.AI, and cs.LG

Abstract: We investigate the use of in-context learning and prompt engineering to estimate the contributions of training data in the outputs of instruction-tuned LLMs. We propose two novel approaches: (1) a similarity-based approach that measures the difference between LLM outputs with and without provided context, and (2) a mixture distribution model approach that frames the problem of identifying contribution scores as a matrix factorization task. Our empirical comparison demonstrates that the mixture model approach is more robust to retrieval noise in in-context learning, providing a more reliable estimation of data contributions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024.
  2. GPT-4 Technical Report. arXiv:2303.08774, 2023.
  3. Hydra: Hypergradient data relevance analysis for interpreting deep neural networks. In AAAI, volume 35, pp.  7081–7089, 2021.
  4. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL-HLT, 2019.
  5. What neural networks memorize and why: Discovering the long tail via influence estimation. NeurIPS, 33:2881–2891, 2020.
  6. Data shapley: Equitable valuation of data for machine learning. In International conference on machine learning, pp. 2242–2251. PMLR, 2019.
  7. Making ai forget you: Data deletion in machine learning. NeurIPS, 32, 2019.
  8. Biconvex sets and optimization with biconvex functions: a survey and extensions. Mathematical methods of operations research, 66(3):373–407, 2007.
  9. Training data influence analysis and estimation: A survey. Machine Learning, pp.  1–53, 2024.
  10. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  11. Understanding black-box predictions via influence functions. In International conference on machine learning, pp. 1885–1894. PMLR, 2017.
  12. Bloom: A 176b-parameter open-access multilingual language model. 2023.
  13. Retrieval-augmented generation for knowledge-intensive nlp tasks. NeurIPS, 2020.
  14. A bayesian perspective on training data attribution. arXiv preprint arXiv:2305.19765, 2023.
  15. Data valuation without training of a model. In The Eleventh International Conference on Learning Representations, 2022.
  16. Trak: attributing model behavior at scale. In ICML, 2023.
  17. Estimating training data influence by tracing gradient descent. NeurIPS, 33:19920–19930, 2020.
  18. Remember what you want to forget: Algorithms for machine unlearning. NeurIPS, 34:18075–18086, 2021.
  19. L S Shapley. A value for n-person games. Contributions to the Theory of Games, pp.  307–317, 1953.
  20. Unifying corroborative and contributive attributions in large language models. In 2024 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pp.  665–683. IEEE, 2024.
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube