Papers
Topics
Authors
Recent
Search
2000 character limit reached

Unlearning Traces the Influential Training Data of Language Models

Published 26 Jan 2024 in cs.CL and cs.AI | (2401.15241v2)

Abstract: Identifying the training datasets that influence a LLM's outputs is essential for minimizing the generation of harmful content and enhancing its performance. Ideally, we can measure the influence of each dataset by removing it from training; however, it is prohibitively expensive to retrain a model multiple times. This paper presents UnTrac: unlearning traces the influence of a training dataset on the model's performance. UnTrac is extremely simple; each training dataset is unlearned by gradient ascent, and we evaluate how much the model's predictions change after unlearning. Furthermore, we propose a more scalable approach, UnTrac-Inv, which unlearns a test dataset and evaluates the unlearned model on training datasets. UnTrac-Inv resembles UnTrac, while being efficient for massive training datasets. In the experiments, we examine if our methods can assess the influence of pretraining datasets on generating toxic, biased, and untruthful content. Our methods estimate their influence much more accurately than existing methods while requiring neither excessive memory space nor multiple checkpoints.

Authors (2)
Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Second-order stochastic optimization for machine learning in linear time. Journal of Machine Learning Research, 18(1):4148–4187.
  2. Relatif: Identifying explanatory training samples via relative influence. In International Conference on Artificial Intelligence and Statistics, pages 1899–1909. PMLR.
  3. Influence functions in deep learning are fragile. In International Conference on Learning Representations.
  4. On second-order group influence functions for black-box predictions. In Proceedings of the 37th International Conference on Machine Learning, pages 715–724. PMLR.
  5. Yinzhi Cao and Junfeng Yang. 2015. Towards making systems forget with machine unlearning. In IEEE Symposium on Security and Privacy, pages 463–480. IEEE.
  6. Jiaao Chen and Diyi Yang. 2023. Unlearn what you want to forget: Efficient unlearning for LLMs. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12041–12052, Singapore. Association for Computational Linguistics.
  7. R Dennis Cook and Sanford Weisberg. 1982. Residuals and Influence in Regression. New York: Chapman and Hall.
  8. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
  9. Making ai forget you: Data deletion in machine learning. Advances in neural information processing systems, 32.
  10. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9304–9312.
  11. Studying large language model generalization with influence functions. arXiv preprint arXiv:2308.03296.
  12. FastIF: Scalable influence functions for efficient model interpretation and debugging. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10333–10350, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  13. Adaptive machine unlearning. Advances in Neural Information Processing Systems, 34:16319–16330.
  14. Frank R Hampel. 1974. The influence curve and its role in robust estimation. Journal of the American Statistical Association, 69(346):383–393.
  15. Understanding in-context learning via supportive pretraining data. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12660–12673, Toronto, Canada. Association for Computational Linguistics.
  16. Xiaochuang Han and Yulia Tsvetkov. 2021. Influence tuning: Demoting spurious correlations via instance attribution and instance-driven updates. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4398–4409, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  17. Explaining black box predictions and unveiling data artifacts through influence functions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5553–5563, Online. Association for Computational Linguistics.
  18. ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3309–3326, Dublin, Ireland. Association for Computational Linguistics.
  19. Knowledge unlearning for mitigating privacy risks in language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14389–14408, Toronto, Canada. Association for Computational Linguistics.
  20. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980v9.
  21. Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning, pages 1885–1894. PMLR.
  22. On the accuracy of influence functions for measuring group effects. Advances in Neural Information Processing Systems, 32.
  23. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  24. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
  25. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  26. Deep unlearning via randomized conditionally independent hessians. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10422–10431.
  27. Estimating training data influence by tracing gradient descent. Advances in Neural Information Processing Systems, 33:19920–19930.
  28. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  29. Theoretical and practical perspectives on what influence functions do. Advances in Neural Information Processing Systems.
  30. Scaling up influence functions. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 8179–8186.
  31. Remember what you want to forget: Algorithms for machine unlearning. Advances in Neural Information Processing Systems, 34:18075–18086.
  32. Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive learning rates with sublinear memory cost. In Proceedings of the 35th International Conference on Machine Learning, pages 4596–4604. PMLR.
  33. Anders Søgaard. 2021. Revisiting methods for finding influential examples. arXiv preprint arXiv:2111.04683.
  34. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31.
  35. Trieu H Trinh and Quoc V Le. 2018. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847.
  36. Towards understanding chain-of-thought prompting: An empirical study of what matters. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2717–2739, Toronto, Canada. Association for Computational Linguistics.
  37. KGA: A general machine unlearning framework based on knowledge gap alignment. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13264–13276, Toronto, Canada. Association for Computational Linguistics.
  38. How many and which training points would need to be removed to flip this prediction? In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2571–2584, Dubrovnik, Croatia. Association for Computational Linguistics.
  39. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  40. Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 15–20, New Orleans, Louisiana. Association for Computational Linguistics.
  41. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision, pages 19–27.
Citations (3)

Summary

  • The paper introduces UnTrac and UnTrac-Inv methods to unlearn influential datasets, efficiently tracking their impact on model predictions.
  • It assesses data influence using gradient ascent, achieving higher precision compared to traditional retraining methods on synthetic and pretraining datasets.
  • Evaluation on the OPT-125M model demonstrates robustness in tracing harmful content origins while highlighting the need for optimal hyperparameter tuning.

Unlearning Traces the Influential Training Data of LLMs

Introduction

The paper "Unlearning Traces the Influential Training Data of LLMs" addresses the critical challenge of identifying which specific datasets contribute significantly to a LLM's capabilities and potential risks, such as generating biased or toxic content. Traditional methods to measure dataset influence, like the leave-dataset-out, are computationally expensive as they require retraining models multiple times. The paper introduces a novel method, UnTrac, which simplifies the process by applying gradient ascent to unlearn datasets and measuring the resulting change in model predictions.

Methodology

UnTrac and UnTrac-Inv

The core mechanism of UnTrac involves unlearning datasets from a model using gradient ascent, evaluating the degree of change in predictions on a test dataset post-unlearning (Figure 1). This approach directly quantifies data influence without necessitating multiple model retrainings. UnTrac-Inv extends this by unlearning test datasets and assessing the change in predictions on training data, offering a scalable alternative for scenarios with numerous datasets. Figure 1

Figure 1: Overview of leave-dataset-out vs. proposed methods, UnTrac and UnTrac-Inv.

The methods are grounded in the principle that data influence can be estimated by observing changes in loss functions post-unlearning. UnTrac is computationally intense when scaling to many datasets, prompting the development of UnTrac-Inv for efficiency.

Experimental Evaluation

Synthetic Datasets

Using synthetic datasets, the paper evaluates UnTrac's ability to identify influential training tasks. Results confirm its robustness, accurately distinguishing tasks based on their alignment with test datasets despite variations in output format. This is illustrated through changes in influence estimates over unlearning steps, showing alignment with ground-truth influences (Figure 2). Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: The influence estimated by UnTrac (top) and UnTrac-Inv (bottom) on the synthetic datasets A (left) and B (right). The line denotes the average across four runs, and the shaded area corresponds to 95\% confidence region.

Pretraining Datasets and Harmful Content

Further tests on pretraining datasets analyze the ability to trace origins of harmful content in LLMs. Using the OPT-125M model, UnTrac and UnTrac-Inv significantly outperform other influence estimation methods, consistently correlating estimated influence with ground-truth across various dataset configurations and content types (Table 3).

Sensitivity and Optimization Considerations

The performance of UnTrac is notably robust across optimizers like RMSProp and Adam, with optimal performance achieved using higher learning rates and suitable training iterations. Conversely, UnTrac-Inv's effectiveness hinges on batch size and learning rate specifications, indicating a need for careful hyperparameter optimization (Figure 3). Figure 3

Figure 3

Figure 3: Pearson correlation coefficient between the ground-truth influence and the influence estimated by UnTrac (left) and UnTrac-Inv (right) over unlearning epochs. The line denotes the average across four runs, and the shaded area corresponds to 95\% confidence region.

Implications and Future Directions

The practical applicability of UnTrac and UnTrac-Inv extends to efficiently handling large LLMs (e.g., GPT variants) in tracing dataset impacts. Future research directions could explore the methods' applicability across diverse tasks, envisaging a refined understanding of emergent model capabilities and mitigating biases or harmful behaviors.

Conclusion

UnTrac and UnTrac-Inv introduce a pragmatic and scalable approach to identifying dataset influence on LLMs, outperforming existing methods in precision and efficiency. They provide a viable framework for exploring data attribution and unlearning techniques in AI applications, potentially enlightening the opaque decision-making processes of LLMs.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.