Offset Unlearning for Large Language Models (2404.11045v1)
Abstract: Despite the strong capabilities of LLMs to acquire knowledge from their training corpora, the memorization of sensitive information in the corpora such as copyrighted, harmful, and private content has led to ethical and legal concerns. In response to these challenges, unlearning has emerged as a potential remedy for LLMs affected by problematic training data. However, previous unlearning techniques are either not applicable to black-box LLMs due to required access to model internal weights, or violate data protection principles by retaining sensitive data for inference-time correction. We propose $\delta$-unlearning, an offset unlearning framework for black-box LLMs. Instead of tuning the black-box LLM itself, $\delta$-unlearning learns the logit offset needed for unlearning by contrasting the logits from a pair of smaller models. Experiments demonstrate that $\delta$-unlearning can effectively unlearn target data while maintaining similar or even stronger performance on general out-of-forget-scope tasks. $\delta$-unlearning also effectively incorporates different unlearning algorithms, making our approach a versatile solution to adapting various existing unlearning algorithms to black-box LLMs.
- Privacy adhering machine un-learning in NLP. In Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings), pages 268–277, Nusa Dua, Bali. Association for Computational Linguistics.
- Speak, memory: An archaeology of books known to ChatGPT/GPT-4. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7312–7327, Singapore. Association for Computational Linguistics.
- Jiaao Chen and Diyi Yang. 2023. Unlearn what you want to forget: Efficient unlearning for LLMs. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12041–12052, Singapore. Association for Computational Linguistics.
- Dola: Decoding by contrasting layers improves factuality in large language models. In The Twelfth International Conference on Learning Representations.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
- Ronen Eldan and Mark Russinovich. 2023. Who’s harry potter? approximate unlearning in llms. arXiv preprint arXiv:2310.02238.
- Geoffrey E Hinton. 2002. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800.
- The european union general data protection regulation: what it is and what it means. Information & Communications Technology Law, 28(1):65–98.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Knowledge unlearning for mitigating privacy risks in language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14389–14408, Toronto, Canada. Association for Computational Linguistics.
- Copyright violations and large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7403–7412, Singapore. Association for Computational Linguistics.
- Contrastive decoding: Open-ended text generation as optimization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12286–12312, Toronto, Canada. Association for Computational Linguistics.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Tuning language models by proxy. arXiv preprint arXiv:2401.08565.
- DExperts: Decoding-time controlled text generation with experts and anti-experts. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6691–6706, Online. Association for Computational Linguistics.
- Continual learning and private unlearning. In Proceedings of The 1st Conference on Lifelong Learning Agents, volume 199 of Proceedings of Machine Learning Research, pages 243–254. PMLR.
- Monotonic paraphrasing improves generalization of language model prompting. arXiv preprint arXiv:2403.16038.
- Rethinking machine unlearning for large language models. arXiv preprint arXiv:2402.08787.
- Tofu: A task of fictitious unlearning for llms. arXiv preprint arXiv:2401.06121.
- Mass-editing memory in a transformer. In The Eleventh International Conference on Learning Representations.
- Controllable text generation with neurally-decomposed oracle. Advances in Neural Information Processing Systems, 35:28125–28139.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium. Association for Computational Linguistics.
- SILO language models: Isolating legal risk in a nonparametric datastore. In The Twelfth International Conference on Learning Representations.
- An emulator for fine-tuning large language models using small language models. In The Twelfth International Conference on Learning Representations.
- CombLM: Adapting black-box language models through small fine-tuned models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2961–2974, Singapore. Association for Computational Linguistics.
- Can sensitive information be deleted from LLMs? objectives for defending against extraction attacks. In The Twelfth International Conference on Learning Representations.
- In-context unlearning: Language models as few shot unlearners. arXiv preprint arXiv:2310.07579.
- Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527.
- Winogrande: An adversarial winograd schema challenge at scale. COMMUNICATIONS OF THE ACM, 64(9).
- On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4454–4470, Toronto, Canada. Association for Computational Linguistics.
- Trusting your evidence: Hallucinate less with context-aware decoding. arXiv preprint arXiv:2305.14739.
- Beyond memorization: Violating privacy via inference with large language models. In The Twelfth International Conference on Learning Representations.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- KGA: A general machine unlearning framework based on knowledge gap alignment. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13264–13276, Toronto, Canada. Association for Computational Linguistics.
- DEPN: Detecting and editing privacy neurons in pretrained language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2875–2886, Singapore. Association for Computational Linguistics.
- Machine unlearning of pre-trained large language models. arXiv preprint arXiv:2402.15159.
- Large language model unlearning. arXiv preprint arXiv:2310.10683.
- HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
- Right to be forgotten in the era of large language models: Implications, challenges, and solutions. arXiv preprint arXiv:2307.03941.
- Weak-to-strong jailbreaking on large language models. arXiv preprint arXiv:2401.17256.
- James Y. Huang (11 papers)
- Wenxuan Zhou (61 papers)
- Fei Wang (573 papers)
- Fred Morstatter (64 papers)
- Sheng Zhang (212 papers)
- Hoifung Poon (61 papers)
- Muhao Chen (159 papers)