Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs

Published 6 Jul 2025 in cs.LG and cs.AI | (2507.04219v2)

Abstract: Current unlearning methods for LLMs optimize on the private information they seek to remove by incorporating it into their training objectives. We argue this not only risks reinforcing exposure to sensitive data, it also fundamentally contradicts the principle of minimizing its use. As a remedy, we propose a novel unlearning method - Partial Model Collapse (PMC), which does not require unlearning targets in the unlearning objective. Our approach is inspired by recent observations that training generative models on their own generations leads to distribution collapse, effectively removing information from the model. Our core idea is to leverage this collapse for unlearning by triggering collapse partially on the sensitive data. We theoretically analyze that our approach converges to the desired outcome, i.e. the LLM unlearns the information in the forget set. We empirically demonstrate that PMC overcomes two key limitations of existing unlearning approaches that explicitly optimize on unlearning targets, and more effectively removes private information from model outputs. Overall, our contributions represent an important step toward more comprehensive unlearning that aligns with real-world privacy constraints. Code available at https://www.cs.cit.tum.de/daml/partial-model-collapse/.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces Partial Model Collapse (PMC) to unlearn sensitive data from LLMs by triggering controlled distribution collapse.
It employs an iterative finetuning process using synthetic answers for forget questions and ground truth for retain questions, ensuring effective convergence.
Experimental results on the TOFU dataset validate PMC's ability to maintain generation coherence and reduce information leakage.

Targeted Information Removal in LLMs via Partial Model Collapse

The paper "Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs" (2507.04219) introduces Partial Model Collapse (PMC), a novel machine unlearning method for LLMs that leverages the principles of model collapse to selectively remove sensitive information without directly optimizing on private data. This approach addresses limitations in existing unlearning techniques, which often rely on incorporating unlearning targets into their training objectives, potentially reinforcing exposure to sensitive data and contradicting the principle of minimizing its use.

Core Concepts and Methodology

The PMC method draws inspiration from the observation that iterative finetuning of generative models on their own generations can lead to distribution collapse, effectively removing information from the model. By analogy, the method partially triggers distribution collapse on the sensitive data by iteratively finetuning the model on retain data augmented with the model's own generations.

Figure 1: We propose \acf{method}---a novel unlearning method that leverages the principles of model collapse to remove private information from LLMs. By iteratively finetuning the model on retain data augmented with the model's own generations, we can partially trigger distribution collapse on the private data we want to unlearn. This approach allows us to achieve unlearning without directly optimizing on sensitive data, thus aligning with stricter privacy constraints.

The core methodology involves three key steps: generating synthetic answers for forget questions by sampling responses from the model, selecting the best synthetic model response according to a reward function modeling desired answers, and finetuning the LLM on ground truth for retain questions and synthetic answers for forget questions. The approach ensures that the LLM unlearns the information in the forget set. The theoretical analysis provides a formal basis for the method, demonstrating convergence to a target distribution where the influence of private data is effectively eliminated.

Theoretical Analysis

The paper provides theoretical underpinnings for PMC, demonstrating its convergence properties and effectiveness in removing private information. The analysis begins with categorical distributions and extends to arbitrary distributions, adapting the method for practical use in question-answering tasks. Key theoretical results include a lemma demonstrating information loss about non-retain categories during iterative relearning and a theorem proving exponential convergence to the retain distribution under certain assumptions.

Experimental Validation and Results

The efficacy of PMC is empirically validated through extensive experiments conducted on the TOFU dataset, a collection of question-answering pairs designed for machine unlearning. The experimental results demonstrate that PMC effectively removes private information from model outputs while overcoming key limitations of existing unlearning approaches. Specifically, PMC preserves generation coherence by avoiding unintended degradation in unrelated contexts and reduces information leakage by preventing unnatural suppression of correct answers, mitigating vulnerability to probability-based attacks.

Implications and Future Directions

The introduction of PMC offers a new paradigm for LLM unlearning, reframing model collapse as a tool for targeted information removal. By harnessing this mechanism, PMC enables new avenues towards more trustworthy machine learning under stricter privacy constraints.

However, the authors note the computational overhead associated with sampling from the model's distribution, and the design of the reward function must be tailored to specific applications. Future research directions include investigating more efficient sampling techniques, exploring pruned variants of the model, and developing comprehensive evaluation metrics for assessing unlearning effectiveness.

Conclusion

This work makes a significant contribution to the field of machine unlearning by introducing PMC, a theoretically grounded and empirically validated method for selectively removing sensitive information from LLMs. By leveraging the principles of model collapse and reframing it as a tool for targeted information removal, PMC offers a promising path towards more trustworthy and privacy-preserving AI systems.

Markdown Report Issue