Free()LM: Self-Forgetting for LLMs
- Free()LM framework is a novel approach that introduces a trainable self-forgetting module to dynamically prune redundant context and prevent saturation in LLMs.
- It utilizes a plug-and-play LoRA adapter that alternates between reasoning and cleaning modes to maintain a compact, high-signal working memory during extensive chain-of-thought tasks.
- Empirical evaluations reveal consistent Pass@1 accuracy improvements and significant token reduction across various LLM scales, validating its scalability and efficiency.
The Free()LM framework addresses a fundamental limitation of standard LLMs in complex reasoning: their inability to discard obsolete or redundant information during long-horizon Chain-of-Thought (CoT) reasoning. By introducing an explicit, trainable self-forgetting operation implemented as a plug-and-play LoRA adapter called the Free-Module, Free()LM enables LLMs to dynamically prune unusable context and maintain a compact, high-signal working memory. This self-forgetting mechanism is empirically shown to resolve context window saturation and accuracy collapse in long reasoning tasks, improving performance and reliability across scales spanning 8B to 685B parameters (Zheng et al., 8 Feb 2026).
1. Architectural Motivation and High-Level Overview
Standard LLMs operate as "malloc-only" engines: as they generate extended CoT trajectories, all intermediate outputs—valid or otherwise—are retained in the context buffer. Empirical evidence indicates that, when reasoning tokens consume 70–90% of the available context, accuracy stagnates or collapses due to information overload and self-reinforcing degeneration. Free()LM endows the model with a "free()" operation via the Free-Module, permitting the system to iteratively alternate between:
- Reasoning mode: Generates new CoT steps identical to the vanilla backbone.
- Cleaning mode: Identifies and excises redundant, no-longer-relevant context spans, emitting explicit deletion instructions.
This mode switching results in periodic memory reclamation, breaking cyclic context accumulation and sustaining accuracy in deep reasoning settings.
2. Formalization and Mechanism of the Free-Module
Let the current model context be (each a text chunk). The Free-Module , implemented as a set of LoRA adapters embedded in every linear projection of the Transformer, maps this context to a set of JSON-style pruning commands: where each anchors a span of for removal. The module computes per-chunk redundancy scores ; chunks surpassing threshold are marked for deletion.
At inference:
- Reasoning mode (unmerged LoRA): Model autoregressively generates new CoT tokens.
- Cleaning mode (merged LoRA): The backbone, with Free-Module weights merged, emits a set of pruning commands. Application of these commands reduces context to only retain high-utility content.
The system alternates every tokens, with cleaning iterations followed by a reset to reasoning mode.
3. Implementation and Integration
The Free-Module uses LoRA with typical configuration: rank , scaling , dropout 0.1. Backbone weights are frozen; only adapters are updated during supervised training, with target pruning commands distilled from oracle annotations. Inference toggling between unmerged and merged adapters effectively switches between generation and cleaning functionalities. When switching to cleaning, LoRA weights are added to the base; after pruning, they are removed.
Supported backbones include Qwen3-8B, Qwen3-30B-A3B, and Qwen3-235B-A22B. Context windows are tokens (8B/30B models); for the 235B model, only is required (down from for vanilla operation). The cleaning interval is tokens, with up to 50 cleaning cycles per generation.
Integration leverages serving with vLLM (v0.8.5), CUDA 12.6, FlashAttention-2, and BF16 precision. A JSON parser executes the pruning commands, reducing the context and then resuming generation.
4. Empirical Evaluation and Results
Extensive benchmark evaluation demonstrates that Free()LM yields a consistent average gain of 3.3% in Pass@1 accuracy over state-of-the-art reasoning baselines. Key results include:
| Model | Pass@1 (vanilla) | Pass@1 (Free()LM) | Δ Pass@1 | Avg. #Tokens (vanilla) | Avg. #Tokens (Free()LM) | Δ Tokens |
|---|---|---|---|---|---|---|
| Qwen3-8B | 44.24% | 48.14% | +3.90 | 17.5k | 13.8k | –21.1% |
| Qwen3-30B | 57.47% | 62.30% | +4.83 | 18.1k | 15.9k | –12.2% |
| Qwen3-235B | 69.18% | 70.47% | +1.29 | 26.1k | 19.3k | –26.1% |
On the IMOAnswerBench, DeepSeek V3.2‐Speciale with Free()LM attains a new SOTA with Pass@1 increasing from 83.54% to 85.87%, and average token usage decreasing by 45.99%. In long-horizon settings (e.g., HLE, k tokens), vanilla Qwen3-235B collapses to 0% accuracy whereas Free()LM sustains accuracy (Zheng et al., 8 Feb 2026).
Ablation studies show that heuristic compression baselines compromise accuracy, while in-context pruning with oracle impulses adds only and does not prevent degeneration. The Free-Module trained on Qwen3-8B generalizes as a "universal pruning service," giving +1.5% Pass@1 on Qwen3-235B and +2.3% on DeepSeek V3.2.
5. Comparative Analysis and Generalization
Compared to purely heuristic or in-context compression, Free()LM avoids destructive context truncation and exhibits robust cross-model transfer. The LoRA-based Free-Module is plug-and-playable, requiring no retraining of backbone weights. Efficiency evaluations indicate a +56% increase in per-sample latency due to cleaning and re-prefill, offset by a –45% reduction in key–value (KV) cache memory (6.14 GB→3.34 GB). A further projected latency reduction (~20%) is attainable with native KV pruning.
Free()LM’s cleaning maintains the context within a 40–70k token "sweet spot," preventing context bloat and model drift associated with long CoT chains.
6. Significance and Implications
The central insight established by Free()LM is that sustainable LLM intelligence necessitates a controlled mechanism for self-forgetting—complementing the capacity for incremental reasoning. By completing the memory-management cycle (malloc & free analogy), Free()LM provides a learnable, context-sensitive pruning solution that effectively overcomes saturation and accuracy collapses in deep, long-horizon reasoning tasks. It is scalable, generalizes across architectures, and is compatible with contemporary inference infrastructures (Zheng et al., 8 Feb 2026).
This suggests that future LLM design should treat memory management as an integral part of the reasoning process, not an engineering afterthought, and further, that plug-in modules like Free-Module adapters can bridge the gap between reasoning power and sustainable inference.