Papers
Topics
Authors
Recent
Search
2000 character limit reached

Free()LM: Self-Forgetting for LLMs

Updated 24 February 2026
  • Free()LM framework is a novel approach that introduces a trainable self-forgetting module to dynamically prune redundant context and prevent saturation in LLMs.
  • It utilizes a plug-and-play LoRA adapter that alternates between reasoning and cleaning modes to maintain a compact, high-signal working memory during extensive chain-of-thought tasks.
  • Empirical evaluations reveal consistent Pass@1 accuracy improvements and significant token reduction across various LLM scales, validating its scalability and efficiency.

The Free()LM framework addresses a fundamental limitation of standard LLMs in complex reasoning: their inability to discard obsolete or redundant information during long-horizon Chain-of-Thought (CoT) reasoning. By introducing an explicit, trainable self-forgetting operation implemented as a plug-and-play LoRA adapter called the Free-Module, Free()LM enables LLMs to dynamically prune unusable context and maintain a compact, high-signal working memory. This self-forgetting mechanism is empirically shown to resolve context window saturation and accuracy collapse in long reasoning tasks, improving performance and reliability across scales spanning 8B to 685B parameters (Zheng et al., 8 Feb 2026).

1. Architectural Motivation and High-Level Overview

Standard LLMs operate as "malloc-only" engines: as they generate extended CoT trajectories, all intermediate outputs—valid or otherwise—are retained in the context buffer. Empirical evidence indicates that, when reasoning tokens consume 70–90% of the available context, accuracy stagnates or collapses due to information overload and self-reinforcing degeneration. Free()LM endows the model with a "free()" operation via the Free-Module, permitting the system to iteratively alternate between:

  • Reasoning mode: Generates new CoT steps identical to the vanilla backbone.
  • Cleaning mode: Identifies and excises redundant, no-longer-relevant context spans, emitting explicit deletion instructions.

This mode switching results in periodic memory reclamation, breaking cyclic context accumulation and sustaining accuracy in deep reasoning settings.

2. Formalization and Mechanism of the Free-Module

Let the current model context be C={u1,u2,,un}C = \{u_1, u_2, \ldots, u_n\} (each uiu_i a text chunk). The Free-Module FθF_\theta, implemented as a set of LoRA adapters embedded in every linear projection of the Transformer, maps this context to a set of JSON-style pruning commands: Fθ:  C    O,oi={"prefix":pi,"suffix":si}F_\theta: \; C \;\longmapsto\; \mathcal{O}\,, \quad o_i = \{\text{"prefix"}: p_i,\, \text{"suffix"}: s_i\} where each (pi,si)(p_i, s_i) anchors a span of CC for removal. The module computes per-chunk redundancy scores ri=Scoreθ(uiC)r_i = \mathrm{Score}_\theta(u_i \mid C); chunks surpassing threshold τ\tau are marked for deletion.

At inference:

  • Reasoning mode (unmerged LoRA): Model autoregressively generates new CoT tokens.
  • Cleaning mode (merged LoRA): The backbone, with Free-Module weights merged, emits a set of pruning commands. Application of these commands reduces context to only retain high-utility content.

The system alternates every LcleanL_\text{clean} tokens, with cleaning iterations followed by a reset to reasoning mode.

3. Implementation and Integration

The Free-Module uses LoRA with typical configuration: rank r=128r=128, scaling α=256\alpha=256, dropout 0.1. Backbone weights are frozen; only adapters are updated during supervised training, with target pruning commands distilled from oracle annotations. Inference toggling between unmerged and merged adapters effectively switches between generation and cleaning functionalities. When switching to cleaning, LoRA weights are added to the base; after pruning, they are removed.

Supported backbones include Qwen3-8B, Qwen3-30B-A3B, and Qwen3-235B-A22B. Context windows are 32k32\,\mathrm{k} tokens (8B/30B models); for the 235B model, only 64k64\,\mathrm{k} is required (down from 128k128\,\mathrm{k} for vanilla operation). The cleaning interval is Lclean=5000L_\text{clean}=5000 tokens, with up to 50 cleaning cycles per generation.

Integration leverages serving with vLLM (v0.8.5), CUDA 12.6, FlashAttention-2, and BF16 precision. A JSON parser executes the pruning commands, reducing the context and then resuming generation.

4. Empirical Evaluation and Results

Extensive benchmark evaluation demonstrates that Free()LM yields a consistent average gain of 3.3% in Pass@1 accuracy over state-of-the-art reasoning baselines. Key results include:

Model Pass@1 (vanilla) Pass@1 (Free()LM) Δ Pass@1 Avg. #Tokens (vanilla) Avg. #Tokens (Free()LM) Δ Tokens
Qwen3-8B 44.24% 48.14% +3.90 17.5k 13.8k –21.1%
Qwen3-30B 57.47% 62.30% +4.83 18.1k 15.9k –12.2%
Qwen3-235B 69.18% 70.47% +1.29 26.1k 19.3k –26.1%

On the IMOAnswerBench, DeepSeek V3.2‐Speciale with Free()LM attains a new SOTA with Pass@1 increasing from 83.54% to 85.87%, and average token usage decreasing by 45.99%. In long-horizon settings (e.g., HLE, >80>80k tokens), vanilla Qwen3-235B collapses to 0% accuracy whereas Free()LM sustains 50%\approx50\% accuracy (Zheng et al., 8 Feb 2026).

Ablation studies show that heuristic compression baselines compromise accuracy, while in-context pruning with oracle impulses adds only 1%\sim1\% and does not prevent degeneration. The Free-Module trained on Qwen3-8B generalizes as a "universal pruning service," giving +1.5% Pass@1 on Qwen3-235B and +2.3% on DeepSeek V3.2.

5. Comparative Analysis and Generalization

Compared to purely heuristic or in-context compression, Free()LM avoids destructive context truncation and exhibits robust cross-model transfer. The LoRA-based Free-Module is plug-and-playable, requiring no retraining of backbone weights. Efficiency evaluations indicate a +56% increase in per-sample latency due to cleaning and re-prefill, offset by a –45% reduction in key–value (KV) cache memory (6.14 GB→3.34 GB). A further projected latency reduction (~20%) is attainable with native KV pruning.

Free()LM’s cleaning maintains the context within a 40–70k token "sweet spot," preventing context bloat and model drift associated with long CoT chains.

6. Significance and Implications

The central insight established by Free()LM is that sustainable LLM intelligence necessitates a controlled mechanism for self-forgetting—complementing the capacity for incremental reasoning. By completing the memory-management cycle (malloc & free analogy), Free()LM provides a learnable, context-sensitive pruning solution that effectively overcomes saturation and accuracy collapses in deep, long-horizon reasoning tasks. It is scalable, generalizes across architectures, and is compatible with contemporary inference infrastructures (Zheng et al., 8 Feb 2026).

This suggests that future LLM design should treat memory management as an integral part of the reasoning process, not an engineering afterthought, and further, that plug-in modules like Free-Module adapters can bridge the gap between reasoning power and sustainable inference.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Free()LM Framework.