Free()LM: Self-Forgetting LLM Architecture
- Free()LM is a large language model framework that incorporates an explicit self-forgetting mechanism via a Free-Module to prune redundant context.
- The architecture alternates between reasoning and cleaning modes using LoRA-based adapters to dynamically remove irrelevant information and prevent performance collapse.
- Empirical evaluations demonstrate improved accuracy on long-context reasoning tasks and a reduction in generated tokens by up to 46%.
Free()LM denotes a class of LLMs augmented with an intrinsic “self-forgetting” capability, instantiated by the integration of the Free-Module—a compact, plug-and-play adapter designed to address the architectural limitations of “malloc-only” reasoning engines. In standard LLM pipelines, context is an append-only buffer: each new step is appended, but obsolete or redundant information is never explicitly removed. This leads to context rot, degeneration loops, and performance collapse during long-horizon reasoning. Free()LM solves this by alternating between reasoning and cleaning modes, dynamically identifying and excising irrelevant context to maintain a compact, noise-free state. This architecture demonstrates consistent improvements across model scales and establishes new state-of-the-art results on challenging long-context and mathematical reasoning benchmarks (Zheng et al., 8 Feb 2026).
1. Rationale for Explicit Forgetting in LLMs
Standard LLMs maintain their working context as an append-only structure, growing monotonically as inference proceeds. This “malloc-only” paradigm results in several critical phenomena:
- Context Rot: As empirically observed (Hong et al., 2025), model accuracy often plateaus or regresses once the context window reaches 70–90% capacity.
- Degeneration Loops: Extended inference (>20k–100k tokens) accumulates not only valid reasoning steps but also obsolete branches, redundant corrections, and irrelevant information, leading to output degradation.
- Performance Collapse: In extreme long-horizon settings (e.g., Qwen3-235B on HLE sequences exceeding 80k tokens), pass@1 accuracy can drop to 0%.
Free()LM addresses these problems by training a model to actively “free” (delete) unnecessary context through data-driven, reward-filtered supervision—restoring model reliability and efficiency in settings where classical models degrade (Zheng et al., 8 Feb 2026).
2. Free-Module: Mechanism and Formalism
The Free-Module is a LoRA-based adapter layered on top of the backbone LLM. Its operation involves the following steps:
- Context Decomposition: The current context is partitioned into fixed-length chunks of length (default during training).
- Chunk Embedding: Each chunk is embedded via mean pooling or a special hidden state:
- Retention Scoring: The adapter outputs a retention score for each chunk:
A threshold is applied, and only chunks with are retained.
- Anchor-Based Span Deletion: The module emits a list of anchor pairs in JSON, specifying spans to be pruned using regex-based matching and replacement in the live context:
- Loss Function: Training supervision uses cross-entropy over serialized JSON tokens to match distilled “oracle” deletion commands. Only prunings that preserve or improve rollout accuracy are retained, eliminating the need for RL or regret-based objectives:
This approach completes the reasoning pipeline with an explicit free() operation, addressing both the efficiency and quality deficits of traditional approaches (Zheng et al., 8 Feb 2026).
3. Integration into Inference and Training Workflow
Free()LM operates by alternating between two modes:
- Reasoning Mode: The base LLM generates reasoning steps (tokens) without intervention from the Free-Module.
- Cleaning Mode: After every tokens (default 5000), the Free-Module is merged via LoRA adapters, processes the context, emits deletion commands, and prunes the context accordingly.
The inference procedure can be summarized as follows:
- Initialize context and mode (“Reasoning”).
- While completion not reached:
- If in “Reasoning” mode: Generate tokens up to the pruning interval, then switch to “Cleaning.”
- If in “Cleaning” mode: Merge Free-Module, emit and apply pruning anchors, unmerge, and revert to “Reasoning.”
- Return the final output.
LoRA Integration Details:
- Adapters are injected on every linear layer (rank , scaling , dropout=0.1).
- Less than 0.1% parameter overhead.
- Only LoRA weights are updated; the backbone is frozen throughout.
Training is conducted on 6,648 reward-filtered, high-quality deletion instances from DeepMath-103k, using AdamW, a learning rate of , 5 epochs, and batch size of 16. Precision is set to BF16 with FlashAttention-2, and parallelism is managed via DeepSpeed ZeRO (Stage 3 for 8B, Stage 2 for 30B/235B) (Zheng et al., 8 Feb 2026).
4. Empirical Performance and Evaluation
Free()LM’s empirical evaluations demonstrate significant and consistent gains:
- Long-Horizon Mathematical Reasoning: On benchmarks such as AIME2425, BrUMO25, HMMT, BeyondAIME, HLE (text), and IMOAnswerBench, Free()LM exceeds baselines (Vanilla Qwen3, heuristic methods, ICL) by 3.3% on average. On IMOAnswerBench (DeepSeek V3.2-Speciale), SOTA improves from 83.54 to 85.87 (+2.3%), with a 46% reduction in generated tokens.
- Long-Tail Recovery: On extended context trajectories (>80k tokens) where vanilla models collapse (0% accuracy), Free()LM restores pass@1 to approximately 50%.
- General Evaluation: Maintains or slightly improves (≤+0.2%) accuracy on general-purpose safety and knowledge benchmarks (BBH, MMLU-Pro, MMLU-STEM, GPQA).
- Systems Efficiency: On Qwen3-235B/A22B with HLE, the key-value cache is reduced from 6.14 GB to 3.34 GB (–45.6%), with inference latency increasing by 56% (which can be limited to 20% with direct KV-cache pruning).
Ablation studies reveal that heuristic compression degrades both accuracy and output length, while ICL alone yields only ~1% improvement. The Free-Module demonstrates notable cross-model generalization, with an 8B-trained variant delivering gains (+1.5% to +2.3%) when applied to larger backbones and new families (Zheng et al., 8 Feb 2026).
5. Insights, Best Practices, and Deployment
Deployment and operationalization of Free()LM emphasize:
- LoRA Adapters: Preferred for zero-backbone modification and efficient fine-tuning.
- Pruning Parameters: Set pruning intervals in the range of 4–6k tokens to balance persistence of reasoning vs. noise accumulation.
- Anchor-Based Deletion: Utilize anchor pairing to excise large spans for minimal module output cost.
- Context Management: Deploy either as an external JSON executor or internalize via KV-cache pruning for reduced re-prefilling overhead.
- Universal Pruning Service: One can train the Free-Module on a mid-scale model and deploy across different model families, enabling broad applicability.
- Cycle Tuning: The maximum number of cleaning cycles per query should be matched to application complexity (e.g., longer for mathematical tasks, shorter for dialogues).
A plausible implication is that Free()LM can serve as the basis for a “Universal Context Pruning Service” in LLM-centric systems, addressing both accuracy and cost at scale (Zheng et al., 8 Feb 2026).
6. Significance in Contemporary LLM Architectures
The introduction of an explicit free() primitive closes the memory management lifecycle for LLMs, previously dominated by unbounded context growth. By training lightweight Free-Modules to detect and prune superfluous information, Free()LM mitigates the major pathologies of context rot and reasoning collapse, significantly boosting long-horizon task reliability while reducing resource demands by more than 20% on key benchmarks. This methodology departs from failed heuristics and prompt engineering approaches, establishing a scalable foundation for sustainable, self-managing, and adaptive reasoning agents (Zheng et al., 8 Feb 2026).