Agentic Self-Compression in LLM Agents
- Agentic self-compression is a mechanism where LLM agents autonomously condense long interaction histories to minimize computational costs while maintaining task accuracy.
- It integrates explicit policy controls and reward signals to balance information retention with resource constraints in multi-step and tool-augmented tasks.
- Empirical results show substantial token savings and near-baseline performance, demonstrating its effectiveness in long-horizon, cost-sensitive operations.
Agentic self-compression refers to mechanisms by which an autonomous agent—typically one powered by LLMs—actively governs the condensation of its long-term interaction history, observations, or memory into more compact representations, with the goal of minimizing computational cost (tokens, memory, inference latency) while maintaining high task performance. Unlike traditional, externally administered context summarization, agentic self-compression places compression decisions within the agent's own reasoning or action policy, allowing the system to adaptively balance information retention and resource constraints in real-time agency settings (Feng et al., 8 Jan 2026, Verma, 12 Jan 2026, He et al., 25 Dec 2025, Kang et al., 1 Oct 2025).
1. Motivation and Core Problem
The deployment of LLM-based agents in multi-step, interactive environments quickly results in the unbounded accumulation of textual or state history (), leading to context bloat. Such growth inflates token budgets, challenges fixed-size attention mechanisms, and incurs memory and compute overheads—most acutely in long-horizon and tool-augmented reasoning tasks. Context bloat, as articulated in the context of autonomous software engineering agents, not only impacts compute and latency but also degrades reasoning quality due to distraction by irrelevant or outdated prior steps (Feng et al., 8 Jan 2026, Verma, 12 Jan 2026).
Traditional compression approaches (e.g., periodic summarization by a separate module) are passive, often external to the agent policy, and provide limited adaptability. By contrast, agentic self-compression empowers the agent to decide when and how aggressively to compress, and to optimize the trade-off between preserving task-relevant information and reducing memory footprint (Verma, 12 Jan 2026, He et al., 25 Dec 2025). This enables cost-efficient, long-horizon operation without distorting end-task accuracy.
2. Mechanisms for Agentic Self-Compression
2.1 Compression Rate Emission and Policy Control
Agentic self-compression is typically realized by extending the agent’s policy to emit both environment actions and explicit compression commands. In AgentOCR (Feng et al., 8 Jan 2026), the agent, at each step , outputs not only an action but also a scalar compression factor . This controls the downsampling of a rendered visual memory buffer:
1 2 |
<action>...</action> <compression>1.2</compression> |
The environment wrapper then rescales the visual context by , trading off image fidelity for token budget efficiency. Crucially, is sampled from the agent’s policy distribution and learned jointly with task actions, permitting strategic, state-conditioned self-compression.
In context-centric agents such as Focus (Verma, 12 Jan 2026), explicit toolcall primitives (e.g., start_focus(name)/complete_focus()) are invoked by the agent to delimit exploration phases, after which the agent consolidates raw history into a persistent knowledge block. The agent autonomously decides when to summarize and prune, often prompted by heuristics (e.g., tool call counts) or internal reasoning triggers.
2.2 Reward Signals and Objective Formulation
The learning objective for self-compressing agents augments the base reward (task success) with compression-aware terms. For example, AgentOCR employs Group Relative Policy Optimization (GRPO) and sparsely injects a compression reward :
with total reward: where is a small weight and the reward interval (Feng et al., 8 Jan 2026).
In Focus, the cost minimized is a weighted sum of total context tokens and information loss, operationalized through periodic summarization and history pruning (Verma, 12 Jan 2026).
2.3 Compression Algorithms and Guidelines
Self-compression may be realized via:
- Direct scalar rate control (as in AgentOCR, controlling visual fidelity)
- Explicit use of compression primitives by the agent, plus summarization of messages (as in Focus)
- LLM-based compressors working under optimized natural language "compression guidelines" (as in ACON (Kang et al., 1 Oct 2025)), refined through contrastive analysis of task success/failure upon compression
In all cases, the agent is the locus of compression decisions and may optimize corresponding parameters or policies through RL, policy gradients, or prompt-guided meta-optimization.
3. Information-Theoretic Frameworks for Compression Quality
A central theoretical perspective models the compression process as a communication problem: the compressor (an LLM or explicit mechanism) maps the full context to a compressed representation , analogous to traversing a noisy channel (He et al., 25 Dec 2025). The relevant metric is the mutual information , quantifying the degree to which information in the original context survives compression.
Estimation is achieved via Monte Carlo, using the log-probabilities of compressed outputs under their respective or marginal input distributions. High mutual information per token—a measure of "information density"—tightly predicts downstream task accuracy across diverse agentic pipelines. Larger compressors, or richer compression policies, typically increase both information retention and token efficiency, yielding superior agentic self-compression (He et al., 25 Dec 2025).
4. Empirical Results and Performance Trade-Offs
Comprehensive empirical studies demonstrate that well-designed agentic self-compression yields substantial token and cost reductions with minimal performance degradation.
- In AgentOCR, agents retained over 95% of text-based performance on ALFWorld (99.5%) and search QA (95.0%) while saving up to 55% of tokens at moderate compression (). Further compression (higher ) leads to rapid performance drop, especially in information-dense settings (Feng et al., 8 Jan 2026).
- Ablation studies show that too-frequent or overly aggressive compression (e.g., with reward injected every step, ) causes collapse in task success, as the agent greedily maximizes compression reward, but intermittent reward () enables stable savings (~17% tokens) without loss in success (Feng et al., 8 Jan 2026).
- Focus agents achieved 22.7% reduction in average tokens on SWE-bench Lite with identical task accuracy, and up to 57% per-instance savings, by issuing 6.0 autonomous compressions per task (Verma, 12 Jan 2026).
- ACON reduced peak memory by 26–54% across agentic benchmarks, with compressed or distilled agents preserving at least 95% of the uncompressed baseline performance. Compressors distilled into smaller LMs retained 95% of task accuracy and cut module inference cost by 4–10× (Kang et al., 1 Oct 2025).
- Information-theoretic studies report that 7B Qwen-2.5 compressors are 1.6× more accurate, 4.6× more concise, and 5.5× higher in mutual information per token than 1.5B baselines, recovering up to 99% of the accuracy of much larger LMs at 26% of API costs (He et al., 25 Dec 2025).
| Method | Token Saving | Task Accuracy Preservation | Typical Compression Artifact |
|---|---|---|---|
| AgentOCR | 50–55% | >95% | Visual buffer downsampling |
| Focus | 23–57% | 100% | Summarized knowledge block |
| ACON | 26–54% | ≥95% | Abstractive natural-language summary |
| Info-Theoretic | up to 74% | 97–99% | Bit-efficient context, mutual info |
Careful reward design and explicit control of when and how to compress are critical; unmoderated compression incentives rapidly reduce performance (Feng et al., 8 Jan 2026).
5. System Architectures and Practical Integration
Agentic self-compression has been integrated into various architectures:
- AgentOCR frames history as a visual stack, allowing compression via adjustable image scaling, with compression factors emitted in agent action heads and integrated into the observation pipeline (Feng et al., 8 Jan 2026).
- Focus operates over textual histories, introducing tool-calls for phase delimiting and decontextualized summaries, with in-prompt persistent “knowledge” blocks to retain learned facts (Verma, 12 Jan 2026).
- Compressor–Predictor Systems assign a dedicated compression module (potentially a local LM) to process raw context into high-density summaries, consumed by downstream predictor LMs. Pairing strategies aim to maximize mutual information per token and minimize compute cost (He et al., 25 Dec 2025).
- ACON uses LLM-as-guideline-optimizer to iteratively refine the abstraction policy (compression guideline), alternating improvements to utility and parsimony. When pre-trained compressors are costly, knowledge distillation to smaller models is employed (Kang et al., 1 Oct 2025).
Common architectural features include (a) explicit agent control over compression, (b) persistent or periodically updated abstraction buffers, (c) auxiliary mechanisms for reward/penalty injection governing compression rate, and (d) monitoring of information density and token budgets as first-class system metrics.
6. Limitations, Extensions, and Considerations
Several constraints and open questions characterize current agentic self-compression research:
- Trade-offs: Compression may break attention caches (KV-cache), sporadically increasing compute. Aggressive compression trades off memory savings for the risk of information loss and task failure (Kang et al., 1 Oct 2025).
- Model Specificity: Many frameworks are currently evaluated on select LM families (e.g., GPT, Qwen), and generalization to all open-source LMs is not guaranteed (Kang et al., 1 Oct 2025).
- Feedback Loops: Excessive or misaligned compression incentives may promote myopic memory policies unless reward schemes are properly tuned or intermittent (Feng et al., 8 Jan 2026).
- Guideline and Policy Optimization: For prompt-guided compressor agents, abstraction guidelines may need ongoing tuning through meta-optimization, especially as domains change (Kang et al., 1 Oct 2025).
- Hardware Constraints: On-device compressor size is limited by resources (VRAM, quantization), necessitating judicious model family selection (He et al., 25 Dec 2025).
- Future Directions: Extensions include integration with retrieval-augmented mechanisms, RL-driven context pruning, and end-to-end cache compression; iterative and multi-modal self-compression present additional areas for advancement (Kang et al., 1 Oct 2025, He et al., 25 Dec 2025).
7. Significance and Impact
Agentic self-compression fundamentally enables long-horizon, cost-aware LLM agents to scale to real-world settings. By internalizing memory management, agents remain adaptive under tight resource budgets, support persistent workflows, and offer predictable cost–performance trade-offs across diverse tasks. The combination of direct policy control, information-theoretic optimization, and practical engineering (visual buffers, explicit tool invocation, prompt-guided update, distillation) provides a robust repertoire for scalable agentic system design. Theoretical and empirical evidence demonstrates that agentic self-compression maintains, and sometimes even enhances, downstream accuracy, all while yielding sublinear scaling of memory and API cost relative to context length (Feng et al., 8 Jan 2026, Verma, 12 Jan 2026, He et al., 25 Dec 2025, Kang et al., 1 Oct 2025).