Papers
Topics
Authors
Recent
Search
2000 character limit reached

Self-Compacting Language Model Agents

Published 22 Jun 2026 in cs.CL | (2606.23525v1)

Abstract: Long agent traces composed of chains of thought and tool calls accumulate stale content that anchor subsequent generations, and eventually outgrow the context window. Existing scaffolds mitigate it with fixed-interval compaction triggered at a token threshold. Such triggers pay no heed to trajectory structure, risking discard of partial results mid-derivation or mid-search. We propose SelfCompact, a scaffold that allows the model itself to decide when and how to compact. Specifically, it pairs two inference-time elements: (i) a compaction tool the model invokes to summarize the accumulated context, and (ii) a lightweight rubric specifying when to fire (a sub-task has resolved, or the trajectory is converging) and when to suppress (mid-derivation, or when stuck). Both are needed. The tool alone is unevenly used across open-weight models, often invoked at unhelpful moments or not at all; the rubric alone cannot act. Together, they elicit effective adaptive compaction without any fine-tuning or external supervision. We present empirical results on six benchmarks (competitive math and agentic search) and seven models. Our results show that SelfCompact matches or exceeds fixed-interval summarization at a fraction of the token cost, improving over a no-summarization baseline by up to 18.1 points on math and 5-9 points on agentic search at 30-70% lower per-question cost. Our results expose a meta-cognitive gap: although unprompted models cannot reliably tell when their own context is rotting, a lightweight rubric closes this gap, reframing when to compact as a capability that scaffolds can supply without training.

Summary

  • The paper introduces a training-free, rubric-gated compaction mechanism that adaptively compresses context to enhance reasoning accuracy.
  • It demonstrates significant performance gains in competition math and agentic search, with up to 18.1-point improvements and reduced token costs.
  • The approach stabilizes long-horizon, multi-turn reasoning by reducing per-question token use by 30–70% and mitigating context rot.

Self-Compacting LLM Agents: Rubric-Gated Adaptive Context Management

Motivation and Problem Statement

Scaling LMs to competitive math and agentic search has introduced significant context length challenges. As trajectories increase in length due to complex multi-step reasoning and chains of tool-use, so does the accumulation of stale or erroneous intermediate content—termed context rot. This context rot actively impairs subsequent generations due to anchoring on flawed prior state, as documented in competitive math and agentic search domains where one session can easily span tens of thousands to millions of tokens.

Current deployed and academic scaffolds primarily address compaction using rigid, content-agnostic thresholds (e.g., fixed token intervals), discarding context when exceeding a budget without regard to the model's reasoning state. Such methods can trigger compaction mid-derivation, erasing partial results critical for ongoing reasoning, or fire late, retaining and over-attending to context rot. This instability limits both reasoning reliability and cost efficiency.

SelfCompact Scaffold: Technical Solution

SelfCompact introduces an inference-time, training-free mechanism for adaptive context compaction, scaffolded via two components:

  1. Inline Compaction Tool: The LM is exposed to a summarization tool capable of condensing accumulated context (y1:ty_{1:t}) into a compressed representation y~\tilde{y}.
  2. Rubric-Gated Trigger: At periodic intervals, a lightweight rubric—instantiated as a user message—prompts the model to decide, via explicit self-judgment, on issuing a compaction trigger. The rubric is designed to fire only when a sub-task has resolved or the trajectory is converging, and to suppress during mid-derivation or when the model is stuck.

The probe and summarizer both append instructions to the chat-based context, leveraging KV-cache reuse to keep additional inference cost minimal. Both elements (the tool and the rubric) are necessary: ablation studies demonstrate that without rubric-gating, tool invocation is uneven and often poorly timed. Figure 1

Figure 1: Comparison of trajectory-compression strategies on a hard BrowseComp question.

Empirical Results: Quantitative Analysis

SelfCompact is evaluated on six benchmarks spanning multi-turn competition math and three open agentic search tasks. Seven open-weight models are tested, including Qwen3.5 variants, GLM-4.7-Flash, MiniMax-M2.5, and Mimo-V2-Flash.

Key results:

  • Competition Math: Under budget-matched comparison, SelfCompact achieves the strongest results on 11/12 benchmark-model pairs. On Qwen3.5-9B, it improves accuracy over a no-compaction baseline by up to 18.1 points.
  • Agentic Search: SelfCompact outperforms both fixed-interval and no-compaction baselines on BrowseComp-Plus and DeepSearchQA, boosting accuracy by +5 to +9 points and reducing per-question token cost by 30–70%.
  • Ablation: Removing the rubric collapses gains (e.g., GLM-4.7-Flash mean accuracy drops by 5.4 points), confirming that knowing when to compact is critical and cannot be left to spontaneous model choice.
  • Headroom Analysis: An oracle that skips summarization when the current answer is correct achieves further gains, indicating significant headroom for more sophisticated adaptive policies.

SelfCompact's probe often fires well before context length thresholds, distilling compaction decisions to semantically meaningful transition points rather than purely token-based triggers. Figure 2

Figure 2: Distribution of context length when a summary fires in BrowseComp Plus; SelfCompact triggers are adaptively spread, unlike the rigid fixed-interval policy.

Difficulty scaling is also addressed: SelfCompact matches or surpasses fixed-budget approaches on easy problems, but offers up to 20 percentage points gain on the hardest bins for all three evaluated models in BrowseComp Plus. Figure 3

Figure 3: Accuracy by per-question difficulty on BrowseComp Plus; SelfCompact shows superior accuracy on hardest problems.

Theoretical and System Implications

SelfCompact advances LM agent context management by demonstrating that meta-cognitive capabilities—here, strategic context compaction—can be supplied through scaffolding alone, without finetuning or additional external supervision. The rubric design isolates the temporal aspect of compaction, shifting the onus of state monitoring and execution from weight-level adaptation to prompt-level protocol.

This approach offers a robust, interpretable, and deployment-ready technique for untrained model agents, circumventing the need for RL-based policy distillation or model modifications.

In agentic and content-intensive open-weight settings, these findings imply that context management—a bottleneck in current scaling—can be made adaptive, cheap, and effective with properly crafted scaffolding, rather than by further parameter expansion.

Limitations and Future Directions

The current evaluation is limited to open-weight, non-frontier LMs. Closed and more advanced models (e.g., GPT-5.5, Claude Opus 4.7) may possess improved intrinsic metacognition, potentially reducing dependence on explicit rubrics. Additionally, no reinforcement learning or external verification is involved—making this method orthogonal and potentially complementary to approaches targeting what content to summarize, not just when.

Open research directions:

  • Incorporation of RL to jointly learn compaction timing and content within the rubric framework.
  • Extending rubric criteria to support finer granularity and multi-modal tool-use.
  • Generalization to even harder and longer-horizon domains, including scientific discovery agents and full-stack program synthesis.

Conclusion

SelfCompact demonstrates that lightweight, rubric-gated, inference-time context compaction scaffolds can robustly overcome the limitations of fixed-interval compression in LM agents. Across diverse models and tasks, it achieves superior accuracy at reduced cost, reframing context management as an emergent capability supplied at deployment, not encoded at training. The rubric paradigm sets a new standard for adaptive context management in LLM agents.

(2606.23525)

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What this paper is about (the big picture)

This paper looks at a problem that happens when AI assistants think for a long time. As they write long chains of thoughts and make many tool calls (like web searches), they keep everything in one big “memory” called a context. Old, wrong, or no-longer-useful bits pile up and start to mislead the AI—a problem the authors call “context rot.” The paper introduces a simple, training-free way for the AI to clean up its own notes at the right moments so it stays focused and accurate. They call this method SelfCompact.

The main questions the authors ask

The authors focus on one core idea:

  • Can an AI notice when its own notes are going stale and clean them up—by itself—without retraining?

More specifically, they ask:

  • When should an AI summarize and shrink its running notes?
  • How can it do that without cutting off important steps mid-thought?
  • Can a simple set of rules help the AI choose good times to clean up?

How their method works (in everyday terms)

Imagine you’re solving a big puzzle and keeping a messy notebook. If you never clean it, your notebook gets huge and confusing. If you clean it on a fixed timer (say, every 10 minutes), you might erase something important in the middle of a step. The best time to tidy up is right after you finish a small sub-task—when your work has a natural stopping point.

SelfCompact gives the AI two things to make that happen:

  • A “summarize” tool: When used, the AI writes a short, clean summary of what matters so far and continues from that smaller summary instead of the messy full history.
  • A tiny checklist (a rubric): Every so often, the AI asks itself simple questions like “Did I just finish a sub-task?” “Is my search converging on an answer?” “Am I in the middle of a calculation?” If the checklist says “now is a good stopping point,” it summarizes; if not, it keeps going.

A few key details, explained simply:

  • The AI reuses its recent “thinking shortcuts” (called a cache) so asking the checklist and writing the summary is cheap.
  • The AI doesn’t need to be retrained. Both the judge (using the checklist) and the summarizer are the same AI you’re already using—just with good instructions.
  • Timing matters: compressing right after a sub-task prevents chopping off a calculation mid-step and keeps important facts that were already verified.

What they tested and how

The authors tried SelfCompact on two types of tasks:

  • Competition math problems: these require multi-step reasoning (like long scratch work).
  • Agentic web search: the AI browses and checks facts across many pages before answering.

They compared four strategies:

  • No compaction (keep everything, which gets messy and expensive).
  • Fixed-interval compaction (summarize at a fixed size or time, regardless of what’s happening).
  • Simple deletes (throw away lots of history on a fixed schedule).
  • SelfCompact (summarize only when the checklist says it’s a good time).

They measured:

  • Accuracy: How often the AI gets the correct answer.
  • Cost: How many tokens (and therefore money) it spends per question.

The main results (what they found and why it matters)

Here are the main takeaways:

  • SelfCompact improves accuracy while costing less.
    • On math problems, it beat “no compaction” by up to about 18 points and matched or beat fixed schedules in most cases.
    • On web search tasks, it added around 5–9 accuracy points while using 30–70% less cost per question.
  • Timing is everything.
    • Fixed schedules often summarize at bad moments (like mid-derivation), which can erase useful steps.
    • SelfCompact’s checklist tends to trigger summaries earlier and at more natural breakpoints (right after a sub-goal is done), keeping key facts and dropping stale or distracting text.
  • The checklist (rubric) is crucial.
    • Giving the AI just the summary tool (without the checklist) led to messy timing—sometimes summarizing too often or not at all.
    • Adding a short, clear checklist made the AI choose better moments to compress, which consistently improved results.
  • It helps most on harder problems.
    • The tougher the question (the longer the AI needs to think), the bigger the benefit from smart, well-timed cleanups.

Why this is important (the impact)

As AIs tackle longer, more complex tasks, they need to manage their own “notebooks” wisely. This paper shows that:

  • A simple, training-free scaffold (a tool + a tiny checklist) can give AIs a kind of “meta-cognition”—the ability to notice when their notes are getting messy and clean them up at the right time.
  • This makes them more accurate, faster, and cheaper to run.
  • It also means you don’t have to retrain or build a new model—just layer this method onto the AI you already use.

In short, SelfCompact is a practical way to keep long-thinking AIs sharp: summarize when it makes sense, not just when a timer goes off.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, phrased to guide immediate follow-up research:

  • Frontier-model generalization: Does SelfCompact still help with frontier LMs (e.g., GPT, Claude, Gemini) that may exhibit stronger metacognitive abilities? Run controlled comparisons to quantify incremental gains (or redundancy) of rubrics for such models.
  • Rubric portability and sensitivity: How robust are outcomes to rubric wording, length, and structure? Systematically vary phrasing across models and tasks to measure sensitivity and establish best practices or invariances.
  • Automatic rubric induction: Can rubrics be learned or induced (e.g., from traces, heuristics, or feedback) rather than hand-crafted per task, and do learned rubrics transfer across domains and models?
  • Probe cadence optimization: The method still probes at fixed intervals N. What is the optimal, event-driven or uncertainty-aware probe scheduling, and how do different N (or adaptive N) trade off accuracy vs. cost/latency?
  • Faithfulness of summaries: What fraction of summaries omit or distort needed facts? Introduce summary faithfulness audits (factuality/entailment checks, unit tests in code tasks) and quantify downstream error propagation.
  • Summary length control: How does performance vary with target summary length and compression ratio L/ℓ? Develop decoding constraints or reward shaping to hit length budgets without losing critical content.
  • “Hard reset” design choice: Is fully replacing history with the summary optimal? Compare against hybrid memory (summary + last-k turns, retrieval buffers, structured fact stores) and quantify retention vs. cost.
  • Oracle-approximation gating: The oracle analysis shows large headroom. Evaluate practical proxies (self-consistency, verifier models, calculators/unit tests, web re-checks) to gate compaction near-oracle quality.
  • Latency and systems overhead: Beyond token cost, what are the real serving impacts (latency, throughput, GPU memory) in production stacks with/without KV-cache reuse? Benchmark across providers that differ in cache semantics and billing.
  • Domain coverage: Validate on code agents (e.g., SWE-bench variants), multimodal tasks, planning/robotics, and long software-debug sessions where traces span millions of tokens; report failure modes per domain.
  • Multilingual and domain-shift robustness: Do English rubrics generalize to other languages or technical subdomains (law, medicine)? Test multilingual variants and domain-specific rubrics.
  • Misfire analysis: When does the rubric wrongly compress (false positives) or suppress (false negatives)? Build labeled datasets of “safe-to-compact” states and report precision/recall of rubric decisions.
  • Tool-noise resilience: How does noisy, delayed, or adversarial tool feedback affect the rubric’s ability to detect subtask closure vs. “stuck” states? Stress-test with injected tool errors.
  • Adversarial or erroneous traces: If the trace embeds confident but wrong reasoning, does the rubric prematurely “lock in” a bad summary? Evaluate robustness and design counter-checks (evidence re-verification).
  • Training-based baselines: Compare against SFT/RL methods that learn when/what to compact under matched compute and cost budgets; explore distilling rubric behavior into model weights.
  • Formal characterization: Provide theoretical or empirical conditions under which compaction improves vs. harms performance (e.g., subgoal separability, verification availability, stability under repeated summarization).
  • Operationalizing “closed reasoning units”: Make the notion precise (e.g., via program structure, proof steps, search tree nodes) and evaluate automatic detectors for closure events independent of free-form text cues.
  • Long-horizon stability: Quantify cumulative information loss and drift over many compaction cycles (100+), including “summary-of-summary” degradation and mitigation strategies (periodic re-grounding).
  • Scaling laws: Map accuracy and cost as functions of context length, problem difficulty, compression frequency, and summary length; derive actionable scaling curves and recommended operating points.
  • Controllability knobs: Expose and evaluate user-tunable risk/cost settings (strict vs. permissive rubric criteria, summary granularity) and their impact on accuracy, cost, and latency.
  • Privacy and safety: Assess how summaries handle sensitive content (PII, secrets); develop redaction-aware compaction and measure privacy leakage reduction vs. performance.
  • Reproducibility of cost claims: The paper uses a single-rate cost approximation and assumes KV reuse; report per-provider billing differences, cache hit rates, and reproducible scripts that compute exact costs.
  • Integration with KV eviction: Benchmark SelfCompact combined with token-level KV eviction/compression methods; identify complementary regimes and joint policies that dominate either alone.
  • Multi-agent settings: Study shared-memory compaction across collaborating agents (consistency of summaries, conflict resolution) and whether rubrics need to be agent-specific or centralized.
  • Evidence quoting reliability: The rubric asks for verbatim evidence from the trajectory; measure exact-match quote rates, partial-match robustness, and whether stricter quote verification improves decision quality.
  • Edge cases in tool use: Prevent compaction mid-tool-execution or mid-code block; design guards and unit tests ensuring syntactic/state integrity across compaction boundaries.
  • Streaming and UX: Explore interactive/streaming settings where compaction occurs mid-output, and characterize how to communicate summary transitions to users without disrupting comprehension.

Practical Applications

Practical Applications of “Self-Compacting LLM Agents”

Below are actionable, real-world applications enabled by the paper’s findings and scaffold (SelfCompact: a rubric-gated, training-free compaction tool that compacts at closed reasoning units, improving accuracy while reducing token cost). Each item notes sectors, possible tools/workflows, and feasibility assumptions.

Immediate Applications

These can be deployed now with existing LLMs and agent frameworks using prompt-level rubrics and a summarization tool, without fine-tuning.

  • Software engineering (developer copilots, code agents)
    • Use case: Long-horizon coding agents (e.g., SWE-bench–style tasks) summarize after resolving sub-tasks (e.g., after passing a unit test, completing a refactor, or verifying a hypothesis), suppressing summaries mid-derivation to avoid losing partial reasoning.
    • Tools/workflows:
    • A “SelfCompact middleware” for LangChain/AutoGen/LlamaIndex agents to probe every N tool calls and summarize on rubric-verdict=compress.
    • Git/CI integration: snapshot compacted, auditable summaries with links to test logs.
    • Assumptions/dependencies:
    • Tool-use capable LLMs with KV cache reuse; task-specific rubrics (e.g., “unit test passed”).
    • Summarizer accuracy sufficient not to drop active constraints or TODOs.
  • Web research and enterprise search assistants
    • Use case: Browse/search agents summarize after verifying a fact cluster or closing a lead, preserving verified facts and citations while discarding stale exploration to cut cost and reduce “context rot.”
    • Tools/workflows:
    • Browser agent plug-in that fires rubric judgments every K search calls; compaction writes a concise, cite-backed evidence set.
    • “Research memory” snapshots to resume efficiently across turns.
    • Assumptions/dependencies:
    • Access to external tools (browser, retriever), citation extraction; rubric tuned to “verified vs. open” claims.
  • Customer support and CX chatbots
    • Use case: Long support threads compact after resolving a ticket step (identity verified, issue diagnosed, workaround verified) while suppressing compaction during troubleshooting.
    • Tools/workflows:
    • Inline rubric: “Is a step resolved, or are we mid-troubleshoot?”; compact into a case note with next-step plan.
    • Assumptions/dependencies:
    • Accurate detection of resolution states; data retention and privacy compliance for compacted state.
  • Knowledge management and meetings (PMO/ops)
    • Use case: Meeting assistants or project-bot agents compact after each agenda item into action-item summaries, preserving decisions and owners while collapsing discussion.
    • Tools/workflows:
    • Agenda-aware rubric (“agenda item closed?”) triggered at set intervals; per-item summary log with due dates and dependencies.
    • Assumptions/dependencies:
    • Agenda segmentation detection or calendar metadata; human-in-the-loop review for critical decisions.
  • Education and tutoring (math, STEM)
    • Use case: Tutors compact only after completing a sub-proof or solved sub-problem, preventing mid-derivation truncation that confuses learners and the model.
    • Tools/workflows:
    • Rubric for “closed reasoning unit” (proof step completed, sub-answer verified); summary with explicit formulas and intermediate results.
    • Assumptions/dependencies:
    • Task-specific math rubrics (as in the paper); guardrails to avoid hallucinated derivations.
  • Retrieval-augmented generation (RAG) and “conversation memory” modules
    • Use case: Long conversations or multi-document reading compress into verified, query-specific memory chunks; reduces prompt size while improving recall.
    • Tools/workflows:
    • Memory controller that queries a rubric (e.g., “is the current hypothesis supported by cited sources?”) to compact into a structured memory store.
    • Assumptions/dependencies:
    • RAG pipeline compatibility; citation tracking; stable summaries outperform raw-history retention.
  • Finance and analyst assistants
    • Use case: Compact after closing an analysis phase (data import validated, KPIs computed, anomaly triaged), maintaining auditable snapshots with formulas and source references.
    • Tools/workflows:
    • “Analysis phase rubric” and snapshot ledger of compacted states; cost-efficient scenario exploration.
    • Assumptions/dependencies:
    • Audit requirements (retain quotes/links); data governance; summary fidelity.
  • Legal review, e‑discovery, and compliance
    • Use case: Summarize after resolving sub-issues (e.g., establishing a fact, completing a statutory factor analysis) with citation-labeled facts; suppress mid-argument compactions.
    • Tools/workflows:
    • Rubric that demands quoted evidence snippets for each retained fact; compaction pipeline produces an evidence matrix.
    • Assumptions/dependencies:
    • High-precision citation extraction; legal safety review; data access controls.
  • Contact centers and field service workflows
    • Use case: Agents handling multi-step diagnosis compact after each verified step (device checked, configuration confirmed, symptom reproduced) to reduce cost and accelerate next steps.
    • Tools/workflows:
    • Rubric integrated into ticketing systems (e.g., ServiceNow) to auto-update case summaries.
    • Assumptions/dependencies:
    • Robust detection of “step complete”; seamless CRM integration.
  • AgentOps/MLOps cost control and observability
    • Use case: Introduce a “context-rot guardrail” that probes rubric every N turns; track compaction rate, tokens saved, and answer deltas.
    • Tools/workflows:
    • SDK for token accounting and KV-cache reuse; dashboards for cost/performance trade-offs.
    • Assumptions/dependencies:
    • Providers with cache-friendly billing and API semantics; ops instrumentation.
  • Personal assistants and daily productivity
    • Use case: Long email threads or travel planning sessions compress after each sub-decision (dates fixed, budget accepted), keeping the assistant fast and coherent.
    • Tools/workflows:
    • Rubric tuned to task milestones; snapshot timeline with decisions and open questions.
    • Assumptions/dependencies:
    • Calendar/email access; safeguards to prevent premature compaction when preferences are unsettled.

Long-Term Applications

These require further research, validation, scaling, or ecosystem changes.

  • Safety-critical clinical decision support (healthcare)
    • Use case: Compaction only after clinically “closed” assessments (lab interpretation finalized, differential narrowed), with verifiable references; suppress mid-diagnostic steps.
    • Potential tools/workflows:
    • Clinically validated rubrics; dual-model verification (summarizer + verifier); EHR integration with audit trails.
    • Assumptions/dependencies:
    • Regulatory approval, rigorous trials, bias/recall assessments; human oversight; robust citation tracking on clinical evidence.
  • Autonomous robotics and task planning
    • Use case: High-level LLM planner compacts when waypoints/subgoals complete, reducing memory load and plan drift during long missions.
    • Potential tools/workflows:
    • Planner–controller loop with rubric gating; summaries translated into formal subgoal graphs.
    • Assumptions/dependencies:
    • Reliable subgoal completion signals; tight integration with symbolic/optimization planners.
  • Scientific agents and automated research
    • Use case: Compaction after closing experiment phases (hypothesis set, protocol fixed, results verified), preserving provenance and versioned context across hundreds of trials.
    • Potential tools/workflows:
    • Lab notebook auto-compaction with evidence links; experiment orchestration systems that resume from compacted states.
    • Assumptions/dependencies:
    • Structured data logging; reproducibility standards; multi-agent coordination.
  • Enterprise multi-agent memory and knowledge curation
    • Use case: Organization-wide agents compact and exchange closed-unit summaries, maintaining lean, auditable cross-team context.
    • Potential tools/workflows:
    • Memory routers that accept only rubric-verified summaries; governance policies for evidence-backed memory ingestion.
    • Assumptions/dependencies:
    • Standardized summary schemas, access control, lineage tracking.
  • On-device/edge agents and low-resource deployments
    • Use case: Compaction to extend effective horizons on devices with small contexts or limited compute (call center kiosks, mobile assistants).
    • Potential tools/workflows:
    • Local KV caching and rubric gating; hybrid on-device/cloud fallback.
    • Assumptions/dependencies:
    • Efficient caching APIs on-device; compressed models; resilience to connectivity.
  • Training-time distillation of “when-to-compact”
    • Use case: Use the rubric as a behavioral target for RL/SFT so models internalize compaction timing, reducing inference overhead for probes.
    • Potential tools/workflows:
    • Offline logs with rubric labels; policy distillation into base weights.
    • Assumptions/dependencies:
    • Quality of supervision; avoidance of overfitting to narrow rubrics; safety/robustness guarantees.
  • Standards and policy for context management in public-sector AI
    • Use case: Procurement and governance frameworks mandate rubric-gated compaction, logging of summaries with citations, and reporting of compaction-induced answer changes.
    • Potential tools/workflows:
    • Compliance checklists; audit APIs for summary snapshots and evidence.
    • Assumptions/dependencies:
    • Interoperable specs; regulator and vendor alignment; privacy constraints.
  • Cost-aware, dynamic compaction controllers
    • Use case: Controllers that trade off accuracy vs. spend in real time, adapting probe intervals and rubric strictness to budget and task difficulty.
    • Potential tools/workflows:
    • Token-cost forecasters; difficulty estimators (as in the paper’s binning analysis) to modulate compaction policy.
    • Assumptions/dependencies:
    • Stable pricing signals; model introspection reliability on difficulty.
  • Cross-modal planning (vision, code, language)
    • Use case: Vision-language-code agents summarize closed units (e.g., a verified perception hypothesis, a tested code patch) before switching modalities.
    • Potential tools/workflows:
    • Modality-aware rubrics; structured summaries linking images/snippets to claims.
    • Assumptions/dependencies:
    • Robust cross-modal grounding; verifiers per modality.

Notes on Feasibility and Dependencies (common across applications)

  • Task-specific rubrics are essential: the paper shows the tool alone is uneven; simple, well-designed guidance closes the “meta-cognitive gap.”
  • Provider/API capabilities:
    • KV cache reuse across appended messages is crucial for cost benefits.
    • Some APIs bill caches differently; savings assume cache-friendly pricing.
  • Model behavior:
    • No fine-tuning is required, but model must follow tool instructions reliably.
    • Summary fidelity matters; hallucinated or overly aggressive compaction can harm downstream steps.
  • Auditing and safety:
    • For regulated domains, retain snapshots with verbatim quotes/citations; require human oversight.
  • Generalization:
    • Results are demonstrated on open-weight and several deployed “Flash” agents; behavior may vary on frontier models (which may have stronger metacognition), though the scaffold remains complementary.

By adopting rubric-gated, training-free compaction, organizations can deploy longer-horizon agents that are both cheaper and more accurate, especially on complex tasks where context rot and poorly-timed summaries are most damaging.

Glossary

  • Ablation: The experimental removal or modification of system components to assess their contribution. "Ablations find that the rubric is crucial for effective self-compaction"
  • Agentic search: An LLM-driven procedure that performs tool-based web/search actions to answer queries. "on agentic search (BrowseComp, BrowseComp-Plus, DeepSearch QA), it adds 5--9 points"
  • Autoregressive generation: Token-by-token generation where each token is conditioned on the prompt and all previous tokens. "generates a continuation y=(y1,y2,)y = (y_1, y_2, \ldots) autoregressively"
  • Compaction tool: A model-invoked mechanism that summarizes accumulated context during inference. "a compaction tool the model invokes to summarize the accumulated context"
  • Context compaction: Summarizing or condensing the dialogue/history to control length and mitigate degradation. "context compaction has become a standard built-in feature to prevent multi-rounds of long reasoning chains from blowing up the context window."
  • Context rot: Performance degradation caused by stale or erroneous prior content anchoring subsequent generations. "This phenomenon is known as context rot"
  • Context window: The maximum number of tokens a model can condition on at once. "eventually outgrow the context window."
  • Fine-tuning: Post-training adaptation of a model on task-specific data to refine behavior. "without any fine-tuning or external supervision."
  • Fixed-interval compaction: Triggering compaction based solely on a fixed token/turn threshold. "fixed-interval compaction triggered at a token threshold."
  • Frontier models: The most capable, often proprietary, state-of-the-art LLMs. "Context compaction in frontier models."
  • Hard reset: Replacing the existing context with its summary and resuming generation from that condensed state. "hard reset; resume from summary"
  • Inference-time: Operations performed during generation (not training), often controlled via prompts/tools. "pairs two inference-time elements"
  • KV cache: Cached key/value attention states that allow efficient reuse of prior computations. "To maximize KV-cache reuse, we implement S\mathcal{S}"
  • KV-cache eviction: Removing or compressing entries in the key/value cache during inference to save memory/compute. "evicting or compressing KV cache entries during inference."
  • Long-horizon tasks: Problems requiring many steps, turns, or extended reasoning/search over large contexts. "LM agents on long-horizon tasks accumulate tokens in a single rolling context"
  • Meta-cognitive gap: A shortfall in a model’s ability to monitor and control its own reasoning/context quality. "Our results expose a meta-cognitive gap: although unprompted models cannot reliably tell when their own context is rotting, a lightweight rubric closes this gap"
  • Open-weight models: Models whose parameters are publicly available for local inference and customization. "Across seven open-weight models"
  • Oracle policy: A hypothetical ideal decision rule used to estimate the upper bound of possible performance. "an oracle policy which suppresses summarization calls whenever the current answer is correct"
  • Periodic compaction: Compaction that fires at regular intervals irrespective of trajectory content. "Periodic compaction fires on a fixed interval---every kk turns or every kk tokens---regardless of what the trajectory contains"
  • Prefill: The initial forward pass to encode prompt/context into the KV cache before decoding. "pays prefill only on the appended instruction"
  • Probe interval: The periodic step at which the rubric is consulted to decide whether to compact. "at periodic probe intervals"
  • Reactive compaction: Compaction triggered only when nearing the model’s context limit, as overflow prevention. "Reactive compaction triggers only when the rolling context approaches the model's token budget"
  • Rubric: A lightweight, explicit set of criteria guiding when to trigger or suppress compaction. "a lightweight rubric specifying when to fire (a sub-task has resolved, or the trajectory is converging) and when to suppress (mid-derivation, or when stuck)."
  • Scaffold: An external prompting/control structure that organizes a model’s behavior without modifying weights. "a scaffold that allows the model itself to decide when and how to compact."
  • Search–judge–summarize loop: An iterative cycle where the agent searches, evaluates, and summarizes before continuing. "on agentic search this yields a search--judge--summarize--search loop"
  • SelfCompact: The proposed rubric-gated, training-free method for adaptive context compaction. "We propose {SelfCompact, a scaffold that pairs two inference-time elements"
  • Summarizer: The model-invoked instruction/component that produces a condensed version of the trajectory. "The summarizer is the only real overhead"
  • Token budget: The allocation/limit of tokens per question or trajectory used for fair comparison or cost control. "a token budget matched to fixed-interval summarization"
  • Tool call: An invocation of an external capability (e.g., web search, code execution) within an agent trajectory. "chains of thought and tool calls"
  • Trajectory: The evolving sequence of thoughts, actions, tool outputs, and summaries generated during problem solving. "the trajectories LMs generate to solve them keep growing."

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 47 likes about this paper.