- The paper introduces a novel asynchronous compaction method that validates agent summaries using future trajectory actions.
- It employs a trajectory-grounded judge to perform statement-level and plan-level checks, ensuring critical information is maintained.
- Empirical results on coding and web-browsing tasks show improvements of up to 8.8% in task success and 39.7% reduction in latency.
Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents
Contemporary long-horizon LLM-based agents face the severe challenge of context rot: as the history of observations, tool calls, and internal reasoning accumulates, even models equipped with extensive context windows exhibit significant accuracy degradation. The adopted mitigation, context compaction—periodic LLM-guided summarization of the ongoing agent trajectory—presents a validation bottleneck. Synchronous compaction, the dominant paradigm, pauses agent execution and directly commits to a summary, introducing two critical deficiencies: (1) elevated latency due to blocking, and (2) silent accuracy regressions, since the compactor has no access to the trajectory’s future evolution and hence may omit facts critical for subsequently successful task completion. No validation signal is available at compaction time because all post-compaction actions are already conditioned on the summary and cannot be used for independent sufficiency checks.
Slipstream System Design
Slipstream introduces an execution model that addresses both the efficiency and validation gaps of synchronous compaction. Compaction is executed asynchronously: the agent continues reasoning on the uncompacted trajectory while a parallel thread compacts the context. When the compactor produces a candidate summary, Slipstream employs a trajectory-grounded judge which compares the candidate against the next-k agent actions emerging during the async window. This comparison leverages two axes:
- Statement-level check: Validates preservation of all concrete facts, constraints, tool outputs, and subgoal artifacts actually referenced or consumed in next-k steps.
- Plan-level check: Assesses whether the agent's evolving high-level intent, as instantiated in upcoming actions, is faithfully supported by the candidate summary.
The judge formally accepts, or—if a deviation is detected—issues a targeted diagnosis that triggers a patch to the summary rather than reverting to synchronous compaction. Notably, most validation-relevant errors from compaction surface within the first few agent steps (88-100% within three steps across both coding and web-browsing benchmarks), ensuring that the asynchronously overlapped window suffices for reliable validation.
Empirical Results
Slipstream is systematically evaluated on two canonical long-horizon agent workloads:
- SWE-bench Verified: Multi-step code modification and debugging.
- BrowseComp: Web-browsing-based open-domain information gathering.
Using Qwen3.5-9B and Seed-OSS-36B-Instruct, Slipstream achieves:
- Task success rate improvements of up to 8.8 percentage points versus synchronous compaction.
- End-to-end per-query latency reductions of up to 39.7%.
These accuracy improvements are robust to compaction thresholds and system configuration and stem specifically from the trajectory-grounded validation, not just from overlapping computation (Async-only baselines obtain the latency wins but not the accuracy increases).
Additional experiments demonstrate that the judge model need not be as capable as the base agent: smaller models used as the judge retain most of the accuracy gains, broadening the practical deployment possibilities.
Theoretical and Practical Implications
Slipstream demonstrates that asynchrony is not merely a systems-level optimization but a necessary enabler for behavioral validation in agentic context management. By decoupling summary generation from agent progression, it creates a held-out behavioral trajectory that can be used to test summary sufficiency ex ante. Synchronous compaction forecloses this possibility—an issue common not only to summarization but to other forms of critical-path agent state mutation (e.g., tool clearing, external memory swap-in).
From a practical perspective, Slipstream’s validation method directly increases agent reliability, particularly in highly lossy or information-dense domains where traditional compaction fails silently. Furthermore, the system leverages modern server hardware efficiently: because agent serving is typically memory, not compute, bound, there is sufficient headroom to parallelize compaction and judgment without incurring resource contention overhead. The system further optimizes memory overhead using shared-prefix KV-cache inference.
Contradictory and Strong Claims
Slipstream’s core assertion is that asynchronous scheduling is both necessary and sufficient to enable proper compaction validation for long-horizon agents. The experiments show that asynchronous validation, not mere parallelism, is what drives accuracy improvements—Async-only ablation does not achieve such gains. The authors also empirically demonstrate that compaction errors nearly always manifest within three agent steps, a strong claim supporting the feasibility and sufficiency of their approach.
Limitations and Future Directions
While Slipstream captures almost all compaction-induced failures within its next-k validation window, rare errors propagating outside this window remain undetected, matching the behavior of synchronous compaction. This inherent locality of compaction-induced errors is workload-dependent, and the approach may require further extension for workloads with longer error manifestation tails.
Broader implications include the extension of trajectory-grounded validation to other auxiliary agent tasks beyond compaction (e.g., structured memory swaps, agent state rollbacks). Future work might explore dynamic next-k window scaling based on workload-specific error profiles, as well as integration with more sophisticated targeted update mechanisms following judge rejections.
Conclusion
Slipstream formalizes a new trajectory-grounded approach to compaction validation, leveraging asynchronous execution to independently validate summary sufficiency against held-out future agent actions. This methodology simultaneously reduces agent latency and silently improves downstream task accuracy over prevalent synchronous approaches. The underlying principle—using asynchrony to expose validation opportunities unavailable to synchronous control flow—suggests broader relevance for systems managing the working state in long-horizon agentic inference (2605.08580).