Long-Horizon Agent Trajectories
- Long-horizon agent trajectories are extended sequences of agent-environment interactions that span dozens to thousands of steps, involving complex multi-modal data and long-term dependencies.
- Advanced memory architectures such as hierarchical, optical, and UID-based schemes are developed to compress and manage context beyond conventional token limits.
- Benchmarks and learning paradigms like multi-turn RL and hierarchical control frameworks are used to assess planning accuracy and mitigate error accumulation over long durations.
A long-horizon agent trajectory is a sequence of agent-environment interactions that unfolds over a substantial number of steps (often dozens to thousands), frequently spanning multiple modalities, tools, and long-term dependencies. Such trajectories arise in settings where the agent must reason, plan, or act over extended temporal contexts—examples include productivity workflows, multi-stage search, embodied navigation, code synthesis, or complex reasoning chains. The principal challenges stem from context length limits, compounding errors, long-term evidence preservation, and the need for structured memory. Recent research has introduced new frameworks and benchmarks explicitly designed to analyze, model, and improve agent performance over long horizons, with diverse approaches spanning learning paradigms, memory architectures, and trajectory representations.
1. Formal Representations and Structural Properties
Formally, a long-horizon agent trajectory is a sequence of perception-action (or observation-action) pairs
where , is the agent’s state or context at time , and is the action (which may include tool invocations, reasoning steps, or environment interactions) (Du et al., 14 Apr 2026, Zhao et al., 26 Feb 2026). In multimodal and tool-augmented scenarios, typically incorporates:
- Original user query
- History of textual observations
- A set of external visual or data assets referenced by lightweight identifiers (e.g., UIDs mapping to files or images)
- Environment or tool responses
Trajectory representation is further enriched with distinctions between planned versus reactive actions, stage or goal annotations, and externally-referenced memory (such as file proxies or symbolic keys).
The state abstraction for 0 is dependent on domain and system design:
- Structured text, code blocks, JSON/XML states, graphical elements, or perception feature maps (Yang et al., 9 Oct 2025, Ghoul et al., 2023).
- Multimodal data involving images or other content, referenced as UIDs to external assets to minimize context size (Du et al., 14 Apr 2026).
Actions 1 may include:
- Internal language generation (“reasoning”)
- Tool or API invocation
- Final answer or conclusion
- Navigation/movement commands in RL/embodied scenarios (Hu et al., 12 Feb 2026)
2. Memory Architectures and Trajectory Compression
A defining challenge for long-horizon trajectories is surpassing the token-window and working memory available to current LLMs and agentic architectures. Key innovations include:
- File-based externalization and UID proxies: Visual or data artifacts are stored externally, referenced through short, cheap-to-contextualize tokens (UIDs) that act as pointers, with deferred on-demand loading only as necessary for reasoning (Du et al., 14 Apr 2026). This reduces in-context token burden from 2pixels3 per image to 4.
- Optical Memory encoding: Trajectories are rendered as images (“screenshots”), annotated with visual indices and bounding boxes (Set-of-Mark), and later retrieved by optical models that efficiently locate and transcribe only the relevant segments, achieving 5 compression in memory bandwidth (Li et al., 29 Apr 2026).
- Hierarchical and hybrid memory: Self-evolving experience modules leverage strategic, procedural, and tool memories, each with distinct update and retrieval policies (e.g., MUSE’s multi-tier memory bank) (Yang et al., 9 Oct 2025). Causality graphs (nodes as state snapshots, edges as causal or association links) enable traversable, tool-augmented retrieval pertinent to long-horizon queries (Zhao et al., 26 Feb 2026).
- Trajectory-splitting and curriculum-based SFT/RL: Long traces are chunked into overlapping sub-trajectories, with essential prefix context preserved and local overlap ensuring continuity. Progressive extension of episode timeouts in RL further scaffolds long-horizon competencies (Liu et al., 19 Feb 2026).
- Dual (neuro-symbolic) memory banks: Decoupling fuzzy semantic progress guidance from strict logical feasibility validation, with neural blueprint memories and symbolic, executable rule validators acting in synchronous control (Wen et al., 3 Apr 2026).
These memory solutions address both context budget limitations and the preservation of crucial information over extremely long trajectories.
3. Data Synthesis, Benchmark Construction, and Task Generation
Evaluating and training long-horizon agents necessitates benchmarks that stress planning, memory, and reasoning across large time and interaction scales. Approaches include:
- Synthesis pipelines for cross-modal/multi-hop queries: Construction of graph-structured multi-hop task graphs, enforced path-lengths, and forced dependencies (e.g., 40% of edges must require visual nodes) to generate trajectories lasting 30–100+ steps (Du et al., 14 Apr 2026).
- Synthetic environment instantiation: Creation of realistic “synthetic computers” and productivity scenarios with thousands of folder structures, documents, and deliverable objectives, enabling agents to interact in over 2,000 turns per simulation (8+ hours wall time) (Ge et al., 30 Apr 2026).
- Pull Request (PR) chains as supervision structure: Mining authentic software evolution via PR chains yields natural decomposition, long-term consistency, and traceable refinement across multi-stage agent rollouts, resulting in trajectories with 685k tokens and over 100 tool calls (Jiang et al., 2 Feb 2026).
- Delayed-trigger and risk protocols: Benchmarks such as ATBench construct long-horizon safety studies via “setup” and “exploit” phases, isolating when risk is introduced and triggered across many steps (Li et al., 2 Apr 2026).
- Ultra-long horizon exploration: UltraHorizon systematically scales trajectory lengths to 35–200k tokens and 60–400+ tool calls per trajectory, analyzing domains where human agents outperform LLM agents by a substantial margin (Luo et al., 26 Sep 2025).
These data generation methodologies permit targeted stress-testing, as well as precise tracking of error propagation, memory failures, and constraint drift.
4. Planning, Control, and Learning Paradigms
Agentic reasoning over long horizons requires explicit support in planning and learning architectures:
- Hierarchical control and multiscale planning: HM-Diffuser uses recursively nested diffusion models at multiple temporal scales, progressively generating subgoal plans and recursively refining to granular trajectories. Progressive Trajectory Extension (PTE) grows trajectory lengths well beyond the training distribution (Chen et al., 25 Mar 2025).
- Multi-turn RL and horizon-adaptive optimization: LongNav-R1 reformulates navigation as a multi-turn “conversation” with the environment, leveraging a horizon-adaptive policy optimization (HAPO) that dynamically estimates advantages based on step position in the overall horizon (Hu et al., 12 Feb 2026).
- Self-reflection and experience-driven evolution: MUSE’s “Plan-Execute-Reflect-Memorize” loop extracts sub-task experience and integrates it iteratively into hierarchical procedural and strategic memory, enabling measured self-improvement across repeated exposures (Yang et al., 9 Oct 2025).
- Supervision control and asynchronous human-in-the-loop: Frameworks such as Apollo enable sparse but highly targeted human interventions, with symbolic and LLM-based action-level filtering, summarized context, and efficient “patch” data annotation for training in domain-specialized, multi-hour rollouts (Fu et al., 31 Oct 2025).
Many of these approaches supplement classical end-to-end RL or SFT with structure-aware planning/constraint modules to support long-term coherence.
5. Empirical Results, Metrics, and Failure Modes
Agent performance and error analysis in the long-horizon regime rely on both aggregate task success and stepwise/failure-mode decomposition:
- Scaling performance: LMM-Searcher achieves sustained accuracy increases as horizon scales to 100 turns (7 pp average), exhibiting 8 on MM-BrowseComp and 9 on MMSearch-Plus at 100 turns (Du et al., 14 Apr 2026). KLong shows +11.28 pp gain over the next best open-source model on 12 h research reproduction (Liu et al., 19 Feb 2026).
- Memory fidelity and effectiveness: AMA-Agent, with its causality-graph and tool-augmented retrieval, sustains accuracies (057–58%) at horizons up to 128k tokens where baseline memory agents and RAG models collapse (Zhao et al., 26 Feb 2026). OCR-Memory’s optical encoding delivers up to 1 compression with 5–10% relative robustness to token budget shrinkage (Li et al., 29 Apr 2026).
- Error/failure analysis: HORIZON, UltraHorizon, and others systematically identify dominant error sources: planning breakdown, catastrophic forgetting, compounding history errors, and memory/context overflow (Wang et al., 13 Apr 2026, Luo et al., 26 Sep 2025). Failure taxonomies distinguish process-level (e.g., environmental or instruction ambiguity) and design-level (planning, memory, catastrophic forgetting) risks. Notably, long-horizon failures show increasing dominance of memory and drift errors as task depth rises.
- Human–agent performance gap: Across ultra-long-horizon challenges, human participants outscore best LLM agents by a large margin (e.g., 26.52 vs. 14.33 score on UltraHorizon’s composite environments), while agent performance and efficiency degrade rapidly as 2 grows (Luo et al., 26 Sep 2025).
6. Implications, Limitations, and Future Directions
Long-horizon agent trajectories expose fundamental limitations in current LLM and agentic architectures—notably, memory constraints, error accumulation, context loss, and the absence of robust sub-task planning and validation. Research highlights several convergent principles for future development:
- Hierarchical, constraint-aware decomposition: Integrating explicit sub-plan staging, goal and feasibility monitoring, and checkpointing is critical (Wang et al., 13 Apr 2026, Wen et al., 3 Apr 2026).
- Execution-time plan repair and rollback: Devise execution monitors and correction mechanisms to contain error propagation during trajectory rollout (Wang et al., 13 Apr 2026).
- Externalized, high-fidelity memory and retrieval: Optical or UID-based memory proxies, causality-aware retrieval, and hybrid neuro-symbolic memory are essential to bypass token budget restrictions (Li et al., 29 Apr 2026, Du et al., 14 Apr 2026).
- Data-efficient supervision and transfer: Mining real-world data sources with natural long-term dependencies (PR chains) or large-scale synthetic persona/world generation (synthetic computers) delivers more robust and transferable supervision at scale (Jiang et al., 2 Feb 2026, Ge et al., 30 Apr 2026).
- Automated benchmarking and open datasets: Modern diagnostic (e.g., HORIZON, UltraHorizon, ATBench, AMA-Bench) and synthetic world benchmarks are now foundational for systematic, quantifiable advancement.
Limitations remain regarding transparency, data efficiency, and scalability of current approaches, as well as the generalization of symbolic constraint modules to novel domains and the full automation of experience-driven memory updating (Wen et al., 3 Apr 2026). Open problems include efficient long-context management, automatic failure diagnosis, and continual adaptation to arbitrary horizon extensions.
Key References:
(Du et al., 14 Apr 2026, Hu et al., 12 Feb 2026, Li et al., 29 Apr 2026, Wen et al., 3 Apr 2026, Yang et al., 9 Oct 2025, Zhao et al., 26 Feb 2026, Fu et al., 31 Oct 2025, Luo et al., 26 Sep 2025, Wang et al., 13 Apr 2026, Liu et al., 19 Feb 2026, Ge et al., 30 Apr 2026, Jiang et al., 2 Feb 2026, Acharya et al., 2023, Chen et al., 25 Mar 2025, Ghoul et al., 2023, Zheng et al., 2017)