Long-Horizon ASR in Agentic Systems
- Long-Horizon ASR is a metric that quantifies the fraction of trials where agents fully complete all subgoals in extended tasks, emphasizing end-to-end success over partial progress.
- It is widely adopted across robotics, navigation, and agentic reasoning to benchmark performance while addressing challenges like compounded errors and memory limitations.
- Methodologies incorporating hierarchical planning, subgoal decomposition, and enhanced memory modules significantly improve ASR by facilitating failure recovery and efficient task decomposition.
Long-horizon Action Success Rate (ASR) quantifies the fraction of trials in which an agent successfully completes all required steps or subgoals in complex, temporally extended tasks. ASR serves as the principal metric for evaluating the ability of agentic systems—spanning LLM-based planners, Vision-Language-Action (VLA) stacks, and hierarchical controllers—to execute multi-step protocols, solve sequential tasks, and maintain performance as the task horizon increases. Unlike stepwise or local metrics, long-horizon ASR measures strict end-to-end success, making it sensitive to compounded errors, long-range memory limitations, partial observability, and failure recovery. The metric is now widely adopted in robotics, manipulation, navigation, and general agentic AI research as an anchor for benchmarking and diagnosis of system reliability under horizon scaling.
1. Formal Definition and Computation
The canonical definition of long-horizon Action Success Rate (ASR) is the episode-level completion metric: where is the total number of evaluation episodes (trials), is the number of subgoals or atomic actions in episode , and is the indicator function returning 1 if all subgoals are achieved and 0 otherwise. In many domains, a subgoal is completed when a precise geometric, semantic, or kinematic predicate is met (e.g., object placed within target region, logical state transition reached). Some works also report step-wise ASR (fraction of successful subgoals across all episodes) and average subtask success for finer-grained diagnosis (Shen et al., 22 Apr 2026, Tan et al., 4 Jan 2026, Zeng et al., 20 Apr 2026).
The metric generalizes naturally to more complex compositional horizons. For variable task depths: where is the intrinsic horizon at extension level in nested task families (Wang et al., 13 Apr 2026).
Across domains such as manipulation, navigation, and agentic reasoning, long-horizon ASR consistently refers to the fraction of episodes in which all steps succeed, no matter the local reward structure or auxiliary sub-metrics.
2. Benchmark Domains, Task Construction, and Success Criteria
ASR underpins evaluation protocols in diverse high-complexity agentic scenarios:
- Symbolic and embodied planning: AgentBoard games (Blocksworld, Tyreworld, Jericho), where full success requires satisfying all atomic environment goal conditions (Hu et al., 2024).
- Robotic manipulation: Multi-object interaction, protocol-based lab automation, or 3D rearrangement, with geometric and semantic constraints on all final placements (Zhang et al., 10 Sep 2025, Malik et al., 24 Mar 2026, Tan et al., 4 Jan 2026, Zeng et al., 20 Apr 2026).
- Navigation: Episodic success is determined by reaching the goal state and halting within physical and timing thresholds (Hu et al., 12 Feb 2026).
- Human-scene interaction: Ordered composite skills (Follow, Carry, Climb, Sit) chained together with stringent per-step criteria and global episode timeouts (Shen et al., 22 Apr 2026).
- Multimodal and database/Web domains: Cross-domain composition with depth/breadth extension, where ASR measures robustness as horizon grows (Wang et al., 13 Apr 2026).
Each setting enforces strict full-sequence requirements: an episode is only counted toward ASR if no subgoal (according to task-specific binary or graded predicates) fails. Protocol-level ASR often drops precipitously as horizon or scene complexity increases, exposing compounding bottlenecks absent from single-step metrics.
3. Representative Empirical Results and Baseline Comparisons
Long-horizon ASR reveals sharp delineations in the capabilities of various agentic architectures, particularly under scaling of the horizon and complexity:
| Domain/Baseline | SR/ASR (%) | Context |
|---|---|---|
| PCArena (HiAgent) (Hu et al., 2024) | 42 | 2x improvement over Standard (21), 5 long-horizon AgentBoard tasks |
| RAMP-3D (Malik et al., 24 Mar 2026) | 79.5 | 3D box rearrangement, 11 variants, 1–30 objects |
| Goal2Skill (Liu et al., 15 Apr 2026) | 32.4 | RMBench, adaptive subtask composition, vs. 9.8% best baseline |
| SD-VLA (Qiu et al., 3 Feb 2026) | 76.4 | Memory-dependent LIBERO-Memory, +39.8 pp over best prior |
| LiLo-VLA (Yang et al., 25 Feb 2026) | 69 | LIBERO-Long++, Ultra-Long, best baseline: 28 |
| HELM (Zeng et al., 20 Apr 2026) | 81.5 | LIBERO-LONG, +23.1 pp over OpenVLA-H=8 |
| Action-Sketcher (Tan et al., 4 Jan 2026) | 96.0 | LIBERO Long-Horizon (8–16 subtasks), best on complex manipulation |
| ALAS (Shen et al., 22 Apr 2026) | 72 | HSI-LH1, vs. TokenHSI 55, CML 30 |
| RoboChemist (Zhang et al., 10 Sep 2025) | 72 | Protocol SR, full chemical sequence, vs. highest baseline 38 |
| RoboClaw (Li et al., 12 Mar 2026) | 75 | Real-world manipulation, +25 pp over VLA open-loop (50) |
These results consistently show that naive or non-hierarchical models suffer exponential-like decay in ASR as tasks lengthen. In contrast, architectures with explicit memory, subgoal chunking, hierarchical working memory, and closed-loop recovery (e.g. HELM, PCArena, Goal2Skill, Action-Sketcher, ALAS) dramatically improve episode completion rates.
4. Mechanisms That Affect Long-Horizon ASR
Three primary system-level factors systematically impact ASR:
- Temporal Memory: Hierarchical, episodic, or layer-wise KV memory mitigates “memory gap” failures typical in purely Markovian or short-context models (Hu et al., 2024, Sun et al., 8 Mar 2026, Zeng et al., 20 Apr 2026). Failure ablations show loss of 8–23 pp ASR when disabling key episodic or chunked working memory modules.
- Verification and Recovery: Learned state verifiers and recovery controllers (rollback, reflection, multi-policy orchestration) address cumulative subgoal errors and allow for local correction without global episode reset (Liu et al., 15 Apr 2026, Zeng et al., 20 Apr 2026, Li et al., 12 Mar 2026).
- Task/Plan Decomposition: Explicit subgoal chunking, plan summarization, and trajectory retrieval localize memory, reduce context overload, and enable targeted reasoning (Hu et al., 2024, Tan et al., 4 Jan 2026).
Ablation studies consistently show double-digit drops in ASR on disabling these components. For example, PCArena’s observation summarization and retrieval module yields 30–50% ASR swings; HELM’s episodic memory and state verifier contribute the majority of its 23.1-point gain.
5. Statistical Analysis and Horizon-Scaling Behavior
Empirical studies document nonlinear ASR collapse as horizon or step count grows. In HORIZON (Wang et al., 13 Apr 2026), both LLM agents and RL/VLA agents display sharp transition regions: high ASR is sustained up to a domain/model-specific critical extension, after which episode success rapidly collapses to zero. Domain compositional strategies (depth vs. breadth) reveal the susceptibility of each architecture to structurally distinct error accumulation.
ASR is typically reported as mean ± standard deviation across random seeds or trial replicates. Extensive bootstrapping or runwise aggregation yields reliable uncertainty quantification, especially at deep compositional levels.
6. Diagnosing and Mitigating ASR Degradation
Comprehensive failure attribution, as exemplified in HORIZON (Wang et al., 13 Apr 2026), employs LLM-as-a-Judge pipelines and error taxonomy labeling to analyze breakdowns. Failures transition from process-level (local subplan issues) to design-level (memory and catastrophic forgetting), aligning with ASR collapse.
Mitigation strategies empirically shown to elevate ASR include:
- Hierarchical subplanning and plan repair (Hu et al., 2024, Wang et al., 13 Apr 2026)
- Enhanced working and episodic memory modules (Zeng et al., 20 Apr 2026, Sun et al., 8 Mar 2026, Hu et al., 2024)
- Closed-loop, reflection-based recovery (Liu et al., 15 Apr 2026, Li et al., 12 Mar 2026)
- Task- and memory-aware verification logic (Zeng et al., 20 Apr 2026)
- Explicit decomposition and subgoal summarization (Hu et al., 2024, Tan et al., 4 Jan 2026)
Simply scaling context length or model size without architectural improvements yields limited (often <6 pp) ASR gains.
7. Relation to Partial-Completion, Average Progress, and Other Metrics
While ASR strictly tracks full-sequence completion, several works also report:
- Average Progress (AP): Mean fraction of ordered subgoals completed before failure (Yang et al., 25 Feb 2026).
- Average subtask success: Mean over all subtask hits in all episodes (Shen et al., 22 Apr 2026, Tan et al., 4 Jan 2026).
- Progress Rate (PR): Fraction of goal conditions satisfied at episode end (Hu et al., 2024).
ASR remains the most stringent and informative measure for true end-to-end reliability, with significant implications for system deployment in robotics, automated lab environments, and decision-critical agentic domains.
References
- (Hu et al., 2024) "HiAgent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with LLM"
- (Liu et al., 15 Apr 2026) "Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection"
- (Malik et al., 24 Mar 2026) "Grounding Vision and Language to 3D Masks for Long-Horizon Box Rearrangement"
- (Hu et al., 12 Feb 2026) "LongNav-R1: Horizon-Adaptive Multi-Turn RL for Long-Horizon VLA Navigation"
- (Tan et al., 4 Jan 2026) "Action-Sketcher: From Reasoning to Action via Visual Sketches for Long-Horizon Robotic Manipulation"
- (Zhang et al., 10 Sep 2025) "RoboChemist: Long-Horizon and Safety-Compliant Robotic Chemical Experimentation"
- (Sun et al., 8 Mar 2026) "TempoFit: Plug-and-Play Layer-Wise Temporal KV Memory for Long-Horizon Vision-Language-Action Manipulation"
- (Shen et al., 22 Apr 2026) "ALAS: Adaptive Long-Horizon Action Synthesis via Async-pathway Stream Disentanglement"
- (Wang et al., 13 Apr 2026) "The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break"
- (Yang et al., 25 Feb 2026) "LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies"
- (Li et al., 12 Mar 2026) "RoboClaw: An Agentic Framework for Scalable Long-Horizon Robotic Tasks"
- (Qiu et al., 3 Feb 2026) "Efficient Long-Horizon Vision-Language-Action Models via Static-Dynamic Disentanglement"
- (Zeng et al., 20 Apr 2026) "HELM: Harness-Enhanced Long-horizon Memory for Vision-Language-Action Manipulation"