Papers
Topics
Authors
Recent
Search
2000 character limit reached

Long-Horizon ASR in Agentic Systems

Updated 28 April 2026
  • Long-Horizon ASR is a metric that quantifies the fraction of trials where agents fully complete all subgoals in extended tasks, emphasizing end-to-end success over partial progress.
  • It is widely adopted across robotics, navigation, and agentic reasoning to benchmark performance while addressing challenges like compounded errors and memory limitations.
  • Methodologies incorporating hierarchical planning, subgoal decomposition, and enhanced memory modules significantly improve ASR by facilitating failure recovery and efficient task decomposition.

Long-horizon Action Success Rate (ASR) quantifies the fraction of trials in which an agent successfully completes all required steps or subgoals in complex, temporally extended tasks. ASR serves as the principal metric for evaluating the ability of agentic systems—spanning LLM-based planners, Vision-Language-Action (VLA) stacks, and hierarchical controllers—to execute multi-step protocols, solve sequential tasks, and maintain performance as the task horizon increases. Unlike stepwise or local metrics, long-horizon ASR measures strict end-to-end success, making it sensitive to compounded errors, long-range memory limitations, partial observability, and failure recovery. The metric is now widely adopted in robotics, manipulation, navigation, and general agentic AI research as an anchor for benchmarking and diagnosis of system reliability under horizon scaling.

1. Formal Definition and Computation

The canonical definition of long-horizon Action Success Rate (ASR) is the episode-level completion metric: ASR=1Ni=1NI[all Mi subgoals succeeded in episode i]\mathrm{ASR} = \frac{1}{N} \sum_{i=1}^N \mathbb{I}[\text{all } M_i \text{ subgoals succeeded in episode } i] where NN is the total number of evaluation episodes (trials), MiM_i is the number of subgoals or atomic actions in episode ii, and I[]\mathbb{I}[\cdot] is the indicator function returning 1 if all subgoals are achieved and 0 otherwise. In many domains, a subgoal is completed when a precise geometric, semantic, or kinematic predicate is met (e.g., object placed within target region, logical state transition reached). Some works also report step-wise ASR (fraction of successful subgoals across all episodes) and average subtask success for finer-grained diagnosis (Shen et al., 22 Apr 2026, Tan et al., 4 Jan 2026, Zeng et al., 20 Apr 2026).

The metric generalizes naturally to more complex compositional horizons. For variable task depths: ASR(H(s))=Nsucc(s)Natt(s)\mathrm{ASR}(H^*(s)) = \frac{N_{\mathrm{succ}}(s)}{N_{\mathrm{att}}(s)} where H(s)H^*(s) is the intrinsic horizon at extension level ss in nested task families (Wang et al., 13 Apr 2026).

Across domains such as manipulation, navigation, and agentic reasoning, long-horizon ASR consistently refers to the fraction of episodes in which all steps succeed, no matter the local reward structure or auxiliary sub-metrics.

2. Benchmark Domains, Task Construction, and Success Criteria

ASR underpins evaluation protocols in diverse high-complexity agentic scenarios:

Each setting enforces strict full-sequence requirements: an episode is only counted toward ASR if no subgoal (according to task-specific binary or graded predicates) fails. Protocol-level ASR often drops precipitously as horizon or scene complexity increases, exposing compounding bottlenecks absent from single-step metrics.

3. Representative Empirical Results and Baseline Comparisons

Long-horizon ASR reveals sharp delineations in the capabilities of various agentic architectures, particularly under scaling of the horizon and complexity:

Domain/Baseline SR/ASR (%) Context
PCArena (HiAgent) (Hu et al., 2024) 42 2x improvement over Standard (21), 5 long-horizon AgentBoard tasks
RAMP-3D (Malik et al., 24 Mar 2026) 79.5 3D box rearrangement, 11 variants, 1–30 objects
Goal2Skill (Liu et al., 15 Apr 2026) 32.4 RMBench, adaptive subtask composition, vs. 9.8% best baseline
SD-VLA (Qiu et al., 3 Feb 2026) 76.4 Memory-dependent LIBERO-Memory, +39.8 pp over best prior
LiLo-VLA (Yang et al., 25 Feb 2026) 69 LIBERO-Long++, Ultra-Long, best baseline: 28
HELM (Zeng et al., 20 Apr 2026) 81.5 LIBERO-LONG, +23.1 pp over OpenVLA-H=8
Action-Sketcher (Tan et al., 4 Jan 2026) 96.0 LIBERO Long-Horizon (8–16 subtasks), best on complex manipulation
ALAS (Shen et al., 22 Apr 2026) 72 HSI-LH1, vs. TokenHSI 55, CML 30
RoboChemist (Zhang et al., 10 Sep 2025) 72 Protocol SR, full chemical sequence, vs. highest baseline 38
RoboClaw (Li et al., 12 Mar 2026) 75 Real-world manipulation, +25 pp over VLA open-loop (50)

These results consistently show that naive or non-hierarchical models suffer exponential-like decay in ASR as tasks lengthen. In contrast, architectures with explicit memory, subgoal chunking, hierarchical working memory, and closed-loop recovery (e.g. HELM, PCArena, Goal2Skill, Action-Sketcher, ALAS) dramatically improve episode completion rates.

4. Mechanisms That Affect Long-Horizon ASR

Three primary system-level factors systematically impact ASR:

Ablation studies consistently show double-digit drops in ASR on disabling these components. For example, PCArena’s observation summarization and retrieval module yields 30–50% ASR swings; HELM’s episodic memory and state verifier contribute the majority of its 23.1-point gain.

5. Statistical Analysis and Horizon-Scaling Behavior

Empirical studies document nonlinear ASR collapse as horizon or step count grows. In HORIZON (Wang et al., 13 Apr 2026), both LLM agents and RL/VLA agents display sharp transition regions: high ASR is sustained up to a domain/model-specific critical extension, after which episode success rapidly collapses to zero. Domain compositional strategies (depth vs. breadth) reveal the susceptibility of each architecture to structurally distinct error accumulation.

ASR is typically reported as mean ± standard deviation across random seeds or trial replicates. Extensive bootstrapping or runwise aggregation yields reliable uncertainty quantification, especially at deep compositional levels.

6. Diagnosing and Mitigating ASR Degradation

Comprehensive failure attribution, as exemplified in HORIZON (Wang et al., 13 Apr 2026), employs LLM-as-a-Judge pipelines and error taxonomy labeling to analyze breakdowns. Failures transition from process-level (local subplan issues) to design-level (memory and catastrophic forgetting), aligning with ASR collapse.

Mitigation strategies empirically shown to elevate ASR include:

Simply scaling context length or model size without architectural improvements yields limited (often <6 pp) ASR gains.

7. Relation to Partial-Completion, Average Progress, and Other Metrics

While ASR strictly tracks full-sequence completion, several works also report:

ASR remains the most stringent and informative measure for true end-to-end reliability, with significant implications for system deployment in robotics, automated lab environments, and decision-critical agentic domains.


References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Long-Horizon Action Success Rate (ASR).