- The paper introduces a hierarchical framework that reduces planning complexity by clustering error-prone tools into high-level agent tools.
- It employs asymmetric planner adaptation with backward reconstruction and forward refinement to ensure robust and efficient execution.
- HTAA achieves significant efficiency gains, improving accuracy by up to 13% and dramatically reducing manual workload and costs in real-world workflows.
The increasing integration of LLMs with external tools is essential for realizing robust agentic behavior in complex, real-world workflows. However, deploying LLM planners in heterogeneous, large-scale tool ecosystems is hindered by the planning complexity and error propagation associated with flat tool-calling architectures. In such configurations, the action space grows combinatorially with the toolset, substantially burdening the planner and resulting in elongated, fragile execution chains. This challenge is especially acute in industrial settings–for example, the POI verification workflow in large ride-hailing platforms, which may require coordinating over a dozen tools per task.
HTAA Framework: Hierarchical Abstraction and Coordination
The paper proposes HTAA, a hierarchical methodology for scaling tool-augmented LLM planning. HTAA is composed of two core innovations: toolset agentization and asymmetric planner adaptation.
Instead of exposing every fine-grained tool to the LLM-based planner, HTAA applies a utility-driven clustering pipeline, automatically and manually grouping frequent or error-prone tools into higher-level “agent tools.” Each agent tool encapsulates complex, multi-step internal decision-making (including aggregation and summarization across co-functional tools), exposing a unified interface. This allows the global planner to delegate nuanced subtasks, significantly reducing its effective action space. The agentization process relies on both automated (LLM-estimated utility) and expert-driven grouping, providing domain adaptability and functional redundancy mitigation.
Crucially, agentization is selective: deterministic, lightweight tools are left as basic primitives to avoid unnecessary abstraction overhead, yielding a hybrid environment of agent and basic tools.
Figure 1: Overview of the HTAA framework. Core components are Toolset Agentization, which abstracts fine-grained tools to reduce planning complexity, and Asymmetric Planner Adaptation, which aligns the planner with agent tools via a hybrid trajectory pipeline.
Asymmetric Planner Adaptation
The shift to hierarchical abstraction introduces a non-trivial coordination challenge: the planner only observes aggregated outputs from agent tools, with no visibility into their fine-grained execution, risking a semantic and policy mismatch. HTAA resolves this by introducing a trajectory-centric asymmetric adaptation protocol:
- Backward Reconstruction: Trajectories are generated by conditioning on both the input and the ground-truth label, producing high-quality, feasible tool-use chains aligned with the final outcome, even under weak task priors.
- Forward Refinement: These are then post-processed using a stronger teacher model to convert verification-style paths into forward, executable reasoning, augmenting logical consistency while preserving valid tool execution trees.
- Hybrid Policy Optimization: This structure-preserving data serves as E2E Behavioral Cloning supervision, with only the planner updated, keeping agent tools static and ensuring stable policy optimization even as planner and toolset scale.
Experimental Evaluation
HTAA was extensively validated on both proprietary large-scale real-world workflows (InfoVerify) and established function-calling benchmarks (BFCL, Extended BFCL). Notably, the InfoVerify dataset reflects a rigorous industrial pipeline, where legacy manual POI verification (over 700 tasks, 15+ tools/task) is replaced by the automated HTAA stack.
Key empirical findings include:
- InfoVerify: Native LLMs (e.g., Claude-Sonnet-4.5, GPT-4o) achieve only 22-35% accuracy in long-horizon POI validation. With HTAA, this is consistently improved by 1.5–13%, demonstrating compositional generalization and robustness. The dedicated APA-8B (HTAA-adapted, task-specific fine-tuned model) reaches 50–54% accuracy, surpassing all non-hierarchical (flat) tool-augmented methods by substantial margins. Operational metrics further indicate that HTAA reduces manual verification workload by 84.5% and annotation costs by 81.25%.
- Extended BFCL: On challenging synthetic tool-use scenarios with increased tool invocation depth, HTAA delivers performance gains (up to 10%) even in typical reasoning categories—solidifying the planning decomposition advantage in simulated, long-horizon cases.

Figure 2: HTAA shows monotonic improvements with planner and agent tool model scaling, supporting both tool-based and planner-based generalization.
Additionally, scaling experiments (Figure 2) show monotonic benefits from larger base planners and more capable agent tools—especially when tool agent quality is increased, even a small planner can approach large planner performance if tool orchestration is offloaded to potent agents. Analyses of trajectory lengths and token usage confirm that HTAA yields significantly shorter, more efficient plans, reducing context overhead and computational cost.
Ablations and Mechanistic Analysis
Component ablations indicate that both agentization and asymmetric adaptation are essential—removing either substantially degrades accuracy, increases reasoning fragility, and causes error accumulation. Efficiency analyses indicate that trajectory compression (fewer decision steps and tokens per task) directly correlates with reduced cognitive load and error rates in long-horizon workflows.
Practical and Theoretical Implications
HTAA’s hierarchical abstraction addresses both the sample efficiency and context bottlenecks endemic to large-toolset LLM planning. By decoupling high-level planning from low-level tool coordination, the framework provides a strong pathway for modular agent architecture—fundamentally mirroring human organizational delegation structures.
Practically, the approach facilitates scalable automation in real-world workflows (e.g., industrial verification, information aggregation, workflow orchestration) with minimal human oversight. The static (frozen) agent tool interfaces further support robust, composable pipeline deployment, reducing the risk from toolset drift or planner overfitting.
Theoretically, HTAA substantiates the hypothesis that LLM-based reasoning in multi-step, multi-tool tasks faces severe optimization and generalization barriers in flat action spaces. Hierarchical abstraction (with partial observability of the tool layer) and asymmetric, trajectory-based adaptation offer a robust template for future agentic LLM systems.
Future Directions
HTAA opens compelling avenues for automatic, differentiable agent tool construction, learnable agentization policies, and more refined alignment protocols in semi-observable multi-agent settings. Enhanced interpretability of agent tool aggregation and cross-domain transferability also present vital directions. Addressing the information hiding problem—where abstraction may occasionally hinder exploitation of low-level evidence—remains a challenge for future architectural refinement.
Conclusion
HTAA demonstrates that hierarchical abstraction via toolset agentization, coupled with asymmetric planner adaptation, yields a scalable, robust framework for tool-augmented LLM planning (2604.10917). Empirical evidence confirms significant efficiency, accuracy, and operational cost gains over flat alternatives. These findings indicate high potential for the deployment of agentized LLM systems in demanding, real-world environments requiring complex, multi-tool, long-horizon decision-making.