ChatHTN Planner: Hybrid HTN & LLM Integration
- ChatHTN Planner is a hybrid Hierarchical Task Network framework that integrates symbolic planning with LLM-based task decompositions for sound plans.
- The system interleaves classical HTN methods with ChatGPT queries, dramatically reducing expensive LLM calls while sustaining high planning success rates.
- An online method learner generalizes LLM-derived decompositions to update the method library, enhancing scalability and robustness in complex domains.
The ChatHTN planner is a hybrid Hierarchical Task Network (HTN) planning framework that tightly integrates symbolic HTN planning with on-demand decomposition queries to LLMs, notably ChatGPT. When no applicable method exists to decompose a compound task, ChatHTN prompts ChatGPT to produce a sequence of primitive subtasks; subsequent extensions of ChatHTN learn and generalize from these LLM-derived decompositions in an online fashion. This approach guarantees that resultant plans are provably sound with respect to task effects, while dramatically reducing reliance on costly LLM calls and maintaining high overall planning success rates (Xu et al., 17 Nov 2025, Munoz-Avila et al., 17 May 2025).
1. System Overview and Formal Definitions
ChatHTN operates on planning problems of the form :
- : initial task list (ordered sequence of compound or primitive tasks)
- : initial state (set of ground atoms)
- : initial HTN method library
- : set of primitive operators
Tasks are divided into:
- Primitive: , associated with operator
- Compound: , decomposed via methods
At each step, ChatHTN processes the head of the task list:
- If primitive, applies the operator if preconditions permit, updates state, and appends to the plan.
- If compound and an applicable exists, applies , replacing by its subtasks.
- If no method applies, queries ChatGPT for a decomposition into primitive tasks, inserts a verifier subtask to check the intended effects, and proceeds recursively (Xu et al., 17 Nov 2025, Munoz-Avila et al., 17 May 2025).
Soundness is achieved via strict effect checking: every decomposition (symbolic or LLM) is immediately followed by a verifier primitive that ensures the original compound's effects are realized.
2. LLM Interleaving and Decomposition Procedure
ChatHTN interleaves classical HTN planning with LLM-based approximations:
- When a compound task cannot be matched to any available method, ChatGPT is prompted with the domain, current state, and .
- The returned sequence of primitive subtasks is parsed and injected in place of , followed by a special verifier task whose operator checks that the claimed effects hold in the resulting state.
- Planning backtracks if the decomposition fails the verifier.
This integration is formalized through two inference schemas:
- Symbolic-method decomposition
- LLM-approximation decomposition
The resulting plan is guaranteed to satisfy the effects of every task in the hierarchy, even when decompositions are sourced from an approximate LLM (Munoz-Avila et al., 17 May 2025).
3. Online Learning of HTN Methods
A critical extension, the ChatHTN "method learner", allows the system to generalize from LLM-derived decompositions:
- After a successful LLM decomposition, the primitive sequence and corresponding state transitions are recorded.
- Precondition regression is computed: recursively regressing effects through the action sequence to recover a minimal set of preconditions necessary to guarantee the effects if the subtasks are executed.
- All constants in the method head, preconditions, and subtask sequence are lifted to variables, yielding a generalized method .
- The method library is updated with , so future analogous tasks can be decomposed symbolically without recourse to the LLM (Xu et al., 17 Nov 2025).
Termination methods are also systematically constructed for every annotated task, enabling do-nothing default behavior when the corresponding effects are already satisfied.
4. Theoretical Properties and Complexity Analysis
Complexity
Classical HTN planning exhibits exponential time complexity in decomposition depth (). ChatHTN bottlenecks on ChatGPT queries, but with learning, each compound task symbol in the domain triggers at most one LLM call (plus verification). Thus, the number of LLM calls per problem instance scales as , sharply reducing repeated expensive queries relative to total task occurrences (Xu et al., 17 Nov 2025).
Soundness and Completeness
- Soundness: Enforced through regression-verified methods and verifier tasks. No learned method is added unless executing its body is sufficient for the effects of the original compound task.
- Completeness (relative to oracle LLM): If ChatGPT produces a valid decomposition for each required compound symbol on first encounter, no future LLM queries are required for that symbol. The planner will eventually construct a plan if one exists and the LLM is correct at least once for each distinct compound (Xu et al., 17 Nov 2025).
5. Empirical Evaluation
ChatHTN and its learning extension were evaluated on domains including Logistics Transportation and Search & Rescue. Key evaluation criteria comprised:
- Number of ChatGPT queries per problem
- Percentage of instances solved
Empirical results demonstrate:
- The online learner consistently reduces LLM calls by approximately 50–70%.
- Planning success rates are maintained or improved because accumulation of LLM errors is reduced through memoization-like learning of decompositions.
- When only the highest-level method is missing (necessitating full-plan generation by ChatGPT), both baseline and learner perform worse, but the learner still shows modest gains.
Example summary table for calls and solve rates in Logistics:
| Method Removed | Avg. Calls (No Learner) | Avg. Calls (Learner) | Success % (No Learner) | Success % (Learner) |
|---|---|---|---|---|
| TM1 | 8.5 | 3.2 | 90 | 95 |
| TM2 | 9.1 | 2.9 | 85 | 92 |
| TPM2 (top) | 12.4 | 11.8 | 40 | 45 |
No statistical significance tests were reported; results are averaged over 30 trials per removal (Xu et al., 17 Nov 2025).
6. Limitations and Future Directions
The ChatHTN approach demonstrates several limitations:
- Learned methods are strictly linear (flat sequences of primitive tasks); no compound-subtask hierarchy, recursion, or loops are learned. This constrains generalization to tasks with fixed, non-recursive structure.
- When top-level methods are missing and LLMs are forced to produce end-to-end plans, errors compound and degrade results.
- Merging and generalizing multiple learned methods to minimize library size remain open.
Proposed enhancements include:
- Allowing LLM-generated decompositions with mixed primitive and compound subtasks, enabling the discovery of multi-level HTN structures.
- Pattern mining on primitive sequences to induce recursive or iterative methods (e.g., for variable-size collections of entities).
- Improved generalization across learned methods through automated merging and structural induction (Xu et al., 17 Nov 2025).
7. Relationship to Broader HTN-LLM Planning Paradigms
ChatHTN's interleaving of symbolic and LLM-based decomposition contrasts with alternate LLM planning paradigms such as Hypertree Planning (HTP), which represent plans as hypertrees rather than strictly sequential task networks. In HTP, hierarchical decompositions are constructed as rooted, acyclic directed hypergraphs, enabling broader parallelism, divide-and-conquer, and constraint propagation through multi-chain expansion (Gui et al., 5 May 2025). Unlike ChatHTN, which is restricted to linearization via primitive task sequences, HTP enables more expressive multi-level, parallelizable planning at the expense of increased control complexity and resource demands.
The ChatHTN framework (Munoz-Avila et al., 17 May 2025) and its extensions provide a principled compromise between interpretability, provable soundness, and practical scalability for agentic LLM-integrated planning. Its blend of strict symbolic verification and flexible learning from LLM output sets a baseline for future developments in hybrid neuro-symbolic planning systems.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free