ChatHTN Planner: Hybrid HTN & LLM Integration

Updated 24 November 2025

ChatHTN Planner is a hybrid Hierarchical Task Network framework that integrates symbolic planning with LLM-based task decompositions for sound plans.
The system interleaves classical HTN methods with ChatGPT queries, dramatically reducing expensive LLM calls while sustaining high planning success rates.
An online method learner generalizes LLM-derived decompositions to update the method library, enhancing scalability and robustness in complex domains.

The ChatHTN planner is a hybrid Hierarchical Task Network (HTN) planning framework that tightly integrates symbolic HTN planning with on-demand decomposition queries to LLMs, notably ChatGPT. When no applicable method exists to decompose a compound task, ChatHTN prompts ChatGPT to produce a sequence of primitive subtasks; subsequent extensions of ChatHTN learn and generalize from these LLM-derived decompositions in an online fashion. This approach guarantees that resultant plans are provably sound with respect to task effects, while dramatically reducing reliance on costly LLM calls and maintaining high overall planning success rates (Xu et al., 17 Nov 2025, Munoz-Avila et al., 17 May 2025).

1. System Overview and Formal Definitions

ChatHTN operates on planning problems of the form $(\tilde T_0, s_0, \Lambda_0, \mathcal O)$ :

$\tilde T_0$ : initial task list (ordered sequence of compound or primitive tasks)
$s_0$ : initial state (set of ground atoms)
$\Lambda_0$ : initial HTN method library
$\mathcal O$ : set of primitive operators

Tasks are divided into:

Primitive: $t\in\mathcal T_p$ , associated with operator $o=(t_o,p_o,\mathit{add}_o,\mathit{del}_o)$
Compound: $t\in\mathcal T_c$ , decomposed via methods $m=(t_m,p_m,st_m)$

At each step, ChatHTN processes the head of the task list:

If primitive, applies the operator if preconditions permit, updates state, and appends to the plan.
If compound and an applicable $m\in\Lambda$ exists, applies $m$ , replacing $t$ by its subtasks.
If no method applies, queries ChatGPT for a decomposition into primitive tasks, inserts a verifier subtask to check the intended effects, and proceeds recursively (Xu et al., 17 Nov 2025, Munoz-Avila et al., 17 May 2025).

Soundness is achieved via strict effect checking: every decomposition (symbolic or LLM) is immediately followed by a verifier primitive that ensures the original compound's effects are realized.

2. LLM Interleaving and Decomposition Procedure

ChatHTN interleaves classical HTN planning with LLM-based approximations:

When a compound task $t$ cannot be matched to any available method, ChatGPT is prompted with the domain, current state, and $t$ .
The returned sequence of primitive subtasks $\tilde T_{LLM} = [t'_1,\dots,t'_m]$ is parsed and injected in place of $t$ , followed by a special verifier task $t_{\text{ver}}$ whose operator checks that the claimed effects $\mathit{eff}_t$ hold in the resulting state.
Planning backtracks if the decomposition fails the verifier.

This integration is formalized through two inference schemas:

Symbolic-method decomposition
LLM-approximation decomposition

The resulting plan $\pi$ is guaranteed to satisfy the effects of every task in the hierarchy, even when decompositions are sourced from an approximate LLM (Munoz-Avila et al., 17 May 2025).

3. Online Learning of HTN Methods

A critical extension, the ChatHTN "method learner", allows the system to generalize from LLM-derived decompositions:

After a successful LLM decomposition, the primitive sequence and corresponding state transitions are recorded.
Precondition regression is computed: recursively regressing effects through the action sequence to recover a minimal set of preconditions $p$ necessary to guarantee the effects if the subtasks are executed.
All constants in the method head, preconditions, and subtask sequence are lifted to variables, yielding a generalized method $m_\text{new} = (t^\uparrow, p^\uparrow, \tilde T_{LLM}^\uparrow)$ .
The method library $\Lambda$ is updated with $m_\text{new}$ , so future analogous tasks can be decomposed symbolically without recourse to the LLM (Xu et al., 17 Nov 2025).

Termination methods are also systematically constructed for every annotated task, enabling do-nothing default behavior when the corresponding effects are already satisfied.

4. Theoretical Properties and Complexity Analysis

Complexity

Classical HTN planning exhibits exponential time complexity in decomposition depth ( $O(|\Lambda|^d)$ ). ChatHTN bottlenecks on ChatGPT queries, but with learning, each compound task symbol in the domain triggers at most one LLM call (plus verification). Thus, the number of LLM calls per problem instance scales as $O(|\mathcal T_c|)$ , sharply reducing repeated expensive queries relative to total task occurrences (Xu et al., 17 Nov 2025).

Soundness and Completeness

Soundness: Enforced through regression-verified methods and verifier tasks. No learned method is added unless executing its body is sufficient for the effects of the original compound task.
Completeness (relative to oracle LLM): If ChatGPT produces a valid decomposition for each required compound symbol on first encounter, no future LLM queries are required for that symbol. The planner will eventually construct a plan if one exists and the LLM is correct at least once for each distinct compound (Xu et al., 17 Nov 2025).

5. Empirical Evaluation

ChatHTN and its learning extension were evaluated on domains including Logistics Transportation and Search & Rescue. Key evaluation criteria comprised:

Number of ChatGPT queries per problem
Percentage of instances solved

Empirical results demonstrate:

The online learner consistently reduces LLM calls by approximately 50–70%.
Planning success rates are maintained or improved because accumulation of LLM errors is reduced through memoization-like learning of decompositions.
When only the highest-level method is missing (necessitating full-plan generation by ChatGPT), both baseline and learner perform worse, but the learner still shows modest gains.

Example summary table for calls and solve rates in Logistics:

Method Removed	Avg. Calls (No Learner)	Avg. Calls (Learner)	Success % (No Learner)	Success % (Learner)
TM1	8.5	3.2	90	95
TM2	9.1	2.9	85	92
TPM2 (top)	12.4	11.8	40	45

No statistical significance tests were reported; results are averaged over 30 trials per removal (Xu et al., 17 Nov 2025).

6. Limitations and Future Directions

The ChatHTN approach demonstrates several limitations:

Learned methods are strictly linear (flat sequences of primitive tasks); no compound-subtask hierarchy, recursion, or loops are learned. This constrains generalization to tasks with fixed, non-recursive structure.
When top-level methods are missing and LLMs are forced to produce end-to-end plans, errors compound and degrade results.
Merging and generalizing multiple learned methods to minimize library size remain open.

Proposed enhancements include:

Allowing LLM-generated decompositions with mixed primitive and compound subtasks, enabling the discovery of multi-level HTN structures.
Pattern mining on primitive sequences to induce recursive or iterative methods (e.g., for variable-size collections of entities).
Improved generalization across learned methods through automated merging and structural induction (Xu et al., 17 Nov 2025).

7. Relationship to Broader HTN-LLM Planning Paradigms

ChatHTN's interleaving of symbolic and LLM-based decomposition contrasts with alternate LLM planning paradigms such as Hypertree Planning (HTP), which represent plans as hypertrees rather than strictly sequential task networks. In HTP, hierarchical decompositions are constructed as rooted, acyclic directed hypergraphs, enabling broader parallelism, divide-and-conquer, and constraint propagation through multi-chain expansion (Gui et al., 5 May 2025). Unlike ChatHTN, which is restricted to linearization via primitive task sequences, HTP enables more expressive multi-level, parallelizable planning at the expense of increased control complexity and resource demands.

The ChatHTN framework (Munoz-Avila et al., 17 May 2025) and its extensions provide a principled compromise between interpretability, provable soundness, and practical scalability for agentic LLM-integrated planning. Its blend of strict symbolic verification and flexible learning from LLM output sets a baseline for future developments in hybrid neuro-symbolic planning systems.