Task–Step–State Hierarchies

Updated 16 March 2026

Task–Step–State hierarchies are structured models that decompose complex goals into high-level tasks, intermediate steps, and explicit state transitions.
They are implemented through formal mechanisms like tree/DAG-based memory, finite-state machines, and hierarchical task networks, ensuring modularity and clarity.
Empirical studies show these hierarchies boost long-horizon coherence, efficiency, and generalization in systems ranging from dialogue agents to procedural video analysis.

A Task–Step–State hierarchy is a structured model that decomposes complex procedural, planning, or control problems into three distinct levels: an overarching Task (high-level goal), intermediate Steps (subgoals, actions, or modules), and explicit States (encodings of the environment or process at each transition). This hierarchical principle underlies a diverse range of systems spanning classical AI planning, reinforcement learning, LLM agents, dialogue systems, and multimodal instructional video understanding. Recent research demonstrates that imposing an explicit Task–Step–State structure improves interpretability, long-horizon coherence, generalization, and policy modularity across modalities and settings.

1. Formal Models of Task–Step–State Hierarchies

The Task–Step–State paradigm is instantiated in multiple formal systems across AI subfields:

Tree and DAG-based memory and control: In the Task Memory Engine (TME) framework, the task hierarchy is encoded as a Task Memory Tree (TMT), a directed (sometimes graph-augmented) structure where each node represents an executable step, with explicit fields for actions, input context, outputs, status, parent/child links, and cross-branch dependencies. States are embedded as fields per node or as stepwise status vectors (Ye, 11 Apr 2025).
Hierarchical finite-state machines: In dialogue systems such as HierTOD, workflows are represented as ordered trees or chains of goals and subgoals. Dialogue execution is governed by a hierarchical FSM, whose state tracks both high-level phases (pending/executing) and fine-grained substate indices (Mo et al., 2024).
Hierarchical Task Networks (HTN): HTN planning formalizes a three-level mapping: the (compound) Task level decomposes into a network of Steps (method instantiations and primitive actions), which in turn drive sequential transitions between explicit World States. Task networks are partial orders over steps, and the execution induces a corresponding sequence of world states (Georgievski et al., 2014).
Stack-augmented policies: The SteP framework for web automation models the agent's state as a stack of policies, where the root policy addresses the overall Task, stacks are dynamically manipulated (push/pop) to delegate Steps, and each stack frame tracks its own history (state) (Sodhi et al., 2023).
Procedural video representation: SCHEMA and related works construct TSS (Task–Step–State) hierarchies for video understanding, mapping high-level procedure to sequences of Steps, each explicitly annotated by before/after/intermediate States, and aligning visual content to this triple via cross-modal contrastive learning (Zhao et al., 25 Nov 2025, Niu et al., 2024).

This tri-level structure is typically encoded as a directed acyclic graph (DAG), tree, or FSM, with Step-nodes mediating between Task and State layers and supporting operations such as decomposition, status tracking, dependency management, and prompt/context synthesis.

2. Algorithmic Construction and Execution Mechanisms

The construction and runtime execution of a Task–Step–State hierarchy involves several algorithmic components:

Insertion and linking: New steps (subgoals, actions) are inserted as nodes in the active hierarchy (e.g., INSERT_STEP in TME) with explicit parent/child links. Cross-branch or reusable substeps are managed using additional dependency edges to realize DAGs and shared execution contexts (Ye, 11 Apr 2025, Tianxing et al., 26 Jun 2025).
Node update and state tracking: Each node maintains a vector of fields (action, input, output, status) and is updated as steps are executed (UPDATE_NODE). Execution status transitions (e.g., waiting/active/done/failed) are central to traversal and pruning (Ye, 11 Apr 2025).
Stateful prompt/context synthesis: Systems like TME and StateFlow synthesize LLM prompts or action contexts by traversing the active node path (root-to-leaf or current FSM state), assembling only the relevant subpath and applying recency-based weighting (e.g., λ-depth weighting) to focus context on recent Steps and the current State (Ye, 11 Apr 2025, Wu et al., 2024).
Decomposition and termination policies: Embodied agents (e.g., STEP Planner) employ paired models to recursively decompose high-level goals into Steps by prompting foundation models, and use environment-driven termination models to determine when a Step is atomic (i.e., executable as a primitive action) or requires further decomposition, based on both affordance and current state (Tianxing et al., 26 Jun 2025).
Dynamic controller stacks: In stack-based control (SteP; Reward Machine hierarchies), Step transitions are implemented as stack operations (push sub-policy, pop on completion), and task-level rewards or transitions are mediated by stack unwinding (Sodhi et al., 2023, Furelos-Blanco et al., 2022).

A central theme is the integration of structured symbolic or neural state into the stepwise transition logic, enabling more granular progress monitoring, context-specific prompting, and systematic error recovery.

3. Empirical Impact and Benchmark Results

Across domains, explicit Task–Step–State hierarchies deliver improved robustness, interpretability, data efficiency, and generalization:

System/Domain	Hierarchy Structure	Key Empirical Gains
TME (LLM agents)	Step-based trees/DAGs	Accuracy ↑, reduced hallucinations, prompt savings
StateFlow	FSM (states/steps)	+13–28 points SR, 3–5x cost reduction (SQL, ALFWorld)
STEP (embodied)	Subgoal trees	SR up to 34% (WAH-NL), 4–6× over SOTA for long tasks
HierTOD (dialogue)	Goal/step chains	Higher task completion, effective mixed initiative
Video TSS	Task–Step–State	+2–4% accuracy in task/step/next prediction
SteP (web RL)	Dynamic policy stack	2.3× lower context, SR↑ (e.g. 0.23→0.36 WebArena)

For instance, in StateFlow, ablation studies demonstrate that removing "Observe", "Error", or "Verify" macro-states individually reduces success rates by 1.5–6 points, confirming the necessity of each layer. STEP Planner demonstrates that removing the subgoal tree structure drops embodied task success rates from 40% to 8% (WAH-NL), directly illustrating the causal impact of the hierarchy (Wu et al., 2024, Tianxing et al., 26 Jun 2025). Similar improvements are found in procedural video understanding, where adding "state" supervision in TSS yields superior performance across recognition and next-step forecasting benchmarks (Zhao et al., 25 Nov 2025).

4. Cross-domain Instantiations: Planning, RL, Dialogue, Video

Hierarchical Task Network Planning

HTN planners formalize TSS by decomposing tasks into method networks (Steps), each expressed over states; execution semantics connect primitive actions (steps) to state transitions (Georgievski et al., 2014). Key distinctions exist between plan-based and state-based models, but all maintain explicit mapping from task objectives, through networks of steps, to induced world state trajectories.

Hierarchical RL and Reward Machines

Hierarchies in HRL are constructed via SARM-HSTRL (sequential association rule mining over RL trajectories) or as options hierarchies encoded by Reward Machines with call-edges (HRMs). These form DAGs where each subtask (Step) is associated with subgoal-exit states (State), and optimality of policies is preserved under the hierarchy (Ghazanfari et al., 2018, Furelos-Blanco et al., 2022).

LLM Agent and Dialogue Systems

TME, HierTOD, and SteP demonstrate LLM agent/task frameworks where workflows or dialogues are encoded as trees, FSMs, or dynamically stacked policies. Each step or subgoal may maintain its own context, slot-believe state, or memory vector, and the control logic is expressed as state-driven transition or traversal (Ye, 11 Apr 2025, Mo et al., 2024, Sodhi et al., 2023).

Instructional Video Representation

In video, both SCHEMA and the TSS framework argue that explicit "states" (observable world or object configurations) are essential to bridge the semantic gap between high-level task/step labels and raw pixel data. Progressive learning curricula further enforce the hierarchical structure on representation learning (Zhao et al., 25 Nov 2025, Niu et al., 2024).

5. Graph-aware and DAG Extensions

Traditional tree-structured hierarchies are increasingly generalized to support cross-branch dependencies, shared substeps, rollbacks, and merges:

TME DAG extension: Nodes may have multiple parents, supporting substep reuse and path merging. Edges E_dep capture "depends_on", "merge_with", or "rollback" relations. Formal acyclicity ensures well-foundedness (Ye, 11 Apr 2025).
Reward Machine HRMs: Each RM can call others recursively, and a stack-based call semantics enforces hierarchical progress; this structure enables modular options learning and transfer (Furelos-Blanco et al., 2022).
SteP recursion: Dynamic stacking allows for deep, unbounded function-like calls, enabling recursive web navigations or decision processes that are not possible in fixed-depth trees (Sodhi et al., 2023).

In all cases, the shift from trees to DAGs or stack-augmented graphs increases expressiveness, supports compositional reuse, and adheres to acyclicity constraints.

6. Evaluation Protocols, Limitations, and Open Problems

Evaluation relies on a mix of human and automatic metrics:

Script generation: Perplexity, ROUGE-L, BLEU, Distinct-3, and segment distance, plus human preference for goal achievement and subgoal faithfulness (Li et al., 2023).
Planning/RL: Task and subgoal success rate, reward convergence, learning speed, and hierarchy validity (Ghazanfari et al., 2018, Tianxing et al., 26 Jun 2025).
Video: Multi-level (task/step/state) classification or forecasting, cross-modal retrieval, and representation ablations, demonstrating key gains when state supervision is enforced (Zhao et al., 25 Nov 2025).

Known limitations include:

Most hierarchical script generation and dialogue systems remain limited to two levels. Extension to deeper or more flexible nesting remains an open problem.
High-fidelity step and state labeling requires either domain knowledge or high-capacity supervised/unsupervised models; segmentation and subgoal induction remain imperfect (Li et al., 2023).
For LLM-agent paradigms, integration of rich world-state representations and robust environment feedback is nontrivial, as demonstrated by embodied and web-planning domains.

A plausible implication is that future research will focus on joint learning of decomposition (step segmentation), state induction, and flexible multi-level control, possibly under resource- or feedback-constrained settings.

7. Synthesis and Future Directions

Task–Step–State hierarchies unify planning, control, reasoning, and representation learning across classical AI, RL, LLM agents, and procedural video understanding. The formal separation of high-level intent, actionable plans, and concrete transit states grounds execution, reduces error propagation, and enables interpretable monitoring and intervention. Empirical results consistently demonstrate substantial advantages in long-horizon coherence, efficiency, and transferability.

Future work may extend these hierarchies to deeper, more recursive, and cross-modal settings, refine methods for automated hierarchy extraction, and integrate continuous or structured state descriptors (e.g., graphs or spatial representations) at each layer. The emergence of hybrid neural-symbolic implementations and graph-aware extensions further suggests that Task–Step–State frameworks will underpin next-generation autonomous agents and embodied reasoning systems (Ye, 11 Apr 2025, Zhao et al., 25 Nov 2025).