LLM-Based Hierarchical TODO Decomposition

Updated 25 November 2025

LLM-based Hierarchical TODO Decomposition is a paradigm that deconstructs complex tasks into structured subtasks, enabling robust orchestration with multi-agent systems.
It employs formal models, dependency graphs, and scoring functions to assign domain-specific subtasks and manage parallel execution.
Empirical benchmarks demonstrate significant gains in accuracy, efficiency, and cost savings across applications like robotics, survey generation, and 6G management.

LLM-based Hierarchical TODO Decomposition is a paradigm for orchestrating LLMs and agent systems to robustly solve complex, ambiguous, or multi-stage problems by systematically splitting them into hierarchically structured sub-tasks (“TODOs”), routing these to specialized agents or tools, and aggregating the results. This methodology overcomes context window limitations, enables parallel and modular execution, and delivers improved solution quality. Modern designs are grounded in formalisms from automated planning, multi-agent systems, computational graph theory, and empirical workflow management.

1. Formal Models and Notation

Formally, the decomposition process begins with a high-level task $T \in \mathcal{T}$ , which is transformed into a set or hierarchy of sub-tasks via a decomposition function:

$D(T) = \{ t_1, t_2, \ldots, t_n \}$

A directed acyclic dependency graph $\mathrm{Dep} \subseteq \{ (t_i \to t_j) \}$ encodes prerequisite relations between subtasks. Each sub-task $t_i$ is annotated with:

$d(t_i)$ : domain/expertise label (e.g., “math calculation”, “flight search”)
$c(t_i)$ : complexity estimate (e.g., token budget, number of steps)
$agent(t_i)$ : assigned agent (LLM specialist or external tool)
$status(t_i) \in \{\text{Pending}, \text{In-Progress}, \text{Done}, \text{Failed}\}$
$result(t_i)$ : the output of solving $t_i$

The global solution is reconstructed as:

$S_\text{final} = \text{Aggregate}( \{ result(t_i) \mid t_i \in D(T) \} )$

In multi-agent or multi-LLM workflows, assignment and prioritization are governed by scoring functions:

$agent(t_i) = \arg\max_{A_j} \text{Score}(A_j, d(t_i)) \ \text{Score}(A_j, d) = w_\text{domain} \cdot \text{Match}(A_j.\text{domain}, d) + w_\text{perf} \cdot \text{historical\_accuracy}(A_j, d)$

This abstraction generalizes to recursive and cross-domain scenarios, such as tree-based mission planning for robots (Gupta et al., 27 Jan 2025), debate-based subtask planning for 6G management (Lin et al., 6 Jun 2025), and compositional workflows in code generation (Nakkab et al., 2024, Tang et al., 2024).

2. Decomposition and Orchestration Algorithms

A canonical orchestration pipeline proceeds in five distinct phases (Rasal et al., 2024):

Requirement Elicitation: The orchestrator LLM interacts with the user, posing clarifying questions until the specification is sufficient. This leverages chain-of-thought prompting to uncover ambiguous or missing requirements.
Task Decomposition: The orchestrator applies an LLM-driven split to produce a structured TODO list, modeled as a tree or DAG with explicit dependencies and stepwise domain annotations.
Agent Assignment: Each subtask is routed to the agent or tool best suited by capability and prior observed accuracy (domain-specific routing).
Parallel Subproblem Solving: Using a dependency graph and work queue (priority determined by topological order and/or complexity), subtasks are dispatched to agents as soon as all dependencies are satisfied. Execution is asynchronous and exploits available parallelism.
Aggregation: Final solutions are synthesized by prompting the orchestrator LLM with the collection of subtask results to produce a coherent, user-facing response.

The following pseudocode from (Rasal et al., 2024) exemplifies this loop, with additional application-specific modules—such as utility-based robot task allocation (Gupta et al., 27 Jan 2025) and DSE-driven prompt generation for IC design (Tang et al., 2024)—refining assignment and aggregation strategies.

context = user_input
while requirements_incomplete(context):
    q = Orch.generate_follow_up_question(context)
    user_answer = query_user(q)
    context |= user_answer
root_task = context
subtasks = Orch.decompose(root_task)
for t in subtasks:
    t.assignee = select_agent(t.domain)
    queue.add(t)
while queue:
    t = queue.pop_ready()
    t.status = "In-Progress"
    t.result = t.assignee.solve(t.description)
    t.status = "Done"
    queue.enqueue_ready_dependents(t)
S_final = Orch.aggregate({t.result for t in subtasks})
return S_final

3. Data Structures and Hierarchy Representation

Task hierarchies are predominantly managed as trees or DAGs. Each node represents a subproblem:

class TaskNode:
    id: str
    description: str
    domain: str
    complexity: float
    deps: List[str]
    assignee: AgentHandle
    status: Enum("Pending", "InProgress", "Done", "Failed")
    result: Optional[Any]

Orchestrators maintain a mapping of task IDs to TaskNodes, a dependency graph (adjacency list), and a priority work queue.

In multi-agent or modular agent systems, the entire workflow is a tree of services (Chao et al., 13 Oct 2025):

Root: high-level task (e.g., “Survey Generation”)
Intermediate: phase/functional modules (e.g., AnalysisPhase, SkeletonPhase)
Leaves: atomic LLM or tool servers (e.g., SearchServer, DigestServer)

Each module exposes one or more functions as standard protocols (e.g., MCP tool calls), facilitating distributed orchestration and plug-and-play module insertion.

4. Mathematical Criteria for Split, Assignment, and Aggregation

Although not always formalized as explicit closed-form equations, the decomposition process is driven by:

Chain-of-Thought (CoT) Decomposition: $T \Rightarrow_{link} D(T)$ , where $\Rightarrow_{link}$ models LLM-generated, stepwise breaking-down via CoT prompting.
Complexity Measures: Subtasks are defined so as to keep $c(t_i)$ (tokens or steps per subtask) below agent-specific thresholds, ensuring each is LLM-manageable (Chen et al., 2024). Theoretical analysis relates depth $D$ , branching $b$ , and per-node error $\varepsilon_D$ to overall workflow accuracy:

$E_0 \leq b^D \cdot \varepsilon_D \ C_{total} \leq \sum_{l=0}^{D-1} b^l \cdot [C_{pre}(L_{sys}+m_l) + C_{dec}(L_{sys}+m_l, L_{dec}(m_l))]$

Optimization aims to set $m_l$ , $b$ , and $D$ so that $E_0$ is under a target while $C_{total}$ is minimized.

Assignment Score: As above, $Score(A_j, d) = w_\text{domain}\cdot Match + w_\text{perf}\cdot accuracy$ controls agent routing (Rasal et al., 2024).
Task-robot matching: In multi-robot planning, assignment maximizes

$\max \sum_{r\in R} \sum_{a \in T_r} u_a(r)$

subject to deadline and sequentiality, with utility $u_a(r) = \alpha q_a(r) - \beta d_a(r) - \gamma c_a(r)$ (Gupta et al., 27 Jan 2025).

5. Runtime Protocols

At runtime, the orchestrator operates as a long-lived service, executing the following protocol (Rasal et al., 2024, Chao et al., 13 Oct 2025):

Instantiation: Ingest user input; dynamically clarify via question–answer loop.
Decomposition: Generate the task DAG/tree, possibly interacting with the user for further disambiguation.
Agent Dispatch: Assign ready subtasks to available agent instances (with system such as LangChain or MCP).
Concurrency and Monitoring: Track task status; upon completion of dependencies, schedule downstream tasks.
Checkpointing and Fault Tolerance: Periodically record partial results to persist progress and allow recovery.
Aggregation and Finalization: Aggregate subresults via LLM prompt or symbolic function; deliver final output.

In advanced systems, “orchestra” agents holistically plan next tool invocations based on execution history and user feedback (Chao et al., 13 Oct 2025). Human-in-the-loop intervention may occur at key decision points (topic scope, outline restructuring, etc.).

6. Empirical Results and Comparative Benchmarks

Empirical evaluation demonstrates substantive gains in accuracy, reliability, and efficiency:

On GSM8K math (2–8 steps per task), a GPT-4 orchestrator with GPT-3.5-turbo specialists achieved a 73% solve-rate, outperforming single-agent and flat multi-agent approaches by 8–23 percentage points (Rasal et al., 2024).
In hierarchical debate for 6G network management, MCR (macro coverage rate) improved as follows for GPT-4o + GPT-4o-mini: 39.62% (baseline) → 49.75% (regular debate) → 81.19% (hierarchical debate), with similar lifts for other model combinations (Lin et al., 6 Jun 2025).
In chip design, hierarchical prompting delivered >30% token and >45% runtime savings compared to flat prompting, with pass@5 rates rising from 0–10% to >90% for certain architectures (Nakkab et al., 2024, Tang et al., 2024).
In multi-robot mission planning, LLM-constructed hierarchical trees yielded tractable, near-optimal alternatives sublinear in the number of abstract tree nodes, with demonstrable flexibility across diverse mission types (Gupta et al., 27 Jan 2025).

These results generalize to domains including programming education (DBox: +0.198 correctness, +2.33 self-efficacy) (Ma et al., 26 Feb 2025) and cross-task zero-shot generalization in reinforcement learning (ReflexGrad: 67% trial-0 success, zero action loops) (Kadu et al., 18 Nov 2025).

7. Applications and Illustrative Examples

LLM-based hierarchical TODO decomposition frameworks are deployed in scenarios such as:

Travel planning: Decomposing user requests (“Book me a return flight...”) into flight search, amenity check, booking, with agent routing and dependency management (Rasal et al., 2024).
Robotics: Multi-level decomposition of missions (“Reunite mom with her lost child”) into compound and primitive subroutines, capability-aware agent assignment, and utility-maximizing task allocation (Gupta et al., 27 Jan 2025).
6G Management: Hierarchical debate among LLMs for sub-task extraction (“Optimize RIS placement...”) and per-step solution refinement (Lin et al., 6 Jun 2025).
HDL/IC Generation: Recursive submodule generation (“64-to-1 MUX” → “8 MUX8-1” → “MUX2-1”) with simulation feedback in each TODO iteration (Nakkab et al., 2024, Tang et al., 2024).
Survey Generation and Planning: Modular orchestration of MCP servers for search, clustering, outline generation, and content refinement (Chao et al., 13 Oct 2025).
Programming Education: Co-decomposition of algorithmic tasks; learner-LLM step-tree alignment with dynamic hints and scaffolded code mapping (Ma et al., 26 Feb 2025).

Each of these exemplifies the translation of high-level, often ambiguous, natural language instructions into a structured, agent-executable workflow that supports parallelism, modular failure recovery, and extendability to new domains.

Key References:

“Navigating Complexity: Orchestrated Problem Solving with Multi-Agent LLMs” (Rasal et al., 2024)
“Generalized Mission Planning for Heterogeneous Multi-Robot Teams...” (Gupta et al., 27 Jan 2025)
“Hierarchical Debate-Based LLM...” (Lin et al., 6 Jun 2025)
“Rome was Not Built in a Single Step: Hierarchical Prompting for LLM-based Chip Design” (Nakkab et al., 2024)
“HiVeGen -- Hierarchical LLM-based Verilog Generation...” (Tang et al., 2024)
“LLM $\times$ MapReduce-V3: Enabling Interactive In-Depth Survey Generation...” (Chao et al., 13 Oct 2025)
“DBox: Scaffolding Algorithmic Programming Learning...” (Ma et al., 26 Feb 2025)
“ReflexGrad: Three-Way Synergistic Architecture...” (Kadu et al., 18 Nov 2025)
“On the Design and Analysis of LLM-Based Algorithms” (Chen et al., 2024)