StackPlanner: Hierarchical Memory Control

Updated 27 May 2026

StackPlanner is a centralized, hierarchical multi-agent system that employs active task-level memory control to coordinate LLM collaboration.
It decouples high-level planning from subtask execution through explicit push, condense, and prune memory operations to mitigate error propagation.
Empirical evaluations on multi-hop QA benchmarks show that its reinforcement learning strategies significantly boost resource efficiency and long-horizon task performance.

StackPlanner is a centralized, hierarchical multi-agent system designed to enable robust, long-horizon collaboration among LLM-based agents by providing explicit task-experience memory management. The StackPlanner framework was introduced to address instability in multi-agent organization caused by context bloat, error accumulation, and poor cross-task generalization—limitations highly prevalent in prior centralized LLM-agent architectures that lacked explicit memory control. By decoupling high-level coordination from subtask execution and introducing learnable, reinforcement-based memory management strategies, StackPlanner systematically improves resource efficiency and collaborative problem-solving efficacy in agent systems (Zhang et al., 9 Jan 2026).

1. Central Concepts: Active Task-Level Memory Control

StackPlanner centers its design around "active task-level memory control" within a hierarchical agent architecture. The central coordinator operates strictly at the plan level, issuing one of three actions: Plan, Delegate, or Revise. All detailed subtask executions are delegated to specialized sub-agents (such as Search or Report agents). The principal innovation is that the central coordinator's own task memory is shielded from noisy, fine-grained logs, and instead tracks only explicitly filtered results and summarized actions.

Direct, explicit inspection and revision of the internal task stack is enabled through Revise actions. Rather than relying on incidental truncation or passive summarization, the coordinator can push, pop, condense, or prune frames in its task memory in response to coordination needs, detected errors, or context-length constraints. This establishes memory as a direct control parameter for long-horizon coordination rather than a passive outcome of context growth or ad-hoc summarization (Zhang et al., 9 Jan 2026).

2. Memory System Architecture

StackPlanner's memory system features three distinct memory structures:

Task Memory Stack (M): A dynamic stack maintained by the coordinator, where each entry (frame) records a coordination step (plan specification, sub-agent results) in a concise format. Stack operations include push (for new actions or results), pop/condense (replace a segment with an abstracted summary), and prune (drop selected frames and note the failure).
Subtask Frames: Each sub-agent maintains its own local context for reasoning and tool usage. Only filtered summaries are returned to and stored by the coordinator; raw logs are excluded from the stack.
Structured Experience Memory: This persistent, typed repository is indexed by user ID, task type, and semantic topic. It contains: (i) user profiles (preferences, behaviors), (ii) semantic memory (evidence, facts), and (iii) procedural SOPs (abstract plans). Entries are retrieved via an Experience Search agent using embedding-based similarity, then injected into the current stack as context for new tasks.

Experience retrieval is driven by cosine similarity between the current task embedding and memory entries, with attention weights assigned for relevance. Content is injected into the stack by weighted sum across top-K retrieved entries (Zhang et al., 9 Jan 2026).

3. Memory Operations, Policy Learning, and Algorithmic Loop

Memory Operations

Update (Push): On new action or filtered result, append as new stack frame.
Condensation (Pop & Summarize): Replace a contiguous stack segment with a concise summary generated via LLM template prompt.
Pruning: Direct deletion of frames from the stack (replaced by a terse failure note), typically when error accumulation is detected.

The central agent’s learning objective is

$\max_\theta \mathbb{E}_{q,y\sim\pi_\theta}[r_\phi(q, y)] - \beta \cdot D_{\mathrm{KL}}(\pi_\theta(\cdot) \| \pi_\mathrm{ref}(\cdot))$

where $q$ is sampled from the task distribution, $y$ is the action trajectory, and $r_\phi$ is the environment reward. Optimization employs Group Relative Policy Optimization (GRPO): for a batch of $K$ rollouts, token-level normalized advantages

$\hat{A}_i^{(k)} = [r_i^{(k)} - \mathrm{mean}(\mathcal{R}_G)] / \mathrm{std}(\mathcal{R}_G)$

are computed and used with PPO-style clipping.

Policy Loop Pseudocode

initialize M = empty stack
retrieve experience E = ExperienceSearch(userID, initial_spec)
push E summaries onto M

for t in range(1, T+1):
    action = CoordinatorPolicy(q, M)  # Plan, Delegate, Revise
    if action == 'Plan':
        subtask_spec = GeneratePlan(M)
        continue
    if action == 'Delegate':
        result = InvokeSubAgent(subtask_spec)
        push FilteredSummary(result) onto M
    if action == 'Revise':
        if TooLong(M) or StageComplete(M):
            block = PopBlock(M)
            m_prime = Summarize(block)
            push m_prime onto M
        if DetectedError(M):
            PruneBlock(M)
            push FailureNote() onto M
    if TerminationCriteria(M):
        break
return FinalOutput(M)

Garbage collection and context bloat are actively managed through Revise operations; neither passive truncation nor fixed-size windows are used (Zhang et al., 9 Jan 2026).

4. Empirical Evaluation and Observed Impact

StackPlanner's evaluation targets memory efficiency, error accumulation, and long-range task generalization. On four multi-hop QA benchmarks (2Wiki, MusiQue, GAIA, FRAMES) with a 3B model backbone, StackPlanner achieves F1 scores of 32.92, 16.48, 7.71, and 16.23, outperforming all comparison baselines. Ablation reveals that omitting task memory causes a 3–5 point F1 drop, while removing experience memory results in a 4–8 point loss; removing both leads to 15–16 point degradation.

Out-of-distribution performance gains on more complex tasks (MusiQue, GAIA, FRAMES) are traced directly to structured experience retrieval. Although the paper does not plot explicit context-length curves, StackPlanner's performance on long-horizon tasks is stable, demonstrating that active condensation and pruning control context bloat and restrict error propagation (Zhang et al., 9 Jan 2026).

5. Architectural Rationale and Significance

The StackPlanner architecture is designed to:

Decouple plan-level and execution-level memory, exposing task memory as an actionable and inspectable resource for the central coordinator.
Enable learning-based policies to determine when and how to perform memory operations (push, condense, prune), facilitating closed-loop, reinforcement-driven memory management.
Allow cross-task experience reuse and adaptation through structured memory retrieval and integration, thereby improving out-of-distribution generalization.
Minimize information dilution and error compounding by filtering and summarizing sub-agent results before inclusion into the plan-level memory stack.

By transforming memory from a passive, unstructured history buffer into an actively controlled and reinforced subsystem, StackPlanner advances the long-horizon reliability and adaptability of multi-agent LLM-based systems (Zhang et al., 9 Jan 2026).

6. Limitations, Best Practices, and Future Extensions

Limitations:

No explicit compression formula is provided; summarization is template-driven, potentially introducing variability in stack size and detail.
Learned policies require sufficient training data to generalize decision-making about stack operations across diverse tasks.
The approach depends on high-quality, filtered summaries from sub-agents to prevent incomplete or noisy task stack frames.

Best Practices:

Expose memory control operations (push, condense, prune) as first-class policy decisions.
Train RL coordinator policies with structured rewards targeting both task success and minimal error propagation.
Use embedding-based retrieval for experience memory and dynamically inject relevant content into the stack upon task initialization.

Extensions:

Approaches such as dynamic, content-based stack frame reordering, probabilistic retrieval of cross-task experience, and more complex stack condensation strategies could enhance flexibility.
Integration with multi-modal sub-agents and support for non-textual experience memories would allow broader application domains.

StackPlanner's contributions demonstrate the centrality of explicit, actively managed task-experience memory to the scalability and reliability of LLM-based multi-agent systems operating on complex, knowledge-intensive tasks (Zhang et al., 9 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

StackPlanner: A Centralized Hierarchical Multi-Agent System with Task-Experience Memory Management (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to StackPlanner.