H²Memory Framework: Scalable AI Memory

Updated 26 February 2026

H²Memory Framework is a collection of protocols that combine hierarchical abstraction and harmonic retrieval to support scalable reasoning in AI systems.
It employs multi-layered semantic organization with cue anchors and index-based routing to optimize memory retrieval and reduce query latency.
Dynamic update mechanisms and hardware-aware data placement ensure efficient operation in both cognitive architectures and high-performance computing contexts.

H²Memory refers to a set of distinct but conceptually related frameworks addressing memory abstraction, retrieval, and management for large-scale reasoning agents and high-performance systems. In contemporary computational research, the moniker appears in cognitive-architectural (hierarchical agent memory), synergistic hardware–software, and adaptive data-tiering contexts. Central instantiations include the hierarchical memory for LLM agents (Sun et al., 23 Jul 2025), the head-aware heterogeneous memory manager for LLM inference (Hwang et al., 21 Apr 2025), the online-guided data placement framework for heterogeneous hardware (Olson et al., 2021), and harmonic abstraction–specificity balancing memory (Xia et al., 3 Feb 2026).

1. Hierarchical and Harmonic Memory Architectures in LLM Agents

H²Memory (also referred to as H-MEM or Memora) implements multi-level semantic organization of long-term agent memory, optimizing both context-aware retrieval and efficient scaling.

Hierarchical Layering

Four semantic layers:

Domain (coarse topics: e.g., "Movies")
Category (subdomains: e.g., "Action Movies")
Memory Trace (salient entities: e.g., "Jackie Chan")
Episode (full episodic text, user profile, timestamp)

At each level, entries are defined via a $D$ -dimensional semantic embedding $e_k^{(\ell)}$ , a self-index $p_k^{(\ell,0)}$ , and pointers to child indices in the next layer. $m_k^{(\ell)} = \left[\,e_k^{(\ell)};\;p_k^{(\ell,0)};\;p_k^{(\ell,1)},...,p_k^{(\ell,K_\ell)}\right] \in \mathbb{R}^D \times \mathbb{Z} \times \mathbb{Z}^{K_\ell}$

Harmonic Representation

Memora extends this principle by introducing:

Primary abstraction: A canonical, semantically grouped identifier for concept-level memory buckets.
Cue anchors: Short entity+aspect phrases providing fine-grained hooks, many-to-many linked across entries (Xia et al., 3 Feb 2026).

Integrating these, Memora strikes a formalized balance, sharding memory to maximize retrieval efficiency while maintaining specificity required for reasoning.

2. Retrieval Dynamics: Routing, Abstraction, and Policy

Index-Based Routing

H²Memory implements a top-down, index-routed retrieval process: Layer 1 identifies top- $k$ domains via embedding similarity; subsequent layers recurse into child pointers, restricting further similarity scoring to semantically filtered subregions:

$\mathcal{S}^{(\ell)} = \mathrm{TopK}\left\{ (j,\,\mathrm{sim}(q, e_j^{(\ell)})) : j \in \bigcup_{(i,\,\_) \in \mathcal{S}^{(\ell-1)}} \mathrm{Children}(m_i^{(\ell-1)}) \right\}$

Policy-Guided Retrieval

Memora formulates retrieval as a Markov Decision Process, where an LLM parameterized policy $\pi_\theta$ navigates memory along abstraction and cue-anchor edges. The states encode current query, retrieved working set, frontier candidates, and remaining step budget. Actions include refinement, expansion, and stop—with the policy trained on group-relative trajectory rewards, balancing grounding, redundancy, and cost (Xia et al., 3 Feb 2026).

Scalability

Both hierarchical index-routing and harmonic retrieval exhibit sublinear or even constant query time scaling given controlled abstraction granularity growth: query cost $T_\text{Harmo} = O(\log(mN^2/B^2))$ becomes constant in $N$ if average abstraction size $B = N^{1/2 + \epsilon}$ for $e_k^{(\ell)}$ 0.

3. Dynamic Memory Update and Plasticity

Each memory entry is assigned a scalar weight $e_k^{(\ell)}$ 1 encoding recency, reinforcement, and user feedback:

$e_k^{(\ell)}$ 2

This dynamic integrates Ebbinghaus-style forgetting with explicit reinforcement and discounting, prioritizing active/useful memories and gradually purging irrelevant or refuted knowledge (Sun et al., 23 Jul 2025).

4. Heterogeneous and Asymmetric Hardware Memory Management

Asymmetric Memory Architecture

Another instantiation of H²Memory (H2M2) addresses hardware efficiency for very large LLMs via parallel, asymmetric memory:

Bandwidth-centric memory: HBM3, 96 GB, 3 TB/s, co-located with 4-accelerator cores.
Capacity-centric memory: LPDDR5X, 512 GB, 544 GB/s, on a peer accelerator.
High-speed interconnect: 960 GB/s chip–chip link.

Head-Aware Mapping & Runtime Adaptation

Each transformer sublayer (qkv, attention, fc) is partitioned by attention head count ( $e_k^{(\ell)}$ 3 for HBM, $e_k^{(\ell)}$ 4 for LPDDR). The mapping solves:

$e_k^{(\ell)}$ 5

achieving near-optimal load-balance using a per-sublayer $e_k^{(\ell)}$ 6 min-max sweep.

A dynamic runtime mapping algorithm, triggered each generation step, adapts to sequence length and batch variability, automatically triggering page migrations and address translations within a TLB-page-based abstraction system. Overheads are $e_k^{(\ell)}$ 7 in all models tested (Hwang et al., 21 Apr 2025).

Performance

Speedup over homogeneous LPDDR-only system: 1.46× (GPT3-175B), 1.55× (Chinchilla-70B), 2.94× (Llama2-70B), within 5% of oracle mapping.
Memory energy and cost/GB improve due to offloading bulk storage (KV-cache, weights) to LPDDR while retaining critical compute on HBM.

5. Automated Online Data Placement for Heterogeneous Memory Systems

The original H²Memory (online application guidance) framework targets runtime feedback-driven tier assignment for hybrid DRAM+NVM configurations (Olson et al., 2021).

Automatic, Per-Allocation Site Profiling

Each heap allocation is decorated with source-unique site ID and context.
Modified jemalloc (via SICM) allocates in multi-tier arenas.

Ski-Rental Placement Heuristic

At every interval $e_k^{(\ell)}$ 8, each site’s cumulative access sample $e_k^{(\ell)}$ 9 and resident set $p_k^{(\ell,0)}$ 0 are updated. The online guide then computes:

Rental cost:

$p_k^{(\ell,0)}$ 1

where $p_k^{(\ell,0)}$ 2 and $p_k^{(\ell,0)}$ 3 sum accesses needing DRAM migration.

Purchase (migration) cost:

$p_k^{(\ell,0)}$ 4

Memory pages are migrated when $p_k^{(\ell,0)}$ 5. This fully amortizes migration overhead and tracks changes in hot/cold page pattern over program lifecycle.

Empirical Outcomes

On CORAL and SPEC 2017, H²Memory online tiering achieves 2.5× geometric mean speedup (CORAL) and 8.6% (SPEC) over first-touch unguided placement, closely approaching the respective offline profiling-guided optimal.
Profiling and management overheads are $p_k^{(\ell,0)}$ 610% wall-clock for large HPC codes.

6. Empirical Benchmarks and Theoretical Relationships

LLM Agents and Memory Retrieval

On the LoCoMo benchmark, hierarchical H²Memory outperforms five baselines in F1 (by +14.98~pp) and BLEU-1 (by +12.77~pp); gains are greatest in multi-hop and adversarial QA (+21.3~pp F1, +17.7~pp BLEU-1)(Sun et al., 23 Jul 2025).
Memora demonstrates state-of-the-art retrieval effectiveness and context efficiency (e.g., 87.4% accuracy at 2.9k context length on LongMemEval), outperforming both flat RAG and neural memory baselines (Xia et al., 3 Feb 2026).

Method	BLEU	F1	LLM‐Judge
Full Context	0.487	0.565	0.825
RAG (k=3)	0.389	0.455	0.633
Memora (P)	0.466	0.553	0.863

Hardware/Systems Context

H2M2 achieves $p_k^{(\ell,0)}$ 7 $p_k^{(\ell,0)}$ 8 oracle performance at $p_k^{(\ell,0)}$ 95\% total overhead, supports dynamic sequence batched inference, and maintains negligible internal fragmentation ( $m_k^{(\ell)} = \left[\,e_k^{(\ell)};\;p_k^{(\ell,0)};\;p_k^{(\ell,1)},...,p_k^{(\ell,K_\ell)}\right] \in \mathbb{R}^D \times \mathbb{Z} \times \mathbb{Z}^{K_\ell}$ 00.16%).
Online H²Memory achieves near-offline-optimal speedups, with rapid convergence and low migration cost amortization.

7. Comparative Expressiveness and Applicability

Memora and H²Memory subsume prior vector-store (RAG) and (implicit or explicit) KG-based retrieval frameworks. Special cases include:

Flat RAG: Each entry singly indexed by itself (no cues), leads to standard chunk retrieval.
Implicit KG: Cue anchors as approximate entities, retrieval as L-hop traversal in cue-similarity space.
Explicit KG: Cue edges represent symbolic graph edges, tracing explicit multi-hop knowledge graph pathways.

The representational formalism is strictly more expressive: mixed-key intersection predicates realizable in Memora are unattainable by RAG or standard KG retrieval approaches (Xia et al., 3 Feb 2026).

H²Memory, in both algorithmic and hardware-aware variants, is a unifying set of frameworks for scalable, efficient, and context-rich memory management in modern AI systems—spanning from long-term reasoning agents to high-throughput, multi-tier hardware platforms (Sun et al., 23 Jul 2025, Hwang et al., 21 Apr 2025, Olson et al., 2021, Xia et al., 3 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (4)

Hierarchical Memory for High-Efficiency Long-Term Reasoning in LLM Agents (2025)

Hardware-based Heterogeneous Memory Management for Large Language Model Inference (2025)

Online Application Guidance for Heterogeneous Memory Systems (2021)

Memora: A Harmonic Memory Representation Balancing Abstraction and Specificity (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to H²Memory Framework.