Papers
Topics
Authors
Recent
Search
2000 character limit reached

H²Memory Framework: Scalable AI Memory

Updated 26 February 2026
  • H²Memory Framework is a collection of protocols that combine hierarchical abstraction and harmonic retrieval to support scalable reasoning in AI systems.
  • It employs multi-layered semantic organization with cue anchors and index-based routing to optimize memory retrieval and reduce query latency.
  • Dynamic update mechanisms and hardware-aware data placement ensure efficient operation in both cognitive architectures and high-performance computing contexts.

H²Memory Framework

H²Memory refers to a set of distinct but conceptually related frameworks addressing memory abstraction, retrieval, and management for large-scale reasoning agents and high-performance systems. In contemporary computational research, the moniker appears in cognitive-architectural (hierarchical agent memory), synergistic hardware–software, and adaptive data-tiering contexts. Central instantiations include the hierarchical memory for LLM agents (Sun et al., 23 Jul 2025), the head-aware heterogeneous memory manager for LLM inference (Hwang et al., 21 Apr 2025), the online-guided data placement framework for heterogeneous hardware (Olson et al., 2021), and harmonic abstraction–specificity balancing memory (Xia et al., 3 Feb 2026).

1. Hierarchical and Harmonic Memory Architectures in LLM Agents

H²Memory (also referred to as H-MEM or Memora) implements multi-level semantic organization of long-term agent memory, optimizing both context-aware retrieval and efficient scaling.

Hierarchical Layering

  • Four semantic layers:
  1. Domain (coarse topics: e.g., "Movies")
  2. Category (subdomains: e.g., "Action Movies")
  3. Memory Trace (salient entities: e.g., "Jackie Chan")
  4. Episode (full episodic text, user profile, timestamp)

At each level, entries are defined via a DD-dimensional semantic embedding ek()e_k^{(\ell)}, a self-index pk(,0)p_k^{(\ell,0)}, and pointers to child indices in the next layer. mk()=[ek();  pk(,0);  pk(,1),...,pk(,K)]RD×Z×ZKm_k^{(\ell)} = \left[\,e_k^{(\ell)};\;p_k^{(\ell,0)};\;p_k^{(\ell,1)},...,p_k^{(\ell,K_\ell)}\right] \in \mathbb{R}^D \times \mathbb{Z} \times \mathbb{Z}^{K_\ell}

Harmonic Representation

Memora extends this principle by introducing:

  • Primary abstraction: A canonical, semantically grouped identifier for concept-level memory buckets.
  • Cue anchors: Short entity+aspect phrases providing fine-grained hooks, many-to-many linked across entries (Xia et al., 3 Feb 2026).

Integrating these, Memora strikes a formalized balance, sharding memory to maximize retrieval efficiency while maintaining specificity required for reasoning.

2. Retrieval Dynamics: Routing, Abstraction, and Policy

Index-Based Routing

H²Memory implements a top-down, index-routed retrieval process: Layer 1 identifies top-kk domains via embedding similarity; subsequent layers recurse into child pointers, restricting further similarity scoring to semantically filtered subregions:

S()=TopK{(j,sim(q,ej())):j(i,_)S(1)Children(mi(1))}\mathcal{S}^{(\ell)} = \mathrm{TopK}\left\{ (j,\,\mathrm{sim}(q, e_j^{(\ell)})) : j \in \bigcup_{(i,\,\_) \in \mathcal{S}^{(\ell-1)}} \mathrm{Children}(m_i^{(\ell-1)}) \right\}

Policy-Guided Retrieval

Memora formulates retrieval as a Markov Decision Process, where an LLM parameterized policy πθ\pi_\theta navigates memory along abstraction and cue-anchor edges. The states encode current query, retrieved working set, frontier candidates, and remaining step budget. Actions include refinement, expansion, and stop—with the policy trained on group-relative trajectory rewards, balancing grounding, redundancy, and cost (Xia et al., 3 Feb 2026).

Scalability

Both hierarchical index-routing and harmonic retrieval exhibit sublinear or even constant query time scaling given controlled abstraction granularity growth: query cost THarmo=O(log(mN2/B2))T_\text{Harmo} = O(\log(mN^2/B^2)) becomes constant in NN if average abstraction size B=N1/2+ϵB = N^{1/2 + \epsilon} for ϵ>0\epsilon>0.

3. Dynamic Memory Update and Plasticity

Each memory entry is assigned a scalar weight wi(t)w_i(t) encoding recency, reinforcement, and user feedback:

wi(t+Δ)=wi(t)exp(λΔ)×{α+if user feedback positive 1if no feedback αif user negatesw_i(t+\Delta) = w_i(t)\cdot \exp(-\lambda\Delta) \times \begin{cases} \alpha_+ & \text{if user feedback positive} \ 1 & \text{if no feedback} \ \alpha_- & \text{if user negates} \end{cases}

This dynamic integrates Ebbinghaus-style forgetting with explicit reinforcement and discounting, prioritizing active/useful memories and gradually purging irrelevant or refuted knowledge (Sun et al., 23 Jul 2025).

4. Heterogeneous and Asymmetric Hardware Memory Management

Asymmetric Memory Architecture

Another instantiation of H²Memory (H2M2) addresses hardware efficiency for very large LLMs via parallel, asymmetric memory:

  • Bandwidth-centric memory: HBM3, 96 GB, 3 TB/s, co-located with 4-accelerator cores.
  • Capacity-centric memory: LPDDR5X, 512 GB, 544 GB/s, on a peer accelerator.
  • High-speed interconnect: 960 GB/s chip–chip link.

Head-Aware Mapping & Runtime Adaptation

Each transformer sublayer (qkv, attention, fc) is partitioned by attention head count (nn_\ell for HBM, NnN-n_\ell for LPDDR). The mapping solves:

n=argmin0nN{max(THBM,(n),TLPDDR,(Nn))}n_\ell^* = \arg\min_{0 \le n \le N} \left\{ \max(T_{\text{HBM}, \ell}(n), T_{\text{LPDDR},\ell}(N-n)) \right\}

achieving near-optimal load-balance using a per-sublayer O(N)O(N) min-max sweep.

A dynamic runtime mapping algorithm, triggered each generation step, adapts to sequence length and batch variability, automatically triggering page migrations and address translations within a TLB-page-based abstraction system. Overheads are 5%\leq 5\% in all models tested (Hwang et al., 21 Apr 2025).

Performance

  • Speedup over homogeneous LPDDR-only system: 1.46× (GPT3-175B), 1.55× (Chinchilla-70B), 2.94× (Llama2-70B), within 5% of oracle mapping.
  • Memory energy and cost/GB improve due to offloading bulk storage (KV-cache, weights) to LPDDR while retaining critical compute on HBM.

5. Automated Online Data Placement for Heterogeneous Memory Systems

The original H²Memory (online application guidance) framework targets runtime feedback-driven tier assignment for hybrid DRAM+NVM configurations (Olson et al., 2021).

Automatic, Per-Allocation Site Profiling

  • Each heap allocation is decorated with source-unique site ID and context.
  • Modified jemalloc (via SICM) allocates in multi-tier arenas.

Ski-Rental Placement Heuristic

At every interval ΔT\Delta T, each site’s cumulative access sample Ai(T)A_i(T) and resident set RSSi(T)RSS_i(T) are updated. The online guide then computes:

  • Rental cost:

Cr=max(0,ab)×ΔC_r = \max(0,\,a-b)\times \Delta_\ell

where aa and bb sum accesses needing DRAM migration.

  • Purchase (migration) cost:

Cp=(i:curireciRSSi)×ΔmC_p = \left( \sum_{i:\,cur_i \neq rec_i} RSS_i \right) \times \Delta_m

Memory pages are migrated when Cr>CpC_r > C_p. This fully amortizes migration overhead and tracks changes in hot/cold page pattern over program lifecycle.

Empirical Outcomes

  • On CORAL and SPEC 2017, H²Memory online tiering achieves 2.5× geometric mean speedup (CORAL) and 8.6% (SPEC) over first-touch unguided placement, closely approaching the respective offline profiling-guided optimal.
  • Profiling and management overheads are <<10% wall-clock for large HPC codes.

6. Empirical Benchmarks and Theoretical Relationships

LLM Agents and Memory Retrieval

  • On the LoCoMo benchmark, hierarchical H²Memory outperforms five baselines in F1 (by +14.98~pp) and BLEU-1 (by +12.77~pp); gains are greatest in multi-hop and adversarial QA (+21.3~pp F1, +17.7~pp BLEU-1)(Sun et al., 23 Jul 2025).
  • Memora demonstrates state-of-the-art retrieval effectiveness and context efficiency (e.g., 87.4% accuracy at 2.9k context length on LongMemEval), outperforming both flat RAG and neural memory baselines (Xia et al., 3 Feb 2026).
Method BLEU F1 LLM‐Judge
Full Context 0.487 0.565 0.825
RAG (k=3) 0.389 0.455 0.633
Memora (P) 0.466 0.553 0.863

Hardware/Systems Context

  • H2M2 achieves \sim0.96\timesoracleperformanceat oracle performance at <5%totaloverhead,supportsdynamicsequencebatchedinference,andmaintainsnegligibleinternalfragmentation(5\% total overhead, supports dynamic sequence batched inference, and maintains negligible internal fragmentation (\approx$0.16%).
  • Online H²Memory achieves near-offline-optimal speedups, with rapid convergence and low migration cost amortization.

7. Comparative Expressiveness and Applicability

Memora and H²Memory subsume prior vector-store (RAG) and (implicit or explicit) KG-based retrieval frameworks. Special cases include:

  • Flat RAG: Each entry singly indexed by itself (no cues), leads to standard chunk retrieval.
  • Implicit KG: Cue anchors as approximate entities, retrieval as L-hop traversal in cue-similarity space.
  • Explicit KG: Cue edges represent symbolic graph edges, tracing explicit multi-hop knowledge graph pathways.

The representational formalism is strictly more expressive: mixed-key intersection predicates realizable in Memora are unattainable by RAG or standard KG retrieval approaches (Xia et al., 3 Feb 2026).


H²Memory, in both algorithmic and hardware-aware variants, is a unifying set of frameworks for scalable, efficient, and context-rich memory management in modern AI systems—spanning from long-term reasoning agents to high-throughput, multi-tier hardware platforms (Sun et al., 23 Jul 2025, Hwang et al., 21 Apr 2025, Olson et al., 2021, Xia et al., 3 Feb 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to H²Memory Framework.