H^2Memory in Hierarchical & Heterogeneous Architectures

Updated 24 November 2025

H^2Memory is a hierarchical, heterogeneous memory architecture that integrates multi-tier physical and logical memory to optimize performance and efficiency.
In LLM inference, an asymmetric design couples high-bandwidth HBM with capacity-rich LPDDR5X, achieving 1.5–2.9× speedups and minimal virtualization overhead.
For linear algebra and multi-task agents, H^2Memory organizes data-sparse matrices and layered agent memories to enable fast solvers and improved planning through structured retrieval.

H $^2$ Memory commonly refers to hierarchical and heterogeneous memory architectures and representations that combine multiple levels of abstraction or physical/media types to improve performance, efficiency, and transfer in diverse computational workloads. The term arises in at least two prominent research threads: (1) the architectural management of heterogeneous memory hardware for large-scale inference (notably for LLMs), and (2) the hierarchical matrix representations ( $\mathcal{H}^2$ -matrices) central to fast solvers and data-sparse linear algebra. Each area exemplifies the general principle that memory systems benefit from multi-level organization—whether in silicon, linear algebraic storage, or agent memory for learning and planning.

1. Hierarchical and Heterogeneous Memory in LLM Inference

LLM inference confronts contradictory demands: enormous model footprint requires capacity-rich memory, but performance-critical kernels (e.g., attention GEMV) are bandwidth-bound (Hwang et al., 21 Apr 2025). The Hardware-based Heterogeneous Memory Management (H2M2) system addresses this by an asymmetric memory architecture comprising two tiers:

Bandwidth-Centric Tier: HBM3 (96 GB, 3 TB/s peak bandwidth)
Capacity-Centric Tier: LPDDR5X (512 GB, 544 GB/s peak bandwidth)
Interconnect: 960 GB/s joins the two memory domains

Each memory is paired with a compute unit (accelerator chip with four cores; each core contains GEMM/GEMV/vector/SFU units and 32 MB scratchpad via double-banked 16 MB designs). Accelerator₁ serves kernels mapped to HBM3; Accelerator₂ serves LPDDR5X.

Memory Mapping and Dynamic Allocation

Kernel-to-memory assignment is formalized as a binary optimization:

$\min_{\{x_k\}} \; \sum_{k=1}^{K} \frac{\alpha_k w_k}{x_k B_{bw} + (1 - x_k) B_{cap}}$

subject to capacity constraints on both tiers, where $x_k \in \{0, 1\}$ selects memory tier, $w_k$ is FLOP count, $f_k(t)$ is per-kernel footprint, and $\alpha_k$ encodes arithmetic intensity penalties. The mapping seeks to assign bandwidth-bound kernels (e.g., attention heads) to HBM3 and capacity-suited kernels to LPDDR, balancing utilization based on runtime status (KV cache, batch size, etc.).

A dynamic mapping algorithm triggers per iteration or upon memory/batch changes, solving (greedily or nearly exactly due to moderate per-layer problem size) for optimal head partitioning within each sublayer (Hwang et al., 21 Apr 2025). Migration cost is tightly budgeted via page-level hardware copying amortized over the high-speed interconnect.

Memory Abstraction and Virtualization

Each accelerator’s MMU manages a page table (2 MB pages; ~4 MB size), implementing a formalized API so that virtual address spaces are transparently mapped to physical memory across tiers. Logical reads/writes and page migrations thus occur without kernel code changes, incurring measured latency overhead below 1.36%.

Performance

Empirical end-to-end latency improvements include:

Model	Batch	H2M2 vs. LPDDR-Only Speedup
GPT3-175B	32	1.46×
Chinchilla-70B	64	1.55×
Llama2-70B (GQA, 128)	128	2.94×

Bandwidth-bound attention GEMV dominates in standard (homogeneous, capacity-dominated) systems; H2M2’s mapping prioritizes offloading these to HBM, as confirmed by bottleneck profiling. Mapping strategy overhead is <5% compared to oracle exhaustive assignment (Hwang et al., 21 Apr 2025).

2. Hierarchical $\mathcal{H}^2$ -Matrix Memory in Numerical Linear Algebra

The $\mathcal{H}^2$ -matrix formalism provides a data-sparse, multi-level memory organization for dense linear operators, especially those arising from boundary element, covariance, or kernel methods (Boukaram et al., 14 Sep 2025, Christophersen, 21 Sep 2025). In this context, "H $^2$ Memory" refers to this nested, block-structured memory representation and its exploitation for both direct and iterative solvers.

Structure and Complexity

An $\mathcal{H}^2$ -matrix consists of:

Cluster Trees $T$ : index partitioning of basis sets, with $|T|=O(N)$ for matrix size $N$ .
Block Trees $B$ : Cartesian products of clusters partition matrix blocks; admissibility conditions assign far-field blocks to low-rank representation.
Leaf Bases, Transfer, and Coupling Matrices: Store explicit, nested basis at leaves ( $V_t, W_s$ ), transfer matrices $E_{child \to parent}$ per non-leaf, and small coupling matrices $S_b$ for admissible blocks.

Storage costs are dominated by $O(N k)$ (basis) and $O(N k^2)$ (coupling/transfer), with $k$ the maximum numerical rank per block. With $k \ll N$ , practical memory usage is quasi-linear; factorization (with all orthogonal/compression factors) maintains $O(k N)$ scaling under strong admissibility (Boukaram et al., 14 Sep 2025, Christophersen, 21 Sep 2025). Empirical measurements for $N = 2^{14}\ldots 2^{20}$ confirm log-log slope near 1.

Memory Management and Parallel Architectures

Efficient parallel implementations avoid per-cluster dynamic allocations via a prefix-sum memory manager:

Workspace Allocation: For each parallel color class (task group), compute workspace per cluster, prefix-sum offsets, and allocate a single contiguous buffer.
Global Factor Storage: Precompute per-cluster factor footprint and lay out a global contiguous buffer for all factors, using prefix-sum addressing.

This enables coalesced, lock-free access, which is essential on multi-core CPUs/GPUs when constructing, factorizing, or applying $\mathcal{H}^2$ matrices.

Bandwidth and Compute Bottlenecks

Factorization is dominated by small GEMMs (partial LU/Schur updates, QR/SVD basis augmentation). For $k \approx 20$ –$30$, compute is dominant; for smaller $k$ or high sparsity constants $C_{sp}$ (many small blocks), memory bandwidth limits performance. The solve phase (GEMV/TRSV) is strongly bandwidth-bound, achieving up to 80% of system peak bandwidth.

Precision choices (double vs. single) directly halve memory footprint and may double effective bandwidth but entail 7–8 digits less numerical accuracy.

Iterative Solvers and Block Krylov Algorithms

$\mathcal{H}^2$ -memory representations are especially advantageous in block Krylov methods (block-CG, block-GMRES) (Christophersen, 21 Sep 2025). Matrix-matrix kernels (GEMM-level, as opposed to GEMV) are efficiently realized: for $m$ right-hand sides,

$O(N k m)$ arithmetic per iteration (matrix-matrix multiply, basis transfer)
$O(N k^2)$ matrix memory, $O(N m)$ for Krylov state
Empirically: 10–13× wall-clock speedup for block-matmat versus $m$ matvecs, and 30–50× speedup for block-Krylov versus sequential solves (for $m\geq50$ ).

3. Hierarchical Memory in LLM-Based Multi-Task Agents

Hierarchical memory also denotes architectures and algorithms for multi-level agent memory, exemplified by Hierarchical Hindsight Reflection (H $^2$ R) (Ye et al., 16 Sep 2025). Here, H $^2$ Memory is defined as two distinct repositories:

High-Level Planning Memory ( $M^H$ ): Stores task-level embeddings with key–value pairs, values being subgoal sequences plus abstract planning insights.
Low-Level Execution Memory ( $M^L$ ): Stores per-subgoal embeddings with values containing atomic action/observation traces and fine-grained execution insights.

Hindsight-Reflection and Update Mechanism

Memories are constructed and refined by extracting subgoal sequences and insights from paired successful/failed agent-environment trajectories. Updates are distilled via LLM-driven prompts, implementing multi-stage contrastive hindsight reflection. Formally, updates minimize contrastive losses in the space of planning and execution insights.

Retrieval and Agent Integration

At test time, task or subgoal embeddings retrieve top- $k$ relevant memory units from each tier by cosine similarity. The planner receives high-level memories as context to emit next subgoals; the executor receives low-level memories to guide atomic decisions.

Empirical Validation

H $^2$ R achieves 75.9% and 80.5% success rates on AlfWorld and PDDLGame respectively, outperforming single-level or monolithic baselines. Ablation confirms both memory levels are essential: removing high-level or low-level memories reduces performance by 20–28 percentage points. Strengths include fine-grained decomposition, reduced retrieval interference, and improved transfer even in deep hierarchies (Ye et al., 16 Sep 2025).

4. Implementation Trade-Offs and Design Considerations

Hardware and Mapping

Accelerator design must accommodate different arithmetic intensities; vector and systolic-array units are tailored to GEMV/GEMM distribution across memory tiers.
The head-aware partitioning in attention layers exemplifies the necessity for workload-adaptive mapping.

Memory Format Tuning

$\mathcal{H}^2$ -block sizes ( $m$ ), maximum rank ( $k$ ), and cluster tree balance impact both memory use and speed. $m \approx k$ is a practical compromise.
Bandwidth saturation may require explicit recompression, custom GPU batched kernels, or mixed-precision variants.

System Overheads

In H2M2, virtualization cost is sub-1.4% latency; mapping suboptimality incurs $<$ 5% overhead.
In $\mathcal{H}^2$ solvers, the bulk of memory is in factor storage; auxiliary buffer allocation must be carefully managed especially at large scale or with deep trees.
In H $^2$ R, memory capacity scales with training data/task and subgoal cardinality, which may necessitate memory pruning or adaptation in continual learning scenarios.

5. Strengths, Limitations, and Future Directions

Strengths

Heterogeneous physical memory and $\mathcal{H}^2$ matrix formats exploit hierarchical structure to obtain near-linear scaling in storage and compute.
Fine-grained, multi-tiered logical memory supports highly context-specific retrieval and transfer in multi-task agents and learning systems.
Blocked and batched operations (GEMM) enable efficient hardware utilization and outperform traditional bandwidth-bound routines.

Limitations

Hardware/execution: Static two-level partitioning may require extension for broader device heterogeneity.
Linear algebra: $\mathcal{H}^2$ schemes presuppose low-rank structure and may not generalize to all dense problems; strong bandwidth dependence persists for key routines.
LLM agents: Reflection and insight distillation depend on LLM prompt cycles, incurring latency and cost. Static memory may become stale; memory growth is unmanaged absent additional pruning or consolidation protocols.

Future Directions

In hardware: Dynamic policy-driven tiering, online recompression, and integration with broader system resource allocators.
In linear algebra: Adaptive precision, real-time recompression, and more nuanced interplay with high-level algorithms.
For agent memory: Learnable read/write, online continual reflection, multi-agent distributed sharing, and reinforcement-weighted insight integration are all proposed as promising extensions (Ye et al., 16 Sep 2025).

6. Summary Table: H $^2$ Memory in Key Contexts

Context	Memory Type/Structure	Main Performance Benefit
H2M2 LLM Inference (Hwang et al., 21 Apr 2025)	Two-tier (HBM3+LPDDR5X), virtual	1.5–2.9× throughput, bandwidth/util.
$\mathcal{H}^2$ Linear Algebra (Boukaram et al., 14 Sep 2025, Christophersen, 21 Sep 2025)	Cluster/block tree, nested bases	$O(Nk)$ – $O(Nk^2)$ storage, block-Krylov speedups
LLM Multi-Task Agents (H $^2$ R) (Ye et al., 16 Sep 2025)	High/low-level key-value memories	Fine-grained transfer, generalization

In all scenarios, hierarchical or heterogeneous memory, whether physical or logical, enables modularity, transfer, and resource efficiency, with trade-offs determined by context-dependent requirements and system constraints.

PDF Markdown Chat (Pro)

References (4)

Hardware-based Heterogeneous Memory Management for Large Language Model Inference (2025)

Linear Complexity $\mathcal{H}^2$ Direct Solver for Fine-Grained Parallel Architectures (2025)

On efficient block Krylov-solvers for $\mathcal H^2$-matrices (2025)

H$^2$R: Hierarchical Hindsight Reflection for Multi-Task LLM Agents (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to H$^2$Memory.