Two-level Memory Allocator

Updated 25 October 2025

Two-level memory allocators are dual-layer systems that separate fast thread-local allocations and global reclamation, enhancing performance and scalability.
They employ fixed-size blocks, virtual spans, and dynamic paging to efficiently manage memory across heterogeneous hardware environments.
Applications include multicore synchronization, persistent memory systems, real-time processing, and large-scale analytics with improved throughput and reduced contention.

A two-level memory allocator is a class of memory management systems in which allocation and reclamation strategies operate cooperatively at two distinct abstraction layers to achieve improved efficiency, scalability, and fragmentation control. In contemporary implementations, these layers are typically differentiated by locality (thread-private vs. global/shared), size (small allocations vs. large allocations), or memory hierarchy (e.g., fast DRAM vs. slower SCM/NVRAM), with the allocation logic for each level optimized for characteristic access patterns and hardware constraints. Two-level allocator designs are foundational in a variety of domains: multicore synchronization, persistent memory, hybrid volatile/nonvolatile storage, real-time systems, and large-scale data analytics.

1. Architectural Principles and Layer Separation

A common architectural paradigm for two-level allocators involves partitioning allocation state into thread-local or private pools for rapid, contention-free operations (first level) and a shared global pool responsible for balance and reclamation (second level). For example, in concurrent fixed-size allocation (Blelloch et al., 2020), each process maintains a private pool implemented as batches or stacks of blocks. The global shared pool (a concurrent stack) mediates transfers when local pools are depleted or overfull, amortizing global coordination and enabling constant-time operations.

In other variants, particularly in allocators for multicore and NUMA systems (e.g., scalloc (Aigner et al., 2015)), the first layer is realized through thread-local allocation buffers (LABs) and "hot" spans for fast-path allocation, while the second layer, a distributed span-pool of lock-free stacks, tracks empties and enables rapid reuse under high concurrency. This separation reduces contention and false sharing, allowing constant-time allocation modulo synchronization.

In persistent memory settings, the division is not only logical but physical: Metall (Iwabuchi et al., 2021) stores data structures in mapped files on NVRAM, optimizing allocation and durability at the persistent layer while providing a high-performance volatile interface closer to traditional allocators.

2. Allocation Strategies, Data Structures, and Algorithms

Two-level allocators typically employ fixed-size block allocations (slabs, spans, pages) organized in hierarchical pools or stacks. Scalloc (Aigner et al., 2015) introduces "virtual spans" of fixed virtual size (2 MB), each containing a real span suited for a particular size class. This uniformity ensures that small and large objects are allocated via the same mechanism, with physical backing enforced by on-demand paging—unmapped regions incurring no RAM cost.

Allocation logic involves state transitions; pseudocode for scalloc’s span acquisition showcases typical design:

function get_span(size_class):
    repeat:
        span = LAB.reusable_spans[size_class].get()
        if span ≠ NULL and try_mark_hot(span):
            return span
    until span is NULL
    return span_pool.get(size_class)

In concurrent, fixed-size block allocators (Blelloch et al., 2020), batch transfer between local and shared pools is initiated only on exhaustion or excess, thereby maintaining $O(1)$ worst-case allocation and free operations and bounding additive space overhead at $\Theta(p^2)$ for $p$ processes.

For heterogeneous memory hierarchies, allocation managers must dynamically grade pages between levels. The Aging paging algorithm (Oren, 2017) maintains per-page reference counters whose leading zero count determines demotion between fast and slow memory, as formalized for two-level systems by:

$L = \left\lceil \frac{\text{numZeros}}{Z} \times ML \right\rceil$

where $Z$ is the counter width and $ML$ the number of memory levels.

3. Fragmentation, Reclamation, and Caching

Physical and logical fragmentation is controlled via reclamation protocols tailored to both levels. Scalloc (Aigner et al., 2015) uses madvise calls (e.g., MADV_DONTNEED) when real spans become empty, releasing physical pages. Fragmentation is tracked via:

Allocation (no reusable span): $f = f + (\text{size} - u)$
Deallocation (last block): $f = f - u$

where $f$ is fragmentation, $\text{size}$ is the span class, and $u$ the payload.

Modern systems (e.g., Jenga (Zhang et al., 24 Mar 2025)) address fragmentation from heterogeneous embeddings by selecting a large page size as the least common multiple (LCM) of embedding sizes:

$P = \operatorname{LCM}(E_1, E_2, \ldots, E_n)$

Partitioning large pages into smaller, evenly divisible pages minimizes waste.

Caching APIs in Jenga (update_last_access, set_prefix_length, get_possible_prefix) enable layer- and request-aware eviction. These allow for batch-serving, cache hit maximization, and effective memory utilization—in evaluations, up to 79.6% improvement in GPU utilization and up to 4.92× throughput over prior schemes.

4. Scalability, Concurrency, and Synchronization

By isolating fast-path allocations in local pools and using lock-free or wait-free shared data structures (e.g., Treiber stacks as in scalloc (Aigner et al., 2015), P-SIM construction as in (Blelloch et al., 2020)), two-level allocators achieve multicore scalability with minimal contention. Atomic single-word CAS, LL/SC simulation, and modular locking strategies allow high concurrency even for security-hardened allocators (e.g., StarMalloc (Reitz et al., 14 Mar 2024)).

StarMalloc further subdivides the regular allocation region into arenas mapped to threads via thread-local storage, with each size class guarded by fine-grained locks for concurrent allocation. Predictable performance (execution times within 0.84×–1.38× of competitor allocators) is achieved while maintaining low overhead under real-world, multiprocess workloads.

5. Persistence, Recoverability, and Position Independence

Persistent memory allocators (e.g., Ralloc (Cai et al., 2020), Metall (Iwabuchi et al., 2021)) implement two-level designs where the volatile allocation logic is decoupled from infrequent persistent metadata updates. Ralloc’s recoverability criterion ensures that post-crash, only genuinely in-use blocks (reachable from persistent roots) remain allocated. Recovery uses tracing garbage collection, facilitated by user-supplied filter functions to avoid conservative GC false positives. Position-independent pointers (offset-based smart pointers) allow remapping heaps at arbitrary virtual addresses.

Metall leverages memory-mapped files with batch-synchronized mmap (bs-mmap) to minimize OS call overhead and fragmentation. Objects are allocated directly into mapped persistent regions, enabling data reuse across analytics invocations.

6. Security Verification, Correctness, and Optimization

Recent systems such as StarMalloc (Reitz et al., 14 Mar 2024) employ formal verification (Steel separation logic and dependent types) for memory safety and heap integrity. Allocator metadata and user data are segregated, thwarting buffer overflows and heap exploits. Size classes, slab management, quarantine policies, guard pages, and canaries are specified and verified with separation logic predicates and type-level invariants.

Compiler-level optimization (e.g., dead-allocation elimination, pointer-to-integer cast removal) is facilitated by two-phase infinite/finite models (Beck et al., 24 Apr 2024). Transformations proved correct under unbounded/infinite-memory semantics are refined to finite systems via lifting:

$\text{Lift}(\mathcal{C}^{\text{fin}}) = \mathcal{C}^{\text{inf}}$

This allows aggressive optimization and provably sound executable interpreters bridging high-level semantics and hardware realities.

7. Practical Applications and Impact

Two-level allocators are critical in environments characterized by concurrency, heterogeneity, persistence, or high-performance requirements:

In serving LLMs (Jenga (Zhang et al., 24 Mar 2025)), two-level allocation with LCM-based partitioning and cache-aware APIs dramatically increases GPU throughput and utilization on platforms such as vLLM.
In data analytic workloads with persistent memory (Metall (Iwabuchi et al., 2021)), two-level strategies accelerate graph construction and enable durable storage for large datasets.
In multicore, general-purpose allocation (scalloc (Aigner et al., 2015)), two-level pool design balances memory reuse, minimizes fragmentation, and ensures scalability and low-latency allocation/deallocation.
Verification-oriented allocators (StarMalloc (Reitz et al., 14 Mar 2024)) demonstrate formal correctness and security in real-world systems such as Firefox.

Through careful architectural separation and control at both levels, two-level memory allocators remain central to state-of-the-art solutions across high-performance, concurrent, and persistent memory management domains.