Exgen-Malloc: Efficient Single-Threaded Allocator
- Exgen-Malloc is a single-threaded memory allocator that streamlines metadata and free-list management to enhance allocation speed and reduce fragmentation.
- It aggregates page metadata and uses a unified free-block list, resulting in up to 20% lower L1 cache misses and near-complete reduction of TLB miss counts.
- Benchmark results show speedups up to 1.93× and memory savings up to 25.2%, making it ideal for high-frequency allocation in single-threaded environments.
Exgen-Malloc is a single-threaded memory allocator designed to minimize overhead and fragmentation while maximizing allocation speed and memory efficiency. Its architecture diverges from mainstream multi-threaded allocators by specializing for the single-threaded context, consolidating metadata and streamlining control logic. Exgen-Malloc retains key innovations of contemporary allocator design and introduces careful strategies for heap management, free-list organization, and resource commitment. Quantitative evaluation demonstrates substantial speedup and memory savings over legacy and modern allocators on several platforms and benchmark suites (Li et al., 11 Oct 2025).
1. Architectural Foundations
Exgen-Malloc implements a centralized heap architecture rather than partitioning memory into per-thread pools or complex hierarchical regions, as is customary in allocators targeting concurrent workloads. The heap is divided into segments—each typically 4 MiB—further subdivided into pages tailored for different block-size classes. Page metadata tracks block size, capacity, allocated and free block counts, and a single pointer to the free-block list.
Unlike multi-threaded allocators (e.g., mimalloc) that maintain multiple free lists per page to handle remote deallocation and synchronization, Exgen-Malloc uses a unified free-block list per page. In a single-threaded environment, this approach eliminates cross-thread coordination, reduces indirections, streamlines free-list traversal, and enhances the locality of reference throughout the allocation path.
Exgen-Malloc optimizes memory commitment and reclamation by deferring physical allocation until needed (useful for short-lived or memory-light processes) and caching one recently-freed segment per page type. This segment reuse minimizes mmap/munmap system calls and thus overhead, contributing to both speed and reduced fragmentation.
2. Metadata Minimization and Data Locality
Legacy allocators such as dlmalloc employ relatively simple header-based strategies but lack the data locality benefits realized by aggregating metadata along access-critical paths. Exgen-Malloc compacts page metadata to reside adjacent to the blocks they describe, which minimizes pointer chase and increases the cache residency of allocation control information. Fine-grained block-size classes (e.g., 8-byte steps) are used to reduce internal fragmentation and accommodate a diverse range of allocation sizes typical of modern applications.
The use of a single free-block list per page provides two major benefits: reduction in metadata size and more frequent reuse of spatially-coherent blocks, improving cache and TLB performance. Empirical data demonstrates that hardware events such as L1 data cache misses and data TLB misses are significantly improved—Exgen-Malloc achieves 18–20% lower L1 miss rates and up to 99% lower TLB miss counts relative to some state-of-the-art allocators.
3. Allocation Path and Performance Analysis
Exgen-Malloc’s allocation logic is characterized by its simplicity. Allocation proceeds by finding an appropriate page for the requested size, then repurposing the next available block from the single free-block list. If unavailable, a new page is instantiated and initialized. The pseudo-code captures this succinct path:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
void *malloc(size_t size) { page *p = find_page_for(size); if (p->free != NULL) { void *block = p->free; p->free = ((block_header*)block)->next; return block; } else { p = new_page(); void *block = p->free; p->free = ((block_header*)block)->next; return block; } } |
The absence of multi-list selection, lock-based coordination, or indirected metadata results in a reduced control path, measured by lower average instruction count and improved branch prediction.
Benchmark results on two Intel Xeon platforms indicate Exgen-Malloc achieves geometric mean speedups of 1.17× (SPEC CPU2017), 1.10× (redis-benchmark), and 1.93× (mimalloc-bench) over dlmalloc. Compared to mimalloc (a modern, highly-optimized allocator), Exgen-Malloc demonstrates up to 1.05× speedup on SPEC CPU2017 and up to 1.02× on mimalloc-bench.
4. Memory Efficiency and Fragmentation Reduction
Exgen-Malloc is purpose-built to minimize its own footprint as well as fragmentation in the managed heap. Evaluation reveals 6.2% memory savings compared to mimalloc on SPEC CPU2017 and up to 25.2% on mimalloc-bench, with more modest gains (0.1%) observed on redis-benchmark due to smaller opportunities for fragmentation in uniformly-sized allocations.
The allocator achieves these reductions by limiting per-allocation metadata, closely packing blocks, optimizing block reuse, and employing deferred segment commitment. This design is particularly advantageous when minimizing datacenter energy consumption and lowering the “datacenter tax” incurred by memory inefficiencies—a consideration underscored in hyperscale deployment scenarios.
5. Comparison with Legacy Allocators
Whereas dlmalloc leverages a simple single-threaded design, it does not incorporate modern allocator innovations such as fine-grained block-size management, aggregated metadata, or adaptive segment policies. Exgen-Malloc applies techniques such as inlined functions (for lower call overhead), aggregated free-list management, and dynamic commitment/caching of segments, which are absent in legacy allocators.
A summary comparison:
| Feature | dlmalloc | Exgen-Malloc | mimalloc |
|---|---|---|---|
| Thread Model | Single-threaded | Single-threaded | Multi-threaded |
| Free-List Structure | Single global | Single per-page list | Multi-list per page |
| Metadata Locality | Fragmented | Aggregated | Aggregated |
| Commitment/Reclamation | Immediate | Deferred, cached | Deferred, cached |
| Block-Size Granularity | Coarse steps | 8 bytes | 8 bytes |
Exgen-Malloc thus combines the low-overhead of legacy single-threaded allocators with the architectural innovations characteristic of modern systems, yielding improved locality, reduced fragmentation, and higher allocation speed.
6. Technical Implementation Details
Each segment is 4 MiB, subdivided into pages for different block-size classes. The page metadata contains key bookkeeping data:
- Block size ()
- Block capacity ()
- Allocated/free counts
- Free-block list pointer ()
For small allocations: 64 pages per segment, each 64 KiB, organized for cache/TLB affinity.
Allocation speedup with respect to baseline ( with dlmalloc, with Exgen-Malloc) is expressed as:
Observed speedups are 1.17×, 1.10×, and 1.93× by benchmark. TLB miss rates are often cut by 87–99%, as measured using platform hardware counters.
Memory savings are computed as the normalized ratio of maximum resident set size (RSS) against competitor allocators, with observed reductions matching or exceeding contemporaries in realistic applications.
7. Significance, Limitations, and Deployment Context
Exgen-Malloc is most effective in single-threaded applications with diverse and high-frequency allocation patterns—such as scientific computing, data processing, and traditional server workflows. It is less suitable for environments requiring thread concurrency or remote deallocation, as its architecture omits the control logic needed for multi-threaded safety. For high-concurrency or mixed-mode workloads, allocators like mimalloc remain preferable due to their extensive synchronization capabilities.
The simplicity and empirical efficiency of Exgen-Malloc identify a design direction for future single-threaded allocators: prioritize metadata compaction, data locality, and adaptive segment management over legacy simplicity or multi-threaded robustness.
Exgen-Malloc’s quantified performance and efficiency improvements underscore its relevance in hyperscale and embedded deployment, where marginal gains at the allocator level translate to substantial global resource savings and operational cost reduction (Li et al., 11 Oct 2025).