Fast, Multicore-Scalable, Low-Fragmentation Memory Allocation through Large Virtual Memory and Global Data Structures (1503.09006v2)

Published 31 Mar 2015 in cs.PL

Abstract: We demonstrate that general-purpose memory allocation involving many threads on many cores can be done with high performance, multicore scalability, and low memory consumption. For this purpose, we have designed and implemented scalloc, a concurrent allocator that generally performs and scales in our experiments better than other allocators while using less memory, and is still competitive otherwise. The main ideas behind the design of scalloc are: uniform treatment of small and big objects through so-called virtual spans, efficiently and effectively reclaiming free memory through fast and scalable global data structures, and constant-time (modulo synchronization) allocation and deallocation operations that trade off memory reuse and spatial locality without being subject to false sharing.

Authors (4)

Martin Aigner (1 paper)
Christoph M. Kirsch (5 papers)
Michael Lippautz (2 papers)
Ana Sokolova (18 papers)

Citations (32)

View on Semantic Scholar

Summary

The paper presents scalloc, a concurrent memory allocator delivering constant-time allocation and deallocation via virtual spans.
It employs a scalable global backend with efficient data structures to reduce synchronization overhead in multicore environments.
Experimental results demonstrate that scalloc significantly reduces memory fragmentation while maintaining competitive speed against traditional allocators.

An Examination of "Fast, Multicore-Scalable, Low-Fragmentation Memory Allocation"

The paper presents "scalloc," a novel concurrent memory allocator optimized for high performance, multicore scalability, and reduced memory consumption in dynamically allocating concurrent programs. The scalloc design is built upon three principal ideas: uniform treatment of differently sized objects through "virtual spans," efficient memory reuse via scalable global data structures, and constant-time allocation and deallocation operations that address memory reuse and spatial locality with limited false sharing. This work positions itself as potentially superior to existing allocators under diverse conditions, offering a compelling option for high-performance concurrent programming.

Core Contributions

The research in question presents several innovative contributions to memory allocation, specifically:

Virtual Spans: Scalloc introduces the concept of virtual spans, which are same-sized segments in virtual memory. This approach enables the uniform handling of small and large objects, reduces the need for synchronization, and enhances memory consumption efficiency. A virtual span may occupy a large portion of virtual memory but is not entirely manifested in physical memory unless needed, thanks to on-demand paging.
Scalable Global Backend: The paper details a novel backend design utilizing recently developed efficient concurrent data structures, such as lock-free stacks. The scalability of these data structures is critical as it allows the allocator to operate efficiently under high concurrency by minimizing coordination costs.
Constant-Time Frontend: Scalloc's frontend facilitates constant-time allocation and deallocation (modulo synchronization), reducing the potential for performance bottlenecks. By eagerly returning empty spans to the backend, scalloc maintains a balance between memory reuse and fragmentation.

Experimental Observations

The authors conducted an array of experiments comparing scalloc against well-known allocators, including Hoard, jemalloc, TBB, and others across varied synthetic benchmarks. The results demonstrated that scalloc either outperforms or is competitive with existing solutions in both single-threaded and multi-threaded contexts. Notably, the allocator significantly improved memory consumption without sacrificing speed. For small-range single-threaded workloads (e.g., SPEC CPU2006's 483.xalancbmk), scalloc was competitive, while in multi-threaded and producer-consumer scenarios, it offered superior scalability.

Theoretical and Practical Implications

The approach proposed in this paper serves to fortify memory allocation strategies in concurrent programming environments, which are ubiquitous in modern computing paradigms. The application of virtual spans elegantly balances theoretical aspects of memory management—like fragmentation and locality—with the practical demands of efficiency and scaling in multi-threaded workloads. Furthermore, the ability to incorporate uniform object treatment across varying sizes potentially simplifies design and enhances the adaptability of memory allocators to different workloads without extensive custom optimizations.

Forward-Looking Perspectives

Future research could explore the seamless integration of scalloc-like allocators within existing systems, potentially replacing or augmenting standard libraries to improve baseline performance metrics universally. Moreover, in light of continued hardware developments, further tuning of the design, particularly with consideration of NUMA architectures, could augment benefits in heterogeneous systems.

In summary, the scalloc allocator encapsulates a compelling mix of innovation, performance, and adaptability, marking it as a significant advancement in concurrent memory allocation. This paper provides both an immediate toolkit for improving application performance and a fertile ground for future exploration and optimization within the domain of memory management.

PDF Markdown

Related Papers

Tweets

https://twitter.com/0x77er/status/1826937034163847443