MemShare: Efficient Memory Sharing
- MemShare is a suite of systems and algorithms that enable fine-grained, efficient shared-memory use across multicore analytics, deep learning inference, and caching environments.
- It employs techniques like ALTREP in R, zero-copy KV cache reuse for large models, and dynamic multi-tenant memory caching to reduce redundant data and improve throughput.
- The approach balances performance gains with safety trade-offs by requiring user-managed synchronization and careful namespace management to avoid data inconsistencies.
MemShare refers to a family of systems and algorithms for efficient, fine-grained memory sharing across processes or requests in multicore analytics, deep learning inference, web caching, or distributed proxy environments. MemShare designs target bottlenecks arising from redundant data replication, static partitioning, and inefficient memory management. There are three major lines of MemShare research: (1) in-memory data sharing for parallel analytics (notably R), (2) key–value (KV) cache reuse in large reasoning model inference, and (3) multi-tenant memory cache managers for web/datacenter workloads. All exploit application-level locality and sharing, but vary in mechanisms and safety trade-offs.
1. Shared Memory for Parallel Analytics in R
Memshare (R package) provides true shared-memory support for large objects (e.g., matrices, vectors) across multicore R sessions, implemented via a POSIX-shared C++ buffer and exposed to R using the ALTREP (ALTernative REPresentations) framework. The C++ layer allocates anonymous memory pages using shm + mmap or MapViewOfFile (Windows), organized by a user-defined namespace and variable names to avoid session collisions. The ALTREP wrapper allows R processes ("master" and "worker" sessions in a parallel cluster) to map directly onto these shared memory buffers, avoiding both serialization and per-worker copies required by conventional parallel R (using PSOCK or FORK clusters).
The API exposes primitives to register and release shared variables, retrieve ALTREP views, and manage collective garbage collection. High-level wrappers (e.g., memApply, memLapply) mimic R's parallel apply family, but always operate in shared-memory mode over the registered buffers.
Compared to Bioconductor's SharedObject—which introduces copy-on-write and sharedSubset flags for safety at the cost of per-session duplication—Memshare adopts a strict page/view model. Pages are owned by the master, views are raw pointers, and no implicit copying occurs. Safety is enforced so that pages are deallocated only once all views (even across processes) are released, preventing dangling references. Users are responsible for synchronization if concurrent writes are performed and must avoid multiple masters accessing the same namespace, which is undefined at the operating system level.
Performance in column-wise apply workloads shows Memshare achieves ~2× speedup and baseline-level resident set size (RSS) compared to SharedObject, which uses 2–10× the memory and may crash for very large (e.g., 10⁵×10⁵) matrices (Thrun et al., 10 Sep 2025).
2. KV Cache Reuse in Large Reasoning Models
In transformer-based large reasoning models (LRMs), inference—particularly with chain-of-thought (CoT) prompting—incurs substantial GPU memory overhead due to the accumulation of per-layer key and value (KV) caches for all generated tokens over long contexts (t ≈ 10⁴–10⁵). MemShare for KV cache reuse targets this by observing that many intermediate reasoning steps and corresponding KV states are highly similar or redundant.
The mechanism is a two-stage collaborative filtering algorithm:
- Stage 1 (Step-level similarity): Uses lexical similarity (cosine similarity of tokenized reasoning steps) to preselect candidate cache blocks.
- Stage 2 (Block-level distance): Among candidates, computes layer-wise L₂ distances of KV blocks to identify blocks sufficiently close (below a Euclidean threshold) for safe re-use.
If a suitable match is found, the system redirects the block table entry via zero-copy pointer reassignment (with PagedAttention-style block managers, e.g., vLLM, SGLang), rather than duplicating data or incurring bandwidth cost. This approach enables batch-level KV sharing under tight memory budgets, maximizing batch size and throughput.
Empirical results on NVIDIA A800 GPUs (vLLM 0.8.2) show throughput increases of up to 84.79% and KV memory reductions of ~21.7% on MATH-500 with DeepSeek-R1-Distill-Qwen-32B; QoQ-32B models reach similar improvements with ≤2% accuracy drop as measured by end-to-end logical reasoning benchmarks. Random sharing or aggressive reuse strategies, by contrast, degrade accuracy to 40–70%, highlighting the importance of semantic and geometric filtering (Chen et al., 29 Jul 2025).
3. Dynamic Multi-Tenant Web/Datacenter Memory Caches
Memshare is also instantiated as a log-structured, multi-tenant cache manager for memory-based web caches (e.g., memcached), optimizing hit-rate through dynamic partitioning and object sharing. The architecture replaces slab allocators with a global log segmented at 1 MB granularity; each application (tenant) is assigned a guaranteed "private" allocation with the rest of DRAM flexibly pooled as shared memory. A central arbiter tracks per-tenant usage and shadow-queue hits, dynamically assigning target memory based on instantaneous hit-rate gradients inferred via shadow queues.
Eviction policies are tenant-specific functions (e.g., LRU, LFU, or hybrids). The cleaner-arbiter feedback loop compacts, relocates, or evicts objects per-segment, optimizing tenant allocation. A "dynamic allocation algorithm" leverages shadow queues to redistribute shared resources responsively; additional variants ("idle-tax") reclaim capacity from idle tenants based on observed activity fraction.
Performance experiments on real traces (Memcachier, 32 GB DRAM) show Memshare (shared/idle-tax modes) increases aggregate hit rate from 84.66% (static partitioning) to 89.92–90.75%, reducing total cache misses by 34–40%. The increase in CPU and DRAM bandwidth for cleaning is modest (≤13% cycles under full write; <0.01% DDR4 bandwidth), with minimal increase in get latency (21.4→22.0 μs) (Cidon et al., 2016).
4. Object Sharing Among Caching Proxies
In edge-cloud settings with multiple proxy caches, MemShare can refer to a content-caching system where each proxy maintains its own virtual LRU list within a single (shared) physical cache. When the same object is present across multiple LRU lists, its size is divided equally among participating caches. The system dynamically readjusts "virtual lengths" and orchestrates eviction via a loop that compensates for the impact of shared objects as their reference count changes.
A working-set approximation (extending Denning-Schwartz) provides analytical prediction of per-proxy hit probabilities and enables efficient admission control—allowing overbooking so that the sum of virtual cache allocations can exceed the physical cache size, enhancing overall utilization and hit rates. Enhanced eviction logic defers ripple evictions via tunable slack parameters, consolidating evictions and limiting latency overhead. Empirical results on the MCD-OS prototype show that sharing increases hit rates by up to 10% with only a small (~15%) set-latency overhead (Kesidis et al., 2019).
5. Quality, Safety, and Trade-Offs
Across the MemShare literature, the prevailing trade-off is between safety (e.g., copy-on-write, session isolation), performance (latency, throughput, memory savings), and application-level consistency.
- In analytic contexts (R package), Memshare favors read-mostly patterns for maximal speed/memory efficiency, permitting but not synchronizing in-place mutation. Synchronization is left to the user.
- In LLM inference, zero-copy reuse of semantically/geometrically similar KV blocks can yield substantial throughput/memory gains while maintaining accuracy within 95–98% of the dense baseline, provided similarity thresholds are chosen conservatively.
- In web caches and proxy LRU caches, statistical sharing of objects enables cache overbooking and higher hit rates, but may incur modest increases in operation latency and complexity of eviction handling.
All MemShare systems require careful handling of resource release/garbage collection, namespace management (to avoid collision and undefined behavior), and, for mutable data, explicit coordination to prevent data races or consistency violations.
6. Practical Applications and Usage Scenarios
- Large-scale feature selection in genomics: In RNA-seq analytics (N=10,446 samples × 19,637 genes), Memshare allows the entire task to execute with a single shared data copy (~10 GB), completing mutual information estimation with Pareto Density Estimation (PDE) across multicore R in ~2 hours using ~47 GB RAM. Naive parallelism would require O(n_threads × 10 GB), which is typically prohibitive or unworkable (Thrun et al., 10 Sep 2025).
- Memory-efficient LLM serving: Integrations with vLLM and SGLang enable MemShare's collaborative-filtering KV cache reuse to double throughput and reduce memory by 20% for long CoT inferences, permitting larger batch sizes and consistent inference quality in production (Chen et al., 29 Jul 2025).
- Multi-tenant cloud/web caches: MemShare's log-structured design and arbiter-based dynamic allocation yield 2–6% higher hit rates over statically partitioned alternatives and allow each tenant to employ its own eviction policy seamlessly (Cidon et al., 2016).
- Edge-cloud caching systems: Equal-division object sharing with working-set–based hit-rate prediction offers finer-grained and more efficient cache utilization across large fleets of proxies or microservices (Kesidis et al., 2019).
7. Limitations and Recommendations
MemShare solutions are most effective in environments with large shared datasets accessed read-mostly across processes or tenant boundaries. In-place mutation, though supported, requires external synchronization to ensure correctness. Small objects are less efficiently shared due to the fixed ALTREP/view metadata overhead. For analytics applications, releasing (or finalizing) all session views before detachment is essential to prevent orphaned memory. In distributed/proxy cache modes, ripple-eviction suppression strategies and careful configuration of slack parameters are vital to avoid pathological eviction cascades and maintain predictable access latencies. Avoiding namespace collisions and adhering to best practices in cluster setup is mandatory for safe and robust deployment.
References:
- "Memshare: Memory Sharing for Multicore Computation in R with an Application to Feature Selection by Mutual Information using PDE" (Thrun et al., 10 Sep 2025)
- "MemShare: Memory Efficient Inference for Large Reasoning Models through KV Cache Reuse" (Chen et al., 29 Jul 2025)
- "Memshare: a Dynamic Multi-tenant Memory Key-value Cache" (Cidon et al., 2016)
- "On a caching system with object sharing" (Kesidis et al., 2019)