MCAS: Memory-Centric Active Storage

Updated 23 February 2026

MCAS is a paradigm that unifies byte-addressable persistent memory with near-data compute to enable efficient in-place data operations.
It employs Active Data Objects (ADOs) for programmable server-side computation, reducing data movement and ensuring crash consistency through logging mechanisms.
MCAS delivers significant performance improvements over traditional storage systems by leveraging low-latency persistent memory, scalable RDMA networking, and sharded processing.

Memory-Centric Active Storage (MCAS) is a computational and storage paradigm that unifies byte-addressable persistent memory (PM) and near-data compute within a network-accessible key-value abstraction. MCAS is built to exploit hardware such as Intel Optane Persistent Memory Modules (PMM) operating in App-Direct (DAX) mode, enabling in-place data operations, minimal data movement, and programmable server-side computation through Active Data Objects (ADOs) (Waddington et al., 2021, Barceló et al., 2021, Waddington et al., 2021, Wood et al., 2022). By converging the memory and storage tiers, MCAS attains latency, bandwidth, and architectural properties that contrast sharply with traditional NVMe/SSD block-based or DRAM-only architectures.

1. Architectural Overview and Core Components

MCAS is fundamentally a networked key-value store underpinned by persistent memory. The core storage node exposes one or more DAX-backed PM regions that are mapped directly into the process address space, allowing load/store semantics at cache-line granularity (Optane-PM: ~300 ns read/write, up to 512 GB/DIMM) (Wood et al., 2022, Waddington et al., 2021). Sharding is used for scalable parallelism: each "shard" is a single-threaded server bound to one CPU core and PM pool, handling a partition of the K/V space (Waddington et al., 2021, Waddington et al., 2021).

Key components are:

Server process: Manages allocation, crash-consistent logging, metadata, and exposes hooks for ADOs.
Persistent media: Optane PMM DIMMs accessed via DAX (device or filesystem modes).
Network layer: Uses RDMA (Infiniband, RoCE) with libfabric for low-latency RPC, zero-copy buffer registration, and high-throughput transfers.
Client libraries: Available in C/C++ and Python, supporting synchronous/asynchronous key-value operations, bulk zero-copy (put_direct/get_direct), and ADO invocation (Wood et al., 2022, Waddington et al., 2021).
Indexing: Primary Hash (Hopscotch Hash Table) and optional secondary (Red-Black Tree); both reside in PM and offer variable-length keys and values (Waddington et al., 2021).
Persistence/Consistency: Guaranteed through software logs (undo/redo) or hardware write orderings (CLFLUSHOPT, CLWB, SFENCE).

2. Active Data Objects and In-Place Compute

A central distinction of MCAS is its programmable compute model via Active Data Objects (ADOs). ADOs are user-defined plugins deployed server-side, designed to execute on objects resident in PM without staging data to DRAM or client buffers (Barceló et al., 2021, Wood et al., 2022, Waddington et al., 2021). Each ADO is sandboxed by pool boundary: the plugin process can only access the PM pool mapped into its address space, which enforces safety and security constraints (Waddington et al., 2021).

The ADO invocation API allows the client to send opaque requests (opcode + arguments) that the plugin interprets, granting zero-copy access to object buffers for modification or analysis. Persistence is maintained by requiring users to implement crash consistency through established mechanisms—undo/redo logs, atomic 64-bit operations, or by employing libraries such as PMDK (Waddington et al., 2021).

ADO plugins support pointer-based dynamic data structures (trees, lists) allocated and mutated in persistent memory. This model enables compute-near-storage, reducing data transfer (especially for large objects) and allowing complex in-place algorithms, including continuous data protection (CDP) metadata ingestion and summarization (Waddington et al., 2021).

3. Programming Model: APIs and Extensions

The MCAS programming interface exposes basic K/V operations, zero-copy direct APIs, and ADO-specific calls (Waddington et al., 2021, Wood et al., 2022):

K/V API: create_pool, put, get, erase, find, and their asynchronous variants.
Zero-Copy RDMA Bulk: register_direct_memory, put_direct/get_direct for high-throughput bulk transfers.
ADO Invocation: invoke_ado, invoke_put_ado for dispatching computation to storage-side plugins.

Notable extensions include PyMM (Python Micro-MCAS), a library providing native persistent types for Python:

Shelf abstraction: On-disk containers backed by fsdax/devdax, with object-like access.
pymm.ndarray: NumPy ndarray subclass whose storage resides on persistent PM; supports full in-place NumPy/scipy semantics (e.g., np.linalg.svd) (Wood et al., 2022).
pymm.tensor: Experimental PyTorch tensor integration.
Persistence: Arrays and tensors are resilient across process failures; reopening the shelf re-exposes all prior allocations.

A typical PyMM workflow involves instantiating a shelf, allocating persistent arrays, and executing standard data science or linear algebra operations in place, without further code adaptation (Wood et al., 2022).

4. Performance Characterization

MCAS delivers order-of-magnitude improvements in latency and throughput compared to traditional storage:

Latency: Sub-10 µs median for small (16 B) GETs; 99th percentile ≈10 µs; 4 KiB GET medians ≈15 µs (Waddington et al., 2021). In-memory K/V systems require DRAM logging to achieve comparable figures; MCAS guarantees persistence directly on PM.
Bandwidth: Direct RDMA APIs (≥128 KiB) saturate 100 GbE (≈12–50 GiB/s depending on workload/shards) (Waddington et al., 2021).
Scalability: Linear scaling up to at least 12–16 shards for GET/PUT operations; up to 7.7 M GETs/s and 2.7 M PUTs/s aggregate (Waddington et al., 2021).
ADO Throughput: Passthru plugins reach ~7.5 M invoke IOPS (single-key); key-set invokes with 100K keys attain ~4.3 M IOPS.
Application-level Speedups: In data-centric workloads (k-Means, matrix operations), active storage on NVM is 1.1–15× faster than non-active object stores due to data-locality and in-place execution. Memory-bound kernels maximize this advantage (Barceló et al., 2021).

In data-intensive Python-based linear algebra (PyMM/GMRA), despite Optane-PM’s ~3× DRAM latency, observed slowdowns in major kernels are in the 1.1×–1.2× range; loading stages slow down by ~1.4–1.6× (Wood et al., 2022). For large working sets exceeding DRAM, MCAS on PM enables progress where DRAM-only systems cannot proceed.

Dataset	Load DRAM (s)	Load PyMM (s)	Wavelet DRAM (s)	Wavelet PyMM (s)
MNIST	0.33 ± 0.002	0.53 ± 0.002	62.4 ± 1.88	71.2 ± 1.55
CIFAR10	0.83 ± 0.003	1.20 ± 0.008	134.7 ± 5.07	140.6 ± 5.24

The wavelet stage dominates end-to-end runtime (>99%), highlighting the efficiency of in-place persistent computation (Wood et al., 2022).

5. Deployment Modes, Memory Management, and Trade-Offs

MCAS supports multiple deployment and memory-tuning modes:

App-Direct Mode (DAX): PM appears as regular memory; all allocations and compute are fully persistent (Barceló et al., 2021, Wood et al., 2022).
Memory Mode: DRAM acts as a cache for NVM, flattening the volatile address space; performance is largely determined by hardware caching policy (Barceló et al., 2021).
RDMA Networking: Enables multi-server pools and distributed persistent address spaces. MCAS exploits synchronous client-side replication to maintain durability with bounded throughput reductions (2-way: -20%, 3-way: -37%) (Waddington et al., 2021).
Region Allocation: Pools are managed as contiguous PM regions with coarse (32 MiB) granularity and undo-log updates for crash consistency (Waddington et al., 2021).
Indexing and Recovery: Upon restart, persistent metadata is scanned to reconstitute volatile heap structures; multi-write atomicity is enforced through software logs (Waddington et al., 2021).

Trade-offs involve:

Latency vs. Capacity: Optane-PM offers 8–12× greater capacity per socket than DRAM at ~3× the latency (Wood et al., 2022).
Region size: Coarse granularity reduces metadata but may waste space for small pools.
Crash Consistency: Enabling fine-grained logging degrades throughput; batch or bulk durability models may mitigate overhead (Wood et al., 2022, Waddington et al., 2021).
Object Granularity: Too-small objects increase RPC/metadata overhead; too-large objects limit parallelism (Barceló et al., 2021).

6. Comparison with Conventional Architectures and Active Object Stores

MCAS demonstrates a convergence between memory and storage, contrasting traditional NVMe-over-Fabric, SSD, or in-memory K/V systems:

Vs. NVMe-oF: MCAS delivers 2–3 orders of magnitude higher IOPS (M-range) at sub-10–50 µs; NVMe-oF delivers ~150 K IOPS at >100 µs (Waddington et al., 2021).
Vs. DRAM-centric K/V: In-memory systems (e.g., RAMCloud, HERD) reach high ops/s but rely on volatile DRAM and separate logging; MCAS achieves comparable throughput directly on persistent hardware (Waddington et al., 2021).
Vs. Non-active object stores: Active storage models (MCAS, dataClay) eliminate repeated serialization and network copies, yielding 10–90% lower latencies or up to 10× higher throughput, depending on access pattern and kernel (Barceló et al., 2021).

The active object paradigm co-locates code with data, exploiting byte-addressability of NVM to push data-locality to its architectural extreme (Barceló et al., 2021). This is particularly effective for large-scale, reuse-intensive, or memory-bound data analytics.

7. Limitations, Prospective Enhancements, and Future Work

Current limitations include:

Write Endurance: Optane DC supports ~10⁶ writes per cell, suggesting the necessity for wear-leveling or write-reduction strategies (Waddington et al., 2021).
Asymmetric PM Bandwidth: NVM writes do not scale linearly beyond a few threads; pooling or scheduling techniques may be required (Waddington et al., 2021).
Crash Consistency Semantics: Only 64-bit atomicity is natively supported; software logging is mandatory for more complex transactional updates.
Failure Domain: Device-DAX striping means single-DIMM failures can cause pool-wide outages; finer-grained redundancy/erasure coding is identified as a future direction (Waddington et al., 2021).

Proposed future work includes:

New language bindings and richer client replication frameworks (Waddington et al., 2021).
Pool-level authentication, admission control, and service QoS (Waddington et al., 2021).
Advanced crash-consistent programming models, possibly compiler- or hardware-assisted.
Unified protocols for cross-pool ADO transactions, and scale-out to distributed clusters (Wood et al., 2022).
Integration with hardware accelerators (GPU/FPGA) via CXL or RDMA for offloaded execution (Wood et al., 2022).
Incorporating advanced, persistent concurrent data structures for algorithms such as CoverTree or GMRA (Wood et al., 2022).

These directions aim to further unify the persistent memory-compute model, unlock new classes of large-scale data applications, and systematically exploit the architectural benefits of memory-centric active storage (Waddington et al., 2021, Barceló et al., 2021, Waddington et al., 2021, Wood et al., 2022).