Papers
Topics
Authors
Recent
2000 character limit reached

Persistent Memory Architectures

Updated 8 January 2026
  • Persistent Memory Architectures are hardware and software design paradigms that use non-volatile, byte-addressable memory like Intel Optane to bridge the storage-memory gap.
  • They integrate through modes such as App-Direct and Memory Mode, leveraging cache-line flushes and transactional APIs to ensure crash consistency and durability.
  • These architectures drive innovations in data structures, file systems, and parallel computing, optimizing performance and reliability across diverse system-level applications.

Persistent memory architectures encompass a diverse set of hardware and software design paradigms enabling systems to exploit persistent, byte-addressable memory technologies (PM) such as Intel Optane DC Persistent Memory (DCPMM) and 3D XPoint. These architectures directly bridge the storage-memory gap, exposing non-volatility at near-DRAM speed and granularity, with profound implications for data structures, file systems, parallel computing, networking, and transactional systems. This article provides a comprehensive account of the architectural principles, system-level designs, software structures, performance trade-offs, reliability models, and future directions underpinning persistent memory architectures.

1. Hardware Foundations and Architectural Organization

Modern persistent memory devices offer byte-addressability (typically 64 B cache lines), non-volatility, and latency/bandwidth profiles intermediate between DRAM and NAND flash. Architecturally, PM devices are installed as DIMMs on DDR4/5 channels, co-existing with DRAM. Two major operational modes are supported:

  • App-Direct Mode: DRAM and PM are presented as logically distinct NUMA nodes; both are directly addressable via CPU load/store. Linux exposes PM regions via DAX (Direct Access) for file systems or via /dev/dax for device-mode memory mapping. Applications can mmap PM and use load/store directly (Jackson et al., 2018, Fridman et al., 2021).
  • Memory Mode: PM is paged behind DRAM, which acts as a hardware-managed cache. The OS sees a large volatile address space, but all persistence guarantees are lost (Fridman et al., 2021).

Integration at the memory controller requires handling higher device latency (PM read: 250–400 ns, write: 300–360 ns per channel) and asymmetric bandwidth (DRAM: 60–70 GB/s, PM: 12–15 GB/s/channel) (Marques et al., 2021). Internal PM granularity may be larger (e.g., 256 B XPLines), leading to read-modify-write phenomena if accesses are unaligned.

2. Software Stack, Addressing, and Persistence Primitives

The software stack for PM encompasses multiple layers:

  • System and OS Support: Linux manages PM and DRAM as separate NUMA domains. PM regions are published as DAX-aware filesystems (ext4-DAX, xfs-DAX, NOVA, SplitFS, etc.), supporting mmap for user-level, page cache bypass, and fine-grained synchronization (Jackson et al., 2018, Breukelen et al., 2023).
  • Programming Models and APIs:
    • SNIA PMDK (Persistent Memory Development Kit) provides transactional object allocators, flush/fence abstraction, and atomicity semantics via libpmemobj (Jackson et al., 2018, Choi et al., 2020).
    • Persistent Memory Objects (PMOs) extend the POSIX abstraction with object-based, crash-consistent primitives such as pcreate, attach, detach, and an object-centric crash-consistency barrier, psync, with minimal API friction and high concurrency (Greenspan et al., 2022).
    • Application integrations range from full-persistence (all metadata and values in PM) to hybrid modes (indices in DRAM, values in PM) for key-value stores (Choi et al., 2020).
  • Persistence Primitives:

3. Consistency, Correctness, and Fault Models

Crash consistency is a central challenge in PM architectures, compounded by the fact that volatile caches lose state on power loss while PM persists. Key mechanisms and models include:

  • Ordering constraints: Formal models govern store and durability semantics. For example, strict persistency enforces that volatile store order implies persistent order, while relaxed/epoch persistency admits batching of stores inside epochs with atomic visibility at barriers (Lin et al., 2019, Greenspan et al., 2022).
  • Transactional durability: Log-based approaches (undo/redo) wrap updates so that, after a crash, either all or none of a transaction's updates appear. Automatic flush/fence sequences enforce per-transaction atomicity and ordering constraints (Wu et al., 2019, Giles et al., 2018).
  • Fault tolerance for concurrency and parallelism: Algorithmic models such as the Parallel Persistent Memory (PPM) model structure computation into idempotent "capsules," each ending at a persistent checkpoint. After faults (soft or hard), processors resume at the last checkpoint; progress and correctness (durably linearizable structures) are ensured by capsule-based idempotence and recoverable compare-and-swap primitives (Blelloch et al., 2018, Ben-David et al., 2018).
  • Encryption integration: Secure persistent memory architectures (such as SecPM) manage counter-mode encryption in concert with persistence, ensuring that both data and encryption counters are persisted in atomically ordered fashion. Write-reduction techniques coalesce counter writes for spatial locality, reducing performance penalties (Zuo et al., 2019).

4. Data Structures and Application-Specific Approaches

The adoption of PM motivates novel data-structure primitives and re-design of existing algorithms:

  • Low-level primitives: Manual persist via cache-line flush/fence, copy-on-write nodes, persistent root pointers, and multi-word CAS (PMwCAS) serve as the foundation for higher-level constructs (Götze et al., 2020).
  • Write-amplification reduction: To minimize device wear and latency, strategies include batching, indirection slots + bitmaps for validity, hybrid DRAM/PM placement (inner nodes in DRAM, leaves in PM), and unsorted append or bitmap layouts in leaves (Götze et al., 2020, Islam et al., 2024).
  • Dynamic graph frameworks: DGAP demonstrates per-section edge logs (to avoid small in-place PM writes), per-thread undo logs (for crash-consistent rebalancing), and careful partitioning of hot-write metadata into DRAM. These techniques obtain up to 3.2×–3.8Ă— better performance compared to XPGraph, LLAMA, and GraphOne (Islam et al., 2024).
  • Key-value stores: Full-persistence of all index and value data in PM enables sub-second recovery, at the cost of lower throughput and higher latency, whereas "hybrid" approaches that retain indices in DRAM (reconstructed at restart) deliver higher throughput and lower tail latency for write-dominant workloads, with slightly longer recovery (Choi et al., 2020).

5. File Systems and System-Level Services

Persistent memory file systems (PMFS) are re-architected to exploit byte-addressability and bypass the overheads of traditional block I/O:

  • Design strategies:
    • DAX and kernel bypass: DAX awareness allows applications to bypass the page cache, mmap files directly, and reduce kernel mediation latency (Jackson et al., 2018, Breukelen et al., 2023).
    • Metadata acceleration: O(1) hash-based metadata indexing (HashFS), contiguous-page mappings (ctFS), and flat namespace lookups reduce small-file and metadata-intensive operation costs (Breukelen et al., 2023).
    • Hybrid logging and shadow-paging: Modern PMFS intersect traditional undo/redo journaling with atomic in-place updates and operation logs, balancing write amplification with simplicity and performance (Breukelen et al., 2023).
  • Consistency mechanisms: Atomic store + clwb + sfence suffice for small metadata changes. Larger, cross-node updates fall back on per-inode logs (NOVA), log-structured appends (Strata), or operation logs (SplitFS) to enforce consistency and durability (Breukelen et al., 2023).
  • Trade-off analysis: Log-structured and hybrid-kernel/user designs scale better on concurrency, while copy-on-write–heavy file systems expose significant write amplification (Breukelen et al., 2023).

6. Parallel and Distributed Persistent Memory

  • Parallel computation models: The Parallel Persistent Memory (PPM) model defines a cost model—partitioning work and depth into idempotent, checkpointed capsules with explicit accounting for faults and recovery. Provable time bounds account for processor failure rates, number of checkpoints, and available parallelism (Blelloch et al., 2018).
  • Distributed PM architectures: Disaggregated PM (DPM) enables sharing PM pools across datacenter compute nodes via RDMA. Architectures vary: DPM-Direct employs local metadata with RDMA atomics, DPM-Central uses a coordinator for metadata, and DPM-Sep separates control and data planes. Each approach trades off write-commit latency, scalability, CPU overhead, and programming complexity. DPM-Sep, with lock-free out-of-place chaining, achieves the best aggregate throughput and avoids coordinator bottlenecks, at the cost of more complex garbage collection (Tsai et al., 2019).
  • Near-memory compute: Memory-centric active storage (MCAS) pushes key-value APIs and user-defined "active data objects" (ADOs) close to persistent data with RDMA support, enabling sub-10 μs operation latency, complex in-place analytics (e.g., trees, versioned indexes), and strong crash consistency via transactional ADO plugin protocols (Waddington et al., 2021, Waddington et al., 2021).

7. Performance Scalability, Hybrid Tiering, and Design Recommendations

  • Dynamic tiering strategies: Hybrid DRAM/PMM systems require software page migration to exploit the fast tier for hot, write-intensive pages and PM for cold, read-dominated pages. Empirical work demonstrates that DRAM-first or read/write-based policies outperform static hotness- or bandwidth-split schemes, with >5Ă— throughput gains under bandwidth-aware migration policies (HyPlacer) (Marques et al., 2021).
  • Programmability and maintainability: PM-centric architectures benefit from higher-level libraries (PMDK, object-based PMOs) that abstract low-level flush semantics. Manual cache-line handling yields subtle coherency bugs and code complexity. Transactional and object APIs yield easier integration and better correctness properties (Choi et al., 2020, Greenspan et al., 2022).
  • Application guidelines: For high-throughput, low-latency transactional systems, hybrid DRAM/PM techniques (placing indices in DRAM and values in PM) amortize latency and exploit PM's byte-addressability. For immediate recovery at scale, full-persistence trades performance for minimal recovery latency (Choi et al., 2020).
  • Reliability and wear leveling: Write minimization is crucial to exploit limited device endurance. Write coalescing, grouping of small allocations (slabs), batching metadata updates, and indirection indextables reduce writes and maximize device lifetime (Götze et al., 2020, Wu et al., 2019).

References

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Persistent Memory Architectures.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube