Distributed Shared-Memory Databases

Updated 9 November 2025

Distributed Shared-Memory Databases are systems that leverage a unified address space across multiple nodes to combine shared-memory programmability with the scalability and resilience of distributed systems.
They utilize modern hardware advances such as RDMA interconnects and memory disaggregation to decouple compute and memory, achieving low latency and near-linear scalability.
Advanced protocols in concurrency control, fault tolerance, and type-guided ownership enable strong consistency and high throughput across distributed environments.

Distributed Shared-Memory Databases (DSM-DB) are database systems that leverage the distributed shared memory (DSM) abstraction—exposing the main memory of a cluster of physically separate servers as a single, unified address space. DSM-DBs aim to combine the programmability and fine-grained data sharing advantages of shared-memory models with the scalability and failure resilience of distributed architectures. Their feasibility has been revitalized by hardware trends such as memory disaggregation, fast RDMA interconnects, modern language-level type systems, and advances in concurrency control protocols.

1. Foundations and System Architecture

DSM-DB systems are characterized by the decoupling of compute from memory, typically realized by allocating compute-only nodes (CNs) and memory-only nodes (MNs) interconnected via ultra-low-latency networks such as RDMA. CNs execute transaction and query logic, while MNs serve as a global memory pool, exporting raw addressable regions with access mediated via RDMA verbs. Memory disaggregation protocols allow the system to present a global logical address space, managed by a lightweight, distributed directory service maintaining mappings from logical addresses to physical memory locations, allocation status, and possible offload hooks for operators such as aggregation or scan (Wang et al., 2022).

The architectural organization is typically described as follows:

Network-Attached Memory (NAM): MNs export large, pinned memory; CNs coordinate access but do not hold persistent state themselves. Compute and memory scale independently (Zamanian et al., 2016).
Partitioned Global Address Space (PGAS): The global address space is explicitly partitioned; each node ("locale") is responsible for a subset, but the programming interface hides or abstracts remote/local boundaries (Dewan et al., 2021).
Ownership Models & Type Systems: DSM-DBs may exploit programming language semantics (e.g., Rust’s ownership and borrowing) to enforce access discipline and synchronize coherence at runtime, directly mapping language lifetimes to coherence epochs (Ma et al., 4 Jun 2024).

A typical data layout includes globally addressable pages, cache layers on CNs, and versioned/colored tags encoding the validity of cached data.

2. Memory Consistency and Coherence Protocols

DSM-DB implementations must ensure that concurrent accesses by distributed clients preserve well-defined consistency guarantees—sequential consistency (SC), snapshot isolation, or causal consistency—depending on the tradeoff between programmability, performance, and tolerance for staleness.

Ownership-Guided Coherence: Leverages compile-time type systems (as in Rust) to enforce single-writer multiple-reader (SWMR) epochs. Each Box<T> or &T/&mut T carries a version tag ("color") and updated bit; coherence is enforced by ensuring every mutable write increments the color, making stale cached copies unreachable (Ma et al., 4 Jun 2024). Reads fetch and locally cache the addressed value, keyed by (address, color), ensuring readers never see stale data.
Causal Consistency in Partial Replication: Employs vector-style timestamp metadata per replica, but with optimization. Each replica tracks exactly those causality relationships (edges) relevant to its local state, minimizing the size of piggybacked timestamps and local metadata for partially replicated shared memory (Xiang et al., 2017).
Quorum-Based SC (ABD-family algorithms): Ensures that, in the presence of faults, every read and write operation consults overlapping majorities to guarantee SC semantics. Writes use a single round of broadcast, reads use two. Timestamps (logical clocks) guarantee monotonicity and proper linearization (Ekström et al., 2016).

A summary table of protocol approaches:

Consistency Model	Protocol/Mechanism	Metadata Overhead
Sequential Consistency	Quorum + Logical Clocks	O(n) timestamps
Ownership-Guided SWMR	Colored Address + Borrow	Per-pointer/version tag
Causal Consistency	Edge-based Vector Clocks

3. Transaction Management and Concurrency Control

DSM-DBs must support scalable distributed transactions—even for multi-writer workloads—without imposing traditional scalability bottlenecks. Recent systems achieve this via:

RDMA-based SI/CC: DECOR (NAM-DB) assigns transactions a vector timestamp, each assigned locally by the transaction originator, recorded in a global array via one-sided RDMA_WRITE. Reads at transaction start scan the vector for snapshot visibility; commits are set by direct RDMA; no central timestamp server bottleneck exists (Zamanian et al., 2016).
Type-Guided Isolation: The Rust model’s immutable and mutable borrows can be mapped to read and write transactional "leases"; color tags serve as low-overhead MVCC version identifiers, enforcing isolation by associating each in-progress mutation with its own color epoch (Ma et al., 4 Jun 2024).
Fine-Grained Locking via RDMA Atomics: For index structures or tuple-granular operations, remote CAS (compare-and-swap) is used for locking/unlocking buckets or elements. This scales to high core counts because locks only serialize conflicting accesses, and can be pipelined/batched via asynchronous APIs (Dewan et al., 2021).

The primary implication is that hardware-based operations (RDMA CAS, FETCH-ADD) and vectorized logical clocks enable nearly linear scalability under load, provided that distributed metadata (e.g., timestamp or lock allocation) can itself be partitioned or decentralized.

4. Buffer Management, Indexes, and Data Structures

DSM-DBs require buffer managers and data structures optimized for RDMA’s unique latency and bandwidth characteristics. The gap between local RAM and RDMA-read is approximately 10×, not the ≥10⁴× gap between RAM and flash/SSD, which shifts buffer management strategies:

Buffer Management: Lightweight replacement policies (CLOCK, TinyLFU) are preferred; LRU/K can be too costly in software. Decision models must weigh local cache hits against offloading functions to MNs to minimize data movement (Wang et al., 2022).
DSM-aware Indexes: Hash tables and B-trees are distributed. For hash tables (e.g., DIHT), top-level buckets are block-distributed and accesses are forwarded to the responsible node. Locks are managed via remote CAS. DIHT achieves >100× speedup vs. non-DSM approaches at 64 nodes × 44 cores (Dewan et al., 2021). For B-trees, caching of internal nodes at CNs and splitting of leaves on MNs are used. All designs must enable batched/asynchronous operations to amortize network latency.

Guidelines for DSM-DB data structure design include privatization of per-node metadata, single-lock semantics per leaf (to avoid deadlocks), batch aggregation, and epoch-based memory reclamation for safe element deletion.

5. Fault Tolerance and Replication

Maintaining data integrity under partial failures and ensuring rapid recovery are essential in DSM-DBs, especially as memory and compute are decoupled:

Replication: In-RAM replication (k-way, e.g., as in RAMCloud) ensures that loss of an MN does not imply data loss; periodic checkpoints are streamed to persistent storage for full durability (Wang et al., 2022).
Commit Logging: Ownership transfer and version/color updates serve as implicit commit logs in type/ownership-guided designs; these can be offloaded to a backup for minimal durability overhead (Ma et al., 4 Jun 2024).
Quorum Protocols: Write and read quorums guarantee that no two concurrent operations can diverge, so long as a majority is correct (Ekström et al., 2016).
Recovery: Replica promotion and log replay allow rapid restoration of consistency following MN or CN failure.

A plausible implication is that the minimal overhead for strong consistency and recovery, together with elastic resource provisioning, can enable "always-on" memory pools with sub-second failover.

6. Performance Characteristics and Scalability

DSM-DBs enabled by RDMA disaggregation and new coherence protocols achieve performance profiles markedly different from legacy architectures:

Latency: One-sided RDMA-read/write/invalidation costs ≈0.8–1.8 μs; up to 100 ns for cache hits. Batching and asynchronous APIs can mask round-trip times for RDMA atomics.
Throughput: Up to 6.5 million distributed new-order txn/s, 14.5 million total on 56 machines under TPC-C benchmark in NAM-DB (Zamanian et al., 2016).
Scaling: Linear scale-up with the number of CNs and MNs is possible for both OLTP and OLAP workloads, so long as there is no centralized bottleneck (e.g., timestamp oracle). Partial replication can reduce per-replica metadata cost to O(d), where d is local replication degree (Xiang et al., 2017).
Workload Diversity: DRust shows 2.64–29.16× throughput improvement over state-of-the-art DSMs for OLAP, microservices, GEMM, and KV workloads (Ma et al., 4 Jun 2024).
Efficiency: Single-node overhead in advanced DSM runtimes is <2.5% vs. native, and pointer dereference incurs a ~30-cycle penalty.

The dominant scaling bottlenecks in legacy designs—centralized lock managers, timestamp servers, and high write-latency—are mitigated by decentralization and hardware acceleration.

7. Open Questions, Challenges, and Research Directions

The modern DSM-DB landscape raises and re-frames several foundational system questions:

Concurrency Control Tradeoffs: Implementing fine-grained (e.g., row-level) locks via RDMA atomics can incur non-trivial round-trips; the balance between granularity, overhead, and contention is an active topic (Wang et al., 2022).
Coherence and Consistency Spectrum: Language-aided coherence (e.g., DRust), partial causality tracking (Xiang et al., 2017), and software broadcast protocols define a rich design space between strict SC, SI, and eventual consistency.
Data and Computation Affinity: Optimizing data locality—spawning queries on partition owners, affinity pointers—determines both responsiveness and communication cost (Ma et al., 4 Jun 2024).
Function Offloading: Offloading operators (aggregation, scan) directly onto MNs reduces network load, suggesting a convergence of data+compute at the interface between RAM and CPU (Wang et al., 2022).
Failure Domains: DSM-DBs must treat compute-node and memory-node failures as orthogonal; elastic failure boundaries and rapid rebinding of compute to recovered MNs is essential for SLAs.
Programming Models: PGAS-style models demonstrate how fine-grained synchronization, memory reclamation, and locality can be unified in global-view data structures at massive scale (Dewan et al., 2021).

A plausible implication is that continued hardware/network advances and further integration of type-level reasoning into runtime protocols may close the gap between the semantic convenience of shared memory and the operational scale of modern distributed DBMSs.

In conclusion, DSM-DBs, supported by advances in RDMA, memory disaggregation, language-level ownership, and decentralized concurrency control, represent a viable approach to unifying scalability, high utilization, and fine-grained programmability. Research continues to refine the tradeoffs across protocol efficiency, consistency, and system manageability for future generations of data-intensive platforms.