Centralized KV Store

Updated 25 February 2026

Centralized KV store is a server-based system that manages key-value pairs under strict service-level guarantees and optimizes performance using advanced data structures.
It employs mechanisms like LSM-trees, learned indexes, and SmartNIC accelerators to deliver low latency, high throughput, and efficient range scan capabilities.
Design trade-offs focus on balancing update efficiency and query performance, making the system ideal for read-, write-, and scan-intensive workloads in modern hardware environments.

A centralized key-value (KV) store is a server-based system that provides efficient storage, retrieval, and management of ⟨key, value⟩ pairs under strong service-level guarantees, typically optimized for single-site deployment or tightly-coupled clusters. Distinct from distributed or sharded KV stores, the centralized model concentrates all data and access coordination within a single system image, allowing it to leverage sophisticated internal data structures (often LSM-trees or learned indexes), hardware accelerators, and specialized I/O paths for maximizing throughput, minimizing latency, and supporting complex query types such as range scans. Modern centralized KV stores incorporate recent advances in memory architectures, storage-class hardware, programmable networking devices (SmartNICs), and algorithmic optimizations to address a spectrum of real-world read-, write-, and scan-dominated workloads (Schimmelpfennig et al., 9 Jan 2026, Wang et al., 2024, Li et al., 2018).

1. Architectural Models and Deployment Paradigms

Centralized KV stores can be categorized by their data path realization: purely host-based, host-“offload” (e.g., SmartNIC-accelerated), and hybrid tiered architectures.

Host-Centric LSM-Tree Stores: The canonical deployment, featuring an in-memory Memtable and persistent Log-Structured Merge (LSM) tree on SSD or HDD. All reads/writes, garbage collection (GC), and compaction are orchestrated on general-purpose CPUs (Wang et al., 2024, Li et al., 2018).
SmartNIC-Accelerated Stores: Advancements in programmable NICs (e.g., NVIDIA BlueField-3) enable hosting full or partial KV indices directly on SmartNIC data-path accelerators (DPAs). These on-path systems handle request lookup/traversal within NIC-resident memory, bypassing the host OS, kernel network stack, and even PCIe, for much lower latencies and superior scalability. DPA-Store exemplifies this approach, supporting line-rate lookups and high-throughput scans while relegating heavy update logic (e.g., rebalancing, splits) to the host CPU (Schimmelpfennig et al., 9 Jan 2026).
Tiered KV Stores with Learned Indexes: Recent models such as LearnedKV implement a two-tier separation, where recent writes reside in a fast LSM tree and reads are accelerated by a static, read-optimized learned index over historical data (“Key List”). This approach offers improved GC, dramatically shrinks the active LSM state, and reduces compaction and write amplification (Wang et al., 2024).

2. Data Structures: Indexing and Storage Strategies

Multiple key data structures have emerged in centralized KV designs, each with distinct trade-offs for read/write mix, value size, and scan support:

LSM-Tree: An LSM-tree organizes keys in k levels $L_0, L_1, \dots, L_{k-1}$ , each exponentially larger than the last ( $T$ -factor growth, $T \approx 10$ ). Random writes buffer in DRAM, flushed sequentially to disk as immutable SSTables. Compaction merges overlapping ranges across levels to maintain order and freshness, supporting fast point and range queries but incurring high write amplification (WA ≃ $\frac{T^n-1}{T-1}$ ) (Li et al., 2018).
KV-Separation (WiscKey, HashKV): Value storage is decoupled from key management. The LSM holds only keys, pointers, and metadata, while the append-only value log (vLog) captures payloads separately—lowering LSM compaction and read/write amplification. HashKV introduces hash-based data grouping, mapping each key value deterministically to one of M main segments, which streamlines GC and eliminates LSM-tree lookups during value log reclamation (Li et al., 2018).
Learned Indexes (PLR, PLA Models): In LearnedKV and DPA-Store, a piecewise linear regression (PLR) or piecewise linear approximation (PLA) model tracks the empirical CDF $F(k)$ to map a key to ordered position or index file offset. Such indexes, characterized by segmental parameters ( $a_s, b_s$ ), allow efficient O(log S) binary search, prediction, and bounded chunk scan. By tuning error bounds (e.g., $\epsilon_{\text{inner}}=4, \epsilon_{\text{leaf}}=8$ ), the store balances model complexity against index memory size and query latency (Schimmelpfennig et al., 9 Jan 2026, Wang et al., 2024).
Auxiliary Structures: Small, fast in-memory Bloom filters reduce unnecessary LSM traversals, while local read caches (e.g., per-thread hot-entry caches on DPA) absorb skewed, repetitive access patterns (Schimmelpfennig et al., 9 Jan 2026, Wang et al., 2024).

3. Update, Delete, and Consistency Mechanisms

Update and delete processing in centralized KV stores reflects a design spectrum from immediate in-place changes to batched out-of-path maintenance:

Insert/Update in LSM-Tree-Based Designs: All writes transit through the Memtable and are persisted as immutable SSTables. Consistency is ensured because the LSM always holds the freshest keys, and a Bloom filter can divert read paths directly to learned indexes when the LSM is missed, keeping the systems strongly consistent (LSM > Learned Index) (Wang et al., 2024).
Hash-Based Grouping with Hot/Cold Separation: HashKV groups value log entries per key, conducts GC per “segment group,” and can accelerate reclamation of cold values by moving them out of the main segments after a period of inactivity. This reduces repeated movement of stable (cold) values during update-intensive workloads (Li et al., 2018).
Offloaded Batched Writes on SmartNICs: DPA-Store defers value retrieval and update work, buffering mutations in DPA (NIC)-side insert buffers and staging host-driven structural maintenance (splits, pointer swaps) as atomic multi-node “stitches” upon buffer overflow. Host-side patching maintains fine-grained locks briefly, while traversers remain entirely lock-free due to read-copy-update (RCU) semantics and epoch-based reclamation (Schimmelpfennig et al., 9 Jan 2026).
Garbage Collection (GC): LearnedKV exploits non-blocking GC, redirecting new writes to fresh structures while sequentially scanning and compacting the Value Log, synchronized with Learned Index rebuilds. No front-end query is blocked by GC, and during transition, reads check both old and new structures to guarantee availability (Wang et al., 2024).

4. Performance Characteristics and Evaluation

Empirical studies reveal sharp distinctions in throughput, latency, write amplification, and scalability among recent centralized KV store designs:

Store / Design	Point Lookup Throughput	Range Scan Performance	Update/Insert Rate
DPA-Store (Schimmelpfennig et al., 9 Jan 2026)	33 MOPS (GET, median ≤3.2 μs)	13 MOPS for 10-key ranges	12 MOPS (UPDATE), 1.7 MOPS (INSERT, bottlenecked by host-DPA write path)
LearnedKV (Wang et al., 2024)	Up to 4.32× faster reads vs. LSM	1.3×–5× faster, one-shot LI prediction	1.43× write speedup, LSM reduced by α (fraction in LI)
HashKV (Li et al., 2018)	Matches/exceeds LSM on large KVs	Matches LSM on ≥4 KiB, up to 70% slower for small KVs	4.6× throughput over vLog, 1.3–1.4× over RocksDB/LDB

Performance evaluation consistently uses metrics such as median and tail latency, MOPS (million operations per second), WA (write amplification), and DRAM footprint. DPA-Store is constrained primarily by DPA memory and PCIe DMA latency; with architectural refinements (e.g., lower DPA memory latency, block-DMA), GET rates could surpass 60 MOPS. LearnedKV’s main bottlenecks are conversion/compaction throughput and the efficiency of bulk data movement across the Bloom-filtered LSM and learned index tiers. HashKV’s gains are maximized for workloads with high update skew, significant value size, and tail-heavy “hot” value access distributions.

5. Design Trade-Offs, System Integration, and Best Practices

Key design tension points in centralized KV stores include index complexity, update path efficiency, storage cost, and scalability:

Separation of Concerns: Partitioning logic—placing only read-heavy and latency-sensitive structures on fast-path (e.g., DPA-resident learned indexes) and relegating complex or infrequent mutative operations to background or host-processing—yields both performance and simplicity (Schimmelpfennig et al., 9 Jan 2026).
Model-Driven Indexing: Employing learned indexes or PLA/PLR models applies statistical modeling principles (segmental regressions with strict error bounds) in lieu of traditional B+-trees, compressing metadata and minimizing memory/core pressure, especially in hardware-constrained environments (Wang et al., 2024).
Hash-Based Grouping: HashKV’s deterministic value grouping allows per-group GC and eliminates LSM lookups during reclamation, but this may complicate range-scan support or cold-item migration (Li et al., 2018).
Tunability: Error bounds for learned models, segment sizes, Bloom filter bits per key, and GC thresholds are all key parameters with workload-dependent optimal settings for throughput, memory, and amplification minimization (Wang et al., 2024, Schimmelpfennig et al., 9 Jan 2026).
Integration: HashKV integrates seamlessly with existing LSM-backed stores (LevelDB, RocksDB, HyperLevelDB, PebblesDB) by plugging in the segment-grouped value log and GC logic with minimal code changes, consistently improving throughput for write-heavy applications (Li et al., 2018).
Concurrency Control: DPA-Store enables lock-free traversers for fast-path reads and employs RCU/epoch-based reclamation, while patchers and stitchers serially manage updates to avoid races. GC in LearnedKV and HashKV runs in the background without observable latency spikes or correctness anomalies (Schimmelpfennig et al., 9 Jan 2026, Wang et al., 2024).

6. Hardware and Media Considerations

Centralized KV-store performance is deeply tied to underlying hardware properties, with design choices reflecting specific platform constraints:

DRAM and Accelerator Memory: DPA-Store leverages 1 GiB of DDR5 DPA memory on BlueField-3; all fast-path logic is contained entirely in the DPA, with ARM cores disabled. LearnedKV and HashKV both require modest DRAM for Memtable, Bloom filter, and model segments (tens of MB for hundreds of millions of keys), but rely heavily on storage-class devices for value payloads and immutable index tiers (Schimmelpfennig et al., 9 Jan 2026, Wang et al., 2024, Li et al., 2018).
Storage Media: SSDs provide high random (∼100 μs) and sequential (∼10 μs) access latencies, suitable for learned index chunk reads in LearnedKV and large value storage in HashKV. On HDD, range-read amplification is more problematic, but learned indexes mitigate disk-head movement by aligning chunk scans sequentially (Wang et al., 2024, Li et al., 2018).
SmartNICs and On-Path Acceleration: By integrating request processing onto programmable DPA cores, DPA-Store achieves full OS bypass, minimal PCIe crossing, and stateless client operation—an advantage over both host-only and RDMA-based distributed designs (Schimmelpfennig et al., 9 Jan 2026).
GC and Bulk Transfer Paths: Both LearnedKV and DPA-Store’s batched offload/update actors are throttled by storage/memory DMA efficiency and the concurrency of GC or “stitching” threads. Hardware support for faster block-DMA and further parallelism would remove observed bottlenecks, especially for large-scale bulk loads and insert-intense phases (Schimmelpfennig et al., 9 Jan 2026, Wang et al., 2024).

7. Implications, Scalability, and Future Directions

Centralized KV stores continue to evolve along several axes:

Concurrency Scaling: Thread-per-port (DPA-Store), per-group GC (HashKV), and background rebuild (LearnedKV) scale linear with thread/core count, yielding high MOPS on modern multi-core/multi-threaded hardware (Schimmelpfennig et al., 9 Jan 2026, Li et al., 2018).
Stateless Clients and Failover: Systems like DPA-Store eliminate client-side state/caching entirely, improving stateless recovery, load balancing, and simplifying horizontal scaling and fault tolerance (Schimmelpfennig et al., 9 Jan 2026).
Multi-Tenancy and Memory Efficiency: Compression of LSM tree (via LI in LearnedKV), value separation, and group-based meta-partitioning reduce DRAM footprint, enabling higher multi-tenant densities and better aggregate system resource utilization (Wang et al., 2024, Li et al., 2018).
Hardware Co-Design: The trend toward integrated SmartNIC and storage-class memory components is poised to further drive centralization, reduce datacenter PCIe and memory bottlenecks, and enable “line-rate” execution of previously host-bound data structures (Schimmelpfennig et al., 9 Jan 2026).
Software and System Design Principles: Offloading read-intensive traversals, strict modeling of index error bounds, multi-stage update batching, and RCU/epoch-based concurrency primitives are now best practices for both host and accelerator-resident centralized KV stores (Schimmelpfennig et al., 9 Jan 2026, Wang et al., 2024).

Overall, the centralized KV store remains a fast-moving research area, with recent work demonstrating substantial gains in performance, simplicity, and resilience by combining algorithmic, system-architectural, and hardware-accelerated innovations (Schimmelpfennig et al., 9 Jan 2026, Wang et al., 2024, Li et al., 2018).