Papers
Topics
Authors
Recent
2000 character limit reached

FlashMap: Flash-Optimized KV Store

Updated 18 November 2025
  • FlashMap is a flash-optimized key-value store that leverages an append-only, log-structured design with parallel I/O to achieve high throughput and low latency.
  • It utilizes a succinct in-memory index, large-scale caching, and TRIM-assisted garbage collection to optimize SSD performance and reduce write amplification.
  • Its architecture supports multi-threaded operations and adaptive scheduling, making it ideal for use cases like caching tiers, session management, and distributed infrastructures.

FlashMap is a high-performance, flash-optimized key-value (KV) store designed to maximize throughput and minimize latency on modern solid-state drives (SSDs). By leveraging a log-structured, append-only architecture with N-way parallelism, succinct in-memory indexing, and SSD-aware I/O strategies, FlashMap achieves up to 19.8 million inserts per second and 23.8 million random lookups per second on a single enterprise-grade server using 100-byte payloads. The engine integrates large-scale caching, write coalescing, background TRIM-assisted garbage collection, and parallel scheduling, targeting applications demanding high responsiveness and scalability such as caching tiers, session management, and distributed infrastructures (Guo et al., 11 Nov 2025).

1. System Architecture and Data Structures

FlashMap partitions an SSD into NN “active” append-only strands and MM spare strands, yielding a total of N+MN+M contiguous, log-structured regions. Each active strand serves as an independent append stream, while spare strands are reserved for garbage collection (GC) and background data migration.

The mapping of keys to storage locations proceeds as follows:

  • A global in-memory index maps a key KK to a physical flash address AA within one of the NN active strands.
  • Strand assignment uses a 64-bit hash: i=h(K) mod Ni = h(K)\ \mathrm{mod}\ N.
  • Inserts, updates, and deletes are recorded as new appends to strand ii. Lookups use the index to locate AA, reading the data directly.
  • Startup recovers the index by scanning the active strands; shutdown optionally serializes the index to persistent storage.

On-flash, each strand is of size SSSD_capacity/(N+M)S \simeq \mathrm{SSD\_capacity}/(N+M), hosting a sequential chain of records:

[meta  key bytes  value bytes][\text{meta}~|~\text{key bytes}~|~\text{value bytes}]

The metadata includes a link pointer (link_ptr\text{link\_ptr}) to previous versions of the same key, and lengths (key_len\text{key\_len}, val_len\text{val\_len}); tombstone deletes are signaled via val_len=1\text{val\_len} = -1. Version chains and strand scans underpin fast index recovery and efficient GC.

The in-memory index is typically a succinct search tree with O(logN)O(\log N) lookup and update, costing approximately 12 bytes per entry. Supported operations include exact lookup, update (in-place mapping change), and predecessor/successor search.

Summary formulas are central to FlashMap's design:

  • Strand selection: i=h(K)modNi = h(K) \mod N
  • Index memory use: MindexNentries×12 BM_\mathrm{index} \approx N_\mathrm{entries} \times 12\ \text{B}
  • Space utilization: U=valid_data_bytesSNU = \frac{\text{valid\_data\_bytes}}{S \cdot N}
  • Idealized write amplification: WA1+overhead_bytesuser_bytesWA \simeq 1 + \frac{\text{overhead\_bytes}}{\text{user\_bytes}}

2. Flash-Optimized I/O and Write Path Strategies

FlashMap minimizes write amplification and maximizes device throughput via flash-conscious techniques:

  • Append-only, large-block writes: All writes are appended sequentially (≥64 KB per I/O), combining small updates in a 32 MB per-strand in-memory buffer. This minimizes costly SSD-internal read/modify/write cycles and leverages sequential bandwidth.
  • Parallel, strand-based I/O scheduling: With N=32N=32 active strands, up to 32 parallel append streams are interleaved, ensuring all SSD channels are saturated. Round-robin scheduling maintains high device utilization and fairness across strands.
  • TRIM-assisted garbage collection and wear leveling: GC picks a retired strand and migrates only valid records to a fresh spare strand, atomically swapping index mappings and issuing TRIM commands to inform the SSD of freed regions, thereby cooperating with the device's wear-leveling firmware.

GC is managed by a background thread (typically M=1M=1 spare strand), running at low priority. Contention with foreground I/O only occurs during the final atomic swap.

3. Multi-Threaded Operation, Indexing, and Caching

Concurrency and caching are primary enablers of FlashMap’s performance:

  • Thread-safe operations: Each strand employs a lightweight append lock (e.g., atomic fetch-and-add) ensuring multi-threaded appends with minimal contention. Reads require only strand-local locks when traversing version chains.
  • Read caching: A 1 GB LRU cache per strand (partitioned into 32 MB segments) supports rapid lookups. On a 32-strand system, the total cache size is 32 GB. The cache-hit probability for a uniform random workload is modeled as

H1eC/WH \simeq 1 - e^{-C/W}

where CC is total cache, WW the working set size.

  • Efficient write buffering: Updates accumulate in 32 MB per-strand write buffers. Only when a buffer fills is a large append issued, optimizing for device-preferred access sizes and further suppressing write amplification.

Eviction strategies reclaim the least-recently-used records per segment, maintaining high hit rates for small to moderate working sets.

4. Performance Characteristics

Benchmarks on an AWS i8g.4xlarge server (16 vCPUs Graviton4, 128 GB DRAM, 3.75 TB Nitro SSD) under a 1-billion KV, 100-byte object workload showed:

Threads Inserts (M/s) Seq. Lookup (M/s) Rnd. Lookup (M/s) Deletes (M/s)
1 1.3 1.8 1.6 1.5
4 4.8 6.6 6.0 5.2
8 10.1 13.9 12.5 11.2
16 19.8 26.5 23.8 21.2

On 16 threads, FlashMap achieves its maximum reported throughput: 19.8 million inserts/second and 23.8 million random lookups/second.

Latency percentile results for a 20% update/80% lookup mix (100 B records):

Threads p95 (µs) p99 (µs) p99.9 (µs)
1 0.65 2.82 8.07
4 0.94 3.10 11.7
8 1.37 2.97 16.9
16 2.56 6.91 31.7

FlashMap sustains full SSD bandwidth even for workloads exceeding cache, confirming minimal engine overhead as payload size increases.

Direct comparison under identical workloads reveals that FlashMap outperforms Redis, RocksDB, and other LSM-style engines with approximately twice the insert throughput and three times the random-lookup throughput on a single node. These gains derive primarily from large, append-only I/O and optimized flash-specific caching (Guo et al., 11 Nov 2025).

5. Limitations and Future Directions

Known constraints include:

  • Performance for very large keyspaces degrades as cache-hit rate drops, leading to higher read latencies and I/O amplification in cache-miss regimes.
  • Wear-leveling effectiveness is ultimately contingent on the SSD’s internal controller; suboptimal GC patterns could exacerbate device wear.
  • Transaction support is currently limited: FlashMap forks an entire in-memory child store per transaction, which may be unsuitable for long-lived or massive transactions.

Prospective enhancements focus on:

  • Adaptive or variable-length strand sizing to optimize between GC overhead and SSD fragmentation.
  • Multi-tier caching, such as NVMe-DIMM layers, to expand cacheable working sets.
  • Inline compaction and support for range queries or secondary indices.
  • Support for application-driven wear-leveling advisories beyond TRIM commands.

6. Context and Significance

FlashMap exemplifies an approach in KV store design that abandons traditional LSM-tree architectures in favor of log-structured, highly parallel write strategies attuned to flash SSD operational characteristics. Its architecture highlights the criticality of aligning data layout, caching, and concurrency controls with the physical realities of SSD bandwidth, write amplification, and garbage collection behaviors.

Its empirical performance results demonstrate that substantial gains can be realized by minimizing random writes, maximizing sequential throughput, and maintaining aggressive in-memory caching. As data center storage remains dominated by SSDs, such architectures suggest new frontiers for scaling KV stores to high-throughput, low-latency, and flash-efficient designs (Guo et al., 11 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to FlashMap.