FlashMap: Flash-Optimized KV Store
- FlashMap is a flash-optimized key-value store that leverages an append-only, log-structured design with parallel I/O to achieve high throughput and low latency.
- It utilizes a succinct in-memory index, large-scale caching, and TRIM-assisted garbage collection to optimize SSD performance and reduce write amplification.
- Its architecture supports multi-threaded operations and adaptive scheduling, making it ideal for use cases like caching tiers, session management, and distributed infrastructures.
FlashMap is a high-performance, flash-optimized key-value (KV) store designed to maximize throughput and minimize latency on modern solid-state drives (SSDs). By leveraging a log-structured, append-only architecture with N-way parallelism, succinct in-memory indexing, and SSD-aware I/O strategies, FlashMap achieves up to 19.8 million inserts per second and 23.8 million random lookups per second on a single enterprise-grade server using 100-byte payloads. The engine integrates large-scale caching, write coalescing, background TRIM-assisted garbage collection, and parallel scheduling, targeting applications demanding high responsiveness and scalability such as caching tiers, session management, and distributed infrastructures (Guo et al., 11 Nov 2025).
1. System Architecture and Data Structures
FlashMap partitions an SSD into “active” append-only strands and spare strands, yielding a total of contiguous, log-structured regions. Each active strand serves as an independent append stream, while spare strands are reserved for garbage collection (GC) and background data migration.
The mapping of keys to storage locations proceeds as follows:
- A global in-memory index maps a key to a physical flash address within one of the active strands.
- Strand assignment uses a 64-bit hash: .
- Inserts, updates, and deletes are recorded as new appends to strand . Lookups use the index to locate , reading the data directly.
- Startup recovers the index by scanning the active strands; shutdown optionally serializes the index to persistent storage.
On-flash, each strand is of size , hosting a sequential chain of records:
The metadata includes a link pointer () to previous versions of the same key, and lengths (, ); tombstone deletes are signaled via . Version chains and strand scans underpin fast index recovery and efficient GC.
The in-memory index is typically a succinct search tree with lookup and update, costing approximately 12 bytes per entry. Supported operations include exact lookup, update (in-place mapping change), and predecessor/successor search.
Summary formulas are central to FlashMap's design:
- Strand selection:
- Index memory use:
- Space utilization:
- Idealized write amplification:
2. Flash-Optimized I/O and Write Path Strategies
FlashMap minimizes write amplification and maximizes device throughput via flash-conscious techniques:
- Append-only, large-block writes: All writes are appended sequentially (≥64 KB per I/O), combining small updates in a 32 MB per-strand in-memory buffer. This minimizes costly SSD-internal read/modify/write cycles and leverages sequential bandwidth.
- Parallel, strand-based I/O scheduling: With active strands, up to 32 parallel append streams are interleaved, ensuring all SSD channels are saturated. Round-robin scheduling maintains high device utilization and fairness across strands.
- TRIM-assisted garbage collection and wear leveling: GC picks a retired strand and migrates only valid records to a fresh spare strand, atomically swapping index mappings and issuing TRIM commands to inform the SSD of freed regions, thereby cooperating with the device's wear-leveling firmware.
GC is managed by a background thread (typically spare strand), running at low priority. Contention with foreground I/O only occurs during the final atomic swap.
3. Multi-Threaded Operation, Indexing, and Caching
Concurrency and caching are primary enablers of FlashMap’s performance:
- Thread-safe operations: Each strand employs a lightweight append lock (e.g., atomic fetch-and-add) ensuring multi-threaded appends with minimal contention. Reads require only strand-local locks when traversing version chains.
- Read caching: A 1 GB LRU cache per strand (partitioned into 32 MB segments) supports rapid lookups. On a 32-strand system, the total cache size is 32 GB. The cache-hit probability for a uniform random workload is modeled as
where is total cache, the working set size.
- Efficient write buffering: Updates accumulate in 32 MB per-strand write buffers. Only when a buffer fills is a large append issued, optimizing for device-preferred access sizes and further suppressing write amplification.
Eviction strategies reclaim the least-recently-used records per segment, maintaining high hit rates for small to moderate working sets.
4. Performance Characteristics
Benchmarks on an AWS i8g.4xlarge server (16 vCPUs Graviton4, 128 GB DRAM, 3.75 TB Nitro SSD) under a 1-billion KV, 100-byte object workload showed:
| Threads | Inserts (M/s) | Seq. Lookup (M/s) | Rnd. Lookup (M/s) | Deletes (M/s) |
|---|---|---|---|---|
| 1 | 1.3 | 1.8 | 1.6 | 1.5 |
| 4 | 4.8 | 6.6 | 6.0 | 5.2 |
| 8 | 10.1 | 13.9 | 12.5 | 11.2 |
| 16 | 19.8 | 26.5 | 23.8 | 21.2 |
On 16 threads, FlashMap achieves its maximum reported throughput: 19.8 million inserts/second and 23.8 million random lookups/second.
Latency percentile results for a 20% update/80% lookup mix (100 B records):
| Threads | p95 (µs) | p99 (µs) | p99.9 (µs) |
|---|---|---|---|
| 1 | 0.65 | 2.82 | 8.07 |
| 4 | 0.94 | 3.10 | 11.7 |
| 8 | 1.37 | 2.97 | 16.9 |
| 16 | 2.56 | 6.91 | 31.7 |
FlashMap sustains full SSD bandwidth even for workloads exceeding cache, confirming minimal engine overhead as payload size increases.
Direct comparison under identical workloads reveals that FlashMap outperforms Redis, RocksDB, and other LSM-style engines with approximately twice the insert throughput and three times the random-lookup throughput on a single node. These gains derive primarily from large, append-only I/O and optimized flash-specific caching (Guo et al., 11 Nov 2025).
5. Limitations and Future Directions
Known constraints include:
- Performance for very large keyspaces degrades as cache-hit rate drops, leading to higher read latencies and I/O amplification in cache-miss regimes.
- Wear-leveling effectiveness is ultimately contingent on the SSD’s internal controller; suboptimal GC patterns could exacerbate device wear.
- Transaction support is currently limited: FlashMap forks an entire in-memory child store per transaction, which may be unsuitable for long-lived or massive transactions.
Prospective enhancements focus on:
- Adaptive or variable-length strand sizing to optimize between GC overhead and SSD fragmentation.
- Multi-tier caching, such as NVMe-DIMM layers, to expand cacheable working sets.
- Inline compaction and support for range queries or secondary indices.
- Support for application-driven wear-leveling advisories beyond TRIM commands.
6. Context and Significance
FlashMap exemplifies an approach in KV store design that abandons traditional LSM-tree architectures in favor of log-structured, highly parallel write strategies attuned to flash SSD operational characteristics. Its architecture highlights the criticality of aligning data layout, caching, and concurrency controls with the physical realities of SSD bandwidth, write amplification, and garbage collection behaviors.
Its empirical performance results demonstrate that substantial gains can be realized by minimizing random writes, maximizing sequential throughput, and maintaining aggressive in-memory caching. As data center storage remains dominated by SSDs, such architectures suggest new frontiers for scaling KV stores to high-throughput, low-latency, and flash-efficient designs (Guo et al., 11 Nov 2025).