Papers
Topics
Authors
Recent
Search
2000 character limit reached

Tidehunter: Large-Value Storage With Minimal Data Relocation

Published 2 Feb 2026 in cs.DB | (2602.01873v2)

Abstract: Log-Structured Merge-Trees (LSM-trees) dominate persistent key-value storage but suffer from high write amplification from 10x to 30x under random workloads due to repeated compaction. This overhead becomes prohibitive for large values with uniformly distributed keys, a workload common in content-addressable storage, deduplication systems, and blockchain validators. We present Tidehunter, a storage engine that eliminates value compaction by treating the Write-Ahead Log (WAL) as permanent storage rather than a temporary recovery buffer. Values are never overwritten; and small, lazily-flushed index tables map keys to WAL positions. Tidehunter introduces (a) lock-free writes that saturate NVMe drives through atomic allocation and parallel copying, (b) an optimistic index structure that exploits uniform key distributions for single-roundtrip lookups, and (c) epoch-based pruning that reclaims space without blocking writes. On a 1 TB dataset with 1 KB values, Tidehunter achieves 830K writes per second, that is 8.4x higher than RocksDB and 2.9x higher than BlobDB, while improving point queries by 1.7x and existence checks by 15.6x. We validate real-world impact by integrating Tidehunter into Sui, a high-throughput blockchain, where it maintains stable throughput and latency under loads that cause RocksDB-backed validators to collapse. Tidehunter is production-ready and is being deployed in production within Sui.

Summary

  • The paper introduces a novel WAL-centric architecture that minimizes write amplification by treating the WAL as the primary store for large-value, hash-keyed data.
  • Experimental results demonstrate up to 8.4× faster writes and 15.6× improved existence queries compared to traditional LSM-tree systems.
  • Real-world deployment in blockchain validators confirms reduced I/O, enhanced latency predictability, and efficient garbage collection under heavy workloads.

Tidehunter: Redefining Large-Value Key-Value Storage with WAL-Centric Design

Overview and Motivation

The dominance of Log-Structured Merge-Trees (LSM-trees) in key-value storage is rooted in their ability to mitigate random write bottlenecks in HDD-centric architectures. However, their write amplification on large-value, hash-keyed, and write-heavy workloads has surfaced as a critical limitation, especially with the transition to SSD/NVMe and emerging content-addressed applications. Write amplification stemming from multi-level compaction is non-trivial (10–30×), severely constraining throughput, SSD longevity, and predictability. The shift towards large values and uniform keys—now prevalent in content-addressable storage, deduplication, and blockchains—renders LSM-trees increasingly suboptimal.

Tidehunter introduces a storage paradigm that treats the Write-Ahead Log (WAL) as the definitive value store, obviating compaction for values and shifting indexing to compact, lazily-persisted tables. This architecture is specifically tailored for large-value, uniformly distributed key workloads. Its operational distinctiveness is most apparent in blockchain validator deployments, which blend high-throughput, aggressive pruning, and latency-sensitive queries. In such contexts, existing solutions like RocksDB (pure LSM-tree) and BlobDB (key-value separation atop LSM-tree) are fundamentally hindered by background I/O, compaction stalls, and non-deterministic latencies.

Architecture and Design Principles

Tidehunter's construction is governed by three principles: (1) direct, immutable write placement; (2) index separation and laziness; (3) non-blocking compaction. The architectural progression is as follows:

  • Append-Only WAL as Primary Store: Values are appended to a WAL segment and never overwritten, with each write assigned an atomic offset.
  • Separable, Sharded Index Tables: Keys are indexed in memory and batched for periodic, lazy flushing to disk. This index is organized into independent cells or shards, maximizing concurrency and reducing memory footprint by enabling partial residency.
  • Optimistic Indexing for Uniform Keys: For uniform (e.g., hash-based) keys, Tidehunter predicts index location, enabling single-roundtrip lookups by exploiting statistical properties of key distribution. The on-disk format is a flat, sorted array, facilitating rapid estimation by hashing.
  • Epoch-Based Pruning and Non-Blocking Relocation: Data expiration aligns with application epochs (as in blockchains), permitting space reclamation by dropping expired WAL segments. Background relocation migrates live records forward as necessary for space, utilizing an atomic compare-and-set to guarantee correctness amidst concurrent updates.

Crash recovery relies on periodic, lightweight state snapshots (tracking WAL/index positions but not the bulk data), making both failure recovery and cold starts efficient. Index management and memory use are finely tunable, with index flushes and cell residency governed by application-driven thresholds.

Experimental Analysis

Large-Value Workloads

On 1 KB value benchmarks, Tidehunter achieves 830K writes/sec—8.4× over RocksDB and 2.9× over BlobDB. For point queries, Tidehunter outpaces RocksDB by 1.7×, and for "exists" queries, it delivers a 15.6× improvement, as existence can be determined from the index alone without value fetch. The fundamental attribute is near-1× write amplification—each value is written once, unlike repeated compaction in LSM-tree architectures.

Small-Value and Mixed Workloads

For 64B–128B values with homogeneous (non-skewed) access, RocksDB and BlobDB outperform Tidehunter, as index overhead dominates and LSM metadata amortization is efficient. The crossover in advantage occurs near 128B. For highly skewed workloads (e.g., hot keys), Tidehunter's large table and value append-locality confer a cache advantage, enabling it to match or outperform RocksDB even on small values.

Garbage Collection and Storage Reclamation

Aggressive deletion workloads showcase the efficiency of non-blocking relocation. Relocation reduces storage used for a uniform workloads by 71% with only 3–4% reduction in throughput, confirming that background GC is both effective and minimally intrusive.

Indexing Strategy

Tidehunter's optimistic index supports rapid lookup via probabilistic position estimation for uniformly distributed keys. In microbenchmarks, the optimistic index outpaces static header-based indices by 24% at high concurrency, especially with direct I/O. This single-roundtrip property is attained by configuring index search window size to match hardware and workload characteristics.

Real-World Deployment: Sui Blockchain

Within the Sui blockchain validator stack, migrating from RocksDB to Tidehunter resulted in stable throughput and latency even under loads that previously induced I/O collapse in RocksDB-backed nodes. Disk utilization dropped from multi-GB/s to under 100 MB/s for equivalent workloads, validating predictive gains in production-scale, adversarial conditions.

Implications and Future Directions

Tidehunter's design marks a significant shift in large-value storage architecture, particularly by:

  • Eliminating the compaction-write-amplification tax for workloads featuring immutable, hash-keyed records. This directly extends SSD/NVMe device longevity and enables much higher sustainable throughput.
  • Redefining index management for uniform-key domains, which are prevalent in content-addressed stores, distributed ledgers, and advanced ML feature embedding stores.
  • Opening the door to more predictable latency and resource utilization, as background operations never stall writes and read paths are strictly bounded.

Practically, this architecture is optimal for hash-keyed, large-value stores, which encapsulate content-addressable databases, deduplication/backups, and modern blockchains. Theoretically, it exposes a dichotomy: for truly uniform, large-value workloads, separating indexing and version management from value movement is optimal, whereas compaction-based designs remain efficient for small, locality-preserving keys.

Going forward, further research could explore dynamic workload adaptation—hybridizing compacting and non-compacting paths within the same store to better address mixed-data regimes. There is also scope for integrating more sophisticated learned indexes or auxiliary indices to extend Tidehunter's reach into secondary (e.g., range) queries, while retaining low write amplification.

Conclusion

Tidehunter introduces a WAL-centric storage paradigm that is fundamentally suited for large-value, write-heavy, uniform-key workloads. The decoupling of index persistence from value management, together with an optimistic, sparsity-aware indexing strategy, enables order-of-magnitude throughput gains over traditional and LSM/vLog-based designs in suitable workloads. Its production deployment in Sui emphasizes its maturity and impact, establishing a compelling architectural direction for content-addressed and blockchain-scale persistent key-value storage.


Reference: "Tidehunter: Large-Value Storage With Minimal Data Relocation" (2602.01873)

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 83 likes about this paper.