Tiered Merge Policies in Storage Systems

Updated 16 September 2025

Tiered merge policies are a class of algorithms that organize, compact, and migrate data across multiple storage tiers to optimize system throughput and latency.
They employ formal models like Bigtable Merge Compaction and competitive online algorithms to balance merge costs and read latencies under dynamic workloads.
Empirical benchmarks demonstrate significant improvements in read throughput and adaptive data placement in systems such as NoSQL databases, distributed file systems, and LSM-trees.

Tiered merge policies are algorithmic strategies implemented in storage systems, particularly NoSQL databases, distributed file systems, and LSM-tree architectures, to efficiently organize, compact, and move data across multiple storage tiers—each with distinct performance and cost characteristics. These policies play a fundamental role in maintaining system throughput, minimizing read/write latency, and controlling resource utilization by identifying which data subsets to compact or move and when, under dynamic workloads and with incomplete future knowledge.

1. Formal Models for Tiered Merge Policies

Tiered merge policies are rigorously modeled through abstractions such as the Bigtable Merge Compaction (BMC) framework (Mathieu et al., 2014). In this model, a stack structure represents file tiers; at each time step $t$ , a new file of length $\ell_t$ and associated read cost $r_t$ is added. Compaction consists of merging a contiguous segment of files at the top of the stack, incurring a merge cost equal to the sum of their lengths, whereas the read cost depends on the stack size post-merge via a monotonic function $f$ .

The overall cost for a schedule $\sigma$ is defined as: $\sigma(\ell, r) = \sum_{t} \left[ L_t + r_t \cdot f(k_t) \right]$ with $L_t$ being the cost of a merge at time $t$ , and $k_t$ the stack size after the operation.

Two specific forms for $f(k)$ are pivotal:

BMC $_{\le K}$ : $f(k) = 0$ if $k \le K$ , $f(k) = \infty$ otherwise, enforcing a fixed limit on the number of tiers.
Linear BMC: $f(k) = k$ , penalizing linearly with the number of stack levels and balancing merge cost against read latency.

A critical combinatorial development is the bijection between merge schedules and binary search trees. The cost function in tree representation becomes: $\text{cost}_f(T) = \sum_{t} [\ell_t (1 + \text{left}_T(t)) + r_t f(1 + \text{right}_T(t))]$ where $\text{left}_T(t)$ and $\text{right}_T(t)$ denote the number of left and right children along the search path to key $t$ , respectively.

2. Algorithmic Developments and Online Tiered Merging

Online algorithms for tiered merge policies must make decisions without future workload knowledge. For BMC $_{\le K}$ , a balanced rent–or–buy algorithm referred to as “K” is shown to be exactly $K$ -competitive in the worst case: its cost will not exceed $K$ times the optimum for any input sequence. This process involves recursively determining when a full merge is "affordable," applying the same logic to subproblems until the base case.

For Linear BMC, the algorithm is constructed to preserve the following invariant in the binary search tree mapping: the total lengths of files in each node’s left subtree are at least that in its right subtree. This achieves $O(1)$ -competitiveness for “read-heavy” regimes ( $\ell_t = O(r_t)$ for all $t$ ), dynamically balancing small frequent merges versus large infrequent compactions.

Dynamic programming recurrences underpin optimal offline solutions. A technical recurrence describing minimum subproblem cost is: $d[i, j] = \min_{s = i \dots j} \left\{ d[i, s-1] + \ell[i, s] + r[s, j] \cdot f_d(1) + d_{d+1}[s+1, j] \right\}$ with running times $O(n^4), O(K n^3),$ or $O(n^3)$ , respectively, depending on the variant.

3. Performance Analysis: Worst-Case, Average-Case, and Benchmarks

Tiered merge policy algorithms are analyzed both in worst-case and stochastic regimes. For BMC $_{\le K}$ , no deterministic online algorithm can outperform $K$ -competitiveness. Average-case results for i.i.d. workloads with bounded support reveal that both BMC $_{\le K}$ and Linear BMC can be made asymptotically $1$-competitive. The expected cost for BMC $_{\le K}$ is asymptotically: $E[A(\sigma)] \sim \frac{K n^{1 + 1/K}}{c_K},\quad c_K = \frac{K+1}{K!^{1/K}}$ and for Linear BMC: $E[A(\sigma)] \sim \beta n \log n$ where $\beta$ is determined by a root-finding equation related to the competitive analysis.

Empirical evaluations using log-normal and exponential workload distributions demonstrate that these tiered merge policies not only adhere to theoretical predictions, but can achieve order-of-magnitude improvements over traditional fixed-threshold algorithms, especially when the number of levels $K$ and workload size $n$ are large (Mathieu et al., 2014).

4. Tiering in Distributed Storage and Adaptive Data Placement

Modern distributed file systems and cluster computing platforms such as HDFS, Hadoop, and Spark have evolved to incorporate physical storage tiering (e.g., NVRAM, SSD, HDD) (Herodotou et al., 2019). Tiered merge policies here encompass not only compaction but also dynamic movement (“upgrade” or “downgrade”) of data across physical tiers.

Machine learning classifiers, such as XGBoost, are integrated to predict access patterns and dynamically identify “hot” files for migration to faster storage and “cold” files for relegation to slower tiers. The framework defined in (Herodotou et al., 2019) employs incremental learning, continuously updating model parameters to capture evolving workload characteristics, allowing for highly adaptive tiered placement. Automated decision points include when to start/stop data movement processes and which files to promote or demote, based on predicted likelihood of future access. Empirical evaluations on production-like traces have shown job completion time reductions of 18–27% and cluster efficiency gains, with accuracy rates exceeding 98% sustained even under workload shifts.

5. Fine-Grained Promotion and Hot Data Retention

In Log-Structured Merge-tree (LSM-tree) environments, the tiered merge policy challenge intensifies: upper levels reside on fast storage for write/read performance, while base levels are relegated to slower, cheaper storage (Qiu et al., 3 Feb 2024). The HotRAP system implements a fine-grained tiered merge policy by logging every record access in an on-disk LSM structure (“RALT”) and computing exponentially smoothed hotness scores per key. Keys with high sustained scores are identified as “hot.”

Promotion and retention of hot records utilize three mechanisms:

Retention during compaction: During merge between fast and slow tiers, dual iterators scan both key range and hotness logs, retaining only hot records in fast storage.
Promotion by compaction: During regular slow-to-fast compaction, hot records detected in key range are proactively moved to fast storage.
Promotion by flush: An in-memory promotion cache temporarily captures accessed slow-tier records, which are batch-promoted based on hotness as the cache fills.

Eviction from fast storage is administered via weighted sampling and thresholding against total “hot set” size. Empirical results show that HotRAP achieves up to $5.4\times$ improvements in read throughput over prior LSM-based tiered systems and reaches fast-tier hit rates of 95% under canonical skewed workload patterns.

6. Broader Applications and Tiered Mechanism Generalizations

Tiered merge strategies extend naturally to transaction fee mechanisms in blockchains and other domains where differentiated service levels are economically or operationally beneficial (Kiayias et al., 2023). In blockchain transaction fee management, a tiered mechanism stratifies transactions into multiple queues, each with distinct delays and prices. Parameters are chosen to enforce monotonic ordering of delay/price (e.g., $d_{j+1} \ge \lambda_j d_j$ , $p_{j+1} \le \mu_j p_j$ ), enabling inclusivity for low-urgency requests at lower prices and reserving immediate processing at premium rates. This framework achieves stable prices in expectation and supports price discrimination without sacrificing overall revenue.

This suggests that tiered merge policies are an instance of a broader class of resource management mechanisms, applicable whenever urgent, high-value or frequently accessed items must compete with bulk or background data for premium resources.

7. Design Principles, Trade-offs, and Future Directions

The theoretical and empirical analysis of tiered merge policies reveals fundamental trade-offs:

Aggressive merging minimizes read cost but increases write amplification and CPU burden.
Deferred merging reduces merge cost but raises read latency, especially as number of tiers increases.

Rigorous frameworks such as BMC allow system designers to quantitatively calibrate these tradeoffs via cost functions and competitive analysis, moving beyond heuristic or empirically tuned strategies. Adaptive policies—those which do not assume workload stationarity—are advantageous for systems with dynamic access distributions and evolving requirements.

A plausible implication is that future work will continue to integrate predictive analytics, finer-grained tracking (at record rather than file/table level), and combinatorial optimization techniques to maximize the efficiency and flexibility of tiered merge policies in even more complex heterogeneous storage infrastructures.

In summary, tiered merge policies encompass a collection of rigorously designed, adaptively managed algorithms and mechanisms that orchestrate the organization, compaction, and migration of data across multi-level, multi-tier storage architectures. They combine combinatorial analysis, online algorithms, competitive analysis, empirical benchmarking, and—more recently—machine learning to achieve robust trade-offs between merge and read costs, adaptability to workload shifts, and efficient utilization of heterogeneous resources. Their continued evolution reflects both the complexity and centrality of data management in modern, large-scale systems.