Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads (2006.13282v1)

Published 23 Jun 2020 in cs.DB and cs.LG

Abstract: Filtering data based on predicates is one of the most fundamental operations for any modern data warehouse. Techniques to accelerate the execution of filter expressions include clustered indexes, specialized sort orders (e.g., Z-order), multi-dimensional indexes, and, for high selectivity queries, secondary indexes. However, these schemes are hard to tune and their performance is inconsistent. Recent work on learned multi-dimensional indexes has introduced the idea of automatically optimizing an index for a particular dataset and workload. However, the performance of that work suffers in the presence of correlated data and skewed query workloads, both of which are common in real applications. In this paper, we introduce Tsunami, which addresses these limitations to achieve up to 6X faster query performance and up to 8X smaller index size than existing learned multi-dimensional indexes, in addition to up to 11X faster query performance and 170X smaller index size than optimally-tuned traditional indexes.

Citations (132)

View on Semantic Scholar

Summary

The paper introduces Tsunami, a self-optimizing learned index that enhances query performance on correlated data and skewed workloads.
It employs innovative Grid Tree and Augmented Grid structures to achieve up to 6x–11x faster queries and significantly reduced index sizes.
Evaluation on real-world datasets shows Tsunami outperforms both traditional and existing learned indexes, offering practical benefits for in-memory analytics.

The paper introduces Tsunami, a novel in-memory, read-optimized learned multi-dimensional index designed to enhance query performance on correlated data and skewed workloads. This addresses the limitations of existing learned indexes like Flood, which struggle with such real-world scenarios. Tsunami demonstrates performance gains of up to 6x faster query execution and up to 8x smaller index sizes compared to existing learned multi-dimensional indexes. It also achieves up to 11x faster query performance and 170x smaller index size than optimally-tuned traditional indexes.

The key contributions of Tsunami are:

The design and implementation of a self-optimizing, in-memory, read-optimized learned multi-dimensional index robust to correlated datasets and skewed workloads.
The introduction of two data structures: the Grid Tree and the Augmented Grid. These are coupled with new optimization procedures that allow Tsunami to adapt its index structure and data organization strategy to handle data correlation and query skew.
An evaluation against Flood, and a number of traditional non-learned indexes on a variety of workloads over real datasets, demonstrating Tsunami's performance and adaptability across various conditions.

Tsunami builds upon the concept of clustered in-memory indexing over column stores, aligning with trends favoring incremental merges over in-place updates. It is envisioned as a potential building block for in-memory key-value stores or integration into commercial in-memory analytics accelerators like Oracle’s Database In-Memory (DBIM).

The limitations of Flood are:

Flood's grid cannot efficiently adapt to skewed query workloads in which query frequencies and filter selectivities vary across the data space.
If dimensions are correlated, then Flood cannot maintain uniformly sized grid cells, which degrades performance and memory usage.

Tsunami's Architecture

Tsunami's architecture comprises two main components:

Grid Tree: A lightweight decision tree that partitions the data space into non-overlapping regions to reduce query skew.
Augmented Grid: An index structure applied within each region of the Grid Tree that uses functional mappings and conditional Cumulative Distribution Functions (CDFs) to efficiently capture correlations.

During query processing, the Grid Tree is traversed to identify relevant regions, after which the corresponding Augmented Grids are queried. An offline optimization procedure tailors both the Grid Tree and Augmented Grid structures to the dataset and workload characteristics.

Grid Tree Details

The Grid Tree mitigates performance challenges due to skewed workloads, where query characteristics like frequency or selectivity vary across the data space. The skew of a set of queries $Q$ with respect to a range $[a,b)$ in dimension $i$ is quantified using the Earth Mover's Distance ( $EMD$ ) between a uniform distribution $Uni_i(a,b)$ and the empirical Probability Density Function ( $PDF$ ) of queries in $Q$ over $[a,b)$ , denoted as $Skew_i(Q,a,b) = EMD(Uni_i(a,b), PDF_i(Q,a,b))$ . This $PDF$ is approximated using a histogram $Hist_i(Q,a,b,n)$ .

The Grid Tree is a space-partitioning decision tree, where each internal node splits space based on values in a specific dimension. The split dimension and values are chosen to minimize query skew. The reduction in query skew, $R_i(Q,0,X_d,V)$ , for a dimension $i$ and split values $V = \{v_1, ..., v_k\}$ , is calculated by comparing the skew before and after the split. A "skew tree," a balanced binary tree, is used to find the optimal split values $V$ that maximize $R_{d_s}$ for each candidate split dimension $d_s \in [0, d)$ .

Augmented Grid Details

The Augmented Grid addresses challenges posed by data correlations, where dimensions are not independent, i.e., $CDF(X) \ne CDF(X|Y)$ . It augments the basic grid structure with strategies to partition dimensions dependently:

Independent partitioning of dimension $X$ uniformly in $CDF(X)$ .
Removal of dimension $X$ from the grid by transforming query filters using a functional mapping $F:X\rightarrow Y$ .
Partitioning dimension $X$ dependently on another dimension $Y$ uniformly in $CDF(X|Y)$ .

For monotonically correlated dimensions $X$ and $Y$ , functional mappings are implemented using linear regression ( $LR$ ) with error bounds ( $e_l$ , $e_u$ ). Given a range $[Y_{min},Y_{max}]$ , the mapping function produces $X_{min} = Y_{min}-e_l$ and $X_{max} = Y_{max}+e_u$ . For generic correlations, conditional CDFs partition $X$ uniformly in $CDF(X)$ and $Y$ uniformly in $CDF(Y|X)$ , resulting in equally-sized cells. The $CDF(Y|X)$ is implemented by storing $p_X$ histograms over $Y$ , where $p_X$ is the number of partitions in $X$ .

The optimization of the Augmented Grid involves finding the best skeleton $S$ (instantiation of partitioning strategies) and number of partitions $P$ in each dimension to minimize average query time. This is achieved using a cost model and Adaptive Gradient Descent (AGD). The cost model predicts query runtime based on the number of cell ranges scanned and the number of points scanned, $\text{Time} = w_0 (\text{\# cell ranges}) + w_1 (\text{\# scanned points})(\text{\# filtered dims})$ . AGD iteratively optimizes $S$ and $P$ by taking gradient descent steps over $P$ and performing local searches over skeletons.

Evaluation Results

The evaluation compared Tsunami against Flood, K-d tree, Hyperoctree, Z-Order index, and a Clustered Single-Dimensional Index. The experiments used a mix of real-world and synthetic datasets, including TPC-H, Taxi, Perfmon, and Stocks. Tsunami consistently outperformed the other indexes. The evaluation showed that:

Tsunami achieves up to 6x higher query throughput than Flood and up to 11x higher query throughput than the fastest optimally-tuned non-learned index.
Tsunami uses up to 8x less memory than Flood and 7-170x less memory than the fastest tuned non-learned index.
Tsunami can adapt to changes in the query workload quickly, typically in under 4 minutes for a 300 million record dataset.
Tsunami's performance advantage over other indexes scales with dataset size, selectivity, and dimensionality.

A component analysis showed that both the Grid Tree and Augmented Grid contribute to Tsunami's performance.

Future Work

Future research directions include:

Handling complex correlations using more sophisticated partitioning strategies.
Optimizing the sort order of categorical dimensions based on co-access frequency.
Developing mechanisms to detect and adapt to data and workload shifts incrementally.
Extending Tsunami's techniques to support persistence and disk-based multi-dimensional indexes.

PDF Markdown