RelJoin: Relative-cost-based Selection of Distributed Join Methods for Query Plan Optimization (2311.14311v1)
Abstract: Selecting appropriate distributed join methods for logical join operations in a query plan is crucial for the performance of data-intensive scalable computing (DISC). Different network communication patterns in the data exchange phase generate varying network communication workloads and significantly affect the distributed join performance. However, most cost-based query optimizers focus on the local computing cost and do not precisely model the network communication cost. We propose a cost model for various distributed join methods to optimize join queries in DISC platforms. Our method precisely measures the network and local computing workloads in different execution phases, using information on the size and cardinality statistics of datasets and cluster join parallelism. Our cost model reveals the importance of the relative size of the joining datasets. We implement an efficient distributed join selection strategy, known as RelJoin in SparkSQL, which is an industry-prevalent distributed data processing framework. RelJoin uses runtime adaptive statistics for accurate cost estimation and selects optimal distributed join methods for logical joins to optimize the physical query plan. The evaluation results on the TPC-DS benchmark show that RelJoin performs best in 62 of the 97 queries and can reduce the average query time by 21% compared with other strategies.
- Tpc-ds, . URL: https://www.tpc.org/tpcds/.
- Cost-based query transformation in oracle, in: VLDB, pp. 1026–1036.
- Massively parallel sort-merge joins in main memory multi-core database systems. Proceedings of the VLDB Endowment 5.
- Spark sql: Relational data processing in spark, in: Proceedings of the 2015 ACM SIGMOD international conference on management of data, pp. 1383–1394.
- A cost model for spark sql. IEEE Transactions on Knowledge and Data Engineering 31, 819–832.
- Multi-core, main-memory joins: Sort vs. hash revisited. Proceedings of the VLDB Endowment 7, 85–96.
- Main-memory hash joins on modern processor architectures. IEEE Transactions on Knowledge and Data Engineering 27, 1754–1766.
- Rack-scale in-memory join processing using rdma, in: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1463–1475.
- Distributed join algorithms on thousands of cores. Proceedings of the VLDB Endowment 10, 517–528.
- Enhanced subquery optimizations in oracle. Proceedings of the VLDB Endowment 2, 1366–1377.
- Adaptive and big data scale parallel execution in oracle. Proceedings of the VLDB Endowment 6, 1102–1113.
- Design and evaluation of main memory hash join algorithms for multi-core cpus, in: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp. 37–48.
- Adaptive statistics in oracle 12c. Proceedings of the VLDB Endowment 10, 1813–1824.
- Partial join order optimization in the paraccel analytic database, in: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 905–908.
- Improving the robustness and performance of parallel joins over distributed systems. Journal of Parallel and Distributed Computing 109, 310–323.
- Query optimization in oracle 12c database in-memory. Proceedings of the VLDB Endowment 8, 1770–1781.
- Efficient distributed algorithms for distance join queries in spark-based spatial analytics systems. International Journal of General Systems 52, 206–250.
- Distributed query optimization strategies for cloud environment. Journal of Data, Information and Management 3, 271–279.
- Sort vs. hash revisited: Fast join implementation on modern multi-core cpus. Proceedings of the VLDB Endowment 2, 1378–1389.
- How good are query optimizers, really? Proceedings of the VLDB Endowment 9, 204–215.
- Rios: Runtime integrated optimizer for spark, in: Proceedings of the ACM Symposium on Cloud Computing, pp. 275–287.
- The optimization of cost-model for join operator on spark sql platform, in: MATEC Web of Conferences, EDP Sciences. p. 01015.
- Confluence: speeding up iterative distributed operations by key-dependency-aware partitioning. IEEE Transactions on Parallel and Distributed Systems 29, 351–364.
- Practical selectivity estimation through adaptive sampling, in: Proceedings of the 1990 ACM SIGMOD international conference on Management of data, pp. 1–11.
- Forecasting the cost of processing multi-join queries via hashing for main-memory databases, in: Proceedings of the Sixth ACM Symposium on Cloud Computing, pp. 153–166.
- Optimizing main-memory join on modern hardware. IEEE Transactions on Knowledge and Data Engineering 14, 709–730.
- Generic database cost models for hierarchical memory systems, in: VLDB’02: Proceedings of the 28th International Conference on Very Large Databases, Elsevier. pp. 191–202.
- Zstream: a cost-based query processor for adaptively detecting composite events, in: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 193–206.
- Comparative analysis of skew-join strategies for large-scale datasets with mapreduce and spark. Applied Sciences 12, 6554.
- A theoretical and experimental comparison of large-scale join algorithms in spark. SN Computer Science 2, 352.
- Track join: distributed joins with minimal network traffic, in: Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pp. 1483–1494.
- Cost-based query optimization via ai planning, in: Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, pp. 2344–2351.
- A review of different cost-based distributed query optimizers. Progress in Artificial Intelligence 8, 45–62.
- Heuristic and randomized optimization for the join ordering problem. The VLDB Journal 6, 191–208.
- An end-to-end learning-based cost estimator. Proceedings of the VLDB Endowment 13.
- Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2, 1626–1629.
- Apache hadoop yarn: Yet another resource negotiator, in: Proceedings of the 4th annual Symposium on Cloud Computing, pp. 1–16.
- Predicting query execution time: Are optimizer cost models really unusable?, in: 2013 IEEE 29th International Conference on Data Engineering (ICDE), IEEE. pp. 1081–1092.
- Structural join order selection for xml query optimization, in: Proceedings 19th International Conference on Data Engineering, IEEE. pp. 443–454.
- Deep unsupervised cardinality estimation. Proceedings of the VLDB Endowment 13.
- Robust query optimization methods with respect to estimation errors: A survey. ACM Sigmod Record 44, 25–36.
- A learned query rewrite system using monte carlo tree search. Proceedings of the VLDB Endowment 15, 46–58.