Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Spark Optimizer for Adaptive, Fine-Grained Parameter Tuning (2403.00995v3)

Published 1 Mar 2024 in cs.DC

Abstract: As Spark becomes a common big data analytics platform, its growing complexity makes automatic tuning of numerous parameters critical for performance. Our work on Spark parameter tuning is particularly motivated by two recent trends: Spark's Adaptive Query Execution (AQE) based on runtime statistics, and the increasingly popular Spark cloud deployments that make cost-performance reasoning crucial for the end user. This paper presents our design of a Spark optimizer that controls all tunable parameters of each query in the new AQE architecture to explore its performance benefits and, at the same time, casts the tuning problem in the theoretically sound multi-objective optimization (MOO) setting to better adapt to user cost-performance preferences. To this end, we propose a novel hybrid compile-time/runtime approach to multi-granularity tuning of diverse, correlated Spark parameters, as well as a suite of modeling and optimization techniques to solve the tuning problem in the MOO setting while meeting the stringent time constraint of 1-2 seconds for cloud use. Evaluation results using TPC-H and TPC-DS benchmarks demonstrate the superior performance of our approach: (i) When prioritizing latency, it achieves 63% and 65% reduction for TPC-H and TPC-DS, respectively, under an average solving time of 0.7-0.8 sec, outperforming the most competitive MOO method that reduces only 18-25% latency with 2.6-15 sec solving time. (ii) When shifting preferences between latency and cost, our approach dominates the solutions of alternative methods, exhibiting superior adaptability to varying preferences.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Spark SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015, Timos K. Sellis, Susan B. Davidson, and Zachary G. Ives (Eds.). ACM, 1383–1394. https://doi.org/10.1145/2723372.2742797
  2. David Arthur and Sergei Vassilvitskii. 2006. How slow is the k-means method?. In Proceedings of the twenty-second annual symposium on Computational geometry. 144–153.
  3. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE. 1151–1162.
  4. Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: simplified data processing on large clusters. In OSDI’04: Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation (San Francisco, CA). USENIX Association, Berkeley, CA, USA, 10–10.
  5. Optimistic Recovery for Iterative Dataflows in Action. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015. 1439–1443. https://doi.org/10.1145/2723372.2735372
  6. Vijay Prakash Dwivedi and Xavier Bresson. 2021. A Generalization of Transformer Networks to Graphs. AAAI Workshop on Deep Learning on Graphs: Methods and Applications (2021).
  7. Michael T. Emmerich and André H. Deutz. 2018. A Tutorial on Multiobjective Optimization: Fundamentals and Evolutionary Methods. Natural Computing: an international journal 17, 3 (Sept. 2018), 585–609. https://doi.org/10.1007/s11047-018-9685-y
  8. Adaptive Query Execution: Speeding Up Spark SQL at Runtime. https://www.databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html.
  9. To Tune or Not to Tune? In Search of Optimal Configurations for Data Analytics. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Virtual Event, CA, USA) (KDD ’20). Association for Computing Machinery, New York, NY, USA, 2494–2504. https://doi.org/10.1145/3394486.3403299
  10. Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience. PVLDB 2, 2 (2009), 1414–1425.
  11. Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 1035–1050. https://doi.org/10.1145/3318464.3389741
  12. Herodotos Herodotou and Elena Kakoulli. 2021. Trident: Task Scheduling over Tiered Storage Systems in Big Data Platforms. Proc. VLDB Endow. 14, 9 (2021), 1570–1582. http://www.vldb.org/pvldb/vol14/p1570-herodotou.pdf
  13. Optimizing Resource Allocation for Data-Parallel Jobs Via GCN-Based Prediction. IEEE Trans. Parallel Distributed Syst. 32, 9 (2021), 2188–2201. https://doi.org/10.1109/TPDS.2021.3055019
  14. Arvind Hulgeri and S. Sudarshan. 2002. Parametric Query Optimization for Linear and Piecewise Linear Cost Functions. In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China) (VLDB ’02). VLDB Endowment, 167–178. http://dl.acm.org/citation.cfm?id=1287369.1287385
  15. Morpheus: Towards Automated SLOs for Enterprise Clusters. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2-4, 2016. 117–134. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/jyothi
  16. Too Many Knobs to Tune? Towards Faster Database Tuning by Pre-selecting Important Knobs. In 12th USENIX Workshop on Hot Topics in Storage and File Systems, HotStorage 2020, July 13-14, 2020, Anirudh Badam and Vijay Chidambaram (Eds.). USENIX Association. https://www.usenix.org/conference/hotstorage20/presentation/kanellis
  17. Schedule Optimization for Data Processing Flows on the Cloud. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (Athens, Greece) (SIGMOD ’11). ACM, New York, NY, USA, 289–300. https://doi.org/10.1145/1989323.1989355
  18. On finding the maxima of a set of vectors. Journal of the ACM (JACM) 22, 4 (1975), 469–476.
  19. Mayuresh Kunjir and Shivnath Babu. 2020. Black or White? How to Develop an AutoTuner for Memory-based Analytics. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 1667–1683. https://doi.org/10.1145/3318464.3380591
  20. Viktor Leis and Maximilian Kuschewski. 2021. Towards Cost-Optimal Query Processing in the Cloud. Proc. VLDB Endow. 14, 9 (2021), 1606–1612. http://www.vldb.org/pvldb/vol14/p1606-leis.pdf
  21. QTune: A Query-Aware Database Tuning System with Deep Reinforcement Learning. Proc. VLDB Endow. 12, 12 (2019), 2118–2130. https://doi.org/10.14778/3352063.3352129
  22. Towards General and Efficient Online Tuning for Spark. Proc. VLDB Endow. 16, 12 (2023), 3570–3583. https://doi.org/10.14778/3611540.3611548
  23. Adaptive Code Learning for Spark Configuration Tuning. In 38th IEEE International Conference on Data Engineering, ICDE 2022, Kuala Lumpur, Malaysia, May 9-12, 2022. IEEE, 1995–2007. https://doi.org/10.1109/ICDE53745.2022.00195
  24. Fauce: Fast and Accurate Deep Ensembles with Uncertainty for Cardinality Estimation. Proc. VLDB Endow. 14, 11 (2021), 1950–1963. http://www.vldb.org/pvldb/vol14/p1950-liu.pdf
  25. Pre-training Summarization Models of Structured Datasets for Cardinality Estimation. Proc. VLDB Endow. 15, 3 (2021), 414–426. http://www.vldb.org/pvldb/vol15/p414-lu.pdf
  26. Fine-Grained Modeling and Optimization for Intelligent Resource Management in Big Data Processing. Proc. VLDB Endow. 15, 11 (2022), 3098–3111. https://doi.org/10.14778/3551793.3551855
  27. Fine-Grained Modeling and Optimization for Intelligent Resource Management in Big Data Processing. Proc. VLDB Endow. 15, 11 (2022), 3098–3111. https://www.vldb.org/pvldb/vol15/p3098-lyu.pdf
  28. Ryan Marcus and Olga Papaemmanouil. 2016. WiSeDB: A Learning-based Workload Management Advisor for Cloud Databases. PVLDB 9, 10 (2016), 780–791. http://www.vldb.org/pvldb/vol9/p780-marcus.pdf
  29. Regina Marler and J S Arora. 2004. Survey of multi-objective optimization methods for engineering. Structural and Multidisciplinary Optimization 26, 6 (2004), 369–395.
  30. MaxCompute [n.d.]. Open Data Processing Service. https://www.alibabacloud.com/product/maxcompute.
  31. A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output From a Computer Code. Technometrics 42, 1 (2000), 55–61. https://doi.org/10.1080/00401706.2000.10485979
  32. Achille Messac. 2012. From Dubious Construction of Objective Functions to the Application of Physical Programming. AIAA Journal 38, 1 (2012), 155–163.
  33. The normalized normal constraint method for generating the Pareto frontier. Structural and Multidisciplinary Optimization 25, 2 (2003), 86–98.
  34. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger (Eds.). 3111–3119. https://proceedings.neurips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html
  35. Naiad: A Timely Dataflow System. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (Farminton, Pennsylvania) (SOSP ’13). ACM, New York, NY, USA, 439–455. https://doi.org/10.1145/2517349.2522738
  36. Intelligent Scaling in Amazon Redshift. In SIGMOD ’24: International Conference on Management of Data, Philadelphia, 2024. ACM, 1–. To appear.
  37. Flow-Loss: Learning Cardinality Estimates That Matter. Proc. VLDB Endow. 14, 11 (2021), 2019–2032. http://www.vldb.org/pvldb/vol14/p2019-negi.pdf
  38. Weighted Distinct Sampling: Cardinality Estimation for SPJ Queries. In SIGMOD ’21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021, Guoliang Li, Zhanhuai Li, Stratos Idreos, and Divesh Srivastava (Eds.). ACM, 1465–1477. https://doi.org/10.1145/3448016.3452821
  39. PerfOrator: eloquent performance models for Resource Optimization. In Proceedings of the Seventh ACM Symposium on Cloud Computing, Santa Clara, CA, USA, October 5-7, 2016. 415–427. https://doi.org/10.1145/2987550.2987566
  40. Spark-based Cloud Data Analytics using Multi-Objective Optimization. In 37th IEEE International Conference on Data Engineering, ICDE 2021, Chania, Greece, April 19-22, 2021. IEEE, 396–407. https://doi.org/10.1109/ICDE51399.2021.00041
  41. Learned Cardinality Estimation for Similarity Queries. In SIGMOD ’21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021, Guoliang Li, Zhanhuai Li, Stratos Idreos, and Divesh Srivastava (Eds.). ACM, 1745–1757. https://doi.org/10.1145/3448016.3452790
  42. Zilong Tan and Shivnath Babu. 2016. Tempo: robust and self-tuning resource management in multi-tenant parallel databases. Proceedings of the VLDB Endowment 9, 10 (2016), 720–731.
  43. Hive - A Warehousing Solution Over a Map-Reduce Framework. PVLDB 2, 2 (2009), 1626–1629.
  44. Immanuel Trummer and Christoph Koch. 2014a. Approximation Schemes for Many-objective Query Optimization. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (Snowbird, Utah, USA) (SIGMOD ’14). ACM, New York, NY, USA, 1299–1310. https://doi.org/10.1145/2588555.2610527
  45. Immanuel Trummer and Christoph Koch. 2014b. Multi-objective Parametric Query Optimization. Proc. VLDB Endow. 8, 3 (Nov. 2014), 221–232. https://doi.org/10.14778/2735508.2735512
  46. Immanuel Trummer and Christoph Koch. 2015. An Incremental Anytime Algorithm for Multi-Objective Query Optimization. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015. 1941–1953. https://doi.org/10.1145/2723372.2746484
  47. Automatic Database Management System Tuning Through Large-scale Machine Learning. In Proceedings of the 2017 ACM International Conference on Management of Data (Chicago, Illinois, USA) (SIGMOD ’17). ACM, New York, NY, USA, 1009–1024. https://doi.org/10.1145/3035918.3064029
  48. Apache Hadoop YARN: yet another resource negotiator. In ACM Symposium on Cloud Computing, SOCC ’13, Santa Clara, CA, USA, October 1-3, 2013, Guy M. Lohman (Ed.). ACM, 5:1–5:16. https://doi.org/10.1145/2523616.2523633
  49. FACE: A Normalizing Flow based Cardinality Estimator. Proc. VLDB Endow. 15, 1 (2021), 72–84. http://www.vldb.org/pvldb/vol15/p72-li.pdf
  50. UDO: Universal Database Optimization using Reinforcement Learning. Proc. VLDB Endow. 14, 13 (2021), 3402–3414. https://doi.org/10.14778/3484224.3484236
  51. PostCENN: PostgreSQL with Machine Learning Models for Cardinality Estimation. Proc. VLDB Endow. 14, 12 (2021), 2715–2718. http://www.vldb.org/pvldb/vol14/p2715-woltmann.pdf
  52. Peizhi Wu and Gao Cong. 2021. A Unified Deep Model of Learning from both Data and Queries for Cardinality Estimation. In SIGMOD ’21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021, Guoliang Li, Zhanhuai Li, Stratos Idreos, and Divesh Srivastava (Eds.). ACM, 2009–2022. https://doi.org/10.1145/3448016.3452830
  53. BayesCard: Revitilizing Bayesian Frameworks for Cardinality Estimation. https://doi.org/10.48550/ARXIV.2012.14743
  54. LOCAT: Low-Overhead Online Configuration Auto-Tuning of Spark SQL Applications. In Proceedings of the 2022 International Conference on Management of Data (SIGMOD/PODS ’22). ACM. https://doi.org/10.1145/3514221.3526157
  55. Shark: SQL and rich analytics at scale. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (New York, New York, USA) (SIGMOD ’13). ACM, New York, NY, USA, 13–24. https://doi.org/10.1145/2463676.2465288
  56. Graph Transformer Networks. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (Eds.). 11960–11970. https://proceedings.neurips.cc/paper/2019/hash/9d63484abb477c97640154d40595a3bb-Abstract.html
  57. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (San Jose, CA) (NSDI’12). USENIX Association, Berkeley, CA, USA, 2–2. http://dl.acm.org/citation.cfm?id=2228298.2228301
  58. UDAO: A Next-Generation Unified Data Analytics Optimizer. PVLDB 12, 12 (2019), 1934–1937. https://doi.org/10.14778/3352063.3352103
  59. An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD ’19). ACM, New York, NY, USA, 415–432. https://doi.org/10.1145/3299869.3300085
  60. ResTune: Resource Oriented Tuning Boosted by Meta-Learning for Cloud Databases. In SIGMOD ’21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021, Guoliang Li, Zhanhuai Li, Stratos Idreos, and Divesh Srivastava (Eds.). ACM, 2102–2114. https://doi.org/10.1145/3448016.3457291
  61. Towards Dynamic and Safe Configuration Tuning for Cloud Databases. In SIGMOD ’22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, Zachary G. Ives, Angela Bonifati, and Amr El Abbadi (Eds.). ACM, 631–645. https://doi.org/10.1145/3514221.3526176
  62. Fuxi: a Fault-Tolerant Resource Management and Job Scheduling System at Internet Scale. Proc. VLDB Endow. 7, 13 (2014), 1393–1404. https://doi.org/10.14778/2733004.2733012
  63. SCOPE: parallel databases meet MapReduce. The VLDB Journal 21, 5 (Oct. 2012), 611–636. https://doi.org/10.1007/s00778-012-0280-z
  64. FLAT: Fast, Lightweight and Accurate Method for Cardinality Estimation. Proc. VLDB Endow. 14, 9 (2021), 1489–1502. http://www.vldb.org/pvldb/vol14/p1489-zhu.pdf
  65. Yuqing Zhu and Jianxun Liu. 2019. ClassyTune: A Performance Auto-Tuner for Systems in the Cloud. IEEE Transactions on Cloud Computing (2019), 1–1.
  66. BestConfig: tapping the performance potential of systems via automatic configuration tuning. SoCC ’17: ACM Symposium on Cloud Computing Santa Clara California September, 2017 (2017), 338–350.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com