Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Demeter: Resource-Efficient Distributed Stream Processing under Dynamic Loads with Multi-Configuration Optimization (2403.02129v1)

Published 4 Mar 2024 in cs.DC

Abstract: Distributed Stream Processing (DSP) focuses on the near real-time processing of large streams of unbounded data. To increase processing capacities, DSP systems are able to dynamically scale across a cluster of commodity nodes, ensuring a good Quality of Service despite variable workloads. However, selecting scaleout configurations which maximize resource utilization remains a challenge. This is especially true in environments where workloads change over time and node failures are all but inevitable. Furthermore, configuration parameters such as memory allocation and checkpointing intervals impact performance and resource usage as well. Sub-optimal configurations easily lead to high operational costs, poor performance, or unacceptable loss of service. In this paper, we present Demeter, a method for dynamically optimizing key DSP system configuration parameters for resource efficiency. Demeter uses Time Series Forecasting to predict future workloads and Multi-Objective Bayesian Optimization to model runtime behaviors in relation to parameter settings and workload rates. Together, these techniques allow us to determine whether or not enough is known about the predicted workload rate to proactively initiate short-lived parallel profiling runs for data gathering. Once trained, the models guide the adjustment of multiple, potentially dependent system configuration parameters ensuring optimized performance and resource usage in response to changing workload rates. Our experiments on a commodity cluster using Apache Flink demonstrate that Demeter significantly improves the operational efficiency of long-running benchmark jobs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. [n.d.]. Apache Flink - Reactive Mode. https://flink.apache.org/2021/05/06/reactive-mode. Accessed: March 2024.
  2. [n.d.]. Auto-Scaling Systems with Elastic Spark Streaming. https://databricks.com/session/auto-scaling-systems-with-elastic-spark-streaming. Accessed: March 2024.
  3. Prompt: Dynamic Data-Partitioning for Distributed Micro-batch Stream Processing Systems. Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (2020).
  4. CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics. In NSDI.
  5. BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization. In Advances in Neural Information Processing Systems 33.
  6. IBM infosphere streams for scalable, real-time, intelligent transportation services. SIGMOD (2010).
  7. Apache Flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 38, 4 (2015), 28–38.
  8. John Daly. 2003. A Model for Predicting the Optimum Checkpoint Interval for Restart Dumps. In ICCS (LNCS, Vol. 2660), Peter M. A. Sloot, David Abramson, Alexander V. Bogdanov, Jack J. Dongarra, Albert Y. Zomaya, and Yuri E. Gorbachev (Eds.). Springer.
  9. John T. Daly. 2006. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Syst. 22, 3 (2006).
  10. Machines Tuning Machines: Configuring Distributed Stream Processors with Bayesian Optimization. 2015 IEEE International Conference on Cluster Computing (2015), 22–31.
  11. Dhalion: Self-Regulating Stream Processing in Heron. VLDB (2017).
  12. Elastic Scaling for Data Stream Processing. IEEE TPDS (2014).
  13. Phoebe: QoS-Aware Distributed Stream Processing through Anticipating Dynamic Workloads. 2022 IEEE International Conference on Web Services (ICWS) (2022), 198–207.
  14. Effectively Testing System Configurations of Critical IoT Analytics Pipelines. 2019 IEEE International Conference on Big Data (Big Data) (2019), 4157–4162.
  15. Chiron: Optimizing Fault Tolerance in QoS-aware Distributed Stream Processing Jobs. 2020 IEEE International Conference on Big Data (Big Data) (2020), 434–440.
  16. Evaluation of Load Prediction Techniques for Distributed Stream Processing. 2021 IEEE International Conference on Cloud Engineering (IC2E) (2021), 91–98.
  17. Stream Data Load Prediction for Resource Scaling Using Online Support Vector Regression. Algorithms (2019).
  18. A Survey of Distributed Data Stream Processing Frameworks. IEEE Access (2019).
  19. Pooyan Jamshidi and Giuliano Casale. 2016. An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing Systems. 2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS) (2016), 39–48.
  20. A utilization model for optimization of checkpoint intervals in distributed stream processing systems. Future Gener. Comput. Syst. 110 (2020), 68–79.
  21. Three steps is all you need: fast, accurate, automatic scaling decisions for distributed streaming dataflows. In USENIX Symposium on Operating Systems Design and Implementation.
  22. Jay Kreps. 2011. Kafka: a Distributed Messaging System for Log Processing.
  23. Twitter Heron: Stream Processing at Scale. ACM SIGMOD (2015).
  24. ContTune: Continuous Tuning by Conservative Bayesian Optimization for Distributed Stream Data Processing Systems. Proc. VLDB Endow. 16 (2023), 4282–4295.
  25. A Survey of Distributed Stream Processing Systems for Smart City Data Analytics. In SCIOT ’18.
  26. Evaluation of distributed stream processing frameworks for IoT applications in Smart Cities. Big Data (2019).
  27. Adaptive performance model for dynamic scaling Apache Spark Streaming. YSC (2018).
  28. Karasu: A Collaborative Approach to Efficient Cluster Configuration for Big Data Analytics. In IPCCC. IEEE.
  29. Enel: Context-Aware Dynamic Scaling of Distributed Dataflow Jobs using Graph Propagation. In IPCCC. IEEE.
  30. Unsupervised Anomaly Event Detection for VNF Service Monitoring Using Multivariate Online Arima. In CloudCom. IEEE Computer Society.
  31. The Hadoop Distributed File System, Mohammed G. Khatib, Xubin He, and Michael Factor (Eds.). IEEE Computer Society.
  32. Selecting resources for distributed dataflow systems according to runtime targets. IPCCC (2016).
  33. Storm@twitter. Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (2014).
  34. Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics. In NSDI.
  35. Large-scale cluster management at Google with Borg. In EuroSys, Laurent Réveillère, Tim Harris, and Maurice Herlihy (Eds.). ACM.
  36. J. W. Young. 1974. A first order approximation to the optimum checkpoint interval. Commun. ACM 17 (1974), 530–531.
  37. Apache Spark. Commun. ACM 59 (2016), 56 – 65.
  38. ResTune: Resource Oriented Tuning Boosted by Meta-Learning for Cloud Databases. In SIGMOD. ACM.

Summary

We haven't generated a summary for this paper yet.