Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluation of Data Enrichment Methods for Distributed Stream Processing Systems (2307.14287v2)

Published 26 Jul 2023 in cs.DC

Abstract: Stream processing has become a critical component in the architecture of modern applications. With the exponential growth of data generation from sources such as the Internet of Things, business intelligence, and telecommunications, real-time processing of unbounded data streams has become a necessity. DSP systems provide a solution to this challenge, offering high horizontal scalability, fault-tolerant execution, and the ability to process data streams from multiple sources in a single DSP job. Often enough though, data streams need to be enriched with extra information for correct processing, which introduces additional dependencies and potential bottlenecks. In this paper, we present an in-depth evaluation of data enrichment methods for DSP systems and identify the different use cases for stream processing in modern systems. Using a representative DSP system and conducting the evaluation in a realistic cloud environment, we found that outsourcing enrichment data to the DSP system can improve performance for specific use cases. However, this increased resource consumption highlights the need for stream processing solutions specifically designed for the performance-intensive workloads of cloud-based applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. H. Isah, T. Abughofa, S. Mahfuz, D. Ajerla, F. H. Zulkernine, and S. Khan, “A survey of distributed data stream processing frameworks,” IEEE Access, vol. 7, 2019.
  2. H. Nasiri, S. Nasehi, and M. Goudarzi, “Evaluation of distributed stream processing frameworks for iot applications in smart cities,” J. Big Data, vol. 6, 2019.
  3. S. Kulkarni, N. Bhagat, M. Fu, V. Kedigehalli, C. Kellogg, S. Mittal, J. M. Patel, K. Ramasamy, and S. Taneja, “Twitter heron: Stream processing at scale,” in SIGMOD.   ACM, 2015.
  4. M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, and I. Stoica, “Apache spark: a unified engine for big data processing,” Commun. ACM, vol. 59, no. 11, 2016.
  5. P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas, “Apache flink™: Stream and batch processing in a single engine,” IEEE Data Eng. Bull., vol. 38, no. 4, 2015.
  6. A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, N. Bhagat, S. Mittal, and D. V. Ryaboy, “Storm@twitter,” in SIGMOD.   ACM, 2014.
  7. A. Gulenko, A. Acker, F. Schmidt, S. Becker, and O. Kao, “Bitflow: An in situ stream processing framework,” in ACSOS.   IEEE, 2020.
  8. V. Kalavri, J. Liagouris, M. Hoffmann, D. C. Dimitrova, M. Forshaw, and T. Roscoe, “Three steps is all you need: fast, accurate, automatic scaling decisions for distributed streaming dataflows,” in OSDI.   USENIX Association, 2018.
  9. M. K. Geldenhuys, D. Scheinert, O. Kao, and L. Thamsen, “Phoebe: Qos-aware distributed stream processing through anticipating dynamic workloads,” in ICWS.   IEEE, 2022.
  10. B. J. J. Pfister, W. S. Lickefett, J. Nitschke, S. Paul, M. K. Geldenhuys, D. Scheinert, K. K. Gontarska, and L. Thamsen, “Rafiki: Task-level capacity planning in distributed stream processing systems,” in Euro-Par.   Springer, 2021.
  11. B. Gedik, S. Schneider, M. Hirzel, and K. Wu, “Elastic scaling for data stream processing,” IEEE Trans. Parallel Distributed Syst., vol. 25, no. 6, 2014.
  12. K. K. Gontarska, M. Geldenhuys, D. Scheinert, P. Wiesner, A. Polze, and L. Thamsen, “Evaluation of load prediction techniques for distributed stream processing,” in IC2E.   IEEE, 2021.
  13. Z. Hu, H. Kang, and M. Zheng, “Stream data load prediction for resource scaling using online support vector regression,” Algorithms, vol. 12, no. 2, 2019.
  14. F. Kalim, T. Cooper, H. Wu, Y. Li, N. Wang, N. Lu, M. Fu, X. Qian, H. Luo, D. Cheng, Y. Wang, F. Dai, M. Ghosh, and B. Wang, “Caladrius: A performance modelling service for distributed stream processing systems,” in ICDE.   IEEE, 2019.
  15. M. Geldenhuys, B. J. J. Pfister, D. Scheinert, L. Thamsen, and O. Kao, “Khaos: Dynamically optimizing checkpointing for dependable distributed stream processing,” in FedCSIS, 2022.
  16. M. K. Geldenhuys, L. Thamsen, and O. Kao, “Chiron: Optimizing fault tolerance in qos-aware distributed stream processing jobs,” in BigData.   IEEE, 2020.
  17. S. Jayasekara, A. Harwood, and S. Karunasekera, “A utilization model for optimization of checkpoint intervals in distributed stream processing systems,” Future Gener. Comput. Syst., vol. 110, 2020.
  18. M. K. Geldenhuys, L. Thamsen, K. K. Gontarska, F. Lorenz, and O. Kao, “Effectively testing system configurations of critical iot analytics pipelines,” in BigData.   IEEE, 2019.
  19. A. Floratou, A. Agrawal, B. Graham, S. Rao, and K. Ramasamy, “Dhalion: Self-regulating stream processing in heron,” Proc. VLDB Endow., vol. 10, no. 12, 2017.
  20. R. Derakhshan, A. Sattar, and B. Stantic, “A new operator for efficient stream-relation join processing in data streaming engines,” in CIKM.   ACM, 2013.
  21. Y. Jeon, K. Lee, and H. Kim, “Distributed join processing between streaming and stored big data under the micro-batch model,” IEEE Access, vol. 7, 2019.
  22. H. Kim and K. Lee, “Semi-stream similarity join processing in a distributed environment,” IEEE Access, vol. 8, 2020.
  23. S. Horchidan, E. Kritharakis, V. Kalavri, and P. Carbone, “Evaluating model serving strategies over streaming data,” in DEEM.   ACM, 2022.
  24. P. Garefalakis, K. Karanasos, and P. R. Pietzuch, “Neptune: Scheduling suspendable tasks for unified stream/batch applications,” in SoCC.   ACM, 2019.
  25. M. Meldrum, K. Segeljakt, L. Kroll, P. Carbone, C. Schulte, and S. Haridi, “Arcon: Continuous and deep data stream analytics,” in BIRTE.   ACM, 2019.
  26. P. Carbone, S. Ewen, G. Fóra, S. Haridi, S. Richter, and K. Tzoumas, “State management in apache flink®: Consistent stateful distributed stream processing,” Proc. VLDB Endow., vol. 10, no. 12, 2017.
  27. N. Polyzotis, S. Skiadopoulos, P. Vassiliadis, A. Simitsis, and N. Frantzell, “Meshing streaming updates with persistent data in an active data warehouse,” IEEE Trans. Knowl. Data Eng., vol. 20, no. 7, 2008.
  28. M. A. Naeem, G. Dobbie, G. Weber, and S. Alam, “R-MESHJOIN for near-real-time data warehousing,” in DOLAP.   ACM, 2010.
  29. M. A. Naeem, G. Dobbie, and G. Weber, “A lightweight stream-based join with limited resource consumption,” in DaWaK.   Springer, 2012.
  30. M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica, “Discretized streams: fault-tolerant streaming computation at scale,” in SOSP.   ACM, 2013.

Summary

We haven't generated a summary for this paper yet.