Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HEET: A Heterogeneity Measure to Quantify the Difference across Distributed Computing Systems (2312.03235v1)

Published 6 Dec 2023 in cs.DC and cs.PF

Abstract: Although system heterogeneity has been extensively studied in the past, there is yet to be a study on measuring the impact of heterogeneity on system performance. For this purpose, we propose a heterogeneity measure that can characterize the impact of the heterogeneity of a system on its performance behavior in terms of throughput or makespan. We develop a mathematical model to characterize a heterogeneous system in terms of its task and machine heterogeneity dimensions and then reduce it to a single value, called Homogeneous Equivalent Execution Time (HEET), which represents the execution time behavior of the entire system. We used AWS EC2 instances to implement a real-world machine learning inference system. Performance evaluation of the HEET score across different heterogeneous system configurations demonstrates that HEET can accurately characterize the performance behavior of these systems. In particular, the results show that our proposed method is capable of predicting the true makespan of heterogeneous systems without online evaluations with an average precision of 84%. This heterogeneity measure is instrumental for solution architects to configure their systems proactively to be sufficiently heterogeneous to meet their desired performance objectives.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. M. B. Taylor, “Is dark silicon useful? harnessing the four horsemen of the coming dark silicon apocalypse,” in Proceedings of the 49th annual design automation conference, 2012, pp. 1131–1136.
  2. H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger, “Dark silicon and the end of multicore scaling,” in Proceedings of the 38th annual international symposium on Computer architecture, 2011, pp. 365–376.
  3. M. B. Taylor, L. Vega, M. Khazraee, I. Magaki, S. Davidson, and D. Richmond, “Asic clouds: Specializing the datacenter for planet-scale applications,” Communications of the ACM, vol. 63, no. 7, pp. 103–109, 2020.
  4. C. Bobda, J. M. Mbongue, P. Chow, M. Ewais, N. Tarafdar, J. C. Vega, K. Eguro, D. Koch, S. Handagala, M. Leeser et al., “The future of fpga acceleration in datacenters and the cloud,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 15, no. 3, pp. 1–42, 2022.
  5. “Amazon sagemaker,” https://aws.amazon.com/sagemaker/, September 2023, (accessed: Sept. 2023).
  6. “Heterogeneous clusters for amazon sagemaker model training,” https://aws.amazon.com/about-aws/whats-new/2022/07/announcing-heterogeneous-clusters-amazon-sagemaker-model-training/, (accessed: Sept. 2023).
  7. “Glass enterprise editiotn 2,” https://www.google.com/glass/tech-specs/, (accessed: Sept. 2023).
  8. “Qualcomm reveals the world’s first dedicated xr platform,” https://www.qualcomm.com/news/releases/2018/05/qualcomm-reveals-worlds-first-dedicated-xr-platform, (accessed: Sept. 2023).
  9. S. G. Cardwell, C. Vineyard, W. Severa, F. S. Chance, F. Rothganger, F. Wang, S. Musuvathy, C. Teeter, and J. B. Aimone, “Truly heterogeneous hpc: Co-design to achieve what science needs from hpc,” in Smoky Mountains Computational Sciences and Engineering Conference, 2020, pp. 349–365.
  10. “Top500,” https://www.top500.org/lists/top500/2022/06/, (accessed: Sept. 2023).
  11. A. Mokhtari, M. A. Hossen, P. Jamshidi, and M. A. Salehi, “Felare: Fair scheduling of machine learning tasks on heterogeneous edge systems,” in Proceedings of the 15th IEEE International Conference on Cloud Computing (IEEE Cloud), 2022, pp. 459–468.
  12. C. Denninnart, J. Gentry, A. Mokhtari, and M. A. Salehi, “Efficient task pruning mechanism to improve robustness of heterogeneous computing systems,” Journal of Parallel and Distributed Computing, vol. 142, pp. 46–61, 2020.
  13. B. Pérez, E. Stafford, J. L. Bosque, and R. Beivide, “Sigmoid: an auto-tuned load balancing algorithm for heterogeneous systems,” Journal of Parallel and Distributed Computing, vol. 157, pp. 30–42, 2021.
  14. Y. N. Khalid, M. Aleem, U. Ahmed, M. A. Islam, and M. A. Iqbal, “Troodon: A machine-learning based load-balancing application scheduler for cpu–gpu system,” Journal of Parallel and Distributed Computing, vol. 132, pp. 79–94, 2019.
  15. R. Nozal, J. L. Bosque, and R. Beivide, “Enginecl: usability and performance in heterogeneous computing,” Future Generation Computer Systems, vol. 107, pp. 522–537, 2020.
  16. J. Vandebon, J. G. Coutinho, W. Luk, E. Nurvitadhi, and M. Naik, “Enhanced heterogeneous cloud: Transparent acceleration and elasticity,” in 2019 International Conference on Field-Programmable Technology (ICFPT), 2019, pp. 162–170.
  17. Z. Zhong and R. Buyya, “A cost-efficient container orchestration strategy in kubernetes-based cloud computing infrastructures with heterogeneous resources,” ACM Transactions on Internet Technology (TOIT), vol. 20, no. 2, pp. 1–24, 2020.
  18. S. Zobaed, A. Mokhtari, J. P. Champati, M. Kourouma, and M. A. Salehi, “Edge-multiai: Multi-tenancy of latency-sensitive deep learning applications on edge,” in Proceedings of 15th IEEE/ACM International Conference on Utility and Cloud Computing (UCC), 2022.
  19. S. Allmeier and N. Gast, “Mean field and refined mean field approximations for heterogeneous systems: It works!” in Proceedings of the ACM on Measurement and Analysis of Computing Systems, vol. 6, no. 1, 2022, pp. 1–43.
  20. L. Ying, “On the approximation error of mean-field models,” in ACM SIGMETRICS Performance Evaluation Review, vol. 44, no. 1.   ACM New York, NY, USA, 2016, pp. 285–297.
  21. L. K. John, “More on finding a single number to indicate overall performance of a benchmark suite,” ACM SIGARCH Computer Architecture News, vol. 32, no. 1, pp. 3–8, 2004.
  22. J. E. Smith, “Characterizing computer performance with a single number,” Communications of the ACM, vol. 31, no. 10, pp. 1202–1206, 1988.
  23. M. A. Oxley, S. Pasricha, A. A. Maciejewski, H. J. Siegel, J. Apodaca, D. Young, L. Briceno, J. Smith, S. Bahirat, B. Khemka et al., “Makespan and energy robust stochastic static resource allocation of a bag-of-tasks to a heterogeneous computing system,” IEEE Transactions on Parallel and Distributed Systems, vol. 26, no. 10, pp. 2791–2805, 2014.
  24. Salesforce, “Real-world examples of machine learning.” [Online]. Available: https://www.salesforce.com/eu/blog/2020/06/real-world-examples-of-machine-learning.html
  25. T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz et al., “Huggingface’s transformers: State-of-the-art natural language processing,” arXiv preprint arXiv:1910.03771, 2019.
  26. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  27. G. Jocher, “YOLOv5 by Ultralytics,” https://github.com/ultralytics/yolov5, May 2020, (Accessed: Aug. 2, 2023).
  28. V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,” arXiv preprint arXiv:1910.01108, 2019.
  29. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020.
  30. O. D. Team, “Onnx github repository,” https://github.com/onnx/onnx, (Accessed: Aug. 2, 2023).
  31. Microsoft, “Convert a pytorch model to onnx for windows machine learning,” 2023, (Accessed: August 2, 2023). [Online]. Available: https://learn.microsoft.com/en-us/windows/ai/windows-ml/tutorials/pytorch-convert-model
  32. Seldon, “What is multi-model serving and how does it transform your ml infrastructure?” https://www.seldon.io/what-is-multi-model-serving-and-how-does-it-transform-your-ml-/infrastructure, 2023, (Accessed: Aug. 2, 2023).
  33. N. Corporation, “Triton inference server github repository,” https://github.com/triton-inference-server/server, (Accessed: Aug. 2, 2023).
  34. “GRPC.” [Online]. Available: https://grpc.io/
  35. M. Techlabs, “Comparison between grpc vs. rest,” https://marutitech.com/rest-vs-grpc/#Comparison_Between_gRPC_vs_REST, (Accessed: Aug. 2, 2023).
  36. “prometheus.” [Online]. Available: https://prometheus.io/
  37. A. W. Services, “Aws sdk for python,” https://aws.amazon.com/sdk-for-python/, (Accessed: Aug. 2, 2023).
  38. M. Hussain, L.-F. Wei, A. Lakhan, S. Wali, S. Ali, and A. Hussain, “Energy and performance-efficient task scheduling in heterogeneous virtualized cloud computing,” Sustainable Computing: Informatics and Systems, vol. 30, p. 100517, 2021.
  39. S. K. Panda and P. K. Jana, “An energy-efficient task scheduling algorithm for heterogeneous cloud computing systems,” Journal of Cluster Computing, vol. 22, no. 2, pp. 509–527, 2019.
  40. Y. Chen, Y. Zhang, Y. Wu, L. Qi, X. Chen, and X. Shen, “Joint task scheduling and energy management for heterogeneous mobile edge computing with hybrid energy supply,” IEEE Internet of Things Journal, vol. 7, no. 9, pp. 8419–8429, 2020.
  41. S. Ghafouri, A. A. Saleh-Bigdeli, and J. Doyle, “Consolidation of services in mobile edge clouds using a learning-based framework,” in Proceedings of IEEE World Congress on Services (SERVICES), 2020, pp. 116–121.
  42. S. Azizi, M. Shojafar, J. Abawajy, and R. Buyya, “Deadline-aware and energy-efficient iot task scheduling in fog computing systems: A semi-greedy approach,” Journal of network and computer applications, vol. 201, p. 103333, 2022.
  43. S. M. Hussain and G. R. Begh, “Hybrid heuristic algorithm for cost-efficient qos aware task scheduling in fog–cloud environment,” Journal of Computational Science, vol. 64, p. 101828, 2022.
  44. A. Mokhtari, C. Denninnart, and M. A. Salehi, “Autonomous task dropping mechanism to achieve robustness in heterogeneous computing systems,” in 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2020, pp. 17–26.
  45. S. Ali, H. J. Siegel, M. Maheswaran, D. Hensgen, S. Ali et al., “Representing task and machine heterogeneities for heterogeneous computing systems,” Journal of Applied Science and Engineering, vol. 3, no. 3, pp. 195–207, 2000.
  46. D. Narayanan, K. Santhanam, F. Kazhamiaka, A. Phanishayee, and M. Zaharia, “Heterogeneity-aware cluster scheduling policies for deep learning workloads,” in 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), 2020, pp. 481–498.
  47. B. Li, R. B. Roy, T. Patel, V. Gadepally, K. Gettings, and D. Tiwari, “Ribbon: cost-effective and qos-aware deep learning model inference using a diverse pool of cloud computing instances,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’21), 2021, pp. 1–13.
  48. J. R. Gunasekaran, C. S. Mishra, P. Thinakaran, B. Sharma, M. T. Kandemir, and C. R. Das, “Cocktail: A multidimensional optimization for model serving in cloud,” in 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), 2022, pp. 1041–1057.
  49. B. Li, S. Samsi, V. Gadepally, and D. Tiwari, “Kairos: Building cost-efficient machine learning inference systems with heterogeneous cloud resources,” in Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing, 2023, pp. 3–16.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com