Modeling and Controlling Many-Core HPC Processors: an Alternative to PID and Moving Average Algorithms (2405.18030v1)
Abstract: The race towards performance increase and computing power has led to chips with heterogeneous and complex designs, integrating an ever-growing number of cores on the same monolithic chip or chiplet silicon die. Higher integration density, compounded with the slowdown of technology-driven power reduction, implies that power and thermal management become increasingly relevant. Unfortunately, existing research lacks a detailed analysis and modeling of thermal, power, and electrical coupling effects and how they have to be jointly considered to perform dynamic control of complex and heterogeneous Multi-Processor System on Chips (MPSoCs). To close the gap, in this work, we first provide a detailed thermal and power model targeting a modern High Performance Computing (HPC) MPSoC. We consider real-world coupling effects such as actuators' non-idealities and the exponential relation between the dissipated power, the temperature state, and the voltage level in a single processing element. We analyze how these factors affect the control algorithm behavior and the type of challenges that they pose. Based on the analysis, we propose a thermal capping strategy inspired by Fuzzy control theory to replace the state-of-the-art PID controller, as well as a root-finding iterative method to optimally choose the shared voltage value among cores grouped in the same voltage domain. We evaluate the proposed controller with model-in-the-loop and hardware-in-the-loop co-simulations. We show an improvement over state-of-the-art methods of up to 5x the maximum exceeded temperature while providing an average of 3.56% faster application execution runtime across all the evaluation scenarios.
- A. Tilli, E. Garone, C. Conficoni, M. Cacciari, A. Bosso, and A. Bartolini, “A two-layer distributed mpc approach to thermal control of multiprocessor systems-on-chip,” Control Engineering Practice, vol. 122, 5 2022.
- G. LLC, “Power management for multiple processor cores,” U.S. Patent US8402290B2, Dec. 2020.
- Z. Liu and H. Zhu, “A survey of the research on power management techniques for high‐performance systems,” Softw., Pract. Exper., vol. 40, 10 2010.
- A. Bartolini, M. Cacciari, A. Tilli, and L. Benini, “Thermal and energy management of high-performance multicores: Distributed and self-calibrating model-predictive controller,” IEEE Transactions on Parallel and Distributed Systems, vol. 24, no. 1, pp. 170–183, 2013.
- P. Czarnul, J. Proficz, and A. Krzywaniak, “Energy-aware high-performance computing: Survey of state-of-the-art tools, techniques, and environments,” Scientific Programming, vol. 2019, pp. 1–19, 04 2019.
- Arm, “Scp-firmware - version 2.13,” https://github.com/Arm-software/SCP-firmware, 2023.
- R. Schöne, T. Ilsche, M. Bielert, A. Gocht, and D. Hackenberg, “Energy Efficiency Features of the Intel Skylake-SP Processor and Their Impact on Performance,” arXiv:1905.12468 [cs], May 2019, arXiv: 1905.12468. [Online]. Available: http://arxiv.org/abs/1905.12468
- H.-Y. Cheng, J. Zhan, J. Zhao, Y. Xie, J. Sampson, and M. J. Irwin, “Core vs. uncore: The heart of darkness,” in 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC), 2015, pp. 1–6.
- G. Bambini, C. Conficoni, A. Tilli, L. Benini, and A. Bartolini, “Modeling the thermal and power control subsystem in hpc processors,” in 2022 IEEE Conference on Control Technology and Applications (CCTA), 2022, pp. 397–402.
- A. Ottaviano, R. Balas, G. Bambini, A. Del Vecchio, M. Ciani, D. Rossi, L. Benini, and A. Bartolini, “Controlpulp: A risc-v on-chip parallel power controller for many-core hpc processors with fpga-based hardware-in-the-loop power and thermal emulation,” International Journal of Parallel Programming, Feb 2024. [Online]. Available: https://doi.org/10.1007/s10766-024-00761-4
- G. Bambini, R. Balas, C. Conficoni, A. Tilli, L. Benini, S. Benatti, and A. Bartolini, “An open-source scalable thermal and power controller for hpc processors,” in 2020 IEEE 38th International Conference on Computer Design (ICCD), 2020, pp. 364–367.
- E. Rotem, A. Naveh, A. Ananthakrishnan, E. Weissmann, and D. Rajwan, “Power-Management Architecture of the Intel Microarchitecture Code-Named Sandy Bridge,” IEEE Micro, vol. 32, no. 2, pp. 20–27, 2012.
- H. Zhang and H. Hoffmann, “Maximizing performance under a power cap: A comparison of hardware, software, and hybrid techniques,” vol. 51, no. 4, p. 545–559, mar 2016. [Online]. Available: https://doi.org/10.1145/2954679.2872375
- A. Beloglazov, R. Buyya, Y. C. Lee, and A. Zomaya, “Chapter 3 - a taxonomy and survey of energy-efficient data centers and cloud computing systems,” ser. Advances in Computers, M. V. Zelkowitz, Ed. Elsevier, 2011, vol. 82, pp. 47–111. [Online]. Available: https://www.sciencedirect.com/science/article/pii/B9780123855121000037
- A. Majumdar, L. Piga, I. Paul, J. L. Greathouse, W. Huang, and D. H. Albonesi, “Dynamic gpgpu power management using adaptive model predictive control,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2017, pp. 613–624.
- R. Diversi, A. Tilli, A. Bartolini, F. Beneventi, and L. Benini, “Bias-compensated least squares identification of distributed thermal models for many-core systems-on-chip,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 61, no. 9, pp. 2663–2676, 2014.
- F. Beneventi, A. Bartolini, A. Tilli, and L. Benini, “An effective gray-box identification procedure for multicore thermal modeling,” IEEE Transactions on Computers, vol. 63, no. 5, pp. 1097–1110, 2014.
- M. Rapp, M. B. Sikal, H. Khdr, and J. Henkel, “Smartboost: Lightweight ml-driven boosting for thermally-constrained many-core processors,” in 2021 58th ACM/IEEE Design Automation Conference (DAC), 2021, pp. 265–270.
- S. K. Mandal, G. Bhat, J. R. Doppa, P. P. Pande, and U. Y. Ogras, “An energy-aware online learning framework for resource management in heterogeneous platforms,” ACM Trans. Des. Autom. Electron. Syst., vol. 25, no. 3, may 2020. [Online]. Available: https://doi.org/10.1145/3386359
- S. Pagani, P. D. S. Manoj, A. Jantsch, and J. Henkel, “Machine learning for power, energy, and thermal management on multicore processors: A survey,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 39, no. 1, pp. 101–116, 2020.
- G. Bhat, G. Singla, A. Unver, and U. Ogras, “Algorithmic optimization of thermal and power management for heterogeneous mobile platforms,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2017.
- K. Moazzemi, B. Maity, S. Yi, A. M. Rahmani, and N. Dutt, “Hessle-free: Heterogeneous systems leveraging fuzzy control for runtime resource management,” ACM Trans. Embed. Comput. Syst., vol. 18, no. 5s, oct 2019. [Online]. Available: https://doi.org/10.1145/3358203
- Y. Cui, W. Zhang, and B. He, “A variation-aware adaptive fuzzy control system for thermal management of microprocessors,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25, no. 2, pp. 683–695, 2017.
- A. Deval, A. Ananthakrishnan, and C. Forbell, “Power management on 14 nm intel® core m processor,” 2015 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XVIII), pp. 1–3, 2015. [Online]. Available: https://api.semanticscholar.org/CorpusID:37333321
- A. Varma, B. Bowhill, J. Crop, C. Gough, B. Griffith, D. Kingsley, and K. Sistla, “Power management in the intel xeon e5 v3,” in 2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), 2015, pp. 371–376.
- T. Burd, N. Beck, S. White, M. Paraschou, N. Kalyanasundharam, G. Donley, A. Smith, L. Hewitt, and S. Naffziger, ““zeppelin”: An soc for multichip architectures,” IEEE Journal of Solid-State Circuits, vol. 54, no. 1, pp. 133–143, 2019.
- IBM, “Openpower occ,” https://github.com/open-power/occ, 2022.
- A. Leva, F. Terraneo, I. Giacomello, and W. Fornaciari, “Event-based power/performance-aware thermal management for high-density microprocessors,” IEEE Transactions on Control Systems Technology, vol. 26, no. 2, pp. 535–550, 2018.
- S. Naffziger, N. Beck, T. Burd, K. Lepak, G. H. Loh, M. Subramony, and S. White, “Pioneering chiplet technology and design for the amd epyc™ and ryzen™ processor families : Industrial product,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021, pp. 57–70.
- V. Hanumaiah, S. Vrudhula, and K. S. Chatha, “Performance optimal online dvfs and task migration techniques for thermally constrained multi-core processors,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 30, no. 11, pp. 1677–1690, 2011.
- H. Sultan, A. Chauhan, and S. R. Sarangi, “A survey of chip-level thermal simulators,” ACM Comput. Surv., vol. 52, no. 2, apr 2019. [Online]. Available: https://doi.org/10.1145/3309544
- K. Skadron, M. R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tarjan, “Temperature-aware microarchitecture,” in Proceedings of the 30th Annual International Symposium on Computer Architecture, ser. ISCA ’03. New York, NY, USA: Association for Computing Machinery, 2003, p. 2–13. [Online]. Available: https://doi.org/10.1145/859618.859620
- A. Vassighi and M. Sachdev, “Thermal runaway in integrated circuits,” IEEE Transactions on Device and Materials Reliability, vol. 6, no. 2, pp. 300–305, 2006.
- A. Bartolini and D. Rossi, “Advances in power management of many-core processors,” Many-Core Computing: Hardware and Software, p. 191, 2019.
- D. Rossi, A. Pullini, I. Loi, M. Gautschi, F. K. Gürkaynak, A. Bartolini, P. Flatresse, and L. Benini, “A 60 gops/w, -1.8v to 0.9v body bias ulp cluster in 28nm utbb fd-soi technology,” Solid-State Electronics, vol. 117, pp. 170–184, 2016, pLANAR FULLY-DEPLETED SOI TECHNOLOGY. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0038110115003342
- G. Paci, P. Marchal, F. Poletti, and L. Benini, “Exploring “ temperature-aware ” design in low-power mpsocs,” in Proceedings of the Design Automation & Test in Europe Conference, vol. 1, 2006, pp. 1–6.
- S. Lee and K. P. Moran, “Constriction/spreading resistance model for electronics packaging,” 1996. [Online]. Available: https://api.semanticscholar.org/CorpusID:28843083
- C. I. Riva, “A numerical tool for the analytical solution of temperature rise and thermal spreading resistance for power electronics,” 2021.
- J. Park, D. Shin, N. Chang, and M. Pedram, “Accurate modeling and calculation of delay and energy overheads of dynamic voltage scaling in modern high-performance microprocessors,” in 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED), 2010, pp. 419–424.
- S. Das, P. Whatmough, and D. Bull, “Modeling and characterization of the system-level power delivery network for a dual-core arm cortex-a57 cluster in 28nm cmos,” in 2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), 2015, pp. 146–151.
- D. Hackenberg, R. Schöne, T. Ilsche, D. Molka, J. Schuchart, and R. Geyer, “An energy efficiency feature survey of the intel haswell processor,” in 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015, pp. 896–904.
- E. Baer, A. Burenkov, P. Evanschitzky, and J. Lorenz, “Simulation of process variations in finfet transistor patterning,” in 2016 International Conference on Simulation of Semiconductor Processes and Devices (SISPAD), 2016, pp. 299–302.
- C.-C. Chen and L. Milor, “Microprocessor aging analysis and reliability modeling due to back-end wearout mechanisms,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 23, no. 10, pp. 2065–2076, 2015.
- E. Rotem, R. Ginosar, A. Mendelson, and U. C. Weiser, “Power and thermal constraints of modern system-on-a-chip computer,” in 19th International Workshop on Thermal Investigations of ICs and Systems (THERMINIC), 2013, pp. 141–146.
- A. Bendali and Y. Audet, “A 1-v cmos current reference with temperature and process compensation,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 54, no. 7, pp. 1424–1429, 2007.
- W.-B. Yang, Y.-Y. Lin, and Y.-L. Lo, “Analysis and design considerations of static cmos logics under process, voltage and temperature variation in 90nm cmos process,” in 2014 International Conference on Information Science, Electronics and Electrical Engineering, vol. 3, 2014, pp. 1653–1656.
- B. Gao and L. Pavel, “On the properties of the softmax function with application in game theory and reinforcement learning,” 2018.
- T. Rosedahl, M. Broyles, C. Lefurgy, B. Christensen, and W. Feng, “Power/performance controlling techniques in openpower,” in High Performance Computing, J. M. Kunkel, R. Yokota, M. Taufer, and J. Shalf, Eds. Cham: Springer International Publishing, 2017, pp. 275–289.
- M. Schlager, R. Obermaisser, and W. Elmenreich, “A Framework for Hardware-in-the-Loop Testing of an Integrated Architecture,” vol. 4761, 05 2007, pp. 159–170.
- Z. Tan, A. Waterman, H. Cook, S. Bird, K. Asanović, and D. Patterson, “A case for fame: Fpga architecture model execution,” in Proceedings of the 37th Annual International Symposium on Computer Architecture, ser. ISCA ’10. New York, NY, USA: Association for Computing Machinery, 2010, p. 290–301. [Online]. Available: https://doi.org/10.1145/1815961.1815999
- Giovanni Bambini (2 papers)
- Alessandro Ottaviano (17 papers)
- Christian Conficoni (3 papers)
- Andrea Tilli (6 papers)
- Luca Benini (363 papers)
- Andrea Bartolini (30 papers)