Supercomputers as a Continous Medium (2405.05639v1)
Abstract: As supercomputers' complexity has grown, the traditional boundaries between processor, memory, network, and accelerators have blurred, making a homogeneous computer model, in which the overall computer system is modeled as a continuous medium with homogeneously distributed computational power, memory, and data movement transfer capabilities, an intriguing and powerful abstraction. By applying a homogeneous computer model to algorithms with a given I/O complexity, we recover from first principles, other discrete computer models, such as the roofline model, parallel computing laws, such as Amdahl's and Gustafson's laws, and phenomenological observations, such as super-linear speedup. One of the homogeneous computer model's distinctive advantages is the capability of directly linking the performance limits of an application to the physical properties of a classical computer system. Applying the homogeneous computer model to supercomputers, such as Frontier, Fugaku, and the Nvidia DGX GH200, shows that applications, such as Conjugate Gradient (CG) and Fast Fourier Transforms (FFT), are rapidly approaching the fundamental classical computational limits, where the performance of even denser systems in terms of compute and memory are fundamentally limited by the speed of light.
- R. M. Karp, A survey of parallel algorithms for shared-memory machines. University of California at Berkeley, 1988.
- J. A. Ang, R. F. Barrett, R. E. Benner, D. Burke, C. Chan, J. Cook, D. Donofrio, S. D. Hammond, K. S. Hemmert, S. Kelly, et al., “Abstract machine models and proxy architectures for exascale computing,” in 2014 Hardware-Software Co-Design for High Performance Computing, pp. 25–32, IEEE, 2014.
- J. Dongarra, P. Beckman, T. Moore, P. Aerts, G. Aloisio, J.-C. Andre, D. Barkai, J.-Y. Berthou, T. Boku, B. Braunschweig, et al., “The international exascale software project roadmap,” The international journal of high performance computing applications, vol. 25, no. 1, pp. 3–60, 2011.
- F. Alexander, A. Almgren, J. Bell, A. Bhattacharjee, J. Chen, P. Colella, D. Daniel, J. DeSlippe, L. Diachin, E. Draeger, et al., “Exascale applications: skin in the game,” Philosophical Transactions of the Royal Society A, vol. 378, no. 2166, p. 20190056, 2020.
- S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,” Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009.
- I. L. Markov, “Limits on fundamental limits to computation,” Nature, vol. 512, no. 7513, pp. 147–154, 2014.
- G. M. Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” in Proceedings of the April 18-20, 1967, spring joint computer conference, pp. 483–485, 1967.
- J. L. Gustafson, “Reevaluating amdahl’s law,” Communications of the ACM, vol. 31, no. 5, pp. 532–533, 1988.
- H. Jia-Wei and H.-T. Kung, “I/o complexity: The red-blue pebble game,” in Proceedings of the thirteenth annual ACM symposium on Theory of computing, pp. 326–333, 1981.
- G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz, “Communication lower bounds and optimal algorithms for numerical linear algebra,” Acta Numerica, vol. 23, pp. 1–155, 2014.
- L. Arge, M. T. Goodrich, M. Nelson, and N. Sitchinava, “Fundamental parallel algorithms for private-cache chip multiprocessors,” in Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures, pp. 197–206, 2008.
- E. D. Demaine and Q. C. Liu, “Red-blue pebble game: Complexity of computing the trade-off between cache size and memory transfers,” in Proceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures, pp. 195–204, 2018.
- G. Kwasniewski, M. Kabić, M. Besta, J. VandeVondele, R. Solcà, and T. Hoefler, “Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–22, 2019.
- T. M. Smith, B. Lowery, J. Langou, and R. A. van de Geijn, “A tight i/o lower bound for matrix multiplication,” arXiv preprint arXiv:1702.02017, 2017.
- J. A. Rico-Gallego, J. C. Díaz-Martín, R. R. Manumachu, and A. L. Lastovetsky, “A survey of communication performance models for high-performance computing,” ACM Computing Surveys (CSUR), vol. 51, no. 6, pp. 1–36, 2019.
- D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. Von Eicken, “Logp: Towards a realistic model of parallel computation,” in Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, pp. 1–12, 1993.
- T. Hoefler, T. Schneider, and A. Lumsdaine, “Loggopsim: simulating large-scale applications in the loggops model,” in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 597–604, 2010.
- D. C. Fisher, “Your favorite parallel algorithms might not be as fast as you think,” IEEE transactions on computers, vol. 37, no. 02, pp. 211–213, 1988.
- A. S. Dufek, J. R. Deslippe, P. T. Lin, C. J. Yang, B. G. Cook, and J. Madsen, “An extended roofline performance model with pci-e and network ceilings,” in 2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pp. 30–39, IEEE, 2021.
- D. Cardwell and F. Song, “An extended roofline model with communication-awareness for distributed-memory hpc systems,” in Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, pp. 26–35, 2019.
- A. Abdelfattah, H. Anzt, E. G. Boman, E. Carson, T. Cojean, J. Dongarra, A. Fox, M. Gates, N. J. Higham, X. S. Li, et al., “A survey of numerical linear algebra methods utilizing mixed-precision arithmetic,” The International Journal of High Performance Computing Applications, vol. 35, no. 4, pp. 344–369, 2021.
- E. Solomonik, E. Carson, N. Knight, and J. Demmel, “Trade-offs between synchronization, communication, and computation in parallel linear algebra computations,” ACM Transactions on Parallel Computing (TOPC), vol. 3, no. 1, pp. 1–47, 2017.
- M. Hoemmen, Communication-avoiding Krylov subspace methods. University of California, Berkeley, 2010.
- H. Abelson, D. Allen, D. Coore, C. Hanson, G. Homsy, T. F. Knight Jr, R. Nagpal, E. Rauch, G. J. Sussman, and R. Weiss, “Amorphous computing,” Communications of the ACM, vol. 43, no. 5, pp. 74–82, 2000.
- R. Nagpal and M. Mamei, “Engineering amorphous computing systems,” in Methodologies and software engineering for agent systems: The agent-oriented software engineering handbook, pp. 303–320, Springer, 2004.
- J. Beal, “Programming an amorphous computational medium,” in International Workshop on Unconventional Programming Paradigms, pp. 121–136, Springer, 2004.
- S. Lloyd, “Ultimate physical limits to computation,” Nature, vol. 406, no. 6799, pp. 1047–1054, 2000.
- M. Karp, N. Jansson, A. Podobas, P. Schlatter, and S. Markidis, “Reducing communication in the conjugate gradient method: a case study on high-order finite elements,” in Proceedings of the Platform for Advanced Scientific Computing Conference, pp. 1–11, 2022.
- D. Takahashi, Fast Fourier transform algorithms for parallel computers. Springer, 2019.
- J. J. Dongarra, H. W. Meuer, E. Strohmaier, et al., “Top500 supercomputer sites,” Supercomputer, vol. 13, pp. 89–111, 1997.
- “Nvidia a100 tensor core gpu architecture,” white paper, Nvidia, Santa Clara, United States.
- J. Dongarra and A. Geist, “Report on the oak ridge national laboratory’s frontier system,” Univ. of Tennessee, Knoxville, Tech. Rep. Tech Report No. ICL-UT-22-05, 2022.
- J. Dongarra, “Report on the fujitsu fugaku system,” University of Tennessee-Knoxville Innovative Computing Laboratory, Tech. Rep. ICLUT-20-06, 2020.
- “NVIDIA DGX GH200 AI Supercomputer,” white paper, Nvidia, Santa Clara, United States, June 2023. Accessed: September, 2023.
- “NVIDIA GH200 Grace Hopper Superchip Datasheet,” white paper, Nvidia, Santa Clara, United States, August 2023. Accessed: September, 2023.
- D. De Sensi, S. Di Girolamo, K. H. McMahon, D. Roweth, and T. Hoefler, “An in-depth analysis of the slingshot interconnect,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–14, IEEE, 2020.
- J. Dongarra, M. A. Heroux, and P. Luszczek, “High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems,” The International Journal of High Performance Computing Applications, vol. 30, no. 1, pp. 3–10, 2016.
- V. Marjanović, J. Gracia, and C. W. Glass, “Performance modeling of the hpcg benchmark,” in High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation: 5th International Workshop, PMBS 2014, New Orleans, LA, USA, November 16, 2014. Revised Selected Papers 5, pp. 172–192, Springer, 2015.
- S. Mittal, “A survey of techniques for approximate computing,” ACM Computing Surveys (CSUR), vol. 48, no. 4, pp. 1–33, 2016.