Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Accurate and Scalable Many-Node Simulation (2401.09877v1)

Published 18 Jan 2024 in cs.PF

Abstract: Accurate performance estimation of future many-node machines is challenging because it requires detailed simulation models of both node and network. However, simulating the full system in detail is unfeasible in terms of compute and memory resources. State-of-the-art techniques use a two-phase approach that combines detailed simulation of a single node with network-only simulation of the full system. We show that these techniques, where the detailed node simulation is done in isolation, are inaccurate because they ignore two important node-level effects: compute time variability, and inter-node communication. We propose a novel three-stage simulation method to allow scalable and accurate many-node simulation, combining native profiling, detailed node simulation and high-level network simulation. By including timing variability and the impact of external nodes, our method leads to more accurate estimates. We validate our technique against measurements on a multi-node cluster, and report an average 6.7% error on 64 nodes (maximum error of 12%), compared to on average 27% error and up to 54% when timing variability and the scaling overhead are ignored. At higher node counts, the prediction error of ignoring variable timings and scaling overhead continues to increase compared to our technique, and may lead to selecting the wrong optimal cluster configuration. Using our technique, we are able to accurately project performance to thousands of nodes within a day of simulation time, using only a single or a few simulation hosts. Our method can be used to quickly explore large many-node design spaces, including node micro-architecture, node count and network configuration.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. “SNAP - SN application proxy,” Los Alamos National Labs, 2013. [Online]. Available: https://github.com/losalamos/snap
  2. “gRPC, a high performance, open-source universal RPC framework,” 2017. [Online]. Available: https://grpc.io
  3. H. Akkan, M. Lang, and L. M. Liebrock, “Stepping towards noiseless Linux environment,” in Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers (ROSS), 2012, pp. 7:1–7:7.
  4. J. A. Bailey, A. Bazavov, C. Bernard, C. Bouchard, C. Detar, D. Du, A. El-Khadra, J. Foley, E. Freeland, E. Gámiz et al., “Refining new-physics searches in B D τ𝜏\tauitalic_τ ν𝜈\nuitalic_ν with Lattice QCD,” Physical review letters, vol. 109, no. 7, p. 071802, 2012.
  5. P. Beckman, K. Iskra, K. Yoshii, and S. Coghlan, “The influence of operating systems on the performance of collective operations at extreme scale,” in IEEE International Conference on Cluster Computing, Sept 2006, pp. 1–12.
  6. C. Berg, J. Engblom, and R. Wilhelm, “Requirements for and design of a processor with predictable timing,” in Dagstuhl Seminar Proceedings, 2004.
  7. N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The gem5 simulator,” SIGARCH Computer Architecture News, vol. 39, no. 2, pp. 1–7, Aug. 2011.
  8. R. N. Calheiros, R. Ranjan, A. Beloglazov, C. A. De Rose, and R. Buyya, “CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms,” Software: Practice and Experience, vol. 41, no. 1, pp. 23–50, 2011.
  9. T. E. Carlson, W. Heirman, and L. Eeckhout, “Sampled simulation of multi-threaded applications,” in IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2013, pp. 2–12.
  10. T. E. Carlson, W. Heirman, S. Eyerman, I. Hur, and L. Eeckhout, “An evaluation of high-level mechanistic core models,” ACM Transactions on Architecture and Code Optimization (TACO), vol. 11, no. 3, pp. 28:1–28:25, Aug. 2014.
  11. H. Casanova, A. Giersch, A. Legrand, M. Quinson, and F. Suter, “Versatile, scalable, and accurate simulation of distributed applications and platforms,” Journal of Parallel and Distributed Computing, vol. 74, no. 10, pp. 2899 – 2917, 2014.
  12. S. Chunduri, K. Harms, S. Parker, V. Morozov, S. Oshin, N. Cherukuri, and K. Kumaran, “Run-to-run variability on Xeon Phi based Cray XC systems,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’17, 2017, pp. 52:1–52:13.
  13. P. N. Clauss, M. Stillwell, S. Genaud, F. Suter, H. Casanova, and M. Quinson, “Single node on-line simulation of MPI applications with SMPI,” in IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2011, pp. 664–675.
  14. J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008.
  15. A. Degomme, A. Legrand, G. Markomanolis, M. Quinson, M. L. Stillwell, and F. Suter, “Simulating MPI applications: the SMPI approach,” IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 28, no. 8, p. 14, Aug. 2017.
  16. J. Dongarra and M. A. Heroux, “Toward a new metric for ranking high performance computing systems,” Sandia Report, SAND2013-4744, vol. 312, 2013.
  17. L. Eeckhout, S. Nussbaum, J. E. Smith, and K. D. Bosschere, “Statistical simulation: Adding efficiency to the computer designer’s toolbox,” IEEE Micro, vol. 23, no. 5, pp. 26–38, 2003.
  18. S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, “A mechanistic performance model for superscalar out-of-order processors,” ACM Transactions on Computer Systems (TOCS), vol. 27, no. 2, pp. 3:1–3:37, May 2009.
  19. K. B. Ferreira, P. Bridges, and R. Brightwell, “Characterizing application sensitivity to OS interference using kernel-level noise injection,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2008, pp. 19:1–19:12.
  20. M. Frigo and S. G. Johnson, “The design and implementation of FFTW3,” Proceedings of the IEEE, vol. 93, no. 2, pp. 216–231, 2005, special issue on “Program Generation, Optimization, and Platform Adaptation”.
  21. S. Girona, J. Labarta, and R. M. Badia, “Validation of Dimemas communication model for MPI collective operations,” in Recent Advances in Parallel Virtual Machine and MPI, 2000, pp. 39–46.
  22. T. Grass, C. Allande, A. Armejach, A. Rico, E. Ayguadé, J. Labarta, M. Valero, M. Casas, and M. Moreto, “MUSA: A multi-level simulation approach for next-generation HPC machines,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2016, pp. 45:1–45:12.
  23. T. Hoefler, T. Schneider, and A. Lumsdaine, “Characterizing the influence of system noise on large-scale applications by simulation,” in International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2010, pp. 1–11.
  24. ——, “LogGOPSim: Simulating large-scale applications in the LogGOPS model,” in International Symposium on High Performance Distributed Computing (HPDC), 2010.
  25. C. L. Janssen, H. Adalsteinsson, S. Cranford, J. P. Kenny, A. Pinar, D. A. Evensky, and J. Mayo, “A simulator for large-scale parallel computer architectures,” Technology Integration Advancements in Distributed Systems and Computing, vol. 179, 2012.
  26. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.
  27. J. Kim, W. J. Dally, S. Scott, and D. Abts, “Technology-driven, highly-scalable dragonfly topology,” in Proceedings of the 35th Annual International Symposium on Computer Architecture, ser. ISCA ’08, 2008, pp. 77–88.
  28. E. A. León, R. Riesen, A. B. Maccabe, and P. G. Bridges, “Instruction-level simulation of a cluster at scale,” in International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2009, pp. 3:1–3:12.
  29. J. Liu, J. Wu, and D. K. Panda, “High performance RDMA-based MPI implementation over InfiniBand,” International Journal of Parallel Programming, vol. 32, no. 3, pp. 167–198, 2004.
  30. P. D. V. Mann and U. Mittaly, “Handling OS jitter on multicore multithreaded systems,” in IEEE International Symposium on Parallel Distributed Processing, 2009 (IPDPS 2009)., May 2009, pp. 1–12.
  31. A. Mohammad, U. Darbaz, G. Dozsa, S. Diestelhorst, D. Kim, and N. S. Kim, “dist-gem5: Distributed simulation of computer clusters,” in 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), April 2017, pp. 153–162.
  32. A. Patel, F. Afram, S. Chen, and K. Ghose, “MARSS: A full system simulator for multicore x86 CPUs,” in Design Automation Conference (DAC), 2011, pp. 1050–1055.
  33. J. Peraza, A. Tiwari, M. Laurenzano, L. Carrington, and A. Snavely, “PMaC’s green queue: A framework for selecting energy optimal DVFS configurations in large scale MPI applications,” Concurr. Comput. : Pract. Exper., vol. 28, no. 2, pp. 211–231, Feb. 2016.
  34. T. Sherwood, E. Perelman, G. Hamerly, S. Sair, and B. Calder, “Discovering and exploiting program phases,” IEEE Micro, vol. 23, no. 6, pp. 84–93, Nov 2003.
  35. J. Subhlok and Q. Xu, “Automatic construction of coordinated performance skeletons,” in IEEE International Symposium on Parallel and Distributed Processing, April 2008, pp. 1–5.
  36. R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe, “SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling,” in International Symposium on Computer Architecture (ISCA), June 2003, pp. 84–95.
  37. G. Zheng, G. Kakulapati, and L. V. Kale, “BigSim: a parallel simulator for performance prediction of extremely large parallel machines,” in IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2004, pp. 78–87.
Citations (1)

Summary

We haven't generated a summary for this paper yet.