Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Control Flow Management in Modern GPUs (2407.02944v1)

Published 3 Jul 2024 in cs.AR

Abstract: In GPUs, the control flow management mechanism determines which threads in a warp are active at any point in time. This mechanism monitors the control flow of scalar threads within a warp to optimize thread scheduling and plays a critical role in the utilization of execution resources. The control flow management mechanism can be controlled or assisted by software through instructions. However, GPU vendors do not disclose details about their compiler, ISA, or hardware implementations. This lack of transparency makes it challenging for researchers to understand how the control flow management mechanism functions, is implemented, or is assisted by software, which is crucial when it significantly affects their research. It is also problematic for performance modeling of GPUs, as one can only rely on traces from real hardware for control flow and cannot model or modify the functionality of the mechanism altering it. This paper addresses this issue by defining a plausible semantic for control flow instructions in the Turing native ISA based on insights gleaned from experimental data using various benchmarks. Based on these definitions, we propose a low-cost mechanism for efficient control flow management named Hanoi. Hanoi ensures correctness and generates a control flow that is very close to real hardware. Our evaluation shows that the discrepancy between the control flow trace of real hardware and our mechanism is only 1.03% on average. Furthermore, when comparing the Instructions Per Cycle (IPC) of GPUs employing Hanoi with the native control flow management of actual hardware, the average difference is just 0.19%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. T. M. Aamodt, W. W. Fung, and T. H. Hetherington, “Inside volta: The world’s most advanced data center gpu,” http://gpgpu-sim.org/manual/index.php/Main_Page.
  2. baidu, “Deepbench: Benchmarking deep learning operations on different hardware,” https://github.com/baidu-research/DeepBench, 2020.
  3. A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, “Analyzing cuda workloads using a detailed gpu simulator,” in International Symposium on Performance Analysis of Systems and Software (ISPASS), 2009.
  4. A. Betts, N. Chong, A. Donaldson, S. Qadeer, and P. Thomson, “Gpuverify: a verifier for gpu kernels,” in International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA), ser. OOPSLA ’12.   New York, NY, USA: Association for Computing Machinery, 2012, p. 113–132.
  5. S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, “cudnn: Efficient primitives for deep learning,” 2014.
  6. S. Damani, D. R. Johnson, M. Stephenson, S. W. Keckler, E. Yan, M. McKeown, and O. Giroux, “Speculative reconvergence for improved simt efficiency,” in Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization.   Association for Computing Machinery, 2020.
  7. S. Damani, M. Stephenson, R. Rangan, D. Johnson, R. Kulkami, and S. W. Keckler, “Gpu subwarp interleaving,” in International Symposium on High-Performance Computer Architecture (HPCA), 2022.
  8. G. Diamos, B. Ashbaugh, S. Maiyuran, A. Kerr, H. Wu, and S. Yalamanchili, “Simd re-convergence at thread frontiers,” in International Symposium on Microarchitecture (MICRO), 2011.
  9. L. Durant, O. Giroux, M. Harris, and N. Stam, “Inside volta: The world’s most advanced data center gpu,” https://developer.nvidia.com/blog/inside-volta/, 2017.
  10. A. ElTantawy and T. M. Aamodt, “Mimd synchronization on simt architectures,” in International Symposium on Microarchitecture (MICRO), 2016.
  11. A. ElTantawy, J. W. Ma, M. O’Connor, and T. M. Aamodt, “A scalable multi-path microarchitecture for efficient gpu control flow,” in International Symposium on High Performance Computer Architecture (HPCA), 2014.
  12. W. W. L. Fung and T. M. Aamodt, “Thread block compaction for efficient simt control flow,” in International Symposium on High Performance Computer Architecture (HPCA), 2011.
  13. W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt, “Dynamic warp formation and scheduling for efficient gpu control flow,” in International Symposium on Microarchitecture (MICRO), 2007.
  14. A. Habermaier and A. Knapp, “On the correctness of the simt execution model of gpus,” in Programming Languages and Systems, H. Seidl, Ed.   Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 316–335.
  15. H. Jeon, G. S. Ravi, N. S. Kim, and M. Annavaram, “Gpu register file virtualization,” in International Symposium on Microarchitecture (MICRO), 2015.
  16. Z. Jia, M. Maggioni, J. Smith, and D. P. Scarpazza, “Dissecting the nvidia turing t4 gpu via microbenchmarking,” 2019.
  17. Z. Jia, M. Maggioni, B. Staiger, and D. P. Scarpazza, “Dissecting the nvidia volta gpu architecture via microbenchmarking,” arXiv preprint arXiv:1804.06826, 2018.
  18. M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, “Accel-sim: An extensible simulation framework for validated gpu modeling,” in International Symposium on Computer Architecture (ISCA), 2020.
  19. C. Lattner and V. Adve, “Llvm: A compilation framework for lifelong program analysis & transformation,” in International symposium on code generation and optimization (CGO), 2004.
  20. V. I. Levenshtein et al., “Binary codes capable of correcting deletions, insertions, and reversals,” in Soviet physics doklady, vol. 10, no. 8.   Soviet Union, 1966, pp. 707–710.
  21. G. Li, P. Li, G. Sawaya, G. Gopalakrishnan, I. Ghosh, and S. P. Rajan, “Gklee: concolic verification and test generation for gpus,” SIGPLAN Not., vol. 47, no. 8, p. 215–224, feb 2012.
  22. E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, “Nvidia tesla: A unified graphics and computing architecture,” in IEEE Micro, vol. 28, no. 2, 2008, pp. 39–55.
  23. P. Mattson, C. Cheng, G. Diamos, C. Coleman, P. Micikevicius, D. Patterson, H. Tang, G.-Y. Wei, P. Bailis, V. Bittorf et al., “Mlperf training benchmark,” Proceedings of Machine Learning and Systems, vol. 2, pp. 336–349, 2020.
  24. J. Meng, D. Tarjan, and K. Skadron, “Dynamic warp subdivision for integrated branch and memory divergence tolerance,” in international symposium on Computer architecture (ISCA), 2010.
  25. L. Nai, Y. Xia, I. G. Tanase, H. Kim, and C.-Y. Lin, “Graphbig: understanding graph computing in the context of industrial solutions,” in International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2015.
  26. V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, “Improving gpu performance via large warps and two-level warp scheduling,” in International Symposium on Microarchitecture (MICRO), 2011.
  27. NVIDIA, “Cuda binary utilities,” https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html.
  28. NVIDIA, “Nvidia turing gpu architecture,” https://images.nvidia.com/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf.
  29. NVIDIA, “Parallel thread execution isa,” https://docs.nvidia.com/cuda/parallel-thread-execution/index.html.
  30. NVIDIA, “Nvidia tesla v100 gpu architecture,” https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf, 2017.
  31. NVIDIA, “cublas,” https://docs.nvidia.com/cuda/cublas/index.html, 2024.
  32. NVIDIA, “Cuda c++ programming guide,” 2024.
  33. V. J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C.-J. Wu, B. Anderson, M. Breughe, M. Charlebois, W. Chou, R. Chukka, C. Coleman, S. Davis, P. Deng, G. Diamos, J. Duke, D. Fick, J. S. Gardner, I. Hubara, S. Idgunji, T. B. Jablin, J. Jiao, T. S. John, P. Kanwar, D. Lee, J. Liao, A. Lokhmotov, F. Massa, P. Meng, P. Micikevicius, C. Osborne, G. Pekhimenko, A. T. R. Rajan, D. Sequeira, A. Sirasao, F. Sun, H. Tang, M. Thomson, F. Wei, E. Wu, L. Xu, K. Yamada, B. Yu, G. Yuan, A. Zhong, P. Zhang, and Y. Zhou, “Mlperf inference benchmark,” in International Symposium on Computer Architecture (ISCA), 2020.
  34. M. Rhu and M. Erez, “The dual-path execution model for efficient gpu control flow,” in International Symposium on High Performance Computer Architecture (HPCA), 2013.
  35. T. G. Rogers, M. O’Connor, and T. M. Aamodt, “Cache-conscious wavefront scheduling,” in International Symposium on Microarchitecture (MICRO), 2012.
  36. O. Villa, M. Stephenson, D. Nellans, and S. W. Keckler, “Nvbit: A dynamic binary instrumentation framework for nvidia gpus,” in International Symposium on Microarchitecture (MICRO), 2019.
  37. H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos, “Demystifying gpu microarchitecture through microbenchmarking,” in International Symposium on Performance Analysis of Systems & Software (ISPASS), 2010.
  38. S. Woop, J. Schmittler, and P. Slusallek, “Rpu: a programmable ray processing unit for realtime ray tracing,” ACM Trans. Graph., vol. 24, no. 3, p. 434–444, jul 2005.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com