Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

From array algebra to energy efficiency on GPUs: Data and hardware shapes with dimension-lifting to optimize memory-processor layouts (2306.11148v1)

Published 19 Jun 2023 in cs.DC and cs.MS

Abstract: We present a new formulation for parallel matrix multiplication (MM) to out-perform the standard row-column code design. This algorithm is formulated in the MoA formalism (A Mathematics of Arrays) and combines an array view of hardware (dimension-lifting) to extend indexing to physical memory/processing units, with a contiguous data layout derived from static transformations. This view of a hardware-software model is thus a bridging model in the sense of Valiant's BSP. OpenACCcode was derived from the MoA expressions's normal form, producing optimal block sizes using the static information of types and shapes. Experiments were run on Nvidia V100 GPUs and reveal energy consumption which is quadratic in N, i.e. linear in the size of matrix. More generally this approach may be an ideal way of formulating, optimizing, and mapping array algorithms to embedded hardware. This work builds upon recently published results of NREL scientists. .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Philip Samuel Abrams. 1970. An APL Machine. Ph. D. Dissertation. Stanford University, Stanford, CA, USA. https://www.slac.stanford.edu/pubs/slacreports/reports07/slac-r-114.pdf AAI7022146.
  2. Tensor Computing for Internet of Things (Dagstuhl Perspectives Workshop 16152). Dagstuhl Reports 6, 4 (2016), 57–79. https://doi.org/10.4230/DagRep.6.4.57
  3. Anima Anandkumar. 2019. Role of Tensors in Machine Learning. https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9733-role-of-tensors-in-machine-learning.pdf.
  4. Peter J Ashenden. 2008. Designer’s Guide to VHDL (3rd Edition). Elsevier, San Francisco.
  5. Klaus Berkling. 1990. Arrays and the Lambda Calculus. Technical Report 93. Syracuse University, Syracuse, NY, USA. https://surface.syr.edu/eecs_techreports/93/ SU-CIS-90-22.
  6. Finite difference methods fengshui: alignment through a mathematics of arrays. In Proceedings of the 6th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming.
  7. Democratizing Domain-Specific Computing. CACM 50, 1 (2023), 74==85. https://doi.org/10.1145/3524108
  8. LogP: Towards a realistic model of parallel computation. In Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming. 1–12.
  9. There’s Plenty of Room at the Top: What will drive computer performance after Moore’s Law? Science 368 (2020).
  10. Clemens Grelck and Cédric Blom. 2020. Resource-aware data parallel array processing. International Journal of Parallel Programming 48, 4 (2020), 652–674.
  11. Ian Grout and Lenore Mullin. 2018. Hardware Considerations for Tensor Implementation and Analysis Using the Field Programmable Gate Array. Electronics 7, 11 (2018). https://doi.org/10.3390/electronics7110320
  12. I.A. Grout and L. Mullin. 2019. Realization of the Kronecker Product in VHDL using Multi-Dimensional Arrays. In 2019 7th International Electrical Engineering Congress (iEECON). IEEE. https://doi.org/10.1109/ieecon45304.2019.8938846
  13. Ian Andrew Grout and Lenore Mullin. 2022a. Realizing Mathematics of Arrays Operations as Custom Architecture Hardware-Software Co-Design Solutions. Information 13, 11 (2022). https://doi.org/10.3390/info13110528
  14. Ian Andrew Grout and Lenore Mullin. 2022b. Realizing Mathematics of Arrays Operations as Custom Architecture Hardware-Software Co-Design Solutions. Inf. 13, 11 (2022), 528. https://doi.org/10.3390/info13110528
  15. G. Hains and L. M. R. Mullin. 1993. Parallel functional programming with arrays. Comput. J. 36, 3 (1993), 238–245.
  16. A Transformation–Based Approach for the Design of Parallel/Distributed Scientific Software: the FFT. CoRR abs/0811.2535 (2008). arXiv:0811.2535 http://arxiv.org/abs/0811.2535
  17. Kenneth E. Iverson. 1962. A Programming Language. John Wiley and Sons, Inc., USA. https://doi.org/10.5555/1098666
  18. Programming with BSP homomorphisms. In Euro-Par 2013 Parallel Processing: 19th International Conference, Aachen, Germany, August 26-30, 2013. Proceedings 19. Springer, 446–457.
  19. William F McColl and Alexandre Tiskin. 1999. Memory-efficient matrix multiplication in the BSP model. Algorithmica 24 (1999), 287–297.
  20. L. Mullin and W. Phan. 2021. A Transformational Approach to Scientific Software: The Mathematics of Arrays (MoA) FFT with OpenACC. https://www.openacc.org/events/openacc-summit-2021.
  21. Monolithic Compiler Experiments using C++ Expression Templates. https://slideplayer.com/slide/8543511/.
  22. L. M. R. Mullin. 1988. A Mathematics of Arrays. Ph. D. Dissertation. Syracuse University. Advisor(s) Ernest Sibert.
  23. Lenore M. R. Mullin and Michael A. Jenkins. 1996. Effective data parallel computation using the Psi calculus. Concurr. Pract. Exp. 8, 7 (1996), 499–515.
  24. Lenore R. Mullin. 2005. A uniform way of reasoning about array-based computation in radar: Algebraically connecting the hardware/software boundary. Digit. Signal Process. 15, 5 (2005), 466–520. https://doi.org/10.1016/j.dsp.2005.02.003
  25. Lenore R. Mullin and James E. Raynolds. 2014. Scalable, Portable, Verifiable Kronecker Products on Multi-scale Computers. In Constraint Programming and Decision Making, Martine Ceberio and Vladik Kreinovich (Eds.). Studies in Computational Intelligence, Vol. 539. Springer, 111–129. https://doi.org/10.1007/978-3-319-04280-0_14
  26. Applying the roofline model. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 76–85. https://doi.org/10.1109/ISPASS.2014.6844463
  27. Christopher Ostrouchov and Lenore Mullin. 2022. PythonMoA. https://labs.quansight.org/blog/2019/04/python-moa-tensor-compiler/.
  28. Compressing RNNs for IoT devices by 15-38x using Kronecker Products. CoRR abs/1906.02876 (2019). arXiv:1906.02876 http://arxiv.org/abs/1906.02876
  29. Improving the Performance of DGEMM with MoA and Cache-Blocking. In Proceedings of ARRAY’21: ACM Symposium on Array Programming (June 20–26). ACM, NY, NY.
  30. Threaded Multi-Core GEMM with MoA and Cache-Blocking. In Proceedings of the 19th International Conference on Scientific Computing (CSC’21) (July 26–29). Las Vegas, Nevada. 2021 World Congress in Computer Science, Computer Engineering and Applied Computing (CSCE’21).
  31. Tiffani L Williams and Rebecca J Parsons. 2000. The heterogeneous bulk synchronous parallel model. In Parallel and Distributed Processing: 15 IPDPS 2000 Workshops Cancun, Mexico, May 1–5, 2000 Proceedings 14. Springer, 102–108.
  32. Nan Wu and Yuan Xie. 2021. A Survey of Machine Learning for Computer Architecture and Systems. CoRR abs/2102.07952 (2021). arXiv:2102.07952 https://arxiv.org/abs/2102.07952
  33. Huamin Zhang and Feng Ding. 2013. On the Kronecker Products and Their Applications. Journal of Applied Mathematics 2013, 296185 (2013).

Summary

We haven't generated a summary for this paper yet.