Arrow Matrix Decomposition: A Novel Approach for Communication-Efficient Sparse Matrix Multiplication (2402.19364v2)
Abstract: We propose a novel approach to iterated sparse matrix dense matrix multiplication, a fundamental computational kernel in scientific computing and graph neural network training. In cases where matrix sizes exceed the memory of a single compute node, data transfer becomes a bottleneck. An approach based on dense matrix multiplication algorithms leads to suboptimal scalability and fails to exploit the sparsity in the problem. To address these challenges, we propose decomposing the sparse matrix into a small number of highly structured matrices called arrow matrices, which are connected by permutations. Our approach enables communication-avoiding multiplications, achieving a polynomial reduction in communication volume per iteration for matrices corresponding to planar graphs and other minor-excluded families of graphs. Our evaluation demonstrates that our approach outperforms a state-of-the-art method for sparse matrix multiplication on matrices with hundreds of millions of rows, offering near-linear strong and weak scaling.
- Improving performance of sparse matrix dense matrix multiplication on large-scale parallel systems. Parallel Comput. 59 (2016), 71–96.
- Statistical mechanics of complex networks. Rev. Mod. Phys. 74 (Jan 2002), 47–97.
- Diameter of the world-wide web. 130–131.
- Minimum projective linearizations of trees in linear time. Inf. Process. Lett. 174 (2022), 106204.
- A primer on statistical distributions. John Wiley & Sons, 2004.
- Hypergraph partitioning for sparse matrix-matrix multiplication. ACM Trans. Parallel Comput. 3, 3 (2016), 18:1–18:34.
- A discrete truncated zipf distribution. Statistica Neerlandica (09 2022).
- Scalable matrix computations on large scale-free graphs using 2d graph partitioning. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC’13, Denver, CO, USA - November 17 - 21, 2013 (2013), W. Gropp and S. Matsuoka, Eds., ACM, pp. 50:1–50:12.
- Sparse matrix multiplication: The distributed block-compressed sparse row library. Parallel Comput. 40, 5-6 (2014), 47–58.
- Bandwidth, expansion, treewidth, separators and universality for bounded-degree graphs. Eur. J. Comb. 31, 5 (2010), 1217–1227.
- Optimizing sparse matrix-vector multiplication for large-scale data analytics. In Proceedings of the 2016 International Conference on Supercomputing, ICS 2016, Istanbul, Turkey, June 1-3, 2016 (2016), O. Ozturk, K. Ebcioglu, M. T. Kandemir, and O. Mutlu, Eds., ACM, pp. 37:1–37:12.
- Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication. IEEE Trans. Parallel Distributed Syst. 10, 7 (1999), 673–693.
- Collective communication: theory, practice, and experience. Concurr. Comput. Pract. Exp. 19, 13 (2007), 1749–1783.
- l22{}_{\mbox{2}}start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPT22{}^{\mbox{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT spreading metrics for vertex ordering problems. Algorithmica 56, 4 (2010), 577–604.
- The bandwidth problem for graphs and matrices - a survey. J. Graph Theory 6, 3 (1982), 223–254.
- Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 24th national conference, ACM 1969, USA, 1969 (1969), S. L. Pollack, T. R. Dines, W. C. Sangren, N. R. Nielsen, W. G. Gerkin, A. E. Corduan, L. Nowak, J. L. Mueller, J. H. III, P. S. T. Yuen, J. Stein, and M. M. Mueller, Eds., ACM, pp. 157–172.
- Mpi for python. Journal of Parallel and Distributed Computing 65, 9 (2005), 1108–1115.
- The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38, 1 (dec 2011).
- Parallel hypergraph partitioning for scientific computing. In 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), Proceedings, 25-29 April 2006, Rhodes Island, Greece (2006), IEEE.
- Minimum linear arrangement of series-parallel graphs. In Approximation and Online Algorithms - 12th International Workshop, WAOA 2014, Wrocław, Poland, September 11-12, 2014, Revised Selected Papers (2014), E. Bampis and O. Svensson, Eds., vol. 8952 of Lecture Notes in Computer Science, Springer, pp. 168–180.
- Feige, U. Approximating the bandwidth via volume respecting embeddings. J. Comput. Syst. Sci. 60, 3 (2000), 510–539.
- Feige, U. Coping with the np-hardness of the graph bandwidth problem. In Algorithm Theory - SWAT 2000, 7th Scandinavian Workshop on Algorithm Theory, Bergen, Norway, July 5-7, 2000, Proceedings (2000), M. M. Halldórsson, Ed., vol. 1851 of Lecture Notes in Computer Science, Springer, pp. 10–19.
- I-GCN: A graph convolutional network accelerator with runtime locality enhancement through islandization. In MICRO ’21: 54th Annual IEEE/ACM International Symposium on Microarchitecture, Virtual Event, Greece, October 18-22, 2021 (2021), ACM, pp. 1051–1063.
- Matrix computations. JHU press, 2013.
- Concrete Mathematics: A Foundation for Computer Science, second ed. Addison-Wesley, Reading, MA, 1994.
- Gravvanis, G. A. An approximate inverse matrix technique for arrowhead matrices. Int. J. Comput. Math. 70, 1 (1998), 35–45.
- Array programming with NumPy. Nature 585, 7825 (Sept. 2020), 357–362.
- A separator theorem in minor-closed classes. In 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, October 23-26, 2010, Las Vegas, Nevada, USA (2010), IEEE Computer Society, pp. 153–162.
- Analysis of partitioning models and metrics in parallel sparse matrix-vector multiplication. In Parallel Processing and Applied Mathematics - 10th International Conference, PPAM 2013, Warsaw, Poland, September 8-11, 2013, Revised Selected Papers, Part II (2013), R. Wyrzykowski, J. J. Dongarra, K. Karczewski, and J. Wasniewski, Eds., vol. 8385 of Lecture Notes in Computer Science, Springer, pp. 174–184.
- Communication-avoiding parallel sparse-dense matrix-matrix multiplication. In 2016 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2016, Chicago, IL, USA, May 23-27, 2016 (2016), IEEE Computer Society, pp. 842–853.
- A discrete truncated pareto distribution. Statistical Methodology 26 (2015), 135–150.
- Lanczos, C. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. Journal of Research of the National Bureau of Standards 45, 4 (Oct. 1950).
- A separator theorem for planar graphs. SIAM Journal on Applied Mathematics 36, 2 (1979), 177–189.
- HYPE: massive hypergraph partitioning with neighborhood expansion. In IEEE International Conference on Big Data (IEEE BigData 2018), Seattle, WA, USA, December 10-13, 2018 (2018), N. Abe, H. Liu, C. Pu, X. Hu, N. K. Ahmed, M. Qiao, Y. Song, D. Kossmann, B. Liu, K. Lee, J. Tang, J. He, and J. S. Saltz, Eds., IEEE, pp. 458–467.
- Muradian, D. The bandwidth minimization problem for cyclic caterpillars with hair length 1 is np-complete. Theor. Comput. Sci. 307, 3 (2003), 567–572.
- Cupy: A numpy-compatible library for nvidia gpu calculations. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Thirty-first Annual Conference on Neural Information Processing Systems (NIPS) (2017).
- Scalability of hybrid spmv with hypergraph partitioning and vertex delegation for communication avoidance. In International Conference on High Performance Computing & Simulation (HPCS 2020) (2021).
- PAIGE, C. C. Computational Variants of the Lanczos Method for the Eigenproblem. IMA Journal of Applied Mathematics 10, 3 (12 1972), 373–381.
- Papadimitriou, C. H. The np-completeness of the bandwidth minimization problem. Computing 16, 3 (1976), 263–270.
- Shallow excluded minors and improved graph decompositions. In Proceedings of the Fifth Annual ACM-SIAM Symposium on Discrete Algorithms. 23-25 January 1994, Arlington, Virginia, USA (1994), D. D. Sleator, Ed., ACM/SIAM, pp. 462–470.
- New approximation techniques for some linear ordering problems. SIAM J. Comput. 34, 2 (2004), 388–404.
- An optimal time algorithm for minimum linear arrangement of chord graphs. Inf. Sci. 238 (2013), 212–220.
- Graph minors. vi. disjoint paths across a disc. Journal of Combinatorial Theory, Series B 41, 1 (1986), 115–138.
- Parallel matrix multiplication: A systematic journey. SIAM J. Sci. Comput. 38, 6 (2016).
- Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication. In ICS ’21: 2021 International Conference on Supercomputing, Virtual Event, USA, June 14-17, 2021 (2021), H. Zhou, J. Moreira, F. Mueller, and Y. Etsion, Eds., ACM, pp. 431–442.
- Graphgrind: addressing load imbalance of graph partitioning. In Proceedings of the International Conference on Supercomputing, ICS 2017, Chicago, IL, USA, June 14-16, 2017 (2017), W. D. Gropp, P. Beckman, Z. Li, and F. J. Cazorla, Eds., ACM, pp. 16:1–16:10.
- Reducing communication in graph neural network training. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020 (2020), C. Cuicchi, I. Qualters, and W. T. Kramer, Eds., IEEE/ACM, p. 70.
- Turner, J. S. On the probable performance of heuristics for bandwidth minimization. SIAM J. Comput. 15, 2 (1986), 561–580.
- SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 17 (2020), 261–272.
- Semi-external memory sparse matrix multiplication for billion-node graphs. IEEE Trans. Parallel Distributed Syst. 28, 5 (2017), 1470–1483.
- Zipf, G. K. Human behavior and the principle of least effort: An introduction to human ecology. Addison-Wesley, Cambridge, Massachusetts, 1949.