Low-Depth Spatial Tree Algorithms (2404.12953v3)
Abstract: Contemporary accelerator designs exhibit a high degree of spatial localization, wherein two-dimensional physical distance determines communication costs between processing elements. This situation presents considerable algorithmic challenges, particularly when managing sparse data, a pivotal component in progressing data science. The spatial computer model quantifies communication locality by weighting processor communication costs by distance, introducing a term named energy. Moreover, it integrates depth, a widely-utilized metric, to promote high parallelism. We propose and analyze a framework for efficient spatial tree algorithms within the spatial computer model. Our primary method constructs a spatial tree layout that optimizes the locality of the neighbors in the compute grid. This approach thereby enables locality-optimized messaging within the tree. Our layout achieves a polynomial factor improvement in energy compared to utilizing a PRAM approach. Using this layout, we develop energy-efficient treefix sum and lowest common ancestor algorithms, which are both fundamental building blocks for other graph algorithms. With high probability, our algorithms exhibit near-linear energy and poly-logarithmic depth. Our contributions augment a growing body of work demonstrating that computations can have both high spatial locality and low depth. Moreover, our work constitutes an advancement in the spatial layout of irregular and sparse computations.
- D. Anderson and G. E. Blelloch, “Parallel minimum cuts in O(m log22{}^{\mbox{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTn) work and low depth,” in SPAA ’21: 33rd ACM Symposium on Parallelism in Algorithms and Architectures, Virtual Event, USA, 6-8 July, 2021, K. Agrawal and Y. Azar, Eds. ACM, 2021, pp. 71–82. [Online]. Available: https://doi.org/10.1145/3409964.3461797
- R. J. Anderson and G. L. Miller, “A simple randomized parallel algorithm for list-ranking,” Inf. Process. Lett., vol. 33, no. 5, pp. 269–273, 1990. [Online]. Available: https://doi.org/10.1016/0020-0190(90)90196-5
- L. Arge, M. T. Goodrich, and N. Sitchinava, “Parallel external memory graph algorithms,” in 24th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010, Atlanta, Georgia, USA, 19-23 April 2010 - Conference Proceedings. IEEE, 2010, pp. 1–11. [Online]. Available: https://doi.org/10.1109/IPDPS.2010.5470440
- T. Ben-Nun, M. Besta, S. Huber, A. N. Ziogas, D. Peter, and T. Hoefler, “A modular benchmarking infrastructure for high-performance and reproducible deep learning,” in 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2019, pp. 66–77.
- O. Berkman and U. Vishkin, “Recursive star-tree parallel data structure,” SIAM J. Comput., vol. 22, no. 2, pp. 221–242, 1993. [Online]. Available: https://doi.org/10.1137/0222017
- M. Besta and T. Hoefler, “Parallel and distributed graph neural networks: An in-depth concurrency analysis,” arXiv preprint arXiv:2205.09702, 2022.
- L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001. [Online]. Available: https://doi.org/10.1023/A:1010933404324
- R. P. Brent, “The parallel evaluation of general arithmetic expressions,” J. ACM, vol. 21, no. 2, pp. 201–206, 1974. [Online]. Available: https://doi.org/10.1145/321812.321815
- A. Chan and F. K. H. A. Dehne, “A note on coarse grained parallel integer sorting,” Parallel Process. Lett., vol. 9, no. 4, pp. 533–538, 1999. [Online]. Available: https://doi.org/10.1142/S0129626499000499
- H. Chernoff, “A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations,” The Annals of Mathematical Statistics, vol. 23, no. 4, pp. 493 – 507, 1952. [Online]. Available: https://doi.org/10.1214/aoms/1177729330
- Y. Chiang, M. T. Goodrich, E. F. Grove, R. Tamassia, D. E. Vengroff, and J. S. Vitter, “External-memory graph algorithms,” in Proceedings of the Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, 22-24 January 1995. San Francisco, California, USA, K. L. Clarkson, Ed. ACM/SIAM, 1995, pp. 139–149. [Online]. Available: http://dl.acm.org/citation.cfm?id=313651.313681
- S. A. Chin, N. Sakamoto, A. Rui, J. Zhao, J. H. Kim, Y. Hara-Azumi, and J. H. Anderson, “CGRA-ME: A unified framework for CGRA modelling and exploration,” in 28th IEEE International Conference on Application-specific Systems, Architectures and Processors, ASAP 2017, Seattle, WA, USA, July 10-12, 2017, 2017, pp. 184–189. [Online]. Available: https://doi.org/10.1109/ASAP.2017.7995277
- F. K. H. A. Dehne, W. Dittrich, and D. A. Hutchinson, “Efficient external memory algorithms by simulating coarse-grained parallel algorithms,” in Proceedings of the 9th Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA ’97, Newport, RI, USA, June 23-25, 1997, C. E. Leiserson and D. E. Culler, Eds. ACM, 1997, pp. 106–115. [Online]. Available: https://doi.org/10.1145/258492.258503
- F. K. H. A. Dehne, A. Fabri, and A. Rau-Chaplin, “Scalable parallel computational geometry for coarse grained multicomputers,” Int. J. Comput. Geom. Appl., vol. 6, no. 3, pp. 379–400, 1996. [Online]. Available: https://doi.org/10.1142/S0218195996000241
- F. K. H. A. Dehne, A. Ferreira, E. Cáceres, S. W. Song, and A. Roncato, “Efficient parallel graph algorithms for coarse-grained multicomputers and BSP,” Algorithmica, vol. 33, no. 2, pp. 183–200, 2002. [Online]. Available: https://doi.org/10.1007/s00453-001-0109-4
- N. Dey, G. Gosal, Z. Chen, H. Khachane, W. Marshall, R. Pathria, M. Tom, and J. Hestness, “Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster,” CoRR, vol. abs/2304.03208, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2304.03208
- B. Gaide, D. Gaitonde, C. Ravishankar, and T. Bauer, “Xilinx adaptive compute acceleration platform: Versaltmtm{}^{\mbox{tm}}start_FLOATSUPERSCRIPT tm end_FLOATSUPERSCRIPT architecture,” in Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2019, Seaside, CA, USA, February 24-26, 2019, K. Bazargan and S. Neuendorffer, Eds. ACM, 2019, pp. 84–93. [Online]. Available: https://doi.org/10.1145/3289602.3293906
- B. Geissmann and L. Gianinazzi, “Parallel minimum cuts in near-linear work and low depth,” in Proceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures, SPAA 2018, Vienna, Austria, July 16-18, 2018, C. Scheideler and J. T. Fineman, Eds. ACM, 2018, pp. 1–11. [Online]. Available: https://doi.org/10.1145/3210377.3210393
- L. Gianinazzi, T. Ben-Nun, M. Besta, S. Ashkboos, Y. Baumann, P. Luczynski, and T. Hoefler, “The spatial computer: A model for energy-efficient parallel computation,” 2022. [Online]. Available: https://arxiv.org/abs/2205.04934
- L. Gianinazzi, A. N. Ziogas, L. Huang, P. Luczynski, S. Ashkboosh, F. Scheidl, A. Carigiet, C. Ge, N. Abubaker, M. Besta, T. Ben-Nun, and T. Hoefler, “Arrow matrix decomposition: A novel approach for communication-efficient sparse matrix multiplication,” in Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’24. New York, NY, USA: Association for Computing Machinery, 2024, p. 404–416. [Online]. Available: https://doi.org/10.1145/3627535.3638496
- H. J. Haverkort and F. van Walderveen, “Locality and bounding-box quality of two-dimensional space-filling curves,” in Algorithms - ESA 2008, 16th Annual European Symposium, Karlsruhe, Germany, September 15-17, 2008. Proceedings, ser. Lecture Notes in Computer Science, D. Halperin and K. Mehlhorn, Eds., vol. 5193. Springer, 2008, pp. 515–527. [Online]. Available: https://doi.org/10.1007/978-3-540-87744-8_43
- D. Hilbert, “Ueber die stetige abbildung einer line auf ein flächenstück,” Mathematische Annalen, vol. 38, no. 3, pp. 459–460, Sep 1891. [Online]. Available: https://doi.org/10.1007/BF01199431
- T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste, “Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks,” J. Mach. Learn. Res., vol. 22, no. 1, jan 2021.
- P. Iff, M. Besta, M. Cavalcante, T. Fischer, L. Benini, and T. Hoefler, “Hexamesh: Scaling to hundreds of chiplets with an optimized chiplet arrangement,” in 2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 2023, pp. 1–6.
- P. Iff, B. Bruggmann, M. Besta, L. Benini, and T. Hoefler, “Rapidchiplet: A toolchain for rapid design space exploration of chiplet architectures,” arXiv preprint arXiv:2311.06081, 2023.
- M. Jacquelin, M. Araya-Polo, and J. Meng, “Scalable distributed high-order stencil computations,” in SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, November 13-18, 2022, F. Wolf, S. Shende, C. Culhane, S. R. Alam, and H. Jagode, Eds. IEEE, 2022, pp. 30:1–30:13. [Online]. Available: https://doi.org/10.1109/SC41404.2022.00035
- D. R. Karger, “Minimum cuts in near-linear time,” in Proceedings of the Twenty-Eighth Annual ACM Symposium on the Theory of Computing, Philadelphia, Pennsylvania, USA, May 22-24, 1996, G. L. Miller, Ed. ACM, 1996, pp. 56–63. [Online]. Available: https://doi.org/10.1145/237814.237829
- P. N. Klein, “Computing the edit-distance between unrooted ordered trees,” in Algorithms - ESA ’98, 6th Annual European Symposium, Venice, Italy, August 24-26, 1998, Proceedings, ser. Lecture Notes in Computer Science, G. Bilardi, G. F. Italiano, A. Pietracaprina, and G. Pucci, Eds., vol. 1461. Springer, 1998, pp. 91–102. [Online]. Available: https://doi.org/10.1007/3-540-68530-8_8
- S. B. Kotsiantis, “Decision trees: a recent overview,” Artif. Intell. Rev., vol. 39, no. 4, pp. 261–283, 2013. [Online]. Available: https://doi.org/10.1007/s10462-011-9272-4
- F. T. Leighton, “Introduction to parallel algorithms and architectures: Arrays, trees, hypercubes,” 1991.
- ——, “Complexity issues in vlsi: Optimal layouts for the shuffle-exchange graph and other networks,” 2003.
- T. Leighton, “Parallel computation using meshes of trees,” in Proceedings of the WG ’83, International Workshop on Graphtheoretic Concepts in Computer Science, June 16-18, 1983, Haus Ohrbeck, near Osnabrück, Germany, M. Nagl and J. Perl, Eds. Universitätsverlag Rudolf Trauner, Linz, 1983, pp. 200–218.
- R. Lin and S. Olariu, “A simple optimal parallel algorithm to solve the lowest common ancestor problem,” in Advances in Computing and Information - ICCI’91, International Conference on Computing and Information, Ottawa, Canada, May 27-29, 1991, Proceedings, ser. Lecture Notes in Computer Science, F. K. H. A. Dehne, F. Fiala, and W. W. Koczkodaj, Eds., vol. 497. Springer, 1991, pp. 455–461. [Online]. Available: https://doi.org/10.1007/3-540-54029-6_194
- ——, “A fast cost-optimal parallel algorithm for the lowest common ancestor problem,” Parallel Comput., vol. 18, no. 5, pp. 511–516, 1992. [Online]. Available: https://doi.org/10.1016/0167-8191(92)90086-M
- R. J. Lipton and R. Sedgewick, “Lower bounds for VLSI,” in Proceedings of the 13th Annual ACM Symposium on Theory of Computing, May 11-13, 1981, Milwaukee, Wisconsin, USA. ACM, 1981, pp. 300–307. [Online]. Available: https://doi.org/10.1145/800076.802482
- H. Ltaief, Y. Hong, L. Wilson, M. Jacquelin, M. Ravasi, and D. E. Keyes, “Scaling the “memory wall” for multi-dimensional seismic processing with algebraic compression on cerebras cs-2 systems,” 2023. [Online]. Available: http://hdl.handle.net/10754/694388
- G. L. Miller and J. H. Reif, “Parallel tree contraction and its application,” in 26th Symposium on Foundations of Computer Science. Portland, Oregon: IEEE, October 1985, pp. 478–489.
- R. Niedermeier, K. Reinhardt, and P. Sanders, “Towards optimal locality in mesh-indexings,” Discrete Applied Mathematics, vol. 117, no. 1, pp. 211–237, 2002. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0166218X00003267
- R. Niedermeier and P. Sanders, “On the Manhattan-distance between points on space-filling mesh-indexings,” 1996. [Online]. Available: https://publikationen.bibliothek.kit.edu/17796
- OpenAI, “Chatgpt (september 25 version),” 2023. [Online]. Available: https://chat.openai.com/chat
- E. Pennisi, “Modernizing the tree of life,” Science, vol. 300, no. 5626, pp. 1692–1697, 2003. [Online]. Available: https://www.science.org/doi/abs/10.1126/science.300.5626.1692
- W. H. Piel, L. Chan, M. J. Dominus, J. Ruan, R. A. Vos, and V. Tannen, “Treebase v. 2: A database of phylogenetic knowledge,” in e-BioSphere 2009, 2009.
- A. Podobas, K. Sano, and S. Matsuoka, “A survey on coarse-grained reconfigurable architectures from a performance perspective,” IEEE Access, vol. 8, pp. 146 719–146 743, 2020. [Online]. Available: https://doi.org/10.1109/ACCESS.2020.3012084
- K. Rocki, D. V. Essendelft, I. Sharapov, R. Schreiber, M. Morrison, V. Kibardin, A. Portnoy, J. Dietiker, M. Syamlal, and M. James, “Fast stencil-code computation on a wafer-scale processor,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020, C. Cuicchi, I. Qualters, and W. T. Kramer, Eds. IEEE/ACM, 2020, p. 58. [Online]. Available: https://doi.org/10.1109/SC41405.2020.00062
- B. Schieber and U. Vishkin, “On finding lowest common ancestors: Simplification and parallelization,” SIAM J. Comput., vol. 17, no. 6, pp. 1253–1262, 1988. [Online]. Available: https://doi.org/10.1137/0217079
- I. Swarbrick, D. Gaitonde, S. Ahmad, B. Gaide, and Y. Arbel, “Network-on-chip programmable platform in versaltmtm{}^{\mbox{tm}}start_FLOATSUPERSCRIPT tm end_FLOATSUPERSCRIPT ACAP architecture,” in Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2019, Seaside, CA, USA, February 24-26, 2019, K. Bazargan and S. Neuendorffer, Eds. ACM, 2019, pp. 212–221. [Online]. Available: https://doi.org/10.1145/3289602.3293908
- C. Systems, Inc., “Cerebras systems: Achieving industry bestai performance through a systems approach,” Apr. 2021. [Online]. Available: https://cerebras.net/wp-content/uploads/2021/04/Cerebras-CS-2-Whitepaper.pdf
- R. Tarjan and U. Vishkin, “Finding biconnected componemts and computing tree functions in logarithmic parallel time,” in 25th Annual Symposium onFoundations of Computer Science, 1984., 1984, pp. 12–20.
- A. Trifan, D. Gorgun, M. Salim, Z. Li, A. Brace, M. Zvyagin, H. Ma, A. Clyde, D. Clark, D. J. Hardy, T. Burnley, L. Huang, J. McCalpin, M. Emani, H. Yoo, J. Yin, A. Tsaris, V. Subbiah, T. Raza, J. Liu, N. Trebesch, G. Wells, V. Mysore, T. Gibbs, J. Phillips, S. C. Chennubhotla, I. Foster, R. Stevens, A. Anandkumar, V. Vishwanath, J. E. Stone, E. Tajkhorshid, S. A. Harris, and A. Ramanathan, “Intelligent resolution: Integrating cryo-em with ai-driven multi-resolution simulations to observe the severe acute respiratory syndrome coronavirus-2 replication-transcription machinery in action,” Int. J. High Perform. Comput. Appl., vol. 36, no. 5–6, p. 603–623, nov 2022. [Online]. Available: https://doi.org/10.1177/10943420221113513
- K. A. Vissers, “Versal: The xilinx adaptive compute acceleration platform (ACAP),” in Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2019, Seaside, CA, USA, February 24-26, 2019, K. Bazargan and S. Neuendorffer, Eds. ACM, 2019, p. 83. [Online]. Available: https://doi.org/10.1145/3289602.3294007
- R. A. Vos, J. P. Balhoff, J. A. Caravas, M. T. Holder, H. Lapp, W. P. Maddison, P. E. Midford, A. Priyam, J. Sukumaran, X. Xia, and A. Stoltzfus, “Nexml: rich, extensible, and verifiable representation of comparative data and metadata,” Systematic Biology, vol. 61, no. 4, pp. 675–689, 2012.
- M. Woo, T. Jordan, R. Schreiber, I. Sharapov, S. Muhammad, A. Koneru, M. James, and D. V. Essendelft, “Disruptive changes in field equation modeling: A simple interface for wafer scale engines,” CoRR, vol. abs/2209.13768, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2209.13768
- Z. Zhang, P. Cui, and W. Zhu, “Deep learning on graphs: A survey,” IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 1, pp. 249–270, 2020.