Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Tascade: Hardware Support for Atomic-free, Asynchronous and Efficient Reduction Trees (2311.15810v2)

Published 27 Nov 2023 in cs.AR and cs.DC

Abstract: Graph search and sparse data-structure traversal workloads contain challenging irregular memory patterns on global data structures that need to be modified atomically. Distributed processing of these workloads has relied on server threads operating on their own data copies that are merged upon global synchronization. As parallelism increases within each server, the communication challenges that arose in distributed systems a decade ago are now being encountered within large manycore servers. Prior work has achieved scalability for sparse applications up to thousands of PUs on-chip, but does not scale further due to increasing communication distances and load-imbalance across PUs. To address these challenges we propose Tascade, a hardware-software co-design that offers support for storage-efficient data-private reductions as well as asynchronous and opportunistic reduction trees. Tascade introduces an execution model along with supporting hardware design that allows coalescing of data updates regionally and merges the data from these regions through cascaded updates. Together, Tascade innovations minimize communication and increase work balance in task-based parallelization schemes and scales up to a million PUs. We evaluate six applications and four datasets to provide a detailed analysis of Tascade's performance, power, and traffic-reduction gains over prior work. Our parallelization of Breadth-First-Search with RMAT-26 across a million PUs -- the largest of the literature -- reaches over 7600 GTEPS.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Chronos: Efficient speculative parallelism for accelerators. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 1247–1262, 2020.
  2. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, pages 105–117, 2015.
  3. AMD. AMD rome, 2018. https://en.wikichip.org/wiki/amd/cores/rome.
  4. iPUG: Accelerating breadth-first graph traversals using manycore Graphcore IPUs. In International Conference on High Performance Computing, pages 291–309. Springer, 2021.
  5. Cerebras Systems Inc. The second generation wafer scale engine. https://cerebras.net/wp-content/uploads/2021/04/Cerebras-CS-2-Whitepaper.pdf.
  6. Nvidia A100 GPU: Performance & innovation for GPU computing. In 2020 IEEE Hot Chips 32 Symposium (HCS), pages 1–43. IEEE Computer Society, 2020.
  7. Cameo: A two-level memory organization with capacity of main memory and flexibility of hardware-managed cache. In 47th IEEE/ACM International Symposium on Microarchitecture, pages 1–12. IEEE, 2014.
  8. Exploiting private local memories to reduce the opportunity cost of accelerator integration. In Proceedings of the 2016 International Conference on Supercomputing, pages 1–12, 2016.
  9. Polygraph: Exposing the value of flexibility for graph processing accelerators. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 595–608. IEEE, 2021.
  10. Openmp: An industry-standard API for shared-memory programming. IEEE Computational Science and Engineering, 5(1), January 1998.
  11. Gluon: A communication-optimizing substrate for distributed heterogeneous graph analytics. In Proceedings of the 39th ACM SIGPLAN conference on programming language design and implementation, pages 752–768, 2018.
  12. Gluon-async: A bulk-asynchronous system for distributed and heterogeneous graph analytics. In 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 15–28. IEEE, 2019.
  13. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.
  14. Esperanto Technologies. Esperanto’s et-minion on-chip RISC-V cores. https://www.esperanto.ai/technology/.
  15. Fujitsu. Interconnect Fugaku Supercomputer. https://www.fujitsu.com/global/imagesgig5/the-tofu-interconnect-d-for-supercomputer-fugaku.pdf.
  16. Linley Gwennap. Groq rocks neural networks. Microprocessor Report, Tech. Rep., jan, 2020.
  17. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. In Proceedings of the 49th Annual International Symposium on Microarchitecture, MICRO, 2016.
  18. William Harrod. Agile: The future of data centric computing, 2022. https://www.youtube.com/watch?v=qIM_RBXX6O0.
  19. ISCA Submission 71. Tascade simulation framework and artifacts, 2023. https://github.com/prisca71/tascade.git.
  20. A scalable architecture for ordered parallelism. In Proceedings of the 48th International Symposium on Microarchitecture, MICRO-48, page 228–241, New York, NY, USA, 2015. Association for Computing Machinery.
  21. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th annual international symposium on computer architecture, pages 1–12, 2017.
  22. Metis: A software package for partitioning unstructured graphs. Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices, Version, 4(0), 1998.
  23. Amoeba-cache: Adaptive blocks for eliminating waste in the memory hierarchy. In 45th Annual IEEE/ACM International Symposium on Microarchitecture, pages 376–388. IEEE, 2012.
  24. The PageRank citation ranking: Bringing order to the web. Technical report, Stanford University, 1998.
  25. Kronecker graphs: An approach to modeling networks. Journal of Machine Learning Reseach (JMLR), 11:985–1042, March 2010.
  26. Enterprise: breadth-first graph traversal on gpus. In International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–12, 2015.
  27. Graphattack: Optimizing data supply for graph applications on in-order multicore architectures. ACM Transactions on Architecture and Code Optimization (TACO), 18(4):1–26, 2021.
  28. Introducing the Graph 500. http://www.graph500.org/specifications, 2010.
  29. Pioneering chiplet technology and design for the amd epyc™ and ryzen™ processor families. In Proceedings of the 48th Annual International Symposium on Computer Architecture, ISCA ’21, page 57–70. IEEE Press, 2021.
  30. Performance evaluation of supercomputer fugaku using breadth-first search benchmark in graph500. In 2020 IEEE International Conference on Cluster Computing (CLUSTER), pages 408–409. IEEE, 2020.
  31. Pipette: Improving core utilization on irregular applications through intra-core pipeline parallelism. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 596–608. IEEE, 2020.
  32. Fifer: Practical acceleration of irregular applications on reconfigurable architectures. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’21, page 1064–1077, New York, NY, USA, 2021. Association for Computing Machinery.
  33. Tiny but mighty: designing and realizing scalable latency tolerance for manycore socs. In ISCA, pages 817–830, 2022.
  34. Wafer-scale fast fourier transforms. In Proceedings of the 37th International Conference on Supercomputing, ICS ’23, page 180–191, New York, NY, USA, 2023. Association for Computing Machinery.
  35. Dalorex: A data-local program execution and architecture for memory-bound applications. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 718–730. IEEE, 2023.
  36. Tascade simulation framework and artifacts, 2023. https://github.com/morenes/tascade.git.
  37. Energy efficient architecture for graph analytics accelerators. ACM SIGARCH Computer Architecture News, 44(3):166–177, 2016.
  38. Architecting waferscale processors-a gpu case study. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 250–263. IEEE, 2019.
  39. J Thomas Pawlowski. Hybrid memory cube (hmc). In 2011 IEEE Hot Chips 23 Symposium (HCS), pages 1–24. IEEE, 2011.
  40. A scalable architecture for reprioritizing ordered parallelism. In Proceedings of the 49th Annual International Symposium on Computer Architecture, ISCA ’22, page 437–453, New York, NY, USA, 2022. Association for Computing Machinery.
  41. Graphpulse: An event-driven hardware accelerator for asynchronous graph processing. In 2020 53rd Annual IEEE/ACM Symposium on Microarchitecture (MICRO), pages 908–921. IEEE, 2020.
  42. Debendra Das Sharma. Pci express 6.0 specification: A low-latency, high-bandwidth, high-reliability, and cost-effective interconnect with 64.0 gt/s pam-4 signaling. IEEE Micro, 41(1), 2020.
  43. BFS and coloring-based parallel algorithms for strongly connected components and related problems. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium, Phoenix, AZ, USA, May 19-23, 2014, pages 550–559. IEEE Computer Society, 2014.
  44. MPI–the Complete Reference: the MPI core, volume 1. MIT press, 1998.
  45. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Technical Report IMPACT-12-01, University of Illinois at Urbana-Champaign, 2012.
  46. Prodigy: Improving the memory latency of data-indirect irregular workloads using hardware-software co-design. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 654–667. IEEE, 2021.
  47. Dojo: The microarchitecture of tesla exa-scale computer. In 2022 IEEE Hot Chips 34 Symposium, pages 1–28. IEEE Computer Society, 2022.
  48. Optimization of collective communication operations in mpich. The International Journal of High Performance Computing Applications, 19(1):49–66, 2005.
  49. From” think like a vertex” to” think like a graph”. Proceedings of the VLDB Endowment, 7(3):193–204, 2013.
  50. Optimizing blocking and nonblocking reduction operations for multicore systems: Hierarchical design and implementation. In 2013 IEEE International Conference on Cluster Computing (CLUSTER), pages 1–8. IEEE, 2013.
  51. Gunrock: A high-performance graph processing library on the gpu. In Proceedings of the 21st ACM SIGPLAN symposium on principles and practice of parallel programming, pages 1–12, 2016.
  52. John Wilson. High-bandwidth density, energy-efficient, short-reach signaling that enables massively scalable parallelism, 2022. https://www.opencompute.org/events/past-events/hipchips-chiplet-workshop-isca-conference.
  53. Graphq: Scalable pim-based graph processing. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 712–725, 2019.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com