Flip: Data-Centric Edge CGRA Accelerator (2309.10623v1)
Abstract: Coarse-Grained Reconfigurable Arrays (CGRA) are promising edge accelerators due to the outstanding balance in flexibility, performance, and energy efficiency. Classic CGRAs statically map compute operations onto the processing elements (PE) and route the data dependencies among the operations through the Network-on-Chip. However, CGRAs are designed for fine-grained static instruction-level parallelism and struggle to accelerate applications with dynamic and irregular data-level parallelism, such as graph processing. To address this limitation, we present Flip, a novel accelerator that enhances traditional CGRA architectures to boost the performance of graph applications. Flip retains the classic CGRA execution model while introducing a special data-centric mode for efficient graph processing. Specifically, it exploits the natural data parallelism of graph algorithms by mapping graph vertices onto processing elements (PEs) rather than the operations, and supporting dynamic routing of temporary data according to the runtime evolution of the graph frontier. Experimental results demonstrate that Flip achieves up to 36$\times$ speedup with merely 19% more area compared to classic CGRAs. Compared to state-of-the-art large-scale graph processors, Flip has similar energy efficiency and 2.2$\times$ better area efficiency at a much-reduced power/area budget.
- 2018. IEEE Standard for SystemVerilog–Unified Hardware Design, Specification, and Verification Language. IEEE Std 1800-2017 (Revision of IEEE Std 1800-2012) (2018), 1–1315. https://doi.org/10.1109/IEEESTD.2018.8299595
- A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. 105–117.
- A template-based design methodology for graph-parallel hardware accelerators. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 2 (2017), 420–430.
- REVAMP: a systematic framework for heterogeneous CGRA realization. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 918–932.
- SISA: Set-Centric Instruction Set Architecture for Graph Mining on Processing-in-Memory Systems. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (Virtual Event, Greece) (MICRO ’21). Association for Computing Machinery, New York, NY, USA, 282–297. https://doi.org/10.1145/3466752.3480133
- Graph processing on fpgas: Taxonomy, survey, challenges. arXiv preprint arXiv:1903.06697 (2019).
- Parallel Distributed Breadth First Search on the Kepler Architecture. IEEE Transactions on Parallel and Distributed Systems 27, 7 (2016), 2091–2102. https://doi.org/10.1109/TPDS.2015.2475270
- Distributed-Memory Breadth-First Search on Massive Graphs. arXiv:1705.04590 [cs.DC]
- Solving the straggler problem with bounded staleness. In 14th Workshop on Hot Topics in Operating Systems (HotOS XIV).
- Polygraph: Exposing the value of flexibility for graph processing accelerators. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 595–608.
- Systematically understanding graph accelerator dimensions and the value of hardware flexibility. IEEE Micro 42, 4 (2022), 87–96.
- GraphH: A Processing-in-Memory Architecture for Large-Scale Graph Processing. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 38, 4 (2019), 640–653. https://doi.org/10.1109/TCAD.2018.2821565
- William James Dally and Brian Patrick Towles. 2004. Principles and practices of interconnection networks. Elsevier.
- A Time-Based Intra-Memory Computing Graph Processor Featuring A* Wavefront Expansion and 2-D Gradient Control. IEEE Journal of Solid-State Circuits 56, 7 (2021), 2281–2290.
- A programmable, energy-minimal dataflow compiler and architecture. In 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 546–564.
- A survey on graph processing accelerators: Challenges and opportunities. Journal of Computer Science and Technology 34, 2 (2019), 339–371.
- MiBench: A free, commercially representative embedded benchmark suite. In Proceedings of the fourth annual IEEE international workshop on workload characterization. WWC-4 (Cat. No. 01EX538). IEEE, 3–14.
- Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1–13.
- Branch-aware loop mapping on CGRAs. In Proceedings of the 51st Annual Design Automation Conference. 1–6.
- On-chip networks. Synthesis Lectures on Computer Architecture 12, 3 (2017), 1–210.
- Nachiket Kapre. 2015. Custom FPGA-based soft-processors for sparse graph acceleration. In 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP). IEEE, 9–16.
- Hycube: A cgra with reconfigurable single-cycle multi-hop interconnect. In Proceedings of the 54th Annual Design Automation Conference 2017. 1–6.
- 4D-CGRA: Introducing branch dimension to spatio-temporal application mapping on CGRAs. In 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 1–8.
- An FPGA implementation for solving the large single-source-shortest-path problem. IEEE Transactions on Circuits and Systems II: Express Briefs 63, 5 (2015), 473–477.
- Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford large network dataset collection.
- On trip planning queries in spatial databases. In International symposium on spatial and temporal databases. Springer, 273–290.
- Coarse Grained Reconfigurable Array CGRA. Book Chapter in Springer Handbook of Computer Architecture (2022).
- A survey of coarse-grained reconfigurable architecture and design: Taxonomy, challenges, and applications. ACM Computing Surveys (CSUR) 52, 6 (2019), 1–39.
- OverGen: Improving FPGA usability through domain-specific overlay generation. In 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 35–56.
- Graphlab: A new framework for parallel machine learning. arXiv preprint arXiv:1408.2041 (2014).
- Kevin JM Martin. 2022. Twenty Years of Automated Methods for Mapping Applications on CGRA. In 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 679–686.
- Thinking like a vertex: a survey of vertex-centric frameworks for large-scale distributed graph processing. ACM Computing Surveys (CSUR) 48, 2 (2015), 1–39.
- Tartan: evaluating spatial computation for whole program execution. ACM SIGARCH Computer Architecture News 34, 5 (2006), 163–174.
- Quan M Nguyen and Daniel Sanchez. 2021. Fifer: Practical acceleration of irregular applications on reconfigurable architectures. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. 1064–1077.
- Stream-dataflow acceleration. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 416–429.
- Energy efficient architecture for graph analytics accelerators. ACM SIGARCH Computer Architecture News 44, 3 (2016), 166–177.
- J Thomas Pawlowski. 2011. Hybrid memory cube (HMC). In 2011 IEEE Hot chips 23 symposium (HCS). IEEE, 1–24.
- A survey on coarse-grained reconfigurable architectures from a performance perspective. IEEE Access 8 (2020), 146719–146743.
- Graphpulse: An event-driven hardware accelerator for asynchronous graph processing. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 908–921.
- D Raj Reddy et al. 1977. Speech understanding systems: A summary of results of the five-year research effort. Department of Computer Science. Camegie-Mell University, Pittsburgh, PA 17 (1977), 138.
- GraphR: Accelerating graph processing using ReRAM. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 531–543.
- GNNerator: A hardware/software framework for accelerating graph neural networks. In 2021 58th ACM/IEEE Design Automation Conference (DAC). IEEE, 955–960.
- WaveScalar. In Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36. IEEE, 291–302.
- Opencgra: Democratizing coarse-grained reconfigurable arrays. In 2021 IEEE 32nd International Conference on Application-specific Systems, Architectures and Processors (ASAP). IEEE, 149–155.
- ASAP: automatic synthesis of area-efficient and precision-aware CGRAs. In Proceedings of the 36th ACM International Conference on Supercomputing. 1–13.
- OpenCGRA: An open-source unified framework for modeling, testing, and evaluating CGRAs. In 2020 IEEE 38th International Conference on Computer Design (ICCD). IEEE, 381–388.
- Aurora: Automated refinement of coarse-grained reconfigurable accelerators. In 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 1388–1393.
- Dani Voitsechov and Yoav Etsion. 2014. Single-graph multiple flows: Energy efficient design alternative for GPGPUs. ACM SIGARCH computer architecture news 42, 3 (2014), 205–216.
- Baring it all to software: Raw machines. Computer 30, 9 (1997), 86–93.
- Hycube: A 0.9 v 26.4 mops/mw, 290 pj/op, power efficient accelerator for iot applications. In 2019 IEEE Asian Solid-State Circuits Conference (A-SSCC). IEEE, 133–136.
- Dsagen: Synthesizing programmable spatial accelerators. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 268–281.
- A hybrid systolic-dataflow architecture for inductive matrix algorithms. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 703–716.
- Cascade: High throughput data streaming via decoupled access-execute cgra. ACM Transactions on Embedded Computing Systems (TECS) 18, 5s (2019), 1–26.
- Morpher: An Open-Source Integrated Compilation and Simulation Framework for CGRA. In Fifth Workshop on Open-Source EDA Technology (WOSET).
- Metal–oxide RRAM. Proc. IEEE 100, 6 (2012), 1951–1970.
- Scalagraph: A scalable accelerator for massively parallel graph processing. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 199–212.
- Dynamic-II Pipeline: Compiling Loops with Irregular Branches on Static-Scheduling CGRA. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2021).
- GraphP: Reducing communication for PIM-based graph processing with efficient data partition. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 544–557.
- Depgraph: A dependency-driven accelerator for efficient iterative graph processing. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 371–384.
- Tunao: A high-performance and energy-efficient reconfigurable accelerator for graph processing. In 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). IEEE, 731–734.
- An FPGA framework for edge-centric graph processing. In Proceedings of the 15th ACM International Conference on Computing Frontiers. 69–77.
- GraphQ: Scalable PIM-Based Graph Processing. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (Columbus, OH, USA) (MICRO ’52). Association for Computing Machinery, New York, NY, USA, 712–725. https://doi.org/10.1145/3352460.3358256