Parendi: Thousand-Way Parallel RTL Simulation (2403.04714v1)
Abstract: Hardware development relies on simulations, particularly cycle-accurate RTL (Register Transfer Level) simulations, which consume significant time. As single-processor performance grows only slowly, conventional, single-threaded RTL simulation is becoming less practical for increasingly complex chips and systems. A solution is parallel RTL simulation, where ideally, simulators could run on thousands of parallel cores. However, existing simulators can only exploit tens of cores. This paper studies the challenges inherent in running parallel RTL simulation on a multi-thousand-core machine (the Graphcore IPU, a 1472-core machine). Simulation performance requires balancing three factors: synchronization, communication, and computation. We experimentally evaluate each metric and analyze how it affects parallel simulation speed, drawing on contrasts between the large-scale IPU and smaller but faster x86 systems. Using this analysis, we build Parendi, an RTL simulator for the IPU. It distributes RTL simulation across 5888 cores on 4 IPU sockets. Parendi runs large RTL designs up to 4x faster than a powerful, state-of-the-art x86 multicore system.
- 4th gen AMD EPYC Processor Archiecture. Technical report, AMD.
- AI IPU Cloud Infrastructure. https://gcore.com/cloud/ai-platform. Accessed: 22-11-2023.
- Azure pricing calculator. https://azure.microsoft.com/en-us/pricing/calculator/.
- Introducing the Colussus MK2 GC200 IPU. https://www.graphcore.ai/products/ipu. Accessed: 2023-11-23.
- Long time to compile complicated processor. https://github.com/ucsc-vama/essent/issues/15, sep 2022.
- Simulation performance differs with different Verilog styles. https://github.com/verilator/verilator/issues/4547, oct 2023.
- Using essent with chipyard. https://github.com/ucsc-vama/essent/issues/20, sep 2023.
- Scalable parallel event-driven HDL simulation for multi-cores. In 2012 International Conference on Synthesis, Modeling, Analysis and Simulation Methods and Applications to Circuit Design (SMACD), pages 217–220, 2012.
- Chipyard: Integrated Design, Simulation, and Implementation Framework for Custom SoCs. IEEE Micro, 40(4):10–21, 2020.
- The Rocket Chip Generator. Technical report, University of California, Berkeley, 2016.
- Logic emulation with virtual wires. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 16(6):609–626, 1997.
- Chisel: constructing hardware in a Scala embedded language. pages 1216–1225, 2012.
- Scott Beamer. A Case for Accelerating Software RTL Simulation. IEEE Micro, 40(4):112–119, 2020.
- Efficiently Exploiting Low Activity Factors to Accelerate RTL Simulation. pages 1–6, 2020.
- Peter Birch. Open source FPGA-based emulation with nexus. In Workshop on Open-Source EDA Technology (WOSET), number 1, 2022.
- Event-driven gate-level simulation with GP-GPUs. pages 557–562, 2009.
- GCS: High-performance gate-level simulation with GPGPUs. pages 1332–1337, 2009.
- Gate-Level Simulation with GPU Computing. ACM Trans. Design Autom. Electr. Syst., 16(3):30:1–30:26, 2011.
- SlackSim: a platform for parallel simulations of CMPs on CMPs. SIGARCH Comput. Archit. News, 37(2):20–29, 2009.
- Accelerating RTL Simulation with Hardware Software Co-Design. In MICRO-56: 56th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’23, New York, NY, USA, 2023. Association for Computing Machinery.
- Manticore: Hardware-Accelerated RTL Simulation with Static Bulk-Synchronous Parallelism. In ASPLOS (4), pages 219–237, 2023.
- Harry Foster. Part 4: The 2020 Wilson Research Group Functional Verification Study, FPGA Verification Effort Trends, 12 2020.
- Harry Foster. Part 8: The 2020 Wilson Research Group Functional Verification Study, IC/ASIC Resource Trends, 1 2021.
- PriME: A parallel and distributed simulator for thousand-core chips. In Proceedings of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 116–125, 2014.
- Performance Guarantees for Scheduling Algorithms. Oper. Res., 26(1):3–21, 1978.
- Reusability is FIRRTL ground: Hardware construction languages, compiler frameworks, and transformations. pages 209–216, 2017.
- A scalable architecture for ordered parallelism. In Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 228–241, 2015.
- Dissecting the Graphcore IPU Architecture via Microbenchmarking. CoRR, abs/1912.03413, 2019.
- FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud. IEEE Micro, 39(3):56–65, 2019.
- A new distributed event-driven gate-level HDL simulation by accurate prediction. pages 547–550, 2011.
- FPGA-based emulation: Industrial and custom prototyping solutions. In Proceedings of the The Roadmap to Reconfigurable Computing, 10th International Workshop on Field-Programmable Logic and Applications, FPL ’00, page 68–77, Berlin, Heidelberg, 2000. Springer-Verlag.
- Design and Implementation of a Parallel Verilog Simulator: PVSim. In VLSI Design, pages 329–334, 2004.
- From RTL to CUDA: A GPU Acceleration Flow for RTL Simulation with Batch Stimulus. pages 88:1–88:12, 2022.
- Fast Behavioural RTL Simulation of 10B Transistor SoC Designs with Metro-Mpi. pages 1–6, 2023.
- George Marsaglia. Xorshift RNGs. Journal of Statistical Software, 8(14):1–6, 2003.
- Graphite: A distributed parallel simulator for multicores. In Proceedings of the 16th IEEE Symposium on High-Performance Computer Architecture (HPCA), pages 1–12, 2010.
- A Hardware-Software Blueprint for Flexible Deep Learning Specialization. IEEE Micro, 39(5):8–16, 2019.
- Open-Source FPGA Bitcoin Miner. https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner.
- OSCI. SystemC. https://www.systemc.org.
- PicoRV32 - A Size-Optimized RISC-V CPU. https://github.com/YosysHQ/picorv32.
- Accelerating RTL simulation with GPUs. pages 687–693, 2011.
- Karl Rupp. Microprocessor trend data. https://github.com/karlrupp/microprocessor-trend-data, 2022. Accessed: 18-10-2023.
- Sartaj Sahni. Algorithms for Scheduling Independent Tasks. J. ACM, 23(1):116–127, 1976.
- Compile-time partitioning and scheduling of parallel programs. In SIGPLAN Symposium on Compiler Construction, pages 17–26, 1986.
- High-Quality Hypergraph Partitioning. ACM J. Exp. Algorithmics, 27:1.9:1–1.9:39, 2022.
- Wilson Snyder. Verilator, accelerated: Accelerating development, and case study of accelerating performance. 2nd Workshop on Open-Source Design Automation (OSDA).
- Wilson Snyder. Verilator 4.0: Open simulation goes multithreaded. The OPen Source Digital Design Conference (ORConf), 2018.
- Wilson Snyder. Your Big 4th Simulator: 2019 intro and roadmap. CHIPS Alliance, 2019.
- Submodular Approximation: Sampling-based Algorithms and Lower Bounds. SIAM J. Comput., 40(6):1715–1737, 2011.
- ZSim: fast and accurate microarchitectural simulation of thousand-core systems. In Proceedings of the 40th International Symposium on Computer Architecture (ISCA), pages 475–486, 2013.
- DIABLO: A Warehouse-Scale Computer Network Simulator using FPGAs. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XX), pages 207–221, 2015.
- Design and implementation of a high performance financial Monte-Carlo simulation engine on an FPGA supercomputer. pages 81–88, 2008.
- Jeffrey D. Ullman. NP-Complete Scheduling Problems. J. Comput. Syst. Sci., 10(3):384–393, 1975.
- Leslie G. Valiant. A Bridging Model for Parallel Computation. Commun. ACM, 33(8):103–111, 1990.
- SAGA: SystemC acceleration on GPU architectures. pages 115–120, 2012.
- RepCut: Superlinear Parallel RTL Simulation with Replication-Aided Partitioning. In ASPLOS (3), pages 572–585, 2023.
- SSIM: A Software Levelized Compiled-Code Simulator. pages 2–8, 1987.
- LECSIM: A Levelized Event Driven Compiled Logic Simulation. pages 491–496, 1990.
- Predictive parallel event-driven HDL simulation with a new powerful prediction strategy. pages 1–3, 2014.
- Constellation: An open-source SoC-capable NoC generator. In 2022 15th IEEE/ACM International Workshop on Network on Chip Architectures (NoCArc), pages 1–7, 2022.
- par-gem5: Parallelizing gem5’s Atomic Mode. pages 1–6, 2023.