Muchisim: A Simulation Framework for Design Exploration of Multi-Chip Manycore Systems (2312.10244v2)
Abstract: The design space exploration of scaled-out manycores for communication-intensive applications (e.g., graph analytics and sparse linear algebra) is hampered due to either lack of scalability or accuracy of existing frameworks at simulating data-dependent execution patterns. This paper presents MuchiSim, a novel parallel simulator designed to address these challenges when exploring the design space of distributed multi-chiplet manycore architectures. We evaluate MuchiSim at simulating systems with up to a million interconnected processing units (PUs) while modeling data movement and communication cycle by cycle. In addition to performance, MuchiSim reports the energy, area, and cost of the simulated system. It also comes with a benchmark application suite and two data visualization tools. MuchiSim supports various parallelization strategies and communication primitives such as task-based parallelization and message passing, making it highly relevant for architectures with software-managed coherence and distributed memory. Via a case study, we show that MuchiSim helps users explore the balance between memory and computation units and the constraints related to chiplet integration and inter-chip communication. MuchiSim enables evaluating new techniques or design parameters for systems at scales that are more realistic for modern parallel systems, opening the gate for further research in this area.
- M. Abeydeera and D. Sanchez, “Chronos: Efficient speculative parallelism for accelerators,” in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 2020, pp. 1247–1262.
- N. R. Adiga, G. Almási, G. S. Almasi, Y. Aridor, R. Barik, D. Beece, R. Bellofatto, G. Bhanot, R. Bickford, M. Blumrich et al., “An overview of the bluegene/l supercomputer,” in SC’02: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing. IEEE, 2002, pp. 60–60.
- J. H. Ahn, S. Li, S. O, and N. P. Jouppi, “Mcsima+: A manycore simulator with application-level+ simulation and detailed microarchitecture modeling,” in 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2013, pp. 74–85.
- J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scalable processing-in-memory accelerator for parallel graph processing,” in Proceedings of the 42nd Annual International Symposium on Computer Architecture, 2015, pp. 105–117.
- Ampere Computing, “Ampereone 192-core server processor,” https://amperecomputing.com/products/processors.
- S. Ardalan, B. Vinnikota, T. Arabi, and E. Alon, “What is the right die-to-die interface? a comparison study,” 2022, https://www.opencompute.org/events/past-events/hipchips-chiplet-workshop-isca-conference.
- E. Argollo, A. Falcón, P. Faraboschi, M. Monchiero, and D. Ortega, “Cotson: Infrastructure for full system simulation,” SIGOPS Oper. Syst. Rev., vol. 43, no. 1, p. 52–61, jan 2009. [Online]. Available: https://doi.org/10.1145/1496909.1496921
- T. Austin, E. Larson, and D. Ernst, “Simplescalar: an infrastructure for computer system modeling,” Computer, vol. 35, no. 2, pp. 59–67, 2002.
- A. Benner, “Optical interconnect opportunities in supercomputers and high end computing,” in OFC/NFOEC. IEEE, 2012, pp. 1–60.
- D. Bertozzi and L. Benini, “Xpipes: a network-on-chip architecture for gigascale systems-on-chip,” IEEE Circuits and Systems Magazine, vol. 4, no. 2, pp. 18–31, 2004.
- M. Besta and T. Hoefler, “Slim fly: A cost effective low-diameter network topology,” in SC’14: proceedings of the international conference for high performance computing, networking, storage and analysis. IEEE, 2014, pp. 348–359.
- N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The gem5 simulator,” SIGARCH Comput. Archit. News, vol. 39, no. 2, p. 1–7, aug 2011. [Online]. Available: https://doi.org/10.1145/2024716.2024718
- B. Black, “Die stacking is happening,” in Intl. Symp. on Microarchitecture, Davis, CA, 2013.
- V. Catania, A. Mineo, S. Monteleone, M. Palesi, and D. Patti, “Cycle-accurate network on chip simulation with noxim,” ACM Trans. Model. Comput. Simul., vol. 27, no. 1, aug 2016. [Online]. Available: https://doi.org/10.1145/2953878
- Cerebras Systems Inc., “The second generation wafer scale engine,” https://cerebras.net/wp-content/uploads/2021/04/Cerebras-CS-2-Whitepaper.pdf.
- T.-J. Chang, A. Li, F. Gao, T. Ta, G. Tziantzioulis, Y. Ou, M. Wang, J. Tu, K. Xu, P. J. Jackson, A. Ning, G. Chirkov, M. Orenes-Vera, S. Agwa, X. Yan, E. Tang, J. Balkind, C. Batten, and D. Wentzlaff, “CIFER: A 12nm, 16mm2, 22-core soc with a 1541 lut6/mm2 1.92 mops/lut, fully synthesizable, cachecoherent, embedded fpga,” in 2023 IEEE Custom Integrated Circuits Conference (CICC), 2023, pp. 1–2. [Online]. Available: https://doi.org/10.1109/CICC57935.2023.10121294
- J. Choquette and W. Gandhi, “Nvidia A100 GPU: Performance & innovation for GPU computing,” in 2020 IEEE Hot Chips 32 Symposium (HCS). IEEE Computer Society, 2020, pp. 1–43.
- V. Dadu, S. Liu, and T. Nowatzki, “Polygraph: Exposing the value of flexibility for graph processing accelerators,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2021, pp. 595–608.
- V. Dadu and T. Nowatzki, “Taskstream: accelerating task-parallel workloads by recovering program structure,” in Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2022, pp. 1–13.
- S. Davidson, S. Xie, C. Torng, K. Al-Hawai, A. Rovinski, T. Ajayi, L. Vega, C. Zhao, R. Zhao, S. Dai, A. Amarnath, B. Veluri, P. Gao, A. Rao, G. Liu, R. K. Gupta, Z. Zhang, R. Dreslinski, C. Batten, and M. B. Taylor, “The celerity open-source 511-core risc-v tiered accelerator fabric: Fast architectures and design methodologies for fast chips,” IEEE Micro, vol. 38, no. 2, pp. 30–41, 2018.
- M. Emani, V. Vishwanath, C. Adams, M. E. Papka, R. Stevens, L. Florescu, S. Jairath, W. Liu, T. Nama, and A. Sujeeth, “Accelerating scientific applications with sambanova reconfigurable dataflow architecture,” Computing in Science & Engineering, vol. 23, no. 2, pp. 114–119, 2021.
- Esperanto Technologies, “Esperanto’s et-minion on-chip risc-v cores,” https://www.esperanto.ai/technology/.
- A. Feldmann and D. Sanchez, “Spatula: A hardware accelerator for sparse matrix factorization,” in Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’23. New York, NY, USA: Association for Computing Machinery, 2023, p. 91–104. [Online]. Available: https://doi.org/10.1145/3613424.3623783
- K. Feng, Y. Ye, and J. Xu, “A formal study on topology and floorplan characteristics of mesh and torus-based optical networks-on-chip,” Microprocessors and Microsystems, vol. 37, no. 8, pp. 941–952, 2013.
- D. Fox, J. M. Diaz, and X. Li, “A gem5 implementation of the sequential codelet model: Reducing overhead and expanding the software memory interface,” in Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis (SC-W 2023), November 12–17, 2023, Denver, CO, USA, 2023.
- Y. Fu and D. Wentzlaff, “PriME: A parallel and distributed simulator for thousand-core chips,” in ISPASS. IEEE Press, March 2014.
- F. Gao, T.-J. Chang, A. Li, M. Orenes-Vera, D. Giri, P. J. Jackson, A. Ning, G. Tziantzioulis, J. Zuckerman, J. Tu et al., “DECADES: A 67mm 2, 1.46 tops, 55 giga cache-coherent 64-bit risc-v instructions per second, heterogeneous manycore soc with 109 tiles including accelerators, intelligent storage, and efpga in 12nm finfet,” in 2023 IEEE Custom Integrated Circuits Conference (CICC). IEEE, 2023, pp. 1–2.
- S. Ghose, A. G. Yaglikçi, R. Gupta, D. Lee, K. Kudrolli, W. X. Liu, H. Hassan, K. K. Chang, N. Chatterjee, A. Agrawal et al., “What your dram power models are not telling you: Lessons from a detailed experimental study,” Proceedings of the ACM on Measurement and Analysis of Computing Systems, vol. 2, no. 3, pp. 1–41, 2018.
- L. Gwennap, “Groq rocks neural networks,” Microprocessor Report, Tech. Rep., jan, 2020.
- T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi, “Graphicionado: A high-performance and energy-efficient accelerator for graph analytics,” in Proceedings of the 49th Annual International Symposium on Microarchitecture, ser. MICRO, 2016. [Online]. Available: https://doi.org/10.1109/MICRO.2016.7783759
- N. Hardavellas, S. Somogyi, T. F. Wenisch, R. E. Wunderlich, S. Chen, J. Kim, B. Falsafi, J. C. Hoe, and A. G. Nowatzyk, “Simflex: A fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture,” SIGMETRICS Perform. Eval. Rev., vol. 31, no. 4, p. 31–34, mar 2004. [Online]. Available: https://doi.org/10.1145/1054907.1054914
- M. Horro, G. Rodríguez, and J. Touriño, “Simulating the network activity of modern manycores,” IEEE Access, vol. 7, pp. 81 195–81 210, 2019.
- X. Hu, D. Stow, and Y. Xie, “Die stacking is happening,” IEEE micro, vol. 38, no. 1, pp. 22–28, 2018.
- Isine, “Die yield calculator,” https://isine.com/resources/die-yield-calculator/.
- M. C. Jeffrey, S. Subramanian, C. Yan, J. Emer, and D. Sanchez, “A scalable architecture for ordered parallelism,” in Proceedings of the 48th International Symposium on Microarchitecture, ser. MICRO-48. New York, NY, USA: Association for Computing Machinery, 2015, p. 228–241. [Online]. Available: https://doi.org/10.1145/2830772.2830777
- N. Jiang, D. U. Becker, G. Michelogiannakis, J. Balfour, B. Towles, D. E. Shaw, J. Kim, and W. J. Dally, “A detailed and flexible cycle-accurate network-on-chip simulator,” in 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2013, pp. 86–96.
- S. W. Jones, “Lithovision: Economics in the 3d era,” https://semiwiki.com/wp-content/uploads/2020/03/Lithovision-2020.pdf.
- N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers et al., “In-datacenter performance analysis of a tensor processing unit,” in Proceedings of the 44th annual international symposium on computer architecture, 2017, pp. 1–12.
- D. C. Jung, S. Davidson, C. Zhao, D. Richmond, and M. B. Taylor, “Ruche networks: Wire-maximal, no-fuss nocs: Special session paper,” in 2020 14th IEEE/ACM International Symposium on Networks-on-Chip (NOCS). IEEE, 2020, pp. 1–8.
- D.-H. Kim, B. Song, H.-a. Ahn, W. Ko, S. Do, S. Cho, K. Kim, S.-H. Oh, H.-Y. Joo, G. Park, J.-H. Jang, Y.-H. Kim, D. Lee, J. Jung, Y. Kwon, Y. Kim, J. Jung, S. O, S. Lee, J. Lim, J. Son, J. Min, H. Do, J. Yoon, I. Hwang, J. Park, H. Shim, S. Yoon, D. Choi, J. Lee, S. Woo, E. Hong, J. Choi, J.-S. Kim, S. Han, J. Bang, B. Park, J. Kim, S.-K. Choi, G.-H. Han, Y.-C. Sung, W.-I. Bae, J.-D. Lim, S. Lee, C. Yoo, S. J. Hwang, and J. Lee, “A 16gb 9.5gb/s/pin lpddr5x sdram with low-power schemes exploiting dynamic voltage-frequency scaling and offset-calibrated readout sense amplifiers in a fourth generation 10nm dram process,” in 2022 IEEE International Solid- State Circuits Conference (ISSCC), vol. 65, 2022, pp. 448–450.
- J. Kim, W. J. Dally, S. Scott, and D. Abts, “Technology-driven, highly-scalable dragonfly topology,” ACM SIGARCH Computer Architecture News, vol. 36, no. 3, pp. 77–88, 2008.
- S. Kim, S. Kim, K. Cho, T. Shin, H. Park, D. Lho, S. Park, K. Son, G. Park, and J. Kim, “Processing-in-memory in high bandwidth memory (pim-hbm) architecture with energy-efficient and low latency channels for high bandwidth system,” in 2019 IEEE 28th Conference on Electrical Performance of Electronic Packaging and Systems (EPEPS), 2019, pp. 1–3.
- S. Knowles, “Graphcore,” in 2021 IEEE Hot Chips 33 Symposium (HCS). IEEE, 2021, pp. 1–25.
- P. Lawrence, B. Sergey, R. Motwani, and T. Winograd, “The pagerank citation ranking: Bringing order to the web,” Stanford University, Technical Report, 1998.
- C.-C. Lee, C. Hung, C. Cheung, P.-F. Yang, C.-L. Kao, D.-L. Chen, M.-K. Shih, C.-L. C. Chien, Y.-H. Hsiao, L.-C. Chen, M. Su, M. Alfano, J. Siegel, J. Din, and B. Black, “An overview of the development of a gpu with integrated hbm on silicon interposer,” in 2016 IEEE 66th Electronic Components and Technology Conference (ECTC), 2016, pp. 1439–1444.
- D. U. Lee, H. S. Cho, J. Kim, Y. J. Ku, S. Oh, C. D. Kim, H. W. Kim, W. Y. Lee, T. K. Kim, T. S. Yun et al., “22.3 a 128gb 8-high 512gb/s hbm2e dram with a pseudo quarter bank structure, power dispersion and an instruction-based at-speed pmbist,” in 2020 IEEE International Solid-State Circuits Conference-(ISSCC). IEEE, 2020, pp. 334–336.
- C. E. Leiserson, “Fat-trees: Universal networks for hardware-efficient supercomputing,” IEEE transactions on Computers, vol. 100, no. 10, pp. 892–901, 1985.
- J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani, “Kronecker graphs: An approach to modeling networks,” Journal of Machine Learning Reseach (JMLR), vol. 11, pp. 985–1042, Mar. 2010.
- J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network dataset collection,” http://snap.stanford.edu/data, Jun. 2014.
- A. Li, T.-J. Chang, F. Gao, T. Ta, G. Tziantzioulis, Y. Ou, M. Wang, J. Tu, K. Xu, P. Jackson, A. Ning, G. Chirkov, M. Orenes-Vera, S. Agwa, X. Yan, E. Tang, J. Balkind, C. Batten, and D. Wentzlaff, “Cifer: A cache-coherent 12nm 16mm2 soc with four 64-bit risc-v application cores, 18 32-bit risc-v compute cores, and a 1541 lut6/mm2 synthesizable efpga,” IEEE Solid-State Circuits Letters, pp. 1–1, 2023.
- S. Lie, “Multi-million core, multi-wafer AI cluster,” in 2021 IEEE Hot Chips 33 Symposium (HCS). IEEE Computer Society, 2021, pp. 1–41.
- A. Manocha, T. Sorensen, E. Tureci, O. Matthews, J. L. Aragón, and M. Martonosi, “Graphattack: Optimizing data supply for graph applications on in-order multicore architectures,” ACM Transactions on Architecture and Code Optimization (TACO), vol. 18, no. 4, pp. 1–26, 2021.
- O. Matthews, A. Manocha, D. Giri, M. Orenes-Vera, E. Tureci, T. Sorensen, T. J. Ham, J. L. Aragon, L. P. Carloni, and M. Martonosi, “Mosaicsim: A lightweight, modular simulator for heterogeneous systems,” in 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2020, pp. 136–148.
- Micron, “High Bandwidth Memory with ECC,” 2018, https://media-www.micron.com/-/media/client/global/documents/products/data-sheet/dram/hbm2e/8gb_and_16gb_hbm2e_dram.pdf.
- J. E. Miller, H. Kasture, G. Kurian, C. Gruenwald, N. Beckmann, C. Celio, J. Eastep, and A. Agarwal, “Graphite: A distributed parallel simulator for multicores,” in HPCA. IEEE Press, 2010.
- F. Muñoz-Martínez, R. Garg, M. Pellauer, J. L. Abellán, M. E. Acacio, and T. Krishna, “Flexagon: A multi-dataflow sparse-sparse matrix multiplication accelerator for efficient dnn processing,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 2023, pp. 252–265.
- R. C. Murphy, K. B. Wheeler, B. W. Barrett, and J. A. Ang, “Introducing the Graph 500,” http://www.graph500.org/specifications, Cray User’s Group (CUG), 2010.
- S. Naffziger, N. Beck, T. Burd, K. Lepak, G. H. Loh, M. Subramony, and S. White, “Pioneering chiplet technology and design for the amd epyc™ and ryzen™ processor families,” in Proceedings of the 48th Annual International Symposium on Computer Architecture, ser. ISCA ’21. IEEE Press, 2021, p. 57–70.
- N. Nassif, A. O. Munch, C. L. Molnar, G. Pasdast, S. V. Lyer, Z. Yang, O. Mendoza, M. Huddart, S. Venkataraman, S. Kandula et al., “Sapphire rapids: The next-generation intel xeon scalable processor,” in 2022 IEEE International Solid-State Circuits Conference (ISSCC), vol. 65. IEEE, 2022, pp. 44–46.
- Q. M. Nguyen and D. Sanchez, “Pipette: Improving core utilization on irregular applications through intra-core pipeline parallelism,” in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020, pp. 596–608.
- Q. M. Nguyen and D. Sanchez, “Fifer: Practical acceleration of irregular applications on reconfigurable architectures,” in MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’21. New York, NY, USA: Association for Computing Machinery, 2021, p. 1064–1077. [Online]. Available: https://doi.org/10.1145/3466752.3480048
- C.-S. Oh, K. C. Chun, Y.-Y. Byun, Y.-K. Kim, S.-Y. Kim, Y. Ryu, J. Park, S. Kim, S. Cha, D. Shin et al., “22.1 a 1.1 v 16gb 640gb/s hbm2e dram with a data-bus window-extension technique and a synergetic on-die ecc scheme,” in 2020 IEEE International Solid-State Circuits Conference-(ISSCC). IEEE, 2020, pp. 330–332.
- Open Compute Group, “Bunch of wires phy specification,” https://opencomputeproject.github.io/ODSA-BoW/bow_specification.html.
- M. Orenes-Vera, A. Manocha, J. Balkind, F. Gao, J. L. Aragón, D. Wentzlaff, and M. Martonosi, “Tiny but mighty: designing and realizing scalable latency tolerance for manycore socs.” in ISCA, 2022, pp. 817–830.
- M. Orenes-Vera, I. Sharapov, R. Schreiber, M. Jacquelin, P. Vandermersch, and S. Chetlur, “Wafer-scale fast fourier transforms,” in Proceedings of the 37th International Conference on Supercomputing, ser. ICS’23. New York, NY, USA: Association for Computing Machinery, 2023, p. 180–191. [Online]. Available: https://doi.org/10.1145/3577193.3593708
- M. Orenes-Vera, E. Tureci, M. Martonosi, and D. Wentzlaff, “DCRA: A distributed chiplet-based reconfigurable architecture for irregular applications,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2311.15443
- M. Orenes-Vera, E. Tureci, M. Martonosi, and D. Wentzlaff, “MuchiSim simulation framework and artifacts,” 2023, https://github.com/PrincetonUniversity/muchisim.git.
- M. Orenes-Vera, E. Tureci, D. Wentzlaf, and M. Martonosi, “Massive data-centric parallelism in the chiplet era,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2304.09389
- M. Orenes-Vera, E. Tureci, D. Wentzlaff, and M. Martonosi, “Dalorex: A data-local program execution and architecture for memory-bound applications,” in 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2023, pp. 718–730.
- M. Orenes-Vera, E. Tureci, D. Wentzlaff, and M. Martonosi, “Tascade: Hardware support for atomic-free, asynchronous and efficient reduction trees,” 2023. [Online]. Available: https://doi.org/10.48550/arxiv.2311.15810
- Y. Ou, S. Agwa, and C. Batten, “Implementing low-diameter on-chip networks for manycore processors using a tiled physical design methodology,” in 2020 14th IEEE/ACM International Symposium on Networks-on-Chip (NOCS). IEEE, 2020, pp. 1–8.
- M. M. Ozdal, S. Yesil, T. Kim, A. Ayupov, J. Greth, S. Burns, and O. Ozturk, “Energy efficient architecture for graph analytics accelerators,” ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 166–177, 2016.
- M. O’Connor, N. Chatterjee, D. Lee, J. Wilson, A. Agrawal, S. W. Keckler, and W. J. Dally, “Fine-grained dram: Energy-efficient dram for extreme bandwidth systems,” in 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2017, pp. 41–54.
- M.-J. Park, H. S. Cho, T.-S. Yun, S. Byeon, Y. J. Koo, S. Yoon, D. U. Lee, S. Choi, J. Park, J. Lee et al., “A 192-gb 12-high 896-gb/s hbm3 dram with a tsv auto-calibration scheme and machine-learning-based layout optimization,” in 2022 IEEE International Solid-State Circuits Conference (ISSCC), vol. 65. IEEE, 2022, pp. 444–446.
- A. Patrizio, “High-bandwidth memory (hbm) delivers impressive performance gains,” https://semiengineering.com/whats-next-for-high-bandwidth-memory/.
- G. Posluns, Y. Zhu, G. Zhang, and M. C. Jeffrey, “A scalable architecture for reprioritizing ordered parallelism,” in Proceedings of the 49th Annual International Symposium on Computer Architecture, ser. ISCA ’22. New York, NY, USA: Association for Computing Machinery, 2022, p. 437–453. [Online]. Available: https://doi.org/10.1145/3470496.3527387
- V. Puente, J. Gregorio, and R. Beivide, “Sicosys: an integrated framework for studying interconnection network performance in multiprocessor systems,” in Proceedings 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing, 2002, pp. 15–22.
- S. Rahman, N. Abu-Ghazaleh, and R. Gupta, “Graphpulse: An event-driven hardware accelerator for asynchronous graph processing,” in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020, pp. 908–921.
- A. F. Rodrigues, K. S. Hemmert, B. W. Barrett, C. Kersey, R. Oldfield, M. Weston, R. Risen, J. Cook, P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “The structural simulation toolkit,” SIGMETRICS Perform. Eval. Rev., vol. 38, no. 4, p. 37–42, mar 2011. [Online]. Available: https://doi.org/10.1145/1964218.1964225
- D. Sanchez and C. Kozyrakis, “Zsim: Fast and accurate microarchitectural simulation of thousand-core systems,” ACM SIGARCH Computer architecture news, vol. 41, no. 3, pp. 475–486, 2013.
- D. Schor, “TSMC demonstrates a 7nm ARM-based chiplet design for HPC,” 2019, https://fuse.wikichip.org/news/2446/tsmc-demonstrates-a-7nm-arm-based-chiplet-design-for-hpc/.
- D. Schor, “TSMC Details 5 nm,” 2020, https://fuse.wikichip.org/news/3398/tsmc-details-5-nm/.
- D. D. Sharma, “Pci express 6.0 specification: A low-latency, high-bandwidth, high-reliability, and cost-effective interconnect with 64.0 gt/s pam-4 signaling,” IEEE Micro, vol. 41, no. 1, pp. 23–29, 2020.
- K. Shkurko, T. Grant, E. Brunvand, D. Kopta, J. Spjut, E. Vasiou, I. Mallett, and C. Yuksel, “Simtrax: Simulation infrastructure for exploring thousands of cores,” in Proceedings of the 2018 on Great Lakes Symposium on VLSI, 2018, pp. 503–506.
- G. M. Slota, S. Rajamanickam, and K. Madduri, “BFS and coloring-based parallel algorithms for strongly connected components and related problems,” in 2014 IEEE 28th International Parallel and Distributed Processing Symposium, Phoenix, AZ, USA, May 19-23, 2014. IEEE Computer Society, 2014, pp. 550–559. [Online]. Available: https://doi.org/10.1109/IPDPS.2014.64
- K. Sohn, W.-J. Yun, R. Oh, C.-S. Oh, S.-Y. Seo, M.-S. Park, D.-H. Shin, W.-C. Jung, S.-H. Shin, J.-M. Ryu, H.-S. Yu, J.-H. Jung, H. Lee, S.-Y. Kang, Y.-S. Sohn, J.-H. Choi, Y.-C. Bae, S.-J. Jang, and G. Jin, “A 1.2 v 20 nm 307 gb/s hbm dram with at-speed wafer-level io test scheme and adaptive refresh considering temperature distribution,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 250–260, 2017.
- T. Sorensen, A. Manocha, E. Tureci, M. Orenes-Vera, J. L. Aragón, and M. Martonosi, “A simulator and compiler framework for agile hardware-software co-design evaluation and exploration,” in 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD). IEEE, 2020, pp. 1–9.
- D. Stow, Y. Xie, T. Siddiqua, and G. H. Loh, “Cost-effective design of scalable high-performance systems using active and passive interposers,” in 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2017, pp. 728–735.
- J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. Hwu, “Parboil: A revised benchmark suite for scientific and commercial throughput computing,” University of Illinois at Urbana-Champaign, Tech. Rep. IMPACT-12-01, 2012.
- N. Talati, K. May, A. Behroozi, Y. Yang, K. Kaszyk, C. Vasiladiotis, T. Verma, L. Li, B. Nguyen, J. Sun, J. M. Morton, A. Ahmadi, T. Austin, M. O’Boyle, S. Mahlke, T. Mudge, and R. Dreslinski, “Prodigy: Improving the memory latency of data-indirect irregular workloads using hardware-software co-design,” in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2021, pp. 654–667.
- E. Talpes, D. Williams, and D. D. Sarma, “Dojo: The microarchitecture of tesla exa-scale computer,” in 2022 IEEE Hot Chips 34 Symposium (HCS). IEEE Computer Society, 2022, pp. 1–28.
- Z. Tan, A. Waterman, R. Avizienis, Y. Lee, H. Cook, D. Patterson, and K. Asanović, “Ramp gold: An fpga-based architecture simulator for multiprocessors,” in Proceedings of the 47th Design Automation Conference, ser. DAC ’10. New York, NY, USA: Association for Computing Machinery, 2010, p. 463–468. [Online]. Available: https://doi.org/10.1145/1837274.1837390
- T. Tang and Y. Xie, “Cost-aware exploration for chiplet-based architecture with advanced packaging technologies,” arXiv preprint arXiv:2206.07308, 2022.
- J. Vasiljevic, L. Bajic, D. Capalija, S. Sokorac, D. Ignjatovic, L. Bajic, M. Trajkovic, I. Hamer, I. Matosevic, A. Cejkov, U. Aydonat, T. Zhou, S. Z. Gilani, A. Paiva, J. Chu, D. Maksimovic, S. A. Chin, Z. Moudallal, A. Rakhmati, S. Nijjar, A. Bhullar, B. Drazic, C. Lee, J. Sun, K.-M. Kwong, J. Connolly, M. Dooley, H. Farooq, J. Y. T. Chen, M. Walker, K. Dabiri, K. Mabee, R. S. Lal, N. Rajatheva, R. Retnamma, S. Karodi, D. Rosen, E. Munoz, A. Lewycky, A. Knezevic, R. Kim, A. Rui, A. Drouillard, and D. Thompson, “Compute substrate for software 2.0,” IEEE Micro, vol. 41, no. 2, pp. 50–55, 2021.
- Z. Wang, C. Liu, N. Beckmann, and T. Nowatzki, “Affinity alloc: Taming not-so near-data computing,” in Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’23. New York, NY, USA: Association for Computing Machinery, 2023, p. 784–799. [Online]. Available: https://doi.org/10.1145/3613424.3623778
- T. Wei, N. Turtayeva, M. Orenes-Vera, O. Lonkar, and J. Balkind, “Cohort: Software-Oriented Acceleration for Heterogeneous SoCs,” ser. ASPLOS 2023. New York, NY, USA: Association for Computing Machinery, 2023, p. 105–117. [Online]. Available: https://doi.org/10.1145/3582016.3582059
- J. Wilson, “High-bandwidth density, energy-efficient, short-reach signaling that enables massively scalable parallelism,” 2022, https://www.opencompute.org/events/past-events/hipchips-chiplet-workshop-isca-conference.
- Y. Yokoyama, M. Tanaka, K. Tanaka, M. Morimoto, M. Yabuuchi, Y. Ishii, and S. Tanaka, “A 29.2 mb/mm2 ultra high density sram macro using 7nm finfet technology with dual-edge driven wordline/bitline and write/read-assist circuit,” in 2020 IEEE Symposium on VLSI Circuits, 2020, pp. 1–2.
- J. Zarrin, R. L. Aguiar, and J. P. Barraca, “Manycore simulation for peta-scale system design: Motivation, tools, challenges and prospects,” Simulation Modelling Practice and Theory, vol. 72, pp. 168–201, Mar. 2017. [Online]. Available: https://doi.org/10.1016/j.simpat.2016.12.014
- F. Zaruba and L. Benini, “The cost of application-class processing: Energy and performance analysis of a linux-ready 1.7-ghz 64-bit risc-v core in 22-nm fdsoi technology,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 27, no. 11, pp. 2629–2640, Nov 2019, https://github.com/openhwgroup/cva6.
- F. Zaruba, F. Schuiki, and L. Benini, “Manticore: A 4096-core risc-v chiplet architecture for ultraefficient floating-point computing,” IEEE Micro, vol. 41, no. 2, pp. 36–42, 2020.
- G. Zheng, G. Kakulapati, and L. Kale, “Bigsim: a parallel simulator for performance prediction of extremely large parallel machines,” in 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings., 2004, pp. 78–.
- Y. Zhuo, C. Wang, M. Zhang, R. Wang, D. Niu, Y. Wang, and X. Qian, “Graphq: Scalable pim-based graph processing,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, pp. 712–725.