Tao: Re-Thinking DL-based Microarchitecture Simulation (2404.10921v2)
Abstract: Microarchitecture simulators are indispensable tools for microarchitecture designers to validate, estimate, and optimize new hardware that meets specific design requirements. While the quest for a fast, accurate and detailed microarchitecture simulation has been ongoing for decades, existing simulators excel and fall short at different aspects: (i) Although execution-driven simulation is accurate and detailed, it is extremely slow and requires expert-level experience to design. (ii) Trace-driven simulation reuses the execution traces in pursuit of fast simulation but faces accuracy concerns and fails to achieve significant speedup. (iii) Emerging deep learning (DL)-based simulations are remarkably fast and have acceptable accuracy but fail to provide adequate low-level microarchitectural performance metrics crucial for microarchitectural bottleneck analysis. Additionally, they introduce substantial overheads from trace regeneration and model re-training when simulating a new microarchitecture. Re-thinking the advantages and limitations of the aforementioned simulation paradigms, this paper introduces TAO that redesigns the DL-based simulation with three primary contributions: First, we propose a new training dataset design such that the subsequent simulation only needs functional trace as inputs, which can be rapidly generated and reused across microarchitectures. Second, we redesign the input features and the DL model using self-attention to support predicting various performance metrics. Third, we propose techniques to train a microarchitecture agnostic embedding layer that enables fast transfer learning between different microarchitectural configurations and reduces the re-training overhead of conventional DL-based simulators. Our extensive evaluation shows TAO can reduce the overall training and simulation time by 18.06x over the state-of-the-art DL-based endeavors.
- McSimA+: A manycore simulator with application-level+ simulation and detailed microarchitecture modeling. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 74–85.
- Ayaz Akram and Lina Sawalha. 2019. A survey of computer architecture simulation techniques and tools. Ieee Access 7 (2019), 78120–78145.
- Calculating stack distances efficiently. In Proceedings of the 2002 workshop on Memory system performance. 37–43.
- CausalSim: A Causal Framework for Unbiased Trace-Driven Simulation. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 1115–1147.
- Hybrid, scalable, trace-driven performance modeling of GPGPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–15.
- Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance. In Proceedings of the 48th International Symposium on Microarchitecture. 725–737.
- SimpleScalar: An infrastructure for computer system modeling. Computer 35, 2 (2002), 59–67.
- ArchExplorer: Microarchitecture exploration via bottleneck analysis. In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture. 268–282.
- Predicting gpu performance from cpu runs using machine learning. In 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing. IEEE, 254–261.
- Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator.. In USENIX annual technical conference, FREENIX Track, Vol. 41. California, USA, 46.
- The gem5 simulator. ACM SIGARCH computer architecture news 39, 2 (2011), 1–7.
- A survey of cache simulators. ACM Computing Surveys (CSUR) 53, 1 (2020), 1–32.
- A simple model for portable and fast prediction of execution time and power consumption of GPU kernels. ACM Transactions on Architecture and Code Optimization (TACO) 18, 1 (2020), 1–25.
- Transparent dynamic instrumentation. In Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments. 133–144.
- SPEC CPU2017: Next-generation compute benchmark. In Companion of the 2018 ACM/SPEC International Conference on Performance Engineering. 41–42.
- A trace-driven approach for fast and accurate simulation of manycore architectures. In The 20th Asia and South Pacific Design Automation Conference. IEEE, 707–712.
- Calin CaBcaval and David A Padua. 2003. Estimating cache misses and locality using stack distances. In Proceedings of the 17th annual international conference on Supercomputing. 150–159.
- Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 1–12.
- Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International conference on machine learning. PMLR, 794–803.
- Bob Cmelik and David Keppel. 1994. Shade: A fast instruction-set simulator for execution profiling. In Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems. 128–137.
- Chen Ding and Yutao Zhong. 2003. Predicting whole-program locality through reuse distance analysis. In Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation. 245–257.
- Microarchitectural design space exploration using an architecture-centric approach. In 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007). IEEE, 262–271.
- Improving cache management policies using dynamic reuse distances. In 2012 45Th annual IEEE/ACM international symposium on microarchitecture. IEEE, 389–400.
- A very fast trace-driven simulation platform for chip-multiprocessors architectural explorations. IEEE Transactions on Parallel and Distributed Systems 28, 11 (2017), 3033–3045.
- A mechanistic performance model for superscalar out-of-order processors. ACM Transactions on Computer Systems (TOCS) 27, 2 (2009), 1–37.
- Using interaction costs for microarchitectural bottleneck analysis. In Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36. IEEE, 228–239.
- Stephen R Goldschmidt and John L Hennessy. 1993. The accuracy of trace-driven simulations of multiprocessors. ACM SIGMETRICS Performance Evaluation Review 21, 1 (1993), 146–157.
- John L Hennessy and David A Patterson. 2011. Computer architecture: a quantitative approach (fifth ed.). Elsevier.
- Kenneth Hoste and Lieven Eeckhout. 2007. Microarchitecture-independent workload characterization. IEEE micro 27, 3 (2007), 63–72.
- Efficiently exploring architectural design spaces via predictive modeling. ACM SIGOPS Operating Systems Review 40, 5 (2006), 195–206.
- A predictive performance model for superscalar processors. In 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’06). IEEE, 161–170.
- Construction and use of linear regression models for processor performance analysis. In The Twelfth International Symposium on High-Performance Computer Architecture, 2006. IEEE, 99–108.
- Memory row reuse distance and its role in optimizing application performance. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. 137–149.
- XIOSim: power-performance modeling of mobile x86 cores. In Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design. 267–272.
- Accel-Sim: An extensible simulation framework for validated GPU modeling. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 473–486.
- Macsim: A cpu-gpu heterogeneous simulation framework user guide. Georgia Institute of Technology (2012).
- MASE: a novel infrastructure for detailed microarchitectural modeling.. In ISPASS, Vol. 1. 9.
- Benjamin C Lee and David M Brooks. 2006. Accurate and efficient regression modeling for microarchitectural performance and power prediction. ACM SIGOPS operating systems review 40, 5 (2006), 185–194.
- Benjamin C Lee and David M Brooks. 2007. Illustrative design space studies with microarchitectural regression models. In 2007 IEEE 13th International Symposium on High Performance Computer Architecture. IEEE, 340–351.
- Accurately approximating superscalar processor performance from traces. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, 238–248.
- Machine learning based online performance prediction for runtime parallelization and task scheduling. In 2009 IEEE international symposium on performance analysis of systems and software. IEEE, 89–100.
- Lingda Li. [n. d.]. Lingda-li/simnet. https://github.com/lingda-li/simnet
- SimNet: Accurate and High-Performance Computer Architecture Simulation Using Deep Learning. Proc. ACM Meas. Anal. Comput. Syst. 6, 2, Article 25 (jun 2022), 24 pages. https://doi.org/10.1145/3530891
- Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN Workloads. In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture. 380–394.
- Power modeling for GPU architectures using McPAT. ACM Transactions on Design Automation of Electronic Systems (TODAES) 19, 3 (2014), 1–24.
- Sooyoung Lim and Dongchul Park. 2022. Efficient Stack Distance Approximation Based on Workload Characteristics. IEEE Access 10 (2022), 59792–59805.
- Loss-balanced task weighting to reduce negative transfer in multi-task learning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 9977–9978.
- Goeffrey J McLachlan. 1999. Mahalanobis distance. Resonance 4, 6 (1999), 20–26.
- Ithemal: Accurate, portable and fast basic block throughput estimation using deep neural networks. In International Conference on machine learning. PMLR, 4505–4515.
- Graphite: A distributed parallel simulator for multicores. In HPCA-16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture. IEEE, 1–12.
- Hashem H Najaf-Abadi and Eric Rotenberg. 2008. Configurational workload characterization. In ISPASS 2008-IEEE International Symposium on Performance Analysis of Systems and software. IEEE, 147–156.
- Pablo Montesinos Ortego and Paul Sack. 2004. SESC: SuperESCalar simulator. In 17 th Euro micro conference on real time systems (ECRTS’05). Citeseer, 1–4.
- Wait of a decade: Did spec cpu 2017 broaden the performance horizon?. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 271–282.
- Scalable Deep Learning-Based Microarchitecture Simulation on GPUs. In 2022 SC22: International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (SC). IEEE Computer Society, Los Alamitos, CA, USA, 1138–1152. https://doi.ieeecomputersociety.org/
- Trace-driven simulation of multithreaded applications. In (IEEE ISPASS) IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, 87–96.
- Daniel Sanchez and Christos Kozyrakis. 2013. ZSim: Fast and accurate microarchitectural simulation of thousand-core systems. ACM SIGARCH Computer architecture news 41, 3 (2013), 475–486.
- Ozan Sener and Vladlen Koltun. 2018. Multi-task learning as multi-objective optimization. Advances in neural information processing systems 31 (2018).
- Automatically characterizing large scale program behavior. ACM SIGPLAN Notices 37, 10 (2002), 45–57.
- Challenges in computer architecture evaluation. Computer 36, 8 (2003), 30–36.
- Oppertune: Post-deployment configuration tuning of services made easy. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). USENIX Association.
- GRANITE: A Graph Neural Network Model for Basic Block Throughput Estimation. In 2022 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 14–26.
- Multi2Sim: A simulation framework for CPU-GPU computing. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques. 335–344.
- Richard A Uhlig and Trevor N Mudge. 1997. Trace-driven memory simulation: A survey. ACM Computing Surveys (CSUR) 29, 2 (1997), 128–170.
- Analytical processor performance and power modeling using micro-architecture independent characteristics. IEEE Trans. Comput. 65, 12 (2016), 3537–3551.
- GPGPU performance and power estimation using machine learning. In 2015 IEEE 21st international symposium on high performance computer architecture (HPCA). IEEE, 564–576.
- SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling. In Proceedings of the 30th annual international symposium on Computer architecture. 84–97.
- The design and use of simplepower: a cycle-accurate energy estimation tool. In Proceedings of the 37th Annual Design Automation Conference. 340–345.
- Automated runtime-aware scheduling for multi-tenant dnn inference on gpu. In 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD). IEEE, 1–9.
- Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems 33 (2020), 5824–5836.
- A modulation module for multi-task learning with applications in image retrieval. In Proceedings of the European Conference on Computer Vision (ECCV). 401–416.
- Accurate phase-level cross-platform power and performance estimation. In Proceedings of the 53rd Annual Design Automation Conference. 1–6.
- Program locality analysis using reuse distance. ACM Transactions on Programming Languages and Systems (TOPLAS) 31, 6 (2009), 1–39.
- Santosh Pandey (18 papers)
- Amir Yazdanbakhsh (38 papers)
- Hang Liu (135 papers)