MIREncoder: Multi-modal IR-based Pretrained Embeddings for Performance Optimizations (2407.02238v1)
Abstract: One of the primary areas of interest in High Performance Computing is the improvement of performance of parallel workloads. Nowadays, compilable source code-based optimization tasks that employ deep learning often exploit LLVM Intermediate Representations (IRs) for extracting features from source code. Most such works target specific tasks, or are designed with a pre-defined set of heuristics. So far, pre-trained models are rare in this domain, but the possibilities have been widely discussed. Especially approaches mimicking large-LLMs have been proposed. But these have prohibitively large training costs. In this paper, we propose MIREncoder, a M}ulti-modal IR-based Auto-Encoder that can be pre-trained to generate a learned embedding space to be used for downstream tasks by machine learning-based approaches. A multi-modal approach enables us to better extract features from compilable programs. It allows us to better model code syntax, semantics and structure. For code-based performance optimizations, these features are very important while making optimization decisions. A pre-trained model/embedding implicitly enables the usage of transfer learning, and helps move away from task-specific trained models. Additionally, a pre-trained model used for downstream performance optimization should itself have reduced overhead, and be easily usable. These considerations have led us to propose a modeling approach that i) understands code semantics and structure, ii) enables use of transfer learning, and iii) is small and simple enough to be easily re-purposed or reused even with low resource availability. Our evaluations will show that our proposed approach can outperform the state of the art while reducing overhead.
- Predicting number of threads using balanced datasets for openmp regions. Computing 105, 5 (2023), 999–1017.
- A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR) 51, 4 (2018), 1–37.
- Learning to represent programs with graphs. arXiv preprint arXiv:1711.00740 (2017).
- A general path-based representation for predicting program properties. ACM SIGPLAN Notices 53, 4 (2018), 404–419.
- code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages 3, POPL (2019), 1–29.
- AMD. [n. d.]. AMD OpenCL accelerated parallel processing SDK. https://developer.amd.com/amd-accelerated-parallel-processing-app-sdk/.
- Opentuner: An extensible framework for program autotuning. In Proceedings of the 23rd international conference on Parallel architectures and compilation. 303–316.
- The nas parallel benchmarks. In The International Journal of Supercomputer Applications. Citeseer.
- Neural code comprehension: A learnable representation of code semantics. Advances in neural information processing systems 31 (2018).
- LS-CAT: a large-scale CUDA AutoTuning dataset. In 2021 International Conference on Applied Artificial Intelligence (ICAPAI). IEEE, 1–6.
- Compiler-based graph representations for deep learning models of code. In Proceedings of the 29th International Conference on Compiler Construction. 201–211.
- CLOMP: accurately characterizing OpenMP application overheads. In International Workshop on OpenMP. Springer, 13–25.
- A machine learning-based approach for thread mapping on transactional memory applications. In 2011 18th International Conference on High Performance Computing. IEEE, 1–10.
- Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE international symposium on workload characterization (IISWC). Ieee, 44–54.
- A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads. In IEEE International Symposium on Workload Characterization (IISWC’10). IEEE, 1–11.
- Learning to parallelize with openMP by augmented heterogeneous AST representation. Proceedings of Machine Learning and Systems 5 (2023).
- NVIDIA Corporation. [n. d.]. CUDA. http://developer.nvidia.com/object/cuda.html.
- Programl: A graph-based program representation for data flow analysis and compiler optimizations. In International Conference on Machine Learning. PMLR, 2244–2253.
- End-to-end deep learning of optimization heuristics. In 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 219–232.
- Large language models for compiler optimization. arXiv preprint arXiv:2309.07062 (2023).
- A deep tree-based model for software defect prediction. arXiv preprint arXiv:1802.00921 (2018).
- The scalable heterogeneous computing (SHOC) benchmark suite. In Proceedings of the 3rd workshop on general-purpose computation on graphics processing units. 63–74.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- Performance Optimization using Multimodal Modeling and Heterogeneous GNN. In Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing. 45–57.
- Pattern-based autotuning of openmp loops using graph neural networks. In 2022 IEEE/ACM International Workshop on Artificial Intelligence and Machine Learning for Scientific Applications (AI4S). IEEE, 26–31.
- Power Constrained Autotuning using Graph Neural Networks. In 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE Computer Society, Los Alamitos, CA, USA, 535–545. https://doi.org/10.1109/IPDPS54959.2023.00060
- Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).
- Portable mapping of data parallel programs to opencl for heterogeneous systems. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 1–10.
- Neurovectorizer: End-to-end vectorization with deep reinforcement learning. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization. 242–255.
- Machine-learning-based performance heuristics for runtime cpu/gpu selection. In Proceedings of the principles and practices of programming on the Java platform. 27–36.
- Quantifying openmp: Statistical insights into usage and adoption. In 2023 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–7.
- Lulesh 2.0 updates and changes. Technical Report. Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States).
- Graph matching networks for learning the similarity of graph structured objects. In International conference on machine learning. PMLR, 3835–3845.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32 (2019).
- Automatic optimization of thread-coarsening for graphics processors. In Proceedings of the 23rd international conference on Parallel architectures and compilation. 455–466.
- A large-scale cross-architecture evaluation of thread-coarsening. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 1–11.
- Multimodal deep learning. In ICML.
- Louis-Noël Pouchet et al. 2012. Polybench: The polyhedral benchmark suite. URL: http://www. cs. ucla. edu/pouchet/software/polybench 437 (2012), 1–1.
- CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks.
- Rigel: A Framework for OpenMP PerformanceTuning. In 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS). IEEE, 2093–2102.
- Predicting program properties from” big code”. ACM SIGPLAN Notices 50, 1 (2015), 111–124.
- Bliss: auto-tuning complex applications using a pool of diverse lightweight learning models. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 1280–1295.
- Modeling and optimizing numa effects and prefetching with machine learning. In Proceedings of the 34th ACM International Conference on Supercomputing. 1–13.
- Performance characterization of the NAS Parallel Benchmarks in OpenCL. In 2011 IEEE international symposium on workload characterization (IISWC). IEEE, 137–148.
- Nicolai Stawinoga and Tony Field. 2018. Predictable thread coarsening. ACM Transactions on Architecture and Code Optimization (TACO) 15, 2 (2018), 1–26.
- Value learning for throughput optimization of deep learning workloads. Proceedings of Machine Learning and Systems 3 (2021).
- Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127 (2012), 27.
- Recent Advances and Trends in Multimodal Deep Learning: A Review. arXiv preprint arXiv:2105.11087 (2021).
- Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision. 7464–7473.
- Active harmony: Towards automated performance tuning. In SC’02: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing. IEEE, 44–44.
- Perfograph: A numerical aware program graph representation for performance optimization and program analysis. Advances in Neural Information Processing Systems 36 (2024).
- Learning intermediate representations using graph neural networks for numa and prefetchers optimization. In 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 1206–1216.
- Ir2vec: Llvm ir based scalable program embeddings. ACM Transactions on Architecture and Code Optimization (TACO) 17, 4 (2020), 1–27.
- Vasily Volkov and James W Demmel. 2008. Benchmarking GPUs to tune dense linear algebra. In SC’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. IEEE, 1–11.
- Integrating profile-driven parallelism detection and machine-learning-based mapping. ACM Transactions on Architecture and Code Optimization (TACO) 11, 1 (2014), 1–26.
- Artemis: Automatic Runtime Tuning of Parallel Execution Parameters Using Machine Learning. In International Conference on High Performance Computing. Springer, 453–472.
- ytopt: Autotuning scientific applications for energy efficiency at large scales. arXiv preprint arXiv:2303.16245 (2023).
- Autotuning PolyBench benchmarks with LLVM Clang/Polly loop optimization pragmas using Bayesian optimization. Concurrency and Computation: Practice and Experience 34, 20 (2022), e6683.
- Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).