Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MIREncoder: Multi-modal IR-based Pretrained Embeddings for Performance Optimizations (2407.02238v1)

Published 2 Jul 2024 in cs.DC, cs.LG, and cs.PF

Abstract: One of the primary areas of interest in High Performance Computing is the improvement of performance of parallel workloads. Nowadays, compilable source code-based optimization tasks that employ deep learning often exploit LLVM Intermediate Representations (IRs) for extracting features from source code. Most such works target specific tasks, or are designed with a pre-defined set of heuristics. So far, pre-trained models are rare in this domain, but the possibilities have been widely discussed. Especially approaches mimicking large-LLMs have been proposed. But these have prohibitively large training costs. In this paper, we propose MIREncoder, a M}ulti-modal IR-based Auto-Encoder that can be pre-trained to generate a learned embedding space to be used for downstream tasks by machine learning-based approaches. A multi-modal approach enables us to better extract features from compilable programs. It allows us to better model code syntax, semantics and structure. For code-based performance optimizations, these features are very important while making optimization decisions. A pre-trained model/embedding implicitly enables the usage of transfer learning, and helps move away from task-specific trained models. Additionally, a pre-trained model used for downstream performance optimization should itself have reduced overhead, and be easily usable. These considerations have led us to propose a modeling approach that i) understands code semantics and structure, ii) enables use of transfer learning, and iii) is small and simple enough to be easily re-purposed or reused even with low resource availability. Our evaluations will show that our proposed approach can outperform the state of the art while reducing overhead.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Predicting number of threads using balanced datasets for openmp regions. Computing 105, 5 (2023), 999–1017.
  2. A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR) 51, 4 (2018), 1–37.
  3. Learning to represent programs with graphs. arXiv preprint arXiv:1711.00740 (2017).
  4. A general path-based representation for predicting program properties. ACM SIGPLAN Notices 53, 4 (2018), 404–419.
  5. code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages 3, POPL (2019), 1–29.
  6. AMD. [n. d.]. AMD OpenCL accelerated parallel processing SDK. https://developer.amd.com/amd-accelerated-parallel-processing-app-sdk/.
  7. Opentuner: An extensible framework for program autotuning. In Proceedings of the 23rd international conference on Parallel architectures and compilation. 303–316.
  8. The nas parallel benchmarks. In The International Journal of Supercomputer Applications. Citeseer.
  9. Neural code comprehension: A learnable representation of code semantics. Advances in neural information processing systems 31 (2018).
  10. LS-CAT: a large-scale CUDA AutoTuning dataset. In 2021 International Conference on Applied Artificial Intelligence (ICAPAI). IEEE, 1–6.
  11. Compiler-based graph representations for deep learning models of code. In Proceedings of the 29th International Conference on Compiler Construction. 201–211.
  12. CLOMP: accurately characterizing OpenMP application overheads. In International Workshop on OpenMP. Springer, 13–25.
  13. A machine learning-based approach for thread mapping on transactional memory applications. In 2011 18th International Conference on High Performance Computing. IEEE, 1–10.
  14. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE international symposium on workload characterization (IISWC). Ieee, 44–54.
  15. A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads. In IEEE International Symposium on Workload Characterization (IISWC’10). IEEE, 1–11.
  16. Learning to parallelize with openMP by augmented heterogeneous AST representation. Proceedings of Machine Learning and Systems 5 (2023).
  17. NVIDIA Corporation. [n. d.]. CUDA. http://developer.nvidia.com/object/cuda.html.
  18. Programl: A graph-based program representation for data flow analysis and compiler optimizations. In International Conference on Machine Learning. PMLR, 2244–2253.
  19. End-to-end deep learning of optimization heuristics. In 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 219–232.
  20. Large language models for compiler optimization. arXiv preprint arXiv:2309.07062 (2023).
  21. A deep tree-based model for software defect prediction. arXiv preprint arXiv:1802.00921 (2018).
  22. The scalable heterogeneous computing (SHOC) benchmark suite. In Proceedings of the 3rd workshop on general-purpose computation on graphics processing units. 63–74.
  23. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  24. Performance Optimization using Multimodal Modeling and Heterogeneous GNN. In Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing. 45–57.
  25. Pattern-based autotuning of openmp loops using graph neural networks. In 2022 IEEE/ACM International Workshop on Artificial Intelligence and Machine Learning for Scientific Applications (AI4S). IEEE, 26–31.
  26. Power Constrained Autotuning using Graph Neural Networks. In 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE Computer Society, Los Alamitos, CA, USA, 535–545. https://doi.org/10.1109/IPDPS54959.2023.00060
  27. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).
  28. Portable mapping of data parallel programs to opencl for heterogeneous systems. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 1–10.
  29. Neurovectorizer: End-to-end vectorization with deep reinforcement learning. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization. 242–255.
  30. Machine-learning-based performance heuristics for runtime cpu/gpu selection. In Proceedings of the principles and practices of programming on the Java platform. 27–36.
  31. Quantifying openmp: Statistical insights into usage and adoption. In 2023 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–7.
  32. Lulesh 2.0 updates and changes. Technical Report. Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States).
  33. Graph matching networks for learning the similarity of graph structured objects. In International conference on machine learning. PMLR, 3835–3845.
  34. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32 (2019).
  35. Automatic optimization of thread-coarsening for graphics processors. In Proceedings of the 23rd international conference on Parallel architectures and compilation. 455–466.
  36. A large-scale cross-architecture evaluation of thread-coarsening. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 1–11.
  37. Multimodal deep learning. In ICML.
  38. Louis-Noël Pouchet et al. 2012. Polybench: The polyhedral benchmark suite. URL: http://www. cs. ucla. edu/pouchet/software/polybench 437 (2012), 1–1.
  39. CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks.
  40. Rigel: A Framework for OpenMP PerformanceTuning. In 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS). IEEE, 2093–2102.
  41. Predicting program properties from” big code”. ACM SIGPLAN Notices 50, 1 (2015), 111–124.
  42. Bliss: auto-tuning complex applications using a pool of diverse lightweight learning models. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 1280–1295.
  43. Modeling and optimizing numa effects and prefetching with machine learning. In Proceedings of the 34th ACM International Conference on Supercomputing. 1–13.
  44. Performance characterization of the NAS Parallel Benchmarks in OpenCL. In 2011 IEEE international symposium on workload characterization (IISWC). IEEE, 137–148.
  45. Nicolai Stawinoga and Tony Field. 2018. Predictable thread coarsening. ACM Transactions on Architecture and Code Optimization (TACO) 15, 2 (2018), 1–26.
  46. Value learning for throughput optimization of deep learning workloads. Proceedings of Machine Learning and Systems 3 (2021).
  47. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127 (2012), 27.
  48. Recent Advances and Trends in Multimodal Deep Learning: A Review. arXiv preprint arXiv:2105.11087 (2021).
  49. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision. 7464–7473.
  50. Active harmony: Towards automated performance tuning. In SC’02: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing. IEEE, 44–44.
  51. Perfograph: A numerical aware program graph representation for performance optimization and program analysis. Advances in Neural Information Processing Systems 36 (2024).
  52. Learning intermediate representations using graph neural networks for numa and prefetchers optimization. In 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 1206–1216.
  53. Ir2vec: Llvm ir based scalable program embeddings. ACM Transactions on Architecture and Code Optimization (TACO) 17, 4 (2020), 1–27.
  54. Vasily Volkov and James W Demmel. 2008. Benchmarking GPUs to tune dense linear algebra. In SC’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. IEEE, 1–11.
  55. Integrating profile-driven parallelism detection and machine-learning-based mapping. ACM Transactions on Architecture and Code Optimization (TACO) 11, 1 (2014), 1–26.
  56. Artemis: Automatic Runtime Tuning of Parallel Execution Parameters Using Machine Learning. In International Conference on High Performance Computing. Springer, 453–472.
  57. ytopt: Autotuning scientific applications for energy efficiency at large scales. arXiv preprint arXiv:2303.16245 (2023).
  58. Autotuning PolyBench benchmarks with LLVM Clang/Polly loop optimization pragmas using Bayesian optimization. Concurrency and Computation: Practice and Experience 34, 20 (2022), e6683.
  59. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com