MonoCoder: Domain-Specific Code Language Model for HPC Codes and Tasks (2312.13322v3)
Abstract: With easier access to powerful compute resources, there is a growing trend in AI for software development to develop LLMs to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size and demand expensive compute resources for training. This is partly because LLMs for HPC tasks are obtained by finetuning existing LLMs that support several natural and/or programming languages. We found this design choice confusing - why do we need LLMs trained on natural languages and programming languages unrelated to HPC for HPC-specific tasks? In this line of work, we aim to question choices made by existing LLMs by developing smaller LLMs (LMs) for specific domains - we call them domain-specific LMs. Specifically, we start with HPC as a domain and build an HPC-specific LM, named MonoCoder, which is orders of magnitude smaller than existing LMs but delivers better performance on non-HPC and HPC codes. Specifically, we pre-trained MonoCoder on an HPC-specific dataset (named HPCorpus) of C and C++ programs mined from GitHub. We evaluated the performance of MonoCoder against state-of-the-art multi-lingual LLMs. Results demonstrate that MonoCoder, although much smaller than existing LMs, outperforms other LLMs on normalized-perplexity tests (in relation to model size) while also delivering competing CodeBLEU scores for high-performance and parallel code generations. In other words, results suggest that MonoCoder understands HPC code better than state-of-the-art LLMs.
- H. Cray, “Project Breckenridge,” https://console.breckenridge.cloud/, [Online].
- Intel, “Intel Developer Cloud,” https://www.intel.com/content/www/us/en/developer/tools/devcloud/overview.html, 2023, [Online].
- R. I. Park, “NegevHPC Project,” https://platform.openai.com/tokenizer, [Online].
- B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz, E. Agirre, I. Heintz, and D. Roth, “Recent advances in natural language processing via large pre-trained language models: A survey,” ACM Computing Surveys, 2021.
- W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023.
- L. Floridi and M. Chiriatti, “Gpt-3: Its nature, scope, limits, and consequences,” Minds and Machines, vol. 30, pp. 681–694, 2020.
- S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg et al., “Sparks of artificial general intelligence: Early experiments with gpt-4,” arXiv preprint arXiv:2303.12712, 2023.
- OpenAI, “OpenAI ChatGPT,” https://openai.com/blog/chatgpt, 2023, [Online].
- D. Reed, D. Gannon, and J. Dongarra, “Reinventing high performance computing: challenges and opportunities,” arXiv preprint arXiv:2203.02544, 2022.
- J. Dongarra, “Hpc: Where we are today and a look into the future,” Parallel Processing and Applied Mathematics, PPAM: Gdansk, Poland, 2022.
- D. Reed, D. Gannon, and J. Dongarra, “Hpc forecast: Cloudy and uncertain,” Communications of the ACM, vol. 66, no. 2, pp. 82–90, 2023.
- L. Chen, P.-H. Lin, T. Vanderbruggen, C. Liao, M. Emani, and B. de Supinski, “Lm4hpc: Towards effective language model application in high-performance computing,” arXiv preprint arXiv:2306.14979, 2023.
- L. Chen, Q. I. Mahmud, H. Phan, N. Ahmed, and A. Jannesari, “Learning to parallelize with openmp by augmented heterogeneous ast representation,” Proceedings of Machine Learning and Systems, vol. 5, 2023.
- R. Harel, Y. Pinter, and G. Oren, “Learning to parallelize in a shared-memory environment with transformers,” in Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2023, pp. 450–452.
- T. Kadosh, N. Schneider, N. Hasabnis, T. Mattson, Y. Pinter, and G. Oren, “Advising openmp parallelization via a graph-based approach with transformers,” arXiv preprint arXiv:2305.11999, 2023.
- D. Nichols, A. Marathe, H. Menon, T. Gamblin, and A. Bhatele, “Modeling parallel programs using large language models,” arXiv preprint arXiv:2306.17281, 2023.
- N. Schneider, T. Kadosh, N. Hasabnis, T. Mattson, Y. Pinter, and G. Oren, “Mpi-rical: Data-driven mpi distributed parallelism assistance with transformers,” in Proceedings of the SC ’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, ser. SC-W ’23. New York, NY, USA: Association for Computing Machinery, 2023, p. 2–10. [Online]. Available: https://doi.org/10.1145/3624062.3624063
- Y. Shen, M. Peng, Q. Wu, and G. Xie, “Multigraph learning for parallelism discovery in sequential programs,” Concurrency and Computation: Practice and Experience, vol. 35, no. 9, p. e7648, 2023.
- R. Harel, I. Mosseri, H. Levin, L.-o. Alon, M. Rusanovsky, and G. Oren, “Source-to-source parallelization compilers for scientific shared-memory multi-core and accelerated multiprocessing: analysis, pitfalls, enhancement and potential,” International Journal of Parallel Programming, vol. 48, pp. 1–31, 2020.
- R. Milewicz, P. Pirkelbauer, P. Soundararajan, H. Ahmed, and T. Skjellum, “Negative perceptions about the applicability of source-to-source compilers in hpc: A literature review,” in International Conference on High Performance Computing. Springer, 2021, pp. 233–246.
- I. Mosseri, L.-o. Alon, R. Harel, and G. Oren, “Compar: optimized multi-compiler for automatic openmp s2s parallelization,” in OpenMP: Portable Multi-Level Parallelism on Modern Systems: 16th International Workshop on OpenMP, IWOMP 2020, Austin, TX, USA, September 22–24, 2020, Proceedings 16. Springer, 2020, pp. 247–262.
- S. Prema, R. Jehadeesan, and B. Panigrahi, “Identifying pitfalls in automatic parallelization of nas parallel benchmarks,” in Parallel Computing Technologies (PARCOMPTECH), 2017 National Conference on. IEEE, 2017, pp. 1–6.
- S. Prema, R. Nasre, R. Jehadeesan, and B. Panigrahi, “A study on popular auto-parallelization frameworks,” Concurrency and Computation: Practice and Experience, vol. 31, no. 17, p. e5168, 2019.
- X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large language models for software engineering: A systematic literature review,” arXiv preprint arXiv:2308.10620, 2023.
- M.-F. Wong, S. Guo, C.-N. Hang, S.-W. Ho, and C.-W. Tan, “Natural language generation and understanding of big code for ai-assisted programming: A review,” Entropy, vol. 25, no. 6, p. 888, 2023.
- T. Kadosh, N. Hasabnis, T. Mattson, Y. Pinter, G. Oren et al., “Pragformer: Data-driven parallel source code classification with transformers,” 2023.
- F. F. Xu, U. Alon, G. Neubig, and V. J. Hellendoorn, “A systematic evaluation of large language models of code,” in Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, 2022, pp. 1–10.
- X. Ding, L. Chen, M. Emani, C. Liao, P.-H. Lin, T. Vanderbruggen, Z. Xie, A. E. Cerpa, and W. Du, “Hpc-gpt: Integrating large language model for high-performance computing,” arXiv preprint arXiv:2311.12833, 2023.
- Q. I. Mahmud, A. TehraniJamsaz, H. D. Phan, N. K. Ahmed, and A. Jannesari, “Autoparllm: Gnn-guided automatic code parallelization using large language models,” arXiv preprint arXiv:2310.04047, 2023.
- M. Mukherjee and V. J. Hellendoorn, “Stack over-flowing with results: The case for domain-specific pre-training over one-size-fits-all models,” arXiv preprint arXiv:2306.03268, 2023.
- J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, “BioBERT: A pre-trained biomedical language representation model for biomedical text mining,” vol. 36, no. 4, pp. 1234–1240.
- R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim et al., “Starcoder: may the source be with you!” arXiv preprint arXiv:2305.06161, 2023.
- Intel, “HPC Software and Tools,” https://www.intel.com/content/www/us/en/high-performance-computing/hpc-software-and-programming.html, 2023, [Online].
- “CUDA C++ Programming Guide,” https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html, 2023, [Online].
- “Building Optimized High Performance Computing (HPC) Architectures and Applications,” https://www.intel.com/content/www/us/en/high-performance-computing/hpc-architecture.html, 2023, [Online].
- “Intel VTune - HPC Performance Characterization View,” https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/2023-0/hpc-performance-characterization-view.html, 2023, [Online].
- “Accelerating HPC Applications with NVIDIA Nsight Compute Roofline Analysis,” https://developer.nvidia.com/blog/accelerating-hpc-applications-with-nsight-compute-roofline-analysis/, 2020, [Online].
- T. Kadosh, N. Hasabnis, T. Mattson, Y. Pinter, and G. Oren, “Quantifying openmp: Statistical insights into usage and adoption,” 2023.
- L. Chen, A. Bhattacharjee, N. K. Ahmed, N. Hasabnis, G. Oren, B. Lei, and A. Jannesari, “Compcodevet: A compiler-guided validation and enhancement approach for code dataset,” arXiv preprint arXiv:2311.06505, 2023.
- A. Khan, H. Sim, S. S. Vazhkudai, A. R. Butt, and Y. Kim, “An analysis of system balance and architectural trends based on top500 supercomputers,” in The International Conference on High Performance Computing in Asia-Pacific Region, 2021, pp. 11–22.
- S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. Clement, D. Drain, D. Jiang, D. Tang et al., “Codexglue: A machine learning benchmark dataset for code understanding and generation,” arXiv preprint arXiv:2102.04664, 2021.
- S. Black, L. Gao, P. Wang, C. Leahy, and S. Biderman, “Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow,” If you use this software, please cite it using these metadata, vol. 58, 2021.
- B. Wang and A. Komatsuzaki, “Gpt-j-6b: A 6 billion parameter autoregressive language model,” 2021.
- M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
- S. Black, S. Biderman, E. Hallahan, Q. Anthony, L. Gao, L. Golding, H. He, C. Leahy, K. McDonell, J. Phang et al., “Gpt-neox-20b: An open-source autoregressive language model,” arXiv preprint arXiv:2204.06745, 2022.
- A. Andonian, Q. Anthony, S. Biderman, S. Black, P. Gali, L. Gao, E. Hallahan, J. Levy-Kramer, C. Leahy, L. Nestler, K. Parker, M. Pieler, J. Phang, S. Purohit, H. Schoelkopf, D. Stander, T. Songz, C. Tigges, B. Thérien, P. Wang, and S. Weinbach, “GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch,” 9 2023. [Online]. Available: https://www.github.com/eleutherai/gpt-neox
- Z. Yang, Z. Zhao, C. Wang, J. Shi, D. Kim, D. Han, and D. Lo, “What do code models memorize? an empirical study on large language models of code,” arXiv preprint arXiv:2308.09932, 2023.
- R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” arXiv preprint arXiv:1508.07909, 2015.
- W. Zheng, S. Sharan, A. K. JAISWAL, K. Wang, Y. Xi, and Z. Wang, “Code means more than plain language: Bringing syntax structure awareness to algorithmic problem solution generation,” 2022.
- Y. Xu and Y. Zhu, “A survey on pretrained language models for neural code intelligence,” arXiv preprint arXiv:2212.10079, 2022.
- N. D. Bui, H. Le, Y. Wang, J. Li, A. D. Gotmare, and S. C. Hoi, “Codetf: One-stop transformer library for state-of-the-art code llm,” arXiv preprint arXiv:2306.00029, 2023.
- D. KC and C. T. Morrison, “Neural machine translation for code generation,” arXiv preprint arXiv:2305.13504, 2023.
- J. Ye, X. Chen, N. Xu, C. Zu, Z. Shao, S. Liu, Y. Cui, Z. Zhou, C. Gong, Y. Shen et al., “A comprehensive capability analysis of gpt-3 and gpt-3.5 series models,” arXiv preprint arXiv:2303.10420, 2023.
- S. Ren, D. Guo, S. Lu, L. Zhou, S. Liu, D. Tang, N. Sundaresan, M. Zhou, A. Blanco, and S. Ma, “Codebleu: a method for automatic evaluation of code synthesis,” arXiv preprint arXiv:2009.10297, 2020.
- K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: A Method for Automatic Evaluation of Machine Translation,” in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, P. Isabelle, E. Charniak, and D. Lin, Eds. Association for Computational Linguistics, pp. 311–318.
- Y. Shen, M. Peng, S. Wang, and Q. Wu, “Towards parallelism detection of sequential programs with graph neural network,” Future Generation Computer Systems, vol. 125, pp. 515–525, 2021.
- A. Mitra, L. D. Corro, S. Mahajan, A. Codas, C. Simoes, S. Agrawal, X. Chen, A. Razdaibiedina, E. Jones, K. Aggarwal, H. Palangi, G. Zheng, C. Rosset, H. Khanpour, and A. Awadallah, “Orca 2: Teaching small language models how to reason,” 2023.
- T. Schick and H. Schütze, “It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics. arXiv. [Online]. Available: http://arxiv.org/abs/2009.07118
- A. Grossman, L. Paehler, K. Parasyris, T. Ben-Nun, J. Hegna, W. Moses, J. M. M. Diaz, M. Trofin, and J. Doerfert, “Compile: A large ir dataset from production sources,” 2023.
- M. Szafraniec, B. Roziere, H. Leather, F. Charton, P. Labatut, and G. Synnaeve, “Code translation with compiler representations,” arXiv preprint arXiv:2207.03578, 2022.
- D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, A. Svyatkovskiy, S. Fu et al., “Graphcodebert: Pre-training code representations with data flow,” arXiv preprint arXiv:2009.08366, 2020.
- Tal Kadosh (7 papers)
- Niranjan Hasabnis (21 papers)
- Vy A. Vo (11 papers)
- Nadav Schneider (9 papers)
- Neva Krien (3 papers)
- Mihai Capota (9 papers)
- Abdul Wasay (4 papers)
- Nesreen Ahmed (18 papers)
- Ted Willke (13 papers)
- Guy Tamir (5 papers)
- Yuval Pinter (41 papers)
- Timothy Mattson (11 papers)
- Gal Oren (38 papers)