MonoCoder: Domain-Specific Code Language Model for HPC Codes and Tasks (2312.13322v3)

Published 20 Dec 2023 in cs.PL, cs.AI, cs.LG, and cs.SE

Abstract: With easier access to powerful compute resources, there is a growing trend in AI for software development to develop LLMs to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size and demand expensive compute resources for training. This is partly because LLMs for HPC tasks are obtained by finetuning existing LLMs that support several natural and/or programming languages. We found this design choice confusing - why do we need LLMs trained on natural languages and programming languages unrelated to HPC for HPC-specific tasks? In this line of work, we aim to question choices made by existing LLMs by developing smaller LLMs (LMs) for specific domains - we call them domain-specific LMs. Specifically, we start with HPC as a domain and build an HPC-specific LM, named MonoCoder, which is orders of magnitude smaller than existing LMs but delivers better performance on non-HPC and HPC codes. Specifically, we pre-trained MonoCoder on an HPC-specific dataset (named HPCorpus) of C and C++ programs mined from GitHub. We evaluated the performance of MonoCoder against state-of-the-art multi-lingual LLMs. Results demonstrate that MonoCoder, although much smaller than existing LMs, outperforms other LLMs on normalized-perplexity tests (in relation to model size) while also delivering competing CodeBLEU scores for high-performance and parallel code generations. In other words, results suggest that MonoCoder understands HPC code better than state-of-the-art LLMs.

References (61)

Authors (13)

Tal Kadosh (7 papers)
Niranjan Hasabnis (21 papers)
Vy A. Vo (11 papers)
Nadav Schneider (9 papers)
Neva Krien (3 papers)
Mihai Capota (9 papers)
Abdul Wasay (4 papers)
Nesreen Ahmed (18 papers)
Ted Willke (13 papers)
Guy Tamir (5 papers)
Yuval Pinter (41 papers)
Timothy Mattson (11 papers)
Gal Oren (38 papers)

Citations (4)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/GalOren8/status/1839376877573861724

MonoCoder: Domain-Specific Code Language Model for HPC Codes and Tasks (2312.13322v3)

Summary

Related Papers

Tweets