Investigating the Performance of LLMs for Completing Code in Haskell
Introduction
The performance of LLMs, specifically CodeGPT and UniXcoder, in the arena of automatic code completion for Haskell, a strongly typed functional programming language, is explored. Unlike the wealth of research in code completion models for imperative languages such as Python and JavaScript, functional languages like Haskell have received less attention. This paper aims to fill that gap by evaluating and fine-tuning CodeGPT and UniXcoder on Haskell code. Through the evaluation on a publicly available Haskell dataset and the specially translated HumanEval dataset into Haskell, the research scrutinizes the adaptability of these models to the syntactic and semantic constructs unique to Haskell.
Motivation
Functional programming languages, particularly Haskell, present unique challenges and opportunities for automatic code completion due to their concise syntax and advanced type class techniques. This paper is motivated by the underrepresentation of functional languages in the existing research on code completion models. It aims to investigate whether the knowledge acquired by LLMs from imperative programming languages can be effectively transferred to a functional programming context. With an increasing integration of functional programming concepts into mainstream languages, understanding code completion in Haskell holds broader implications for improving LLMs' performance across a range of programming paradigms.
Approach
The research strategy involves three principal phases: dataset creation, fine-tuning, and evaluation. Initially, Haskell code samples are collected and processed from the Blastwind dataset available on HuggingFace, alongside the creation of a new dataset by translating Python functions from HumanEval to Haskell. These datasets serve to train and evaluate two LLMs, CodeGPT and UniXcoder, both pre-trained on multiple programming languages. The fine-tuning process adapts these models to perform line completion for Haskell, with evaluation metrics focusing on Exact Match (EM) and Edit Similarity (ES) to assess performance.
Results and Discussion
The fine-tuning significantly enhanced both models' abilities to complete Haskell code, with noteworthy improvement over their base versions. However, when compared to results on imperative languages, Haskell presents a more considerable challenge, as indicated by the lower performance metrics. This outcome underscores the essential difference in model adaptability between functional and imperative languages, suggesting a need for Haskell-specific adjustments or training strategies.
Manual evaluation on the translated HumanEval dataset reveals notable behavioral differences between CodeGPT and UniXcoder. CodeGPT tends to generate empty outputs or include unnecessary comments, while UniXcoder is prone to incomplete or incorrect predictions. Despite this, neither model demonstrates a consistent failure across specific Haskell features or constructs, indicating a general need for improvement in functional language support.
Implications
The findings highlight the potential and limitations of applying LLMs developed primarily for imperative languages to a functional programming context. For developers and researchers, understanding these dynamics is crucial for advancing code completion tools that support a wider range of programming languages, including Haskell. Future work should focus on developing and incorporating high-quality Haskell datasets into the training process of LLMs, potentially improving their performance not just in Haskell, but in understanding functional programming principles that are increasingly relevant in modern software development.
Concluding Remarks
This paper illuminates the challenges and opportunities in extending LLMs' capabilities to Haskell code completion. While fine-tuning shows promise in enhancing model performance, the distinct nature of functional programming warrants specialized approaches for optimal code completion support. As the landscape of programming continues to evolve, with a greater fusion of imperative and functional paradigms, research such as this paves the way for more versatile and effective code completion tools that cater to the diverse needs of today's developers.