Investigating the Performance of Language Models for Completing Code in Functional Programming Languages: a Haskell Case Study (2403.15185v1)

Published 22 Mar 2024 in cs.CL

Abstract: LLM-based code completion models have quickly grown in use, helping thousands of developers write code in many different programming languages. However, research on code completion models typically focuses on imperative languages such as Python and JavaScript, which results in a lack of representation for functional programming languages. Consequently, these models often perform poorly on functional languages such as Haskell. To investigate whether this can be alleviated, we evaluate the performance of two LLMs for code, CodeGPT and UniXcoder, on the functional programming language Haskell. We fine-tune and evaluate the models on Haskell functions sourced from a publicly accessible Haskell dataset on HuggingFace. Additionally, we manually evaluate the models using our novel translated HumanEval dataset. Our automatic evaluation shows that knowledge of imperative programming languages in the pre-training of LLMs may not transfer well to functional languages, but that code completion on functional languages is feasible. Consequently, this shows the need for more high-quality Haskell datasets. A manual evaluation on HumanEval-Haskell indicates CodeGPT frequently generates empty predictions and extra comments, while UniXcoder more often produces incomplete or incorrect predictions. Finally, we release HumanEval-Haskell, along with the fine-tuned models and all code required to reproduce our experiments on GitHub (https://github.com/AISE-TUDelft/HaskellCCEval).

PDF Abstract

Investigating the Performance of LLMs for Completing Code in Haskell

Introduction

The performance of LLMs, specifically CodeGPT and UniXcoder, in the arena of automatic code completion for Haskell, a strongly typed functional programming language, is explored. Unlike the wealth of research in code completion models for imperative languages such as Python and JavaScript, functional languages like Haskell have received less attention. This paper aims to fill that gap by evaluating and fine-tuning CodeGPT and UniXcoder on Haskell code. Through the evaluation on a publicly available Haskell dataset and the specially translated HumanEval dataset into Haskell, the research scrutinizes the adaptability of these models to the syntactic and semantic constructs unique to Haskell.

Motivation

Functional programming languages, particularly Haskell, present unique challenges and opportunities for automatic code completion due to their concise syntax and advanced type class techniques. This paper is motivated by the underrepresentation of functional languages in the existing research on code completion models. It aims to investigate whether the knowledge acquired by LLMs from imperative programming languages can be effectively transferred to a functional programming context. With an increasing integration of functional programming concepts into mainstream languages, understanding code completion in Haskell holds broader implications for improving LLMs' performance across a range of programming paradigms.

Approach

The research strategy involves three principal phases: dataset creation, fine-tuning, and evaluation. Initially, Haskell code samples are collected and processed from the Blastwind dataset available on HuggingFace, alongside the creation of a new dataset by translating Python functions from HumanEval to Haskell. These datasets serve to train and evaluate two LLMs, CodeGPT and UniXcoder, both pre-trained on multiple programming languages. The fine-tuning process adapts these models to perform line completion for Haskell, with evaluation metrics focusing on Exact Match (EM) and Edit Similarity (ES) to assess performance.

Results and Discussion

The fine-tuning significantly enhanced both models' abilities to complete Haskell code, with noteworthy improvement over their base versions. However, when compared to results on imperative languages, Haskell presents a more considerable challenge, as indicated by the lower performance metrics. This outcome underscores the essential difference in model adaptability between functional and imperative languages, suggesting a need for Haskell-specific adjustments or training strategies.

Manual evaluation on the translated HumanEval dataset reveals notable behavioral differences between CodeGPT and UniXcoder. CodeGPT tends to generate empty outputs or include unnecessary comments, while UniXcoder is prone to incomplete or incorrect predictions. Despite this, neither model demonstrates a consistent failure across specific Haskell features or constructs, indicating a general need for improvement in functional language support.

Implications

The findings highlight the potential and limitations of applying LLMs developed primarily for imperative languages to a functional programming context. For developers and researchers, understanding these dynamics is crucial for advancing code completion tools that support a wider range of programming languages, including Haskell. Future work should focus on developing and incorporating high-quality Haskell datasets into the training process of LLMs, potentially improving their performance not just in Haskell, but in understanding functional programming principles that are increasingly relevant in modern software development.

Concluding Remarks

This paper illuminates the challenges and opportunities in extending LLMs' capabilities to Haskell code completion. While fine-tuning shows promise in enhancing model performance, the distinct nature of functional programming warrants specialized approaches for optimal code completion support. As the landscape of programming continues to evolve, with a greater fusion of imperative and functional paradigms, research such as this paves the way for more versatile and effective code completion tools that cater to the diverse needs of today's developers.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Tim van Dam (3 papers)
Frank van der Heijden (1 paper)
Philippe de Bekker (2 papers)
Berend Nieuwschepen (1 paper)
Marc Otten (2 papers)
Maliheh Izadi (36 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/Jose_A_Alonso/status/1773305781795827841