Evaluating AI-generated code for C++, Fortran, Go, Java, Julia, Matlab, Python, R, and Rust (2405.13101v2)

Published 21 May 2024 in cs.SE and cs.AI

Abstract: This study evaluates the capabilities of ChatGPT versions 3.5 and 4 in generating code across a diverse range of programming languages. Our objective is to assess the effectiveness of these AI models for generating scientific programs. To this end, we asked ChatGPT to generate three distinct codes: a simple numerical integration, a conjugate gradient solver, and a parallel 1D stencil-based heat equation solver. The focus of our analysis was on the compilation, runtime performance, and accuracy of the codes. While both versions of ChatGPT successfully created codes that compiled and ran (with some help), some languages were easier for the AI to use than others (possibly because of the size of the training sets used). Parallel codes -- even the simple example we chose to study here -- also difficult for the AI to generate correctly.

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates that ChatGPT models can generate compilable and often correct code for tasks like numerical integration and conjugate gradient solving.
The methodology evaluated code across compilation, runtime, and correctness metrics in nine languages using three distinct computational tasks.
Findings reveal that while simpler tasks perform well, complex parallel computing challenges expose limitations in code accuracy and robustness.

Evaluating ChatGPT's Code Generation Across Languages

Introduction

LLMs have increasingly been used in automating programming tasks, including code generation. This paper assesses the performance of ChatGPT versions 3.5 and 4.0 in generating scientific code across various programming languages. The aim was to evaluate how effectively these models can help generate reliable and efficient code. The researchers focused on three programming tasks of varying complexity: numerical integration, a conjugate gradient solver, and a parallel 1D stencil-based heat equation solver.

Methodology

To evaluate ChatGPT's code generation capabilities, the following three tasks were chosen:

Numerical Integration: Participants asked ChatGPT to compute the area between $-\pi$ and $2/3\pi$ for the sine function.
Conjugate Gradient Solver: The models were tasked with generating a solver for a linear system $Ax = b$ .
Parallel Heat Equation Solver: A parallel 1D heat equation solver using finite differencing was tested to assess parallel computation capabilities.

The languages assessed included C++, Fortran, Go, Julia, Java, Matlab, Python, R, and Rust. Both free and paid versions of ChatGPT were used.

Quality of the Generated Software

The generated software was evaluated based on three metrics:

Compilation: Whether the code could compile successfully with a recent compiler.
Runtime: Whether the code executed without runtime errors.
Correctness: Whether the code produced accurate outputs.

Here are the summarized results for the numerical integration and conjugate gradient solver tasks:

Numerical Integration:
- Compilation: Codes compiled successfully in almost all languages except Fortran.
- Runtime: Few runtime errors were observed.
- Correctness: ChatGPT 3.5 generated correct results for almost all languages, whereas ChatGPT 4.0 occasionally returned incorrect results due to possible misinterpretation of problem nuances.
Conjugate Gradient Solver:
- Compilation: Successful in most languages, with minor hiccups in Fortran and Rust.
- Runtime: Generally good, but Python and some other languages had minor issues.
- Correctness: Most codes produced were correct, except for JavaScript.

The parallel heat equation solver proved to be the most difficult:

Parallel Heat Equation Solver:
- Compilation errors were observed, particularly in Fortran, Rust, and C++.
- Many generated codes encountered runtime errors.
- Accuracy was an issue; most did not produce correct results, indicating difficulties in handling parallel computing constructs.

Code Metrics

The paper also analyzed lines of code (LOC) and utilized the Constructive Cost Model (COCOMO) for evaluating code quality:

Lines of Code:
- Matlab and R typically resulted in fewer lines of code.
- Python and Julia are roughly in the middle.
- Low-level languages like C++, Fortran, and Rust generated more lines.
Code Quality (using COCOMO metrics):
- C++ and Java generally showed robust performance across tasks.
- Matlab consistently demonstrated high code quality.
- Despite Python being widely used, its code quality was on the lower side.

Implications and Future Work

The paper highlighted both the capabilities and limitations of using LLMs for code generation. While simpler tasks like numerical integration and solving linear systems showed promising results, more complex parallel computing tasks revealed significant challenges. This underscores the need for continued refinement of these tools, especially in handling advanced computational tasks requiring parallelism.

Future Work:

Further development to improve the handling of more complex code tasks, especially those involving parallel computations.
Exploration of specialized models for high-performance computing (HPC) tasks to address the limitations found in this paper.

Conclusion

This paper provides a clear-eyed look at the current state of ChatGPT's code generation abilities, illustrating that while there's impressive potential, there are notable areas for improvement, particularly with more complex and parallel tasks. As these models continue to evolve, they offer exciting possibilities for automating and streamlining programming tasks, although careful validation and expert oversight remain crucial.

PDF Markdown

Related Papers

Tweets

https://twitter.com/emulenews/status/1794808834558702070

HackerNews

Evaluating AI-Generated Code for C++, Fortran, Go, Java, Julia, Matlab, etc. (1 point, 2 comments)