- The paper demonstrates that ChatGPT models can generate compilable and often correct code for tasks like numerical integration and conjugate gradient solving.
- The methodology evaluated code across compilation, runtime, and correctness metrics in nine languages using three distinct computational tasks.
- Findings reveal that while simpler tasks perform well, complex parallel computing challenges expose limitations in code accuracy and robustness.
Evaluating ChatGPT's Code Generation Across Languages
Introduction
LLMs have increasingly been used in automating programming tasks, including code generation. This paper assesses the performance of ChatGPT versions 3.5 and 4.0 in generating scientific code across various programming languages. The aim was to evaluate how effectively these models can help generate reliable and efficient code. The researchers focused on three programming tasks of varying complexity: numerical integration, a conjugate gradient solver, and a parallel 1D stencil-based heat equation solver.
Methodology
To evaluate ChatGPT's code generation capabilities, the following three tasks were chosen:
- Numerical Integration: Participants asked ChatGPT to compute the area between −π and 2/3π for the sine function.
- Conjugate Gradient Solver: The models were tasked with generating a solver for a linear system Ax=b.
- Parallel Heat Equation Solver: A parallel 1D heat equation solver using finite differencing was tested to assess parallel computation capabilities.
The languages assessed included C++, Fortran, Go, Julia, Java, Matlab, Python, R, and Rust. Both free and paid versions of ChatGPT were used.
Quality of the Generated Software
The generated software was evaluated based on three metrics:
- Compilation: Whether the code could compile successfully with a recent compiler.
- Runtime: Whether the code executed without runtime errors.
- Correctness: Whether the code produced accurate outputs.
Here are the summarized results for the numerical integration and conjugate gradient solver tasks:
- Numerical Integration:
- Compilation: Codes compiled successfully in almost all languages except Fortran.
- Runtime: Few runtime errors were observed.
- Correctness: ChatGPT 3.5 generated correct results for almost all languages, whereas ChatGPT 4.0 occasionally returned incorrect results due to possible misinterpretation of problem nuances.
- Conjugate Gradient Solver:
- Compilation: Successful in most languages, with minor hiccups in Fortran and Rust.
- Runtime: Generally good, but Python and some other languages had minor issues.
- Correctness: Most codes produced were correct, except for JavaScript.
The parallel heat equation solver proved to be the most difficult:
- Parallel Heat Equation Solver:
- Compilation errors were observed, particularly in Fortran, Rust, and C++.
- Many generated codes encountered runtime errors.
- Accuracy was an issue; most did not produce correct results, indicating difficulties in handling parallel computing constructs.
Code Metrics
The paper also analyzed lines of code (LOC) and utilized the Constructive Cost Model (COCOMO) for evaluating code quality:
- Lines of Code:
- Matlab and R typically resulted in fewer lines of code.
- Python and Julia are roughly in the middle.
- Low-level languages like C++, Fortran, and Rust generated more lines.
- Code Quality (using COCOMO metrics):
- C++ and Java generally showed robust performance across tasks.
- Matlab consistently demonstrated high code quality.
- Despite Python being widely used, its code quality was on the lower side.
Implications and Future Work
The paper highlighted both the capabilities and limitations of using LLMs for code generation. While simpler tasks like numerical integration and solving linear systems showed promising results, more complex parallel computing tasks revealed significant challenges. This underscores the need for continued refinement of these tools, especially in handling advanced computational tasks requiring parallelism.
Future Work:
- Further development to improve the handling of more complex code tasks, especially those involving parallel computations.
- Exploration of specialized models for high-performance computing (HPC) tasks to address the limitations found in this paper.
Conclusion
This paper provides a clear-eyed look at the current state of ChatGPT's code generation abilities, illustrating that while there's impressive potential, there are notable areas for improvement, particularly with more complex and parallel tasks. As these models continue to evolve, they offer exciting possibilities for automating and streamlining programming tasks, although careful validation and expert oversight remain crucial.