Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

80 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Low-Cost Language Models: Survey and Performance Evaluation on Python Code Generation (2404.11160v1)

Published 17 Apr 2024 in cs.AI

Abstract: LLMs have become the go-to solution for many NLP tasks due to their ability to tackle various problems and produce high-quality results. Specifically, they are increasingly used to automatically generate code, easing the burden on developers by handling repetitive tasks. However, this improvement in quality has led to high computational and memory demands, making LLMs inaccessible to users with limited resources. In this paper, we focus on Central Processing Unit (CPU)-compatible models and conduct a thorough semi-manual evaluation of their strengths and weaknesses in generating Python code. We enhance their performance by introducing a Chain-of-Thought prompt that guides the model in problem-solving. Additionally, we propose a dataset of 60 programming problems with varying difficulty levels for evaluation purposes. Our assessment also includes testing these models on two state-of-the-art datasets: HumanEval and EvalPlus. We commit to sharing our dataset and experimental results publicly to ensure transparency.

PDF HTML Abstract

Evaluation of Low-Cost CPU-Compatible Models for Python Code Generation

Introduction to CPU-Compatible Models in Python Code Generation

In the landscape of NLP, Python code generation has emerged as an essential task, fueled by the expansive use of the language and the need for automating coding tasks. LLMs have played a pivotal role in these advancements; however, their resource-intensive nature often limits their accessibility. This paper contributes to the field by evaluating the performance of various CPU-compatible, open-source models specifically in the context of Python code generation.

Experiment Setup and Models Evaluated

The exploration of CPU-compatible models is conducted using a selection of quantized models from the llama.cpp project, which is optimized for CPUs. Models examined include versions of LLaMA, Mistral, and other derivatives like Dolphin and OpenHermes, quantized to different levels (2-8 bits). The paper leverages a custom dataset comprising sixty diverse Python coding problems, alongside established datasets such as HumanEval and EvalPlus, to gauge the models' code synthesis capabilities.

Key Outcomes and Model Comparisons

Performance Across Datasets

On a custom dataset, models generally struggled with correct format output, besides correct coding solutions. Notably:
- Mistral variants showed robust problem comprehension and adherence to output format requirements.
- Dolphin and OpenHermes models excelled in code generation but often failed to align outputs with the expected formats.
On HumanEval and EvalPlus, Dolphin models notably surpassed others, exhibiting strengths in actual code synthesis without format constraints.

Computational Efficiency

The paper meticulously considers the operational feasibility on standard CPUs, emphasizing models' storage, RAM requirements, and inference times:

Models like Mistral and Llama demonstrated a balance between performance and computational demands.
The smallest models required less than 6 GB of space and around 5 GB of RAM, manageable within regular desktop environments.

Challenges and Limitations

While CPU-compatible models offer an accessible alternative to GPU-dependent ones, they encounter specific challenges:

Output Format Compliance: Some models, though effective in raw code generation, struggle with strict output format adherence, leading to potential penalties in structured evaluations.
Resource Requirements: Despite optimizations, the most powerful configurations of models like Mixtral still demand resources beyond typical CPU capacities, limiting their practical utility.

Future Research Directions

The continual evolution of CPU-friendly LLMs for coding tasks suggests several trajectories for future work:

Enhanced Model Training: Further refining model architectures and training paradigms to balance performance with resource efficiency.
Expanded Task Coverage: Investigating models' capabilities across a broader spectrum of coding-related tasks, such as code summarization, bug-fixing, or even cross-language translation.

Conclusion

This investigation underscores the significant potential of CPU-compatible models to democratize Python code generation, making it more accessible across varied computational environments. By highlighting specific strengths and weaknesses across different models and tasks, this research provides valuable insights that pave the way for future enhancements in the domain of AI-powered coding assistance.

PDF Markdown Bookmark Chat (Pro)

References (85)

Authors (5)

Jessica López Espejel (10 papers)
Mahaman Sanoussi Yahaya Alassan (6 papers)
Merieme Bouhandi (2 papers)
Walid Dahhane (8 papers)
El Hassane Ettifouri (8 papers)

Tweets

https://twitter.com/morris_phd/status/1782434149350994082

https://twitter.com/jessica_11101/status/1782784654866882787

https://twitter.com/jessica_11101/status/1830887224365220312

https://twitter.com/GptMaestro/status/1787939701661516133