Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Low-Cost Language Models: Survey and Performance Evaluation on Python Code Generation (2404.11160v1)

Published 17 Apr 2024 in cs.AI

Abstract: LLMs have become the go-to solution for many NLP tasks due to their ability to tackle various problems and produce high-quality results. Specifically, they are increasingly used to automatically generate code, easing the burden on developers by handling repetitive tasks. However, this improvement in quality has led to high computational and memory demands, making LLMs inaccessible to users with limited resources. In this paper, we focus on Central Processing Unit (CPU)-compatible models and conduct a thorough semi-manual evaluation of their strengths and weaknesses in generating Python code. We enhance their performance by introducing a Chain-of-Thought prompt that guides the model in problem-solving. Additionally, we propose a dataset of 60 programming problems with varying difficulty levels for evaluation purposes. Our assessment also includes testing these models on two state-of-the-art datasets: HumanEval and EvalPlus. We commit to sharing our dataset and experimental results publicly to ensure transparency.

Evaluation of Low-Cost CPU-Compatible Models for Python Code Generation

Introduction to CPU-Compatible Models in Python Code Generation

In the landscape of NLP, Python code generation has emerged as an essential task, fueled by the expansive use of the language and the need for automating coding tasks. LLMs have played a pivotal role in these advancements; however, their resource-intensive nature often limits their accessibility. This paper contributes to the field by evaluating the performance of various CPU-compatible, open-source models specifically in the context of Python code generation.

Experiment Setup and Models Evaluated

The exploration of CPU-compatible models is conducted using a selection of quantized models from the llama.cpp project, which is optimized for CPUs. Models examined include versions of LLaMA, Mistral, and other derivatives like Dolphin and OpenHermes, quantized to different levels (2-8 bits). The paper leverages a custom dataset comprising sixty diverse Python coding problems, alongside established datasets such as HumanEval and EvalPlus, to gauge the models' code synthesis capabilities.

Key Outcomes and Model Comparisons

Performance Across Datasets

  • On a custom dataset, models generally struggled with correct format output, besides correct coding solutions. Notably:
    • Mistral variants showed robust problem comprehension and adherence to output format requirements.
    • Dolphin and OpenHermes models excelled in code generation but often failed to align outputs with the expected formats.
  • On HumanEval and EvalPlus, Dolphin models notably surpassed others, exhibiting strengths in actual code synthesis without format constraints.

Computational Efficiency

The paper meticulously considers the operational feasibility on standard CPUs, emphasizing models' storage, RAM requirements, and inference times:

  • Models like Mistral and Llama demonstrated a balance between performance and computational demands.
  • The smallest models required less than 6 GB of space and around 5 GB of RAM, manageable within regular desktop environments.

Challenges and Limitations

While CPU-compatible models offer an accessible alternative to GPU-dependent ones, they encounter specific challenges:

  • Output Format Compliance: Some models, though effective in raw code generation, struggle with strict output format adherence, leading to potential penalties in structured evaluations.
  • Resource Requirements: Despite optimizations, the most powerful configurations of models like Mixtral still demand resources beyond typical CPU capacities, limiting their practical utility.

Future Research Directions

The continual evolution of CPU-friendly LLMs for coding tasks suggests several trajectories for future work:

  • Enhanced Model Training: Further refining model architectures and training paradigms to balance performance with resource efficiency.
  • Expanded Task Coverage: Investigating models' capabilities across a broader spectrum of coding-related tasks, such as code summarization, bug-fixing, or even cross-language translation.

Conclusion

This investigation underscores the significant potential of CPU-compatible models to democratize Python code generation, making it more accessible across varied computational environments. By highlighting specific strengths and weaknesses across different models and tasks, this research provides valuable insights that pave the way for future enhancements in the domain of AI-powered coding assistance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (85)
  1. Deep learning using rectified linear units (relu). arXiv:1803.08375.
  2. Cm3: A causal masked multimodal model of the internet. arXiv:2201.07520.
  3. Unified pre-training for program understanding and generation, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online. pp. 2655–2668. URL: https://aclanthology.org/2021.naacl-main.211, doi:10.18653/v1/2021.naacl-main.211.
  4. Chatgpt vs. bard: a comparative study. Authorea Preprints .
  5. GQA: Training generalized multi-query transformer models from multi-head checkpoints, in: Bouamor, H., Pino, J., Bali, K. (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Singapore. pp. 4895–4901. URL: https://aclanthology.org/2023.emnlp-main.298, doi:10.18653/v1/2023.emnlp-main.298.
  6. Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo. https://github.com/nomic-ai/gpt4all.
  7. Program synthesis with large language models. arXiv:2108.07732.
  8. Layer normalization. arXiv:1607.06450.
  9. Longformer: The long-document transformer. URL: https://arxiv.org/abs/2004.05150, arXiv:2004.05150.
  10. Language models are few-shot learners, in: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc.. pp. 1877–1901. URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  11. Evaluating large language models trained on code. ArXiv abs/2107.03374. URL: https://api.semanticscholar.org/CorpusID:235755472.
  12. Large language models are few(1)-shot table reasoners. arXiv:2210.06710.
  13. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. URL: https://lmsys.org/blog/2023-03-30-vicuna/.
  14. Palm: Scaling language modeling with pathways. arXiv:2204.02311.
  15. Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems 30. URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf.
  16. Training verifiers to solve math word problems. arXiv:2110.14168.
  17. Coursera, 2023. Most popular programming languages in 2024. URL: https://www.coursera.org/articles/popular-programming-languages.
  18. Ultrafeedback: Boosting language models with high-quality feedback. arXiv:2310.01377.
  19. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv:1901.02860.
  20. Amplify-instruct: Synthetically generated diverse multi-turn conversations for effecient llm training. arXiv preprint arXiv:(coming soon) URL: https://huggingface.co/datasets/LDJnr/Capybara.
  21. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems 35, 16344–16359.
  22. Language modeling with gated convolutional networks, in: Proceedings of the 34th International Conference on Machine Learning - Volume 70, JMLR.org. p. 933–941.
  23. BERT: Pre-training of deep bidirectional transformers for language understanding, in: Burstein, J., Doran, C., Solorio, T. (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota. pp. 4171–4186. URL: https://aclanthology.org/N19-1423, doi:10.18653/v1/N19-1423.
  24. Enhancing chat language models by scaling high-quality instructional conversations. arXiv:2305.14233.
  25. Self-collaboration code generation via chatgpt. arXiv:2304.07590.
  26. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv:2210.17323.
  27. Incoder: A generative model for code infilling and synthesis. The Eleventh International Conference on Learning Representations URL: https://openreview.net/forum?id=hQwb-lbM6EL.
  28. llama.cpp. URL: https://github.com/ggerganov/llama.cpp.
  29. Gguf. https://huggingface.co/docs/hub/en/gguf.
  30. Textbooks are all you need. arXiv:2306.11644.
  31. Dolphin dataset. URL: https://huggingface.co/datasets/cognitivecomputations/dolphin.
  32. Measuring massive multitask language understanding. arXiv:2009.03300.
  33. Training Compute-Optimal Large Language Models. arXiv e-prints .
  34. Minicpm: Unveiling the potential of end-side large language models.
  35. Mistral 7b. arXiv:2310.06825.
  36. Mixtral of experts. arXiv:2401.04088.
  37. Spoc: Search-based pseudocode to code, in: Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc. URL: https://proceedings.neurips.cc/paper_files/paper/2019/file/7298332f04ac004a0ca44cc69ecf6f6b-Paper.pdf.
  38. Artificial intelligence, text generation tools and chatgpt–does digital watermarking offer a solution? International Journal for Educational Integrity 19, 10.
  39. Nvidia ceo jensen huang announces new ai chips: ‘we need bigger gpus’. URL: https://www.cnbc.com/2024/03/18/nvidia-announces-gb200-blackwell-ai-chip-launching-later-this-year.html.
  40. Starcoder: may the source be with you! Transactions on Machine Learning Research URL: https://openreview.net/forum?id=KoFOg41haE. reproducibility Certification.
  41. Textbooks are all you need ii: phi-1.5 technical report. arXiv:2309.05463.
  42. Competition-level code generation with alphacode. Science 378, 1092–1097. URL: http://dx.doi.org/10.1126/science.abq1158, doi:10.1126/science.abq1158.
  43. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv:2306.00978.
  44. Truthfulqa: Measuring how models mimic human falsehoods. arXiv:2109.07958.
  45. Improving chatgpt prompt for code generation. arXiv:2305.08360.
  46. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. Thirty-seventh Conference on Neural Information Processing Systems URL: https://openreview.net/forum?id=1qvx610Cu7.
  47. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation, in: Oh, A., Neumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc.. pp. 21558–21572. URL: https://proceedings.neurips.cc/paper_files/paper/2023/file/43e9d647ccd3e4b7b5baab53f0368686-Paper-Conference.pdf.
  48. Gpt-3.5, gpt-4, or bard? evaluating llms reasoning ability in zero-shot setting and performance boosting through prompts. Natural Language Processing Journal 5, 100032. URL: https://www.sciencedirect.com/science/article/pii/S2949719123000298, doi:https://doi.org/10.1016/j.nlp.2023.100032.
  49. Dolphins: Multimodal language model for driving. arXiv:2312.00438.
  50. An overview of bard: an early experiment with generative ai.
  51. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv:2306.02707.
  52. Usability analysis of text generation by chatgpt openai using system usability scale method. Procedia Computer Science 227, 381–388. URL: https://www.sciencedirect.com/science/article/pii/S1877050923017040, doi:https://doi.org/10.1016/j.procs.2023.10.537. 8th International Conference on Computer Science and Computational Intelligence (ICCSCI 2023).
  53. OpenIA, 2022. Introducing chatgpt. https://openai.com/blog/chatgpt.
  54. OpenIA, 2023. Gpt-4 technical report. arxiv URL: https://arxiv.org/pdf/2303.08774.pdf.
  55. Training language models to follow instructions with human feedback. ArXiv abs/2203.02155.
  56. Chatbots put to the test in math and logic problems: A comparison and assessment of chatgpt-3.5, chatgpt-4, and google bard. AI 4, 949–969.
  57. Improving language understanding by generative pre-training. arxiv .
  58. Language models are unsupervised multitask learners. arxiv URL: https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf.
  59. Scaling Language Models: Methods, Analysis & Insights from Training Gopher. arXiv e-prints , arXiv:2112.11446doi:10.48550/arXiv.2112.11446, arXiv:2112.11446.
  60. Direct preference optimization: Your language model is secretly a reward model. arXiv:2305.18290.
  61. Swish: a self-gated activation function. arXiv: Neural and Evolutionary Computing URL: https://api.semanticscholar.org/CorpusID:196158220.
  62. Code llama: Open foundation models for code. arXiv:2308.12950.
  63. Glu variants improve transformer. arXiv:2002.05202.
  64. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv:2201.11990.
  65. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  66. Gemini: A family of highly capable multimodal models. arXiv:2312.11805.
  67. Gemma: Open models based on gemini research and technology. arXiv:2403.08295.
  68. Evalplus leaderboard. https://evalplus.github.io/leaderboard.html.
  69. Magicoder-oss-instruct-75k. https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K.
  70. Teknium, 2023. Openhermes 2.5 mistral 7b - gguf. URL: https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B.
  71. Lamda: Language models for dialog applications. arXiv .
  72. Llama: Open and efficient foundation language models. arXiv:2302.13971.
  73. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288.
  74. Zephyr: Direct distillation of lm alignment. arXiv:2310.16944.
  75. Most used programming languages among developers worldwide as of 2023. URL: https://www.statista.com/statistics/793628/worldwide-developer-survey-most-used-languages/.
  76. Attention is all you need. Advances in Neural Information Processing Systems 30. URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  77. Gpt-j-6b: A 6 billion parameter autoregressive language model. https://github.com/kingoflolz/mesh-transformer-jax. URL: https://github.com/kingoflolz/mesh-transformer-jax.
  78. Emergent abilities of large language models. arXiv:2206.07682.
  79. Chain of thought prompting elicits reasoning in large language models. ArXiv abs/2201.11903. URL: https://api.semanticscholar.org/CorpusID:246411621.
  80. Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120 .
  81. Unveiling security, privacy, and ethical concerns of chatgpt. Journal of Information and Intelligence URL: https://www.sciencedirect.com/science/article/pii/S2949715923000707, doi:https://doi.org/10.1016/j.jiixd.2023.10.007.
  82. Towards better chain-of-thought prompting strategies: A survey. arXiv:2310.04959.
  83. Root mean square layer normalization. arXiv:1910.07467.
  84. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv:2306.05685.
  85. Agieval: A human-centric benchmark for evaluating foundation models. arXiv:2304.06364.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)