Low-Cost Language Models: Survey and Performance Evaluation on Python Code Generation (2404.11160v1)
Abstract: LLMs have become the go-to solution for many NLP tasks due to their ability to tackle various problems and produce high-quality results. Specifically, they are increasingly used to automatically generate code, easing the burden on developers by handling repetitive tasks. However, this improvement in quality has led to high computational and memory demands, making LLMs inaccessible to users with limited resources. In this paper, we focus on Central Processing Unit (CPU)-compatible models and conduct a thorough semi-manual evaluation of their strengths and weaknesses in generating Python code. We enhance their performance by introducing a Chain-of-Thought prompt that guides the model in problem-solving. Additionally, we propose a dataset of 60 programming problems with varying difficulty levels for evaluation purposes. Our assessment also includes testing these models on two state-of-the-art datasets: HumanEval and EvalPlus. We commit to sharing our dataset and experimental results publicly to ensure transparency.
- Deep learning using rectified linear units (relu). arXiv:1803.08375.
- Cm3: A causal masked multimodal model of the internet. arXiv:2201.07520.
- Unified pre-training for program understanding and generation, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online. pp. 2655–2668. URL: https://aclanthology.org/2021.naacl-main.211, doi:10.18653/v1/2021.naacl-main.211.
- Chatgpt vs. bard: a comparative study. Authorea Preprints .
- GQA: Training generalized multi-query transformer models from multi-head checkpoints, in: Bouamor, H., Pino, J., Bali, K. (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Singapore. pp. 4895–4901. URL: https://aclanthology.org/2023.emnlp-main.298, doi:10.18653/v1/2023.emnlp-main.298.
- Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo. https://github.com/nomic-ai/gpt4all.
- Program synthesis with large language models. arXiv:2108.07732.
- Layer normalization. arXiv:1607.06450.
- Longformer: The long-document transformer. URL: https://arxiv.org/abs/2004.05150, arXiv:2004.05150.
- Language models are few-shot learners, in: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc.. pp. 1877–1901. URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
- Evaluating large language models trained on code. ArXiv abs/2107.03374. URL: https://api.semanticscholar.org/CorpusID:235755472.
- Large language models are few(1)-shot table reasoners. arXiv:2210.06710.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. URL: https://lmsys.org/blog/2023-03-30-vicuna/.
- Palm: Scaling language modeling with pathways. arXiv:2204.02311.
- Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems 30. URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf.
- Training verifiers to solve math word problems. arXiv:2110.14168.
- Coursera, 2023. Most popular programming languages in 2024. URL: https://www.coursera.org/articles/popular-programming-languages.
- Ultrafeedback: Boosting language models with high-quality feedback. arXiv:2310.01377.
- Transformer-xl: Attentive language models beyond a fixed-length context. arXiv:1901.02860.
- Amplify-instruct: Synthetically generated diverse multi-turn conversations for effecient llm training. arXiv preprint arXiv:(coming soon) URL: https://huggingface.co/datasets/LDJnr/Capybara.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems 35, 16344–16359.
- Language modeling with gated convolutional networks, in: Proceedings of the 34th International Conference on Machine Learning - Volume 70, JMLR.org. p. 933–941.
- BERT: Pre-training of deep bidirectional transformers for language understanding, in: Burstein, J., Doran, C., Solorio, T. (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota. pp. 4171–4186. URL: https://aclanthology.org/N19-1423, doi:10.18653/v1/N19-1423.
- Enhancing chat language models by scaling high-quality instructional conversations. arXiv:2305.14233.
- Self-collaboration code generation via chatgpt. arXiv:2304.07590.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv:2210.17323.
- Incoder: A generative model for code infilling and synthesis. The Eleventh International Conference on Learning Representations URL: https://openreview.net/forum?id=hQwb-lbM6EL.
- llama.cpp. URL: https://github.com/ggerganov/llama.cpp.
- Gguf. https://huggingface.co/docs/hub/en/gguf.
- Textbooks are all you need. arXiv:2306.11644.
- Dolphin dataset. URL: https://huggingface.co/datasets/cognitivecomputations/dolphin.
- Measuring massive multitask language understanding. arXiv:2009.03300.
- Training Compute-Optimal Large Language Models. arXiv e-prints .
- Minicpm: Unveiling the potential of end-side large language models.
- Mistral 7b. arXiv:2310.06825.
- Mixtral of experts. arXiv:2401.04088.
- Spoc: Search-based pseudocode to code, in: Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc. URL: https://proceedings.neurips.cc/paper_files/paper/2019/file/7298332f04ac004a0ca44cc69ecf6f6b-Paper.pdf.
- Artificial intelligence, text generation tools and chatgpt–does digital watermarking offer a solution? International Journal for Educational Integrity 19, 10.
- Nvidia ceo jensen huang announces new ai chips: ‘we need bigger gpus’. URL: https://www.cnbc.com/2024/03/18/nvidia-announces-gb200-blackwell-ai-chip-launching-later-this-year.html.
- Starcoder: may the source be with you! Transactions on Machine Learning Research URL: https://openreview.net/forum?id=KoFOg41haE. reproducibility Certification.
- Textbooks are all you need ii: phi-1.5 technical report. arXiv:2309.05463.
- Competition-level code generation with alphacode. Science 378, 1092–1097. URL: http://dx.doi.org/10.1126/science.abq1158, doi:10.1126/science.abq1158.
- Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv:2306.00978.
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv:2109.07958.
- Improving chatgpt prompt for code generation. arXiv:2305.08360.
- Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. Thirty-seventh Conference on Neural Information Processing Systems URL: https://openreview.net/forum?id=1qvx610Cu7.
- Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation, in: Oh, A., Neumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc.. pp. 21558–21572. URL: https://proceedings.neurips.cc/paper_files/paper/2023/file/43e9d647ccd3e4b7b5baab53f0368686-Paper-Conference.pdf.
- Gpt-3.5, gpt-4, or bard? evaluating llms reasoning ability in zero-shot setting and performance boosting through prompts. Natural Language Processing Journal 5, 100032. URL: https://www.sciencedirect.com/science/article/pii/S2949719123000298, doi:https://doi.org/10.1016/j.nlp.2023.100032.
- Dolphins: Multimodal language model for driving. arXiv:2312.00438.
- An overview of bard: an early experiment with generative ai.
- Orca: Progressive learning from complex explanation traces of gpt-4. arXiv:2306.02707.
- Usability analysis of text generation by chatgpt openai using system usability scale method. Procedia Computer Science 227, 381–388. URL: https://www.sciencedirect.com/science/article/pii/S1877050923017040, doi:https://doi.org/10.1016/j.procs.2023.10.537. 8th International Conference on Computer Science and Computational Intelligence (ICCSCI 2023).
- OpenIA, 2022. Introducing chatgpt. https://openai.com/blog/chatgpt.
- OpenIA, 2023. Gpt-4 technical report. arxiv URL: https://arxiv.org/pdf/2303.08774.pdf.
- Training language models to follow instructions with human feedback. ArXiv abs/2203.02155.
- Chatbots put to the test in math and logic problems: A comparison and assessment of chatgpt-3.5, chatgpt-4, and google bard. AI 4, 949–969.
- Improving language understanding by generative pre-training. arxiv .
- Language models are unsupervised multitask learners. arxiv URL: https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf.
- Scaling Language Models: Methods, Analysis & Insights from Training Gopher. arXiv e-prints , arXiv:2112.11446doi:10.48550/arXiv.2112.11446, arXiv:2112.11446.
- Direct preference optimization: Your language model is secretly a reward model. arXiv:2305.18290.
- Swish: a self-gated activation function. arXiv: Neural and Evolutionary Computing URL: https://api.semanticscholar.org/CorpusID:196158220.
- Code llama: Open foundation models for code. arXiv:2308.12950.
- Glu variants improve transformer. arXiv:2002.05202.
- Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv:2201.11990.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Gemini: A family of highly capable multimodal models. arXiv:2312.11805.
- Gemma: Open models based on gemini research and technology. arXiv:2403.08295.
- Evalplus leaderboard. https://evalplus.github.io/leaderboard.html.
- Magicoder-oss-instruct-75k. https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K.
- Teknium, 2023. Openhermes 2.5 mistral 7b - gguf. URL: https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B.
- Lamda: Language models for dialog applications. arXiv .
- Llama: Open and efficient foundation language models. arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288.
- Zephyr: Direct distillation of lm alignment. arXiv:2310.16944.
- Most used programming languages among developers worldwide as of 2023. URL: https://www.statista.com/statistics/793628/worldwide-developer-survey-most-used-languages/.
- Attention is all you need. Advances in Neural Information Processing Systems 30. URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
- Gpt-j-6b: A 6 billion parameter autoregressive language model. https://github.com/kingoflolz/mesh-transformer-jax. URL: https://github.com/kingoflolz/mesh-transformer-jax.
- Emergent abilities of large language models. arXiv:2206.07682.
- Chain of thought prompting elicits reasoning in large language models. ArXiv abs/2201.11903. URL: https://api.semanticscholar.org/CorpusID:246411621.
- Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120 .
- Unveiling security, privacy, and ethical concerns of chatgpt. Journal of Information and Intelligence URL: https://www.sciencedirect.com/science/article/pii/S2949715923000707, doi:https://doi.org/10.1016/j.jiixd.2023.10.007.
- Towards better chain-of-thought prompting strategies: A survey. arXiv:2310.04959.
- Root mean square layer normalization. arXiv:1910.07467.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv:2306.05685.
- Agieval: A human-centric benchmark for evaluating foundation models. arXiv:2304.06364.
- Jessica López Espejel (10 papers)
- Mahaman Sanoussi Yahaya Alassan (6 papers)
- Merieme Bouhandi (2 papers)
- Walid Dahhane (8 papers)
- El Hassane Ettifouri (8 papers)