DocuMint: Docstring Generation for Python using Small Language Models (2405.10243v1)
Abstract: Effective communication, specifically through documentation, is the beating heart of collaboration among contributors in software development. Recent advancements in LLMs (LMs) have enabled the introduction of a new type of actor in that ecosystem: LM-powered assistants capable of code generation, optimization, and maintenance. Our study investigates the efficacy of small LLMs (SLMs) for generating high-quality docstrings by assessing accuracy, conciseness, and clarity, benchmarking performance quantitatively through mathematical formulas and qualitatively through human evaluation using Likert scale. Further, we introduce DocuMint, as a large-scale supervised fine-tuning dataset with 100,000 samples. In quantitative experiments, Llama 3 8B achieved the best performance across all metrics, with conciseness and clarity scores of 0.605 and 64.88, respectively. However, under human evaluation, CodeGemma 7B achieved the highest overall score with an average of 8.3 out of 10 across all metrics. Fine-tuning the CodeGemma 2B model using the DocuMint dataset led to significant improvements in performance across all metrics, with gains of up to 22.5% in conciseness. The fine-tuned model and the dataset can be found in HuggingFace and the code can be found in the repository.
- 2024. devika. https://github.com/stitionai/devika.
- 2024. GitHub Copilot. https://github.com/features/copilot.
- 2024. Introducing Devin. https://www.cognition-labs.com/introducing-devin.
- 2024. OpenDevin. https://github.com/OpenDevin/OpenDevin.
- 2024. Replit AI. https://replit.com/ai.
- Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv preprint arXiv:2404.14219 (2024). https://arxiv.org/abs/2404.14219 Accessed 2024 April.
- AI@Meta. 2024. Llama 3 Model Card. (2024). https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md
- Anthropic. 2024. Introducing the next generation of Claude. https://www.anthropic.com/news/claude-3-family Accessed: 2024-05-13.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021).
- Antonio Valerio Miceli Barone and Rico Sennrich. 2017. A parallel corpus of python functions and documentation strings for automated code documentation and code generation. arXiv preprint arXiv:1707.02275 (2017).
- Explaining code examples in introductory programming courses: Llm vs humans. arXiv preprint arXiv:2403.05538 (2023).
- On-the-Fly Adapting Code Summarization on Trainable Cost-Effective Language Models. Advances in Neural Information Processing Systems 36 (2024).
- A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology (2023).
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- Out of the bleu: how should we assess quality of the code generation models? Journal of Systems and Software 203 (2023), 111741.
- GitHub. 2023. GitHub Copilot for Business is now available. https://github.blog/2023-02-14-github-copilot-for-business-is-now-available/.
- Google. 2024. CodeGemma: Open Code Models Based on Gemma. https://storage.googleapis.com/deepmind-media/gemma/codegemma_report.pdf.
- Google. 2024. Gemma: Open Models for Developers. https://blog.google/technology/developers/gemma-open-models/.
- Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938 (2021).
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
- Deep code comment generation. In Proceedings of the 26th conference on program comprehension. 200–210.
- Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770 (2023).
- Challenges and applications of large language models. arXiv preprint arXiv:2307.10169 (2023).
- Large language models are zero-shot reasoners. Advances in neural information processing systems 35 (2022), 22199–22213.
- Competition-level code generation with alphacode. Science 378, 6624 (2022), 1092–1097.
- Is your code generated by ChatGPT really correct. Rigorous evaluation of large language models for code generation. CoRR, abs/2305.01210 (2023).
- StarCoder 2 and The Stack v2: The Next Generation. arXiv preprint arXiv:2402.19173 (2024).
- RepoAgent: An LLM-Powered Open-Source Framework for Repository-level Code Documentation Generation. arXiv preprint arXiv:2402.16667 (2024).
- World of code: an infrastructure for mining the universe of open source VCS data. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). IEEE, 143–154.
- World of code: enabling a research workflow for mining and analyzing the universe of open source VCS data. Empirical Software Engineering 26 (2021), 1–42.
- Paul W McBurney and Collin McMillan. 2014. Automatic documentation generation via source code summarization of method context. In Proceedings of the 22nd International Conference on Program Comprehension. 279–290.
- Meta. 2023. LLaMA. https://llama.meta.com.
- Microsoft Research. 2024. Phi-2: The Surprising Power of Small Language Models. https://www.microsoft.com/en-us/research/blog/ phi-2-the-surprising-power-of- small-language-models/.
- Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474 (2022).
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023).
- Automatic generation of programming exercises and code explanations using large language models. In Proceedings of the 2022 ACM Conference on International Computing Education Research-Volume 1. 27–43.
- Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html 3, 6 (2023), 7.
- Demo2code: From summarizing demonstrations to synthesizing code via extended chain-of-thought. Advances in Neural Information Processing Systems 36 (2024).
- Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837.
- Measuring program comprehension: A large-scale field study with professionals. IEEE Transactions on Software Engineering 44, 10 (2017), 951–976.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019).
- Unifying the perspectives of nlp and software engineering: A survey on language models for code. arXiv preprint arXiv:2311.07989 (2023).
- A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
- Bibek Poudel (13 papers)
- Adam Cook (1 paper)
- Sekou Traore (1 paper)
- Shelah Ameli (1 paper)