Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
121 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Kotlin ML Pack: Technical Report (2405.19250v1)

Published 29 May 2024 in cs.SE, cs.AI, and cs.PL

Abstract: In this technical report, we present three novel datasets of Kotlin code: KStack, KStack-clean, and KExercises. We also describe the results of fine-tuning CodeLlama and DeepSeek models on this data. Additionally, we present a version of the HumanEval benchmark rewritten by human experts into Kotlin - both the solutions and the tests. Our results demonstrate that small, high-quality datasets (KStack-clean and KExercises) can significantly improve model performance on code generation tasks, achieving up to a 16-point increase in pass rate on the HumanEval benchmark. Lastly, we discuss potential future work in the field of improving LLMing for Kotlin, including the use of static analysis tools in the learning process and the introduction of more intricate and realistic benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. “Python exercises,” URL: https://huggingface.co/datasets/jinaai/code_exercises, 2023.
  2. M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba, “Evaluating large language models trained on code,” 2021.
  3. B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950, 2023.
  4. D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. Li et al., “Deepseek-coder: When the large language model meets programming–the rise of code intelligence,” arXiv preprint arXiv:2401.14196, 2024.
  5. D. Kocetkov, R. Li, L. Ben Allal, J. Li, C. Mou, C. Muñoz Ferrandis, Y. Jernite, M. Mitchell, S. Hughes, T. Wolf, D. Bahdanau, L. von Werra, and H. de Vries, “The stack: 3 tb of permissively licensed source code,” Preprint, 2022.
  6. D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song, and J. Steinhardt, “Measuring coding challenge competence with apps,” NeurIPS, 2021.
  7. Y. Lai, C. Li, Y. Wang, T. Zhang, R. Zhong, L. Zettlemoyer, W.-t. Yih, D. Fried, S. Wang, and T. Yu, “Ds-1000: A natural and reliable benchmark for data science code generation,” in International Conference on Machine Learning.   PMLR, 2023, pp. 18 319–18 345.
  8. “go-enry,” URL: https://github.com/go-enry/go-enry, 2024.
  9. “go-license-detector,” URL: https://github.com/go-enry/go-license-detector, 2021.
  10. A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, Y. Wei et al., “Starcoder 2 and the stack v2: The next generation,” arXiv preprint arXiv:2402.19173, 2024.
  11. S. Gunasekar, Y. Zhang, J. Aneja, C. C. T. Mendes, A. D. Giorno, S. Gopi, M. Javaheripi, P. Kauffmann, G. de Rosa, O. Saarikivi, A. Salim, S. Shah, H. S. Behl, X. Wang, S. Bubeck, R. Eldan, A. T. Kalai, Y. T. Lee, and Y. Li, “Textbooks are all you need,” 2023.
  12. “Openai. introducing chatgpt,” https://openai.com/index/chatgpt/, accessed: 2024-04-01.
  13. Y. Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, and S. C. H. Hoi, “Codet5+: Open code large language models for code understanding and generation,” arXiv preprint, 2023.
  14. S. Chaudhary, “Code alpaca: An instruction-following llama model for code generation,” https://github.com/sahil280114/codealpaca, 2023.
  15. N. Muennighoff, Q. Liu, A. Zebaze, Q. Zheng, B. Hui, T. Y. Zhuo, S. Singh, X. Tang, L. von Werra, and S. Longpre, “Octopack: Instruction tuning code large language models,” arXiv preprint arXiv:2308.07124, 2023.
  16. A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” Journal of Machine Learning Research, vol. 24, no. 240, pp. 1–113, 2023.
  17. J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022.
  18. R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim et al., “Starcoder: may the source be with you!” arXiv preprint arXiv:2305.06161, 2023.
  19. B. Athiwaratkun, S. K. Gouda, Z. Wang, X. Li, Y. Tian, M. Tan, W. U. Ahmad, S. Wang, Q. Sun, M. Shang, S. K. Gonugondla, H. Ding, V. Kumar, N. Fulton, A. Farahani, S. Jain, R. Giaquinto, H. Qian, M. K. Ramanathan, R. Nallapati, B. Ray, P. Bhatia, S. Sengupta, D. Roth, and B. Xiang, “Multi-lingual evaluation of code generation models,” 2022. [Online]. Available: https://arxiv.org/abs/2210.14868
  20. “Code llama model card,” URL: https://github.com/meta-llama/codellama/blob/main/MODEL_CARD.md, 2024.
  21. “Deepseek-coder-6.7b model card,” URL: https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-base, 2024.
  22. “Deepseek-coder-1.3b model card,” URL: https://huggingface.co/deepseek-ai/deepseek-coder-1.3b-base, 2024.
  23. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:53592270
  24. M. Wortsman, P. J. Liu, L. Xiao, K. Everett, A. Alemi, B. Adlam, J. D. Co-Reyes, I. Gur, A. Kumar, R. Novak et al., “Small-scale proxies for large-scale transformer training instabilities,” arXiv preprint arXiv:2309.14322, 2023.
  25. R. Schaeffer, “Pretraining on the test set is all you need,” arXiv preprint arXiv:2309.08632, 2023.
  26. D. Zan, A. Yu, W. Liu, D. Chen, B. Shen, W. Li, Y. Yao, Y. Gong, X. Chen, B. Guan et al., “Codes: Natural language to code repository via multi-layer sketch,” arXiv preprint arXiv:2403.16443, 2024.
  27. C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “SWE-bench: Can language models resolve real-world github issues?” in The Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=VTF8yNQM66

Summary

  • The paper introduces three main Kotlin datasets—KStack, KStack-clean, and KExercises—to enhance code generation capabilities.
  • The study demonstrates that fine-tuning CodeLlama and DeepSeek models with curated datasets can lead to up to a 55.28% pass rate improvement.
  • The report features a human-translated Kotlin HumanEval benchmark, offering a robust evaluation framework for model performance.

Novel High-Quality Datasets and Advancements in Kotlin Code Generation

The paper presents a technical report where the authors introduce three novel datasets to enhance the performance of Kotlin code generation models—namely KStack, KStack-clean, and KExercises. In addition, the paper details the fine-tuning process of CodeLlama and DeepSeek models with these datasets, and the creation of a Kotlin version of the HumanEval benchmark assembled by human experts.

Datasets for Kotlin Code

A comprehensive and up-to-date collection of open-source Kotlin code, KStack, is presented as a counterpart to pre-existing datasets like The Stack. The dataset includes repositories from GitHub mainly focused on Kotlin, filtered for permissive licenses and de-duplicated for representativeness. With a thorough process, KStack accumulates around four million files comprised of 3.1 billion tokens.

To enhance the dataset quality, KStack-clean was derived by building a classifier to predict the quality of code based on a labeled subset. This classifier was refined and applied to the entire dataset, retaining the top 25,000 high-quality examples, vastly improving the dataset utility for fine-tuning purposes.

KExercises, a novel instruction dataset, was generated by translating Python-based exercises into Kotlin using GPT-3.5-turbo. This dataset enhances the natural language comprehension of models in addition to code generation capabilities, creating an extensive set of Kotlin tasks comprising approximately 3.5 million tokens.

Evaluation and Benchmarking

The HumanEval benchmark is human-translated to Kotlin, addressing the limitations of existing solutions in terms of type generality and consistency in floating-point precision. This benchmark serves as a cornerstone to evaluate the models and is complemented by metrics such as compilation error rates, runtime error rates, and syntax error rates.

Experimental Setup and Findings

Several base models including CodeLlama-7B and DeepSeek-coder-6.7B were fine-tuned using multiple datasets. The results indicated significant performance improvements using smaller, high-quality datasets. Specifically, fine-tuning DeepSeek-coder-6.7B with KExercises resulted in up to a 55.28% pass rate, illustrating the profound impact of instructional datasets. KStack-clean also showed considerable improvements evidencing the importance of dataset curation and quality over sheer size.

Implications and Future Directions

The implication of this research is twofold. Practically, it enhances the quality of Kotlin code generation, making these methods applicable and reliable for scalable real-world tasks. Theoretically, it sets a foundation for further research into low-resource programming languages, offering promising directions such as the use of static analysis tools, more realistic synthetic data generation, and realistic benchmarks akin to issue-based methodologies.

Future research avenues that arise from this work include the integration of tools like compilers and linters into the training process, the development of more complex and production-oriented synthetic datasets, and the creation of diverse benchmarks reflecting real-world Kotlin applications. These steps would not only improve the quality of Kotlin code generation but also provide valuable insights for enhancing code generation models for other low-resource languages.

In conclusion, the paper successfully fills the gap by providing datasets and models tailored for Kotlin, significantly improving the Kotlin code generation landscape. This work exemplifies how focused efforts in data curation and model fine-tuning can bring substantial advancements in LLMing, proposing a replicable approach for other underrepresented languages.

HackerNews

  1. Kotlin ML Pack: Technical Report (1 point, 0 comments)