Kotlin ML Pack: Technical Report (2405.19250v1)

Published 29 May 2024 in cs.SE, cs.AI, and cs.PL

Abstract: In this technical report, we present three novel datasets of Kotlin code: KStack, KStack-clean, and KExercises. We also describe the results of fine-tuning CodeLlama and DeepSeek models on this data. Additionally, we present a version of the HumanEval benchmark rewritten by human experts into Kotlin - both the solutions and the tests. Our results demonstrate that small, high-quality datasets (KStack-clean and KExercises) can significantly improve model performance on code generation tasks, achieving up to a 16-point increase in pass rate on the HumanEval benchmark. Lastly, we discuss potential future work in the field of improving LLMing for Kotlin, including the use of static analysis tools in the learning process and the introduction of more intricate and realistic benchmarks.

References (27)

Summary

The paper introduces three main Kotlin datasets—KStack, KStack-clean, and KExercises—to enhance code generation capabilities.
The study demonstrates that fine-tuning CodeLlama and DeepSeek models with curated datasets can lead to up to a 55.28% pass rate improvement.
The report features a human-translated Kotlin HumanEval benchmark, offering a robust evaluation framework for model performance.

Novel High-Quality Datasets and Advancements in Kotlin Code Generation

The paper presents a technical report where the authors introduce three novel datasets to enhance the performance of Kotlin code generation models—namely KStack, KStack-clean, and KExercises. In addition, the paper details the fine-tuning process of CodeLlama and DeepSeek models with these datasets, and the creation of a Kotlin version of the HumanEval benchmark assembled by human experts.

Datasets for Kotlin Code

A comprehensive and up-to-date collection of open-source Kotlin code, KStack, is presented as a counterpart to pre-existing datasets like The Stack. The dataset includes repositories from GitHub mainly focused on Kotlin, filtered for permissive licenses and de-duplicated for representativeness. With a thorough process, KStack accumulates around four million files comprised of 3.1 billion tokens.

To enhance the dataset quality, KStack-clean was derived by building a classifier to predict the quality of code based on a labeled subset. This classifier was refined and applied to the entire dataset, retaining the top 25,000 high-quality examples, vastly improving the dataset utility for fine-tuning purposes.

KExercises, a novel instruction dataset, was generated by translating Python-based exercises into Kotlin using GPT-3.5-turbo. This dataset enhances the natural language comprehension of models in addition to code generation capabilities, creating an extensive set of Kotlin tasks comprising approximately 3.5 million tokens.

Evaluation and Benchmarking

The HumanEval benchmark is human-translated to Kotlin, addressing the limitations of existing solutions in terms of type generality and consistency in floating-point precision. This benchmark serves as a cornerstone to evaluate the models and is complemented by metrics such as compilation error rates, runtime error rates, and syntax error rates.

Experimental Setup and Findings

Several base models including CodeLlama-7B and DeepSeek-coder-6.7B were fine-tuned using multiple datasets. The results indicated significant performance improvements using smaller, high-quality datasets. Specifically, fine-tuning DeepSeek-coder-6.7B with KExercises resulted in up to a 55.28% pass rate, illustrating the profound impact of instructional datasets. KStack-clean also showed considerable improvements evidencing the importance of dataset curation and quality over sheer size.

Implications and Future Directions

The implication of this research is twofold. Practically, it enhances the quality of Kotlin code generation, making these methods applicable and reliable for scalable real-world tasks. Theoretically, it sets a foundation for further research into low-resource programming languages, offering promising directions such as the use of static analysis tools, more realistic synthetic data generation, and realistic benchmarks akin to issue-based methodologies.

Future research avenues that arise from this work include the integration of tools like compilers and linters into the training process, the development of more complex and production-oriented synthetic datasets, and the creation of diverse benchmarks reflecting real-world Kotlin applications. These steps would not only improve the quality of Kotlin code generation but also provide valuable insights for enhancing code generation models for other low-resource languages.

In conclusion, the paper successfully fills the gap by providing datasets and models tailored for Kotlin, significantly improving the Kotlin code generation landscape. This work exemplifies how focused efforts in data curation and model fine-tuning can bring substantial advancements in LLMing, proposing a replicable approach for other underrepresented languages.

Related Papers

Tweets

https://twitter.com/androiddevnotes/status/1796114486354448442

https://twitter.com/smtitov/status/1796177013675282757

HackerNews

Kotlin ML Pack: Technical Report (1 point, 0 comments)