StarCoder 2 and The Stack v2: Advancing the Frontiers of Code Generation LLMs
The BigCode project, an open scientific collaboration focused on the responsible development of LLMs for Code (Code LLMs), recently introduced StarCoder2. This initiative marks a significant advancement in the field of code generation LLMs, extending the foundational work done on the initial StarCoder and The Stack datasets. In partnership with Software Heritage, the project has developed The Stack v2, a vastly expanded corpus for training code generation models. This blog post presents a comprehensive overview of StarCoder2, the development of The Stack v2, and the evaluations performed to gauge the models' capabilities.
Introduction to StarCoder 2
StarCoder2 encompasses a family of models with 3B, 7B, and 15B parameters, pushing the boundaries of what's possible in code generation. These models were trained using a dataset approximately 4 times larger than its predecessor, resulting in significant performance improvements. The training set, rooted in the Software Heritage archive and supplemented with other high-quality datasets, spans over 619 programming languages.
The Development of The Stack v2
The Stack v2 builds upon the digital commons of Software Heritage’s source code archive, enhanced with additional data sources like GitHub pull requests, Kaggle notebooks, and extensive documentation. This meticulously curated and cleaned dataset is 4 times larger than the first version of The Stack, facilitating the training of more nuanced and powerful models.
Evaluation and Benchmarks
StarCoder2 models were evaluated against a suite of benchmarks designed to test code completion, code fixing and editing, mathematical reasoning, and more. These evaluations show that, in many instances, the smaller StarCoder2-3B model outperforms other models of similar size and even surpasses the performance of larger, previously best-performing models. The largest in the family, StarCoder2-15B, sets new standards by matching or outperforming models more than twice its size on several benchmarks.
Repository-Level Code Completion
Focusing on practical applications, the models were assessed on their capability to perform code completion at the repository level, demonstrating significant improvements over earlier models. These improvements are credited to the methodology employed in creating The Stack v2 and the robust training approach that leveraged this expansive dataset.
Advancements and Social Impact
The development of StarCoder2 and The Stack v2 encapsulates the BigCode project’s commitment to open science, ethical data sourcing, and the acceleration of research in the development of Code LLMs. By ensuring transparency in the training data and providing open access to model weights, the project aids in democratizing AI advancements and fostering an environment of responsible AI development. Furthermore, the project addresses challenges in privacy, security, societal and representation biases, underscoring the importance of balanced and mindful technological progress.
Conclusion
StarCoder2 represents a leap forward in the domain of code generation with LLMs, supported by the extensive dataset provided by The Stack v2. These advancements showcase the potential of collaborative, open scientific projects in pushing the boundaries of AI and providing the groundwork for future innovations. As the BigCode project continues to evolve, it remains centered on the pillars of responsible development, open access, and community engagement, paving the way for more inclusive and ethically considered advancements in AI.
Acknowledgements
This work is a testament to the collaborative spirit of the BigCode community, Software Heritage, and all contributors across the globe. It is a powerful example of what can be achieved when the scientific community comes together in pursuit of open, responsible technological advancement.