Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

60 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

8 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

46 3 2

StarCoder 2 and The Stack v2: The Next Generation (2402.19173v1)

Published 29 Feb 2024 in cs.SE and cs.AI

Abstract: The BigCode project, an open-scientific collaboration focused on the responsible development of LLMs for Code (Code LLMs), introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.

PDF HTML Abstract

StarCoder 2 and The Stack v2: Advancing the Frontiers of Code Generation LLMs

The BigCode project, an open scientific collaboration focused on the responsible development of LLMs for Code (Code LLMs), recently introduced StarCoder2. This initiative marks a significant advancement in the field of code generation LLMs, extending the foundational work done on the initial StarCoder and The Stack datasets. In partnership with Software Heritage, the project has developed The Stack v2, a vastly expanded corpus for training code generation models. This blog post presents a comprehensive overview of StarCoder2, the development of The Stack v2, and the evaluations performed to gauge the models' capabilities.

Introduction to StarCoder 2

StarCoder2 encompasses a family of models with 3B, 7B, and 15B parameters, pushing the boundaries of what's possible in code generation. These models were trained using a dataset approximately 4 times larger than its predecessor, resulting in significant performance improvements. The training set, rooted in the Software Heritage archive and supplemented with other high-quality datasets, spans over 619 programming languages.

The Development of The Stack v2

The Stack v2 builds upon the digital commons of Software Heritage’s source code archive, enhanced with additional data sources like GitHub pull requests, Kaggle notebooks, and extensive documentation. This meticulously curated and cleaned dataset is 4 times larger than the first version of The Stack, facilitating the training of more nuanced and powerful models.

Evaluation and Benchmarks

StarCoder2 models were evaluated against a suite of benchmarks designed to test code completion, code fixing and editing, mathematical reasoning, and more. These evaluations show that, in many instances, the smaller StarCoder2-3B model outperforms other models of similar size and even surpasses the performance of larger, previously best-performing models. The largest in the family, StarCoder2-15B, sets new standards by matching or outperforming models more than twice its size on several benchmarks.

Repository-Level Code Completion

Focusing on practical applications, the models were assessed on their capability to perform code completion at the repository level, demonstrating significant improvements over earlier models. These improvements are credited to the methodology employed in creating The Stack v2 and the robust training approach that leveraged this expansive dataset.

Advancements and Social Impact

The development of StarCoder2 and The Stack v2 encapsulates the BigCode project’s commitment to open science, ethical data sourcing, and the acceleration of research in the development of Code LLMs. By ensuring transparency in the training data and providing open access to model weights, the project aids in democratizing AI advancements and fostering an environment of responsible AI development. Furthermore, the project addresses challenges in privacy, security, societal and representation biases, underscoring the importance of balanced and mindful technological progress.

Conclusion

StarCoder2 represents a leap forward in the domain of code generation with LLMs, supported by the extensive dataset provided by The Stack v2. These advancements showcase the potential of collaborative, open scientific projects in pushing the boundaries of AI and providing the groundwork for future innovations. As the BigCode project continues to evolve, it remains centered on the pillars of responsible development, open access, and community engagement, paving the way for more inclusive and ethically considered advancements in AI.

Acknowledgements

This work is a testament to the collaborative spirit of the BigCode community, Software Heritage, and all contributors across the globe. It is a powerful example of what can be achieved when the scientific community comes together in pursuit of open, responsible technological advancement.

PDF Markdown Bookmark Chat (Pro)

References (156)

Authors (66)

Anton Lozhkov (7 papers)
Raymond Li (24 papers)
Loubna Ben Allal (12 papers)
Federico Cassano (16 papers)
Joel Lamy-Poirier (9 papers)
Nouamane Tazi (8 papers)
Ao Tang (27 papers)
Dmytro Pykhtar (2 papers)
Jiawei Liu (156 papers)
Yuxiang Wei (40 papers)
Tianyang Liu (24 papers)
Max Tian (2 papers)
Denis Kocetkov (5 papers)
Arthur Zucker (2 papers)
Younes Belkada (9 papers)
Zijian Wang (99 papers)
Qian Liu (252 papers)
Dmitry Abulkhanov (7 papers)
Indraneil Paul (9 papers)
Zhuang Li (69 papers)

Citations (174)

View on Semantic Scholar

Tweets

https://twitter.com/LoubnaBenAllal1/status/1902717356121411790

https://twitter.com/bug_bomber/status/1764819388098543893

https://twitter.com/NicolasChapados/status/1763543326395965694

https://twitter.com/fly51fly/status/1763692236406235417

https://twitter.com/ellev3n11/status/1816640668435104101

https://twitter.com/therichardzhu/status/1800536603569610902

YouTube

Show All Videos

HackerNews

StarCoder 2 and The Stack v2: The Next Generation (2 points, 0 comments)
StarCoder 2 and the Stack v2 (1 point, 0 comments)