The Stack: 3 TB of permissively licensed source code (2211.15533v1)

Published 20 Nov 2022 in cs.CL and cs.AI

Abstract: LLMs play an ever-increasing role in the field of AI--not only for natural language processing but also for code understanding and generation. To stimulate open and responsible research on LLMs for code, we introduce The Stack, a 3.1 TB dataset consisting of permissively licensed source code in 30 programming languages. We describe how we collect the full dataset, construct a permissively licensed subset, present a data governance plan, discuss limitations, and show promising results on text2code benchmarks by training 350M-parameter decoders on different Python subsets. We find that (1) near-deduplicating the data significantly boosts performance across all experiments, and (2) it is possible to match previously reported HumanEval and MBPP performance using only permissively licensed data. We make the dataset available at https://hf.co/BigCode, provide a tool called "Am I in The Stack" (https://hf.co/spaces/bigcode/in-the-stack) for developers to search The Stack for copies of their code, and provide a process for code to be removed from the dataset by following the instructions at https://www.bigcode-project.org/docs/about/the-stack/.

PDF Abstract

An Analysis of "\datasetname: 3 TB of Permissively Licensed Source Code"

The paper introduces "\datasetname," a substantial dataset comprising 3.1 TB of source code data across 30 programming languages, all of which are permissively licensed. This endeavor is pivotal in promoting open and responsible research on LLMs tailored for code, an increasingly vital application of AI in automating and enhancing programming tasks. By facilitating access to vast and diverse datasets, the research community can better analyze, reproduce, and innovate upon existing models.

Core Contributions

The paper delineates several significant contributions to the field:

Dataset Compilation and Accessibility: \datasetname provides a meticulously curated collection of permissively licensed code, accessible to the research community. This dataset serves as a foundation for developing and benchmarking code LLMs.
Data Governance and Developer Autonomy: The authors have established a framework for data governance, allowing developers to remove their code from the dataset, echoing ongoing discussions about ethics in AI and data usage rights.
Performance on Benchmarks: Significantly, the authors demonstrate that a subset of their dataset permits reproducing the results of prominent models like Codex and CodeGen on standard tests such as HumanEval and MBPP, even outperforming them when non-deduplicated data is leveraged.
Effectiveness of Near-Deduplication: An intriguing finding of the research is the positive impact of near-deduplication on training dataset quality, resulting in notable performance improvements across evaluated benchmarks.

Methodology

The dataset was sourced primarily from GitHub repositories, focusing strictly on those with permissive licenses such as MIT and Apache 2.0. A robust licensing detection framework ensured compliance, minimizing ethical and legal risks involved in using open-source code for commercial or research purposes.

To enhance the dataset's utility, the authors implemented near-deduplication techniques to exclude repetitive content, which was identified as a common issue exacerbating performance metrics in code LLMs. This processing step not only refines the dataset but also exemplifies best practices in data handling, promoting efficient and ethical AI model training.

Implications and Future Directions

The implications of this research are multifold. Practically, \datasetname may serve as a resource for developing more efficient code completion tools and programming assistants capable of handling sophisticated tasks ranging from auto-completion to full program synthesis. Theoretically, the dataset's availability and the open dialogue about data governance and ethical AI development pave the way for more transparent and collaborative advancements in the field.

The research also lays ground for future studies into mitigating potential biases in training datasets, anticipating a broader representation of programming languages and styles. Furthermore, addressing challenges such as the inclusion of malicious or personally identifiable information remains a priority to uphold model integrity and ethical standards.

Conclusion

The introduction of \datasetname marks a substantial step toward fostering an environment of open, reproducible, and responsible AI research within code synthesis and understanding. By facilitating both access and control over training data, the authors support a community-driven approach to innovation in machine learning for software development. The findings not only elucidate the existing capabilities of LLMs but also suggest new directions for optimizing their performance and applicability.

PDF Markdown Bookmark Chat (Pro)

Authors (13)

Denis Kocetkov (5 papers)
Raymond Li (24 papers)
Loubna Ben Allal (12 papers)
Jia Li (380 papers)
Chenghao Mou (7 papers)
Carlos Muñoz Ferrandis (8 papers)
Yacine Jernite (46 papers)
Margaret Mitchell (43 papers)
Sean Hughes (7 papers)
Thomas Wolf (117 papers)
Dzmitry Bahdanau (46 papers)
Leandro von Werra (19 papers)
Harm de Vries (29 papers)

Citations (246)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/simonw/status/1747668708636713022

https://twitter.com/geeknik/status/1892287657163600050