Introduction
The BigCode community has unveiled StarCoder and StarCoderBase, extensive LLMs trained on code data. Featuring 15.5B parameters with an 8K token context length, these models boast infilling capabilities and efficient large-batch inference via multi-query attention. The training corpus for StarCoderBase amounts to 1 trillion tokens sourced from a diverse collection of permissively licensed GitHub repositories known as The Stack. StarCoder is StarCoderBase's fine-tuned counterpart, tailored on 35B Python tokens. A comprehensive evaluation reveals that StarCoderBase surpasses all other open Code LLMs in multiple language support and parallels the performance of OpenAI's code-cushman-001 model. Moreover, StarCoder outshines models fine-tuned on Python while maintaining proficiency in other programming languages.
Model Development
The StarCoder models demonstrate a commitment to responsible development, encompassing copyright respect, privacy protection, and shared community involvement in the development process. Contributing to legal compliance, the PII redaction pipeline has been enhanced and an attribution tool developed, tracing code generations back to training data. Ensuring open access is pivotal to the community-driven approach of the BigCode project. The Stack provides a transparent pre-training dataset with governance tools to verify inclusion and an opt-out process for developers desiring to exclude their code. This effort facilitates external audits and contributions to model improvements and serves as an exemplary open scientific collaboration model.
Empirical Analysis
Evaluation benchmarks the core of Code LLM assessment. The evaluation strategy for StarCoder integrates a diverse array of benchmarks, covering language understanding, reasoning, and toxicity levels. Performance on GSM8K elucidates the reasoning capabilities of StarCoderBase, surpassing similar parameter-sized Code LLMs. Metrics from MMLU and CoQA disclose its language prowess. Meanwhile, RealToxicityPrompts aid in detecting potential biases and toxicity in generated text, an essential safety aspect. StarCoder and StarCoderBase's skilled performance across numerous benchmarks fortifies their staunch positions amid current Code LLMs.
Tools for Safe Deployment
The release of StarCoder models embraces an OpenRAIL-M license, stipulating responsible use restrictions to avert potential misuse in critical scenarios. This initiative addresses the liability by improving transparency and encouraging ethical usage. Augmenting the responsible deployment initiative, new tools for membership checking and a BM25 index search have been published, facilitating users to link model output to training sets effectively. Such tools are pioneering steps towards safeguarding responsible AI deployment, curbing misuse, and bolstering accountability in model-generated code.
In conclusion, the BigCode community's contribution of StarCoder and StarCoderBase represents a significant stride towards the effective and safe application of Code LLMs. With open access, meticulous evaluation, and tools to ensure responsible use, these models stand as beacons of progress while galvanizing community engagement and collaboration.