StarCoder2: Open Code LLM Family

Updated 14 April 2026

StarCoder2 is a family of open, code-specialized language models built on a causal transformer backbone that handles long context sequences up to 16k tokens.
It employs advanced techniques like rotary positional encodings, grouped-query attention, and FlashAttention-2 for efficiency and scalability.
Its rigorous training on extensive curated code corpora and transparent data curation processes establishes state-of-the-art benchmarks for code completion, editing, and comprehension.

StarCoder2 is a family of open, code-specialized LLMs developed by the BigCode project and trained primarily on the Software Heritage archive (“The Stack v2”), designed to advance transparency, legal auditability, and state-of-the-art performance across a broad spectrum of code-centric tasks. Released in 2024, StarCoder2 establishes strong empirical benchmarks and serves as a foundation for further advanced models, including StarCoder2-Instruct, by integrating large-scale curated code corpora, modern transformer architectures, and rigorous evaluation on classic and next-generation programming language (PL) tasks (Lozhkov et al., 2024).

1. Architecture, Model Variants, and Training Paradigm

StarCoder2 comprises three principal models—3B, 7B, and 15B parameters—each following a state-of-the-art causal decoder transformer backbone. Key features include rotary positional encodings (RoPE) for scalable sequence length, grouped-query attention (GQA) to optimize inference efficiency, and FlashAttention-2 enabling context windows of up to 16k tokens (Lozhkov et al., 2024). The tokenizer spans 49,152 subword units, tailored for code and comments. Architectural parameters are summarized below:

Model	Params	Layers	Hidden Dim	Heads	KV-Heads	Context
SC2-3B	3B	30	3072	24	2	4k→16k
SC2-7B	7B	32	4608	36	4	4k→16k
SC2-15B	15B	40	6144	48	4	4k→16k

Training follows a two-stage procedure: (1) base pretraining on 3.3–4.3T tokens at 4k context, then (2) long-context fine-tuning (16,384 tokens, 200B tokens extra) with elevated RoPE base period. AdamW optimizer is used (β₁=0.9, β₂=0.95), cosine decay learning rate, and batch sizes up to 4.1M tokens per step. All models are trained for ≈5 epochs over The Stack v2; early stopping was applied to the 15B model (Lozhkov et al., 2024).

2. Data Sources, Curation Strategy, and Transparency

StarCoder2 is trained on The Stack v2, a digital commons comprising ≈900B unique tokens drawn from 619 programming languages. Core sources include deduplicated Software Heritage code, 19.5B tokens from GitHub pull requests, 11.1B from GitHub issues, 30B from arXiv LaTeX, StackOverflow Q&A, high-quality benchmarks, and LLVM intermediate representations (Lozhkov et al., 2024). The dataset is fully trackable via Software Heritage persistent identifiers (SWHIDs), and files with non-permissive, copyleft, or commercial licenses are excluded at the file level using GitHub metadata and ScanCode. Over 1,500 opt-out requests were honored, PII is filtered via the StarPII model, and malicious files are removed (0.009% of files).

Licensing is governed by the OpenRAIL-M license, permitting commercial and research use subject to ethical restrictions. Extensive tooling ("Am I in the Stack") enables downstream users to audit model inclusions for provenance and compliance (Lozhkov et al., 2024).

3. Benchmarking Performance, Empirical Properties, and Task Coverage

StarCoder2 models achieve state-of-the-art or highly competitive results across diverse evaluation protocols:

Code completion: On HumanEval and MBPP, the 15B model achieves 46.3%/66.2% pass@1, exceeding other open models at comparable scale and matching or surpassing proprietary CodeLlama-34B (Lozhkov et al., 2024).
MultiPL-E: StarCoder2-15B is highest among large models on 16/18 programming languages (pass@1).
Data science (DS-1000): 33.8% (15B model), outperforming all but DeepSeek-33B.
Code editing/fixing: StarCoder2-15B achieves 43.1% pass@1 on descriptive CanItEdit prompts, and outperforms CodeLlama-34B instruct.
Math/reasoning: GSM8K (8-shot PAL): 65.1% (SC2-15B); CRUXEval: 48.1% pass@1 (best in class).
Repository-level/long-context: RepoBench, CrossCodeEval—SC2-15B outpaces CodeLlama-13B and matches 34B on edit-similarity and BLEU.
Security/safety: StarCoder2-3B reduces insecure code generation (12.2% on Asleep at the Keyboard) and is less toxic than equivalent general-purpose LLMs.

Performance scaling is largely monotonic with increased model and data size. The 15B model shows no overfitting at 5 epochs, consistent with compute-optimal scaling predictions (Lozhkov et al., 2024).

4. Self-Alignment, Instruction-Tuning, and StarCoder2-Instruct

A major advancement is the release of StarCoder2-Instruct, a variant trained with the SelfCodeAlign pipeline for instruction-following behavior (Wei et al., 2024). SelfCodeAlign generates instruction–response pairs exclusively from a base model (no closed LLMs or distillation), uses multiple code concept extraction and validation stages, and filters only passing (response + test) examples via a Docker-based sandbox. The resulting StarCoder2-Instruct achieves:

HumanEval pass@1: 72.6% (greedy) (Wei et al., 2024)
DS-1000: 39.1% (best among fully transparent models)
CanItEdit: 39% (vs. OctoCoder-16B 30.2%)
ClassEval: 27% (class-level); 52.6% (method-level)
Licensing and dataset transparency identical to the base model.

StarCoder2-Instruct establishes a new baseline for open, transparent, and self-aligned code LLMs—outperforming both prior open-source and many proprietary models without using closed teacher data (Wei et al., 2024).

5. Specialized Evaluations: Synthesis, Editing, and Comprehension

StarCoder2 demonstrates incremental and task-specific strengths:

Code summarization: In few-shot, function-level summarization, StarCoder2-15B achieves higher BLEURT/METEOR/side scores than code-focused baselines; however, at class/repository level summarization, gains require chunking and retrieval (RAG) for large contexts, with modest improvements (Makharev et al., 23 Feb 2025).
Code refactoring: StarCoder2-15B outperforms developers in code smell reduction (+20.1 pp SRR), dominating systematic, syntactic refactorings (long statements, magic numbers). Context-dependent (architectural/design) smells are best handled by humans, suggesting hybrid integration (Cordeiro et al., 2024).
Hardware design synthesis: After CraftRTL fine-tuning, StarCoder2-15B achieves 81.9% pass@1 on VerilogEval-Machine (+3.8 pp over prior SOTA), with dramatic improvement on non-textual representations (Karnaugh maps, FSMs) when given correct-by-construction data (Liu et al., 2024).

Recent research also establishes StarCoder2’s code comprehension limitations. Under variable renaming and literal encryption obfuscations, description accuracy drops sharply (–28 and –26 pp, respectively), with best recovery (90% functional correctness) only for lexical obfuscations (Nikiema et al., 14 Apr 2025). This suggests a reliance on surface-level lexical regularities and limited abstract semantic representations.

6. Interpretability, Compression, and Modifiability

Mechanistic interpretability methods applied to StarCoder2 reveal recurring and specialized attention motifs, such as induction heads and rare token trackers, using the AP-MAE vision transformer (masked autoencoder) framework. These attention patterns can be reconstructed with high accuracy (MSE loss ≈ 7.1×10⁻³), and interventions on salient heads can boost next-token accuracy by up to 13.6% before abrupt collapse (Katzy et al., 4 Apr 2026). Cross-model transfer of AP-MAE representations is robust, suggesting scalable pathways to model-level interpretability that generalize across StarCoder2 variants.

Compression and throughput trade-offs have been systematically assessed (Reus et al., 2024):

Method (SC2-3B, 128t)	pass@1	Throughput (tok/s)	ΔEnergy
None	7.9%	45.8	Baseline
8-bit quantization	6.7%	6.1 (↓7.5×)	+75%
4-bit quantization	6.1%	23.1 (↓2×)	+19%
Pruning 1 layer	–2 pp	marginal	<5%/layer

Quantization via bitsandbytes (4/8-bit) significantly reduces memory but increases wall-time (hence energy), due to dequantization overhead. 8-bit quantization is acceptable for accuracy if hardware supports native INT8 ops. Layer pruning rapidly degrades function (–2 pp pass@1 per layer).

7. Limitations, Legal Considerations, and Future Work

StarCoder2’s use of permissively licensed and unlabeled code ensures broad access but does not guarantee that every file is free from legal encumbrance. While “Am I in the Stack” empowers user validation, full legal compliance especially at the granularity of snippet-level license remains an open issue (Lozhkov et al., 2024). Inference-time reverse engineering pipelines (“Chinese Wall”) using StarCoder2-Instruct and a strong proprietary annotator (e.g., Gemini) yield 20% code editing improvements, but the legal distinctness of outputs from proprietary logic is not formally assured, as classical clean-room rules may not fully apply. Further, StarCoder2’s summarization and semantic reasoning capabilities exhibit limitations when challenged on dense, heavily obfuscated, or highly abstract code.

Future directions identified include instruction- and RLHF-tuning for conversational workflows, improved scaling analysis, low-resource and domain adaptation, and advanced compression/interpretability methods (Lozhkov et al., 2024). Transparent, public-domain–only models with robust performance remain an aspirational goal for the community.

References:

(Lozhkov et al., 2024) StarCoder 2 and The Stack v2: The Next Generation (Wei et al., 2024) SelfCodeAlign: Self-Alignment for Code Generation (Cordeiro et al., 2024) An Empirical Study on the Code Refactoring Capability of LLMs (Reus et al., 2024) An exploration of the effect of quantisation on energy consumption and inference time of StarCoder2 (Makharev et al., 23 Feb 2025) Code Summarization Beyond Function Level (Nikiema et al., 14 Apr 2025) The Code Barrier: What LLMs Actually Understand? (Liu et al., 2024) CraftRTL: High-quality Synthetic Data Generation for Verilog Code Models (Hanmongkolchai, 21 Jul 2025) Applying the Chinese Wall Reverse Engineering Technique to LLM Code Editing (Katzy et al., 4 Apr 2026) Automated Attention Pattern Discovery at Scale in LLMs