StarCoder: Open-Source Code LLM

Updated 29 December 2025

StarCoder is an open-access code language model built with a decoder-only transformer architecture and multi-query attention for efficient code-centric tasks.
It delivers competitive performance on benchmarks such as HumanEval, MBPP, and multilingual evaluations, excelling in code generation, infilling, and automated test case generation.
The model emphasizes transparency and responsible usage with advanced PII redaction, data traceability tools, and permissive licensing under OpenRAIL.

StarCoder is an open-access family of LLMs for code, developed by the BigCode community. Built on decoder-only transformer architectures, StarCoder integrates architectural efficiency, permissive data governance, and fine-tuned optimization for code-centric tasks. It has achieved strong competitiveness with proprietary and larger-scale models while maintaining a focus on transparency, responsible release, and comprehensive evaluation across programming languages, natural and adversarial prompts, and diverse code reasoning scenarios (Li et al., 2023, Lozhkov et al., 2024).

1. Model Architecture and Pretraining

StarCoder and successor StarCoder2 employ a decoder-only transformer architecture, leveraging multi-query attention (MQA) for inference efficiency. The canonical StarCoderBase uses:

15.5 billion parameters
40 layers, hidden size 6144, 48 attention heads per layer
8K context (StarCoderBase/StarCoder); StarCoder2 supports up to 16,384 tokens via long-context fine-tuning
Byte-level BPE vocabulary (49,152 tokens)

The model processes input sequences with learned absolute positional encodings (StarCoder) and rotary position encodings (StarCoder2) (Li et al., 2023, Lozhkov et al., 2024).

MQA reduces memory requirements and increases throughput by sharing key/value projections across attention heads. StarCoder supports fill-in-the-middle (FIM) via sentinel tokens, enabling bidirectional infilling. Pretraining is performed on 1 trillion tokens (StarCoderBase) or up to 4.1 trillion tokens (StarCoder2-15B) sourced from The Stack (v1/v2): a corpus of permissively licensed, deduplicated code from GitHub and auxiliary software sources, including code documentation, notebooks, issues, and web data (Li et al., 2023, Lozhkov et al., 2024).

Fine-tuning is performed on language-specific corpora; e.g., StarCoder is fine-tuned on 35B Python tokens. Optimization uses Adam with β₁ = 0.9, β₂ = 0.95, weight decay 0.1, and a linear-warmup/cosine-decay schedule. Infrastructure involves large-scale training on A100 GPUs with Megatron-LM, BF16 precision, and FlashAttention kernels.

2. Evaluation Benchmarks and Performance

StarCoder and StarCoder2 have been evaluated on a comprehensive suite of code generation and reasoning benchmarks:

Python Generation: HumanEval (pass@1 33.6% for StarCoder, 46.3% for StarCoder2-15B), MBPP, and DS-1000 (26% for StarCoder, 33.8% for StarCoder2-15B)
Multilingual Code: StarCoderBase matches or outperforms all open models on 19 translated HumanEval variants
Open-domain Code/Natural Language: ODEX (StarCoderBase achieves 46.5% pass@1 on English prompts and 44.7% on mixed NL inputs)
Security: StarCoderBase is competitive in valid completion rates on security-focused evaluations, with improved safety profiles over some peer models
Infilling: Outperforms previous models (e.g., InCoder-6B, SantaCoder) on JavaScript, Java, and Python FIM
Math and Reasoning: GSM8K (21.5% PAL for StarCoderBase, 65.1% for StarCoder2-15B), MMLU, CRUXEval
Long-context Handling: Perplexity reduced by 10–20% on 8K vs. 2K tokens, demonstrating effective context scaling (Li et al., 2023, Lozhkov et al., 2024)

StarCoder2-15B demonstrates performance gains:

Outperforms StarCoderBase-15B on code, math, and reasoning tasks (+58% HumanEval, +203% GSM8K)
Matches CodeLlama-34B on MBPP, outperforms it on math/reasoning
Exceeds comparable-size models on multilingual and low-resource languages

3. Syntax and Semantic Probing Analysis

Targeted probing tasks elucidate StarCoder’s capacity to encode code structure:

AST (Abstract Syntax Tree) Reconstruction: Peak Matthews Correlation Coefficient (MCC) ≈ 0.78–0.82 in shallow layers (4–6), declining to ≈ 0.65 in deeper layers. Indicates strong but layer-local capture of syntax trees.
Token Syntax Tagging: Macro-F1 < 0.35 on Java250 token roles; StarCoder lags encoder-only models (CodeT5: 0.45–0.55), indicating limited fine-grained token-role separation (Ma et al., 2022).
Semantic Relation Prediction: Peak MCCs for control (CFG ≈ 0.60), data (DDG ≈ 0.57), and control-dependency (CDG ≈ 0.52) links at early layers (4–6), but all degrade rapidly in deeper layers (CFG to ≈ 0.35, CDG to ≈ 0.28).
Long-range Semantic Propagation: MCC for CDG/ DDG inGraph tasks drops from ~0.5 to ≤0.25 toward the final layer, indicating poor modeling of extended dependencies over >10 code statements.

Attention head analysis reveals that while ~70% of heads specialize in control/data-flow links, none focus exclusively on control-dependency edges (0/1920 CDG heads), limiting explicit semantic reasoning via direct attention (Ma et al., 2022).

4. Application to Automated Test Case Generation

StarCoder-15.5B has been systematically evaluated for automated test case generation from bug reports:

Cognitive Layered Evaluation (LIBRO Framework): Assessment follows Bloom’s taxonomy, measuring solution rates at increasing cognitive complexity:
- Remember: 24.1% BRT (bug-reproducing test) on Defects4J baseline—reproduces prior results
- Understand: Minimal robustness (~21% BRT) under paraphrase/translation; performance drop is statistically significant (p<0.01)
- Apply: Severe degradation (>60% drop) under identifier masking (hash mask: 6.9% BRT; –71% relative)
- Analyze/Component Sensitivity: Structured bug-report features (test code, method names) are far more predictive than natural language components for test generation success

Open-book prompting (BM25-retrieved few-shot examples) substantially boosts BRT rates by ×2.9–3.3 under mutation conditions, mitigating the model’s identifier dependence. A plausible implication is that StarCoder leverages surface-form similarity and explicit structure rather than robust, deep abstraction in this task setting (Qureshi et al., 6 Oct 2025).

5. In-Context Learning and Generalization to Novel Libraries

Extensive in-context learning experiments show that StarCoder and its derivatives (StarCoderPlus) can learn new libraries and APIs from demonstrations or code presentations:

Demonstrations: With 5–20 usage examples, StarCoderPlus achieves 33.0% (GQA) to 48.6% (image editing) pass@1 on VisProg tasks—well above zero-shot, but below models like GPT-4 (51.1%)
Descriptions (Docstring-style API): StarCoderPlus F1 drops to 8–15% (e.g., 8.7% on GQA), showing limited ability to translate from natural language API specification
Raw Implementations: Slightly better than descriptions (NLVR: 3.1%), but still weak compared to demonstration-based supervision
Novel Language Induction: On Isabelle algebra problems with demonstrations, 7.1% proof correctness (drops to 1.7% when keywords are aliased). Indicates partial reliance on prior knowledge of programming-language surface forms (Patel et al., 2023)

Demonstrations are the most effective mechanism for priming StarCoder for new APIs; raw implementations aid more than descriptions, but compositional and robust understanding via non-example format remains limited.

6. Model Transparency, Safety, and Governance

StarCoder’s release emphasizes responsible model usage:

PII Redaction: A hybrid StarEncoder + rules system redacts EMAIL, NAME, IP, KEY, PASSWORD, USERNAME entities with high recall (e.g., EMAIL: 98.9%). Compute: 800 GPU-hrs for whole-corpus annotation.
Data Attribution Tools: “Data Portrait” Bloom-filters for fast substring membership tests (26 GB artifact) and BM25 Elasticsearch index for full-file traceability. Enables rapid provenance checking to mitigate licensing and data-leak risk.
Licensing: Distributed under Open Responsible AI (OpenRAIL-M) license (StarCoder/StarCoderBase) and OpenRAIL (StarCoder2), permitting commercial and research use under usage restrictions (e.g., prohibition on malware, encouragement of transparency). The Stack enforces opt-out; 44–91 users/orgs honored, over 1,500 repositories removed (Li et al., 2023, Lozhkov et al., 2024).

StarCoder2 additionally provides training dataset hashes (SWHIDs) for data traceability and an “Am I in The Stack” query tool, reflecting a comprehensive governance infrastructure.

7. Comparative Landscape, Limitations, and Prominent Recommendations

Relative to encoder-only code models (e.g., CodeBERT, GraphCodeBERT) and newer LLMs (CodeLlama, DeepSeekCoder):

StarCoder matches or exceeds open competitors in multilingual and pre-2023 tasks, but is more identifier-dependent and sensitive to surface-form mutation in test generation tasks.
StarCoder2-15B narrows the gap to larger proprietary models and leads among open-source models in math and code reasoning, especially for low-resource languages (Lozhkov et al., 2024).
However, probing reveals deficits in local token-role tagging, control-dependency edge propagation, and robustness to API description formats.

Recommended improvements include:

Augmenting pretraining with masked-language modeling (MLM) for bidirectional context
Incorporating graph-aware or contrastive objectives (AST/CDG/DDG) to robustify semantic modeling
Dynamic retrieval for few-shot prompting in test generation
Fine-tuning with identifier anonymization or mutation to reduce surface-form overfitting
Encouraging bug-report templates emphasizing structured, executable information (Ma et al., 2022, Qureshi et al., 6 Oct 2025)

This suggests that while StarCoder establishes a high-water mark for open code LLMs under permissive licensing and transparent data, sustained advances in abstraction, semantic robustness, and attention specialization remain open research priorities.