Magicoder Series: Open-Source Code LLMs

Updated 19 December 2025

Magicoder Series is a suite of open-source LLMs for code generation utilizing OSS-Instruct to create diverse instruction–code pairs from open-source snippets.
The models are fine-tuned on OSS-Instruct and Evol-Instruct data, achieving superior performance on benchmarks like HumanEval+ and MBPP+ with ≤7B parameters.
Their low-resource design enables deployment on standard hardware, democratizing advanced code generation with extensible, open-source methodologies.

Magicoder refers to a series of fully open-source LLMs for code generation, introduced by Wei et al. (2023) in "Magicoder: Empowering Code Generation with OSS-Instruct" (Wei et al., 2023). All models, associated training data, and weights are released openly. The core technical innovation is OSS-Instruct, a synthetic instruction data-generation pipeline leveraging randomly sampled open-source code snippets to mitigate data bias and maximize instruction diversity and realism. Magicoder models, each constrained to ≤7B parameters, exhibit competitive or superior performance relative to state-of-the-art models with considerably larger footprints on standard code generation benchmarks.

1. Model Architecture and Variants

Magicoder models are implemented by fine-tuning pre-existing LLMs with OSS-Instruct-generated data. The main released variants follow two architectural baselines:

Model	Base Model	Parameters	Instruction Data
Magicoder	CodeLlama-7B	7B	75K (OSS-Instruct)
Magicoder⁺	CodeLlama-7B	7B	75K + 110K (Evol-Instruct/"evol-codealpaca-v1")
Magicoder-DS	DeepSeek-Coder-Base-6.7B	6.7B	75K (OSS-Instruct)
Magicoder-DS⁺	DeepSeek-Coder-Base-6.7B	6.7B	75K + 110K (Evol-Instruct/"evol-codealpaca-v1")

Magicoder models are trained using greedy decoding on instruction–response pairs, with two epochs for each dataset. OSS-Instruct data are used as the initial fine-tuning set, and Evol-Instruct ("evol-codealpaca-v1") augments optimization for the "+" variants. The training setup employs 2× NVIDIA A100-80GB GPUs under PyTorch DDP. Adafactor is the optimizer with an initial learning rate of $5 \times 10^{-5}$ , a short warmup (15 steps), and linear decay.

2. OSS-Instruct Data Generation Pipeline

OSS-Instruct synthesizes instruction–code pairs for instruction tuning by:

Seed corpus and snippet sampling: Starting from starcoderdata—a de-duplicated subset of The Stack (open-source code, permissive licenses)—randomly extracts 1–15 consecutive lines ("seed snippets" $s$ ) from documents in nine programming languages.
Prompting a teacher LLM: Each $s$ is fed via a prompt (see Figure 1 of the source paper) to gpt-3.5-turbo-1106 (the teacher LLM $T$ ), which greedily generates a pair $(x, y)$ : a self-contained code-related problem statement $x$ and a valid solution $y$ :

$y^* = \arg\max_{y} p_T(y \mid \text{prompt}(s))$

Data cleaning and decontamination: Duplicates and samples overlapping (exact docstring or solution match) with HumanEval, MBPP, APPS, DS-1000, and GSM8K are eliminated. This yielded $\sim$ 75,000 high-quality entries with only nine exclusions.
Formal definition:

The resulting dataset is

$\mathcal{D}_{OSS} = \left\{ (x_i, y_i) : (x_i, y_i) = T(\text{prompt}(s_i)), s_i \in \mathcal{S} \right\}$

Instruction tuning: The student model is trained to minimize next-token log-likelihood over $\mathcal{D}_{OSS}$ :

$\mathcal{L}(\theta) = -\frac{1}{|\mathcal{D}_{OSS}|} \sum_{(x, y) \in \mathcal{D}_{OSS}} \sum_{t=1}^{|y|} \log p_\theta(y_t \mid y_{<t}, x)$

OSS-Instruct achieves notable diversity and controllability in synthetic coding tasks, surpassing the variety and problem realism of prior self-instruct-style pipelines.

3. Benchmark Evaluation and Performance

Performance is quantified via pass@1 accuracy (%) under greedy decoding across standard code benchmarks:

HumanEval+: Magicoder⁺ (OSS+Evol-Instruct, 7B) reaches 66.5%, outperforming ChatGPT (65.9%)—the first ≤7B open-source LLM to surpass a major proprietary baseline.
MBPP+: Magicoder⁺ achieves 56.6%, above all ≤16B open-source competitors.
MultiPL-E (multilingual): Magicoder⁺ (7B) matches/exceeds WizardCoder-34B (34B) in four out of six languages.
DS-1000 (data science): Magicoder⁺ (7B) yields 37.5%, +8.3pp above WizardCoder-15B (15B).
DeepSeek-Coder comparison: Magicoder-DS⁺ (6.7B) beats DeepSeek-Instruct-6.7B while using 8× fewer fine-tuning tokens.

Key result tables appear verbatim as in the source material; empirical gains on HumanEval+, MBPP+, MultiPL-E, and DS-1000 indicate superior sample efficiency and competitive multilingual transfer. OSS-Instruct's diverse coverage and decontaminated construction contribute to its reduced benchmark leakage.

4. Technical Innovations

OSS-Instruct: The first pipeline to leverage arbitrary open-source code snippets as seeds for instruction–response generation, achieving enhanced problem diversity and realism relative to Self-Instruct and Evol-Instruct.
Low-bias synthetic data: Cosine-similarity analyses show that Magicoder’s data are less HumanEval-similar than prior approaches, yet deliver superior downstream benchmark performance.
Orthogonality: OSS-Instruct can be combined with other approaches (Evol-Instruct), attaining compounded empirical gains notable in Magicoder⁺ and Magicoder-DS⁺.
Cross-language transfer: Empirical results confirm that training on mixed-language OSS-Instruct data yields effective generalization in multilingual and cross-domain coding tasks.
Ablation findings:
- Direct snippet fine-tuning (using comment–function pairs) degrades performance, whereas OSS-Instruct tuning yields +21.4pp improvement on HumanEval+.
Problem diversity: OSS-Instruct covers ten topic categories evenly, with broad distributions for both problem statement and solution lengths.

5. Limitations and Future Research Directions

Magicoder’s main limitations and open questions, as documented:

Teacher LLM limitations: gpt-3.5-turbo manifestations include possible hallucinations and incomplete solutions. Adoption of higher-fidelity teachers (e.g., GPT-4) is anticipated to elevate sample reliability.
Task coverage: OSS-Instruct currently targets code generation (not infilling or repair), and broadening the domain remains a prospective avenue.
Benchmark leakage: Despite thorough decontamination, undetected semantic overlap may persist.
Scalability: The sample-efficient nature of OSS-Instruct for ≤7B models may become inadequate for larger architectures without improved seed balancing or sampling strategies.
Future work: Research expansion includes domain-specific corpora (medical, embedded), open-source domains beyond code (mathematical proofs, scientific writing), and active sampling across reference snippet space.

6. Application, Deployment, and Extensibility

Magicoder models are directly downloadable with full code and weights. The low resource requirement ( $\leq$ 7B parameters) enables deployment on multi-GPU servers or standard cloud instances, supporting privacy-preserving, on-premises use. Practitioners may further fine-tune on proprietary code or APIs. OSS-Instruct methodology is extensible to any domain characterized by large repositories of reference snippets—examples include legal statutes, financial data, and biological sequences—by adapting prompts for instruction–response generation.

A plausible implication is the democratization of advanced code generation capabilities for research groups and enterprises free from closed-source constraints, underpinning broader adoption and continued methodological innovation. The open release model further encourages benchmarking, detailed error analysis, and the development of domain-adapted variants.

PDF Markdown Chat (Pro)

References (1)

Magicoder: Empowering Code Generation with OSS-Instruct (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Magicoder Series.