MagicoderS-CL Variant: Open Code LLM

Updated 19 December 2025

MagicoderS-CL is a fully open-source code-generation LLM built on the CodeLlama-7B architecture using synthetic instruction data from OSS-Instruct.
It employs a two-stage instruction tuning process on 75,000 examples, with optional Evol-Instruct continuation, achieving state-of-the-art pass@1 scores on several benchmarks.
The model demonstrates practical performance in diverse coding tasks including Python data science, system scripting, and Java engineering, while highlighting limitations in infilling support and dataset diversity.

MagicoderS-CL Variant is a fully open-source code-generation LLM built by fine-tuning Meta’s CodeLlama-7B foundation model on synthetic instruction data generated via the OSS-Instruct methodology. MagicoderS-CL aims to create high-quality and diverse instruction-following code datasets from open-source code corpora and demonstrates state-of-the-art performance among 7B-parameter and larger open models, surpassing even some proprietary models on rigorous code-generation benchmarks (Wei et al., 2023).

1. Model Architecture and Initialization

MagicoderS-CL adopts the CodeLlama-7B architecture, a single-stream, decoder-only Transformer without any structural modifications. The core model characteristics are as follows:

Specification	Value	Source
Transformer layers	32	CodeLlama-7B
Hidden dimension ( $d_{model}$ )	Approx. 4,096	CodeLlama-7B
Attention heads per layer	32	CodeLlama-7B
Feed-forward size	Approx. 11,008	CodeLlama-7B
Context window	Up to 2,048 tokens	CodeLlama-7B

No adapters or new layers are introduced. Weight initialization is performed by loading the pretrained CodeLlama-7B weights directly, with no further re-initialization. This distinguishes MagicoderS-CL from MagicoderS, which is instead initialized from DeepSeek-Coder-Base.

The training strategy involves two-stage instruction tuning:

OSS-Instruct dataset (75,000 instances, 2 epochs).
Optional Evol-Instruct continuation (2 further epochs), resulting in MagicoderS-CL⁺.

No additional pretraining or further architectural changes are employed (Wei et al., 2023).

2. OSS-Instruct Data Generation and Curation

OSS-Instruct is a methodology for producing diverse synthetic instruction data grounded in open-source code references. Its primary pipeline involves:

Seed Corpus: "starcoderdata" is the seed, a filtered permissively licensed subset of The Stack. Random snippets of 1–15 consecutive lines are extracted from 80,000 documents.
Prompting and Synthesis: Each seed snippet is formatted into a prompt and fed to gpt-3.5-turbo, requesting the creation of a realistic programming problem "inspired by" the snippet and a self-contained solution with minimal explanatory commentary. Greedy decoding is employed to maximize consistency.
Data Cleaning/Decontamination: Duplicates and any overlap with evaluation benchmarks (HumanEval(+), MBPP(+), DS-1000, APPS, GSM8K) are removed; only 9 samples filtered in this step.
Diversity Analysis: Ten manually defined code categories (e.g., algorithms, ML, scripts) ensure a balanced representation; prompt and solution lengths vary from a few tokens to ~1,000 tokens.
Bias Measurements: Average TF-IDF cosine similarity to HumanEval is lowest among compared synthetic datasets, indicating minimal prompt duplication or content drift.

Formally, the data generation process is:

for s in starcoderdata:
    prompt = fill_template(s)
    x, y = gpt_3_5_turbo.generate(prompt)
    if not duplicate(x, y):
        add_to_dataset(x, y)

The instruction-tuning loss is standard cross-entropy:

$L_{inst}(\theta) = -\sum_{(x, y) \in D} \log P_\theta(y \mid x)$

3. Training Objective and Optimization

The principal objective is standard next-token prediction on solutions given the prompt, using token-level cross-entropy:

$L = -\sum_t \log P(w_t \mid w_{<t}, x; \theta)$

Key training parameters:

Optimizer: Adafactor (no explicit weight decay)
Learning Rate: Initial $5 \times 10^{-5}$ , 15 warmup steps, linear decay to zero
Batch Size: Global batch size of 512 sequences
Max Sequence Length: 1,216 tokens during OSS-Instruct tuning (1,024 when using Evol-Instruct)
Epochs and Compute: 2 epochs over 75k examples, using two NVIDIA A100-80GB GPUs with PyTorch DDP.

No further self-supervised or task-general pretraining is performed; only supervised instruction-tuning is applied in these stages (Wei et al., 2023).

4. Evaluation on Code Generation Benchmarks

Performance is assessed on several industry-standard benchmarks for text-to-code generation, including HumanEval(+), MBPP(+), MultiPL-E, and DS-1000. MagicoderS-CL and MagicoderS-CL⁺ are compared to CodeLlama-7B, ChatGPT, and WizardCoder models. Core results—using greedy decoding for pass@1—are summarized:

Model	HumanEval	HumanEval+	MBPP	MBPP+
CodeLlama-7B	37.8	34.1	57.6	45.4
MagicoderS-CL	60.4	55.5	64.2	52.6
MagicoderS-CL⁺	70.7	66.5	68.4	56.6
ChatGPT	72.6	65.9	81.7	69.4

On MultiPL-E, MagicoderS-CL substantially outperforms CodeLlama-7B across several languages (Java, JavaScript, C++, PHP, Swift, Rust). For DS-1000, MagicoderS-CL achieves ≈29.9%, and MagicoderS-CL⁺ reaches ≈37.5%, outperforming WizardCoder-15B. Notably, MagicoderS-CL⁺ surpasses ChatGPT on the HumanEval+ benchmark (66.5 vs. 65.9 pass@1). All open-source models ≤16B parameters are outperformed by MagicoderS-CL and its enhanced variant (Wei et al., 2023).

An ablation demonstrates that direct finetuning on raw comment–function pairs degrades downstream performance, underscoring the value of the richer OSS-Instruct data.

5. Usage Patterns and Limitations

MagicoderS-CL can be practically employed for code synthesis tasks given minimal code seeds. Typical usage scenarios include:

Python data science: Short initialization code (e.g., "import numpy as np; def foo(x):") prompts data-science assignments with solutions.
System scripting: Shell snippet as prompt leads to generation of full Python script.
Java engineering: Incomplete class signatures generate Spring-Boot application stubs.

Identified constraints and ongoing challenges:

Dataset Size/Distribution: The instruction-tuning dataset is modest (75,000 examples) and predominantly Python (~57%).
Solution Quality: Some outputs are noisy or incomplete, retained by design in line with UnnaturalCode.
Infilling Support: Current CL variant does not support infilling or insertion tasks.
Extension Directions: Future work includes scaling OSS-Instruct to larger CodeLlama variants (13B/34B), leveraging GPT-4 as a teacher, and constructing targeted seed-selection strategies (Wei et al., 2023).

6. Context and Research Significance

MagicoderS-CL introduces a rigorous, scalable approach to instruction data generation for code LLMs, leveraging abundant open-source repositories and the filtering capabilities of OSS-Instruct to control bias and enhance prompt diversity. Its empirical superiority over CodeLlama-7B and other open baselines, including on challenging hidden benchmarks, demonstrates the potential of synthetic, reference-guided instruction datasets. The methodology and artifacts, including datasets and prompt templates, are fully open-sourced for reproducibility and further study. This suggests a trajectory whereby large pre-trained code models can be efficiently specialized with targeted, high-diversity synthetic tasks, and motivates future research into open-source instruction-tuning for code generation (Wei et al., 2023).

PDF Markdown Chat (Pro)

References (1)

Magicoder: Empowering Code Generation with OSS-Instruct (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to MagicoderS-CL Variant.