Papers
Topics
Authors
Recent
2000 character limit reached

MagicoderS-CL Variant: Open Code LLM

Updated 19 December 2025
  • MagicoderS-CL is a fully open-source code-generation LLM built on the CodeLlama-7B architecture using synthetic instruction data from OSS-Instruct.
  • It employs a two-stage instruction tuning process on 75,000 examples, with optional Evol-Instruct continuation, achieving state-of-the-art pass@1 scores on several benchmarks.
  • The model demonstrates practical performance in diverse coding tasks including Python data science, system scripting, and Java engineering, while highlighting limitations in infilling support and dataset diversity.

MagicoderS-CL Variant is a fully open-source code-generation LLM built by fine-tuning Meta’s CodeLlama-7B foundation model on synthetic instruction data generated via the OSS-Instruct methodology. MagicoderS-CL aims to create high-quality and diverse instruction-following code datasets from open-source code corpora and demonstrates state-of-the-art performance among 7B-parameter and larger open models, surpassing even some proprietary models on rigorous code-generation benchmarks (Wei et al., 2023).

1. Model Architecture and Initialization

MagicoderS-CL adopts the CodeLlama-7B architecture, a single-stream, decoder-only Transformer without any structural modifications. The core model characteristics are as follows:

Specification Value Source
Transformer layers 32 CodeLlama-7B
Hidden dimension (dmodeld_{model}) Approx. 4,096 CodeLlama-7B
Attention heads per layer 32 CodeLlama-7B
Feed-forward size Approx. 11,008 CodeLlama-7B
Context window Up to 2,048 tokens CodeLlama-7B

No adapters or new layers are introduced. Weight initialization is performed by loading the pretrained CodeLlama-7B weights directly, with no further re-initialization. This distinguishes MagicoderS-CL from MagicoderS, which is instead initialized from DeepSeek-Coder-Base.

The training strategy involves two-stage instruction tuning:

  1. OSS-Instruct dataset (75,000 instances, 2 epochs).
  2. Optional Evol-Instruct continuation (2 further epochs), resulting in MagicoderS-CL⁺.

No additional pretraining or further architectural changes are employed (Wei et al., 2023).

2. OSS-Instruct Data Generation and Curation

OSS-Instruct is a methodology for producing diverse synthetic instruction data grounded in open-source code references. Its primary pipeline involves:

  • Seed Corpus: "starcoderdata" is the seed, a filtered permissively licensed subset of The Stack. Random snippets of 1–15 consecutive lines are extracted from 80,000 documents.
  • Prompting and Synthesis: Each seed snippet is formatted into a prompt and fed to gpt-3.5-turbo, requesting the creation of a realistic programming problem "inspired by" the snippet and a self-contained solution with minimal explanatory commentary. Greedy decoding is employed to maximize consistency.
  • Data Cleaning/Decontamination: Duplicates and any overlap with evaluation benchmarks (HumanEval(+), MBPP(+), DS-1000, APPS, GSM8K) are removed; only 9 samples filtered in this step.
  • Diversity Analysis: Ten manually defined code categories (e.g., algorithms, ML, scripts) ensure a balanced representation; prompt and solution lengths vary from a few tokens to ~1,000 tokens.
  • Bias Measurements: Average TF-IDF cosine similarity to HumanEval is lowest among compared synthetic datasets, indicating minimal prompt duplication or content drift.

Formally, the data generation process is:

1
2
3
4
5
for s in starcoderdata:
    prompt = fill_template(s)
    x, y = gpt_3_5_turbo.generate(prompt)
    if not duplicate(x, y):
        add_to_dataset(x, y)

The instruction-tuning loss is standard cross-entropy:

Linst(θ)=(x,y)DlogPθ(yx)L_{inst}(\theta) = -\sum_{(x, y) \in D} \log P_\theta(y \mid x)

3. Training Objective and Optimization

The principal objective is standard next-token prediction on solutions given the prompt, using token-level cross-entropy:

L=tlogP(wtw<t,x;θ)L = -\sum_t \log P(w_t \mid w_{<t}, x; \theta)

Key training parameters:

  • Optimizer: Adafactor (no explicit weight decay)
  • Learning Rate: Initial 5×1055 \times 10^{-5}, 15 warmup steps, linear decay to zero
  • Batch Size: Global batch size of 512 sequences
  • Max Sequence Length: 1,216 tokens during OSS-Instruct tuning (1,024 when using Evol-Instruct)
  • Epochs and Compute: 2 epochs over 75k examples, using two NVIDIA A100-80GB GPUs with PyTorch DDP.

No further self-supervised or task-general pretraining is performed; only supervised instruction-tuning is applied in these stages (Wei et al., 2023).

4. Evaluation on Code Generation Benchmarks

Performance is assessed on several industry-standard benchmarks for text-to-code generation, including HumanEval(+), MBPP(+), MultiPL-E, and DS-1000. MagicoderS-CL and MagicoderS-CL⁺ are compared to CodeLlama-7B, ChatGPT, and WizardCoder models. Core results—using greedy decoding for pass@1—are summarized:

Model HumanEval HumanEval+ MBPP MBPP+
CodeLlama-7B 37.8 34.1 57.6 45.4
MagicoderS-CL 60.4 55.5 64.2 52.6
MagicoderS-CL⁺ 70.7 66.5 68.4 56.6
ChatGPT 72.6 65.9 81.7 69.4

On MultiPL-E, MagicoderS-CL substantially outperforms CodeLlama-7B across several languages (Java, JavaScript, C++, PHP, Swift, Rust). For DS-1000, MagicoderS-CL achieves ≈29.9%, and MagicoderS-CL⁺ reaches ≈37.5%, outperforming WizardCoder-15B. Notably, MagicoderS-CL⁺ surpasses ChatGPT on the HumanEval+ benchmark (66.5 vs. 65.9 pass@1). All open-source models ≤16B parameters are outperformed by MagicoderS-CL and its enhanced variant (Wei et al., 2023).

An ablation demonstrates that direct finetuning on raw comment–function pairs degrades downstream performance, underscoring the value of the richer OSS-Instruct data.

5. Usage Patterns and Limitations

MagicoderS-CL can be practically employed for code synthesis tasks given minimal code seeds. Typical usage scenarios include:

  • Python data science: Short initialization code (e.g., "import numpy as np; def foo(x):") prompts data-science assignments with solutions.
  • System scripting: Shell snippet as prompt leads to generation of full Python script.
  • Java engineering: Incomplete class signatures generate Spring-Boot application stubs.

Identified constraints and ongoing challenges:

  • Dataset Size/Distribution: The instruction-tuning dataset is modest (75,000 examples) and predominantly Python (~57%).
  • Solution Quality: Some outputs are noisy or incomplete, retained by design in line with UnnaturalCode.
  • Infilling Support: Current CL variant does not support infilling or insertion tasks.
  • Extension Directions: Future work includes scaling OSS-Instruct to larger CodeLlama variants (13B/34B), leveraging GPT-4 as a teacher, and constructing targeted seed-selection strategies (Wei et al., 2023).

6. Context and Research Significance

MagicoderS-CL introduces a rigorous, scalable approach to instruction data generation for code LLMs, leveraging abundant open-source repositories and the filtering capabilities of OSS-Instruct to control bias and enhance prompt diversity. Its empirical superiority over CodeLlama-7B and other open baselines, including on challenging hidden benchmarks, demonstrates the potential of synthetic, reference-guided instruction datasets. The methodology and artifacts, including datasets and prompt templates, are fully open-sourced for reproducibility and further study. This suggests a trajectory whereby large pre-trained code models can be efficiently specialized with targeted, high-diversity synthetic tasks, and motivates future research into open-source instruction-tuning for code generation (Wei et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to MagicoderS-CL Variant.