MagicoderS-CL Variant: Open Code LLM
- MagicoderS-CL is a fully open-source code-generation LLM built on the CodeLlama-7B architecture using synthetic instruction data from OSS-Instruct.
- It employs a two-stage instruction tuning process on 75,000 examples, with optional Evol-Instruct continuation, achieving state-of-the-art pass@1 scores on several benchmarks.
- The model demonstrates practical performance in diverse coding tasks including Python data science, system scripting, and Java engineering, while highlighting limitations in infilling support and dataset diversity.
MagicoderS-CL Variant is a fully open-source code-generation LLM built by fine-tuning Meta’s CodeLlama-7B foundation model on synthetic instruction data generated via the OSS-Instruct methodology. MagicoderS-CL aims to create high-quality and diverse instruction-following code datasets from open-source code corpora and demonstrates state-of-the-art performance among 7B-parameter and larger open models, surpassing even some proprietary models on rigorous code-generation benchmarks (Wei et al., 2023).
1. Model Architecture and Initialization
MagicoderS-CL adopts the CodeLlama-7B architecture, a single-stream, decoder-only Transformer without any structural modifications. The core model characteristics are as follows:
| Specification | Value | Source |
|---|---|---|
| Transformer layers | 32 | CodeLlama-7B |
| Hidden dimension () | Approx. 4,096 | CodeLlama-7B |
| Attention heads per layer | 32 | CodeLlama-7B |
| Feed-forward size | Approx. 11,008 | CodeLlama-7B |
| Context window | Up to 2,048 tokens | CodeLlama-7B |
No adapters or new layers are introduced. Weight initialization is performed by loading the pretrained CodeLlama-7B weights directly, with no further re-initialization. This distinguishes MagicoderS-CL from MagicoderS, which is instead initialized from DeepSeek-Coder-Base.
The training strategy involves two-stage instruction tuning:
- OSS-Instruct dataset (75,000 instances, 2 epochs).
- Optional Evol-Instruct continuation (2 further epochs), resulting in MagicoderS-CL⁺.
No additional pretraining or further architectural changes are employed (Wei et al., 2023).
2. OSS-Instruct Data Generation and Curation
OSS-Instruct is a methodology for producing diverse synthetic instruction data grounded in open-source code references. Its primary pipeline involves:
- Seed Corpus: "starcoderdata" is the seed, a filtered permissively licensed subset of The Stack. Random snippets of 1–15 consecutive lines are extracted from 80,000 documents.
- Prompting and Synthesis: Each seed snippet is formatted into a prompt and fed to gpt-3.5-turbo, requesting the creation of a realistic programming problem "inspired by" the snippet and a self-contained solution with minimal explanatory commentary. Greedy decoding is employed to maximize consistency.
- Data Cleaning/Decontamination: Duplicates and any overlap with evaluation benchmarks (HumanEval(+), MBPP(+), DS-1000, APPS, GSM8K) are removed; only 9 samples filtered in this step.
- Diversity Analysis: Ten manually defined code categories (e.g., algorithms, ML, scripts) ensure a balanced representation; prompt and solution lengths vary from a few tokens to ~1,000 tokens.
- Bias Measurements: Average TF-IDF cosine similarity to HumanEval is lowest among compared synthetic datasets, indicating minimal prompt duplication or content drift.
Formally, the data generation process is:
1 2 3 4 5 |
for s in starcoderdata: prompt = fill_template(s) x, y = gpt_3_5_turbo.generate(prompt) if not duplicate(x, y): add_to_dataset(x, y) |
The instruction-tuning loss is standard cross-entropy:
3. Training Objective and Optimization
The principal objective is standard next-token prediction on solutions given the prompt, using token-level cross-entropy:
Key training parameters:
- Optimizer: Adafactor (no explicit weight decay)
- Learning Rate: Initial , 15 warmup steps, linear decay to zero
- Batch Size: Global batch size of 512 sequences
- Max Sequence Length: 1,216 tokens during OSS-Instruct tuning (1,024 when using Evol-Instruct)
- Epochs and Compute: 2 epochs over 75k examples, using two NVIDIA A100-80GB GPUs with PyTorch DDP.
No further self-supervised or task-general pretraining is performed; only supervised instruction-tuning is applied in these stages (Wei et al., 2023).
4. Evaluation on Code Generation Benchmarks
Performance is assessed on several industry-standard benchmarks for text-to-code generation, including HumanEval(+), MBPP(+), MultiPL-E, and DS-1000. MagicoderS-CL and MagicoderS-CL⁺ are compared to CodeLlama-7B, ChatGPT, and WizardCoder models. Core results—using greedy decoding for pass@1—are summarized:
| Model | HumanEval | HumanEval+ | MBPP | MBPP+ |
|---|---|---|---|---|
| CodeLlama-7B | 37.8 | 34.1 | 57.6 | 45.4 |
| MagicoderS-CL | 60.4 | 55.5 | 64.2 | 52.6 |
| MagicoderS-CL⁺ | 70.7 | 66.5 | 68.4 | 56.6 |
| ChatGPT | 72.6 | 65.9 | 81.7 | 69.4 |
On MultiPL-E, MagicoderS-CL substantially outperforms CodeLlama-7B across several languages (Java, JavaScript, C++, PHP, Swift, Rust). For DS-1000, MagicoderS-CL achieves ≈29.9%, and MagicoderS-CL⁺ reaches ≈37.5%, outperforming WizardCoder-15B. Notably, MagicoderS-CL⁺ surpasses ChatGPT on the HumanEval+ benchmark (66.5 vs. 65.9 pass@1). All open-source models ≤16B parameters are outperformed by MagicoderS-CL and its enhanced variant (Wei et al., 2023).
An ablation demonstrates that direct finetuning on raw comment–function pairs degrades downstream performance, underscoring the value of the richer OSS-Instruct data.
5. Usage Patterns and Limitations
MagicoderS-CL can be practically employed for code synthesis tasks given minimal code seeds. Typical usage scenarios include:
- Python data science: Short initialization code (e.g., "import numpy as np; def foo(x):") prompts data-science assignments with solutions.
- System scripting: Shell snippet as prompt leads to generation of full Python script.
- Java engineering: Incomplete class signatures generate Spring-Boot application stubs.
Identified constraints and ongoing challenges:
- Dataset Size/Distribution: The instruction-tuning dataset is modest (75,000 examples) and predominantly Python (~57%).
- Solution Quality: Some outputs are noisy or incomplete, retained by design in line with UnnaturalCode.
- Infilling Support: Current CL variant does not support infilling or insertion tasks.
- Extension Directions: Future work includes scaling OSS-Instruct to larger CodeLlama variants (13B/34B), leveraging GPT-4 as a teacher, and constructing targeted seed-selection strategies (Wei et al., 2023).
6. Context and Research Significance
MagicoderS-CL introduces a rigorous, scalable approach to instruction data generation for code LLMs, leveraging abundant open-source repositories and the filtering capabilities of OSS-Instruct to control bias and enhance prompt diversity. Its empirical superiority over CodeLlama-7B and other open baselines, including on challenging hidden benchmarks, demonstrates the potential of synthetic, reference-guided instruction datasets. The methodology and artifacts, including datasets and prompt templates, are fully open-sourced for reproducibility and further study. This suggests a trajectory whereby large pre-trained code models can be efficiently specialized with targeted, high-diversity synthetic tasks, and motivates future research into open-source instruction-tuning for code generation (Wei et al., 2023).