- The paper demonstrates an iterative self-distillation framework that enables small-scale open-source LLMs to synthesize high-quality code instruction data.
- It achieves competitive performance on benchmarks like HumanEval and MBPP while reducing dependency on expensive proprietary LLMs.
- The methodology, validated through ablation studies and theoretical analysis, offers a cost-efficient and rapidly converging approach to code LLM development.
SCoder: Iterative Self-Distillation for Bootstrapping Small-Scale Data Synthesizers to Empower Code LLMs
Introduction
The paper introduces SCoder, a methodology and model family that addresses the high cost and dependency on proprietary LLMs for code instruction data synthesis in code LLM development. The central contribution is an iterative self-distillation framework that enables small-scale open-source LLMs (7B–14B parameters) to serve as effective code instruction data synthesizers. This approach reduces reliance on large, closed-source models (e.g., GPT-3.5/4) and demonstrates that small models, when properly bootstrapped, can generate high-quality instruction data for code LLM fine-tuning.
Motivation and Problem Statement
Instruction tuning is critical for code LLMs, but the prevailing paradigm depends on large-scale, high-quality instruction datasets distilled from proprietary LLMs. This process is cost-prohibitive and limits accessibility. The paper investigates whether small-scale open-source LLMs can be transformed into competitive data synthesizers, thus democratizing the construction of code instruction datasets and reducing costs.
Methodology
Data Synthesizer Bootstrapping
The process begins by training small-scale LLMs (e.g., Qwen2.5-Coder-7B/14B, Llama3.1-8B) on a limited set of high-quality instruction data distilled from proprietary LLMs. This initial "enhanced synthesizer" is then iteratively improved via self-distillation, eliminating further dependence on proprietary data.
Iterative Self-Distillation Framework
Each iteration of the self-distillation process consists of:
- Multi-Checkpoint Sampling: For each code snippet and prompt, outputs are sampled from multiple checkpoints and multiple decoding runs, increasing diversity and robustness.
- Multi-Aspect Scoring: Candidate outputs are evaluated using a learned scorer that aggregates multiple quality aspects (e.g., problem-solution consistency, correctness) into a weighted score. The weights are optimized via ridge regression to maximize downstream code LLM performance.
- Gradient-Based Influence Estimation: To select the most influential samples, the cosine similarity between the gradient induced by a candidate sample and the average gradient of proprietary LLM-distilled samples is computed (using LoRA-adapted reference models and Johnson-Lindenstrauss projections for efficiency). Samples with the highest influence are retained for the next training iteration.
This process is repeated, with each iteration generating a larger, higher-quality self-distilled dataset, which is then used to further train the synthesizer.
SCoder Model Family
Using the instruction datasets generated by the bootstrapped synthesizers, the authors fine-tune DeepSeek-Coder-6.7B-Base to produce the SCoder family. This family includes variants corresponding to different synthesizer backbones (e.g., SCoder-Q7-DS-6.7B, SCoder-Q14-DS-6.7B).
Experimental Results
Benchmarks and Baselines
SCoder models are evaluated on HumanEval, MBPP, LiveCodeBench, and BigCodeBench, using pass@1 as the primary metric. Baselines include both proprietary models (GPT-4-Turbo, GPT-o1) and state-of-the-art open-source models (DeepSeek-Coder-6.7B-Instruct, MagicoderS, WizardCoder-GPT-4, etc.).
Main Findings
- Performance: SCoder models trained on 60K–80K self-distilled samples from small synthesizers match or outperform open-source baselines that rely on 75K–110K proprietary LLM-distilled samples. For example, SCoder-Q14-DS-6.7B achieves 80.5% on HumanEval and 81.0% on MBPP, surpassing all open-source baselines of comparable size.
- Ablation Studies: Removing multi-checkpoint sampling, multi-aspect scoring, or gradient-based influence estimation leads to significant performance drops (up to 8.9% on BigCodeBench), confirming the necessity of each component.
- Data Scaling: Increasing the size of self-distilled data leads to monotonic improvements, with diminishing returns after two iterations, indicating convergence of the self-distillation process.
- Cost Efficiency: The approach reduces proprietary LLM API usage by an order of magnitude (10K vs. 150K–200K samples), with the main cost being the one-time fine-tuning of the synthesizer. The total cost for synthesizer training is estimated at ~$260 on commodity cloud GPUs, compared to thousands of dollars for equivalent proprietary LLM API usage.
Data Quality Analysis
Human and LLM-based evaluations show that the self-distilled data from bootstrapped synthesizers scores higher across all quality aspects compared to standard open-source instruction datasets (e.g., evol-codealpaca-v1).
Theoretical Analysis
The paper provides a formal analysis of the iterative self-distillation process, modeling it as a contraction mapping in the space of model parameters. Under reasonable Lipschitz continuity and contraction assumptions, the process is shown to converge to a unique fixed point, which can be interpreted as a Nash equilibrium between the teacher (synthesizer) and student (target model). The process naturally balances exploration (diverse data generation) and exploitation (retraining from a fixed initialization), with empirical results supporting rapid convergence and stability.
Implementation Considerations
- Synthesizer Training: Fine-tuning small LLMs on 10K proprietary samples, followed by two rounds of self-distillation (20K and 40K samples), is sufficient for strong performance.
- Sampling and Scoring: Multi-checkpoint and multi-aspect strategies require additional inference passes but are computationally tractable for 7B–14B models.
- Gradient Influence: LoRA adaptation and gradient projection make influence estimation feasible on a single A100 GPU within hours.
- Target Model Fine-Tuning: SCoder models are fine-tuned on a mix of standard open-source and self-distilled data, using standard SFT hyperparameters.
Implications and Future Directions
Practical Implications
- Democratization: The methodology enables organizations without access to proprietary LLMs to build competitive code LLMs using only small open-source models and a modest initial investment in proprietary data.
- Cost Reduction: The approach dramatically reduces the cost of instruction data synthesis, making large-scale code LLM development more accessible.
- Generalization: The framework is robust to the choice of reference model for influence estimation and generalizes across different target model architectures.
Theoretical Implications
- Self-Distillation Dynamics: The formal analysis provides a foundation for understanding convergence and stability in iterative self-distillation, with potential applications beyond code generation.
- Sample Selection: The integration of gradient-based influence estimation with multi-aspect scoring offers a principled approach to data selection in self-supervised and semi-supervised learning.
Future Work
- Extension to Other Domains: While the current paper focuses on code generation, the methodology may be adapted to other instruction-following tasks, though domain-specific challenges (e.g., data availability, evaluation) must be addressed.
- Integration with Alternative Paradigms: Combining self-distillation with methods like Self-Instruct or Evol-Instruct could further enhance data diversity and quality.
- Scaling Laws and Model Size: Further exploration of the relationship between synthesizer size, data quality, and downstream performance is warranted.
Conclusion
SCoder demonstrates that small-scale open-source LLMs, when bootstrapped via iterative self-distillation, can serve as effective and efficient code instruction data synthesizers. This approach enables the construction of high-quality instruction datasets at a fraction of the cost and dependency of prior methods, yielding code LLMs that match or exceed the performance of models trained on large-scale proprietary data. The methodology is theoretically grounded, empirically validated, and broadly applicable, representing a significant advance in scalable, accessible code LLM development.