Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 65 tok/s

Gemini 2.5 Pro 47 tok/s Pro

GPT-5 Medium 39 tok/s Pro

GPT-5 High 32 tok/s Pro

GPT-4o 97 tok/s Pro

Kimi K2 164 tok/s Pro

GPT OSS 120B 466 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

SCoder: Iterative Self-Distillation for Bootstrapping Small-Scale Data Synthesizers to Empower Code LLMs (2509.07858v1)

Published 9 Sep 2025 in cs.AI

Abstract: Existing code LLMs often rely on large-scale instruction data distilled from proprietary LLMs for fine-tuning, which typically incurs high costs. In this paper, we explore the potential of small-scale open-source LLMs (e.g., 7B) as synthesizers for high-quality code instruction data construction. We first observe that the data synthesis capability of small-scale LLMs can be enhanced by training on a few superior data synthesis samples from proprietary LLMs. Building on this, we propose a novel iterative self-distillation approach to bootstrap small-scale LLMs, transforming them into powerful synthesizers that reduce reliance on proprietary LLMs and minimize costs. Concretely, in each iteration, to obtain diverse and high-quality self-distilled data, we design multi-checkpoint sampling and multi-aspect scoring strategies for initial data selection. Furthermore, to identify the most influential samples, we introduce a gradient-based influence estimation method for final data filtering. Based on the code instruction datasets from the small-scale synthesizers, we develop SCoder, a family of code generation models fine-tuned from DeepSeek-Coder. SCoder models achieve state-of-the-art code generation capabilities, demonstrating the effectiveness of our method.

Collections

Summary

The paper demonstrates an iterative self-distillation framework that enables small-scale open-source LLMs to synthesize high-quality code instruction data.
It achieves competitive performance on benchmarks like HumanEval and MBPP while reducing dependency on expensive proprietary LLMs.
The methodology, validated through ablation studies and theoretical analysis, offers a cost-efficient and rapidly converging approach to code LLM development.

SCoder: Iterative Self-Distillation for Bootstrapping Small-Scale Data Synthesizers to Empower Code LLMs

Introduction

The paper introduces SCoder, a methodology and model family that addresses the high cost and dependency on proprietary LLMs for code instruction data synthesis in code LLM development. The central contribution is an iterative self-distillation framework that enables small-scale open-source LLMs (7B–14B parameters) to serve as effective code instruction data synthesizers. This approach reduces reliance on large, closed-source models (e.g., GPT-3.5/4) and demonstrates that small models, when properly bootstrapped, can generate high-quality instruction data for code LLM fine-tuning.

Motivation and Problem Statement

Instruction tuning is critical for code LLMs, but the prevailing paradigm depends on large-scale, high-quality instruction datasets distilled from proprietary LLMs. This process is cost-prohibitive and limits accessibility. The paper investigates whether small-scale open-source LLMs can be transformed into competitive data synthesizers, thus democratizing the construction of code instruction datasets and reducing costs.

Methodology

Data Synthesizer Bootstrapping

The process begins by training small-scale LLMs (e.g., Qwen2.5-Coder-7B/14B, Llama3.1-8B) on a limited set of high-quality instruction data distilled from proprietary LLMs. This initial "enhanced synthesizer" is then iteratively improved via self-distillation, eliminating further dependence on proprietary data.

Iterative Self-Distillation Framework

Each iteration of the self-distillation process consists of:

Multi-Checkpoint Sampling: For each code snippet and prompt, outputs are sampled from multiple checkpoints and multiple decoding runs, increasing diversity and robustness.
Multi-Aspect Scoring: Candidate outputs are evaluated using a learned scorer that aggregates multiple quality aspects (e.g., problem-solution consistency, correctness) into a weighted score. The weights are optimized via ridge regression to maximize downstream code LLM performance.
Gradient-Based Influence Estimation: To select the most influential samples, the cosine similarity between the gradient induced by a candidate sample and the average gradient of proprietary LLM-distilled samples is computed (using LoRA-adapted reference models and Johnson-Lindenstrauss projections for efficiency). Samples with the highest influence are retained for the next training iteration.

This process is repeated, with each iteration generating a larger, higher-quality self-distilled dataset, which is then used to further train the synthesizer.

SCoder Model Family

Using the instruction datasets generated by the bootstrapped synthesizers, the authors fine-tune DeepSeek-Coder-6.7B-Base to produce the SCoder family. This family includes variants corresponding to different synthesizer backbones (e.g., SCoder-Q7-DS-6.7B, SCoder-Q14-DS-6.7B).

Experimental Results

Benchmarks and Baselines

SCoder models are evaluated on HumanEval, MBPP, LiveCodeBench, and BigCodeBench, using pass@1 as the primary metric. Baselines include both proprietary models (GPT-4-Turbo, GPT-o1) and state-of-the-art open-source models (DeepSeek-Coder-6.7B-Instruct, MagicoderS, WizardCoder-GPT-4, etc.).

Main Findings

Performance: SCoder models trained on 60K–80K self-distilled samples from small synthesizers match or outperform open-source baselines that rely on 75K–110K proprietary LLM-distilled samples. For example, SCoder-Q14-DS-6.7B achieves 80.5% on HumanEval and 81.0% on MBPP, surpassing all open-source baselines of comparable size.
Ablation Studies: Removing multi-checkpoint sampling, multi-aspect scoring, or gradient-based influence estimation leads to significant performance drops (up to 8.9% on BigCodeBench), confirming the necessity of each component.
Data Scaling: Increasing the size of self-distilled data leads to monotonic improvements, with diminishing returns after two iterations, indicating convergence of the self-distillation process.
Cost Efficiency: The approach reduces proprietary LLM API usage by an order of magnitude (10K vs. 150K–200K samples), with the main cost being the one-time fine-tuning of the synthesizer. The total cost for synthesizer training is estimated at ~$260 on commodity cloud GPUs, compared to thousands of dollars for equivalent proprietary LLM API usage.

Data Quality Analysis

Human and LLM-based evaluations show that the self-distilled data from bootstrapped synthesizers scores higher across all quality aspects compared to standard open-source instruction datasets (e.g., evol-codealpaca-v1).

Theoretical Analysis

The paper provides a formal analysis of the iterative self-distillation process, modeling it as a contraction mapping in the space of model parameters. Under reasonable Lipschitz continuity and contraction assumptions, the process is shown to converge to a unique fixed point, which can be interpreted as a Nash equilibrium between the teacher (synthesizer) and student (target model). The process naturally balances exploration (diverse data generation) and exploitation (retraining from a fixed initialization), with empirical results supporting rapid convergence and stability.

Implementation Considerations

Synthesizer Training: Fine-tuning small LLMs on 10K proprietary samples, followed by two rounds of self-distillation (20K and 40K samples), is sufficient for strong performance.
Sampling and Scoring: Multi-checkpoint and multi-aspect strategies require additional inference passes but are computationally tractable for 7B–14B models.
Gradient Influence: LoRA adaptation and gradient projection make influence estimation feasible on a single A100 GPU within hours.
Target Model Fine-Tuning: SCoder models are fine-tuned on a mix of standard open-source and self-distilled data, using standard SFT hyperparameters.

Implications and Future Directions

Practical Implications

Democratization: The methodology enables organizations without access to proprietary LLMs to build competitive code LLMs using only small open-source models and a modest initial investment in proprietary data.
Cost Reduction: The approach dramatically reduces the cost of instruction data synthesis, making large-scale code LLM development more accessible.
Generalization: The framework is robust to the choice of reference model for influence estimation and generalizes across different target model architectures.

Theoretical Implications

Self-Distillation Dynamics: The formal analysis provides a foundation for understanding convergence and stability in iterative self-distillation, with potential applications beyond code generation.
Sample Selection: The integration of gradient-based influence estimation with multi-aspect scoring offers a principled approach to data selection in self-supervised and semi-supervised learning.

Future Work

Extension to Other Domains: While the current paper focuses on code generation, the methodology may be adapted to other instruction-following tasks, though domain-specific challenges (e.g., data availability, evaluation) must be addressed.
Integration with Alternative Paradigms: Combining self-distillation with methods like Self-Instruct or Evol-Instruct could further enhance data diversity and quality.
Scaling Laws and Model Size: Further exploration of the relationship between synthesizer size, data quality, and downstream performance is warranted.

Conclusion

SCoder demonstrates that small-scale open-source LLMs, when bootstrapped via iterative self-distillation, can serve as effective and efficient code instruction data synthesizers. This approach enables the construction of high-quality instruction datasets at a fraction of the cost and dependency of prior methods, yielding code LLMs that match or exceed the performance of models trained on large-scale proprietary data. The methodology is theoretically grounded, empirically validated, and broadly applicable, representing a significant advance in scalable, accessible code LLM development.