X-Coder Framework: Synthetic Training for Code LLMs
- X-Coder Framework is a synthetic-task–driven approach that trains competitive programming LLMs using fully artificial problems, solutions, and tests.
- It employs a two-stage training process—supervised fine-tuning followed by reinforcement learning via GRPO—powered by the SynthSmith data synthesis pipeline.
- Empirical scaling laws show that increasing unique synthetically generated tasks yields notable performance gains over using multiple solutions per task.
The X-Coder framework is a synthetic-task–driven approach for training LLMs in competitive programming, replacing real-world data with feature-based, fully artificial problems, solutions, and tests. It centers on a two-stage training process (supervised fine-tuning and reinforcement learning), depends on an advanced pipeline called SynthSmith for data synthesis and verification, and establishes state-of-the-art performance among code LLMs of comparable scale. X-Coder introduces empirical scaling laws for synthetic data and demonstrates strong generalization on contemporary competitive programming benchmarks.
1. Model Series and Training Protocol
The X-Coder model family comprises two primary variants: X-Coder-Qwen2.5 (7B parameters, initialized from Qwen2.5-Coder-7B-Instruct) and X-Coder-Qwen3 (8B parameters, initialized from Qwen3-8B-Base). Both follow a two-stage training paradigm:
- Stage 1: Supervised Fine-Tuning (SFT). Models are trained to imitate long chain-of-thought (CoT) demonstrations supplied by synthetic [task, solution] pairs generated by SynthSmith.
- Stage 2: Reinforcement Learning (RL) via Group Relative Policy Optimization (GRPO). Policies are further refined to maximize test-case–based rewards, leveraging the synthetic evaluative infrastructure.
During SFT, the objective is minimization of the negative log-likelihood over authoritative, long-form CoT answers: For RL, GRPO optimizes a clipped policy gradient with KL penalty: with G=8 parallel rollouts per prompt. The reward is determined by compilation success and fraction of passed test cases:
2. SynthSmith: Feature-Based Synthetic Data Generation
SynthSmith is a feature-drive synthesis pipeline responsible for all tasks, solutions, and test cases. Its operation involves four key stages:
- 2.1 Feature Extraction and Evolution: Competition-relevant features (algorithms, data structures, complexity patterns, etc.) are extracted via LLM prompting from 10k TACO examples. Breadth is increased by adding sibling features, while depth is increased by generating sub-features using JSON-to-JSON LLM prompts. For example, the number of extracted algorithm features grows from 27,400 to 176,914 (growth ×6.46), and data structure features from 12,353 to 65,104 (×5.27).
- 2.2 Two-Stage Task Generation: Constructing competitive tasks proceeds in two substages:
- Stage 1: Compatible subtrees of the feature hierarchy are selected and an integration strategy is composed.
- Stage 2: Given selected features and strategy, the LLM is prompted to generate a formatted problem statement (e.g., Codeforces/LeetCode/AtCoder style).
The two-stage method produces a notable empirical gain in avg@4 (from 34.8 to 40.1, +5.3) over direct one-step generation.
- 2.3 Test Input Generation: Test suites are produced using both prompting-based strategies (to induce edge, boundary, random, and large cases) and tool-based approaches (CYaRon library) for programmatic diversity, including base, boundary, stress, and large random types.
- 2.4 Candidate Solutions and Dual Verification: Multiple candidate solutions are sampled via LLMs (DeepSeek-R1, Qwen3-A22B-Thinking), each filtered for completeness (full CoT, single Python code block, AST check, token length <25k). For n test cases and m solutions, provisional labels are majority-voted: Solutions are scored: The best candidate solution is cross-validated on a holdout set; only those passing both weighted and unweighted metrics are accepted as “golden.”
3. Construction of SFT and RL Datasets
- SFT (“Demo”) Dataset: Comprises 200,091 [task, solution] pairs, each with a single, long (>10k tokens) CoT answer. Problem statements typically have length ~N(μ=659, σ=258), solutions length ~N(μ=17,742, σ=7,296). Difficulty is controlled using a fine-tuned CF-rating classifier (84% validation accuracy), yielding a balanced difficulty profile: e.g., 31.9% at 1400–1600, 24.2% at 1800–2000, 2.5% at 2400+ ratings. Embedding-based analysis shows high diversity (mean inter-cluster distance 0.613 vs 0.507 for EpiCoder baseline).
- RL Dataset: A subset of 40,000 synthetic tasks with fully verified test sets, used for on-policy RL rollouts (G=8 parallel samples), with the continuous reward proportional to the fraction of passed tests.
4. Empirical Scaling Laws on Synthetic Data
Experiments establish the impact of dataset construction strategies on model performance:
- Scaling the number of unique tasks yields greater pass@8 gains than increasing the number of solutions per task for a fixed token budget. Subsets with one solution per task (32k, 64k, 128k, 200k samples; avg@8 of 43.7%→54.1%→58.4%→62.7%) outperform those with multiple solutions per task (16k×4→52.3%, 8k×8→45.1%).
- A logarithmic fit models the relationship between dataset size N and pass@8: with , .
5. Performance Evaluation and Benchmarks
- Metrics: Performance is measured via pass@k (the fraction of problems solved at least once among k samples) and avg@8 (average pass@8 across benchmarks). Confidence intervals are computed as , where is the number of problems.
- Benchmarks: Evaluations are performed on LiveCodeBench v5 (Aug 2024–Feb 2025, 268 tasks) and v6 (Feb–May 2025, 240 tasks).
- Main Results:
- X-Coder-Qwen2.5 (7B, synthetic, SFT+RL) records avg@8 of 62.9±1.8 (v5) and 55.8±1.9 (v6).
- X-Coder-Qwen3 (8B, synthetic, SFT+RL) achieves avg@8 of 64.0±2.5 (v5) and 56.5±1.3 (v6).
- Both surpass real-data models of ±14B parameters (e.g., DeepCoder-14B-Preview, 57.9 (v5)), with statistical significance established (e.g., +2.6 absolute on v5, for SFT; further +2.6 for RL, ).
| Model | Params | Data | SFT | RL | avg@8 (v5) | avg@8 (v6) |
|---|---|---|---|---|---|---|
| Qwen2.5-Coder-7B | 7B | real | ✗ | ✗ | 57.5 | 48.4 |
| DeepCoder-14B-Preview | 14B | real | ✗ | ✓ | 57.9 | 48.5 |
| AReal-boba2-14B | 14B | real | ✗ | ✓ | 58.1 | 56.7 |
| X-Coder-Qwen2.5-SFT | 7B | syn | ✓ | ✗ | 60.3±2.5 | 53.5±1.7 |
| X-Coder-Qwen2.5 | 7B | syn | ✓ | ✓ | 62.9±1.8 | 55.8±1.9 |
| X-Coder-Qwen3-SFT | 8B | syn | ✓ | ✗ | 59.4±2.0 | 55.4±2.3 |
| X-Coder-Qwen3 | 8B | syn | ✓ | ✓ | 64.0±2.5 | 56.5±1.3 |
6. Context, Implications, and Limitations
The X-Coder framework demonstrates that large-scale, feature-driven synthetic data can enable highly capable code LLMs for competitive programming, mitigating reliance on real-world coding data while matching or exceeding the performance of models trained exclusively on such data. The staged training and robust synthesis yield competitive models with reduced parameter counts relative to prior work.
A plausible implication is that, as synthetic data construction improves in feature coverage and realism, the necessity for large-scale real data in code reasoning tasks may continue to diminish. Nevertheless, the results depend fundamentally on the verification pipeline’s ability to generate and filter high-quality problem–solution–test triples; failures in dual verification or feature synthesis would potentially limit the representativeness or robustness of the resulting models (Wu et al., 11 Jan 2026).