Papers
Topics
Authors
Recent
Search
2000 character limit reached

Parallel-SFT: Cross-Language Code RL

Updated 24 April 2026
  • Parallel-SFT is a supervised fine-tuning paradigm that leverages parallel programs—functionally equivalent code across multiple languages—to support robust cross-language generalization in reinforcement learning.
  • It addresses the transfer gap by aligning algorithmic semantics through a curated parallel code dataset, enhancing metrics like pass@1 and embedding alignment.
  • Empirical results demonstrate that Parallel-SFT outperforms monolingual and non-parallel methods, achieving improvements up to 3.3 percentage points and stronger latent representation alignment.

Parallel-SFT is a supervised fine-tuning (SFT) training paradigm designed to optimize LLMs for zero-shot cross-programming-language transfer in the context of code reinforcement learning (RL). This approach addresses the limited transferability of code RL improvements across programming languages (PLs) by constructing and leveraging explicit “parallel programs”—functionally equivalent code solutions for the same specification, spanning multiple PLs—as a core training signal. Parallel-SFT induces functionality-centric representations, thus facilitating more effective cross-lingual generalization during subsequent RL finetuning (Wu et al., 22 Apr 2026).

1. Motivation and Theoretical Foundations

Standard LLM SFT on code exploits monolingual data in a source PL (e.g., Python), which leads models to internalize surface-level syntax rather than algorithmic semantics. As a result, RL-finetuned policies for code generation achieve improvements primarily within the training PL, with little or no benefit—and sometimes negative transfer—to other PLs. This breakdown in transfer is attributed to the lack of parallel “translation” data in code corpora, in contrast to the parallel texts that underpin success in multilingual NLP. Most code datasets do not contain semantically aligned instances of the same algorithm in divergent PLs, inhibiting alignment of internal model representations across PLs (Wu et al., 22 Apr 2026).

2. Formal Objective and Algorithmic Structure

Parallel-SFT organizes supervision around the concept of “parallel programs”: for a set of tasks Q={qi}i=1NQ = \{q_i\}_{i=1}^N, the dataset contains for each qiq_i a verified solution set Cilang\mathcal{C}_i^{\text{lang}} for every language langL\text{lang} \in \mathcal{L}. The SFT objective is

LSFT(θ)=i=1NlangLwlangEcCilang[logpθ(cqi)]\mathcal{L}_{\mathrm{SFT}}(\theta) = -\sum_{i=1}^N \sum_{\text{lang} \in \mathcal{L}} w^{\text{lang}} \cdot \mathbb{E}_{c \sim \mathcal{C}_i^{\text{lang}}} [\log p_\theta(c\,|\,q_i)]

where by default wlang=1/Lw^{\text{lang}} = 1 / |\mathcal{L}| for a uniform mix of L|\mathcal{L}| languages.

After SFT, RL is performed using a standard policy gradient method (e.g., GRPO), optimizing the expected reward in a single high-resource source PL src\ell_{\text{src}}: LRL(θ)=EqDRLEcpθ(q,src)[R(c;q,src)]\mathcal{L}_{\mathrm{RL}}(\theta) = -\mathbb{E}_{q \sim D_{\mathrm{RL}}\,\mathbb{E}_{c \sim p_\theta(\cdot|q, \ell_{\text{src}})} [R(c; q, \ell_{\text{src}})]} where R(c;q,src)R(c; q, \ell_{\text{src}}) is a binary verifier signal based on solution correctness in the target PL (Wu et al., 22 Apr 2026).

3. Parallel-Program Dataset Construction

The dataset comprises problem statements (from APPS and CodeContests)—post-filtered to remove overlapping or ambiguous items—with most original solutions authored in Python. For each verified Python solution qiq_i0, machine translation (Llama-4) generates candidate implementations in additional PLs (C++, Java, C#, JavaScript, Bash, Lua, etc.). Each translation is execution-filtered: only code passing the original test suite is retained. The statistics of the constructed corpus are:

  • Number of unique questions qiq_i1
  • Average of qiq_i2 verified solutions per language per question
  • Yielding approximately qiq_i3 parallel code instances across 8 PLs

This synthetic augmentation provides explicit “Rosetta Stone”-style alignment, enabling models to associate natural language prompts, problem specifications, and functionally equivalent code across multiple languages (Wu et al., 22 Apr 2026).

4. Empirical Results

The effectiveness of Parallel-SFT is quantified via both code generation accuracy (pass@1) and code validation accuracy on target PLs unseen during RL. Representative results are:

Mixture C++→others (pass@1) Python→others (pass@1) Oracle (in-PL)
1-Lang (source) 4.8% 5.1% 6.5%
8-Lang non-parallel 6.2% 6.8%
8-Lang parallel 8.1% 9.3%
1-Lang (target) 7.9%

For code validation on unseen PLs:

Mixture C++→others (accuracy) Python→others (accuracy)
1-Lang (source) 70.3% 71.1%
8-Lang non-parallel 75.0% 75.6%
8-Lang parallel 80.2% 81.5%

Parallel-SFT consistently outperforms both monolingual and non-parallel multi-PL SFT baselines for zero-shot transfer. When using Python as source, parallel transfer even exceeds in-PL (target) baseline performance (Wu et al., 22 Apr 2026).

5. Representation Analysis

To determine whether Parallel-SFT induces PL-agnostic internal representations, alignment is assessed on a held-out set of 312 problems, each with Go, PHP, and Ruby solutions. Metrics include retrieval accuracy (does the embedding of a Go solution retrieve the correct PHP partner by cosine proximity?) and adjusted cosine similarity.

  • 1-Lang: retrieval ≈ 50%, adjusted-cosine ≈ 0.02
  • 8-Lang non-parallel: retrieval ≈ 65%, adjusted-cosine ≈ 0.04
  • 8-Lang parallel: retrieval ≈ 78%, adjusted-cosine ≈ 0.07

These alignment gains are most pronounced in mid-network layers (ℓ ≈ 24 of 32), with all models showing an “inverted-U” profile—low alignment early, peaking semantically, decreasing at output. Parallel-SFT significantly increases maximum alignment, indicating a functionality-centric latent space (Wu et al., 22 Apr 2026).

6. Ablation Studies and Sensitivity

Ablation quantifies the contribution of parallel data. Non-parallel multi-PL SFT yields +1.4 percentage points (pp) compared to monolingual SFT, but parallel data increases improvement to +3.3 pp. Varying the proportion of parallel to natural language-only data shows optimal performance when using an equal mix; excessive reliance on parallel data can induce overfitting to the synthetic translations. Performance on the source language for RL remains stable, showing no significant negative impact of the parallel program data (Wu et al., 22 Apr 2026).

7. Conclusions and Recommendations

Parallel-SFT demonstrates that incorporating functionally equivalent code in multiple PLs into SFT builds semantic representations that support robust cross-PL RL transfer. Even with synthetic translations, execution filtering is sufficient for high-quality semantic alignment if a large, diverse pool of tasks is available. Recommended best practices include:

Future research directions identified include curriculum learning strategies, PL typology-sensitive sampling, and integration with multi-agent coding frameworks (Wu et al., 22 Apr 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Parallel-SFT.