Parallel-SFT: Cross-Language Code RL

Updated 24 April 2026

Parallel-SFT is a supervised fine-tuning paradigm that leverages parallel programs—functionally equivalent code across multiple languages—to support robust cross-language generalization in reinforcement learning.
It addresses the transfer gap by aligning algorithmic semantics through a curated parallel code dataset, enhancing metrics like pass@1 and embedding alignment.
Empirical results demonstrate that Parallel-SFT outperforms monolingual and non-parallel methods, achieving improvements up to 3.3 percentage points and stronger latent representation alignment.

Parallel-SFT is a supervised fine-tuning (SFT) training paradigm designed to optimize LLMs for zero-shot cross-programming-language transfer in the context of code reinforcement learning (RL). This approach addresses the limited transferability of code RL improvements across programming languages (PLs) by constructing and leveraging explicit “parallel programs”—functionally equivalent code solutions for the same specification, spanning multiple PLs—as a core training signal. Parallel-SFT induces functionality-centric representations, thus facilitating more effective cross-lingual generalization during subsequent RL finetuning (Wu et al., 22 Apr 2026).

1. Motivation and Theoretical Foundations

Standard LLM SFT on code exploits monolingual data in a source PL (e.g., Python), which leads models to internalize surface-level syntax rather than algorithmic semantics. As a result, RL-finetuned policies for code generation achieve improvements primarily within the training PL, with little or no benefit—and sometimes negative transfer—to other PLs. This breakdown in transfer is attributed to the lack of parallel “translation” data in code corpora, in contrast to the parallel texts that underpin success in multilingual NLP. Most code datasets do not contain semantically aligned instances of the same algorithm in divergent PLs, inhibiting alignment of internal model representations across PLs (Wu et al., 22 Apr 2026).

2. Formal Objective and Algorithmic Structure

Parallel-SFT organizes supervision around the concept of “parallel programs”: for a set of tasks $Q = \{q_i\}_{i=1}^N$ , the dataset contains for each $q_i$ a verified solution set $\mathcal{C}_i^{\text{lang}}$ for every language $\text{lang} \in \mathcal{L}$ . The SFT objective is

$\mathcal{L}_{\mathrm{SFT}}(\theta) = -\sum_{i=1}^N \sum_{\text{lang} \in \mathcal{L}} w^{\text{lang}} \cdot \mathbb{E}_{c \sim \mathcal{C}_i^{\text{lang}}} [\log p_\theta(c\,|\,q_i)]$

where by default $w^{\text{lang}} = 1 / |\mathcal{L}|$ for a uniform mix of $|\mathcal{L}|$ languages.

After SFT, RL is performed using a standard policy gradient method (e.g., GRPO), optimizing the expected reward in a single high-resource source PL $\ell_{\text{src}}$ : $\mathcal{L}_{\mathrm{RL}}(\theta) = -\mathbb{E}_{q \sim D_{\mathrm{RL}}\,\mathbb{E}_{c \sim p_\theta(\cdot|q, \ell_{\text{src}})} [R(c; q, \ell_{\text{src}})]}$ where $R(c; q, \ell_{\text{src}})$ is a binary verifier signal based on solution correctness in the target PL (Wu et al., 22 Apr 2026).

3. Parallel-Program Dataset Construction

The dataset comprises problem statements (from APPS and CodeContests)—post-filtered to remove overlapping or ambiguous items—with most original solutions authored in Python. For each verified Python solution $q_i$ 0, machine translation (Llama-4) generates candidate implementations in additional PLs (C++, Java, C#, JavaScript, Bash, Lua, etc.). Each translation is execution-filtered: only code passing the original test suite is retained. The statistics of the constructed corpus are:

Number of unique questions $q_i$ 1
Average of $q_i$ 2 verified solutions per language per question
Yielding approximately $q_i$ 3 parallel code instances across 8 PLs

This synthetic augmentation provides explicit “Rosetta Stone”-style alignment, enabling models to associate natural language prompts, problem specifications, and functionally equivalent code across multiple languages (Wu et al., 22 Apr 2026).

4. Empirical Results

The effectiveness of Parallel-SFT is quantified via both code generation accuracy (pass@1) and code validation accuracy on target PLs unseen during RL. Representative results are:

Mixture	C++→others (pass@1)	Python→others (pass@1)	Oracle (in-PL)
1-Lang (source)	4.8%	5.1%	6.5%
8-Lang non-parallel	6.2%	6.8%	—
8-Lang parallel	8.1%	9.3%	—
1-Lang (target)	—	—	7.9%

For code validation on unseen PLs:

Mixture	C++→others (accuracy)	Python→others (accuracy)
1-Lang (source)	70.3%	71.1%
8-Lang non-parallel	75.0%	75.6%
8-Lang parallel	80.2%	81.5%

Parallel-SFT consistently outperforms both monolingual and non-parallel multi-PL SFT baselines for zero-shot transfer. When using Python as source, parallel transfer even exceeds in-PL (target) baseline performance (Wu et al., 22 Apr 2026).

5. Representation Analysis

To determine whether Parallel-SFT induces PL-agnostic internal representations, alignment is assessed on a held-out set of 312 problems, each with Go, PHP, and Ruby solutions. Metrics include retrieval accuracy (does the embedding of a Go solution retrieve the correct PHP partner by cosine proximity?) and adjusted cosine similarity.

1-Lang: retrieval ≈ 50%, adjusted-cosine ≈ 0.02
8-Lang non-parallel: retrieval ≈ 65%, adjusted-cosine ≈ 0.04
8-Lang parallel: retrieval ≈ 78%, adjusted-cosine ≈ 0.07

These alignment gains are most pronounced in mid-network layers (ℓ ≈ 24 of 32), with all models showing an “inverted-U” profile—low alignment early, peaking semantically, decreasing at output. Parallel-SFT significantly increases maximum alignment, indicating a functionality-centric latent space (Wu et al., 22 Apr 2026).

6. Ablation Studies and Sensitivity

Ablation quantifies the contribution of parallel data. Non-parallel multi-PL SFT yields +1.4 percentage points (pp) compared to monolingual SFT, but parallel data increases improvement to +3.3 pp. Varying the proportion of parallel to natural language-only data shows optimal performance when using an equal mix; excessive reliance on parallel data can induce overfitting to the synthetic translations. Performance on the source language for RL remains stable, showing no significant negative impact of the parallel program data (Wu et al., 22 Apr 2026).

7. Conclusions and Recommendations

Parallel-SFT demonstrates that incorporating functionally equivalent code in multiple PLs into SFT builds semantic representations that support robust cross-PL RL transfer. Even with synthetic translations, execution filtering is sufficient for high-quality semantic alignment if a large, diverse pool of tasks is available. Recommended best practices include:

Maintaining a balanced PL mixture during SFT (uniform weights)
Interleaving parallel code with natural language instruction tuning data
Monitoring representation alignment to validate semantic learning
Following with RL on the high-resource source PL for maximal zero-shot transfer benefits

Future research directions identified include curriculum learning strategies, PL typology-sensitive sampling, and integration with multi-agent coding frameworks (Wu et al., 22 Apr 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Parallel-SFT.