Papers
Topics
Authors
Recent
Search
2000 character limit reached

OpenRFT: Adaptation of Foundation Models

Updated 23 February 2026
  • OpenRFT is an open-source framework for domain adaptation, using Reinforcement Fine-Tuning to bridge generalist reasoning models with specialized scientific tasks.
  • The framework integrates synthetic reasoning trace SFT, data augmentation, and few-shot in-context learning to enhance performance on scientific MCQs with as few as 100 labeled examples.
  • Empirical results demonstrate an 11% accuracy improvement over baseline models, underscoring the value of process supervision and reward shaping in low-data environments.

OpenRFT is an open-source framework designed for domain adaptation of generalist reasoning foundation models under extreme data scarcity, leveraging Reinforcement Fine-Tuning (RFT) to bridge the gap between broad System-2 models and highly specialized scientific tasks. Developed as an analog to recent closed-source RFT paradigms, OpenRFT demonstrates notable gains in scientific multiple-choice question (MCQ) answering with as few as 100 domain-labeled examples per task, and is characterized by its use of question augmentation, reasoning-process synthesis, and few-shot in-context learning. The methodology, implementation, and empirical evaluation are comprehensively detailed in (Zhang et al., 2024).

1. Rationale and Problem Context

Large reasoning foundation models, typified by the Skywork-o1 series, are capable of multi-step inference across domains such as mathematics, programming, and logic. Nevertheless, these models frequently exhibit diminished performance on domain-specific tasks (for example, scientific MCQs in biology, chemistry, or materials science) due to domain shift and limited availability of task-specific data. Two central challenges impair naive adaptation:

  • The absence of annotated reasoning traces (process supervision) in provided (Qi,Ai)(Q_i, A_i) pairs, rendering RL reward signals highly unstable; correct outputs may arise from flawed or spurious reasoning chains.
  • Training set sizes are severely limited (O(101102)O(10^1{-}10^2)), impeding exploration and overfitting risks, even with RL’s ability for self-generated rollouts.

OpenRFT addresses this by warming up the policy with synthetic step-by-step traces, stabilizing RL with a process-aware reward model, and maximally amplifying domain signals using data augmentation and retrieval-augmented few-shot prompt construction (Zhang et al., 2024).

2. Core Model Architecture and Training Regimen

OpenRFT utilizes two pretrained open-source modules from the Skywork-o1 suite:

  • Policy model πθ\pi_\theta: Based on Skywork-o1-Open-Llama-3.1-8B, an 8B parameter System-2 LLM, fine-tuned via LoRA (rank=4) on both supervised (SFT) and PPO objectives.
  • Process Reward Model (PRM) ρPRM\rho_{\text{PRM}}: Instantiated as Skywork-o1-Open-PRM-Qwen-2.5-7B (7B parameter), independently pretrained to rate the rationality of each intermediate reasoning step (st1,at)(s_{t-1}, a_t).

Training proceeds in three sequential stages:

  1. Data Augmentation (DA): Expansion of limited (Q,A)(Q,A) pairs via paraphrasing and multiple-choice option shuffling.
  2. Supervised Fine-Tuning (SFT): Policy initialization on synthesized reasoning traces (Q,S,A)(Q,S,A) generated by teacher models, using a negative log-likelihood loss:

LSFT=(Q,S,A)DprocesslogP(S,AQ;πori)\mathcal{L}_\text{SFT} = -\sum_{(Q, S, A) \in \mathcal{D}_\text{process}} \log P(S, A \mid Q;\, \pi_\text{ori})

  1. Reinforcement Fine-Tuning (PPO): Further adaptation with a reward that incorporates both final-answer correctness and stepwise process rationality:

Ri=αori+(1α)f(pri1,,prim)R_i = \alpha \cdot or_i + (1-\alpha) \cdot f(pr_i^1,\dots, pr_i^m)

where orior_i is binary outcome reward and pritpr_i^t is the PRM score per step, with α=0.7\alpha=0.7.

The environment supplies each MDP episode with:

  • State s0=“Question:”Qis_0 = \text{“Question:”}\, \|\, Q_i
  • Actions ata_t: either step SitS_i^t or final AiA_i'
  • State transitions via concatenation on newline.
  • PPO maximizes the conventional clipped surrogate with actor lr=3e53e{-}5, critic lr=6e56e{-}5, KL coefficient $0.01$, and maximal sequence length $1536$.

3. Strategies for Exploiting Scarce Domain Data

Given only O(100)O(100) training examples, OpenRFT combines three mutually reinforcing strategies:

3.1 Question Augmentation (DA):

  • Each (Q,A)(Q,A) is expanded through 5 paraphrases of QQ (using GPT-4o-mini), with random shuffling of choice options, yielding k=6k=6 total variants per item.
  • All variants become independent PPO episodes, ensuring diverse state-space coverage and regularization against overfitting.

3.2 Reasoning-Process Synthesis and SFT:

  • Synthetic reasoning traces are generated for each sample by sampling rollouts from a teacher model; only those producing the correct answer are retained and the highest-confidence trace is selected.
  • These comprise Dprocess\mathcal{D}_\text{process} for SFT, enabling stable policy initialization when annotated traces are unavailable.

3.3 Few-Shot In-Context Learning (ICL):

  • For each target question during RL, k=3k=3 nearest (Q,A)(Q, A) pairs (using SBERT-embedded retrieval) are prepended as shots.
  • This practice “warm-starts” domain reasoning and guides exploration during RL optimization.

4. Empirical Benchmarking: SciKnowEval Tasks

OpenRFT is evaluated on SciKnowEval L3 datasets, with tasks spanning biology, chemistry, physics, and materials science. Each of the eight tasks (T1–T8) includes $100$ train and $100$ test samples (T7: $100/49$). OpenRFT is compared against:

  • Ceilings: GPT-4o-mini and o1-mini
  • Vanilla: Skywork-o1-Open-Llama-3.1-8B, without adaptation
  • ReFT/PRM: RL-based fine-tuning without/with process supervision
  • SFT: Only SFT on synthesized traces
  • SFT+RL(PRM), +DA, +ICL: Chained with data augmentation and ICL, culminating in full OpenRFT

All experiments apply LoRA (rank=4), batch=8, augmentation factor=6, ICL retrieval k=3k=3, and temperatures $0.6$ (models) or $1.0$ (o1-mini).

Per-task and average exact match accuracies (mean of three runs):

Model T1 T2 T3 T4 T5 T6 T7 T8 Avg
GPT-4o-mini 0.37 0.69 0.84 0.32 0.53 0.49 0.90 0.525 0.583
o1-mini 0.35 0.86 0.87 0.23 0.73 0.70 0.87 0.50 0.639
Vanilla 0.28 0.55 0.52 0.23 0.45 0.34 0.41 0.41 0.403
ReFT 0.27 0.50 0.52 0.23 0.44 0.33 0.41 0.50 0.402
ReFT+PRM 0.30 0.57 0.49 0.23 0.44 0.36 0.37 0.48 0.405
SFT 0.33 0.53 0.49 0.20 0.45 0.37 0.43 0.49 0.415
SFT+RL(PRM) 0.29 0.59 0.52 0.24 0.47 0.36 0.46 0.57 0.437
SFT+RL(PRM)+DA 0.29 0.63 0.53 0.21 0.47 0.38 0.48 0.59 0.447
OpenRFT 0.33 0.57 0.52 0.28 0.46 0.36 0.49 0.53 0.443

Key observations:

  • Pure RL-based ReFT (without process supervision) provides negligible gains over baseline.
  • SFT with synthesized traces yields +1+1 to $2$ points over Vanilla baseline.
  • Addition of RL(PRM) increases average performance by 3\sim 3 points, while further data augmentation provides an additional gain of 1\sim 1 point.
  • OpenRFT achieves an overall +11%+11\% absolute improvement in average accuracy over the unadapted (Vanilla) policy.
  • Ceiling models such as o1-mini and GPT-4o-mini outperform all adaptation variants, indicating substantial remaining headroom in open-system RFT optimization.

Ablation studies confirm the critical importance of process alignment: when reasoning traces are sourced from an action-incompatible teacher (e.g., QwQ-32B), task performance collapses, emphasizing the necessity of shared action space between teacher and student.

5. Process Supervision and Reward Shaping

OpenRFT’s reward function combines stepwise process supervision (via PRM) with outcome-based reward:

Ri=αori+(1α)f(pri1,,prim)R_i = \alpha \cdot or_i + (1-\alpha) \cdot f(pr_i^1,\dots, pr_i^m)

α\alpha is set to $0.7$, emphasizing answer correctness while still incentivizing rational step-wise trajectories. The PRM assigns rationality scores to (st1,at)(s_{t-1}, a_t) pairs, typically combined across multiple steps by mean or minimum. This dual-reward structure is designed to reduce reward hacking, where the model may arrive at correct answers via incorrect reasoning. The reward model, however, remains static throughout training, and dynamic joint policy-reward updates (such as self-play) are suggested as a future direction.

6. Current Limitations and Prospective Advancements

Limitations include:

  • Reliance on a strong, open-source reasoning foundation model and an appropriately matched PRM; higher performance is potentially gated by these base modules.
  • Static reward modeling; dynamically updating the PRM during policy training (e.g., in a self-play or iterative co-training regime) is unexplored.
  • Restriction to multiple-choice tasks; extension to free-form or long-document reasoning will necessitate changes to both state/action spaces and reward modeling.
  • Inconsistency between few-shot prompt construction in SFT versus RL, possibly limiting transfer.

Envisioned enhancements are:

  • Alternating PRM and policy improvement in a dynamic reward modeling loop.
  • Incorporating structured domain ontology embeddings and symbolic constraints into either policy or reward modeling.
  • Leveraging advanced data augmentation, including adversarial or knowledge-aware templates and unlabeled corpora.
  • Introducing unsupervised self-play, curriculum learning, and continuous tracking of margin-case pools.
  • Generalization to broader task formats, such as code synthesis or technical document comprehension.

7. Conclusion and Research Implications

OpenRFT validates that generalist System-2 models can be efficiently adapted to narrow scientific MCQ tasks using minimal domain data and a unified reinforcement fine-tuning paradigm. The synthesis of data augmentation, process-trace SFT, and reinforcement learning with both process and outcome supervision yields measurable improvements over baseline adaptation and competitive performance relative to (still superior) closed-source ceiling models. The methodology’s modularity, transparency, and extensibility position it as a testbed for scaling domain expertise and reward-based alignment in open-source reasoning models. Future research will likely focus on reward model-policiy co-evolution, transfer to varied domain/task settings, and integration of richer forms of supervision and domain knowledge (Zhang et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OpenRFT.