OpenRFT: Adaptation of Foundation Models
- OpenRFT is an open-source framework for domain adaptation, using Reinforcement Fine-Tuning to bridge generalist reasoning models with specialized scientific tasks.
- The framework integrates synthetic reasoning trace SFT, data augmentation, and few-shot in-context learning to enhance performance on scientific MCQs with as few as 100 labeled examples.
- Empirical results demonstrate an 11% accuracy improvement over baseline models, underscoring the value of process supervision and reward shaping in low-data environments.
OpenRFT is an open-source framework designed for domain adaptation of generalist reasoning foundation models under extreme data scarcity, leveraging Reinforcement Fine-Tuning (RFT) to bridge the gap between broad System-2 models and highly specialized scientific tasks. Developed as an analog to recent closed-source RFT paradigms, OpenRFT demonstrates notable gains in scientific multiple-choice question (MCQ) answering with as few as 100 domain-labeled examples per task, and is characterized by its use of question augmentation, reasoning-process synthesis, and few-shot in-context learning. The methodology, implementation, and empirical evaluation are comprehensively detailed in (Zhang et al., 2024).
1. Rationale and Problem Context
Large reasoning foundation models, typified by the Skywork-o1 series, are capable of multi-step inference across domains such as mathematics, programming, and logic. Nevertheless, these models frequently exhibit diminished performance on domain-specific tasks (for example, scientific MCQs in biology, chemistry, or materials science) due to domain shift and limited availability of task-specific data. Two central challenges impair naive adaptation:
- The absence of annotated reasoning traces (process supervision) in provided pairs, rendering RL reward signals highly unstable; correct outputs may arise from flawed or spurious reasoning chains.
- Training set sizes are severely limited (), impeding exploration and overfitting risks, even with RL’s ability for self-generated rollouts.
OpenRFT addresses this by warming up the policy with synthetic step-by-step traces, stabilizing RL with a process-aware reward model, and maximally amplifying domain signals using data augmentation and retrieval-augmented few-shot prompt construction (Zhang et al., 2024).
2. Core Model Architecture and Training Regimen
OpenRFT utilizes two pretrained open-source modules from the Skywork-o1 suite:
- Policy model : Based on Skywork-o1-Open-Llama-3.1-8B, an 8B parameter System-2 LLM, fine-tuned via LoRA (rank=4) on both supervised (SFT) and PPO objectives.
- Process Reward Model (PRM) : Instantiated as Skywork-o1-Open-PRM-Qwen-2.5-7B (7B parameter), independently pretrained to rate the rationality of each intermediate reasoning step .
Training proceeds in three sequential stages:
- Data Augmentation (DA): Expansion of limited pairs via paraphrasing and multiple-choice option shuffling.
- Supervised Fine-Tuning (SFT): Policy initialization on synthesized reasoning traces generated by teacher models, using a negative log-likelihood loss:
- Reinforcement Fine-Tuning (PPO): Further adaptation with a reward that incorporates both final-answer correctness and stepwise process rationality:
where is binary outcome reward and is the PRM score per step, with .
The environment supplies each MDP episode with:
- State
- Actions : either step or final
- State transitions via concatenation on newline.
- PPO maximizes the conventional clipped surrogate with actor lr=, critic lr=, KL coefficient $0.01$, and maximal sequence length $1536$.
3. Strategies for Exploiting Scarce Domain Data
Given only training examples, OpenRFT combines three mutually reinforcing strategies:
3.1 Question Augmentation (DA):
- Each is expanded through 5 paraphrases of (using GPT-4o-mini), with random shuffling of choice options, yielding total variants per item.
- All variants become independent PPO episodes, ensuring diverse state-space coverage and regularization against overfitting.
3.2 Reasoning-Process Synthesis and SFT:
- Synthetic reasoning traces are generated for each sample by sampling rollouts from a teacher model; only those producing the correct answer are retained and the highest-confidence trace is selected.
- These comprise for SFT, enabling stable policy initialization when annotated traces are unavailable.
3.3 Few-Shot In-Context Learning (ICL):
- For each target question during RL, nearest pairs (using SBERT-embedded retrieval) are prepended as shots.
- This practice “warm-starts” domain reasoning and guides exploration during RL optimization.
4. Empirical Benchmarking: SciKnowEval Tasks
OpenRFT is evaluated on SciKnowEval L3 datasets, with tasks spanning biology, chemistry, physics, and materials science. Each of the eight tasks (T1–T8) includes $100$ train and $100$ test samples (T7: $100/49$). OpenRFT is compared against:
- Ceilings: GPT-4o-mini and o1-mini
- Vanilla: Skywork-o1-Open-Llama-3.1-8B, without adaptation
- ReFT/PRM: RL-based fine-tuning without/with process supervision
- SFT: Only SFT on synthesized traces
- SFT+RL(PRM), +DA, +ICL: Chained with data augmentation and ICL, culminating in full OpenRFT
All experiments apply LoRA (rank=4), batch=8, augmentation factor=6, ICL retrieval , and temperatures $0.6$ (models) or $1.0$ (o1-mini).
Per-task and average exact match accuracies (mean of three runs):
| Model | T1 | T2 | T3 | T4 | T5 | T6 | T7 | T8 | Avg |
|---|---|---|---|---|---|---|---|---|---|
| GPT-4o-mini | 0.37 | 0.69 | 0.84 | 0.32 | 0.53 | 0.49 | 0.90 | 0.525 | 0.583 |
| o1-mini | 0.35 | 0.86 | 0.87 | 0.23 | 0.73 | 0.70 | 0.87 | 0.50 | 0.639 |
| Vanilla | 0.28 | 0.55 | 0.52 | 0.23 | 0.45 | 0.34 | 0.41 | 0.41 | 0.403 |
| ReFT | 0.27 | 0.50 | 0.52 | 0.23 | 0.44 | 0.33 | 0.41 | 0.50 | 0.402 |
| ReFT+PRM | 0.30 | 0.57 | 0.49 | 0.23 | 0.44 | 0.36 | 0.37 | 0.48 | 0.405 |
| SFT | 0.33 | 0.53 | 0.49 | 0.20 | 0.45 | 0.37 | 0.43 | 0.49 | 0.415 |
| SFT+RL(PRM) | 0.29 | 0.59 | 0.52 | 0.24 | 0.47 | 0.36 | 0.46 | 0.57 | 0.437 |
| SFT+RL(PRM)+DA | 0.29 | 0.63 | 0.53 | 0.21 | 0.47 | 0.38 | 0.48 | 0.59 | 0.447 |
| OpenRFT | 0.33 | 0.57 | 0.52 | 0.28 | 0.46 | 0.36 | 0.49 | 0.53 | 0.443 |
Key observations:
- Pure RL-based ReFT (without process supervision) provides negligible gains over baseline.
- SFT with synthesized traces yields to $2$ points over Vanilla baseline.
- Addition of RL(PRM) increases average performance by points, while further data augmentation provides an additional gain of point.
- OpenRFT achieves an overall absolute improvement in average accuracy over the unadapted (Vanilla) policy.
- Ceiling models such as o1-mini and GPT-4o-mini outperform all adaptation variants, indicating substantial remaining headroom in open-system RFT optimization.
Ablation studies confirm the critical importance of process alignment: when reasoning traces are sourced from an action-incompatible teacher (e.g., QwQ-32B), task performance collapses, emphasizing the necessity of shared action space between teacher and student.
5. Process Supervision and Reward Shaping
OpenRFT’s reward function combines stepwise process supervision (via PRM) with outcome-based reward:
is set to $0.7$, emphasizing answer correctness while still incentivizing rational step-wise trajectories. The PRM assigns rationality scores to pairs, typically combined across multiple steps by mean or minimum. This dual-reward structure is designed to reduce reward hacking, where the model may arrive at correct answers via incorrect reasoning. The reward model, however, remains static throughout training, and dynamic joint policy-reward updates (such as self-play) are suggested as a future direction.
6. Current Limitations and Prospective Advancements
Limitations include:
- Reliance on a strong, open-source reasoning foundation model and an appropriately matched PRM; higher performance is potentially gated by these base modules.
- Static reward modeling; dynamically updating the PRM during policy training (e.g., in a self-play or iterative co-training regime) is unexplored.
- Restriction to multiple-choice tasks; extension to free-form or long-document reasoning will necessitate changes to both state/action spaces and reward modeling.
- Inconsistency between few-shot prompt construction in SFT versus RL, possibly limiting transfer.
Envisioned enhancements are:
- Alternating PRM and policy improvement in a dynamic reward modeling loop.
- Incorporating structured domain ontology embeddings and symbolic constraints into either policy or reward modeling.
- Leveraging advanced data augmentation, including adversarial or knowledge-aware templates and unlabeled corpora.
- Introducing unsupervised self-play, curriculum learning, and continuous tracking of margin-case pools.
- Generalization to broader task formats, such as code synthesis or technical document comprehension.
7. Conclusion and Research Implications
OpenRFT validates that generalist System-2 models can be efficiently adapted to narrow scientific MCQ tasks using minimal domain data and a unified reinforcement fine-tuning paradigm. The synthesis of data augmentation, process-trace SFT, and reinforcement learning with both process and outcome supervision yields measurable improvements over baseline adaptation and competitive performance relative to (still superior) closed-source ceiling models. The methodology’s modularity, transparency, and extensibility position it as a testbed for scaling domain expertise and reward-based alignment in open-source reasoning models. Future research will likely focus on reward model-policiy co-evolution, transfer to varied domain/task settings, and integration of richer forms of supervision and domain knowledge (Zhang et al., 2024).