Reference-Guided Fine-Tuning (ReGFT)
- Reference-Guided Fine-Tuning (ReGFT) is a family of methods that actively integrates reference signals during training to improve optimization and overcome reward sparsity.
- It employs diverse reference objects such as partial human solutions, instruction-tuned teacher models, moving averages of global checkpoints, or example banks to steer gradient flow and sampling.
- Empirical results demonstrate that ReGFT enhances performance in mathematical reasoning, federated learning, and controllable generation by effectively addressing sparse feedback and model drift.
Reference-Guided Fine-Tuning (ReGFT) denotes a family of adaptation methods in which an explicit reference actively shapes fine-tuning or generation. In current usage, the term has both a narrow and a broad sense. Narrowly, it names a pre-RL supervised stage for mathematical reasoning that uses partial human reference solutions to synthesize verifier-approved positive trajectories before reinforcement learning. More broadly, it describes a design pattern in which a reference object—such as a human-written proof, an instruction-tuned teacher, a moving average of prior global models, or a bank of examples—guides optimization or generation without functioning merely as a passive merge target (Wu et al., 1 Mar 2026, Ruan et al., 2 May 2026, Yoon et al., 29 Jun 2025, Curvo et al., 11 May 2026).
1. Terminological scope and conceptual definition
In the mathematical-reasoning literature, ReGFT is introduced as a response to reward sparsity in RL with verifiable rewards. The central problem is that, on hard problems, a model may fail to sample any correct trajectory, so RL receives no meaningful positive feedback. The proposed remedy is to use a partial prefix of a human reference solution as a hint, induce the model to generate its own reasoning trace, keep only verifier-approved correct trajectories, and fine-tune on them before RL (Wu et al., 1 Mar 2026).
In later usage, the term is generalized beyond that specific pipeline. GIFT explicitly describes itself as “a concrete instantiation of a broader reference-guided fine-tuning paradigm,” in which an instruction-tuned model provides token-level confidence weights that guide training on a pretrained base model before the learned update is merged back into the instruction model (Ruan et al., 2 May 2026). FedRef presents a federated variant in which a moving reference model derived from previous global checkpoints acts as a Bayesian prior in a server-side MAP objective, thereby guiding each new round of optimization (Yoon et al., 29 Jun 2025). “Follow the Mean” extends the notion further: in flow matching, a reference bank of examples alters the conditional endpoint mean and therefore the velocity field itself, enabling adaptation through examples rather than parameter updates (Curvo et al., 11 May 2026).
Taken together, these formulations define ReGFT less by a single optimizer or architecture than by a common structural principle: the reference is active during adaptation. It changes where gradients flow, which trajectories are synthesized, which parameters are penalized, or which generative drift is followed.
2. Formal mechanisms of reference guidance
A defining feature of ReGFT methods is that the reference enters the learning problem through an explicit mathematical interface rather than as ordinary supervision alone.
In RLVR mathematical reasoning, the reference is a human solution , from which a prefix is extracted. The model is then sampled from rather than only from , and only verifier-approved trajectories are retained. The resulting ReGFT loss mixes standard positive trajectories and reference-guided positives :
This differs from direct supervised imitation of the full human chain of thought, which would optimize and is reported to be substantially weaker (Wu et al., 1 Mar 2026).
In GIFT, the reference is an instruction-tuned model aligned to the same backbone as the pretrained base model . For each target token, the instruction model supplies a confidence score
0
and the base-plus-adapter model is trained with a confidence-weighted loss
1
The teacher therefore modulates the magnitude of token-level gradients but is not matched by KL divergence. The paper explicitly states that this is not knowledge distillation (Ruan et al., 2 May 2026).
In FedRef, the reference is a moving average of recent global models,
2
used as a Bayesian prior in a federated MAP objective. Conceptually, the optimization can be written as
3
where the quadratic penalty is the reference-guided component. The guidance acts on the server, not on clients, which continue to run plain local SGD (Yoon et al., 29 Jun 2025).
In reference-guided flow matching, the reference enters through the conditional endpoint mean. For linear interpolants, the velocity field satisfies
4
Replacing the base mean 5 by a reference-conditioned mean 6 changes the velocity by
7
The reference thus controls generation by altering the endpoint mean, not by retraining the model (Curvo et al., 11 May 2026).
3. Representative algorithmic families
Representative ReGFT formulations differ in what counts as the “reference” and in how that reference is used.
| Method | Reference object | Guidance mechanism |
|---|---|---|
| ReGFT for RLVR math | Partial human reference solution | Synthesizes verifier-approved positive trajectories before RL |
| GIFT | Instruction-tuned teacher model | Confidence-weighted token loss on a base model, then merge |
| FedRef | Moving average of prior global models | Server-side MAP objective with quadratic proximity penalty |
| Reference-Mean Guidance / Semi-Parametric Guidance | Reference bank of examples | Mean-shifted flow field or reference-conditioned endpoint prediction |
In the RLVR formulation, the full pipeline is base model 8 ReFT or ReGFT pre-training on hard problems 9 DAPO on full OmniMath. Hard problems are defined as OmniMath problems for which the base model’s pass@16 is below 0. The reference-guided prompt supplies the question, a partial hint, and the instruction “You must solve it by yourself,” which is intended to preserve model-generated reasoning rather than token-by-token copying (Wu et al., 1 Mar 2026).
In GIFT, the architecture comprises a frozen pretrained base model, a frozen instruction-tuned model, and LoRA adapters attached to the base. Offline annotation computes token confidences once, guided training learns only the adapter, and the resulting low-rank delta is merged into the instruction model with the standard LoRA merge operator. The reference model is therefore active during training but unchanged by it (Ruan et al., 2 May 2026).
FedRef retains standard client-side training and relocates the reference mechanism to the server. Each round aggregates client models into a current global model, aggregates recent global models into a reference model, and performs a Bayesian optimization step on the server using both the reference and the reported client losses. This design is presented as communication-efficient because clients transmit only model parameters and scalar loss values beyond the usual federated exchange (Yoon et al., 29 Jun 2025).
Reference-Mean Guidance is training-free: the base flow model is frozen, the reference bank is encoded into latent space, and the per-step correction is computed in closed form from softmax weights over the references. Semi-Parametric Guidance amortizes the same idea through an explicit mean anchor and a learned residual refiner, so that inference-time behavior still changes when the reference set is swapped (Curvo et al., 11 May 2026).
4. Empirical behavior across domains
In mathematical reasoning, the reported gains are tied directly to reward sparsity. On OmniMath, even with 64 samples per problem, the raw model solves 68.58% of problems under standard sampling. Reference-guided sampling solves 70.82%, unlocks 5.85% of problems that are never solved under standard sampling, and leaves 3.61% solved only by standard sampling. Before RL, train pass@64 rises from 68.6 for the raw model to 72.5 for ReGFT. After DAPO with 64 responses per prompt, ReGFT reaches 70.0 on AIME24, 61.6 on AIME25, and 40.3 on BeyondAIME, which are the best reported results among the compared settings. The same study reports that direct SFT on raw reference solutions “fails to achieve competitive RL performance” (Wu et al., 1 Mar 2026).
In instruction-tuned transfer, the main empirical pattern is that direct fine-tuning of the instruction model is often harmful, whereas reference-guided weighting and merge improve target-domain performance while preserving broad behavior. On Llama3.1-8B-Instruct for mathematics, direct SFT drops average accuracy from 16.8 to 8.9, while GIFT reaches 22.0. On Llama3.1-8B-Instruct for medical QA, Instruct-SFT drops average from 62.6 to 57.3, while GIFT reaches 68.8. After merging, general metrics are reported as preserved or slightly improved: on Qwen2.5-7B, MMLU moves from 68.7 to 68.8 and IFEval from 71.2 to 72.1; on Llama3.1-8B, MMLU moves from 63.2 to 63.7 and IFEval from 73.8 to 74.7 (Ruan et al., 2 May 2026).
In federated learning, the quantitative record in the excerpt is less complete, but the stated pattern is clear. FedRef is reported to achieve lower centralized loss over rounds than FedAvg, FedProx, and FedOpt on MNIST and CIFAR10, and to yield higher segmentation performance on FeTS2022, while keeping client-side computation identical to FedAvg because no proximal term is evaluated on the client (Yoon et al., 29 Jun 2025).
In controllable generation, the evidence emphasizes adaptation through examples. On GenEval with FLUX.2-klein, Reference-Mean Guidance improves the mean score from 80.10 to 91.17 while costing 1.021 base runtime and 12 NFE. Particularly large gains are reported for Position, from 65.25 to 94.00, along with improvements on Two-object, Colors, and Attribution. Semi-Parametric Guidance on AFHQv2 approximately matches unconditional DiT-B/4 quality, with FID 23.256 versus 23.111, KID 0.013 versus 0.012, and IS 6.227 versus 6.554 (Curvo et al., 11 May 2026).
Across these domains, a recurrent empirical pattern is visible: reference guidance is most valuable where plain fine-tuning or plain RL is constrained by sparse positive signals, brittle instruction-model updates, non-IID drift, or limited control interfaces. This suggests that ReGFT is particularly useful when the missing ingredient is not more raw optimization but a better way to expose existing structure to the model.
5. Relationship to adjacent methods and common misconceptions
A recurring misconception is that ReGFT is equivalent to direct imitation of the reference. The mathematical-reasoning formulation is explicit that the model should “solve it by yourself,” and the reported observation is that it “almost always derives its own reasoning independently even when exposed to the full reference solution.” Its purpose is to keep training targets in the model’s own reasoning space while still exploiting external guidance (Wu et al., 1 Mar 2026).
A second misconception is that reference guidance is simply a form of knowledge distillation. GIFT rejects that characterization. The instruction-tuned teacher provides only scalar token-level confidence weights, and there is no KL term or requirement that the student reproduce the teacher distribution. The guidance signal redistributes the learning signal rather than defining a teacher-student matching objective (Ruan et al., 2 May 2026).
A third misconception is that any transfer-based merge method is reference-guided. GIFT distinguishes itself from Shadow-FT, Chat Vector, Re-Adapt, LoRE-Adapt, and related post-hoc transfer methods precisely because the instruction-tuned model is active during adapter training rather than only at the final merge stage (Ruan et al., 2 May 2026). By the same logic, FedRef differs from client-side proximal methods such as FedProx because its reference-guided penalty is applied through a server-side MAP update around a moving reference model, not by modifying each client’s local objective (Yoon et al., 29 Jun 2025).
A fourth misconception is that fine-tuning must modify parameters. “Follow the Mean” shows a different possibility: in flow matching, behavior can be altered by changing the reference set that defines the endpoint mean correction while keeping prompt, seed, and weights fixed. In that setting, ReGFT is literally “adaptation through examples” (Curvo et al., 11 May 2026).
These contrasts indicate that ReGFT is best understood as a family of mechanisms for making references operative in learning dynamics. The operative part may be trajectories, token weights, proximal anchors, or endpoint means, but in each case the reference changes the effective optimization or generation process.
6. Limitations, failure modes, and open problems
The most immediate limitation is dependence on an appropriate reference source. The RLVR ReGFT method assumes access to human reference chains of thought for each training problem, which holds in AoPS-like and OmniMath settings but may not hold elsewhere. The same paper notes that even reference-guided sampling solves only 70.82% of OmniMath problems, leaving roughly 30% unsolved, and attributes the gap in part to limited ability to interpret advanced human mathematical reasoning and to verifier false negatives for open-ended proofs (Wu et al., 1 Mar 2026).
Reference quality also matters in other formulations. GIFT requires an instruction-tuned teacher whose confidence scores are meaningfully aligned; the ablation labeled GIFT-BaseT shows that using the base model as teacher recovers baseline behavior but does not exceed it. FedRef depends on choices of reference window size 3, regularization strength 4, and learning rates, all of which are identified as important but not systematically characterized. The same paper also notes vulnerability to malicious clients and model inference attacks, as well as the absence of formal convergence guarantees (Ruan et al., 2 May 2026, Yoon et al., 29 Jun 2025).
Compute overhead remains a general concern. ReGFT for math adds standard sampling, reference-guided sampling, and an additional SFT stage before RL; RL itself remains expensive, with up to 64 rollouts per prompt in the reported experiments. GIFT requires an offline teacher pass over the dataset, যদিও the paper describes the cost as modest for moderate datasets. Flow-based reference guidance can become expensive when the reference bank is large because the mean correction requires softmax weighting across all references unless subsampling or approximate retrieval is used (Wu et al., 1 Mar 2026, Ruan et al., 2 May 2026, Curvo et al., 11 May 2026).
Several failure modes are specific to particular interfaces. In mathematical reasoning, some problems are solved only without hints, suggesting that over-reliance on hints may bias the model away from solutions it can find alone. In flow matching, poorly curated reference sets transfer unwanted correlations, and late-time instability can arise if the correction term does not decay as 5. Semi-Parametric Guidance is introduced partly to suppress such nuisance artifacts, for example unwanted white backgrounds shared across the references (Wu et al., 1 Mar 2026, Curvo et al., 11 May 2026).
Open problems recur across the literature. They include adaptive hint selection or curriculum strategies, more informative verifiers for open-ended reasoning, stronger or more structured Bayesian priors than identity-FIM approximations, better merge operators, guidance signals beyond raw token probabilities, dynamic or multi-reference schemes, and extensions beyond mathematics to code, logic, planning, video, and non-verifiable tasks (Wu et al., 1 Mar 2026, Ruan et al., 2 May 2026, Yoon et al., 29 Jun 2025, Curvo et al., 11 May 2026).
In aggregate, these limitations show that ReGFT is not a single solved recipe. It is a research program organized around a consistent premise: when direct optimization is brittle or underspecified, an explicit reference can be used to reshape the learning problem itself.