Papers
Topics
Authors
Recent
2000 character limit reached

BREP ReFT: Bias-Restrained Prefix FineTuning

Updated 17 November 2025
  • The paper introduces BREP ReFT, which optimizes per-layer scaling and bias vectors to enhance mathematical reasoning accuracy with minimal parameter overhead.
  • BREP ReFT employs prefix-truncated training and PID-controlled bias constraints to prevent degradation of numerical encoding and reasoning prefixes.
  • Empirical results demonstrate that BREP ReFT outperforms standard PEFT methods on benchmarks like GSM8K, achieving superior accuracy with orders-of-magnitude fewer parameters.

Bias-Restrained Prefix Representation FineTuning (BREP ReFT) is a representation finetuning technique for LLMs designed to address limitations of standard ReFT on mathematical reasoning tasks. BREP ReFT achieves high parameter efficiency by freezing all pretrained model weights and optimizing only per-layer scaling and bias vectors that act directly on hidden states. The method introduces prefix-truncated training, early-stage intervention, and a PID-constrained bias magnitude objective to prevent degradation of mathematical inference accuracy commonly observed in conventional PEFT (Parameter-Efficient Finetuning) and unrestricted ReFT approaches. Empirical evidence demonstrates that BREP ReFT attains superior accuracy and generalization on chain-of-thought mathematical reasoning benchmarks, matching or exceeding the performance of weight-based PEFT methods with orders-of-magnitude fewer learned parameters (Liang et al., 13 Nov 2025).

1. Background and Motivation

Parameter-Efficient Finetuning (PEFT) schemes, such as LoRA and Prefix-tuning, update only a fraction of the model weights, enabling task adaptation without full weight updates. Representation Finetuning (ReFT) extends PEFT by solely learning lightweight transformations—elementwise scaling and bias—on intermediate hidden states while freezing pretrained weights. ReFT is highly parameter-efficient and succeeds on commonsense and instruction-following tasks. However, on mathematical benchmarks such as GSM8K, ReFT shows a notable accuracy drop (∼11.5% lower than PEFT).

Two principal failure modes are identified:

  • Misleading Reasoning Prefixes: ReFT-finetuned models tend to generate poor initial chain-of-thought (CoT) tokens (the "reasoning prefix"), leading to erroneous subsequent inference.
  • Numerical Encoding Degradation: The learned per-layer bias vectors b\bm{b} introduce deviations in the model’s internal linear encoding of numbers, with the effect compounding across autoregressive token generation steps. Empirical projection of these biases onto the original number-encoding direction indicates frequent excursions beyond a critical threshold, correlating with elevated addition error rates.

These findings motivate BREP ReFT, which targets the initialization phase of mathematical reasoning and incorporates explicit constraints to control representational drift.

2. Mathematical Formulation

BREP ReFT formalism integrates a standard transformer architecture with modified ReFT and a prefix-focused, bias-constrained optimization schema.

  1. Transformer with ReFT:

For input x=(x1,,xn)\bm{x}=(x_1,\ldots,x_n), let h0\bm{h}^0 be the initial embedding. At layer jj, the transformer computes:

hj=hj1+Aj(hj1)+Fj(hj1)\bm{h}^j = \bm{h}^{j-1} + \mathbf{A}^j(\bm{h}^{j-1}) + \mathbf{F}^j(\bm{h}^{j-1})

Standard ReFT then applies:

hReFTj=Wj(hj1+Aj+Fj)+bj,Wj,bjRd\bm{h}^j_{\text{ReFT}} = \mathbf{W}^j \odot \left( \bm{h}^{j-1} + \mathbf{A}^j + \mathbf{F}^j \right) + \bm{b}^j, \qquad \mathbf{W}^j, \bm{b}^j \in \mathbb{R}^d

where \odot indicates elementwise multiplication.

  1. Prefix-Focused Objective:

The training objective focuses on the initial lpl_p tokens of each target response sequence y1:Ty_{1:T}. At token position tt, the per-token prefix reward is:

Rt=1lplogp(ytx,y1:t1)1lflogp(ytx,y1:t1)R_t = \frac{1}{l_p} \log p(y_t | x, y_{1:t-1}) - \frac{1}{l_f} \log p(y_t | x, y_{1:t-1})

where lfl_f is the full sequence length. The cumulative prefix reward is maximized:

Rcum=t=1nRt,nlpR_{\text{cum}} = \sum_{t=1}^n R_t, \quad n\leq l_p

Equivalently, the loss minimized is the mean negative log-likelihood over the truncated prefix:

Lce=1lpt=1lplogp(ytx,y1:t1)\mathcal{L}_{\text{ce}} = -\frac{1}{l_p} \sum_{t=1}^{l_p} \log p(y_t | x, y_{1:t-1})

  1. Bias-Restraint via PID Control:

The average per-layer bias magnitude is

b(t)=1Lj=1Lbj(t)2b(t) = \frac{1}{L} \sum_{j=1}^{L} \|\bm{b}^j(t)\|_2

and the instantaneous PID error is e(t)=btargetb(t)e(t) = b_{\text{target}} - b(t). A PID controller computes:

Δw(t)=Kpe(t)+Ki0te(τ)dτ+Kdde(t)dt\Delta w(t) = K_p e(t) + K_i \int_0^t e(\tau)d\tau + K_d \frac{de(t)}{dt}

w(t+1)=clip(w(t)[1+αΔw(t)],wmin,wmax)w(t+1) = \mathrm{clip}(w(t)[1 + \alpha \Delta w(t)], w_{\min}, w_{\max})

The total loss is then:

Ltotal=w(t)Lce\mathcal{L}_{\text{total}} = w(t) \mathcal{L}_{\text{ce}}

3. Algorithmic Procedures

BREP ReFT integrates three main algorithmic components: prefix-truncated training, two-stage inference, and magnitude constraint via PID control.

  • Prefix-Truncated Training: Each sample’s target response is truncated to its first lpl_p tokens. The optimization occurs only over these tokens, directly shaping the model's initial reasoning behavior and sharpening prefix accuracy. The prefix-truncation training loop proceeds as follows:

1
2
3
4
5
for each batch {x, full_response y[1..l_f]}:
    y_prefix = y[1..l_p]
    compute p_t = model(x, y_prefix[1:t-1]) for t=1..l_p
    L_ce = -(1/l_p) * sum_{t=1}^{l_p} log p_t(y_t)
    update bias W, b via  L_total = w(t) L_ce

  • Two-Stage Inference: During decoding, ReFT transforms (scaling and bias) are applied only to the first nlpn \le l_p generated tokens. For subsequent positions, the base (unfinetuned) model representation is used, preventing error propagation through the CoT.

1
2
3
4
5
6
7
8
for t in 1..T:
    if t <= n:
        apply ReFT transforms
    else:
        leave hidden states unchanged
    sample y_t ∼ p(· | x, y_{1:t-1})
    append y_t to output
return output sequence

  • Bias Magnitude Constraint: PID hyperparameters are initialized as Kp=101,Ki=104,Kd=102,α=5,wmin=105,wmax=101K_p = 10^{-1}, K_i = 10^{-4}, K_d = 10^{-2}, \alpha = 5, w_{\min} = 10^{-5}, w_{\max} = 10^{-1}, and the target bias norm btargetb_{\text{target}} is set per model family.

4. Experimental Setup and Evaluation

Models:

Training and inference:

  • Typical prefix lengths: Llama (lp=64l_p=64, n=8n=8); Qwen2.5-7B (66,10); Qwen3-8B (67,11); Qwen3-14B (68,12).
  • AdamW optimizer; learning rates 2×1042\times10^{-4} (Llama) and 2×1052\times10^{-5} (Qwen).
  • Computation: Single NVIDIA A100 80 GB for \sim1 hour of training on 5K samples.

Datasets:

  • Simple reasoning: MATH10K subsampled for GSM8K, SVAMP, MathQA
  • Complex reasoning: PRM800K (5K) for MATH500, AMC23

Benchmarks: GSM8K, SVAMP, MathQA, MATH500, AMC23 Baselines: Base (frozen), LoRA, RED (RepEdit), LoReFT Metrics: Answer correctness (chain-of-thought verification)

Results summary:

Model Method GSM8K SVAMP MathQA MATH500 AMC23
Llama3-8B Base 80.0 88.9 55.0 40.4 57.5
LoRA 81.1 90.0 54.0 39.3 53.8
RED 73.8 88.9 51.3 41.5 56.4
LoReFT 78.8 80.7 44.7 37.0 35.0
BREP 82.8 89.5 54.3 42.8 52.5
Qwen3-8B Base 95.1 96.7 86.5 82.0 85.0
LoRA 95.1 96.8 86.2 81.8 87.5
RED 87.9 91.8 77.3 54.2 35.0
LoReFT 87.1 96.3 72.8 72.4 80.0
BREP 95.3 97.4 86.3 82.6 87.5

BREP improves GSM8K accuracy by up to +2.8 points over base Llama3-8B, and shows consistent gains or parity across all benchmarks and model families.

Efficiency:

BREP introduces only $2d$ parameters per layer (scaling + bias), representing 0.01%\lesssim 0.01\% of total model parameters. Training is rapid (\sim1 hour for 5K examples), and inference time increases negligibly due to intervention being limited to the prefix.

5. Analysis and Ablation

Ablation studies confirm the contributions of BREP’s distinct components:

Model Variant GSM8K MATH500
L3-8B Full BREP 82.8 42.8
w/o Prefix Truncation 81.0 40.2
w/o Bias Constraint 80.0 39.4
w/o Early Intervention 80.4 37.6
Q3-8B Full BREP 95.3 82.0
w/o Prefix Truncation 95.5 79.4
w/o Bias Constraint 94.9 79.8
w/o Early Intervention 95.1 81.6

Each component—prefix truncation, bias constraint, and early-stage intervention—substantially affects mathematical reasoning accuracy, especially on longer CoT benchmarks.

Probing internal representations demonstrates preservation or improvement of linear number encoding with BREP, in contrast to the degradation observed in unconstrained ReFT.

6. Implementation Guidelines

To adopt BREP ReFT:

  1. Freeze the pretrained LLM and insert per-layer learnable scaling (Wj\mathbf{W}^j) and bias (bj\bm{b}^j) vectors.
  2. Prepare dataset and truncate each example's target response to the first lpl_p tokens suitable for the target model.
  3. Implement PID control to maintain 1Ljbj2\frac{1}{L} \sum_j \|\bm{b}^j\|_2 near btargetb_{\text{target}} via updating loss weight w(t)w(t).
  4. Train for $5$K steps on a single high-memory GPU, optimize total loss as above.
  5. During inference, apply the ReFT transformations only for the first nn tokens, then revert to base model.
  6. Decoding may use greedy or preferred CoT policies.

Reference code and scripts are publicly available at https://github.com/LiangThree/BREP.

7. Context, Significance, and Limitations

BREP ReFT establishes new accuracy and generalization standards for representation-efficient finetuning on mathematical reasoning tasks, mitigating the prefix misalignment and numerical encoding drift present in prior ReFT methods. By concentrating adaptation on the initial reasoning prefix and tightly regulating representational drift, BREP enables robust mathematical inference with minimal parameter overhead. For out-of-domain commonsense tasks (BoolQ, PIQA, GPQA), BREP maintains or improves generalization relative to baselines.

A plausible implication is that BREP’s separation of early-stage reasoning intervention from subsequent unperturbed token generation is applicable to other tasks with critical prefix dependencies. However, potential scaling to multi-turn dialogue or highly compositional mathematical contexts may warrant further investigation, as may optimal selection of prefix and intervention lengths for arbitrary architectures.

BREP ReFT provides a reproducible procedure and open-source codebase for investigation and deployment in high-stakes mathematical language modeling, offering a systematic framework to balance adaptation capacity and representational stability for rigorous downstream reasoning applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Bias-REstrained Prefix Representation FineTuning (BREP ReFT).