BREP ReFT: Bias-Restrained Prefix FineTuning

Updated 17 November 2025

The paper introduces BREP ReFT, which optimizes per-layer scaling and bias vectors to enhance mathematical reasoning accuracy with minimal parameter overhead.
BREP ReFT employs prefix-truncated training and PID-controlled bias constraints to prevent degradation of numerical encoding and reasoning prefixes.
Empirical results demonstrate that BREP ReFT outperforms standard PEFT methods on benchmarks like GSM8K, achieving superior accuracy with orders-of-magnitude fewer parameters.

Bias-Restrained Prefix Representation FineTuning (BREP ReFT) is a representation finetuning technique for LLMs designed to address limitations of standard ReFT on mathematical reasoning tasks. BREP ReFT achieves high parameter efficiency by freezing all pretrained model weights and optimizing only per-layer scaling and bias vectors that act directly on hidden states. The method introduces prefix-truncated training, early-stage intervention, and a PID-constrained bias magnitude objective to prevent degradation of mathematical inference accuracy commonly observed in conventional PEFT (Parameter-Efficient Finetuning) and unrestricted ReFT approaches. Empirical evidence demonstrates that BREP ReFT attains superior accuracy and generalization on chain-of-thought mathematical reasoning benchmarks, matching or exceeding the performance of weight-based PEFT methods with orders-of-magnitude fewer learned parameters (Liang et al., 13 Nov 2025).

1. Background and Motivation

Parameter-Efficient Finetuning (PEFT) schemes, such as LoRA and Prefix-tuning, update only a fraction of the model weights, enabling task adaptation without full weight updates. Representation Finetuning (ReFT) extends PEFT by solely learning lightweight transformations—elementwise scaling and bias—on intermediate hidden states while freezing pretrained weights. ReFT is highly parameter-efficient and succeeds on commonsense and instruction-following tasks. However, on mathematical benchmarks such as GSM8K, ReFT shows a notable accuracy drop (∼11.5% lower than PEFT).

Two principal failure modes are identified:

Misleading Reasoning Prefixes: ReFT-finetuned models tend to generate poor initial chain-of-thought (CoT) tokens (the "reasoning prefix"), leading to erroneous subsequent inference.
Numerical Encoding Degradation: The learned per-layer bias vectors $\bm{b}$ introduce deviations in the model’s internal linear encoding of numbers, with the effect compounding across autoregressive token generation steps. Empirical projection of these biases onto the original number-encoding direction indicates frequent excursions beyond a critical threshold, correlating with elevated addition error rates.

These findings motivate BREP ReFT, which targets the initialization phase of mathematical reasoning and incorporates explicit constraints to control representational drift.

2. Mathematical Formulation

BREP ReFT formalism integrates a standard transformer architecture with modified ReFT and a prefix-focused, bias-constrained optimization schema.

Transformer with ReFT:

For input $\bm{x}=(x_1,\ldots,x_n)$ , let $\bm{h}^0$ be the initial embedding. At layer $j$ , the transformer computes:

$\bm{h}^j = \bm{h}^{j-1} + \mathbf{A}^j(\bm{h}^{j-1}) + \mathbf{F}^j(\bm{h}^{j-1})$

Standard ReFT then applies:

$\bm{h}^j_{\text{ReFT}} = \mathbf{W}^j \odot \left( \bm{h}^{j-1} + \mathbf{A}^j + \mathbf{F}^j \right) + \bm{b}^j, \qquad \mathbf{W}^j, \bm{b}^j \in \mathbb{R}^d$

where $\odot$ indicates elementwise multiplication.

Prefix-Focused Objective:

The training objective focuses on the initial $l_p$ tokens of each target response sequence $y_{1:T}$ . At token position $t$ , the per-token prefix reward is:

$R_t = \frac{1}{l_p} \log p(y_t | x, y_{1:t-1}) - \frac{1}{l_f} \log p(y_t | x, y_{1:t-1})$

where $l_f$ is the full sequence length. The cumulative prefix reward is maximized:

$R_{\text{cum}} = \sum_{t=1}^n R_t, \quad n\leq l_p$

Equivalently, the loss minimized is the mean negative log-likelihood over the truncated prefix:

$\mathcal{L}_{\text{ce}} = -\frac{1}{l_p} \sum_{t=1}^{l_p} \log p(y_t | x, y_{1:t-1})$

Bias-Restraint via PID Control:

The average per-layer bias magnitude is

$b(t) = \frac{1}{L} \sum_{j=1}^{L} \|\bm{b}^j(t)\|_2$

and the instantaneous PID error is $e(t) = b_{\text{target}} - b(t)$ . A PID controller computes:

$\Delta w(t) = K_p e(t) + K_i \int_0^t e(\tau)d\tau + K_d \frac{de(t)}{dt}$

$w(t+1) = \mathrm{clip}(w(t)[1 + \alpha \Delta w(t)], w_{\min}, w_{\max})$

The total loss is then:

$\mathcal{L}_{\text{total}} = w(t) \mathcal{L}_{\text{ce}}$

3. Algorithmic Procedures

BREP ReFT integrates three main algorithmic components: prefix-truncated training, two-stage inference, and magnitude constraint via PID control.

Prefix-Truncated Training: Each sample’s target response is truncated to its first $l_p$ tokens. The optimization occurs only over these tokens, directly shaping the model's initial reasoning behavior and sharpening prefix accuracy. The prefix-truncation training loop proceeds as follows:

for each batch {x, full_response y[1..l_f]}:
    y_prefix = y[1..l_p]
    compute p_t = model(x, y_prefix[1:t-1]) for t=1..l_p
    L_ce = -(1/l_p) * sum_{t=1}^{l_p} log p_t(y_t)
    update bias W, b via ∇ L_total = w(t) ∇L_ce

Two-Stage Inference: During decoding, ReFT transforms (scaling and bias) are applied only to the first $n \le l_p$ generated tokens. For subsequent positions, the base (unfinetuned) model representation is used, preventing error propagation through the CoT.

for t in 1..T:
    if t <= n:
        apply ReFT transforms
    else:
        leave hidden states unchanged
    sample y_t ∼ p(· | x, y_{1:t-1})
    append y_t to output
return output sequence

Bias Magnitude Constraint: PID hyperparameters are initialized as $K_p = 10^{-1}, K_i = 10^{-4}, K_d = 10^{-2}, \alpha = 5, w_{\min} = 10^{-5}, w_{\max} = 10^{-1}$ , and the target bias norm $b_{\text{target}}$ is set per model family.

4. Experimental Setup and Evaluation

Models:

Llama3-8B-Instruct, Llama3.1-8B-Instruct
Qwen2.5-Math-7B-Instruct, Qwen3-8B, Qwen3-14B

Training and inference:

Typical prefix lengths: Llama ( $l_p=64$ , $n=8$ ); Qwen2.5-7B (66,10); Qwen3-8B (67,11); Qwen3-14B (68,12).
AdamW optimizer; learning rates $2\times10^{-4}$ (Llama) and $2\times10^{-5}$ (Qwen).
Computation: Single NVIDIA A100 80 GB for $\sim$ 1 hour of training on 5K samples.

Datasets:

Simple reasoning: MATH10K subsampled for GSM8K, SVAMP, MathQA
Complex reasoning: PRM800K (5K) for MATH500, AMC23

Benchmarks: GSM8K, SVAMP, MathQA, MATH500, AMC23 Baselines: Base (frozen), LoRA, RED (RepEdit), LoReFT Metrics: Answer correctness (chain-of-thought verification)

Results summary:

Model	Method	GSM8K	SVAMP	MathQA	MATH500	AMC23
Llama3-8B	Base	80.0	88.9	55.0	40.4	57.5
	LoRA	81.1	90.0	54.0	39.3	53.8
	RED	73.8	88.9	51.3	41.5	56.4
	LoReFT	78.8	80.7	44.7	37.0	35.0
	BREP	82.8	89.5	54.3	42.8	52.5
Qwen3-8B	Base	95.1	96.7	86.5	82.0	85.0
	LoRA	95.1	96.8	86.2	81.8	87.5
	RED	87.9	91.8	77.3	54.2	35.0
	LoReFT	87.1	96.3	72.8	72.4	80.0
	BREP	95.3	97.4	86.3	82.6	87.5

BREP improves GSM8K accuracy by up to +2.8 points over base Llama3-8B, and shows consistent gains or parity across all benchmarks and model families.

Efficiency:

BREP introduces only $2d$ parameters per layer (scaling + bias), representing $\lesssim 0.01\%$ of total model parameters. Training is rapid ( $\sim$ 1 hour for 5K examples), and inference time increases negligibly due to intervention being limited to the prefix.

5. Analysis and Ablation

Ablation studies confirm the contributions of BREP’s distinct components:

Model	Variant	GSM8K	MATH500
L3-8B	Full BREP	82.8	42.8
	w/o Prefix Truncation	81.0	40.2
	w/o Bias Constraint	80.0	39.4
	w/o Early Intervention	80.4	37.6
Q3-8B	Full BREP	95.3	82.0
	w/o Prefix Truncation	95.5	79.4
	w/o Bias Constraint	94.9	79.8
	w/o Early Intervention	95.1	81.6

Each component—prefix truncation, bias constraint, and early-stage intervention—substantially affects mathematical reasoning accuracy, especially on longer CoT benchmarks.

Probing internal representations demonstrates preservation or improvement of linear number encoding with BREP, in contrast to the degradation observed in unconstrained ReFT.

6. Implementation Guidelines

To adopt BREP ReFT:

Freeze the pretrained LLM and insert per-layer learnable scaling ( $\mathbf{W}^j$ ) and bias ( $\bm{b}^j$ ) vectors.
Prepare dataset and truncate each example's target response to the first $l_p$ tokens suitable for the target model.
Implement PID control to maintain $\frac{1}{L} \sum_j \|\bm{b}^j\|_2$ near $b_{\text{target}}$ via updating loss weight $w(t)$ .
Train for $5$K steps on a single high-memory GPU, optimize total loss as above.
During inference, apply the ReFT transformations only for the first $n$ tokens, then revert to base model.
Decoding may use greedy or preferred CoT policies.

Reference code and scripts are publicly available at https://github.com/LiangThree/BREP.

7. Context, Significance, and Limitations

BREP ReFT establishes new accuracy and generalization standards for representation-efficient finetuning on mathematical reasoning tasks, mitigating the prefix misalignment and numerical encoding drift present in prior ReFT methods. By concentrating adaptation on the initial reasoning prefix and tightly regulating representational drift, BREP enables robust mathematical inference with minimal parameter overhead. For out-of-domain commonsense tasks (BoolQ, PIQA, GPQA), BREP maintains or improves generalization relative to baselines.

A plausible implication is that BREP’s separation of early-stage reasoning intervention from subsequent unperturbed token generation is applicable to other tasks with critical prefix dependencies. However, potential scaling to multi-turn dialogue or highly compositional mathematical contexts may warrant further investigation, as may optimal selection of prefix and intervention lengths for arbitrary architectures.

BREP ReFT provides a reproducible procedure and open-source codebase for investigation and deployment in high-stakes mathematical language modeling, offering a systematic framework to balance adaptation capacity and representational stability for rigorous downstream reasoning applications.

PDF Markdown Chat (Pro)

References (1)

Bias-Restrained Prefix Representation Finetuning for Mathematical Reasoning (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Bias-REstrained Prefix Representation FineTuning (BREP ReFT).