Bias-Efficient Fine-Tuning (BEFT) for LLMs

Updated 23 September 2025

Bias-Efficient Fine-Tuning (BEFT) is a parameter-efficient method that selectively updates bias terms in LLMs, focusing on one key projection (query, key, or value) for optimal performance.
BEFT introduces a dynamic importance score combining magnitude and directional changes to choose the most impactful bias, typically favoring the value bias for improved outcomes.
Empirical evaluations across tasks like SST-2 demonstrate that BEFT achieves higher accuracy and faster adaptation while updating only 0.01% of parameters compared to traditional methods.

Bias-Efficient Fine-Tuning (BEFT) is a parameter-efficient methodology for adapting LLMs that focuses on updating a strategically selected subset of bias terms rather than all parameters. The motivation stems from observations that while bias-only fine-tuning (e.g., BitFit) yields surprising effectiveness and efficiency in model adaptation, not all bias terms contribute equally to downstream task performance. The BEFT framework formalizes bias selection, introducing mechanisms to identify which bias projection—among query, key, or value biases—should be fine-tuned for optimal performance and efficiency.

1. Conceptual Foundation of Bias-Efficient Fine-Tuning

Bias-Efficient Fine-Tuning (BEFT) builds upon the principle that large neural models, especially Transformers, have expressive pre-trained representations and that much of the downstream adaptation can be achieved by adjusting only a small subset of parameters. Traditionally, BitFit and similar approaches update all bias vectors across the network’s linear layers; BEFT proposes a more selective approach. Specifically, BEFT focuses on tuning only one out of the three main projection biases in the self-attention mechanism—namely, the query (b_q), key (b_k), or value (b_v) bias vectors—across the Transformer layers.

Parameter efficiency is central: only O(0.01%) of all model parameters are updated, significantly less than full fine-tuning or even full bias-only adaptation. This selectivity results not only in computational savings but also in potentially improved generalization properties, due to reduced risk of overfitting and catastrophic forgetting.

2. Methodological Advances and Bias Term Selection

A core contribution of BEFT is an algorithmic method for identifying the most impactful bias term for fine-tuning. Prior approaches—such as selecting the bias with the largest magnitude change (L₁-norm difference) or the highest pre-tuning empirical Fisher information—are found to be insufficient, as they provide only static or unspecific rankings.

BEFT introduces an “importance score” mechanism that quantifies the effect of fine-tuning on each bias type by jointly considering both magnitude and directional changes (i.e., the angle between pre- and post-tuning bias vectors). For each bias type T (with T ∈ {q, k, v}), and for each layer l:

If the post-tuning bias norm is smaller:

$I(b_T^{(l)}) = 1 - \frac{b_T^{(l, pre)} \cdot b_T^{(l, post)}}{||b_T^{(l, pre)}||^2}$

Otherwise:

$I(b_T^{(l)}) = 1 - \frac{b_T^{(l, pre)} \cdot b_T^{(l, post)}}{||b_T^{(l, post)}||^2}$

The overall importance for each bias type is averaged across layers:

$I(b_T) = \frac{1}{L} \sum_l I(b_T^{(l)})$

The bias type with the highest $I(b_T)$ is selected:

$T_{target} = \arg\max_{T \in \{q, k, v\}} I(b_T)$

This dynamically adapts the selection to the task and data regime; empirically, $b_v$ (value projection bias) frequently offers the highest downstream performance benefits.

3. Experimental Evaluation and Comparative Results

BEFT was evaluated across a comprehensive suite of experiments covering classification (e.g., SST-2, RTE), multiple-choice (e.g., COPA, ReCoRD), and generation tasks (e.g., SQuAD, DROP), using architectures from encoder-only BERT and decoder-only OPT ranging from 110M to 6.7B parameters.

Key findings include:

Accuracy Improvements: On GLUE tasks with BERT_BASE, fine-tuning the BEFT-selected bias ( $b_v$ ) achieves higher accuracy (e.g., 85.8% on SST-2) compared to $b_q$ (80.0%).
Parameter Efficiency: Only 0.01% of parameters are updated (compared to 0.09% for tuning all biases, or 100% for full fine-tuning).
Speed: BEFT achieves adaptation in substantially reduced wall-clock time (e.g., 132.9s vs. 206.1s for full fine-tuning).
Model Generality: The method generalizes across model types and scales, remaining competitive with PEFT baselines such as LoRA and prefix-tuning even at scales up to 6.7B parameters.

The comparative evaluation also showed that BEFT’s bias selection method better tracks the actual downstream performance trends across low, medium, and high data regimes than magnitude- or Fisher-based methods, both of which can yield suboptimal or inconsistent bias selections.

4. Practical Applications and Downstream Adaptation

BEFT is designed for use cases requiring rapid, resource-efficient, and effective model adaptation. The following practical properties are highlighted:

Adaptation with Limited Data: BEFT is especially beneficial in low-data scenarios, as bias-only approaches tend to unlock pre-trained features without requiring massive updates or risking overfitting.
Multi-Task and Transfer Learning: Task-specific bias vectors (particularly $b_v$ ) exhibit properties that allow for merge operations (e.g., arithmetic averaging) to further boost performance in multi-task transfer setups.
Generalization and Robustness: The minimal update footprint enforces a regularization effect that preserves most pre-trained capabilities while efficiently encoding task-specific information.

5. Comparison to Other Bias-Selection Approaches

The paper demonstrates that magnitude-based and empirical Fisher-based methods for bias selection provide only static or insufficiently differentiated prioritization. In contrast, BEFT’s joint angle-magnitude importance scoring:

Approach	Selection Mechanism	Adaptivity	Alignment With Performance
Magnitude	L₁-norm of bias update	Static	Poor
Empirical Fisher	Pre-tuning gradient squares	Static	Task-agnostic
BEFT (projection-based)	Angle & magnitude change	Dynamic	Strong

A plausible implication is that the dynamic, direction-aware nature of BEFT’s importance score captures aspects of the loss landscape and representation adaptation that are not tracked by prior metrics.

6. Implications, Limitations, and Future Directions

BEFT’s precise targeting of a single, empirically justified bias term enables effective downstream adaptation with drastic reductions in parameter, computational, and energy costs, positioning it for widespread use in resource-constrained settings, rapid deployment, and scenarios requiring judicious parameter updating (e.g., privacy, continual learning).

Outstanding limitations include the need for deeper theoretical understanding of how the importance metric correlates with the adaptation capacity of each bias term. The authors note that the correlation between importance scores and generalization is not yet fully formalized. Further research is suggested to explore fine-grained dynamics of bias adaptation and to extend selective updating mechanisms to other parameter types or subspaces.

7. Summary

Bias-Efficient Fine-Tuning (BEFT) operationalizes the insight that selective, dynamically chosen bias-only adaptation in LLMs is not only sufficient for strong downstream performance but also optimal for parameter and resource efficiency. The core methodological contribution—a joint angular and magnitude importance scoring system for bias selection—empirically outperforms prior static heuristics and is validated across diverse architectures and tasks. BEFT provides a foundation for task-adaptive, resource-aware fine-tuning, with broad applicability across modern language modeling pipelines and potential for further methodological refinement (Huang et al., 19 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

BEFT: Bias-Efficient Fine-Tuning of Language Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Bias-Efficient Fine-Tuning (BEFT).