BitFit/BEFT: Bias-Only Fine-Tuning
- Bias-Only Fine-Tuning (BitFit/BEFT) is a strategy that updates only the bias vectors in network layers, enabling rapid adaptation with minimal parameter updates.
- It fine-tunes a very small fraction (~0.05–0.2%) of the parameters while freezing large weight matrices, drastically reducing memory, storage, and computation requirements.
- Empirical results across NLP, vision, medical imaging, and time series tasks show that BitFit/BEFT often matches full fine-tuning performance with a fraction of parameters.
Bias-Only Fine-Tuning (BitFit/BEFT) is a family of parameter-efficient adaptation strategies for large pretrained models, wherein only the additive bias vectors in network layers are updated during downstream training. All other parameters—including the potentially vast weight matrices in self-attention, MLP, or convolutional modules—are frozen. This selective fine-tuning regime enables rapid, memory- and storage-efficient transfer to new tasks while retaining the core modeling capacity of the originating foundation model. Recent research further refines these methods by recognizing that only specific subsets of bias terms may drive adaptation performance, leading to bias-efficient fine-tuning (BEFT). These approaches have demonstrated competitive results across natural language processing, vision, medical imaging, and time-series domains (Zaken et al., 2021, Gupta et al., 2024, Huang et al., 19 Sep 2025, Bu et al., 2022, Doering et al., 2024, Ruffini et al., 23 Jun 2025).
1. Theoretical and Algorithmic Foundations
BitFit, initially introduced for transformer-based masked LLMs, consists of partitioning model parameters as , where are all weight matrices and are the bias vectors. During fine-tuning, is strictly frozen while is updated. The downstream training problem is then
where is a typically small task-specific head. In BEFT, a further selection is made: only a single family of bias vectors (such as all in the value projections) is updated, minimizing the same objective but over a much smaller subspace (Huang et al., 19 Sep 2025).
For each linear layer , the forward computation is modified from to , with initialized to zero and treated as the optimization variable. The gradient update is
with all other parameters held fixed.
2. Implementation Details and Variants
BitFit applies to all bias parameters in attention, feed-forward, and normalization layers—e.g., query/key/value/output projections, MLPs, and LayerNorms in transformers; convolutional, linear, and normalization biases in CNNs (Li et al., 19 Mar 2025, Ruffini et al., 23 Jun 2025). Initial values are typically inherited from pretraining. Adam or SGD optimizers are common, with learning rates for bias-only steps set 5–10× higher than typical full fine-tuning rates (e.g., $1$-- for non-private settings; lower for high-data or larger models) (Zaken et al., 2021, Bu et al., 2022).
In differentially private settings (DP-BiTFiT), only the gradients with respect to are clipped and perturbed using DP-SGD; this eliminates the overhead of per-sample activation storage and weight gradient computation, greatly reducing time and memory complexity (Bu et al., 2022).
BEFT introduces a bias selection mechanism based on a "projection-ratio" importance score, computed after a brief "pilot" epoch of all-bias fine-tuning. This score quantifies the geometric divergence between pre- and post-pilot bias vectors. Only the single bias type with greatest change (typically value, ) is then retained for full fine-tuning, resulting in further parameter and runtime reductions (Huang et al., 19 Sep 2025).
3. Empirical Performance and Comparative Analysis
NLP and Foundation Models
BitFit (or BEFT) consistently matches or closely approaches full fine-tuning (which updates all parameters) and other PEFT methods (adapters, LoRA, prefix-tuning) on sentence-level and token-level tasks for both small and medium data regimes (Zaken et al., 2021, Doering et al., 2024). On GLUE, BERT-Large achieves 84.2 (BitFit, 0.08% params) vs 84.8 (full FT, 100%). BitFit may outperform full FT for small datasets (e.g., RTE 73.2 vs 71.9) and shows comparable or improved generalization gaps.
In low-data settings, BEFT further outperforms all-bias BitFit and LoRA/Prefix, with <0.01% parameter updates. For BERT-Base on GLUE (low/medium/high data), updating only yields 64.5/70.8/71.7 accuracy, exceeding other bias types and matching or surpassing full BitFit (Huang et al., 19 Sep 2025).
Vision and Medical Imaging
In ViT-Base (86M parameters), BitFit adapts just ~0.13% of parameters (Li et al., 19 Mar 2025). Robustness-accuracy trade-offs reveal BitFit excels on simple vision tasks (CIFAR-10/100; up to 81.5% above average Pareto AUC) but becomes suboptimal for fine-grained problems (Caltech-256, CUB-200) where more informative or mid-level adaptations are required.
In chest X-ray prognosis prediction, BitFit (0.01% params) achieves moderate MCCs in balanced full-data regimes but is outperformed by LoRA and related methods in highly imbalanced or few-shot settings. BitFit is preferable when training and memory budgets are extremely limited and data are not severely scarce or imbalanced (Ruffini et al., 23 Jun 2025).
Time Series Foundation Models
In Chronos TSFM (8.3M–201M parameters), BitFit tunes only 200–700 parameters but delivers MeanBP forecasting MSEs that closely match or even surpass LoRA and full fine-tuning in large models (MSE=19.77 for BitFit vs 20.12 LoRA, 20.8 FT in Chronos Base) (Gupta et al., 2024).
| Model | Params Tuned | MeanBP MSE (Tiny/Base) |
|---|---|---|
| Full FT | 8.3M / 201M | 19.90 / 20.80 |
| LoRA | 0.049M / 0.442M | 19.79 / 20.12 |
| BitFit | 0.0002M / 0.0007M | 20.68 / 19.77 |
| FourierFT | 0.0024M / 0.007M | 19.51 / 20.98 |
4. Parameter Efficiency and Computational Benefits
The parameter savings of BitFit are substantial. Only ~0.05–0.2% of model parameters (sometimes an order of magnitude fewer for BEFT, down to 0.01‰) are updated per task (Bu et al., 2022, Huang et al., 19 Sep 2025). This drastically reduces storage requirements for multi-task or streaming adaptation setups. BitFit also accelerates wall-clock fine-tuning and, in DP settings (DP-BiTFiT), reduces memory and compute by 2–30× compared to full DP fine-tuning (Bu et al., 2022).
BitFit/BEFT training is inherently stateless for weights; pre-trained matrices can be deployed in hardware or immutably shared, while only per-task biases and heads must be managed. This property is especially attractive in resource-constrained environments, federated learning, and for privacy-sensitive deployment.
5. Robustness, Generalization, and Analysis
BitFit preserves much of the upstream model's robustness on simple or highly overlapping downstream tasks. On CIFAR-10/100, its adversarial and OOD robustness closely tracks accuracy and remains superior to other PEFT methods for base-level or low-class-count tasks (Li et al., 19 Mar 2025). For more complex/fine-grained tasks, or with severe distributional shifts, BitFit's expressivity is restrictive and robustness-accuracy trade-offs become unfavorable compared to adapters or mid-layer updates.
Generalization gap analyses reveal that restricting adaptation to biases acts as implicit regularization. This yields lower overfitting in small- or medium-data regimes, but full fine-tuning can realize higher peak accuracy when abundant labeled data is available (Zaken et al., 2021, Doering et al., 2024).
The surprising effectiveness of bias-only adaptation supports the hypothesis that pretraining induces representations where only threshold and shift parameters need to be task-adapted, exposing rather than (re-)learning new knowledge (Zaken et al., 2021). BEFT's geometric projection-ratio scoring further clarifies which biases, such as value projections, most support rapid transfer, especially under low data (Huang et al., 19 Sep 2025).
6. Best Practices, Limitations, and Extensions
Usage Recommendations
- Start with BitFit (or BEFT) when data, compute, or memory budgets are tight, the domain shift is moderate, or rapid/streaming task adaptation is necessary.
- Select a bias learning rate 5–10× higher than standard FT; adapt epochs/steps as convergence is typically faster.
- In differentially private applications, BitFit's gradient isolation for biases minimizes DP costs (Bu et al., 2022).
- Use BEFT's pilot procedure to select the most effective bias subset, especially for large models and low-data tasks.
Limitations and Open Questions
- For tasks requiring deep feature modulation (fine-grained vision, extreme distributional shift, few-shot settings), BitFit can underperform richer PEFT approaches (LoRA, adapters, Compacter) (Ruffini et al., 23 Jun 2025, Li et al., 19 Mar 2025, Huang et al., 19 Sep 2025).
- Hybrid or two-phase approaches (brief full or LoRA fine-tuning followed by bias-only updates) partially mitigate these limitations.
- The theoretical underpinnings of why bias shifts suffice in many cases—particularly in overparameterized, privacy-constrained, or multitask settings—remain a subject of active investigation (Bu et al., 2022, Huang et al., 19 Sep 2025).
7. Extensions: Layer and Term Selection (BEFT)
Selective bias adaptation is formalized in BEFT, which leverages geometric measures to rank and identify the most informative bias subfamilies for target tasks. This dynamic, data-driven selection outperforms static magnitude or Fisher-based heuristics and pushes the resource gains of BitFit even further without loss in accuracy (Huang et al., 19 Sep 2025).
In practical terms, BEFT performs a brief (1–2 epoch) pilot fine-tuning on all bias types with a large learning rate, computes projection-ratio scores, resets, and then performs downstream fine-tuning on the single best bias family (e.g., ). This results in parameter and runtime savings by an additional order of magnitude, with empirical superiority across NLP and autoregressive LLM benchmarks in both low-data and standard regimes.
BitFit and BEFT constitute parameter-minimal, hyperparameter-robust, and “model-agnostic” tools for fine-tuning foundation models. Their appeal rests on empirical competitiveness across modalities and data regimes, extreme deployment efficiency, and theoretical insight into the structural sufficiency of bias terms for model adaptation (Zaken et al., 2021, Gupta et al., 2024, Huang et al., 19 Sep 2025, Bu et al., 2022, Li et al., 19 Mar 2025, Ruffini et al., 23 Jun 2025, Doering et al., 2024).