Fully Fine-Tuning (FFT) in Model Adaptation

Updated 10 December 2025

Fully Fine-Tuning (FFT) is defined as updating every trainable parameter of a pretrained model to maximize representational capacity and task performance.
FFT employs full gradient-based optimization over the entire parameter space, with computational and memory costs that scale linearly with model size.
While FFT achieves peak accuracy and robustness in benchmarks, it faces challenges like higher resource demands and potential issues with catastrophic forgetting under certain constraints.

Fully fine-tuning (FFT) denotes the procedure of updating all trainable parameters in a foundation model during the adaptation phase for downstream tasks. This methodology stands as the canonical approach in model transfer, contrasting sharply with parameter-efficient fine-tuning (PEFT), which restricts updates to a small fraction of the parameters or introduces tuned adapters. FFT guarantees maximal representational flexibility, and empirical results across vision and language domains consistently demonstrate the highest achievable performance on task metrics when computational resources permit. However, FFT entails substantially greater computational and memory demands, and in certain regimes—such as strong differential privacy constraints or continual learning—may be suboptimal or subject to catastrophic forgetting.

1. Formal Definition and Optimization Framework

FFT is defined as gradient-based minimization over the full parameter space $\Theta=\mathbb{R}^d$ of a pretrained model $f(x;\theta)$ , given a dataset $D=\{(x_n,y_n)\}_{n=1}^N$ . The optimization objective in single-task supervised fine-tuning is

$\theta' = \arg\min_{\theta\in\mathbb{R}^d}\;\mathbb{E}_{(x,y)\sim D}\;[\mathcal{L}(f(x;\theta),y)],$

where $\mathcal{L}$ is the task-specific loss (e.g., negative log-likelihood for classification or code completion). For continual or multitask scenarios, the cumulative objective reads

$J(\theta) = \sum_{t=1}^T \sum_{(x^t_n, y^t_n)\in D^t} \log P_\theta(y^t_n|x^t_n).$

Gradient updates at each step are computed as

$\theta^{(t)} \leftarrow \theta^{(t-1)} - \eta\nabla_\theta \mathcal{L}(\theta^{(t-1)}),$

where $\eta$ is the learning rate and all parameters are updated (i.e., $\Delta P=P$ ).

Empirical cost models for FFT scale linearly with model size $P$ :

$\text{FLOPs} \propto P \cdot L \cdot B \cdot S,\qquad \text{Memory} \propto P \cdot 4\text{ bytes},$

with $L$ = input sequence length, $B$ = batch size, $S$ = gradient steps. In LLMs, typical learning rates for FFT are $\ell_{\text{FFT}}=10^{-6}$ , with PEFT variants often using quantized weights and higher rates (Zhuo et al., 1 Jan 2024).

2. Theoretical Properties: Capacity, Robustness, and Subset Relations

FFT encompasses the full hypothesis space $\mathcal{F}_{\text{full}}=\{f(x; \theta): \theta\in\mathbb{R}^d\}$ . Parameter-efficient methods restrict updates to a low-dimensional subspace, establishing the strict set inclusion $\mathcal{F}_{\text{peft}}\subset\mathcal{F}_{\text{full}}$ (Liu et al., 28 May 2025). Theoretical analysis demonstrates:

Representational capacity: FFT allows arbitrary parameter perturbations and maximizes expressiveness; by contrast, PEFT’s function class is restricted by injection mappings $g:\mathbb{R}^k\to\mathbb{R}^d$ ( $k\ll d$ ).
Marginal benefit of parameters: FFT yields diminishing returns as the fraction of updated parameters increases, with performance approaching a plateau for $\Delta P \to P$ (Zhuo et al., 1 Jan 2024).
Robustness to perturbations: Second-order analyses show FFT’s loss landscapes are flatter and more robust, while PEFT subspaces (low-rank adapters) exhibit greater sensitivity to adversarial or noise-induced gradients.
Data scaling: Generalization error drops as $O(\sqrt{d/N})$ for FFT, superior to $O(\sqrt{k/N})$ for PEFT; thus, FFT exploits larger datasets more efficiently (Liu et al., 28 May 2025).

3. Empirical Outcomes on Task Performance and Robustness

Astraios (Zhuo et al., 1 Jan 2024) and related studies benchmark FFT against PEFT algorithms including LoRA, Prefix-Tuning, and adapters. FFT consistently achieves the top normalized scores across code comprehension (accuracy, F1), code generation (Pass@1), and robustness/security metrics. Sample comparisons (16B model):

$\begin{array}{l|cc|c} \text{Method} & C=\frac{\Delta P}{P}\,[\%] & M\;(\text{mean over 5 tasks}) & \text{Key Metric}\ \hline \text{FFT} & 100.0 & 100.0 & \text{Highest overall} \ \text{LoRA} & 0.11 & 98.5 & \text{Best PEFT trade‐off} \ \text{Parallel} & 0.83 & 97.9 & \text{Comparable adapter} \end{array}$

Absolute results on Code Synthesis (Python): FFT = 38.47%, LoRA = 38.08%, Parallel = 35.88%. Robustness measured by Robust Pass@1 (RP@1):

Method	RP@1 (%)	Valid (%)	Insecure (%)
FFT	53.05	84.1	38.3
LoRA	51.22	87.1	35.0
Parallel	50.00	86.0	32.6

FFT produces higher functional correctness but is more vulnerable to insecure code patterns, especially as model scale increases.

4. Cost–Performance Trade-offs and Resource Implications

Empirical and theoretical results quantify FFT's efficiency in terms of parameter and computation scaling. As $\Delta P$ increases, task performance $M(\Delta P)$ follows a diminishing-return curve

$M(\Delta P) = M_{\max} - \kappa(\Delta P)^{-\alpha},\quad \alpha\approx0.2,$

with FFT forming the Pareto frontier in attainable accuracy (Zhuo et al., 1 Jan 2024). FFT incurs $\sim$ 100 $\times$ parameter cost (storage, computation, memory) compared to high-performing PEFT methods (LoRA: $\sim$ 0.1% parameters, KronA+: 0.056%) (Ligan et al., 21 May 2025). In resource-constrained settings, PEFT approaches close most of the gap (e.g., KronA+ within 3% OA of FFT in hyperspectral classification) but cannot universally match FFT in complex reasoning or adversarial robustness (Liu et al., 28 May 2025).

5. Continual Learning, Catastrophic Forgetting, and Hybrid Strategies

When models are subject to sequential multi-task adaptation, FFT may induce catastrophic forgetting due to unconstrained parameter drift from the pretrained initialization $\theta^0$ . Absent explicit regularization, $\Delta\theta=\theta-\theta^0$ can overwrite prior knowledge (Hui et al., 29 Apr 2024). Regularization-based strategies or partial freezing (e.g., Half Fine-Tuning—HFT: updating randomly selected 50% of parameters) dampen this effect, empirically yielding better knowledge retention and sometimes matching or surpassing FFT in overall performance. HFT’s masking mechanism translates into a regularizer $\| (I-M)(\theta-\theta^0)\|^2$ , constraining frozen parameters near pretrained values.

Ablation studies show optimal trade-offs occur at 30–50% trainable parameter ratios, combining plasticity for new tasks with retention of previous capabilities.

6. Differential Privacy and Privacy-Budget Allocation

FFT under differential privacy imposes substantial noise on all updated parameters, degrading convergence rates and final accuracy. Theoretical analysis for a two-layer linear model under DP-SGD yields:

$\mathbb{E}[\mathcal{L}(t)] \leq \frac{1}{2}(k\beta+\|Y\|^2) e^{-ct} + L^\square(1-e^{-ct}),$

with

$c = k\beta - 1 - \sqrt{2}\sigma^2(1+d) >0,$

and $L^\square = \sigma^2[(1+d)\|X^\top Y\| + d]/c$ . Larger DP noise ( $\sigma$ ) slows convergence and raises the nonzero loss plateau. Utility curves mapping privacy budget allocation between LP (linear probing) and FFT are concave, with optimal splits varying as privacy constraints tighten. In regimes of strong privacy ( $\epsilon_\text{total}$ small), pure LP outperforms FFT, while for loose constraints, FFT remains optimal (Ke et al., 29 Feb 2024).

7. Federated Fine-Tuning under Resource Constraints

Federated settings present unique challenges for FFT due to client compute and communication limitations. While FFT offers maximal adaptation, techniques such as similarity group pruning (SGP) and orchestrated distillation alignment (ODA) compress models to enable almost-full fine-tuning by pruning layers and fine-tuning low-rank adapters on quantized sub-models (Zhu et al., 18 Aug 2025). The FedSODA framework achieves within 0.5–1% of full FFT accuracy, while reducing communication by 70.6% and storage by 75.6%. This demonstrates that hybrid resource-adapted FFT is feasible in distributed contexts, but always trails pure FFT in expressivity and accuracy.

This synthesis establishes FFT as the highest-capacity, most flexible adaptation strategy, though subject to practical limits in memory, computation, privacy constraints, and sequential training regimes. PEFT and hybrid approaches offer compelling alternatives, achieving near-FFT performance with greatly reduced parameter and resource footprints, but cannot universally match FFT's robustness, scalability, or downstream efficacy in complex, non-stationary, or adversarial settings.