Full Parameter Fine-Tuning

Updated 1 December 2025

Full-parameter fine-tuning (FPFT) is a method that adapts pre-trained neural networks by updating every parameter, ensuring maximal model adaptability.
It achieves superior accuracy, especially in high-stakes tasks such as medical QA, but demands extensive computing and memory resources.
Recent innovations like LOMO, HiFT, and QFT optimize resource usage, enabling FPFT on commodity hardware while preserving performance.

Full-parameter fine-tuning (FPFT), also termed full fine-tuning (FFT), is the process of adapting a pre-trained neural network by updating all its parameters—typically numbering in the billions for contemporary language or vision models—on downstream tasks and datasets. FPFT offers the highest possible adaptation capacity, in contrast to parameter-efficient fine-tuning (PEFT), which restricts updates to a small, structured subset of parameters. FPFT continues to serve as the performance benchmark across domains and remains the method of choice when maximal accuracy, robustness, or representational coverage are required, albeit at considerable memory, compute, and engineering cost.

1. Mathematical Formulation and Algorithmic Workflow

The objective of full-parameter fine-tuning is to minimize a downstream task loss over all model parameters $\theta \in \mathbb{R}^d$ . For a pre-trained model $f(x; \theta)$ and data distribution $D = \{(x, y)\}$ , the optimization problem is:

$\theta^* = \arg \min_{\theta \in \mathbb{R}^d} \mathbb{E}_{(x,y)\sim D}[\mathcal{L}(f(x; \theta), y)],$

where $\mathcal{L}$ denotes the task-specific loss, commonly cross-entropy for classification or language modeling tasks (Liu et al., 28 May 2025).

A canonical instantiation in LLMs appears in Med42 (Christophe et al., 23 Apr 2024):

$L(\theta) = -\sum_i\sum_{t=1}^{|y_i|} \log p(y_{i, t} \mid y_{i, 1:t-1}, x_i; \theta) + \lambda \|\theta\|_2^2,$

where all $\theta$ are optimized end-to-end, with loss computed over the desired output subsequence (e.g., masked response).

Algorithmic steps for FPFT typically include:

Loading pre-trained parameters into a compatible model architecture.
Packing input data according to model constraints (e.g., blockwise tokenization).
Configuring an optimizer (commonly AdamW: $\beta_1=0.9$ , $\beta_2=0.95$ , $\epsilon=10^{-8}$ ; or alternatives for low-resource settings).
Iterative full-model backpropagation and parameter update on each batch.
Ancillary stabilization techniques: gradient clipping, weight decay, learning rate scheduling.

2. Theoretical Properties and Representational Capacity

FPFT occupies the entire parameter manifold $\Theta=\mathbb{R}^d$ , allowing unconstrained movement in function space. A formal distinction with PEFT is found in (Liu et al., 28 May 2025):

Function class (FPFT): $\mathcal{F}_{\text{full}} = \{f(x; \theta): \theta \in \mathbb{R}^d\}$ .
Function class (PEFT): $\mathcal{F}_{\text{peft}} = \{f(x; \theta_0 + g(\Phi)): \Phi \in \mathbb{R}^k,\, k \ll d\}$ .

PEFT’s updates are restricted to a $k$ -dimensional submanifold, strictly contained in the $d$ -dimensional parameter space realized by FPFT (Theorem 1), as $g$ is not surjective for $k<d$ (via Sard's theorem).

Representational bounds: Theorem 2 quantifies that the representational shift in PEFT is controlled by the magnitudes $\|\Phi_k\|$ and that the overall flexibility is strictly less than FPFT.
Rank constraints: For low-rank PEFT (e.g., LoRA), the error in approximating a full update is lower-bounded by spectral tail terms beyond the rank budget (Eckart–Young–Mirsky).

3. Practical Implementations and Memory Optimization

Conventional FPFT is memory-intensive, limiting accessibility:

AdamW doubles memory for two optimizer states (momentum/variance) per parameter.
All gradients must be retained in memory until optimizer step.
Large LLMs (e.g., Llama-2 70B) require distributed compute, often on AI supercomputers (e.g., Cerebras CG-1; system RAM $>1$ PB) (Christophe et al., 23 Apr 2024).

Recent algorithmic advances address resource barriers:

LOMO (LOw-Memory Optimization) (Lv et al., 2023): Fuses gradient computation and update, eliminating optimizer state and peak gradient memory, reducing memory usage to $10.8\%$ of AdamW/DeepSpeed baselines; enables end-to-end tuning of a 65B model on $8 \times 24$ GB GPUs.
HiFT (Hierarchical FPFT) (Liu et al., 26 Jan 2024): Activates only a subset (e.g., 1/ $k$ ) of layer groups per step, reducing in-memory gradients and optimizer states by a factor of $k$ . Enables 7B–13B models to be fine-tuned full-parameter on consumer GPUs ( $<32$ GB), with only a $1.1$– $1.3\times$ increase in wall-clock time.
QFT (Quantized FPFT) (Li et al., 2023): Stores all training states (weights, gradients, optimizer moments) in INT8, leveraging quantizer-theoretic bounds and the Lion optimizer’s sign-invariant update, achieving a model state memory reduction to $21\%$ of fp32 baselines and maintaining performance within $0.6$ points of full-precision FPFT.

Method	Parameter Memory	Optimizer State	Activation/Gradient	Memory Saving
Standard	FP32	AdamW (2×)	FP32	Baseline
LOMO	FP16	SGD (none)	Blockwise-in-place	$\sim$ 10% of baseline
HiFT	FP32/FP16	Any	Subset per-step	$>$ 60% on 7B models
QFT	INT8	Lion (INT8)	INT8, stack-based	21% of Adam-fp32

4. Comparative Accuracy and Empirical Findings

FPFT consistently outperforms PEFT on complex, high-capacity tasks and adversarial settings:

Med42 (Llama-2 70B, medical QA): FPFT achieves $71.9\%$ average accuracy on USMLE vs. $68.3\%$ for LoRA, and $60.9\%$ on MedMCQA vs. $54.7\%$ for LoRA—a $3$–$6$ point advantage, trading $\sim$ 700\times $more parameters updated (<a href="/papers/2404.14779" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Christophe et al., 23 Apr 2024</a>).</li> <li><strong>Comparative studies (GLUE, reasoning, adversarial benchmarks):</strong> On difficult reasoning (GSM8K, SQL, instruction), FPFT exceeds LoRA by$ 2 $–$ 4\% $absolute. In adversarial settings (AdvGLUE, SQuAD perturbations), PEFT’s accuracy declines by$ 1 $–$ 6\% $relative to FPFT (<a href="/papers/2505.22355" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Liu et al., 28 May 2025</a>).</li> <li><strong>PEFT performance on “easy” tasks:</strong> On standard GLUE, LoRA can match or slightly outperform FPFT, due to task simplicity and low intrinsic dimensionality.</li> <li><strong>Scaling with data:</strong> In few-shot regimes ($ k<20 $), LoRA can briefly match or exceed FPFT, but for$ k>100 $, FPFT’s gains per-sample outpace PEFT, attributable to higher underlying model VC-dimension.</li> </ul> <h2 class='paper-heading' id='memory-compute-trade-offs-and-resource-aware-strategies'>5. Memory/Compute Trade-offs and Resource-Aware Strategies</h2> <p>FPFT imposes significant demands:</p> <ul> <li>A modern 7B LLM (e.g., Llama-2) requires$ >100 $GB memory for all parameter, optimizer state, and activation buffers in AdamW-fp32 (<a href="/papers/2401.15207" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Liu et al., 26 Jan 2024</a>).</li> <li>Consumer-level GPUs necessitate algorithmic workarounds; memory savings in LOMO, HiFT, and QFT are largely orthogonal and can be composed for further gains.</li> </ul> <p>Memory-footprint formulas (HiFT,$ k $groups,$ W_1 $=model size):</p> <p>$ \text{M}_{\text{full}} \approx 4 \cdot W_1 + M_{\text{resid}}, $</p> <p>$ \text{M}_{\text{HiFT}} \approx W_1 (1 + 3/k) + M'_{\text{resid}} $</p> <p>Yielding$ >60\% $memory reduction for$ k\approx 34 $(one layer per group, Llama2-7B).</p> <h2 class='paper-heading' id='full-rank-vs-low-rank-parameterization-implications-and-extensions'>6. Full-Rank vs. Low-Rank Parameterization: Implications and Extensions</h2> <p>FPFT is unconstrained in rank: every weight matrix can evolve with full degrees of freedom. By contrast, LoRA and other PEFT approaches enforce a rank-$ r $structure, intrinsically bounding the representational shift (see (<a href="/papers/2502.00987" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Albert et al., 3 Feb 2025</a>, <a href="/papers/2505.22355" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Liu et al., 28 May 2025</a>)). If the singular value spectrum of full-update$ \Delta W $decays slowly (high effective rank), LoRA saturates early; full-rank approaches, whether by full$ \Delta W $learning (FPFT) or via random-basis decompositions (RandLoRA), recover accuracy plateaus otherwise unreachable by PEFT.</p> <p>Empirical comparison (vision, language, vision-language):</p> <ul> <li>LoRA underperforms FPFT on high-complexity tasks (e.g., CLIP V+L, LLM reasoning), while full-rank parameter-efficient methods (e.g., RandLoRA) can close the gap, further confirming the intrinsic value of full-rank adaptation (<a href="/papers/2502.00987" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Albert et al., 3 Feb 2025</a>).</li> </ul> <h2 class='paper-heading' id='guidance-and-best-practices'>7. Guidance and Best Practices</h2> <p>FPFT is the preferred strategy when:</p> <ul> <li><strong>Maximum accuracy is necessary,</strong> particularly in clinical, legal, or other high-stakes settings where empirical gains are non-negligible (e.g., Med42’s$ 3 $–$ 6$ point improvement on USMLE).
Data distribution shift, adversarial robustness, or out-of-distribution generalization are critical, as full-function class adaptability and noise-resilience can only be ensured by unconstrained updates (Liu et al., 28 May 2025).
Sufficient compute/memory resources are available, or when leveraging recent innovations in resource-efficient optimization (LOMO, HiFT, QFT) enables large-scale FPFT on commodity hardware.
For rapid prototyping, resource-constrained environments, or “easy” tasks with limited complexity, PEFT may suffice with substantially reduced cost.

Recommended stabilizing strategies—e.g., masked-response loss, 4K token sequence packing, global norm clipping, cosine decay schedule—further enhance FPFT outcomes on large instruction datasets (Christophe et al., 23 Apr 2024).

References

"Med42 -- Evaluating Fine-Tuning Strategies for Medical LLMs: Full-Parameter vs. Parameter-Efficient Approaches" (Christophe et al., 23 Apr 2024)
"Full Parameter Fine-tuning for LLMs with Limited Resources" (Lv et al., 2023)
"HiFT: A Hierarchical Full Parameter Fine-Tuning Strategy" (Liu et al., 26 Jan 2024)
"QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources" (Li et al., 2023)
"RandLoRA: Full-rank parameter-efficient fine-tuning of large models" (Albert et al., 3 Feb 2025)
"Look Within or Look Beyond? A Theoretical Comparison Between Parameter-Efficient and Full Fine-Tuning" (Liu et al., 28 May 2025)