Papers
Topics
Authors
Recent
2000 character limit reached

Full Parameter Fine-Tuning

Updated 1 December 2025
  • Full-parameter fine-tuning (FPFT) is a method that adapts pre-trained neural networks by updating every parameter, ensuring maximal model adaptability.
  • It achieves superior accuracy, especially in high-stakes tasks such as medical QA, but demands extensive computing and memory resources.
  • Recent innovations like LOMO, HiFT, and QFT optimize resource usage, enabling FPFT on commodity hardware while preserving performance.

Full-parameter fine-tuning (FPFT), also termed full fine-tuning (FFT), is the process of adapting a pre-trained neural network by updating all its parameters—typically numbering in the billions for contemporary language or vision models—on downstream tasks and datasets. FPFT offers the highest possible adaptation capacity, in contrast to parameter-efficient fine-tuning (PEFT), which restricts updates to a small, structured subset of parameters. FPFT continues to serve as the performance benchmark across domains and remains the method of choice when maximal accuracy, robustness, or representational coverage are required, albeit at considerable memory, compute, and engineering cost.

1. Mathematical Formulation and Algorithmic Workflow

The objective of full-parameter fine-tuning is to minimize a downstream task loss over all model parameters θRd\theta \in \mathbb{R}^d. For a pre-trained model f(x;θ)f(x; \theta) and data distribution D={(x,y)}D = \{(x, y)\}, the optimization problem is:

θ=argminθRdE(x,y)D[L(f(x;θ),y)],\theta^* = \arg \min_{\theta \in \mathbb{R}^d} \mathbb{E}_{(x,y)\sim D}[\mathcal{L}(f(x; \theta), y)],

where L\mathcal{L} denotes the task-specific loss, commonly cross-entropy for classification or language modeling tasks (Liu et al., 28 May 2025).

A canonical instantiation in LLMs appears in Med42 (Christophe et al., 23 Apr 2024):

L(θ)=it=1yilogp(yi,tyi,1:t1,xi;θ)+λθ22,L(\theta) = -\sum_i\sum_{t=1}^{|y_i|} \log p(y_{i, t} \mid y_{i, 1:t-1}, x_i; \theta) + \lambda \|\theta\|_2^2,

where all θ\theta are optimized end-to-end, with loss computed over the desired output subsequence (e.g., masked response).

Algorithmic steps for FPFT typically include:

  • Loading pre-trained parameters into a compatible model architecture.
  • Packing input data according to model constraints (e.g., blockwise tokenization).
  • Configuring an optimizer (commonly AdamW: β1=0.9\beta_1=0.9, β2=0.95\beta_2=0.95, ϵ=108\epsilon=10^{-8}; or alternatives for low-resource settings).
  • Iterative full-model backpropagation and parameter update on each batch.
  • Ancillary stabilization techniques: gradient clipping, weight decay, learning rate scheduling.

2. Theoretical Properties and Representational Capacity

FPFT occupies the entire parameter manifold Θ=Rd\Theta=\mathbb{R}^d, allowing unconstrained movement in function space. A formal distinction with PEFT is found in (Liu et al., 28 May 2025):

  • Function class (FPFT): Ffull={f(x;θ):θRd}\mathcal{F}_{\text{full}} = \{f(x; \theta): \theta \in \mathbb{R}^d\}.
  • Function class (PEFT): Fpeft={f(x;θ0+g(Φ)):ΦRk,kd}\mathcal{F}_{\text{peft}} = \{f(x; \theta_0 + g(\Phi)): \Phi \in \mathbb{R}^k,\, k \ll d\}.

PEFT’s updates are restricted to a kk-dimensional submanifold, strictly contained in the dd-dimensional parameter space realized by FPFT (Theorem 1), as gg is not surjective for k<dk<d (via Sard's theorem).

  • Representational bounds: Theorem 2 quantifies that the representational shift in PEFT is controlled by the magnitudes Φk\|\Phi_k\| and that the overall flexibility is strictly less than FPFT.
  • Rank constraints: For low-rank PEFT (e.g., LoRA), the error in approximating a full update is lower-bounded by spectral tail terms beyond the rank budget (Eckart–Young–Mirsky).

3. Practical Implementations and Memory Optimization

Conventional FPFT is memory-intensive, limiting accessibility:

  • AdamW doubles memory for two optimizer states (momentum/variance) per parameter.
  • All gradients must be retained in memory until optimizer step.
  • Large LLMs (e.g., Llama-2 70B) require distributed compute, often on AI supercomputers (e.g., Cerebras CG-1; system RAM >1>1 PB) (Christophe et al., 23 Apr 2024).

Recent algorithmic advances address resource barriers:

  • LOMO (LOw-Memory Optimization) (Lv et al., 2023): Fuses gradient computation and update, eliminating optimizer state and peak gradient memory, reducing memory usage to 10.8%10.8\% of AdamW/DeepSpeed baselines; enables end-to-end tuning of a 65B model on 8×248 \times 24 GB GPUs.
  • HiFT (Hierarchical FPFT) (Liu et al., 26 Jan 2024): Activates only a subset (e.g., 1/kk) of layer groups per step, reducing in-memory gradients and optimizer states by a factor of kk. Enables 7B–13B models to be fine-tuned full-parameter on consumer GPUs (<32<32 GB), with only a $1.1$–1.3×1.3\times increase in wall-clock time.
  • QFT (Quantized FPFT) (Li et al., 2023): Stores all training states (weights, gradients, optimizer moments) in INT8, leveraging quantizer-theoretic bounds and the Lion optimizer’s sign-invariant update, achieving a model state memory reduction to 21%21\% of fp32 baselines and maintaining performance within $0.6$ points of full-precision FPFT.
Method Parameter Memory Optimizer State Activation/Gradient Memory Saving
Standard FP32 AdamW (2×) FP32 Baseline
LOMO FP16 SGD (none) Blockwise-in-place \sim10% of baseline
HiFT FP32/FP16 Any Subset per-step >>60% on 7B models
QFT INT8 Lion (INT8) INT8, stack-based 21% of Adam-fp32

4. Comparative Accuracy and Empirical Findings

FPFT consistently outperforms PEFT on complex, high-capacity tasks and adversarial settings:

  • Med42 (Llama-2 70B, medical QA): FPFT achieves 71.9%71.9\% average accuracy on USMLE vs. 68.3%68.3\% for LoRA, and 60.9%60.9\% on MedMCQA vs. 54.7%54.7\% for LoRA—a $3$–$6$ point advantage, trading \sim700\timesmoreparametersupdated(<ahref="/papers/2404.14779"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Christopheetal.,23Apr2024</a>).</li><li><strong>Comparativestudies(GLUE,reasoning,adversarialbenchmarks):</strong>Ondifficultreasoning(GSM8K,SQL,instruction),FPFTexceedsLoRAby more parameters updated (<a href="/papers/2404.14779" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Christophe et al., 23 Apr 2024</a>).</li> <li><strong>Comparative studies (GLUE, reasoning, adversarial benchmarks):</strong> On difficult reasoning (GSM8K, SQL, instruction), FPFT exceeds LoRA by 24\%absolute.Inadversarialsettings(AdvGLUE,SQuADperturbations),PEFTsaccuracydeclinesby absolute. In adversarial settings (AdvGLUE, SQuAD perturbations), PEFT’s accuracy declines by 16\%relativetoFPFT(<ahref="/papers/2505.22355"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Liuetal.,28May2025</a>).</li><li><strong>PEFTperformanceoneasytasks:</strong>OnstandardGLUE,LoRAcanmatchorslightlyoutperformFPFT,duetotasksimplicityandlowintrinsicdimensionality.</li><li><strong>Scalingwithdata:</strong>Infewshotregimes( relative to FPFT (<a href="/papers/2505.22355" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Liu et al., 28 May 2025</a>).</li> <li><strong>PEFT performance on “easy” tasks:</strong> On standard GLUE, LoRA can match or slightly outperform FPFT, due to task simplicity and low intrinsic dimensionality.</li> <li><strong>Scaling with data:</strong> In few-shot regimes (k<20),LoRAcanbrieflymatchorexceedFPFT,butfor), LoRA can briefly match or exceed FPFT, but for k>100,FPFTsgainspersampleoutpacePEFT,attributabletohigherunderlyingmodelVCdimension.</li></ul><h2class=paperheadingid=memorycomputetradeoffsandresourceawarestrategies>5.Memory/ComputeTradeoffsandResourceAwareStrategies</h2><p>FPFTimposessignificantdemands:</p><ul><li>Amodern7BLLM(e.g.,Llama2)requires, FPFT’s gains per-sample outpace PEFT, attributable to higher underlying model VC-dimension.</li> </ul> <h2 class='paper-heading' id='memory-compute-trade-offs-and-resource-aware-strategies'>5. Memory/Compute Trade-offs and Resource-Aware Strategies</h2> <p>FPFT imposes significant demands:</p> <ul> <li>A modern 7B LLM (e.g., Llama-2) requires >100GBmemoryforallparameter,optimizerstate,andactivationbuffersinAdamWfp32(<ahref="/papers/2401.15207"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Liuetal.,26Jan2024</a>).</li><li>ConsumerlevelGPUsnecessitatealgorithmicworkarounds;memorysavingsinLOMO,HiFT,andQFTarelargelyorthogonalandcanbecomposedforfurthergains.</li></ul><p>Memoryfootprintformulas(HiFT, GB memory for all parameter, optimizer state, and activation buffers in AdamW-fp32 (<a href="/papers/2401.15207" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Liu et al., 26 Jan 2024</a>).</li> <li>Consumer-level GPUs necessitate algorithmic workarounds; memory savings in LOMO, HiFT, and QFT are largely orthogonal and can be composed for further gains.</li> </ul> <p>Memory-footprint formulas (HiFT, kgroups, groups, W_1=modelsize):</p><p>=model size):</p> <p>\text{M}_{\text{full}} \approx 4 \cdot W_1 + M_{\text{resid}},</p><p></p> <p>\text{M}_{\text{HiFT}} \approx W_1 (1 + 3/k) + M'_{\text{resid}}</p><p>Yielding</p> <p>Yielding >60\%memoryreductionfor memory reduction for k\approx 34(onelayerpergroup,Llama27B).</p><h2class=paperheadingid=fullrankvslowrankparameterizationimplicationsandextensions>6.FullRankvs.LowRankParameterization:ImplicationsandExtensions</h2><p>FPFTisunconstrainedinrank:everyweightmatrixcanevolvewithfulldegreesoffreedom.Bycontrast,LoRAandotherPEFTapproachesenforcearank (one layer per group, Llama2-7B).</p> <h2 class='paper-heading' id='full-rank-vs-low-rank-parameterization-implications-and-extensions'>6. Full-Rank vs. Low-Rank Parameterization: Implications and Extensions</h2> <p>FPFT is unconstrained in rank: every weight matrix can evolve with full degrees of freedom. By contrast, LoRA and other PEFT approaches enforce a rank-rstructure,intrinsicallyboundingtherepresentationalshift(see(<ahref="/papers/2502.00987"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Albertetal.,3Feb2025</a>,<ahref="/papers/2505.22355"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Liuetal.,28May2025</a>)).Ifthesingularvaluespectrumoffullupdate structure, intrinsically bounding the representational shift (see (<a href="/papers/2502.00987" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Albert et al., 3 Feb 2025</a>, <a href="/papers/2505.22355" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Liu et al., 28 May 2025</a>)). If the singular value spectrum of full-update \Delta Wdecaysslowly(higheffectiverank),LoRAsaturatesearly;fullrankapproaches,whetherbyfull decays slowly (high effective rank), LoRA saturates early; full-rank approaches, whether by full \Delta Wlearning(FPFT)orviarandombasisdecompositions(RandLoRA),recoveraccuracyplateausotherwiseunreachablebyPEFT.</p><p>Empiricalcomparison(vision,language,visionlanguage):</p><ul><li>LoRAunderperformsFPFTonhighcomplexitytasks(e.g.,CLIPV+L,LLMreasoning),whilefullrankparameterefficientmethods(e.g.,RandLoRA)canclosethegap,furtherconfirmingtheintrinsicvalueoffullrankadaptation(<ahref="/papers/2502.00987"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Albertetal.,3Feb2025</a>).</li></ul><h2class=paperheadingid=guidanceandbestpractices>7.GuidanceandBestPractices</h2><p>FPFTisthepreferredstrategywhen:</p><ul><li><strong>Maximumaccuracyisnecessary,</strong>particularlyinclinical,legal,orotherhighstakessettingswhereempiricalgainsarenonnegligible(e.g.,Med42s learning (FPFT) or via random-basis decompositions (RandLoRA), recover accuracy plateaus otherwise unreachable by PEFT.</p> <p>Empirical comparison (vision, language, vision-language):</p> <ul> <li>LoRA underperforms FPFT on high-complexity tasks (e.g., CLIP V+L, LLM reasoning), while full-rank parameter-efficient methods (e.g., RandLoRA) can close the gap, further confirming the intrinsic value of full-rank adaptation (<a href="/papers/2502.00987" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Albert et al., 3 Feb 2025</a>).</li> </ul> <h2 class='paper-heading' id='guidance-and-best-practices'>7. Guidance and Best Practices</h2> <p>FPFT is the preferred strategy when:</p> <ul> <li><strong>Maximum accuracy is necessary,</strong> particularly in clinical, legal, or other high-stakes settings where empirical gains are non-negligible (e.g., Med42’s 36$ point improvement on USMLE).
  • Data distribution shift, adversarial robustness, or out-of-distribution generalization are critical, as full-function class adaptability and noise-resilience can only be ensured by unconstrained updates (Liu et al., 28 May 2025).
  • Sufficient compute/memory resources are available, or when leveraging recent innovations in resource-efficient optimization (LOMO, HiFT, QFT) enables large-scale FPFT on commodity hardware.
  • For rapid prototyping, resource-constrained environments, or “easy” tasks with limited complexity, PEFT may suffice with substantially reduced cost.

Recommended stabilizing strategies—e.g., masked-response loss, 4K token sequence packing, global norm clipping, cosine decay schedule—further enhance FPFT outcomes on large instruction datasets (Christophe et al., 23 Apr 2024).

References

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Full Parameter Fine-tuning.