Full-parameter fine-tuning (FPFT) is a method that adapts pre-trained neural networks by updating every parameter, ensuring maximal model adaptability.
It achieves superior accuracy, especially in high-stakes tasks such as medical QA, but demands extensive computing and memory resources.
Recent innovations like LOMO, HiFT, and QFT optimize resource usage, enabling FPFT on commodity hardware while preserving performance.
Full-parameter fine-tuning (FPFT), also termed full fine-tuning (FFT), is the process of adapting a pre-trained neural network by updating all its parameters—typically numbering in the billions for contemporary language or vision models—on downstream tasks and datasets. FPFT offers the highest possible adaptation capacity, in contrast to parameter-efficient fine-tuning (PEFT), which restricts updates to a small, structured subset of parameters. FPFT continues to serve as the performance benchmark across domains and remains the method of choice when maximal accuracy, robustness, or representational coverage are required, albeit at considerable memory, compute, and engineering cost.
1. Mathematical Formulation and Algorithmic Workflow
The objective of full-parameter fine-tuning is to minimize a downstream task loss over all model parameters θ∈Rd. For a pre-trained model f(x;θ) and data distribution D={(x,y)}, the optimization problem is:
θ∗=argθ∈RdminE(x,y)∼D[L(f(x;θ),y)],
where L denotes the task-specific loss, commonly cross-entropy for classification or language modeling tasks (Liu et al., 28 May 2025).
2. Theoretical Properties and Representational Capacity
FPFT occupies the entire parameter manifold Θ=Rd, allowing unconstrained movement in function space. A formal distinction with PEFT is found in (Liu et al., 28 May 2025):
Function class (FPFT):Ffull={f(x;θ):θ∈Rd}.
Function class (PEFT):Fpeft={f(x;θ0+g(Φ)):Φ∈Rk,k≪d}.
PEFT’s updates are restricted to a k-dimensional submanifold, strictly contained in the d-dimensional parameter space realized by FPFT (Theorem 1), as g is not surjective for k<d (via Sard's theorem).
Representational bounds: Theorem 2 quantifies that the representational shift in PEFT is controlled by the magnitudes ∥Φk∥ and that the overall flexibility is strictly less than FPFT.
Rank constraints: For low-rank PEFT (e.g., LoRA), the error in approximating a full update is lower-bounded by spectral tail terms beyond the rank budget (Eckart–Young–Mirsky).
3. Practical Implementations and Memory Optimization
Conventional FPFT is memory-intensive, limiting accessibility:
AdamW doubles memory for two optimizer states (momentum/variance) per parameter.
All gradients must be retained in memory until optimizer step.
Large LLMs (e.g., Llama-2 70B) require distributed compute, often on AI supercomputers (e.g., Cerebras CG-1; system RAM >1 PB) (Christophe et al., 23 Apr 2024).
LOMO (LOw-Memory Optimization) (Lv et al., 2023): Fuses gradient computation and update, eliminating optimizer state and peak gradient memory, reducing memory usage to 10.8% of AdamW/DeepSpeed baselines; enables end-to-end tuning of a 65B model on 8×24 GB GPUs.
HiFT (Hierarchical FPFT) (Liu et al., 26 Jan 2024): Activates only a subset (e.g., 1/k) of layer groups per step, reducing in-memory gradients and optimizer states by a factor of k. Enables 7B–13B models to be fine-tuned full-parameter on consumer GPUs (<32 GB), with only a $1.1$–1.3× increase in wall-clock time.
QFT (Quantized FPFT) (Li et al., 2023): Stores all training states (weights, gradients, optimizer moments) in INT8, leveraging quantizer-theoretic bounds and the Lion optimizer’s sign-invariant update, achieving a model state memory reduction to 21% of fp32 baselines and maintaining performance within $0.6$ points of full-precision FPFT.
FPFT consistently outperforms PEFT on complex, high-capacity tasks and adversarial settings:
Med42 (Llama-2 70B, medical QA): FPFT achieves 71.9% average accuracy on USMLE vs. 68.3% for LoRA, and 60.9% on MedMCQA vs. 54.7% for LoRA—a $3$–$6$ point advantage, trading ∼700\timesmoreparametersupdated(<ahref="/papers/2404.14779"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Christopheetal.,23Apr2024</a>).</li><li><strong>Comparativestudies(GLUE,reasoning,adversarialbenchmarks):</strong>Ondifficultreasoning(GSM8K,SQL,instruction),FPFTexceedsLoRAby2–4\%absolute.Inadversarialsettings(AdvGLUE,SQuADperturbations),PEFT’saccuracydeclinesby1–6\%relativetoFPFT(<ahref="/papers/2505.22355"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Liuetal.,28May2025</a>).</li><li><strong>PEFTperformanceon“easy”tasks:</strong>OnstandardGLUE,LoRAcanmatchorslightlyoutperformFPFT,duetotasksimplicityandlowintrinsicdimensionality.</li><li><strong>Scalingwithdata:</strong>Infew−shotregimes(k<20),LoRAcanbrieflymatchorexceedFPFT,butfork>100,FPFT’sgainsper−sampleoutpacePEFT,attributabletohigherunderlyingmodelVC−dimension.</li></ul><h2class=′paper−heading′id=′memory−compute−trade−offs−and−resource−aware−strategies′>5.Memory/ComputeTrade−offsandResource−AwareStrategies</h2><p>FPFTimposessignificantdemands:</p><ul><li>Amodern7BLLM(e.g.,Llama−2)requires>100GBmemoryforallparameter,optimizerstate,andactivationbuffersinAdamW−fp32(<ahref="/papers/2401.15207"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Liuetal.,26Jan2024</a>).</li><li>Consumer−levelGPUsnecessitatealgorithmicworkarounds;memorysavingsinLOMO,HiFT,andQFTarelargelyorthogonalandcanbecomposedforfurthergains.</li></ul><p>Memory−footprintformulas(HiFT,kgroups,W_1=modelsize):</p><p>\text{M}_{\text{full}} \approx 4 \cdot W_1 + M_{\text{resid}},</p><p>\text{M}_{\text{HiFT}} \approx W_1 (1 + 3/k) + M'_{\text{resid}}</p><p>Yielding>60\%memoryreductionfork\approx 34(onelayerpergroup,Llama2−7B).</p><h2class=′paper−heading′id=′full−rank−vs−low−rank−parameterization−implications−and−extensions′>6.Full−Rankvs.Low−RankParameterization:ImplicationsandExtensions</h2><p>FPFTisunconstrainedinrank:everyweightmatrixcanevolvewithfulldegreesoffreedom.Bycontrast,LoRAandotherPEFTapproachesenforcearank−rstructure,intrinsicallyboundingtherepresentationalshift(see(<ahref="/papers/2502.00987"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Albertetal.,3Feb2025</a>,<ahref="/papers/2505.22355"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Liuetal.,28May2025</a>)).Ifthesingularvaluespectrumoffull−update\Delta Wdecaysslowly(higheffectiverank),LoRAsaturatesearly;full−rankapproaches,whetherbyfull\Delta Wlearning(FPFT)orviarandom−basisdecompositions(RandLoRA),recoveraccuracyplateausotherwiseunreachablebyPEFT.</p><p>Empiricalcomparison(vision,language,vision−language):</p><ul><li>LoRAunderperformsFPFTonhigh−complexitytasks(e.g.,CLIPV+L,LLMreasoning),whilefull−rankparameter−efficientmethods(e.g.,RandLoRA)canclosethegap,furtherconfirmingtheintrinsicvalueoffull−rankadaptation(<ahref="/papers/2502.00987"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Albertetal.,3Feb2025</a>).</li></ul><h2class=′paper−heading′id=′guidance−and−best−practices′>7.GuidanceandBestPractices</h2><p>FPFTisthepreferredstrategywhen:</p><ul><li><strong>Maximumaccuracyisnecessary,</strong>particularlyinclinical,legal,orotherhigh−stakessettingswhereempiricalgainsarenon−negligible(e.g.,Med42’s3–6$ point improvement on USMLE).
Data distribution shift, adversarial robustness, or out-of-distribution generalization are critical, as full-function class adaptability and noise-resilience can only be ensured by unconstrained updates (Liu et al., 28 May 2025).
Sufficient compute/memory resources are available, or when leveraging recent innovations in resource-efficient optimization (LOMO, HiFT, QFT) enables large-scale FPFT on commodity hardware.
For rapid prototyping, resource-constrained environments, or “easy” tasks with limited complexity, PEFT may suffice with substantially reduced cost.
Recommended stabilizing strategies—e.g., masked-response loss, 4K token sequence packing, global norm clipping, cosine decay schedule—further enhance FPFT outcomes on large instruction datasets (Christophe et al., 23 Apr 2024).
References
"Med42 -- Evaluating Fine-Tuning Strategies for Medical LLMs: Full-Parameter vs. Parameter-Efficient Approaches" (Christophe et al., 23 Apr 2024)
"Full Parameter Fine-tuning for LLMs with Limited Resources" (Lv et al., 2023)