Orthogonal Finetuning Methods

Updated 6 May 2026

Orthogonal Finetuning is a parameter-efficient adaptation technique that applies orthogonal transformations to pretrained model weights while preserving their angular geometry.
The methodology employs efficient parameterizations such as block-diagonal, Givens rotations, and Householder reflections to maintain hyperspherical energy, reducing overfitting and catastrophic forgetting.
OFT consistently enhances performance in language, vision, and diffusion models by lowering memory and computational costs while sustaining semantic stability.

Orthogonal Finetuning (OFT) encompasses a family of parameter-efficient adaptation methods for deep neural networks, in which adaptation is achieved via orthogonal transformations of the pretrained model’s weights rather than the conventional additive or low-rank approaches. The orthogonality constraint rigorously preserves the pairwise angular geometry (“hyperspherical energy”) among neurons, offering improved control over catastrophic forgetting and overfitting. Modern OFT research covers multiple efficient parameterizations, algorithmic variants, downstream domains (language, vision, diffusion), and mathematical analyses. This article surveys the principal theoretical foundations, algorithmic realizations, empirical highlights, and trade-offs of OFT.

1. Geometric Principle: Hyperspherical Energy and Orthogonal Transformations

OFT is characterized by strictly preserving the hyperspherical energy (HE) of weight matrices during adaptation. For a weight matrix $W = [w_1,\dots,w_n] \in \mathbb{R}^{d \times n}$ , the neurons are normalized as $\hat w_i = w_i/\|w_i\|$ . The HE is defined as: $\mathrm{HE}(W) = \sum_{i \neq j} \|\hat w_i - \hat w_j\|^{-1}$ HE measures the angular “spread” of neurons on the unit sphere, with higher HE indicating more uniform arrangements. During standard finetuning (including Direct Preference Optimization, DPO), large shifts in HE are empirically found to signal representational collapse, overfitting (long, generic generations), and loss of expressiveness. Orthogonal adaptation applies a transformation $W \mapsto R W$ with $R^\top R = I$ , such that

$\|R w_i - R w_j\| = \|w_i - w_j\| \implies \mathrm{HE}(R W) = \mathrm{HE}(W)$

Thus, all pairwise neuron angles and spectral properties are exactly retained, ensuring semantic stability and bias control throughout adaptation (Yang et al., 2024, Qiu et al., 2023).

2. Algorithmic Realizations: Parameterizations and Efficient Implementations

Full dense orthogonal adaptation requires $O(d^2)$ parameters and does not scale to modern models. Recent OFT research introduces a variety of parameter-efficient and computationally expedient parameterizations:

Block-Diagonal and Givens Rotations: The orthogonal transform is often approximated by block-diagonal orthogonal matrices (e.g., $R = \mathrm{diag}(R_1,\dots,R_r)$ with $R_k \in O(b)$ ). Advanced schemes use sparse compositions of 2×2 Givens rotations (“BIG” or “quasi-Givens” structures), allowing expressivity with $O(d)$ parameters and $\hat w_i = w_i/\|w_i\|$ 0 sparse matrix products. Soft orthogonality can be enforced to allow controlled scaling/rotation (Ma et al., 2024, Yang et al., 2024).
Cayley and Cayley–Neumann Parameterizations: The Cayley transform,

$\hat w_i = w_i/\|w_i\|$ 1

provides a differentiable, unconstrained parameterization. The Cayley–Neumann variant approximates $\hat w_i = w_i/\|w_i\|$ 2 with a truncated Neumann series, yielding

$\hat w_i = w_i/\|w_i\|$ 3

for numerically stable and efficient updates (Qiu et al., 24 Jun 2025).

Group-and-Shuffle (GS) Factorization: GSOFT generalizes prior structured approaches, composing the orthogonal update as

$\hat w_i = w_i/\|w_i\|$ 4

where $\hat w_i = w_i/\|w_i\|$ 5 are block-diagonal orthogonals and $\hat w_i = w_i/\|w_i\|$ 6 is a permutation. This allows high expressivity and low parameter cost with just two block-diagonal stages and two shuffles, encompassing block-diagonal, butterfly, and Monarch as special cases (Gorbunov et al., 2024, Liu et al., 2023).

Input-Centric OFTv2: Rather than materializing $\hat w_i = w_i/\|w_i\|$ 7, OFTv2 applies $\hat w_i = w_i/\|w_i\|$ 8 and $\hat w_i = w_i/\|w_i\|$ 9 sequentially as matrix-vector products, reducing runtime from $\mathrm{HE}(W) = \sum_{i \neq j} \|\hat w_i - \hat w_j\|^{-1}$ 0 to $\mathrm{HE}(W) = \sum_{i \neq j} \|\hat w_i - \hat w_j\|^{-1}$ 1 and peak memory by up to $\mathrm{HE}(W) = \sum_{i \neq j} \|\hat w_i - \hat w_j\|^{-1}$ 2 (Qiu et al., 24 Jun 2025).
Principal Subspace Adaptation (MOFT): Orthogonality is restricted to a low-rank principal subspace $\mathrm{HE}(W) = \sum_{i \neq j} \|\hat w_i - \hat w_j\|^{-1}$ 3 so that $\mathrm{HE}(W) = \sum_{i \neq j} \|\hat w_i - \hat w_j\|^{-1}$ 4, where $\mathrm{HE}(W) = \sum_{i \neq j} \|\hat w_i - \hat w_j\|^{-1}$ 5. Adaptation in this basis preserves low-rank hyperspherical energy at $\mathrm{HE}(W) = \sum_{i \neq j} \|\hat w_i - \hat w_j\|^{-1}$ 6 memory cost, with commutativity constraints on $\mathrm{HE}(W) = \sum_{i \neq j} \|\hat w_i - \hat w_j\|^{-1}$ 7 (Wu et al., 16 May 2025).
Householder Reflections (HOFT): Accumulating Householder reflections parameterizes orthogonal updates as $\mathrm{HE}(W) = \sum_{i \neq j} \|\hat w_i - \hat w_j\|^{-1}$ 8. The scaled variant SHOFT inserts a learnable diagonal scaling for additional expressivity while maintaining geometric constraints (Arcas et al., 22 May 2025).

3. Algorithmic Procedures and Integration

The modern OFT pipeline consists of:

Decomposing frozen weight $\mathrm{HE}(W) = \sum_{i \neq j} \|\hat w_i - \hat w_j\|^{-1}$ 9 via the chosen parameterization.
Initializing adaptation parameters (rotation angles, block matrices, Householder vectors, scaling vectors) to identity/zero to avoid sudden shifts.
For each mini-batch, applying the orthogonal transformation (matrix-matrix or sequential mat-vec) and—if present—learned scaling.
Performing the forward pass through the transformed weights, computing task loss (e.g., DPO loss, classification, generation).
Updating only the adaptation parameters (gradient-based, with manifold-aware retraction if needed).
Optional projection onto feasible sets (for constrained variants). At inference, the learned orthogonal transform is merged into the base weights, incurring no additional computational cost.

4. Empirical Efficacy and Domain Coverage

OFT and its variants consistently yield gains in regularization, expressivity, and parameter efficiency across domains:

LLMs: RoPO (orthogonal DPO) outperforms DPO by 8–10 points on MT-Bench and AlpacaEval 2, increases generation diversity by 6 points, and maintains knowledge retention (commonsense QA accuracy 85.43% vs DPO’s 83.84%), all with tuning only 0.0086% of parameters (Yang et al., 2024).
Vision Models: On ViT and DeBERTaV3 backbones, block-diagonal and butterfly OFT, GSOFT, and qGOFT outperform or match LoRA with fewer parameters (Ma et al., 2024, Gorbunov et al., 2024, Liu et al., 2023, Yang et al., 17 Jul 2025).
Diffusion Models: OFT and COFT (constrained variant) preserve subject fidelity and prompt alignment in DreamBooth and Controllable generation tasks, improving DINO/CLIP metrics and stability compared to LoRA and vanilla fine-tuning (Qiu et al., 2023).
Mixture-of-Experts: Imposing a hard Gram–Schmidt projection on MoE experts (OMoE) enforces orthogonalization on the Stiefel manifold, maximizing expert diversity and improving multi-task generalization with fewer experts (Feng et al., 17 Jan 2025).
Representation Similarity: OFT-based RepSim achieves a 30% increase in representation similarity, a 42% reduction in sharpness, and accuracy within 1 point of full finetuning in medical image analysis (Zu et al., 10 Mar 2025).
Quantized Models: Input-centric and GS-structured OFT enable seamless integration with 4-bit quantized weights, outperforming QLoRA on both runtime and accuracy (Qiu et al., 24 Jun 2025).

5. Theoretical Guarantees and Properties

Hyperspherical Energy Invariance: Orthogonal transforms preserve $W \mapsto R W$ 0 and hence the geometric configuration of neuron directions (Yang et al., 2024, Qiu et al., 2023).
Spectral Norm Preservation: For any $W \mapsto R W$ 1, $W \mapsto R W$ 2 (if $W \mapsto R W$ 3), avoiding activation drift and gradient explosion (Qiu et al., 2023).
Parameter/Memory Efficiency: Depending on parameterization, OFT can reduce adaptation parameter count by $W \mapsto R W$ 4– $W \mapsto R W$ 5 orders of magnitude. Memory-efficient designs like MOFT limit activations to $W \mapsto R W$ 6, matching LoRA's activation profile (Wu et al., 16 May 2025).
Expressivity–Efficiency Trade-off: GS-structured, butterfly, and Givens-based variants exhibit quantifiable trade-offs of density, expressivity, and sparsity (e.g., BOFT achieves dense coverage with $W \mapsto R W$ 7 parameters) (Liu et al., 2023, Gorbunov et al., 2024, Ma et al., 2024).
Generalization Bounds: For approximately orthogonal adapters, the model’s Rademacher-based generalization bound is tighter than for unconstrained low-rank adapters, due to the norm constraint on the adaptation (Yang et al., 17 Jul 2025).

6. Extensions and Limitations

Adapter Fusion: The geometry of structured orthogonal parameterizations (e.g., Group-and-Shuffle) admits closed-form, training-free geodesic interpolation of multiple adapters, enabling direct fusion of task- and style-specific adaptations for diffusion models (Aliev et al., 6 Apr 2026).
Diversity Promotion: Orthogonality in Mixture-of-Experts mitigates expert collapse and maximizes angular separation, yielding stable specialization without additional loss terms (Feng et al., 17 Jan 2025).
Manifold Optimization: Several OFT variants rely on explicit optimization on the Stiefel manifold or approximate projections (e.g., QR-based retractions), ensuring update feasibility (Zu et al., 10 Mar 2025).
Scalability: Input-centric and block/grouped designs alleviate both compute and memory bottlenecks, with $W \mapsto R W$ 8 speedup and $W \mapsto R W$ 9 less GPU memory over naive weight-centric OFT (Qiu et al., 24 Jun 2025).

Limitations noted in the literature include:

Block-diagonal designs may limit expressivity for very large group sizes.
Householder and SVD subspace approaches add overhead for small layers.
The strict preservation of angles may constrain magnitude adaptation (relaxed via scaling vectors).
Some parameterizations still have residual cubic computational cost for large unstructured layers; ongoing work investigates further approximate structures (e.g., butterfly, GS, quasi-Givens) (Wu et al., 16 May 2025, Gorbunov et al., 2024, Arcas et al., 22 May 2025).

7. Summary Table: Key OFT Variants and Properties

Variant	Parameterization	Memory/Compute	Key Property
RoPO (Yang et al., 2024)	BIG (Givens)	$R^\top R = I$ 0	Hard HE preservation, DPO
OFTv2 (Qiu et al., 24 Jun 2025)	Input-centric Cayley	$R^\top R = I$ 1	Scalable, Q-aware
GSOFT (Gorbunov et al., 2024)	Group-and-Shuffle	$R^\top R = I$ 2	Dense, sparse, or hybrid
BOFT (Liu et al., 2023)	Butterfly	$R^\top R = I$ 3	Expressivity/speed tradeoff
qGOFT (Ma et al., 2024)	Sequential Givens	$R^\top R = I$ 4	Fast, quasi-orthogonal
MOFT (Wu et al., 16 May 2025)	SVD+subspace	$R^\top R = I$ 5	Angle preservation, memory
HOFT (Arcas et al., 22 May 2025)	Householder	$R^\top R = I$ 6	Full orthogonal coverage