Papers
Topics
Authors
Recent
Search
2000 character limit reached

OrthoMerge: Orthogonal Model Merging

Updated 6 May 2026
  • OrthoMerge is a family of methods that use explicit orthogonality in parameter, data, or transformation spaces to mitigate destructive interference when merging task-specific model adaptations.
  • It leverages techniques like DO-Merging, OSRM, and Lie-Manifold Merge to decouple magnitude from direction, enforce subspace orthogonality, and preserve geometric structure.
  • Empirical evaluations across vision, language, and multi-modal benchmarks demonstrate significant accuracy gains and reduced merge loss compared to traditional model merging approaches.

OrthoMerge is a family of methods for model merging that employ explicit orthogonality—in parameter space, data space, or transformation space—to mitigate destructive interference when combining task-specific model adaptations. It is particularly relevant for merging low-rank adaptation (LoRA) modules and orthogonally fine-tuned models, and encompasses several distinct but related frameworks with rigorous theoretical underpinnings and broad empirical validation (Zheng et al., 21 May 2025, Yang et al., 5 Feb 2026, Zhang et al., 28 May 2025).

1. Motivation and Problem Setting

Model merging seeks to integrate multiple specialized models into a single unified set of weights, reducing deployment, training, and inference costs. Conventional approaches such as Task Arithmetic (simple averaging or weighted sum of parameter deltas) are effective for fully fine-tuned models but fail for LoRA or other structured adaptation methods, resulting in performance degradation (Zheng et al., 21 May 2025, Zhang et al., 28 May 2025). The underlying reasons include:

  • High column-wise magnitude variance in LoRA modules causing dominance by a single task
  • Interference between subspaces associated with different tasks, particularly when their supports overlap
  • Neglect of geometric properties such as preservation of hyperspherical energy or orthogonality inherent to some fine-tuning schemes

OrthoMerge addresses these pitfalls by bringing task decoupling, orthogonality, and geometric structure preservation into merging operations.

2. Theoretical Principles and Frameworks

Decoupling Magnitude and Direction (DO-Merging)

For LoRA, the weight update for task ii at a given layer is ΔWi=BiAi\Delta W_i = B_i A_i, where BiRm×r,AiRr×nB_i \in \mathbb{R}^{m \times r}, A_i \in \mathbb{R}^{r \times n}. DO-Merging decomposes each ΔWi\Delta W_i into:

  • Magnitude vector αi[j]=(ΔWi):,j2\alpha_i[j] = \|(\Delta W_i)_{:,j}\|_2, j=1nj=1\ldots n
  • Direction matrix Wˉi\bar W_i with columns normalized by αi\alpha_i

Each update can thus be written as ΔWi=Diag(αi)Wˉi\Delta W_i = \text{Diag}(\alpha_i) \cdot \bar W_i. By decoupling, cross-task magnitude variance is isolated from mixing of directionality, preventing parameter dominance and information loss (Zheng et al., 21 May 2025).

Orthogonality Constraints

To further reduce interference, OrthoMerge employs layer-wise orthogonalization of direction matrices. For tasks i,ji, j, directions ΔWi=BiAi\Delta W_i = B_i A_i0 are adjusted via small perturbations ΔWi=BiAi\Delta W_i = B_i A_i1, minimizing

ΔWi=BiAi\Delta W_i = B_i A_i2

This penalizes overlap between task-specific updates and is performed data-free, i.e., without input samples. Theoretical results guarantee reduction of merge loss (expected performance drop) due to magnitude variance and task conflict (Zheng et al., 21 May 2025).

Orthogonal-Subspace Preconditioning

Orthogonal Subspaces for Robust Model Merging (OSRM) constrains the row-space of each LoRA module ΔWi=BiAi\Delta W_i = B_i A_i3 before fine-tuning, ensuring the latent features of all other tasks are orthogonal to ΔWi=BiAi\Delta W_i = B_i A_i4. For task ΔWi=BiAi\Delta W_i = B_i A_i5:

  • The LoRA “input” matrix ΔWi=BiAi\Delta W_i = B_i A_i6 is initialized as the bottom-ΔWi=BiAi\Delta W_i = B_i A_i7 eigenvectors of the covariance ΔWi=BiAi\Delta W_i = B_i A_i8 of out-of-task latent features at layer ΔWi=BiAi\Delta W_i = B_i A_i9.
  • The objective enforces BiRm×r,AiRr×nB_i \in \mathbb{R}^{m \times r}, A_i \in \mathbb{R}^{r \times n}0, systematically suppressing inter-task crosstalk in parameter updates (Zhang et al., 28 May 2025).

Geometric Manifold Merging via Lie Theory

When merging models fine-tuned by Orthogonal Finetuning (OFT), each adaptation is represented as an orthogonal matrix BiRm×r,AiRr×nB_i \in \mathbb{R}^{m \times r}, A_i \in \mathbb{R}^{r \times n}1, with merged adaptation BiRm×r,AiRr×nB_i \in \mathbb{R}^{m \times r}, A_i \in \mathbb{R}^{r \times n}2 constructed on the orthogonal group manifold:

  1. Map BiRm×r,AiRr×nB_i \in \mathbb{R}^{m \times r}, A_i \in \mathbb{R}^{r \times n}3 (Lie algebra)
  2. Merge: BiRm×r,AiRr×nB_i \in \mathbb{R}^{m \times r}, A_i \in \mathbb{R}^{r \times n}4 in algebra
  3. Map back: BiRm×r,AiRr×nB_i \in \mathbb{R}^{m \times r}, A_i \in \mathbb{R}^{r \times n}5

This approach exactly preserves geometric properties such as norm and inner product, preventing spectral-norm drift and hyperspherical energy loss (Yang et al., 5 Feb 2026).

For general finetuned weights, the orthogonal Procrustes problem extracts the closest orthogonal matrix BiRm×r,AiRr×nB_i \in \mathbb{R}^{m \times r}, A_i \in \mathbb{R}^{r \times n}6 to an update, with the residual handled by standard merging.

3. Merging Algorithms and Implementation

Three principal OrthoMerge algorithms are instantiated from these principles:

Method Key Step Where Applied
DO-Merging Decouple magnitude and direction, orthogonalize LoRA module merging (post hoc, data-free)
OSRM Orthogonalize LoRA subspace (pre fine-tuning) LoRA module merging (pre-finetuning, data-driven)
Lie-Manifold Merge Projection to Lie algebra, merge, map back Orthogonal Finetuning, general finetuned adapters

DO-Merging Algorithm (Zheng et al., 21 May 2025):

For each layer and task, compute BiRm×r,AiRr×nB_i \in \mathbb{R}^{m \times r}, A_i \in \mathbb{R}^{r \times n}7, decompose into magnitude and direction, orthogonalize directions via small BiRm×r,AiRr×nB_i \in \mathbb{R}^{m \times r}, A_i \in \mathbb{R}^{r \times n}8 using gradient descent, sum magnitudes and orthogonalized directions, reconstruct BiRm×r,AiRr×nB_i \in \mathbb{R}^{m \times r}, A_i \in \mathbb{R}^{r \times n}9 and merge with base weights.

OSRM Procedure (Zhang et al., 28 May 2025):

Before fine-tuning, collect latent features per task; for each task and layer, initialize ΔWi\Delta W_i0 to bottom eigenvectors of covariance of all other tasks' features. Fine-tune as usual. Merge LoRA adapters with any standard technique (e.g., Task Arithmetic, Fisher, RegMean).

Orthogonal Model Merging (Yang et al., 5 Feb 2026):

Given OFT-trained models, map orthogonal weight updates to Lie algebra, average, map back to the group. For general adapters, extract orthogonal part via SVD; residuals are merged additively.

4. Empirical Evaluation and Performance

Experimental studies validate OrthoMerge approaches across vision, language, and multi-modal domains:

  • DO-Merging (Zheng et al., 21 May 2025):
    • Vision (ViT-B/32, 8 tasks): Task Arithmetic 74.06%, DO-Merging 77.88% (+3.82%)
    • Medium NLP (T5-base, 8 tasks): Task Arithmetic 77.4%, DO-Merging 80.9% (+3.5%)
    • Large LLMs (LLaMa3-8B, 6 tasks): Task Arithmetic 83.55%, DO-Merging 87.11% (+3.56%)
    • Orthogonalization alone yields ~2%, decoupling alone ~1%, combined ~3% normalized accuracy gains
  • OSRM (Zhang et al., 28 May 2025):
    • On GLUE with RoBERTa-large: Task Arithmetic +6.6pp, RegMean +1.9pp, Fisher +7.0pp, TIES +5.3pp, EMR +2.1pp improvement on average
    • Robust to hyperparameters: merge scaling, number of latent features ΔWi\Delta W_i1, number of tasks ΔWi\Delta W_i2, type of LoRA block
  • Orthogonal Model Merging (Yang et al., 5 Feb 2026):
    • When merging OFT models, in-domain accuracy: OrthoMerge 46.25% vs. baselines 44.10–44.97%; out-of-domain: OrthoMerge 41.80% vs. 40.78–40.97%
    • When applied to general adapters via Orthogonal-Residual Decoupling, consistently boosts all baselines by 0.2–2.4 points
    • Exact preservation of hyperspherical energy; mitigates catastrophic forgetting

5. Integration and Practical Considerations

OrthoMerge algorithms are designed to integrate seamlessly into existing merging pipelines:

  • Plug-and-play: OSRM and DO-Merging can be applied with no modifications to post-hoc merging code; OSRM is pre-finetuning, DO-Merging is post-finetuning, data-free.
  • Computational cost: DO-Merging requires only ΔWi\Delta W_i3 inner products per layer (ΔWi\Delta W_i4, ΔWi\Delta W_i5 tasks), incurring ΔWi\Delta W_i6 GPU-minute overhead for ΔWi\Delta W_i7-layer models. OSRM costs one eigendecomposition per layer per task.
  • Scalability: Each method is robust to number of tasks and features; OSRM in particular maintains high performance for ΔWi\Delta W_i8 tasks.
  • Hyperparameters: For LoRA, common settings apply, e.g., rank ΔWi\Delta W_i9–αi[j]=(ΔWi):,j2\alpha_i[j] = \|(\Delta W_i)_{:,j}\|_20, αi[j]=(ΔWi):,j2\alpha_i[j] = \|(\Delta W_i)_{:,j}\|_21 features per task; αi[j]=(ΔWi):,j2\alpha_i[j] = \|(\Delta W_i)_{:,j}\|_22 can be strictly orthogonal or softly constrained via fine-tuning.

6. Theoretical Guarantees and Analysis

  • Magnitude imbalance: Merge loss is minimized when merged LoRA modules have matched magnitude vectors (αi[j]=(ΔWi):,j2\alpha_i[j] = \|(\Delta W_i)_{:,j}\|_23) (Zheng et al., 21 May 2025).
  • Benefit of decoupling: The expected merge loss αi[j]=(ΔWi):,j2\alpha_i[j] = \|(\Delta W_i)_{:,j}\|_24 is lower for decoupled-then-merged updates than naive linear merge when magnitudes differ.
  • Orthogonality reduces conflict: Stricter orthogonality between direction matrices reduces “sign conflicts,” minimizing destructive interference and preserving task-specific signal (Zheng et al., 21 May 2025, Zhang et al., 28 May 2025).
  • Group manifold averaging: For OFT models, Riemannian averaging via Lie algebra preserves norm and rotation, ensuring valid merged adaptors (Yang et al., 5 Feb 2026).

A frequent misconception is that parameter-space orthogonality between LoRA deltas αi[j]=(ΔWi):,j2\alpha_i[j] = \|(\Delta W_i)_{:,j}\|_25 suffices to prevent task interference. However, unless the data features for different tasks are taken into account, latent cross-talk persists at inference because parameter-space separation does not guarantee output-space orthogonality. Data-driven or geometric orthogonality (as in OSRM and manifold-based OrthoMerge) is necessary for robust interference suppression (Zhang et al., 28 May 2025, Yang et al., 5 Feb 2026).

OrthoMerge is distinct from pre-merging methods that rely solely on pruning or clustering, and from model soups that apply linear combinations without structure-awareness. Its key contribution is the explicit management of both algebraic and geometric subspace overlap in the merging process.


References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OrthoMerge.