Papers
Topics
Authors
Recent
2000 character limit reached

TIES-Merging: Robust Model Integration

Updated 10 December 2025
  • TIES-Merging is a model-merging algorithm that fuses fine-tuned neural networks by mitigating redundant and sign-disagreement interference through trimming, election, and disjoint merging.
  • It leverages quantile-based pruning and sign election to selectively combine task-relevant parameter updates, outperforming naïve averaging in multimodal and multi-domain settings.
  • Empirical and theoretical analyses showcase its robustness, scalability, and effectiveness in applications like continual pretraining, unlearning, and federated learning.

TIES-Merging is a model-merging algorithm designed to resolve parameter interference and destructive update interactions when fusing multiple fine-tuned neural network models into a single model capable of multitask or multi-domain generalization. The method is prominent within LLM research, continual pre-training, unlearning, and expert composition, offering a robust, data-free strategy for parameter-space integration. TIES-Merging leverages a sequence of “trim,” “elect,” and “disjoint merge” operations to preserve task-relevant expert adjustments while suppressing noisy or conflicting parameter updates, thereby outperforming naïve averaging and basic arithmetic merges in scenarios characterized by substantial expert divergence or limited model size.

1. Motivation, Principles, and Algorithmic Foundations

The primary motivation for TIES-Merging arises from two central types of parameter interference in model merging: redundant-parameter interference and sign-disagreement interference. Redundant-parameter interference occurs when averaging across models leads to dilution of large, task-relevant weight changes by many near-zero (irrelevant) entries. Sign-disagreement interference emerges when experts drive parameters in opposing directions, leading to destructive cancellation under vanilla averaging.

TIES-Merging systematically addresses these phenomena through three atomic operations:

  1. Trim: Prune small-magnitude entries in each expert's task vector (i.e., θ_expert − θ_base), zeroing out entries below a specified quantile or threshold.
  2. Elect: For each parameter dimension, identify a majority or dominant direction, often via majority vote or summing the pruned task vectors and taking the resulting sign.
  3. Disjoint Mean (Merge): Only update parameters where at least one expert’s pruned vector agrees with the elected sign; for each such case, average only those expert deltas sharing the dominant sign. All other coordinates revert to the base value.

The TIES-Merging algorithm is typically formalized as follows for N experts with parameters {θ_i} and a common initialization θ_base (Yadav et al., 2023, Ueda et al., 4 Nov 2025, Yadav et al., 4 Oct 2024):

Ti,p=θi,pθbase,pT_{i,p} = θ_{i,p} - θ_{base,p}

T~i,p=Ti,p1(Ti,pτ)\tilde{T}_{i,p} = T_{i,p} \cdot \mathbf{1}\left(|T_{i,p}| \geq τ\right)

sp=sgn(i=1NT~i,p)s_p = \mathrm{sgn}\left(\sum_{i=1}^N \tilde{T}_{i,p}\right)

Pp={isgn(T~i,p)=sp}P_p = \left\{i \mid \mathrm{sgn}(\tilde{T}_{i,p}) = s_p \right\}

T~m,p={1PpiPpT~i,p,if Pp 0,else\tilde{T}_{m,p} = \begin{cases} \frac{1}{|P_p|}\sum_{i \in P_p} \tilde{T}_{i,p}, & \text{if } P_p \neq \emptyset \ 0, & \text{else} \end{cases}

θmerged,p=θbase,p+λT~m,pθ_{merged,p} = θ_{base,p} + λ \cdot \tilde{T}_{m,p}

Where τ is the trimming threshold and λ is a global scaling factor, both tuned as hyperparameters.

2. Detailed Procedure, Hyperparameterization, and Variants

TIES-Merging operates in a parameter-space setting with a workflow as follows (Yadav et al., 2023, Ueda et al., 4 Nov 2025, Yadav et al., 4 Oct 2024, Xu et al., 27 Mar 2025):

  • Preprocessing: All models must share the same architecture, initialization, and layer ordering. Differences must be computed relative to the same θ_base.
  • Trimming: The density or threshold τ controls sparsity. Typical values prune the bottom 10–30% of absolute parameter deltas. Excessively aggressive pruning can underutilize expert knowledge, while overly dense retention can reintroduce interference.
  • Sign Election: For each parameter index, a majority sign is computed across experts’ trimmed deltas. Only those expert deltas with matching signs are included in the merge.
  • Disjoint Update: For each parameter, if no expert delta agrees with the elected sign, the coordinate is left unchanged (i.e., θ_base is used).
  • Scaling: The merged delta is optionally scaled by λ (commonly λ = 1.0).

Implementations typically flatten and merge all weight tensors, operating layer- or parameter-wise. The merge is one-shot, with no iterative optimization.

Common variants include DARE-TIES, which adds random dropout to deltas before TIES, and ACM-TIES, which replaces the scalar λ with mutual-information-informed layerwise coefficients for improved retention and specialization (Yao et al., 20 May 2025).

3. Theoretical Properties, Assumptions, and Empirical Performance

  • Sign-Consistency and Conflict Avoidance: By grouping parameter updates according to sign and including only those that agree with the dominant direction, TIES mitigates destructive interference, which is critical when experts are heterogeneous or trained on divergent tasks (Yadav et al., 2023).
  • Sparsity and Robustness: Trimming small updates both denoises the task vectors and limits merge-induced norm inflation, promoting stable integration. This is especially effective in sparse or low-license expert merges (Yadav et al., 4 Oct 2024, Ueda et al., 4 Nov 2025).
  • Empirical Scaling Laws: Cross-entropy loss for TIES-merged models empirically follows a “floor plus 1/k tail” scaling, with diminishing returns in the number of experts k and with performance floors set by model capacity N (Wang et al., 29 Sep 2025). For large N and high-quality bases, TIES merges converge toward simple averaging in held-in/held-out performance.
  • Stability: TIES is notably robust to the density parameter d and insulates against catastrophic collapse even as expert overlap decreases, provided pairwise cosine similarity is not too low (empirically, cosine ≥0.98) (Ueda et al., 4 Nov 2025).

Empirical results across NLP, vision, code-mixed, continual pretraining, zero-shot, and unlearning tasks consistently show TIES improving over basic averaging by 1–5 points in accuracy or F1, particularly when number of experts or domains is moderate and models are small to mid-scale (Yadav et al., 2023, Xu et al., 27 Mar 2025, Ueda et al., 4 Nov 2025, Kodali et al., 22 Oct 2025).

Method Noise Handling Sign Conflict Handling Held-in Perf. Gain Computational Cost Data Needs
Averaging None None Medium–Poor O(P) None
Task Arithmetic None None Unstable O(P) None
DARE-TIES Dropout+TIES Yes (strong, strict) Mixed O(P log P) (sort+mask) None
ACM-TIES MI-weighted TIES Yes (as per TIES) Best (“asymptotic”) O(P + n cal. fwd passes) Small calibration
TIES-Merging Trimmed, by quantile Disjoint mean by sign Strong, robust O(P) None

TIES consistently outperforms simple averaging and Task Arithmetic in small-to-midsize regimes and under significant expert-task conflict. DARE-TIES, which applies an additional dropout mask before sign election, can be too aggressive when model deltas are already well-aligned, erasing useful task-specific adaptations (Timilsina et al., 17 Nov 2025). At the largest scales (≥24B), performance among the principal merging strategies closely aligns, with diminishing returns from the more sophisticated conflict-resolution logic in TIES (Yadav et al., 4 Oct 2024).

5. Application Domains: Continual Pretraining, Unlearning, and Expert Composition

TIES-Merging has been applied in diverse contexts:

  • Unlearning in LLMs: Used to synthesize a “balanced” unlearned model by merging aggressive (over-forgetting) and conservative (under-forgetting) adapters. This outperformed linear and DARE-derived baselines on knowledge retention and membership inference metrics in SemEval-2025 Task 4 (Xu et al., 27 Mar 2025).
  • Continual Pretraining and Specialized LLMs: Demonstrably restores general-domain knowledge lost to sequential CPT, unlocking emergent cross-domain skills when merging, e.g., finance and math experts (Ueda et al., 4 Nov 2025).
  • Multilingual/Code-Mixed Adaptation: Enables robust cross-lingual/cross-codebase integration through per-parameter magnitude and sign filtering (Kodali et al., 22 Oct 2025).
  • General Model Soup Composition: Follows the same empirical scaling laws as TA and DARE but edges out others when merging a few strong experts into a mid-size model (Wang et al., 29 Sep 2025, Yadav et al., 4 Oct 2024).
  • Distributed/Federated Learning: Serves as a robust, data-free consolidation strategy in privacy-sensitive, cross-institution scenarios.

6. Hyperparameter Sensitivity, Limitations, and Practical Recommendations

TIES depends most critically on (a) the trimming threshold/density d, and (b) the scaling λ. In practice, performance is relatively robust to d in [0.2,0.6], and λ=1.0 suffices in most cases (Ueda et al., 4 Nov 2025, Yadav et al., 2023). Excessive pruning (low d) limits CPT utility, while excessive density (d→1) reintroduces conflicts.

TIES effectiveness decays as the number of experts and model size increase, yielding convergence toward simple averaging in large-scale, well-regularized settings (Yadav et al., 4 Oct 2024, Wang et al., 29 Sep 2025). TIES requires strict architectural compatibility and high pre-merge parameter similarity. For more than two experts or extremely divergent experts, the headroom for further improvement tapped by TIES shrinks (Ueda et al., 4 Nov 2025).

Best practices:

  • Use TIES for two-way expert merges; modularize merges for >2.
  • Tune density on a held-out dev set; use mergekit for implementation safety.
  • Monitor pre-merge cosine similarity (≥0.98 desirable).

7. Extensions, Scaling Laws, and Theoretical Insights

TIES has inspired layerwise and information-theoretic extensions such as ACM-TIES, which uses mutual-information between base and expert activations to devise layer-specific scaling coefficients, further boosting performance, e.g., reducing response length by 55% while improving reasoning accuracy by +1.3 points in Qwen-7B L2S benchmarks (Yao et al., 20 May 2025). The method is compatible with the “floor plus 1/k” merging scaling law, which enables predictive planning, early stopping, and cost/performance tradeoff for expert composition (Wang et al., 29 Sep 2025).

Theoretical justification for the disjoint-mean merge step is grounded in mitigating the destructive impact of conflicting parameter perturbations, especially in the sparsely-updated regime typical of efficient fine-tuning. TIES merges exploit mode connectivity in parameter space but avoid the zero-average pitfall by isolating and reinforcing large, sign-consistent expert contributions.


References:

Whiteboard

Follow Topic

Get notified by email when new papers are published related to TIES-Merging.