Test-Time Model Merging (TTMM)

Updated 2 December 2025

Test-Time Model Merging (TTMM) is a framework that constructs a unified model by merging the task-specific differences of pretrained experts using learned coefficients.
It employs strategies like parameter arithmetic, conflict-aware trimming, and adaptive entropy minimization to minimize catastrophic forgetting and optimize multi-task performance.
Empirical evaluations demonstrate that TTMM methods significantly improve accuracy and robustness across language, vision, and multi-modal tasks with minimal computational overhead.

Test-Time Model Merging (TTMM) is a family of inference-time algorithms for integrating multiple specialized neural models—often fine-tuned on different domains or objectives—into a single, unified model without requiring retraining or access to original fine-tuning data. TTMM methodologies aim for efficient composition of expert capabilities, minimize catastrophic forgetting, and optimize trade-offs between task performance, generalization, and resource consumption. The paradigm underpins a wide array of recent advances in multi-task, continual, and controllable learning, especially for large-scale vision and LLMs.

1. Core Problem Formulation and Design Objectives

TTMM is formalized as the construction of a multi-task model from a set of pretrained or fine-tuned experts $\{\theta_i\}_{i=1}^{K}$ , each adapted from a common foundation $\theta_0$ . The principal construction is

$\theta_{\text{merged}} = \theta_0 + \sum_{i=1}^K \alpha_i t_i$

where $t_i = \theta_i - \theta_0$ is the task vector for expert $i$ , and $\alpha_i$ is a scalar or input-conditioned merging coefficient. The challenge is to design $\{\alpha_i\}$ and parameter-selection rules that reconcile conflicting task updates, maximize generalization, and avoid destructive interference—often in a training- and label-free manner. TTMM is motivated by:

The need to combine expert knowledge at deployment, without access to all task data or costly multi-task retraining.
Robustness to heterogeneous test-time distributions and efficient continual integration of new capabilities.
Efficient adaptation—ideally incurring only minimal computational and memory overhead compared to full fine-tuning or test-time training (Bertolissi et al., 20 May 2025, Yang et al., 2023).

2. Methodological Taxonomy

TTMM strategies can be broadly partitioned as follows:

(a) Parameter Arithmetic and Sparse/Averaged Merging

Classical approaches perform simple arithmetic—such as weight averaging ("Model Soup"), linear interpolation, or uniform task-vector summation (Task Arithmetic). Extensions include coefficient tuning (per-task or per-layer scaling), and random or magnitude-based sparsification (e.g., TIES, DARE, Localize-and-Stitch) to preserve salient but non-interfering parameters (2505.10833).

(b) Conflict-Aware and Data-Free Trimming

Recent advances suppress parameter conflicts during merging by detecting and eliminating directions of high disagreement. CAT Merging evaluates, per layer, sign and magnitude inconsistency among task vectors to construct a conflict subspace. For linear layers, it projects each $t_i$ onto the complement of this subspace; for normalization parameters, it masks out conflicting components. This suppresses destructive interference in the merged model without any further training or data (Sun et al., 11 May 2025).

(c) Adaptive and Entropy-Minimizing Merging

Adaptive TTMM leverages unlabeled validation or test batches to set task- or layer-specific mixing coefficients by unsupervised objectives such as output entropy minimization (AdaMerging, AdaRank). By optimizing

$L(\lambda) = \mathbb{E}_x\big[ H(f_{\text{merged}}(x;\lambda)) \big]$

as a surrogate for task loss, the method finds $\lambda$ that promote prediction confidence, dynamically balancing the influence of each expert (Yang et al., 2023, Lee et al., 28 Mar 2025).

(d) Subspace and SVD-Based Compression

Frameworks such as MuDSC apply permutation alignment and dual-space similarity to maximize activation- and weight-space agreement before averaging, crucial for architectures with unit/group symmetries (Xu et al., 4 Mar 2024). Twin-Merging modularizes knowledge into shared versus exclusive (low-rank SVD-compressed) task-specific components, then deploys a learned router to dynamically recompose models at inference (Lu et al., 17 Jun 2024).

(e) Interference Suppression via Task-Vector Geometry

WUDI-Merging observes that per-task updates span an approximately linear subspace in layer input space. By minimizing the squared alignment of the merged update away from each task’s vector subspace—using only parameter arithmetic and no data—WUDI achieves state-of-the-art data-free merging performance by orthogonalizing cross-task interference (Cheng et al., 11 Mar 2025).

(f) Dynamic, Input-Conditioned and Continual Merging

Methods such as TTMM in MoE and continual learning settings (MINGLE) maintain a compact pool of low-rank experts and adaptively gate their contributions per-input using a test-time batch. MINGLE further introduces null-space constrained gradient projection to prevent new expert gates from interfering with prior tasks, using soft relaxation to balance stability against plasticity (Qiu et al., 17 May 2025). CodeMerge extends adaptive merging to highly dynamic domains (3D perception under severe test shift), computing merge coefficients from ridge leverage scores over dense feature “fingerprints” (Yang et al., 22 May 2025).

3. Algorithmic Schemes: Representative Approaches

Approach	TTMM Mechanism	Key Feature
Task Arithmetic	$\theta_0 + \lambda\sum_i t_i$	Uniform, global coefficient
CAT Merging	Project/trim task vectors by conflict set	Parameter-specific, data-free trimming
AdaMerging	Learn $\lambda$ via unsupervised entropy	Adaptation via unlabeled test data
Twin-Merging	SVD-twin exclusives + router gating	Modular, dynamic, input-conditioned
WUDI-Merging	Minimize interference in task-subspace	Provable, offline, no data required
MuDSC	Dual-space (weight/activation) alignment	Permutation-based unit alignment
AdaRank	Prune SVD modes causing interference	Data-driven, adaptive rank selection
MINGLE	Low-rank MoE + null-space gate adaptation	Continual, test-time interference control

Each methodology targets critical axes: computational efficiency, robustness to distribution shift, interference suppression, modularity/extensibility, and domain generalization. Notably, most approaches require only a small batch of unlabeled validation examples for coefficient/adapter estimation, if any (Yang et al., 2023, Lee et al., 28 Mar 2025, Sun et al., 11 May 2025).

4. Empirical Results and Benchmarking

TTMM methods are evaluated on language (GLUE, Qwen, Llama, T5), vision (CLIP ViT, ResNet), and multi-modal (VLM/VQA, medical imaging, 3D detection) tasks. Consistent findings include:

Substantial improvements over vanilla Task Arithmetic and weight averaging: e.g., AdaMerging boosts ViT-B/32 accuracy from 69.1% (task arithmetic) to up to 81.1% (layer-wise, trimmed) (Yang et al., 2023); CAT Merging adds +2.5% and +2.0% over state of the art on ViT-B/32 and ViT-L/14, respectively (Sun et al., 11 May 2025); WUDI-Merging attains an 85.2% average on 8-task ViT-B/32, exceeding both adaptive and static baselines by large margins (Cheng et al., 11 Mar 2025).
Input-adaptive merging (Twin-Merging) narrows the performance gap to or surpasses the fine-tuned upper bound, especially for generative tasks, with >28 pp improvement over task arithmetic on GLUE and normalized scores >100% on Qwen-14B generative benchmarks (Lu et al., 17 Jun 2024).
Dynamic and batch-wise merging (T³/T³_B) drastically improve medical VLM OOD accuracy and corruption robustness versus fixed-coefficient strategies (Imam et al., 31 Oct 2025).
Mixture-based and continual merging (MINGLE) achieves 7–9% average ACC gains and nearly zero BWT compared to sequential or static methods across 8–20 task orders (Qiu et al., 17 May 2025).
Computationally, TTMM approaches typically increase memory/test-time latency only minimally, with approaches like TTMM in MoEs being >100× faster than full test-time training while closely matching perplexity (Bertolissi et al., 20 May 2025).

5. Analysis: Conflict, Interference, and Generalization

A central TTMM challenge is mitigating destructive interference—where distinct task adaptations update parameters in incompatible ways. Methodologies to address this include:

Layer- and parameter-specific detection (CAT Merging) exploits sign/magnitude disagreements.
SVD-based strategies (AdaRank, Twin-Merging) prune dominant, but conflicting, singular modes.
Subspace orthogonalization (WUDI-Merging) projects merged updates away from the joint subspace of task vectors, grounded by precise theoretical bounds on parameter interference (Cheng et al., 11 Mar 2025).
Null-space gating (MINGLE) rigorously blocks gradient drift aligned with previous task features. These strategies yield substantial multi-task and OOD gains, reducing catastrophic forgetting and increasing Pareto coverage over competing objectives (Bone Soup) (Xie et al., 15 Feb 2025).

6. Specializations: Domain-Conditioned, Continual, and Multi-Objective TTMM

For domain-specialized and continual learning, approaches like MINGLE and Local Mixtures of Experts manage input-conditioned routing and continual adaptation, critical for real-time, evolving test distributions (Bertolissi et al., 20 May 2025, Qiu et al., 17 May 2025).
In multi-objective generation, two-stage schemes (Bone Soup) train base models to optimize mixtures of objectives, then invert the basis at test time to ensure controllability and Pareto optimality for any desired objective preference (Xie et al., 15 Feb 2025).
In high-variance or safety-critical domains (autonomous driving, medical VLM), inference-time calculation of blending coefficients is guided by data-driven metrics (codebook fingerprints, divergence/minimal entropy) that adaptively resolve robustness/precision under drift (Yang et al., 22 May 2025, Imam et al., 31 Oct 2025).

7. Limitations, Practical Recommendations, and Open Problems

Many methods require small unlabeled held-out/test batches for coefficient optimization or conflict detection; data-free methods (WUDI-Merging, CAT Merging) are effective when such data is unavailable (Sun et al., 11 May 2025, Cheng et al., 11 Mar 2025).
Scaling to very wide or deep models still presents a computational challenge (e.g., cubic assignment in MuDSC; storage of multiple SVD decompositions) (Xu et al., 4 Mar 2024, Lee et al., 28 Mar 2025).
Intelligent selection of merging coefficients and cutoffs (e.g., λ in linear interpolation, α in activation/weight similarity balance, SVD rank) remains empirical in most approaches; automating this is an active research area (2505.10833, Xu et al., 4 Mar 2024).
Theoretical analysis for nonlinear architectures, cross-layer matching, and task clustering is incomplete. Further, while most TTMM methods maintain or improve generalization, task conflicts with fundamentally incompatible objectives may admit only compromise solutions.

In sum, TTMM provides a spectrum of rigorously motivated methods for data-free, label-free, or unsupervised composition of expert models at inference. It is central to scalable, robust, and deployable large model systems across language, vision, and multi-modal domains, with algorithms and theory advancing rapidly on fronts of conflict resolution, continual learning, controllability, and resource efficiency (Sun et al., 11 May 2025, Cheng et al., 11 Mar 2025, 2505.10833, Lee et al., 28 Mar 2025, Lu et al., 17 Jun 2024, Imam et al., 31 Oct 2025, Yang et al., 22 May 2025, Bertolissi et al., 20 May 2025, Qiu et al., 17 May 2025).