Papers
Topics
Authors
Recent
2000 character limit reached

Online Merging Optimizers

Updated 31 January 2026
  • Online merging optimizers are adaptive algorithms that dynamically merge new model updates with existing ones in streaming and resource-constrained environments.
  • They utilize history-aware averages, similarity-based selection, and contextual bandit frameworks to tackle challenges in continual learning, expert composition, and RLHF merging.
  • Empirical evidence shows these methods reduce latency and memory usage while improving task accuracy compared to static or naive merging strategies.

Online Merging Optimizers refer to a class of algorithms and frameworks that dynamically integrate, average, or select between multiple models, expert modules, or adapters as new components or task requirements arrive over time, without requiring retraining from scratch. These optimizers are devised to handle practical constraints in streaming, on-device, resource-constrained, or continually evolving environments, ensuring robust performance under dynamic or incremental task arrival while efficiently managing memory, compute, or other application constraints. The defining feature is the “online” nature of the merging: at each step, as new information (models, tasks, or feedback) becomes available, the optimizer decides how and what to merge in real time, adapting incrementally to evolving requirements.

1. Problem Formulations and Representative Settings

Online merging optimizers address a spectrum of settings arising from modern neural applications:

  • On-device continual learning of adapters: Given a fixed base model (e.g., a 1–2B parameter LLM), support for downstream tasks is provided via plug-in LoRA adapters. The device can store only up to KK \ll (number of possible tasks) adapters, and new single-task LoRAs are delivered as users request new tasks. The goal is to continually select, merge, or allocate LoRAs to maintain acceptable performance on all tasks seen so far, under hard storage and data-free constraints (Shenaj et al., 15 Oct 2025).
  • Online expert composition (Mixture-of-Experts): In inference over large-scale pre-trained MoEs, expert routing and merging for different task distributions must be performed efficiently under changing conditions. Here, the merger needs to select or weight experts at each time step based on task distribution estimates, while minimizing regret over cumulative task-specific rewards (Han et al., 24 Sep 2025).
  • Streaming model merging with operator selection: In large LLM checkpoint catalogs, new specialized models or merge operators frequently arise; merging is formulated as an online contextual bandit problem, selecting the best operator at each step based on similarity features and limited feedback (Bolton et al., 14 Jan 2026).
  • Online merging in streaming topic modeling: In continual unsupervised learning, as new document batches arrive, topic models must merge prior and current topics to adapt to dynamic content changes using optimal transport, ensuring both topic coherence and adaptability to new topics (Granese et al., 10 Apr 2025).
  • Online meta-optimizer merging: In online convex optimization, a master optimizer must adaptively combine (merge) the advice of multiple expert optimizers, tracking the best-performant member of the family over time (Masoudian et al., 2019).
  • Online merging of control or optimization features: In adaptive control (e.g., online bid optimization with multiple constraints), the optimizer merges task targets and feedback into a joint representation, adjusting parameters in real time (Wang et al., 2022).
  • Online gradient merging in RLHF: For LLMs undergoing reinforcement learning from human feedback, merging the update direction at each step with the SFT delta regularizes optimization and mitigates alignment tax in real time (Lu et al., 2024).

2. Core Algorithms and Theoretical Objectives

Several distinct algorithmic motifs recur in online merging optimizers:

  • History-aware running average: When merging two LoRA adapters LcL_c and L(t)L^{(t)} into a single slot, a running average weighted by the number of tasks previously included ensures order-invariant contribution:

merge(Lc,L(t))=HcLc+1L(t)Hc+1\mathrm{merge}(L_c, L^{(t)}) = \frac{||H_c|| \cdot L_c + 1 \cdot L^{(t)}}{||H_c|| + 1}

where HcH_c is the set of tasks already merged into slot cc (Shenaj et al., 15 Oct 2025).

  • Data-free similarity-based selection: To determine with which stored adapter an incoming one should merge, a cosine similarity is computed over flattened LoRA updates (e.g., per transformer layer/key/query/value/output) (Shenaj et al., 15 Oct 2025).
  • Neural-linear contextual bandit for merge-operator selection: SimMerge represents the task, models, and their similarity features as a context vector for a neural-linear bandit, choosing among available merge operators (e.g., Linear, SLERP, TIES) with the goal of minimizing cumulative regret (Bolton et al., 14 Jan 2026). Online adaptation is performed via Bayesian posterior updates per action.
  • Neural bandit with adaptive binary tree partitioning: In online MoE inference, a binary partition tree over the mixture weight simplex is maintained; neural UCB guides the exploration-exploitation tradeoff. Theoretical guarantees (sublinear regret: O(TlogT)\mathcal{O}(\sqrt{T}\log T)) are provided using neural tangent kernel assumptions (Han et al., 24 Sep 2025).
  • Online optimal transport merging: For online topic models, the current and candidate topic embedding clouds are merged using unbalanced OT with marginal relaxations, minimizing a cost that combines geometry with mass transport penalties (Granese et al., 10 Apr 2025).
  • Meta-algorithm for merging optimizers: Master gradient descent merges or samples among a finite set of base online convex optimizers, weighting them via full-information or bandit-feedback regret minimization (Masoudian et al., 2019).
  • Online merging in RLHF: Each RLHF gradient update is merged (elementwise, possibly under sparsification or sign-consensus) with the SFT delta at every iteration, with a blending hyperparameter α\alpha controlling the tradeoff between reward maximization and retention of prior capabilities (Lu et al., 2024).

3. Implementation Workflows and Pseudocode

Online merging optimizers are designed to be lightweight and incremental. Below is a stylized workflow for the storage-constrained, data-free LoRA adapter setting from K-Merge (Shenaj et al., 15 Oct 2025):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
initialize L = []          # current adapter slots
initialize H = []          # history of merged tasks per slot
for t in arrival_sequence:
    receive new LoRA L_new
    if L is not empty:
        compute sim_i = cosine_similarity(L_new, L_i) for each L_i in L
        c = argmax_i sim_i
    if len(L) < K and sim(L_new,L_c) < s:
        # s: similarity threshold
        append L_new to L
        H.append({t})
    else:
        # merge into slot c
        L[c] = ( len(H[c])*L[c] + L_new ) / (len(H[c])+1)
        H[c].add(t)

Additional workflow elements in other frameworks include:

  • Contextual bandit update: For every merge decision, compute context features, select an operator via Thompson sampling/UCB on the neural-linear model, and update the corresponding arm posterior only on observed reward (Bolton et al., 14 Jan 2026).
  • Neural UCB tree expansion: When visitation to tree node exceeds a threshold, split the region, refine candidates, and update neural weights and statistics accordingly (Han et al., 24 Sep 2025).
  • Online OT merging: After local ETM fitting, solve an OT problem (e.g., with KL marginal penalty) and blend topic embeddings with memory parameter ω\omega (Granese et al., 10 Apr 2025).
  • Gradient-based online merging: At each RLHF update, generate an optimizer delta, sparsify, linearly combine (or sign-consensus merge) with the SFT delta, and update model weights (Lu et al., 2024).

4. Computational Efficiency and Memory Constraints

Online merging methods must operate under stringent storage and compute requirements:

  • Adapter storage: For LoRA merging, each adapter is O(rd2)\mathcal{O}(r d^2) parameters (e.g., 23–37M parameters, 27–34MB). For K=5K=5 slots, the device uses \approx135–170MB—storing all single-task adapters is infeasible (>>1GB) (Shenaj et al., 15 Oct 2025).
  • Similarity scoring: Cosine similarity over all layers and projections per incoming adapter, O(KNPd2)O(K|\mathcal{N}|\cdot|P|\cdot d^2) per merge (e.g., d=1024,N=24,P=4d=1024, |\mathcal{N}|=24, |P|=4) (Shenaj et al., 15 Oct 2025).
  • Expert mixture merging: Tree updates in Tanbr scale as O(KlogT)O(K\log T) per inference slot, with inference time <0.4<0.4ms for up to 192 experts (Han et al., 24 Sep 2025).
  • SimMerge: Bayesian linear posterior updates after each merge decision are O(k2)O(k^2) (with kk the dimension of projected similarity features) (Bolton et al., 14 Jan 2026).
  • RLHF online merging: Per-step cost is dominated by standard optimizer steps, as the SFT delta is cached and sparsify/merge is efficient. Merging is robust down to p=5×104p=5\times 10^{-4} (dropping 99.95% of weights still preserves stability) (Lu et al., 2024).
  • Online topic modeling: Each OT solve (batch size J×KJ\times K) is modest, with training dominated by the local ETM optimization (\sim3000 epochs, batch 1000, hidden dim 800) (Granese et al., 10 Apr 2025).

5. Empirical Performance and Baseline Comparisons

Online merging optimizers have consistently outperformed traditional or naively merged baselines in both incremental and continual settings:

  • LoRA adapter merging (K-Merge++): Gains normalized aggregate score S(γ)S^{(\gamma)} of $0.81$ (Llama-3.2-1B, K=5K=5) and $0.88$ (Qwen-2.5-1.5B, K=5K=5), outperforming linear, TIES, DARE, and OPCM merging strategies. For K=8K=8, S(γ)S^{(\gamma)} approaches $0.93$—93% of single-task upper bound (Shenaj et al., 15 Oct 2025).
  • Expert merging in MoE (Tanbr): Reduces inference latency by 47%47\%, memory usage by up to 78%78\% compared to full SMoE, and achieves higher or equal task accuracy relative to token-level routing, Switch, and other baselines; convergence is improved in dynamic conditions (Han et al., 24 Sep 2025).
  • Operator selection bandit (SimMerge): Neural-linear LinTS bandit achieves cumulative regret within 10%10\% of the oracle and matches offline merging quality on held-out tasks, outperforming random and LinUCB policies. Warm-start on logged full-information merges significantly reduces exploration on new tasks and models (Bolton et al., 14 Jan 2026).
  • Online RLHF merging (OnDARE/OnTIES): Achieves highest average benchmark score and better alignment reward vs. AdamW, EMA, ChildTuning, LoRA, and offline merges. Optimal α\alpha tuning balances reward and catastrophic forgetting (alignment tax) (Lu et al., 2024).
  • Topic modeling (StreamETM): Unbalanced OT achieves higher merging and discovery accuracy (harmonic mean $0.85$) and sustains topic coherence/diversity over time relative to Cosine/EU matching or non-merged online baselines (Granese et al., 10 Apr 2025).

6. Limitations, Insights, and Extensions

Key Insights:

  • Low-cost, data-free similarity metrics (cosine, functional probe-based) are effective in grouping and merging compatible modules in adapter or expert-based systems (Shenaj et al., 15 Oct 2025).
  • Running-average and history-aware merging ensures unbiased, order-independent aggregation.
  • Contextual bandit frameworks bring strong theoretical guarantees (low regret) to operator selection, allowing scalable adaptation in the face of unseen models and operators (Bolton et al., 14 Jan 2026).
  • Merging gradients/deltas online provides continual regularization or capability retention, as opposed to static, offline model interpolation (Lu et al., 2024).
  • Optimal transport frameworks (unbalanced OT) enable model alignment in non-i.i.d. streaming settings, improving both merging and topic discovery (Granese et al., 10 Apr 2025).

Limitations:

  • Most online merging optimizers are focused on a specific type of adapter or module (e.g., LoRA) and require retraining or extension to other adaptation or parameter-efficient techniques (Shenaj et al., 15 Oct 2025).
  • Selection thresholds in similarity-based merging frameworks are set on small held-out pools, making them susceptible to domain shift; richer, adaptive similarity functions are suggested as a future direction.
  • Some frameworks (e.g., Tanbr, SimMerge) may incur scalability bottlenecks in the LP step or candidate enumeration for very large KK; hierarchical or heuristic candidate generation can mitigate overhead (Han et al., 24 Sep 2025, Bolton et al., 14 Jan 2026).
  • RLHF online merging requires careful hyperparameter selection (α\alpha, sparsity pp), with ablations showing sensitivity to these values (Lu et al., 2024).
  • Some methods assume short-horizon stationarity, and coarse, sign-based updates may converge slowly when feedback is extremely sparse (Wang et al., 2022).

Possible Extensions:

  • Extending merging functions to learned non-linear gates or residuals, beyond affine/average merges.
  • Application to multimodal or cross-domain adapter types (prefix tuning, attention bias), or to non-NLP settings (vision, robotics, control).
  • Auto-adaptive similarity or merge thresholding via budget-aware RL or unsupervised exploration.
  • Generalizing online merging frameworks (e.g., SimMerge, Tanbr) to other structured action spaces, resource-constrained applications, or as meta-optimizers tracking families of base learners.

7. Theoretical Analyses and Guarantees

Many online merging frameworks provide formal regret or optimality guarantees:

  • Tanbr (MoE expert merging): Achieves sublinear regret O(d~TlogT)O(\tilde{d}\sqrt{T}\log T), where d~\tilde{d} depends on the Cholesky rank of the neural tangent kernel Gram matrix, matching existing bandit methods despite operating in a continuous, high-dimensional simplex (Han et al., 24 Sep 2025).
  • SimMerge: Neural-linear contextual bandits guarantee regret within a small factor of the oracle; warm-starting with full-information logs further reduces online regret (Bolton et al., 14 Jan 2026).
  • Meta-algorithm for online optimizer amalgamation: For any base expert ii, the master has O(lnK)O(\sqrt{\ln K}) overhead in regret (full info) or O(K)O(\sqrt{K}) overhead (bandit), plus the expert’s own problem-dependent bound, providing instance-optimal adaptation (Masoudian et al., 2019).
  • StreamETM: Change-point detection on the time series of merged topics achieves AUC 0.9\approx 0.9, validating robust and timely topic adaptation. Merging and discovery metrics are formally evaluated against ablations (Granese et al., 10 Apr 2025).

The mix of empirically robust, theoretically principled, and computationally efficient designs across the surveyed online merging optimizers points to a rapidly maturing area, directly addressing the increasing need for adaptive, low-overhead, real-time model composition in large-scale, continually evolving systems.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Online Merging Optimizers.