Papers
Topics
Authors
Recent
Search
2000 character limit reached

Reinforced Agent Merging (RAM)

Updated 4 July 2026
  • RAM is a reinforcement learning framework that adaptively merges shared and unique components of agentic models, tokens, or control policies.
  • It leverages sparsity and overlap analysis of task vectors to preserve critical behavior-specific updates while avoiding signal dilution.
  • Variants like RAM+ use distribution-aware rescaling to outperform traditional averaging methods, achieving improved accuracy and efficiency.

Searching arXiv for the cited RAM-related papers to ground the article in the current literature. Search query: (Yuan et al., 20 Jan 2026) Reinforced Agent Merging (RAM) denotes several reinforcement-learning-related notions of “merging” in recent machine learning literature rather than a single universally standardized method. In the most specific contemporary usage, RAM is a distribution-aware parameter-space merging framework for RL-trained agentic models that distinguishes shared from task-specific task-vector updates and merges them selectively to avoid signal dilution (Yuan et al., 20 Jan 2026). Closely related work uses the same term, or an equivalent one, for RL policies that decide layer-wise model-merging actions without retraining source models (Han et al., 27 Mar 2025), for online token merging in Vision Transformers instantiated by DORA (He et al., 12 May 2026), and for cooperative vehicle merging in mixed-traffic MARL (Chen et al., 2021). The common thread is that merging is treated as a decision problem shaped by reinforcement learning, but the merged objects, objectives, and algorithms differ substantially across these lines of work.

1. Terminological scope and research contexts

In the cited literature, RAM appears in several distinct but structurally related senses. "Behavior Knowledge Merge in Reinforced Agentic Models" introduces RAM as a model-merging method tailored to RL-trained agentic LLMs, motivated by the mismatch between RL-induced and SFT-induced task vectors (Yuan et al., 20 Jan 2026). "Reinforced Model Merging" states explicitly that the idea referred to as Reinforced Agent Merging is the same idea introduced there as Reinforced Model Merging (RMM), with the “agent” being the RL policy that selects layer-wise merge actions (Han et al., 27 Mar 2025). The DORA paper frames token reduction in ViTs as a sequential decision problem and, in its detailed positioning, treats this as a RAM system in which a learned policy performs per-layer token merging online (He et al., 12 May 2026). Earlier traffic-control work also uses RAM terminology for a decentralized parameter-shared MARL system that learns cooperative highway on-ramp merging policies (Chen et al., 2021).

This terminological reuse is important because the acronym alone does not identify a unique algorithm. In current arXiv usage, RAM can denote a specific parameter-merging rule for RL-trained LLM agents, a training-free RL search procedure over layer-wise merge operators, an online token-merging controller for ViTs, or a multi-agent traffic-merging policy. A common misconception is therefore to treat RAM as a single canonical method; the cited literature does not support that interpretation.

2. RAM for RL-trained agentic models: motivation and problem structure

The parameter-space RAM method is defined over task vectors. Let the pre-trained base model parameters be θpreRd\theta_{\text{pre}} \in \mathbb{R}^d. For task t{1,,N}t \in \{1,\ldots,N\}, the fine-tuned model is θt\theta_t, and the task vector is

τt=θtθpre.\boldsymbol{\tau}_t = \theta_t - \theta_{\text{pre}}.

The merged model is then θmerged=θpre+τmerged\theta_{\text{merged}} = \theta_{\text{pre}} + \tau_{\text{merged}} (Yuan et al., 20 Jan 2026).

The central claim is that on-policy RL induces task vectors that are sparse, localized, and behavior-specific, whereas SFT-oriented merging methods assume dense and globally comparable task vectors. The paper quantifies sparsity as

sparsity(θt,θpre):=1θtθpre0d,\text{sparsity}(\theta_t,\theta_{\text{pre}}) := 1 - \frac{\|\theta_t - \theta_{\text{pre}}\|_0}{d},

where a parameter is considered unchanged if its absolute difference is ϵ\le \epsilon with ϵ=105\epsilon = 10^{-5}. Empirically, reinforced agents trained from Qwen2.5-7B-Instruct exhibit heterogeneous sparsity and overlap patterns: the coding agent CURE modifies only 3.2% of parameters, ToolRL affects 46.2%, and MemAgent affects 54.3%. The fraction of non-zero elements in unique non-overlapping regions also differs sharply: coding, tool, and memory agents concentrate 6.3%, 40.8%, and 47.5% of their respective non-zero elements in unique regions. Additional analyses on more agents and on Llama3.2-3B show similar heterogeneity, including a Tool agent unique ratio of 54.0% and a Math agent that is denser and mostly shared (Yuan et al., 20 Jan 2026).

Under this distributional structure, standard global averaging is suboptimal. A typical merge is

τmerged=1Nt=1Nτt.\boldsymbol{\tau}_{\text{merged}} = \frac{1}{N}\sum_{t=1}^N \boldsymbol{\tau}_t.

If a parameter update is unique to one agent, averaging with N1N-1 zeros reduces its magnitude to t{1,,N}t \in \{1,\ldots,N\}0. The paper attributes the resulting degradation to signal dilution: unique RL updates encode behavior-specific knowledge that shows negligible cross-task interference but is critical for in-domain performance. This analysis is used to explain why several standard baselines underperform on RL-trained agents. Task Arithmetic downscales unique signals through global scaling; TIES-Merging uses uniform trimming over heterogeneous sparsity; Fisher merging normalizes by total Fisher precision across tasks; and DARE rescales for random dropout rather than shared-versus-unique structure, while still relying on global averaging (Yuan et al., 20 Jan 2026).

3. Distribution-aware formulation of RAM and RAM+

RAM operationalizes the shared-versus-unique distinction by probing update distributions with binary masks. For each task vector,

t{1,,N}t \in \{1,\ldots,N\}1

and the overlap count vector is

t{1,,N}t \in \{1,\ldots,N\}2

For task t{1,,N}t \in \{1,\ldots,N\}3, updated parameters are partitioned into Shared regions with t{1,,N}t \in \{1,\ldots,N\}4 and Unique regions with t{1,,N}t \in \{1,\ldots,N\}5. The Overlap-Unique Ratio is

t{1,,N}t \in \{1,\ldots,N\}6

so higher t{1,,N}t \in \{1,\ldots,N\}7 indicates that task t{1,,N}t \in \{1,\ldots,N\}8 lies more in the shared subspace (Yuan et al., 20 Jan 2026).

The paper motivates rescaling through a functional decomposition of task-performance gain,

t{1,,N}t \in \{1,\ldots,N\}9

with a merged approximation

θt\theta_t0

Under an isotropic importance assumption, the rescaling criterion becomes

θt\theta_t1

For numerical stability, RAM uses clipped linear scaling,

θt\theta_t2

with recommended defaults θt\theta_t3 and θt\theta_t4. RAM denotes the θt\theta_t5 special case with no rescaling, while RAM+ denotes the rescaled variant with θt\theta_t6. A soft-saturation alternative,

θt\theta_t7

also improves over RAM without rescaling, but RAM+ with clipped linear scaling performs better on average: 66.55 versus 65.46 for soft saturation and 64.82 for RAM (Yuan et al., 20 Jan 2026).

The selective merge is defined elementwise. If no task updates parameter θt\theta_t8, the merged update is θt\theta_t9. If exactly one task updates τt=θtθpre.\boldsymbol{\tau}_t = \theta_t - \theta_{\text{pre}}.0, the unique update is preserved and optionally amplified by τt=θtθpre.\boldsymbol{\tau}_t = \theta_t - \theta_{\text{pre}}.1. If multiple tasks update τt=θtθpre.\boldsymbol{\tau}_t = \theta_t - \theta_{\text{pre}}.2, their shared updates are averaged:

τt=θtθpre.\boldsymbol{\tau}_t = \theta_t - \theta_{\text{pre}}.3

This procedure averages consensus-bearing shared regions, preserves unique behavior-bearing regions, and leaves unchanged regions at zero so that base capabilities remain intact. The algorithmic cost is τt=θtθpre.\boldsymbol{\tau}_t = \theta_t - \theta_{\text{pre}}.4 for masks, overlap counts, rescaling factors, and the merged vector, with space complexity τt=θtθpre.\boldsymbol{\tau}_t = \theta_t - \theta_{\text{pre}}.5; layerwise processing and chunking are supported to fit memory constraints (Yuan et al., 20 Jan 2026).

4. Empirical behavior, implementation, and performance of parameter-space RAM

The principal evaluation of parameter-space RAM uses transformer LLMs and RL-trained agents. On Qwen2.5-7B-Instruct, the agents are CURE for coding, ToolRL for tool use, and MemAgent for long-context memory, with extensions including ZeroSearch and AutoTIR. On Llama3.2-3B-Instruct, the evaluated agents are GRPO-Math, ToolRL, and ZeroSearch. RAM is applied elementwise across all parameters, without special casing of attention, MLP, embeddings, or layer norms. Coding is evaluated on LiveBench and LiveCodeBench with ACC and UT metrics; tool use on BFCL Live and Non-Live with Parallel and P_Mul subsets; memory on RULER HotpotQA and SQuAD across 7K–64K contexts, with additional results extending to 8K–896K (Yuan et al., 20 Jan 2026).

Variant Average score Efficiency note
RAM 64.82 75.4 seconds; ~5.5× faster than DARE+TA
RAM+ 66.55 SOTA performance; significantly faster than TIES/DARE variants (>400s)
DARE+TA 63.33 strongest reported baseline among standard merges

The headline result is that RAM and RAM+ surpass standard merging baselines and often reach or exceed specialist performance. In the three-agent setting, RAM averages 64.82 and RAM+ averages 66.55, exceeding DARE+TA at 63.33. For coding, RAM+ beats the coding specialist on LiveBench ACC/UT, 40.23/52.57 versus 37.70/49.27, and on LiveCodeBench UT, 46.84 versus 45.76. For tool use, RAM+ reaches Live P_Mul 70.83 versus the Tool specialist’s 58.33, while RAM achieves Non-Live P_Mul 91.50, surpassing or matching specialists and baselines. For memory, RAM+ attains SQuAD 64K 82.03 versus the Memory specialist’s 77.34, and RAM scores HotpotQA 75.78–74.22 on multiple lengths, setting or matching the best merged-model scores (Yuan et al., 20 Jan 2026).

Pairwise merges remain robust. In Coding+Tool, RAM+ averages 60.04 versus 56.74 for the strongest baseline DARE+TIES. In Tool+Memory, RAM reaches 76.67 and RAM+ 75.86, both above baselines. In Coding+Memory, RAM+ obtains the highest average, 61.21, while RAM reaches SQuAD 82.03 and HotpotQA 78.12/82.03 on multiple lengths. On Llama3.2-3B, RAM and RAM+ outperform baselines across Math, Search, and Tool, with positive synergy where merged generalists surpass specialists in Math and Tool while retaining Search capabilities. Additional evidence shows that unique regions of reinforced task vectors drive domain-specific gains and cause little interference on other domains; artificially diluting unique magnitudes to τt=θtθpre.\boldsymbol{\tau}_t = \theta_t - \theta_{\text{pre}}.6 degrades in-domain performance. Instruction-following on IFEval shows RAM is safer, in the sense of less forgetting, than trimming-heavy baselines such as TIES and DARE on smaller Llama models (Yuan et al., 20 Jan 2026).

In practice, the recommended procedure is to start with RAM at τt=θtθpre.\boldsymbol{\tau}_t = \theta_t - \theta_{\text{pre}}.7 to avoid over-amplification, then sweep τt=θtθpre.\boldsymbol{\tau}_t = \theta_t - \theta_{\text{pre}}.8 with τt=θtθpre.\boldsymbol{\tau}_t = \theta_t - \theta_{\text{pre}}.9 and choose θmerged=θpre+τmerged\theta_{\text{merged}} = \theta_{\text{pre}} + \tau_{\text{merged}}0 by average validation across domains. The paper reports that performance peaks at θmerged=θpre+τmerged\theta_{\text{merged}} = \theta_{\text{pre}} + \tau_{\text{merged}}1, with too-large θmerged=θpre+τmerged\theta_{\text{merged}} = \theta_{\text{pre}} + \tau_{\text{merged}}2 disrupting general knowledge. Implementation guidance includes computing all task vectors from the same θmerged=θpre+τmerged\theta_{\text{merged}} = \theta_{\text{pre}} + \tau_{\text{merged}}3, verifying mask construction with θmerged=θpre+τmerged\theta_{\text{merged}} = \theta_{\text{pre}} + \tau_{\text{merged}}4, maintaining consistent precision such as bfloat16 or float32 to avoid thresholding artifacts, and processing parameters layerwise or in chunks when memory is constrained (Yuan et al., 20 Jan 2026).

5. RL-driven merging beyond parameter-space RAM

A related but distinct line of work treats model merging itself as an MDP solved by an RL policy. In RMM, the state at step θmerged=θpre+τmerged\theta_{\text{merged}} = \theta_{\text{pre}} + \tau_{\text{merged}}5 is a merging map θmerged=θpre+τmerged\theta_{\text{merged}} = \theta_{\text{pre}} + \tau_{\text{merged}}6 that records how often each action has been chosen at each layer. The action space contains model actions that copy a specific model’s layer as-is, merge actions that apply one of θmerged=θpre+τmerged\theta_{\text{merged}} = \theta_{\text{pre}} + \tau_{\text{merged}}7 operators such as weight averaging, task arithmetic, Ties-Merging, or DARE, and layer actions such as skip or back. Rewards are computed only after a merged model is assembled, using a task-averaged validation metric on small data subsets. PPO is used to optimize the policy, while Dynamic Average Reward smooths noisy subset evaluations according to

θmerged=θpre+τmerged\theta_{\text{merged}} = \theta_{\text{pre}} + \tau_{\text{merged}}8

The method is training-free with respect to the original models: it performs no gradient computations on them, learns only the actor and critic, and is presented as feasible for edge devices. Reported acceleration is approximately 10.4× with 10% data and approximately 96.5× with 1% data, with near-identical average accuracy thanks to DAR. Reported results include ViT-S/32 average accuracy 79.53 versus a best baseline near 75.43, ViT-B/16 average 86.10 versus about 79.10, cross-domain ViT-B/16 merging of CUB-200 and Dogs at 73.80 versus 59.24 for the best baseline, and strong gains on T5-Small and T5-Base for QA and mixed tasks (Han et al., 27 Mar 2025).

DORA extends the RAM idea to online token merging in Vision Transformers. It formulates token merging across Transformer blocks as a sequential MDP whose state at block θmerged=θpre+τmerged\theta_{\text{merged}} = \theta_{\text{pre}} + \tau_{\text{merged}}9 is

sparsity(θt,θpre):=1θtθpre0d,\text{sparsity}(\theta_t,\theta_{\text{pre}}) := 1 - \frac{\|\theta_t - \theta_{\text{pre}}\|_0}{d},0

where the layer embedding makes the state depth-aware. The action is a binary token mask sparsity(θt,θpre):=1θtθpre0d,\text{sparsity}(\theta_t,\theta_{\text{pre}}) := 1 - \frac{\|\theta_t - \theta_{\text{pre}}\|_0}{d},1 indicating source tokens to be merged into destination tokens. The step reward is dense and combines efficiency with a non-linear distillation penalty,

sparsity(θt,θpre):=1θtθpre0d,\text{sparsity}(\theta_t,\theta_{\text{pre}}) := 1 - \frac{\|\theta_t - \theta_{\text{pre}}\|_0}{d},2

and the policy objective maximizes expected discounted return under PPO. DORA uses an asymmetric Actor-Critic architecture in which a high-capacity Critic is used only offline for stable training, while a minimal Actor head is retained for deployment. Merging itself is asymmetric many-to-one: each source token is routed to the most similar destination token by cosine similarity and aggregated by a weighted average. The paper explicitly positions DORA as functionally embodying RAM, even though it does not use the RAM label in the title (He et al., 12 May 2026).

The DORA results show that RL-controlled token merging improves the accuracy-efficiency Pareto front over static heuristics. Under aligned Top-1 accuracy constraints on ImageNet-1K, DORA reduces ViT-Tiny to 0.93 GFLOPs with 28.5% average token reduction, ViT-Small to 3.53 GFLOPs with 24.1%, ViT-Base to 12.64 GFLOPs with 28.4%, and ViT-Large to 47.74 GFLOPs with 22.7%, all improving over ToMe. Under negligible accuracy-drop constraints of sparsity(θt,θpre):=1θtθpre0d,\text{sparsity}(\theta_t,\theta_{\text{pre}}) := 1 - \frac{\|\theta_t - \theta_{\text{pre}}\|_0}{d},3, DORA attains up to 12.66% token merging on DeiT-Small, and on ViT-Tiny it yields a 569.7% relative gain in token reduction over the strongest baseline, 10.18% versus 0–1.52%. On OOD benchmarks such as ImageNet-A and ImageNet-C, its relative efficiency advantages exceed 430%, while Actor-only deployment keeps overhead below 1% FLOPs on Small–Large and 2.59% on Tiny (He et al., 12 May 2026).

These works broaden the meaning of RAM. In the 2026 agentic-model formulation, merging acts on task vectors in parameter space. In RMM, merging is a layer-wise RL search problem over candidate operators. In DORA, merging acts on token sets online during inference. The family resemblance lies in the use of RL to make merge decisions adaptively rather than by fixed global heuristics.

6. Other usages, misconceptions, and limitations

In mixed-traffic control, RAM has also been used for cooperative vehicle merging rather than model or token merging. The 2021 highway on-ramp work instantiates RAM as a decentralized parameter-shared actor–critic system for autonomous vehicles on merge and through lanes. Each agent observes nearby vehicles within 150 m using features sparsity(θt,θpre):=1θtθpre0d,\text{sparsity}(\theta_t,\theta_{\text{pre}}) := 1 - \frac{\|\theta_t - \theta_{\text{pre}}\|_0}{d},4, acts in a discrete action space sparsity(θt,θpre):=1θtθpre0d,\text{sparsity}(\theta_t,\theta_{\text{pre}}) := 1 - \frac{\|\theta_t - \theta_{\text{pre}}\|_0}{d},5, and optimizes local rewards combining collision, speed, time headway, and ramp-delay terms with sparsity(θt,θpre):=1θtθpre0d,\text{sparsity}(\theta_t,\theta_{\text{pre}}) := 1 - \frac{\|\theta_t - \theta_{\text{pre}}\|_0}{d},6. Action masking removes invalid actions, and a priority-based safety supervisor overrides unsafe actions using multi-step trajectory checks and right-of-way criteria. Over 30 epochs and 3 seeds, the reported collision rates are 0.00 in Easy, Medium, and Hard traffic, with average speeds 25.72, 24.08, and 22.73 m/s, respectively, outperforming or matching MAPPO, MAACKTR, MAA2C, and MPC baselines (Chen et al., 2021).

The main misconception surrounding RAM is therefore semantic rather than technical: the term does not name a single method across the literature. In one setting it is a distribution-aware merger of RL-specialized LLM checkpoints; in another it is the same idea as Reinforced Model Merging; in another it is an RL policy for dynamic token merging in ViTs; and in another it is a cooperative MARL system for road traffic. A plausible implication is that the most precise use of “RAM” requires immediate specification of domain and object of merging: parameters, layers, tokens, or vehicles.

The limitations are correspondingly domain-specific. For parameter-space RAM, the isotropic assumption used to derive sparsity(θt,θpre):=1θtθpre0d,\text{sparsity}(\theta_t,\theta_{\text{pre}}) := 1 - \frac{\|\theta_t - \theta_{\text{pre}}\|_0}{d},7 may not hold per parameter; incorporating curvature such as Fisher information could refine rescaling but increases cost. As the number of agents grows, collisions in shared subspaces increase and simple averaging may require conflict resolution; extreme sparsity or heavy noise can destabilize sparsity(θt,θpre):=1θtθpre0d,\text{sparsity}(\theta_t,\theta_{\text{pre}}) := 1 - \frac{\|\theta_t - \theta_{\text{pre}}\|_0}{d},8; off-policy regimes, significant domain shift, and 70B+ models remain open validation targets (Yuan et al., 20 Jan 2026). For RMM, reward sparsity, subset sensitivity, incompatible architectures, and the absence of global optimality guarantees remain explicit limitations (Han et al., 27 Mar 2025). For DORA, potential limitations include sensitivity to the reward weights sparsity(θt,θpre):=1θtθpre0d,\text{sparsity}(\theta_t,\theta_{\text{pre}}) := 1 - \frac{\|\theta_t - \theta_{\text{pre}}\|_0}{d},9 and ϵ\le \epsilon0, reliance on a teacher signal for the KD penalty, and reduced gains on inputs with low redundancy where safe merges are scarce (He et al., 12 May 2026). For traffic RAM, the authors identify sim-to-real gaps and limitations of the HDV behavior model as open issues (Chen et al., 2021).

Taken together, the literature uses RAM as a family of RL-mediated merging ideas rather than a single settled formalism. Its most technically mature formulation for RL-trained agentic models is the shared-versus-unique task-vector merger of (Yuan et al., 20 Jan 2026), but the broader record shows that the same acronym has become a reusable label for adaptive merge decision-making across parameter fusion, architecture search, token reduction, and cooperative control.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Reinforced Agent Merging (RAM).