MATTRL: Test-Time Multi-Agent RL
- MATTRL is a framework that performs multi-agent decision making entirely at inference using dynamically injected experience pools and structured test-time deliberation.
- It replaces conventional MARL training with a consensus protocol, credit assignment via difference rewards, and experience retrieval to improve robustness and stability.
- Empirical results in medicine, mathematics, and education show MATTRL outperforms baseline methods, achieving higher accuracy and lower variance in performance metrics.
Multi-Agent Test-Time Reinforcement Learning (MATTRL) refers to a class of frameworks and algorithms for multi-agent systems that acquire and exploit structured experience at inference time, rather than through conventional training or policy gradient updates. The MATTRL paradigm replaces resource-intensive and unstable multi-agent reinforcement learning (MARL) training with dynamically injected experience pools, structured test-time deliberation, and credit assignment on dialogue turns. This approach enables robust, high-accuracy, and distribution-shift tolerant multi-agent reasoning, particularly in domains characterized by diverse agent expertise and sparse feedback signals (Hu et al., 14 Jan 2026).
1. Formal Problem Formulation
MATTRL models multi-agent deliberation as a finite-horizon cooperative decision process evaluated entirely at inference. Each reasoning instance—such as a medical diagnosis, mathematical proof, or educational episode—is described by a task context , a fixed catalog of specialists (each with textual expertise), a coordinator agent , and a continually updated experience pool . For a specific instance, a team of specialists is assembled. The system state at round comprises:
- The current context ,
- Each specialist's evolving set of opinions ,
- A shared bulletin ,
- The current experience pool 0.
Each non-converged agent 1 selects an action 2 (textual utterance/opinion) conditioned on these elements plus retrieved experience 3.
Team-level reward is defined post-hoc. At the end of up to 4 deliberation rounds, a terminal outcome 5 (e.g., hit@6, exact match) is computed. Turn- and agent-level reward allocation uses a decay-weighted blending of per-utterance quality and terminal outcome: 7 where 8 is the LLM-judge score of agent 9 at turn 0, 1, and 2 assigns contribution ratios by normalizing 3 across the team.
2. Architectural Principles and Deliberation Process
MATTRL orchestrates inference in three sequential stages:
Stage I: Team Formation. The coordinator LLM parses the task context and specialist catalog to select 4 roles, forming the 5.
Stage II: Multi-Round Consensus with Experience Retrieval. For each round until all agents converge or 6 is reached:
- Non-converged specialists retrieve 7 relevant experience entries 8 from 9 (vectorized by dense embeddings, e.g., via FAISS).
- Each specialist generates a new opinion 0 as a function of the context, opinion history, and 1.
- Opinion updates 2 are generated and merged through a meeting operator to broadcast a deduplicated incremental bulletin 3.
- Convergence is flagged if no further opinion changes arise.
Stage III: Report Synthesis and Final Decision. The coordinator compiles and summarizes all specialist outputs, with optional additional experience retrieved and injected, before outputting a final answer.
No model weights are updated throughout; all adaptation arises from in-context conditioning and experience retrieval.
3. Turn-Level Experience Pool and Credit Assignment Mechanisms
After each test-time consultation, all agent utterances are scored with LLM-based judgement and the observed terminal outcome. Reward-shaped experience entries are constructed:
- Only utterances with 4 (threshold) are summarized and stored in the pool 5.
- Summaries 6 are generated by a separate LLM, distilling dialogue state, utterance, and reward information.
Three credit assignment schemes enable effective agent turn attribution:
- Naïve Shared Credit: Evenly splits outcome credit within a turn.
- Difference Rewards: Marginalizes each agent's contribution by re-evaluating the team objective with and without the agent's new utterance.
- Shapley-Style Approximation: Estimates each agent's fair value using expected marginal contributions across permutations of team orderings (empirically, computationally heavier and prone to value dilution).
Contribution ratios 7 are always renormalized using softmax exponentiation for stability.
4. Test-Time Learning and Deliberation Algorithm
The test-time inference loop in MATTRL is characterized by experience retrieval, specialized utterance generation, and consensus aggregation as follows:
2
This design enables adaptation purely through contextual retrieval of structured, reward-shaped traces.
5. Consensus Protocol and Final-Answer Aggregation
Consensus is established through the MEETING operator, which formalizes abstracting and deduplicating incremental specialist opinion updates 8 into a shared bulletin. Convergence is achieved when agents propose no further changes. Final answer aggregation is conducted by the coordinator, summarizing all agent outputs into a unified decision report. While other aggregation schemes are possible (e.g., majority vote, agent-confidence weighting), the primary implementation uses a single-pass summarizer.
6. Empirical Evaluation and Results
MATTRL was benchmarked across medicine, mathematics, and education:
- Medicine: RareBench Task 4 (2,185 rare-disease diagnosis cases, 421 labels). On hit@9 and MRR metrics, MATTRL outperforms MDAgent and RareAgents baselines by an average of ≈3.67% (hit@1: 0.39 for MATTRL vs 0.32–0.35 for baselines).
- Mathematics: HLE dataset (856 expert-level problems). MATTRL achieves accuracy of 0.36 (vs 0.33 multi-agent and 0.27 single-agent baselines, yielding +8.67% relative improvement).
- Education: On SuperGPQA, MATTRL achieves higher post-test learning gains (0) compared to both single- and multi-teacher baselines.
Ablation studies demonstrate that difference-reward credit yields the highest top-rank precision and solution stability. MATTRL’s experience pool substantially mitigates distribution shift; variance in hit@1 is lower compared to RL-trained agents. There is no need for replay buffers, weight updates, or co-adaptation, thus avoiding typical MARL instability (Hu et al., 14 Jan 2026).
7. Comparative Analysis and Implications
Relative to end-to-end MARL and single-agent test-time RL (TTRL), MATTRL exhibits several advantages:
- No weights are updated; adaptation is achieved via context injection of high-value experience, eliminating non-stationarity and catastrophic forgetting.
- Multi-agent teams benefit from cross-checking, reducing high-variance failure from sparse rewards and enhancing robustness to distribution shift.
- Difference-reward credit assignment balances computational cost with assignment fidelity, outperforming naive shared and Shapley approaches in reported experiments.
- Structured experience pools can be relayed across tasks and domains, enabling persistent generalization.
- Adaptive routing between single- and multi-agent execution provides further efficiency and accuracy benefits.
A plausible implication is that test-time textual experience conditioning, as instantiated in MATTRL, offers a practical alternative for robust multi-agent reasoning in domains where reward feedback is structured, costly, or highly variable.
8. Theoretical Foundations and Extensions
The MATTRL approach is complemented by the Meta Representations for Agents (MRA) framework (Zhang et al., 2021), which offers a formal method for constructing meta-policy sets across varying agent populations in Markov games. MRA employs hierarchical latent-variable policies and maximizes a constrained mutual-information objective, guaranteeing that, under suitable conditions, the meta-policy set contains Nash equilibria for all games in a test suite. This underlies MATTRL's ability to generalize across tasks and adapt rapidly, with first-order meta-updates facilitating fast convergence without retraining. The theoretical analysis in (Zhang et al., 2021) thus provides convergence guarantees for multi-agent test-time reinforcement learning when the latent-space capacity and diversity-encouraging objectives are properly set.
For further technical and methodological details, see (Hu et al., 14 Jan 2026) and (Zhang et al., 2021).