Test-Time RSA for LLMs: Inference Aggregation Strategies
- The paper presents test-time RSA as a method to aggregate multiple LLM outputs using techniques like recursive self-aggregation and multi-model switching.
- It introduces explicit Bayesian-inspired scoring methods that enhance correctness and pragmatic informativeness by re-ranking candidate solutions.
- Empirical results demonstrate enhanced accuracy and compute efficiency in tasks such as code generation, mathematical reasoning, and natural language processing.
Test-Time RSA for LLMs encompasses a family of test-time (inference-time) strategies designed to improve model output quality by aggregating or re-ranking multiple candidate generations according to explicit principles of reasoning or pragmatic communication. This collection of techniques draws on classical Rational Speech Act (RSA) modeling from linguistics and cognitive science, as well as recent advances in repeated sampling, self-consistency, evolutionary aggregation, and multi-model switching. The fundamental aim is to leverage additional inference-time compute—over and above a single forward pass—either within a single model or across several models, yielding solutions that are more reliable, robust, and, in some formulations, more pragmatically informative.
1. Conceptual Foundations of Test-Time RSA in LLMs
Test-Time RSA originated as a mechanism to approximate rational pragmatic reasoning, most notably via the Rational Speech Act framework, wherein agents recursively model each other's beliefs and utterances. In the LLM context, RSA-inspired methods treat the model as a probabilistic speaker or listener and score (and sometimes generate) utterances or solutions by aggregating over multiple alternatives—often using explicit Bayesian or information-theoretic objectives.
Recent literature extends these ideas beyond strictly linguistic settings, encompassing domains such as code generation, mathematical reasoning, and multi-step problem solving. Key contributions include recursive self-aggregation (RSA), repeated sampling with majority voting, and multi-model sample pooling with dynamic stopping and weighting. These approaches are empirically motivated by the strong correlation between consistency across multiple outputs and correctness, as well as by the ability to exploit diverse failure modes among different models or reasoning trajectories (Venkatraman et al., 30 Sep 2025, Chen et al., 1 Apr 2025, Jian et al., 2024).
2. Formal Problem Setting and Baseline Paradigms
Let denote a test input (e.g., a math question or code task), and a reference LLM parameterizing a conditional distribution over output chains . Each output is scored via a reward function , where is a ground-truth answer. The goal is to maximize expected reward under a fixed inference compute budget (e.g., total LLM calls).
Standard baseline paradigms include:
- Parallel scaling: Drawing independent samples, each with a call to , and aggregating (e.g., by majority vote among final answers).
- Sequential scaling (self-refinement): Iteratively prompting the model to refine its own output, performing 0 sequential calls.
- Repeated-sampling-then-voting: Sampling 1 output candidates and selecting the answer with maximal support or weighted by confidence/entropy (Chen et al., 1 Apr 2025).
Test-time RSA generalizes these by allowing mixed parallel and sequential aggregation, and by using explicit scoring schemes motivated by Bayesian, pragmatic, or evolutionary principles.
3. Algorithmic Instantiations of Test-Time RSA
3.1 Recursive Self-Aggregation (RSA)
Recursive Self-Aggregation, as described in (Venkatraman et al., 30 Sep 2025), maintains a population of 2 candidate solutions, iteratively aggregating random subsets of size 3 over 4 steps:
- Initialization: 5 for 6.
- Aggregation: At each step 7, for each 8:
- Sample 9 (0, 1 current population)
- Aggregate: 2
- Final selection: Output 3, and optionally reduce to a final answer via majority vote.
Mathematical formalism:
4
RSA exploits the information embedded in full solution chains (not merely final answers), enabling incremental improvement by leveraging partially correct reasoning trajectories (Venkatraman et al., 30 Sep 2025).
3.2 Multi-LLM Aggregation: ModelSwitch
ModelSwitch (Chen et al., 1 Apr 2025) introduces multi-model sampling, distributing a fixed sample budget 5 across 6 diverse LLMs. Each LLM 7 is allotted 8 calls:
- After 9 samples, if all answers from 0 are consistent, that answer is returned.
- Otherwise, sampling proceeds to the next model; all responses are eventually aggregated via a weighted vote, where weights account for both a model's internal consistency (empirical entropy) and an external prior.
- Theoretical analysis demonstrates that ModelSwitch can achieve higher accuracy and reduced compute compared to single-model, high-sample baselines.
Weighted voting combines answer frequency, model strength, and consistency-derived weighting:
1
where 2 decreases with (normalized) entropy of model outputs, and 3 reflects prior model confidence (Chen et al., 1 Apr 2025).
3.3 Pragmatic RSA Re-ranking for Language Generation
Within linguistic RSA (Jian et al., 2024), candidate utterances 4 for state 5 are scored according to a pragmatic speaker posterior:
6
Inputs to this re-ranking:
- 7: top-8 LLM generations plus logically constructed alternatives
- 9: meaning function (prompt or rule-based)
- 0 (token length)
Final scores for candidate selection interpolate between LLM log-probability and 1 pragmatic score:
2
This test-time RSA recipe enables more human-like informativeness and brevity in referential generation (Jian et al., 2024).
4. Compute-Efficiency and Scaling Trade-offs
Test-time RSA strategies exhibit different compute-profiles:
- RSA (Recursive Self-Aggregation): Requires 3 total LLM calls (initial plus all aggregation rounds). 4 controls parallel breadth, 5 sequential depth, with practical trade-off: parallelization is hardware-dependent, while increased 6 and 7 can slow convergence and inflate memory usage.
- ModelSwitch: Typically uses 8 or 9 lightweight models, allocating 0 calls per LLM. Empirical results show up to 1 fewer LLM calls needed to reach a given accuracy compared to single-model self-consistency.
- Pragmatic RSA Re-ranking: Dominant cost is in candidate generation (often using beam search or combinatorial logic). Meaning function and re-ranking add negligible overhead.
Ablation studies indicate: for RSA, increasing 2 yields near-monotonic gains; gains plateau for 3; larger 4 raises asymptotic upper bound but also increases resource demands (Venkatraman et al., 30 Sep 2025, Chen et al., 1 Apr 2025).
5. Aggregation-Aware Reinforcement Learning Extensions
Standard RL fine-tuning focuses solely on optimizing the model's own chain generation. Aggregation-aware RL, as introduced for RSA, expands the RL objective to explicitly include aggregation contexts:
- For each training tuple 5 and candidate subset 6, optimize expected reward under policies that aggregate multiple reasoning chains:
7
Policy gradient (e.g., PPO, RLOO) yields an aggregator optimized for the multi-chain, aggregation-aware scenario. Empirical results show that aggregation-aware RL yields performance boosts (8–9 points on Pass@1) compared to vanilla RL (Venkatraman et al., 30 Sep 2025).
6. Empirical Comparisons and Practical Implications
Multi-step and multi-model RSA unlocks substantial performance gains across both math and code tasks, and across models and scales:
- On AIME-25, RSA boosts Qwen3-4B-Instruct-2507 performance from 0 to 1; on HMMT-25, from 2 to 3.
- ModelSwitch, with two lightweight LLMs, achieves MATH accuracy equal to or surpassing much larger models with fewer calls; self-consistency baselines require 4 samples, ModelSwitch delivers at 5–6 samples on average (Chen et al., 1 Apr 2025).
- On linguistic referential tasks, pragmatic RSA achieves moderate (but not maximal) correlation with LLM scoring, indicating that current LLMs can be pushed closer to pragmatic competence but do not fully realize it by default (Jian et al., 2024).
RSA and its variants require no external verifier—a contrast with multi-agent debate or reward-model-based selection—and can be further improved by fine-tuning on aggregation-specific objectives.
7. Extensions, Limitations, and Future Directions
Test-time RSA methods are highly modular: aggregation operators, subset selection schemes, and scoring functions can all be adapted—from language-centric pragmatic formulations to complex reasoning over population chains. ModelSwitch demonstrates the value of error mixing across LLMs, while pragmatic RSA shows that explicit re-ranking by informativity and cost can shift model outputs closer to human-like communication.
Current limitations reside in diminishing returns for very large 7, reliance on strong meaning functions or reward models, and dependence on the diversity and complementarity of model outputs. A plausible implication is that as LLMs become even more accurate and less diverse, the marginal benefit of repeated sampling and aggregation may decrease, shifting future focus to more data- and context-sensitive aggregation (e.g., via highly expressive reward models or interactive verification mechanisms).
Empirical results consistently show that RSA-inspired test-time strategies enable smaller LLMs to outperform larger models' single-call baselines in both accuracy and compute efficiency. The architectural agnosticism of these methods ensures compatibility with future model and inference pipeline innovations (Venkatraman et al., 30 Sep 2025, Chen et al., 1 Apr 2025, Jian et al., 2024).