Papers
Topics
Authors
Recent
Search
2000 character limit reached

Test-Time RSA for LLMs: Inference Aggregation Strategies

Updated 4 April 2026
  • The paper presents test-time RSA as a method to aggregate multiple LLM outputs using techniques like recursive self-aggregation and multi-model switching.
  • It introduces explicit Bayesian-inspired scoring methods that enhance correctness and pragmatic informativeness by re-ranking candidate solutions.
  • Empirical results demonstrate enhanced accuracy and compute efficiency in tasks such as code generation, mathematical reasoning, and natural language processing.

Test-Time RSA for LLMs encompasses a family of test-time (inference-time) strategies designed to improve model output quality by aggregating or re-ranking multiple candidate generations according to explicit principles of reasoning or pragmatic communication. This collection of techniques draws on classical Rational Speech Act (RSA) modeling from linguistics and cognitive science, as well as recent advances in repeated sampling, self-consistency, evolutionary aggregation, and multi-model switching. The fundamental aim is to leverage additional inference-time compute—over and above a single forward pass—either within a single model or across several models, yielding solutions that are more reliable, robust, and, in some formulations, more pragmatically informative.

1. Conceptual Foundations of Test-Time RSA in LLMs

Test-Time RSA originated as a mechanism to approximate rational pragmatic reasoning, most notably via the Rational Speech Act framework, wherein agents recursively model each other's beliefs and utterances. In the LLM context, RSA-inspired methods treat the model as a probabilistic speaker or listener and score (and sometimes generate) utterances or solutions by aggregating over multiple alternatives—often using explicit Bayesian or information-theoretic objectives.

Recent literature extends these ideas beyond strictly linguistic settings, encompassing domains such as code generation, mathematical reasoning, and multi-step problem solving. Key contributions include recursive self-aggregation (RSA), repeated sampling with majority voting, and multi-model sample pooling with dynamic stopping and weighting. These approaches are empirically motivated by the strong correlation between consistency across multiple outputs and correctness, as well as by the ability to exploit diverse failure modes among different models or reasoning trajectories (Venkatraman et al., 30 Sep 2025, Chen et al., 1 Apr 2025, Jian et al., 2024).

2. Formal Problem Setting and Baseline Paradigms

Let x∈Xx \in X denote a test input (e.g., a math question or code task), and pref(⋅∣x)p_{\text{ref}}(\cdot|x) a reference LLM parameterizing a conditional distribution over output chains TT. Each output TT is scored via a reward function r(T,y)∈[0,1]r(T, y) \in [0,1], where yy is a ground-truth answer. The goal is to maximize expected reward E[r(T,y)]\mathbb{E}[r(T, y)] under a fixed inference compute budget BB (e.g., total LLM calls).

Standard baseline paradigms include:

  • Parallel scaling: Drawing NN independent samples, each with a call to prefp_{\text{ref}}, and aggregating (e.g., by majority vote among final answers).
  • Sequential scaling (self-refinement): Iteratively prompting the model to refine its own output, performing pref(⋅∣x)p_{\text{ref}}(\cdot|x)0 sequential calls.
  • Repeated-sampling-then-voting: Sampling pref(⋅∣x)p_{\text{ref}}(\cdot|x)1 output candidates and selecting the answer with maximal support or weighted by confidence/entropy (Chen et al., 1 Apr 2025).

Test-time RSA generalizes these by allowing mixed parallel and sequential aggregation, and by using explicit scoring schemes motivated by Bayesian, pragmatic, or evolutionary principles.

3. Algorithmic Instantiations of Test-Time RSA

3.1 Recursive Self-Aggregation (RSA)

Recursive Self-Aggregation, as described in (Venkatraman et al., 30 Sep 2025), maintains a population of pref(⋅∣x)p_{\text{ref}}(\cdot|x)2 candidate solutions, iteratively aggregating random subsets of size pref(⋅∣x)p_{\text{ref}}(\cdot|x)3 over pref(⋅∣x)p_{\text{ref}}(\cdot|x)4 steps:

  • Initialization: pref(⋅∣x)p_{\text{ref}}(\cdot|x)5 for pref(⋅∣x)p_{\text{ref}}(\cdot|x)6.
  • Aggregation: At each step pref(⋅∣x)p_{\text{ref}}(\cdot|x)7, for each pref(⋅∣x)p_{\text{ref}}(\cdot|x)8:
    • Sample pref(⋅∣x)p_{\text{ref}}(\cdot|x)9 (TT0, TT1 current population)
    • Aggregate: TT2
  • Final selection: Output TT3, and optionally reduce to a final answer via majority vote.

Mathematical formalism:

TT4

RSA exploits the information embedded in full solution chains (not merely final answers), enabling incremental improvement by leveraging partially correct reasoning trajectories (Venkatraman et al., 30 Sep 2025).

3.2 Multi-LLM Aggregation: ModelSwitch

ModelSwitch (Chen et al., 1 Apr 2025) introduces multi-model sampling, distributing a fixed sample budget TT5 across TT6 diverse LLMs. Each LLM TT7 is allotted TT8 calls:

  • After TT9 samples, if all answers from TT0 are consistent, that answer is returned.
  • Otherwise, sampling proceeds to the next model; all responses are eventually aggregated via a weighted vote, where weights account for both a model's internal consistency (empirical entropy) and an external prior.
  • Theoretical analysis demonstrates that ModelSwitch can achieve higher accuracy and reduced compute compared to single-model, high-sample baselines.

Weighted voting combines answer frequency, model strength, and consistency-derived weighting:

TT1

where TT2 decreases with (normalized) entropy of model outputs, and TT3 reflects prior model confidence (Chen et al., 1 Apr 2025).

3.3 Pragmatic RSA Re-ranking for Language Generation

Within linguistic RSA (Jian et al., 2024), candidate utterances TT4 for state TT5 are scored according to a pragmatic speaker posterior:

TT6

Inputs to this re-ranking:

  • TT7: top-TT8 LLM generations plus logically constructed alternatives
  • TT9: meaning function (prompt or rule-based)
  • r(T,y)∈[0,1]r(T, y) \in [0,1]0 (token length)

Final scores for candidate selection interpolate between LLM log-probability and r(T,y)∈[0,1]r(T, y) \in [0,1]1 pragmatic score:

r(T,y)∈[0,1]r(T, y) \in [0,1]2

This test-time RSA recipe enables more human-like informativeness and brevity in referential generation (Jian et al., 2024).

4. Compute-Efficiency and Scaling Trade-offs

Test-time RSA strategies exhibit different compute-profiles:

  • RSA (Recursive Self-Aggregation): Requires r(T,y)∈[0,1]r(T, y) \in [0,1]3 total LLM calls (initial plus all aggregation rounds). r(T,y)∈[0,1]r(T, y) \in [0,1]4 controls parallel breadth, r(T,y)∈[0,1]r(T, y) \in [0,1]5 sequential depth, with practical trade-off: parallelization is hardware-dependent, while increased r(T,y)∈[0,1]r(T, y) \in [0,1]6 and r(T,y)∈[0,1]r(T, y) \in [0,1]7 can slow convergence and inflate memory usage.
  • ModelSwitch: Typically uses r(T,y)∈[0,1]r(T, y) \in [0,1]8 or r(T,y)∈[0,1]r(T, y) \in [0,1]9 lightweight models, allocating yy0 calls per LLM. Empirical results show up to yy1 fewer LLM calls needed to reach a given accuracy compared to single-model self-consistency.
  • Pragmatic RSA Re-ranking: Dominant cost is in candidate generation (often using beam search or combinatorial logic). Meaning function and re-ranking add negligible overhead.

Ablation studies indicate: for RSA, increasing yy2 yields near-monotonic gains; gains plateau for yy3; larger yy4 raises asymptotic upper bound but also increases resource demands (Venkatraman et al., 30 Sep 2025, Chen et al., 1 Apr 2025).

5. Aggregation-Aware Reinforcement Learning Extensions

Standard RL fine-tuning focuses solely on optimizing the model's own chain generation. Aggregation-aware RL, as introduced for RSA, expands the RL objective to explicitly include aggregation contexts:

  • For each training tuple yy5 and candidate subset yy6, optimize expected reward under policies that aggregate multiple reasoning chains:

yy7

Policy gradient (e.g., PPO, RLOO) yields an aggregator optimized for the multi-chain, aggregation-aware scenario. Empirical results show that aggregation-aware RL yields performance boosts (yy8–yy9 points on Pass@1) compared to vanilla RL (Venkatraman et al., 30 Sep 2025).

6. Empirical Comparisons and Practical Implications

Multi-step and multi-model RSA unlocks substantial performance gains across both math and code tasks, and across models and scales:

  • On AIME-25, RSA boosts Qwen3-4B-Instruct-2507 performance from E[r(T,y)]\mathbb{E}[r(T, y)]0 to E[r(T,y)]\mathbb{E}[r(T, y)]1; on HMMT-25, from E[r(T,y)]\mathbb{E}[r(T, y)]2 to E[r(T,y)]\mathbb{E}[r(T, y)]3.
  • ModelSwitch, with two lightweight LLMs, achieves MATH accuracy equal to or surpassing much larger models with fewer calls; self-consistency baselines require E[r(T,y)]\mathbb{E}[r(T, y)]4 samples, ModelSwitch delivers at E[r(T,y)]\mathbb{E}[r(T, y)]5–E[r(T,y)]\mathbb{E}[r(T, y)]6 samples on average (Chen et al., 1 Apr 2025).
  • On linguistic referential tasks, pragmatic RSA achieves moderate (but not maximal) correlation with LLM scoring, indicating that current LLMs can be pushed closer to pragmatic competence but do not fully realize it by default (Jian et al., 2024).

RSA and its variants require no external verifier—a contrast with multi-agent debate or reward-model-based selection—and can be further improved by fine-tuning on aggregation-specific objectives.

7. Extensions, Limitations, and Future Directions

Test-time RSA methods are highly modular: aggregation operators, subset selection schemes, and scoring functions can all be adapted—from language-centric pragmatic formulations to complex reasoning over population chains. ModelSwitch demonstrates the value of error mixing across LLMs, while pragmatic RSA shows that explicit re-ranking by informativity and cost can shift model outputs closer to human-like communication.

Current limitations reside in diminishing returns for very large E[r(T,y)]\mathbb{E}[r(T, y)]7, reliance on strong meaning functions or reward models, and dependence on the diversity and complementarity of model outputs. A plausible implication is that as LLMs become even more accurate and less diverse, the marginal benefit of repeated sampling and aggregation may decrease, shifting future focus to more data- and context-sensitive aggregation (e.g., via highly expressive reward models or interactive verification mechanisms).

Empirical results consistently show that RSA-inspired test-time strategies enable smaller LLMs to outperform larger models' single-call baselines in both accuracy and compute efficiency. The architectural agnosticism of these methods ensures compatibility with future model and inference pipeline innovations (Venkatraman et al., 30 Sep 2025, Chen et al., 1 Apr 2025, Jian et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Test-Time RSA for LLMs.