AggLM: RL-Trained Aggregation for Enhanced Reasoning

Updated 9 September 2025

AggLM is a method that employs reinforcement learning to synthesize diverse candidate outputs into a single, accurate solution.
It uses a curriculum strategy to balance easy consensus cases with challenging minority-correct instances for robust decision-making.
Empirical results demonstrate that AggLM achieves higher pass@1 accuracy and reduced token usage compared to classical aggregation methods.

AggLM refers to a category of methodologies and models centered around aggregation—typically of multiple candidate solutions, predictions, or local model outputs—using machine learning. In the context of LLMs and advanced reasoning systems, AggLM frequently denotes an explicit aggregation model trained to synthesize, reconcile, and select among diverse solution candidates, especially for high-stakes or ambiguous reasoning benchmarks. The current dominant paradigm for AggLM, as exemplified by recent work, is the formulation of aggregation as an explicit trainable reasoning skill that can outperform majority-vote and reward re-ranking, exhibiting robust generalization to solutions from diverse models and heightened inference efficiency (Zhao et al., 8 Sep 2025).

1. Aggregation as a Reasoning Skill: Methodological Foundations

AggLM reconceptualizes the aggregation task as a learnable transformation, where an explicit aggregator model reads a set of candidate solutions and produces a single final output. The aggregation process is cast as a supervised or reinforcement learning (RL) task: Given a context $x$ (e.g., a math problem) and $m$ candidate solutions $y_1, \ldots, y_m$ generated by a base solution model $p_\theta(y|x)$ , the aggregator $p_\phi$ is trained to output an aggregated solution $\tilde{y} \sim p_\phi(\tilde{y}\,|\,x, y_{1:m})$ . The reward is typically binary, rewarding the aggregator only when its synthesized answer matches the ground-truth label $y^\star$ :

$r(\tilde{y}) = \mathbb{I}[\tilde{y} = y^\star]$

Parameter updates for $p_\phi$ leverage RL with verifiable rewards, most prominently using Group-Relative Policy Optimization (GRPO). This approach enables the aggregator to learn not only to select among candidates but to synthesize new, correct outputs when none dominate among the input set.

2. Curriculum Design: Balancing Easy and Hard Aggregation Instances

A central insight in practical AggLM training is the deliberate mixture of aggregation tasks that vary in "hardness". Easy examples are those where most or all candidate solutions agree, usually with the majority answer being correct. Hard instances involve cases where candidate solutions disagree and the majority answer may be incorrect, requiring the aggregator to "recover" minority-but-correct or individually rare correct solutions.

Robust AggLM policy learning requires a curriculum that samples all hard instances and an adjustable fraction of easy ones (e.g., 5%–50%). This ensures the model learns majority selection where appropriate but is additionally incentivized to identify and abstract correct reasoning steps from noisy, non-dominant solutions—a capability majority-vote or static re-ranking methods lack.

3. Empirical Evaluation and Benchmarks

AggLM methodology has been empirically validated on standardized math competition datasets such as AIME24, AIME25, HMMT24, and HMMT25, aggregating outputs from models (e.g., Qwen3-1.7B) operating in "thinking" mode. In these experiments, the RL-trained aggregator (AggLM-1.7B) achieves superior pass@1 accuracy compared to majority voting or reward-model ranking. For example, on the AIME25 dataset, majority voting achieves 45.89% accuracy, while AggLM-1.7B attains 50.00%. These improvements persist when aggregating outputs from stronger models (e.g., Qwen3-8B) and across different generation regimes (thinking vs. non-thinking).

The empirical evidence reveals that AggLM excels not only in "majority correct" scenarios but especially in low-agreement setups—recovering correct minority solutions and scaling more efficiently with the number of input candidates.

4. Generalization Beyond Training Distribution

A notable property of RL-trained AggLM systems is their ability to generalize aggregation skills acquired on solutions from weaker or structurally specific models to candidate sets generated by stronger or heterogeneously sourced solution models. This cross-model generalization capacity is demonstrated by the consistent performance of AggLM-1.7B when aggregating solutions not only from its own model distribution but also from Qwen3-8B and other settings, even when candidate generation style (e.g., "non-thinking" outputs) diverges from training.

This flexibility implies that a well-trained AggLM aggregator can be deployed as a modular post-processing layer across ensembles, diverse model portfolios, or multiple reasoning paradigms, synthesizing the strengths of various systems without overfitting to a single solution distribution.

5. Computational Efficiency and Token Usage

Traditional aggregation approaches such as majority voting often require a large number of candidate solutions to attain strong accuracy, incurring high inference-time token costs. In contrast, AggLM, by learning to "read" and synthesize from a fixed set of $m$ candidates, achieves competitive or superior pass@1 metrics with significantly fewer required tokens—about one-third compared to full majority-vote pipelines. This reduction translates to lower computational cost and latency, vital for production or resource-constrained deployments, particularly when aggregating long-form solutions or over large evaluation sets.

6. Comparison with Classical Aggregation and Re-ranking Strategies

Table: Comparative Features of Aggregation Methods

Method	Recovers Minority-Correct	Token Efficiency	Generalization to Heterogeneous Inputs
Majority Voting	No	Low	Yes
Reward Re-Rank	Limited	Low	Mixed
RL-trained AggLM	Yes	High	Strong

Majority voting and reward re-ranking are rule-based and fail in hard cases where majority is incorrect or solutions are distributed across failure modes. RL-trained AggLM uniquely learns to reason over, reconcile, and sometimes synthesize new solutions, providing demonstrably improved effectiveness, especially in challenging aggregation scenarios.

7. Significance and Future Directions

AggLM establishes aggregation as a distinct reasoning skill trainable via reinforcement learning, rather than as a static or rule-based selection task. The paradigm shift has practical implications for scaling LLM-based test-time reasoning, compositional inference, and ensemble integration. Future research may extend AggLM to aggregation of more complex output types (e.g., structured proofs, program traces), integrate domain verification signals beyond exact match, or combine it with uncertainty estimation frameworks for calibrated aggregation in high-stakes domains.

This direction exemplifies the emerging view that meta-reasoning over solution sets—training systems to aggregate, reconcile, and abstract beyond the candidate pool—is a critical route to robust, sample-efficient, and generalizable performance in advanced LLM applications (Zhao et al., 8 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

The Majority is not always right: RL training for solution aggregation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to AggLM.