AggLM: RL-Trained Aggregation for Enhanced Reasoning
- AggLM is a method that employs reinforcement learning to synthesize diverse candidate outputs into a single, accurate solution.
- It uses a curriculum strategy to balance easy consensus cases with challenging minority-correct instances for robust decision-making.
- Empirical results demonstrate that AggLM achieves higher pass@1 accuracy and reduced token usage compared to classical aggregation methods.
AggLM refers to a category of methodologies and models centered around aggregation—typically of multiple candidate solutions, predictions, or local model outputs—using machine learning. In the context of LLMs and advanced reasoning systems, AggLM frequently denotes an explicit aggregation model trained to synthesize, reconcile, and select among diverse solution candidates, especially for high-stakes or ambiguous reasoning benchmarks. The current dominant paradigm for AggLM, as exemplified by recent work, is the formulation of aggregation as an explicit trainable reasoning skill that can outperform majority-vote and reward re-ranking, exhibiting robust generalization to solutions from diverse models and heightened inference efficiency (Zhao et al., 8 Sep 2025).
1. Aggregation as a Reasoning Skill: Methodological Foundations
AggLM reconceptualizes the aggregation task as a learnable transformation, where an explicit aggregator model reads a set of candidate solutions and produces a single final output. The aggregation process is cast as a supervised or reinforcement learning (RL) task: Given a context (e.g., a math problem) and candidate solutions generated by a base solution model , the aggregator is trained to output an aggregated solution . The reward is typically binary, rewarding the aggregator only when its synthesized answer matches the ground-truth label :
Parameter updates for leverage RL with verifiable rewards, most prominently using Group-Relative Policy Optimization (GRPO). This approach enables the aggregator to learn not only to select among candidates but to synthesize new, correct outputs when none dominate among the input set.
2. Curriculum Design: Balancing Easy and Hard Aggregation Instances
A central insight in practical AggLM training is the deliberate mixture of aggregation tasks that vary in "hardness". Easy examples are those where most or all candidate solutions agree, usually with the majority answer being correct. Hard instances involve cases where candidate solutions disagree and the majority answer may be incorrect, requiring the aggregator to "recover" minority-but-correct or individually rare correct solutions.
Robust AggLM policy learning requires a curriculum that samples all hard instances and an adjustable fraction of easy ones (e.g., 5%–50%). This ensures the model learns majority selection where appropriate but is additionally incentivized to identify and abstract correct reasoning steps from noisy, non-dominant solutions—a capability majority-vote or static re-ranking methods lack.
3. Empirical Evaluation and Benchmarks
AggLM methodology has been empirically validated on standardized math competition datasets such as AIME24, AIME25, HMMT24, and HMMT25, aggregating outputs from models (e.g., Qwen3-1.7B) operating in "thinking" mode. In these experiments, the RL-trained aggregator (AggLM-1.7B) achieves superior pass@1 accuracy compared to majority voting or reward-model ranking. For example, on the AIME25 dataset, majority voting achieves 45.89% accuracy, while AggLM-1.7B attains 50.00%. These improvements persist when aggregating outputs from stronger models (e.g., Qwen3-8B) and across different generation regimes (thinking vs. non-thinking).
The empirical evidence reveals that AggLM excels not only in "majority correct" scenarios but especially in low-agreement setups—recovering correct minority solutions and scaling more efficiently with the number of input candidates.
4. Generalization Beyond Training Distribution
A notable property of RL-trained AggLM systems is their ability to generalize aggregation skills acquired on solutions from weaker or structurally specific models to candidate sets generated by stronger or heterogeneously sourced solution models. This cross-model generalization capacity is demonstrated by the consistent performance of AggLM-1.7B when aggregating solutions not only from its own model distribution but also from Qwen3-8B and other settings, even when candidate generation style (e.g., "non-thinking" outputs) diverges from training.
This flexibility implies that a well-trained AggLM aggregator can be deployed as a modular post-processing layer across ensembles, diverse model portfolios, or multiple reasoning paradigms, synthesizing the strengths of various systems without overfitting to a single solution distribution.
5. Computational Efficiency and Token Usage
Traditional aggregation approaches such as majority voting often require a large number of candidate solutions to attain strong accuracy, incurring high inference-time token costs. In contrast, AggLM, by learning to "read" and synthesize from a fixed set of candidates, achieves competitive or superior pass@1 metrics with significantly fewer required tokens—about one-third compared to full majority-vote pipelines. This reduction translates to lower computational cost and latency, vital for production or resource-constrained deployments, particularly when aggregating long-form solutions or over large evaluation sets.
6. Comparison with Classical Aggregation and Re-ranking Strategies
Table: Comparative Features of Aggregation Methods
Method | Recovers Minority-Correct | Token Efficiency | Generalization to Heterogeneous Inputs |
---|---|---|---|
Majority Voting | No | Low | Yes |
Reward Re-Rank | Limited | Low | Mixed |
RL-trained AggLM | Yes | High | Strong |
Majority voting and reward re-ranking are rule-based and fail in hard cases where majority is incorrect or solutions are distributed across failure modes. RL-trained AggLM uniquely learns to reason over, reconcile, and sometimes synthesize new solutions, providing demonstrably improved effectiveness, especially in challenging aggregation scenarios.
7. Significance and Future Directions
AggLM establishes aggregation as a distinct reasoning skill trainable via reinforcement learning, rather than as a static or rule-based selection task. The paradigm shift has practical implications for scaling LLM-based test-time reasoning, compositional inference, and ensemble integration. Future research may extend AggLM to aggregation of more complex output types (e.g., structured proofs, program traces), integrate domain verification signals beyond exact match, or combine it with uncertainty estimation frameworks for calibrated aggregation in high-stakes domains.
This direction exemplifies the emerging view that meta-reasoning over solution sets—training systems to aggregate, reconcile, and abstract beyond the candidate pool—is a critical route to robust, sample-efficient, and generalizable performance in advanced LLM applications (Zhao et al., 8 Sep 2025).