Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 54 tok/s

Gemini 2.5 Pro 50 tok/s Pro

GPT-5 Medium 18 tok/s Pro

GPT-5 High 31 tok/s Pro

GPT-4o 105 tok/s Pro

Kimi K2 182 tok/s Pro

GPT OSS 120B 466 tok/s Pro

Claude Sonnet 4 40 tok/s Pro

2000 character limit reached

Learning to Reason Across Parallel Samples for LLM Reasoning (2506.09014v1)

Published 10 Jun 2025 in cs.CL

Abstract: Scaling test-time compute brings substantial performance gains for LLMs. By sampling multiple answers and heuristically aggregate their answers (e.g., either through majority voting or using verifiers to rank the answers), one can achieve consistent performance gains in math domains. In this paper, we propose a new way to leverage such multiple sample set. We train a compact LLM, called Sample Set Aggregator (SSA), that takes a concatenated sequence of multiple samples and output the final answer, optimizing it for the answer accuracy with reinforcement learning. Experiments on multiple reasoning datasets show that SSA outperforms other test-time scaling methods such as reward model-based re-ranking. Our approach also shows a promising generalization ability, across sample set sizes, base model families and scales, and tasks. By separating LLMs to generate answers and LLMs to analyze and aggregate sampled answers, our approach can work with the outputs from premier black box models easily and efficiently.

Collections

Summary

The paper introduces the Sample Set Aggregator (SSA), a novel hybrid approach combining parallel and sequential reasoning methods to boost LLM accuracy.
SSA leverages reinforcement learning to aggregate multiple candidate answers without modifying the underlying base model.
Empirical results across diverse benchmarks demonstrate that SSA outperforms reward-based re-ranking and narrows the performance gap to oracle-best accuracy.

Learning to Reason Across Parallel Samples for LLM Reasoning

The paper presents a method to enhance the reasoning capabilities of LLMs through a technique termed "test-time scaling." Traditional test-time scaling methods in LLMs include parallel and sequential approaches. Parallel methods produce multiple reasoning paths independently and combine results via mechanisms like majority voting. Sequential methods refine a single solution iteratively, often using self-reflection prompts or incentivized computation. The research proposes a hybrid approach combining the strengths of both paradigms, optimizing the synthesis of the final answer based on the landscape of the LLM's output distribution.

At the core of the proposed method is the Sample Set Aggregator (SSA), an LLM trained to effectively process multiple solution samples. Unlike conventional methods that treat generated samples in isolation, SSA interprets them as representations of the output distribution, leveraging reinforcement learning (RL) to maximize accuracy. The separation of answer generation from the analysis and aggregation process allows SSA to work efficiently with outputs from prominent black-box models without needing direct modification or retraining of these base models.

The experiments, conducted across multiple reasoning datasets, underscore SSA's superiority over existing test-time scaling methods. Notably, SSA consistently outperforms reward model-based re-ranking strategies. This indicates that SSA substantially bridges the performance discrepancy between real model outputs and oracle-best accuracy. Moreover, SSA showcases impressive generalization capabilities across diverse base model families, scales, and tasks. This validation suggests that SSA can be a potent lightweight alternative to traditionally larger models used in sequential scaling, providing significant computational and performance dividends without the need to train extensive models directly.

Several key contributions are highlighted:

SSA Deployment: The paper introduces SSA as a lightweight LLM that concatenates parallel candidate answers from a fixed base model, applying a sequential RL step to yield a final answer. This integration demonstrates strong performance across sample sets and model scales.
Reasoning Over Output Distribution: The research proposes optimizing over sampled outputs rather than tuning model internals for better reasoning. This conceptual shift underscores SSA's capability to function with sampled answers independently, suggesting potential broader application in contexts where training the base model is untenable.
Empirical Gains: SSA delivers broad and consistent performance enhancements across five math benchmarks and multiple LLM families and base dimensions. This promising empirical evidence underscores SSA's effectiveness when tested against strong baselines.

SSA's implications are significant for LLM reasoning development. Practically, deploying smaller models like SSA can notably reduce computational overheads associated with more extensive models. Theoretically, SSA demonstrates robust applicability, suggesting that base models can remain general-purpose while post hoc adjustments optimize performance on specialized tasks. Future research directions could include exploring broader applications beyond mathematical reasoning, refining SSA's synthesis capabilities, and scaling inference parallelism for exhaustive reasoning tasks.

In conclusion, SSA presents a promising approach to leverage outputs from larger models, illustrating a compelling pathway that combines efficient computation with enhanced reasoning capability. Its flexibility and efficiency herald advances in deploying smaller yet smarter systems that harness the strengths of existing LLM architectures without necessitating model-specific fine-tuning.