Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 65 tok/s

Gemini 2.5 Pro 51 tok/s Pro

GPT-5 Medium 32 tok/s Pro

GPT-5 High 29 tok/s Pro

GPT-4o 80 tok/s Pro

Kimi K2 182 tok/s Pro

GPT OSS 120B 453 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

Recursive Self-Aggregation Unlocks Deep Thinking in Large Language Models (2509.26626v1)

Published 30 Sep 2025 in cs.LG

Abstract: Test-time scaling methods improve the capabilities of LLMs by increasing the amount of compute used during inference to make a prediction. Inference-time compute can be scaled in parallel by choosing among multiple independent solutions or sequentially through self-refinement. We propose Recursive Self-Aggregation (RSA), a test-time scaling method inspired by evolutionary methods that combines the benefits of both parallel and sequential scaling. Each step of RSA refines a population of candidate reasoning chains through aggregation of subsets to yield a population of improved solutions, which are then used as the candidate pool for the next iteration. RSA exploits the rich information embedded in the reasoning chains -- not just the final answers -- and enables bootstrapping from partially correct intermediate steps within different chains of thought. Empirically, RSA delivers substantial performance gains with increasing compute budgets across diverse tasks, model families and sizes. Notably, RSA enables Qwen3-4B-Instruct-2507 to achieve competitive performance with larger reasoning models, including DeepSeek-R1 and o3-mini (high), while outperforming purely parallel and sequential scaling strategies across AIME-25, HMMT-25, Reasoning Gym, LiveCodeBench-v6, and SuperGPQA. We further demonstrate that training the model to combine solutions via a novel aggregation-aware reinforcement learning approach yields significant performance gains. Code available at https://github.com/HyperPotatoNeo/RSA.

Summary

The paper demonstrates that RSA recursively aggregates candidate solutions to enhance deep, multi-step reasoning in LLMs.
It combines parallel and sequential reasoning strategies to refine outputs and consistently outperforms existing test-time scaling methods.
Aggregation-aware RL further improves generalization and performance across diverse benchmarks, bridging the gap between lightweight and heavyweight models.

Recursive Self-Aggregation Unlocks Deep Thinking in LLMs

Introduction

The paper introduces Recursive Self-Aggregation (RSA), a hybrid test-time scaling framework for LLMs that leverages both parallel and sequential reasoning strategies. RSA is motivated by evolutionary algorithms and is designed to improve LLM reasoning capabilities by recursively aggregating candidate solutions, enabling the model to recombine and refine reasoning chains without external verifiers or model parameter updates. The approach is evaluated across diverse tasks and model architectures, demonstrating substantial improvements over existing test-time scaling methods.

Taxonomy of Test-Time Scaling Methods

Test-time scaling methods for LLMs are categorized by their verification strategy and reasoning control flow:

Verification Strategies:
- External Verification: Uses external tools or learned reward models to score candidate solutions.
- Self-Verification: Employs the LLM itself to judge correctness, exploiting the generation-verification gap.
- Implicit Verification: Relies on the LLM to implicitly verify and improve solutions without explicit scoring.
Reasoning Control Flows:
- Parallel Scaling: Generates multiple independent reasoning chains and combines them (e.g., majority voting, Best-of-N).
- Sequential Scaling: Iteratively refines a single reasoning chain, suitable for deep, multi-step reasoning.
- Hybrid Scaling: Combines parallel and sequential strategies, often using evolutionary or ensemble methods.

RSA is positioned as a hybrid scaling method, integrating recursive aggregation steps into a self-improvement loop, maintaining a population of candidate solutions and iteratively recombining subsets to produce improved solutions.

Recursive Self-Aggregation (RSA) Algorithm

RSA operates as follows:

Initialization:
- Generate an initial population $P_1$ of $N$ candidate solutions for a given query $x$ using the base LLM.
Subsampling and Aggregation:
- For each of $N$ aggregation sets, sample $K$ distinct candidates from the current population.
- Prompt the LLM with the query and the sampled set to produce an improved solution.
- Form the next population $P_{t+1}$ from these aggregated solutions.
Recursion:
- Repeat the subsampling and aggregation for $T$ steps, recursively refining the population.
Termination:
- The final solution is selected from the last population, either by random sampling or majority voting.

Key implementation details:

The aggregation prompt is designed to encourage the model to combine useful ideas and correct errors from multiple candidates.
The choice of $K$ (aggregation set size), $N$ (population size), and $T$ (number of steps) are critical hyperparameters, with trade-offs between diversity, convergence speed, and compute requirements.
RSA can be implemented with any LLM inference pipeline and does not require external verification or model retraining.

Pseudocode

def rsa(LLM, query, N, K, T):
    # Initialize population
    population = [LLM.generate(query) for _ in range(N)]
    for t in range(T):
        new_population = []
        for _ in range(N):
            subset = random.sample(population, K)
            prompt = build_aggregation_prompt(query, subset)
            new_solution = LLM.generate(prompt)
            new_population.append(new_solution)
        population = new_population
    return select_final_solution(population)

Aggregation-Aware Reinforcement Learning

Standard RL post-training for LLMs does not align with the aggregation task at inference, often degrading RSA performance due to distribution shift. The paper proposes aggregation-aware RL, which augments the training dataset with aggregation prompts containing multiple candidate solutions. The RL objective is modified to optimize both standard and aggregation prompts, enabling the model to learn aggregation skills directly.

Objective for Standard Prompts:

$\max \mathbb{E}_{(x, y) \sim D} \left[ \mathbb{E}_{T \sim p_\theta(\cdot|x)} [r(T, y)] - \beta \mathrm{KL}(p_\theta(\cdot|x) \| p_{\text{ref}}(\cdot|x)) \right]$

Objective for Aggregation Prompts:

$\max \mathbb{E}_{(x, y) \sim D, S_0 \sim p_{\text{ref}}(\cdot|x)} \left[ \mathbb{E}_{T \sim p_\theta(\cdot|x, S_0)} [r(T, y)] - \beta \mathrm{KL}(p_\theta(\cdot|x, S_0) \| p_{\text{ref}}(\cdot|x, S_0)) \right]$

This approach is compatible with standard policy gradient algorithms (e.g., PPO, RLOO) and can be implemented with parameter-efficient fine-tuning.

Empirical Results

RSA is evaluated on math (AIME-25, HMMT-25), code generation (LiveCodeBench-v6), general reasoning (Reasoning Gym), and knowledge recall (SuperGPQA) benchmarks, using Qwen3-4B-Instruct-2507 and other models. Key findings:

Performance Gains:
- RSA consistently outperforms sequential (self-refinement) and parallel (majority voting, rejection sampling) baselines.
- RSA enables smaller models (e.g., Qwen3-4B-Instruct-2507) to match or exceed the performance of larger models (e.g., DeepSeek-R1, o3-mini (high)).
- Aggregation-aware RL further amplifies RSA's benefits, yielding superior results compared to standard RL fine-tuning.
Scaling Behavior:
- Increasing $K$ (aggregation set size) improves performance, with diminishing returns beyond $K=3$ due to context length constraints.
- Larger $N$ (population size) enhances diversity and asymptotic performance but requires more steps $T$ for convergence.
- Performance improves monotonically with $T$ (number of recursive steps), except in cases where diversity is lost too quickly.
Generalization:
- Aggregation-aware RL exhibits strong out-of-domain transferability, improving performance on tasks not present in the training set.

Implementation Considerations

Computational Requirements:
- RSA increases inference-time compute linearly with $N \times T$ , but parallelization is straightforward.
- Context length constraints limit the effective aggregation set size $K$ ; prompt engineering can mitigate this.
Deployment:
- RSA can be integrated into existing LLM serving pipelines with minimal changes.
- Aggregation-aware RL fine-tuning requires additional data generation and training but is compatible with standard RLHF frameworks.
Limitations:
- Excessive population size $N$ without sufficient aggregation steps $T$ can slow convergence.
- Loss of diversity due to repeated aggregation may hinder exploration of alternative reasoning paths.

Theoretical and Practical Implications

RSA demonstrates that test-time scaling via recursive aggregation can unlock deeper reasoning in LLMs, bridging the gap between lightweight and heavyweight models. The evolutionary perspective enables the reuse of partially correct intermediate steps, improving robustness and solution quality. Aggregation-aware RL aligns training objectives with inference strategies, mitigating distribution shift and enhancing generalization.

Future directions include integrating explicit fitness functions (e.g., self-verification), composing RSA with other test-time scaling methods, and developing multi-step RL policies for end-to-end RSA optimization.

Conclusion

Recursive Self-Aggregation provides a principled and effective framework for enhancing LLM reasoning at inference time, combining the strengths of parallel and sequential scaling. The method is simple to implement, generalizes across tasks and models, and benefits from aggregation-aware RL fine-tuning. RSA sets a new standard for test-time scaling, with clear implications for the development of more capable and efficient LLMs.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper introduces a new way to help AI LLMs “think deeper” when answering hard questions. The method is called Recursive Self-Aggregation (RSA). Instead of giving just one answer, the model first creates many possible answers with step-by-step reasoning, then repeatedly combines the best parts from different attempts to build better solutions. This process makes even smaller models perform more like much larger, smarter ones.

What questions did the researchers ask?

To make this easy to follow, here are the main questions the paper explores:

Can we boost a model’s reasoning at the moment it answers (without retraining it) by letting it think longer and combine its own ideas?
Is it better to improve one answer step by step, or to explore many different answers and then pick or merge them?
Can a model learn how to combine multiple solutions more effectively if we train it specifically for that task?

How did they do it?

Think of the model like a class of students solving a tough problem.

Parallel thinking: Many students try different approaches at the same time. Later, you pick the best one. This is breadth-first thinking.
Sequential thinking: One student writes a solution, then revises it over and over to fix mistakes. This is depth-first thinking.
RSA (their method): Mix both ideas. Many students try first. Then, in rounds, small groups come together to combine the best parts of their attempts into improved solutions. Repeat this several times. In each round, good ideas spread and bad ones get dropped.

Here’s how RSA works in everyday terms:

Start with a “population” of attempts: The model writes several different step-by-step solutions for the same question.
Group and aggregate: Take small groups of those solutions and ask the model to write a new, better solution that reuses correct steps and avoids mistakes. This is “self-aggregation.”
Repeat (recursive part): Use the new, improved solutions as the next population. Form new groups and aggregate again. Do this for several rounds.
Finish: At the end, pick an answer from the final set (or use a simple vote).

Key ideas made simple:

“Test-time scaling” means giving the model more time and compute to think when answering, instead of retraining it.
“Aggregation” means combining the useful pieces from different solutions.
“Population size” (N) is how many solutions you keep at once.
“Group size” (K) is how many solutions you combine at a time.
“Number of rounds” (T) is how many times you repeat the combine-and-improve step.
“Pass@1” is a score that means “got the correct answer on the first try.”

They also tried a training twist:

“Aggregation-aware RL” is a way to fine-tune the model so it practices not only solving problems but also merging multiple solutions. Regular RL often trains the model to produce one good answer, but not to combine several. The authors show that teaching the model to aggregate directly helps RSA work better.

What did they find, and why is it important?

In short, RSA consistently improved performance across many kinds of tasks:

Math competitions (AIME-25, HMMT-25)
Logic and puzzle games (Reasoning Gym)
Coding problems (LiveCodeBench)
Knowledge questions (SuperGPQA)

Highlights:

A smaller model using RSA became competitive with much bigger “reasoning” models like DeepSeek-R1 and o3-mini (high).
RSA beat both “sequential” methods (only revising one solution) and “parallel” methods (only picking the best among many) on most tests.
More rounds (larger T) usually led to steady improvements. Letting the model think through several aggregation steps helps.
Considering more solutions per group (larger K) often boosted results, especially when moving from K = 1 to K = 2. Combining multiple ideas is better than just revising one.
Keeping more solutions overall (larger N) can raise the ceiling on performance, but you may need more rounds to let the good ideas spread through the population.
Training the model to aggregate (aggregation-aware RL) gave extra gains. In contrast, standard RL that ignores aggregation sometimes made RSA worse.

Why it matters:

This shows that models can “think better” at answer time by combining their own attempts—not just by being bigger or retrained from scratch.
It unlocks stronger reasoning for smaller models, which can save cost, energy, and time.

What could this mean in the future?

This approach could make AI tools more reliable and affordable for schools, developers, and researchers by:

Helping smaller models handle challenging math, code, logic, and planning problems.
Reducing the need for expensive retraining by boosting performance “on the fly.”
Inspiring new methods that mix idea-sharing (aggregation) with careful, multi-step thinking.

The authors also note that powerful test-time methods should be used responsibly. Better reasoning can be very helpful, but it also needs careful evaluation to prevent misuse or overconfidence.

Overall, RSA is like running a friendly tournament of ideas inside the model: many attempts compete, the best parts get combined, and after several rounds, the model ends up with a clearer, smarter solution.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of what remains missing, uncertain, or unexplored, framed to guide actionable future research.

Lack of theoretical guarantees for RSA: no proofs or sufficient conditions for monotonic improvement, convergence, or stability as a function of $K$ , $N$ , and $T$ .
Compute accounting not standardized: no token- or wall-clock–normalized cost–benefit curves comparing RSA vs. baselines (aggregation steps incur extra context reads and long prompts).
Latency and throughput trade-offs unreported: missing benchmarks on end-to-end runtime, memory footprint, and GPU utilization across $K$ , $N$ , $T$ and context lengths.
Context-length constraints: no paper of truncation effects, long-context efficiency, or attention degradation as $K$ and chain lengths grow.
Diversity control is heuristic: no principled mechanisms to measure and preserve diversity during recursion; no diversity metrics reported beyond qualitative trends; no adaptive diversity regularization.
Subset construction is naive: only uniform random subsampling studied; no evaluation of disagreement-, clustering-, uncertainty-, or quality-aware set selection strategies.
Termination and selection strategy underdeveloped: final choice often uniform-random or simple voting; no paper of self-verification–guided selection, adaptive stopping (early exit), or confidence calibration.
Failure mode analysis is limited: no systematic diagnostics when deeper steps hurt (e.g., RG Cognition+ARC downturn), nor criteria to detect and prevent degeneracy or error amplification.
Robustness to low-quality or adversarial candidates unknown: no stress tests where most candidates are misleading, off-distribution, or adversarial; no safeguards against aggregating spurious patterns.
Applicability to multiple-choice tasks unclear: RSA underperforms majority voting on SuperGPQA; missing adaptations that exploit option structure (e.g., option-aware aggregation, verifier-guided elimination).
Generalization across domains and modalities untested: no evaluation on tool-use, program synthesis with execution feedback loops, multi-modal reasoning, interactive planning, or real-world long-horizon tasks.
Limited model coverage in RL setting: aggregation-aware RL evaluated only with Qwen3-4B; no tests of transfer to other architectures, sizes, or MoE models.
Training–inference mismatch remains: only single-step aggregation trained; no multi-step (end-to-end) RL that optimizes the full RSA loop, credit assignment across steps, or step-wise rewards.
RL objective design underexplored: no ablations on RL algorithms (PPO/GRPO vs. RLOO), KL scaling, ratio of standard vs. aggregation prompts, or reward shaping beyond final correctness.
Aggregation prompt sensitivity unquantified: no systematic paper of prompt phrasing, structure, number/ordering of candidates, language, or instruction strength on aggregation quality.
Decoding policy not tuned across steps: no exploration of temperature/top-p schedules across RSA iterations (e.g., higher exploration early, exploitation late) or chain length controls.
Compute-budget tuning lacks automation: heuristic guidance for $K$ – $N$ – $T$ trade-offs but no budget-aware optimizer or learned policy to allocate parallel vs. sequential compute adaptively per instance.
No comparison to verifier-driven evolutionary methods under matched budgets: external-verifier genetic loops are excluded; open question where RSA stands when verifiers are available.
Missing integration with explicit verification: how self-aggregation + exact/learned verifiers (e.g., fitness filtering) affects performance, stability, and compute efficiency is untested.
Limited analysis of “partial-correctness reuse”: claims that RSA reuses correct steps are anecdotal; no quantitative step-level tracking, edit-distance analyses, or causal attribution of which fragments drive gains.
Calibration and truthfulness not assessed: no measures of confidence calibration, susceptibility to hallucinations, or plausibility-vs-correctness trade-offs in aggregated chains.
Sensitivity to seed variability and sampling hyperparameters: only four seeds for most tasks; no robustness analysis across broader randomness settings or decoding hyperparameter grids.
Code-specific concerns: majority voting omitted due to exact-match issues; no exploration of unit-test–aware aggregation, multi-hypothesis synthesis, or execution-guided filtering within RSA.
Cross-model aggregation unexplored: RSA uses a single model; benefits and risks of aggregating heterogeneous model outputs (Mixture-of-Agents without a strong aggregator) are unknown.
Security and safety implications not studied: potential for RSA to combine fragments that bypass content filters, amplify harmful content, or increase jailbreak success rates remains unassessed.
Data contamination and provenance audits are minimal: beyond high-level statements, no detailed contamination checks for all benchmarks or leakage between RL training and evaluation distributions.
Scaling laws uncharacterized: no formal or empirical scaling laws predicting accuracy as a function of inference-time compute (tokens or FLOPs), $K$ , $N$ , and $T$ across tasks and models.
Early-stopping and per-instance adaptivity: no mechanism to detect when additional RSA steps won’t help; no per-problem policies to dynamically choose $K$ , $N$ , or $T$ .
Aggregator specialization vs. shared weights: only same-model generator/aggregator considered; unclear whether a dedicated aggregator (or small adapter) outperforms shared-parameter setups.
Practical deployment guidance incomplete: no end-to-end system paper on throughput under service constraints (batching, streaming, KV-caching), or interaction with long-context optimizations.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following applications can be deployed now by integrating Recursive Self-Aggregation (RSA) into existing LLM inference pipelines or model post-training workflows. Each item notes target sectors, potential tools/products/workflows, and assumptions or dependencies.

Education — AI math tutors with “deep-thinking mode”
- Use RSA to aggregate and recursively refine multiple solution paths for competition-level problems (e.g., AIME/HMMT) before presenting a final, step-by-step explanation.
- Tools/workflows: Tutor backend runs RSA with typical settings (e.g., N=16, K=4, T=10), caches intermediate chains, and exposes a “Think deeper” toggle to trade latency for correctness.
- Assumptions/dependencies: User latency tolerance; sufficient inference compute; correct formatting of math answers; curriculum-aligned prompts; diminishing returns if K exceeds model’s effective context length.
Software — IDE code assistants with robust patch synthesis
- Integrate RSA to produce and iteratively aggregate multiple code candidates, improving Pass@1 on tasks like those in LiveCodeBench-v6 and generating more reliable patches, tests, and refactors.
- Tools/workflows: An IDE plugin orchestrating RSA steps, optional majority voting for small fixes, and test-time “fast vs deep” compute controls; server-side orchestrator for N, K, T scheduling.
- Assumptions/dependencies: Unit tests or runtime sandboxes to quickly validate final outputs; memory/latency budgets; context length constraints for including K candidates in aggregation prompts.
Enterprise Knowledge Work — “Consensus drafting” for reports, emails, policies
- RSA aggregates multiple candidate drafts and reasoning chains to output more accurate and consistent documents (e.g., strategy memos, SOPs).
- Tools/workflows: Document editor add-on that generates N drafts, runs T aggregation iterations, then selects or majority-votes the best result; version diff viewer of chains-of-thought.
- Assumptions/dependencies: Strict prompt templates for format compliance; audit logging of candidate chains; acceptable latency (seconds to minutes depending on T).
Research and Academia — Evidence synthesis and literature review assistants
- RSA combines diverse lines of reasoning across multiple candidate summaries to produce higher-quality syntheses for literature reviews and method comparisons.
- Tools/workflows: Research assistant that samples N diverse summaries, aggregates K at a time across T steps, and surfaces provenance links for each incorporated idea.
- Assumptions/dependencies: High-quality retrieval grounding (citations); implicit verification in RSA is weaker than formal fact-checking—add optional external verifiers for claims.
Customer Support — Triage and resolution with recursive aggregation
- Aggregate multiple resolution paths for complex tickets (configuration, billing edge cases) to improve correctness and consistency.
- Tools/workflows: Support bot backend with RSA; policy templates for escalation; “retry with deeper thinking” button for difficult cases.
- Assumptions/dependencies: Domain-specific knowledge bases; guardrails for actions; latency acceptable for complex tickets.
Operations and Planning — Improved task sequencing and puzzle-like planning
- For discrete planning tasks analogous to Reasoning Gym (e.g., simple scheduling, task ordering), RSA refines multi-step plans across iterations to reduce dead-ends.
- Tools/workflows: Planner backend orchestrating RSA with adaptive K based on context length; majority voting or random sampling for final plan selection; monitor Pass@N−Pass@1 gap to decide when to stop.
- Assumptions/dependencies: Structured plan representations; implicit verification may miss feasibility constraints—pair with light external checks if available.
Cloud LLM Providers — Test-time compute “knobs” as an API feature
- Offer RSA as a first-class inference option with tunable N (population size), K (aggregation batch), and T (steps), enabling users to trade cost/latency for accuracy.
- Tools/workflows: RSA orchestration SDK; autoscaling schedulers; dynamic selection of K,N,T based on user SLA and context length.
- Assumptions/dependencies: Robust memory management; diversity-preserving sampling; clear billing tied to total generations (N×T).
Model Developers — Aggregation-aware RL post-training
- Adopt the paper’s RL augmentation strategy to align models with aggregation at inference, improving RSA effectiveness and robustness across domains (notably code generalization, despite training without code).
- Tools/workflows: Dataset augmentation scripts to include aggregation prompts with K candidate chains; RLOO/PPO training pipelines; evaluation harness tracking Pass@1 over T.
- Assumptions/dependencies: Clean separation of train/test to avoid contamination; reward design for correctness; compute for RL fine-tuning.
Cost Optimization — Small-model substitution for expensive reasoning models
- Use RSA to elevate smaller open models (e.g., Qwen3-4B-Instruct) to match or approach larger reasoning models (e.g., DeepSeek-R1, o3-mini high) on target tasks, reducing inference cost.
- Tools/workflows: Side-by-side benchmarking; RSA “deep mode” for high-stakes queries only; policy for when to fallback to larger models.
- Assumptions/dependencies: Task fit and sensitivity; workload has tolerable latency budget; careful tuning of K,N,T per paper’s guidance.
Governance and Compliance — Multi-draft policy synthesis (low stakes)
- Apply RSA to aggregate compliance interpretations and policy drafts before human review.
- Tools/workflows: Governance writing assistant with RSA; audit trail of candidate chains; final human approval stage.
- Assumptions/dependencies: Human-in-the-loop needed for legal/ethical correctness; provenance and logging; external verifiers recommended for high-stakes claims.
Personal Productivity — “Think deeper” assistants for planning, travel, cooking
- RSA refines multiple candidate itineraries, recipes, or step-by-step plans to produce a more robust final recommendation.
- Tools/workflows: Mobile app toggle for deeper reasoning; cached intermediate chains; adaptive steps until Pass@N−Pass@1 narrows.
- Assumptions/dependencies: Latency tolerance; transparent settings; careful formatting of final outputs.
MLOps — RSA-aware inference schedulers and monitoring
- Create orchestrators that optimize N,K,T under compute and latency budgets, and dashboards tracking mixing speed (Pass@N vs Pass@1) to auto-stop runs.
- Tools/workflows: RSA controller; memory-efficient batching; configurable aggregation prompts; fallback strategies (e.g., majority voting for MCQ).
- Assumptions/dependencies: GPU/TPU capacity; paged attention or similar memory management; reliable seed diversity to prevent premature convergence.

Long-Term Applications

These opportunities require further research, scaling, integration with verifiers, or regulatory approvals. They build on RSA’s recursive aggregation and the aggregation-aware RL training approach.

Healthcare — Clinical decision support with multi-path reasoning aggregation
- RSA could combine diverse diagnostic or treatment rationales to reduce single-path bias.
- Tools/workflows: RSA+explicit medical verifiers; provenance tracking; clinician-in-the-loop interfaces.
- Assumptions/dependencies: Strong external validators; regulatory approval; rigorous safety testing.
Finance — Scenario aggregation for risk modeling and compliance
- Aggregate multiple macro/micro risk narratives and stress-test rationales to produce robust briefs and action plans.
- Tools/workflows: RSA with financial data pipelines; model risk management overlays; audit logs of reasoning chains.
- Assumptions/dependencies: Verifiers for numerical consistency; governance frameworks; explainability requirements.
Legal — Contract and brief drafting via recursive aggregation
- Merge multiple clause interpretations and argument lines to produce more complete drafts.
- Tools/workflows: RSA+legal citation checkers; redlining visualization of aggregated changes; human attorney validation.
- Assumptions/dependencies: High-quality legal knowledge bases; strong fact/precedent verification; liability safeguards.
Robotics and Edge AI — Deeper planning with smaller on-device models
- Use RSA to enhance small models’ planning performance for constrained tasks (navigation, manipulation) via recursive aggregation of plan candidates.
- Tools/workflows: Streaming/partial aggregation to fit edge memory; integration with formal planners and safety checks.
- Assumptions/dependencies: Real-time constraints; reliability under distribution shift; external verifiers for feasibility and safety.
Energy and Operations Research — Grid scheduling and optimization assistants
- Aggregate multiple candidate schedules or dispatch plans to improve robustness to uncertainties.
- Tools/workflows: RSA layered with domain solvers (MILP/CP); post-aggregation feasibility checks; operator-in-the-loop dashboards.
- Assumptions/dependencies: Strong external optimization/verifiers; time-critical SLAs; auditability.
Scientific Discovery — Hypothesis and experiment-plan aggregation
- Combine diverse hypothesis chains to propose stronger experimental designs and analyses.
- Tools/workflows: RSA with scientific retrieval and simulation backends; structured experiment DSL; lab notebook provenance.
- Assumptions/dependencies: Domain verifiers; reproducibility standards; human expert review.
Education Systems — Aggregated feedback and grading at scale
- Use RSA to merge multiple candidate feedback explanations and rubrics into consistent, adaptive guidance for students.
- Tools/workflows: LMS integration; bias/consistency checks; student privacy safeguards.
- Assumptions/dependencies: Fairness auditing; content standards; teacher-in-the-loop.
Multi-Agent and Genetic Algorithms — End-to-end RSA policy training
- Extend aggregation-aware RL to train policies that plan the entire RSA loop (selection, crossover-like aggregation, mutation-like diversification).
- Tools/workflows: Multi-step RL; curriculum learning across K,N,T; diversity control mechanisms and fitness functions.
- Assumptions/dependencies: Stable training; scalable reward design; compute for long horizons.
High-Stakes Governance — Evidence synthesis in policymaking
- RSA-based systems to iteratively aggregate multi-source evidence, expert opinions, and impact assessments.
- Tools/workflows: RSA with robust fact-checkers and causal inference modules; transparent provenance; stakeholder review workflows.
- Assumptions/dependencies: Strong verifiers; legal and ethical compliance; public accountability mechanisms.
Safety and Red Teaming — Aggregated threat modeling and defense planning
- Combine multiple attack hypotheses and defense strategies to produce comprehensive risk assessments for AI and IT systems.
- Tools/workflows: RSA plus formal verification and simulation; cross-team collaborative interfaces; audit trails.
- Assumptions/dependencies: Access to high-fidelity simulators; secure handling of sensitive content; governance for dual-use concerns.
Platform Tooling — RSA SDKs, templates, and observability
- Productize RSA orchestration libraries, prompt templates, and monitoring (e.g., Pass@N tracking, diversity metrics) for enterprise deployment.
- Tools/workflows: Configurable controllers for K,N,T; context length-aware aggregation prompts; auto-tuning based on latency constraints.
- Assumptions/dependencies: Vendor-neutral interfaces; compatibility with diverse model families; robust memory and batching strategies.
Hybrid Pipelines — RSA combined with external verifiers/self-verification
- Compose RSA with code compilation, unit tests, math solvers, and learned reward models to introduce explicit fitness functions while preserving RSA’s depth and diversity.
- Tools/workflows: Two-stage pipelines (RSA → verify → re-aggregate); adaptive set sampling; dynamic pruning of low-fitness chains.
- Assumptions/dependencies: Reliable verifiers; careful trade-offs between exploration and exploitation; guardrails to prevent mode collapse.

Notes on Feasibility Assumptions and Dependencies

Compute and latency: RSA scales inference-time compute via N×T generations; deployment feasibility hinges on acceptable latency and GPU/TPU capacity. Offer “deep vs fast” modes.
Context length: Aggregation requires fitting K candidate chains in context. Gains diminish when K exceeds the model’s effective context; tune K (often 2–4) per the paper’s guidance.
Diversity: Maintain population diversity to avoid premature convergence. Use stochastic sampling, adequate N relative to K, and monitor Pass@N−Pass@1 to gauge mixing speed.
Selection: Majority voting works well for MCQ; for open-ended outputs, random sampling or external verifiers can improve final selection robustness.
Alignment: Standard RL may hurt RSA; aggregation-aware RL training aligns the model to aggregation prompts and improves performance and transfer.
Safety and governance: In high-stakes domains (healthcare, finance, legal, policy), pair RSA with strong external verification, provenance tracking, human oversight, and compliance frameworks.

View Paper Prompt View All Prompts

Glossary

Aggregation-aware RL: A reinforcement learning fine-tuning approach that explicitly trains the model to aggregate multiple candidate solutions at inference. "To address this, we propose an aggregation-aware RL approach using a simple data-augmentation strategy to train LLMs to aggregate solutions (§4)."
Aggregation prompt: A special instruction format that presents the problem and selected candidate solutions to the model to elicit an improved, combined solution. "formatted using an aggregation prompt directing the LLM peter to generate a refined response (+1), forming a new population of candidates Pt+1:"
Aggregation set size K: The number of candidate solutions included in each aggregation step; controls how many alternatives the model considers when combining. "Additionally, the choice of K defines the number of alternative responses to consider for aggregation, with K = 1 being equivalent to sequential self-refinement (Madaan et al., 2023)."
AIME-25: A competition-level math benchmark from MathArena used to evaluate LLM reasoning. "We use AIME-25 and HMMT-25 from MathArena (Balunović et al., 2025), each containing 30 challenging competition-level math problems."
Best-of-N: A parallel selection strategy that generates N candidate solutions and picks the one with the highest score from a verifier. "A simple strategy is Best-of-N (Gao et al., 2023a), where N candidates are generated and the highest-reward solution is selected."
Bootstrapping: Leveraging partially correct intermediate steps from different reasoning chains to construct better final solutions. "and enables bootstrapping from partially correct intermediate steps within different chains of thought."
Chain-of-thought 'thinking' models: LLMs trained or configured to produce long, explicit reasoning traces before final answers. "including long chain-of-thought 'thinking' models"
DeepSeek-R1: A strong reasoning-focused LLM used as a comparison target in evaluations. "RSA enables Qwen3-4B-Instruct-2507 to achieve competitive performance with larger reasoning models, including DeepSeek-R1 and o3-mini (high)"
Evolutionary algorithms: Optimization methods inspired by natural selection, involving populations, recombination, and mutation; used here to motivate RSA’s iterative aggregation. "integrating aggregation steps into a self-improvement loop motivated by evolutionary algorithms."
Generation-verification gap: The phenomenon where LLMs are better at judging correctness than generating correct solutions. "LLMs exhibit a generation-verification gap: they are more reliable at judging correctness of solutions than producing them (Li et al., 2024)."
GRPO: A policy gradient algorithm variant used for RL fine-tuning of LLMs. "such as PPO (Ouyang et al., 2022), GRPO (Shao et al., 2024), or RLOO (Ahmadian et al., 2024)"
Hybrid scaling: Test-time strategies that combine parallel branching and sequential refinement to leverage both breadth and depth. "Hybrid scaling. Sequential and parallel scaling strategies can be combined in hybrid frameworks that draw on the strengths of both paradigms."
Hybrid state-space models: Architectures that combine state-space components with other modeling approaches; included among evaluated model families. "and hybrid state-space models."
HMMT-25: A competition-level math benchmark from MathArena used to evaluate LLM reasoning. "We use AIME-25 and HMMT-25 from MathArena (Balunović et al., 2025), each containing 30 challenging competition-level math problems."
Implicit verification: Relying on the model’s ability to generate improved solutions without explicit scoring, effectively verifying during generation. "Implicit verification. Some methods bypass explicit verification by relying on the LLM to generate improved solutions, effectively performing verification of solutions without scoring them."
Inference-time compute: The amount of computation used during model inference to produce answers, scalable for improved performance. "Inference-time compute can be scaled in parallel by choosing among multiple independent solutions or sequentially through self-refinement."
KL regularization: A regularization term that controls divergence from a reference policy during RL fine-tuning. "where ß controls the optional KL regularization with the reference policy peref."
LiveCodeBench-v6: A large code-generation benchmark for evaluating LLMs on programming tasks. "We use LiveCodeBench-v6 (Jain et al., 2024) which contains 1055 problems."
Majority voting: A parallel aggregation method that selects the most common answer among multiple samples under a self-consistency assumption. "majority voting (Wang et al., 2023) works on the assumption of self-consistency: that the model produces correct answers more consistently than incorrect ones."
Mixture-of-Agents: A hybrid test-time scaling framework where multiple LLMs produce proposals that are recursively aggregated. "Another example of hybrid test-time scaling is Mixture-of-Agents (Wang et al., 2024), where an ensemble of LLMs generates improved proposals that are aggregated by a strong model into the seed solution for the next iteration."
Mixture-of-Experts (MoE): Sparse model architectures that route inputs to different expert sub-networks; included among evaluated models. "including long chain-of-thought 'thinking' models, sparse Mixture-of-Experts (MoE) architectures, and hybrid state-space models."
Pass@1: An evaluation metric measuring the fraction of problems correctly solved in a single attempt. "We report Pass@1 scores for RSA and other test-time scaling baselines."
Pass@N: An evaluation metric measuring the probability that at least one of N attempts is correct. "The Pass@N score for a population of N solutions is equal to 1 if at least one final answer out of the N is correct."
Parameter-efficient fine-tuning: Techniques that adapt models using few additional parameters rather than full model updates. "or using a parameter-efficient fine-tuning technique."
Policy gradient algorithm: RL optimization methods that update policies via gradients of expected reward. "This objective can be optimized using any off-the-shelf policy gradient algorithm, such as PPO (Ouyang et al., 2022), GRPO (Shao et al., 2024), or RLOO (Ahmadian et al., 2024)"
Population (in RSA): The set of candidate reasoning chains maintained and refined across iterations. "RSA maintains a population of candidate solutions and iteratively recombines subsets of the population to produce a new population of improved solutions (Fig. 3)."
Population size N: The number of candidate solutions in RSA’s population, controlling diversity and asymptotic performance. "Maintaining a large population size N relative to the aggregation size K helps ensure sufficient diversity for recombination."
PPO: Proximal Policy Optimization, a widely used policy gradient algorithm for RLHF-style training. "such as PPO (Ouyang et al., 2022), GRPO (Shao et al., 2024), or RLOO (Ahmadian et al., 2024)"
Qwen3-4B-Instruct-2507: A 4B-parameter instruction-tuned model used as a primary base model in experiments. "Notably, RSA enables Qwen3-4B-Instruct-2507 to achieve competitive performance with larger reasoning models, including DeepSeek-R1 and o3-mini (high)"
Reasoning Gym: A suite of tasks for general reasoning and planning used to evaluate LLMs. "We construct two datasets with 100 problems each from Reasoning Gym (Stojanovski et al., 2025), using tasks from the games category, and cognition + ARC categories."
Rejection sampling: A parallel method that samples multiple candidates and discards those failing a verification criterion. "We evaluate majority voting (Wang et al., 2023) and rejection sampling with self-verification (Weng et al., 2023), budget-matched with RSA by using N x T generations."
Reinforcement learning (RL) post-training: Using RL after pretraining to improve reasoning ability and alignment. "In addition to the test-time strategies discussed thus far, a model's reasoning ability can be improved by post-training it with reinforcement learning (RL) (Jaech et al., 2024; Guo et al., 2025)."
Recursive Self-Aggregation (RSA): A hybrid test-time scaling algorithm that iteratively aggregates subsets of candidate solutions to evolve better reasoning chains. "We propose Recursive Self-Aggregation (RSA), a test-time scaling method inspired by evolutionary methods that combines the benefits of both parallel and sequential scaling."
Reward models: Learned evaluators that score candidate solutions using preference or correctness signals, enabling selection and improvement. "learned reward models, trained on preference data or correctness signals derived from reasoning chains"
Self-aggregation: Combining multiple candidate reasoning chains from the same model to produce an improved solution without external verifiers. "We study a general way to improve LLM reasoning chains through self- aggregation: providing the model with the query and a set of candidate solutions and prompting it to produce an improved solution."
Self-consistency: The assumption that correct answers appear more consistently than incorrect ones across samples, used by majority voting. "majority voting (Wang et al., 2023) works on the assumption of self-consistency: that the model produces correct answers more consistently than incorrect ones."
Self-refinement: Iteratively improving a single reasoning chain by correcting its mistakes over multiple steps. "Self-refinement methods - the quintessential form of sequential scaling - can improve a candidate solution by reusing its own correct parts, but do not leverage the information contained within other candidates."
Self-verification: Using the LLM itself to judge correctness of its outputs during inference. "This property can be exploited to enable test-time scaling by using the LLM as a verifier of its own outputs (e.g., Madaan et al., 2023; Weng et al., 2023)."
Sequential scaling: Increasing inference depth by performing multiple iterative model evaluations to refine solutions. "Sequential scaling instead increases the number of iterative model evaluations to produce higher- quality solutions"
SuperGPQA: A large, graduate-level knowledge benchmark assessing factual recall and reasoning. "We use SuperGPQA (M-A-P Team et al., 2025), a graduate-level knowledge-based reasoning benchmark, to test effectiveness of RSA on tasks requiring factual recall."
Test-time scaling: Techniques that improve LLM performance by spending more compute at inference without changing model weights. "Test-time scaling methods improve the capabilities of LLMs by increasing the amount of compute used during inference to make a prediction."
Verification strategy: The method used to assess candidate solution quality (external, self-, or implicit) within test-time scaling frameworks. "provide a taxonomy of test-time scaling frameworks based on the verification strategy and control flow they employ"
Verifier-guided Best-of-N selection: A parallel method that uses an external or learned verifier to choose the best among N generated solutions. "parallel scaling methods such as verifier-guided Best-of-N selection can identify the best candidate from a batch, but do not recombine candidates to produce improved solutions."

View Paper Prompt View All Prompts

Continue Learning

Authors (12)

Collections

GitHub

GitHub - HyperPotatoNeo/RSA (51 stars)

Tweets

This paper has been mentioned in 8 posts and received 161 likes.

YouTube

Show All Videos

HackerNews

Recursive self-aggregation unlocks deep thinking in large language models (1 point, 1 comment)

Recursive Self-Aggregation Unlocks Deep Thinking in Large Language Models (2509.26626v1)

Summary

Recursive Self-Aggregation Unlocks Deep Thinking in LLMs

Introduction

Taxonomy of Test-Time Scaling Methods

Recursive Self-Aggregation (RSA) Algorithm

Pseudocode

Aggregation-Aware Reinforcement Learning

Empirical Results

Implementation Considerations

Theoretical and Practical Implications

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

What questions did the researchers ask?

How did they do it?

What did they find, and why is it important?

What could this mean in the future?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Notes on Feasibility Assumptions and Dependencies

Glossary

Continue Learning

Related Papers

Authors (12)

Collections

GitHub

Tweets

YouTube

HackerNews

Don't miss out on important new AI/ML research