Papers
Topics
Authors
Recent
Search
2000 character limit reached

SPIRAL: Sequential-Parallel-Aggregative RL

Updated 1 July 2026
  • SPIRAL is a reinforcement learning framework that unifies sequential, parallel, and aggregative reasoning to generate diverse solution traces and optimize aggregated rewards.
  • The method employs a two-component policy gradient strategy using set-RL and standard RL to effectively train both search and aggregation policies.
  • Empirical results highlight significant improvements in mathematical reasoning tasks, including up to 11× efficiency in pass@k and 13.5% absolute gain in pass@1 via recursive aggregation.

Sequential-Parallel-Aggregative Reinforcement Learning (SPIRAL) is a reinforcement learning framework designed to align LLM (LM) training with realistic inference-time reasoning protocols. Unlike prior approaches in which LMs are optimized only for sequential reasoning within a single trace, SPIRAL integrates three distinct reasoning primitives—sequential, parallel, and aggregative compute—within a unified architecture. The key innovation is an end-to-end learning pipeline in which the model generates multiple independently sampled (parallel) chain-of-thought traces and then aggregates them via a dedicated aggregation trace, enabling improved reward optimization and performance scaling when inference compute is increased (Hamid et al., 22 Jun 2026).

1. Inference Primitives and Pipeline Structure

SPIRAL decomposes inference into three core primitives:

  • Sequential Compute: Within a single trace, the model performs chain-of-thought reasoning, involving intermediate reasoning tokens, sub-goal formulation, self-verification steps, and (optionally) tool calls, ultimately producing an answer. This reflects standard chain-of-thought prompting.
  • Parallel Compute: The model samples nn independent reasoning traces in parallel, conditioned only on the input problem. This enables exploration of diverse solution strategies and local optima.
  • Aggregative Compute: Given the problem and the set of nn search traces, the model synthesizes an aggregation trace. This trace integrates, compares, and verifies information across the sampled traces to generate a final decision.

The SPIRAL inference pipeline operates as follows:

  1. For an input xx, sample nn search traces y1,…,yn∼πθ(⋅∣x)y_1,\ldots,y_n \sim \pi_\theta(\cdot|x) in parallel.
  2. (Optionally) Perform recursive or groupwise aggregation steps.
  3. Condition on xx and all nn traces to generate an aggregation trace y∗∼πϕ(⋅∣x,y1,…,yn)y_* \sim \pi_\phi(\cdot|x, y_1, \ldots, y_n).
  4. Extract the answer from y∗y_* and compute the corresponding reward r(x,y∗)r(x, y_*).

All components are trained end-to-end to maximize the aggregated response reward.

2. Training Algorithms and Objectives

SPIRAL employs a two-component policy gradient strategy:

  • Set-Reinforcement Learning (set-RL): The search-trace policy nn0 is trained using a set-level reward determined by the aggregated output. The objective is to maximize

nn1

The gradient with respect to nn2 involves treating the nn3 traces as a set, propagating the aggregator's expected reward back to each search trace.

  • Standard Reinforcement Learning (RL): The aggregation policy nn4 is optimized via standard REINFORCE, with the reward for aggregation traces calculated directly per-set.

To estimate these gradients, SPIRAL adopts the set-RL estimator of Orney et al. (2026), using the following steps:

  1. Sample nn5 candidate traces nn6.
  2. Pick nn7 sets nn8 of size nn9.
  3. For each xx0, sample xx1 aggregation traces, compute mean rewards xx2, and calculate baselines.
  4. Compute set and marginal set advantages, distributing credit to individual search traces.
  5. For aggregation traces, compute per-set baselines and apply REINFORCE.

The resulting estimator is unbiased up to a constant scaling factor (absorbed in the optimizer's learning rate).

3. Model Architecture and Parameterization

SPIRAL can employ either a single LM for both search and aggregation or separate models for each. The joint distribution is factorized as

xx3

or, if using an independent aggregator, the second factor is replaced by xx4.

For aggregation, the model conditions on the concatenation of the input problem and the set of search traces fed as a single textual prompt. The existing transformer-decoder architecture's attention mechanism natively supports this conditioning without modification beyond context window extension. Generation for both search and aggregation traces is performed via standard left-to-right token sampling.

4. Experimental Protocols and Empirical Results

Experiments focused on mathematical reasoning tasks, specifically a subset of POLARIS-53k, using Qwen3-4b-Instruct-2507 as the base LM. The main baseline is GRPO (sequence-only RL) [Shao et al. 2024].

  • Training Details: SPIRAL uses batch size 256, with xx5 search traces (up to 4096 tokens each), xx6 sets of size xx7, and xx8 aggregator traces per set (each up to 4096 tokens). This yields approximately 98,304 tokens per problem per update, matching the baseline's resource allocation.
  • Quantitative Results:
    • Pass@xx9 Scaling: SPIRAL demonstrates up to 11nn0 higher scaling efficiency on pass@nn1 (parallel compute with oracle verifier), using only 1/11 the samples to match baseline performance.
    • Recursive Self-Aggregation (RSA): With recursive aggregative steps, SPIRAL achieves up to 13.5% absolute improvement in pass@1 after 2–3 levels of aggregation.
    • Sequential Compute Only: When only the length of sequential trajectories is increased, SPIRAL and GRPO show similar performance; distinguishing gains appear when parallel and aggregation primitives are activated.
    • Token Efficiency: SPIRAL with RSA outperforms both pure sequential scaling and majority voting in pass@1 per token, and sequential methods hit a 32K context limit not present for aggregation.
  • Ablations:
    • Search-trace entropy remains higher in SPIRAL than in GRPO, reflecting set-RL’s effectiveness in promoting diversity.
    • Per-set baselines for aggregator traces are more variance-reducing than global baselines.

5. Strengths, Limitations, and Open Problems

  • Key Strengths: SPIRAL's integration of sequential, parallel, and aggregative compute closes the gap between LM training and test-time inference pipelines. Set-RL enables the production of search traces that are not just individually plausible but collectively beneficial for downstream aggregation. Empirically, scaling both parallel and aggregative primitives delivers substantially better scaling efficiency and performance than sequence-only RL.
  • Limitations:
    • Current training is limited to 4B–8B parameter LMs; larger scale behavior remains untested.
    • No exhaustive ablation exists attributing gains to each primitive in isolation.
    • The set-RL estimator assumes set symmetry, so ordering biases in transformer models could subtly affect credit assignment.
  • Future Directions: Prominent next steps include expanding to larger models (nn28B), analyzing token allocation dynamics between search and aggregation, integrating supervised decomposition objectives, extending to non-mathematical reasoning, and exploring dynamic set sizes and deeper recursive aggregation schemes.

6. Relation to Prior Work and Broader Implications

SPIRAL generalizes and subsumes standard sequential RL (e.g., GRPO) by enabling simultaneous optimization of parallel exploration and structured aggregation within a unified RL framework. By optimizing towards aggregated end rewards rather than solely per-trace rewards, it produces search distributions better aligned with the needs of realistic inference-time verifiers and aggregators. A plausible implication is that such architectures may be essential for closing the gap between current LM training regimes and the emerging standard practice of inference-time aggregation or majority-voting in practical deployments (Hamid et al., 22 Jun 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sequential-Parallel-Aggregative Reinforcement Learning (SPIRAL).