SPIRAL: Sequential-Parallel-Aggregative RL
- SPIRAL is a reinforcement learning framework that unifies sequential, parallel, and aggregative reasoning to generate diverse solution traces and optimize aggregated rewards.
- The method employs a two-component policy gradient strategy using set-RL and standard RL to effectively train both search and aggregation policies.
- Empirical results highlight significant improvements in mathematical reasoning tasks, including up to 11× efficiency in pass@k and 13.5% absolute gain in pass@1 via recursive aggregation.
Sequential-Parallel-Aggregative Reinforcement Learning (SPIRAL) is a reinforcement learning framework designed to align LLM (LM) training with realistic inference-time reasoning protocols. Unlike prior approaches in which LMs are optimized only for sequential reasoning within a single trace, SPIRAL integrates three distinct reasoning primitives—sequential, parallel, and aggregative compute—within a unified architecture. The key innovation is an end-to-end learning pipeline in which the model generates multiple independently sampled (parallel) chain-of-thought traces and then aggregates them via a dedicated aggregation trace, enabling improved reward optimization and performance scaling when inference compute is increased (Hamid et al., 22 Jun 2026).
1. Inference Primitives and Pipeline Structure
SPIRAL decomposes inference into three core primitives:
- Sequential Compute: Within a single trace, the model performs chain-of-thought reasoning, involving intermediate reasoning tokens, sub-goal formulation, self-verification steps, and (optionally) tool calls, ultimately producing an answer. This reflects standard chain-of-thought prompting.
- Parallel Compute: The model samples independent reasoning traces in parallel, conditioned only on the input problem. This enables exploration of diverse solution strategies and local optima.
- Aggregative Compute: Given the problem and the set of search traces, the model synthesizes an aggregation trace. This trace integrates, compares, and verifies information across the sampled traces to generate a final decision.
The SPIRAL inference pipeline operates as follows:
- For an input , sample search traces in parallel.
- (Optionally) Perform recursive or groupwise aggregation steps.
- Condition on and all traces to generate an aggregation trace .
- Extract the answer from and compute the corresponding reward .
All components are trained end-to-end to maximize the aggregated response reward.
2. Training Algorithms and Objectives
SPIRAL employs a two-component policy gradient strategy:
- Set-Reinforcement Learning (set-RL): The search-trace policy 0 is trained using a set-level reward determined by the aggregated output. The objective is to maximize
1
The gradient with respect to 2 involves treating the 3 traces as a set, propagating the aggregator's expected reward back to each search trace.
- Standard Reinforcement Learning (RL): The aggregation policy 4 is optimized via standard REINFORCE, with the reward for aggregation traces calculated directly per-set.
To estimate these gradients, SPIRAL adopts the set-RL estimator of Orney et al. (2026), using the following steps:
- Sample 5 candidate traces 6.
- Pick 7 sets 8 of size 9.
- For each 0, sample 1 aggregation traces, compute mean rewards 2, and calculate baselines.
- Compute set and marginal set advantages, distributing credit to individual search traces.
- For aggregation traces, compute per-set baselines and apply REINFORCE.
The resulting estimator is unbiased up to a constant scaling factor (absorbed in the optimizer's learning rate).
3. Model Architecture and Parameterization
SPIRAL can employ either a single LM for both search and aggregation or separate models for each. The joint distribution is factorized as
3
or, if using an independent aggregator, the second factor is replaced by 4.
For aggregation, the model conditions on the concatenation of the input problem and the set of search traces fed as a single textual prompt. The existing transformer-decoder architecture's attention mechanism natively supports this conditioning without modification beyond context window extension. Generation for both search and aggregation traces is performed via standard left-to-right token sampling.
4. Experimental Protocols and Empirical Results
Experiments focused on mathematical reasoning tasks, specifically a subset of POLARIS-53k, using Qwen3-4b-Instruct-2507 as the base LM. The main baseline is GRPO (sequence-only RL) [Shao et al. 2024].
- Training Details: SPIRAL uses batch size 256, with 5 search traces (up to 4096 tokens each), 6 sets of size 7, and 8 aggregator traces per set (each up to 4096 tokens). This yields approximately 98,304 tokens per problem per update, matching the baseline's resource allocation.
- Quantitative Results:
- Pass@9 Scaling: SPIRAL demonstrates up to 110 higher scaling efficiency on pass@1 (parallel compute with oracle verifier), using only 1/11 the samples to match baseline performance.
- Recursive Self-Aggregation (RSA): With recursive aggregative steps, SPIRAL achieves up to 13.5% absolute improvement in pass@1 after 2–3 levels of aggregation.
- Sequential Compute Only: When only the length of sequential trajectories is increased, SPIRAL and GRPO show similar performance; distinguishing gains appear when parallel and aggregation primitives are activated.
- Token Efficiency: SPIRAL with RSA outperforms both pure sequential scaling and majority voting in pass@1 per token, and sequential methods hit a 32K context limit not present for aggregation.
- Ablations:
- Search-trace entropy remains higher in SPIRAL than in GRPO, reflecting set-RL’s effectiveness in promoting diversity.
- Per-set baselines for aggregator traces are more variance-reducing than global baselines.
5. Strengths, Limitations, and Open Problems
- Key Strengths: SPIRAL's integration of sequential, parallel, and aggregative compute closes the gap between LM training and test-time inference pipelines. Set-RL enables the production of search traces that are not just individually plausible but collectively beneficial for downstream aggregation. Empirically, scaling both parallel and aggregative primitives delivers substantially better scaling efficiency and performance than sequence-only RL.
- Limitations:
- Current training is limited to 4B–8B parameter LMs; larger scale behavior remains untested.
- No exhaustive ablation exists attributing gains to each primitive in isolation.
- The set-RL estimator assumes set symmetry, so ordering biases in transformer models could subtly affect credit assignment.
- Future Directions: Prominent next steps include expanding to larger models (28B), analyzing token allocation dynamics between search and aggregation, integrating supervised decomposition objectives, extending to non-mathematical reasoning, and exploring dynamic set sizes and deeper recursive aggregation schemes.
6. Relation to Prior Work and Broader Implications
SPIRAL generalizes and subsumes standard sequential RL (e.g., GRPO) by enabling simultaneous optimization of parallel exploration and structured aggregation within a unified RL framework. By optimizing towards aggregated end rewards rather than solely per-trace rewards, it produces search distributions better aligned with the needs of realistic inference-time verifiers and aggregators. A plausible implication is that such architectures may be essential for closing the gap between current LM training regimes and the emerging standard practice of inference-time aggregation or majority-voting in practical deployments (Hamid et al., 22 Jun 2026).