AT-GRPO: Multi-Agent RL & Cosmic-Ray Detection

Updated 25 December 2025

AT-GRPO is a framework that extends GRPO to multi-agent systems by grouping experiences agent- and turn-wise, enabling more effective policy updates.
The method uses tree-structured sampling and mixed reward aggregation to achieve significant gains in tasks such as planning, coding, and mathematical reasoning.
In the GRAND@Auger prototype, AT-GRPO principles enhance cosmic-ray detection through precise calibration, robust noise rejection, and improved timing accuracy.

AT-GRPO (Agent- and Turn-wise Grouped Relative Policy Optimization) refers to a specific algorithmic and systems framework for on-policy reinforcement learning (RL) in multi-agent systems (MAS), particularly for collaborative LLMs (Zhao et al., 13 Oct 2025). Its design addresses unique challenges posed by MAS environments, where heterogeneous role and turn structures invalidate standard RL grouping assumptions. AT-GRPO generalizes group-relative policy optimization (GRPO) to scenarios requiring agent- and turn-wise grouping, enabling substantial gains across symbolic, code, planning, and mathematical domains. The term AT-GRPO also coincides with the GRAND@Auger prototype (sometimes abbreviated AT-GRPO), a repurposed array for ultra-high energy cosmic-ray and neutrino detection, advancing detection capabilities and calibration strategies for the GRAND experiment (Errico et al., 10 Jul 2025). The following sections discuss both algorithmic and experimental facets of AT-GRPO within these respective contexts.

1. Mathematical Foundations and GRPO Extension in MAS

AT-GRPO builds on the Markov game framework, extending the RL objective to support N-agent, multi-policy environments. For agents indexed by $i$ , each with local and team reward streams and unique histories, the state transition dynamics are:

State space $S$ , actions $A_1,\dots,A_N$ , transitions $s_{t+1} = T(s_t, a_{1,t},\ldots,a_{N,t})$ .
Observations $o_{i,t,e} = o_i(s_{t,e}, h_{t,e})$ reflecting environment, role, and historical context.

Standard GRPO defines advantage by normalizing rewards across policy continuations to a single prompt. In MAS, group structure is specialized such that GRPO grouping is applied agent- and turn-wise: every group $g$ corresponds to a tuple $(e, i, t)$ (environment instance, agent index, turn). All $K$ sampled candidates for $g$ share the prompt template $P_i(o_{i,t,e})$ determined by role.

The mixed reward mechanism aggregates team and local rewards:

$r_{i,t,e} = \alpha r_{t,e}^{\mathrm{team}} + (1-\alpha) r_{i,t,e}^{\mathrm{loc}},$

with $\alpha\in[0,1]$ . Grouped advantages are computed per group, and the per-policy gradient update is performed over minibatches $\mathcal{B}_m$ :

$\nabla_{\theta^{(m)}} L(\theta^{(m)}) = -\mathbb{E}_{g \in \mathcal{B}_m} \left[ \frac{1}{K} \sum_{c=1}^K A_g^{(c)} \nabla \log \pi_{\theta^{(m)}}(a_g^{(c)} | \mathrm{prompt}_g) \right].$

This structure allows enforced comparison, specialization, and credit assignment under high prompt-variation environments (Zhao et al., 13 Oct 2025).

2. Algorithmic Workflow and Pseudocode

The AT-GRPO workflow is characterized by tree-structured sampling and agent-turn grouping. At each step:

Parallel environments are reset and stepped for $T$ turns.
For each agent and turn, $K$ candidates are sampled from the policy, and rewards/advantages are calculated.
Experiences are grouped by $(e, i, t)$ , stored, and the top-rewarded candidate is greedily executed to propagate the trajectory.
At rollout completion, minibatches per policy model are formed from grouped data and subjected to the grouped-advantage policy gradient update.

The key pseudocode (as provided in (Zhao et al., 13 Oct 2025)) is:

Algorithm AT-GRPO for MAS
Inputs: Markov game M, policies Θ={θ^(m)}, role map σ, branches K, batch size E, turns T, steps S, sample temperature T_samp, reward mix α.
for step = 1…S:
    # On-policy rollouts
    initialize D_i = ∅ for i=1…N
    parallel over E environments e:
        s₀,e ← reset()
        for t=0…T−1:
            for i=1…N:
                o = o_{i,t,e}, group g = hash(e, i, t)
                sample {a^{(c)} ~ π_{θ^(σ(i))}(·|o; T_samp), c=1…K}
                compute r^{(c)}_{i,t,e}, advantage A^{(c)}
                store (g, o, {a^{(c)}}, {A^{(c)}}) in D_i
                execute a_{i,t,e} = a^{(c*)} with c* = argmax_c r^{(c)}
            s_{t+1,e} ← T(s_{t,e}, ...)
            if terminal: break
    # Policy updates
    parallel over m=1…M:
        build ℬ_m = ∪_{i:σ(i)=m} D_i
        compute L(θ^(m)), update θ^(m) ← θ^(m) − η ∇L(θ^(m))

Tree-structured sampling and agent-turn grouping preserve valid comparison groups, critical for advantage estimation and variance control.

3. Systems Architecture for MAS Training

The AT-GRPO system is architected for scalable multi-policy training:

ModelPool: Each policy model $m$ is assigned a GPU-backed pool with RolloutWorkers (inference, $K$ -branch sampling) and UpdateWorkers (batch gradient steps on $\theta^{(m)}$ ).
EnvWorkers and Router: Thousands of CPU EnvWorker instances execute sandboxed environments in parallel. The Router deterministically tags each experience by (envID, agent index) and forwards grouped batches to target UpdateWorkers according to role mapping.

This design supports concurrent on-policy rollouts, rigorous separation of inference and optimization operations, and clean integration of multi-policy regimes. Deterministic routing ensures reproducibility and efficiency in high-throughput settings (Zhao et al., 13 Oct 2025).

4. Experimental Evaluation and Quantitative Results

AT-GRPO was evaluated across several domains—games, planning, coding, and mathematics—with the following protocol:

Models: Qwen3-1.7B, Qwen3-8B (“no-thinking” mode)
Baselines: single-agent prompt-only/GRPO, MAS prompt-only (frozen LLM), MAS RL shared policy ( $M=1$ ), MAS RL per-role policies ( $M=N$ )
Hyperparameters: $K=4$ branches, $T=4$ turns, $\alpha=1.0$ (team reward), $S=150$ steps, PPO-style learning rate $1e^{-6}$ , batch size $128$, $\lambda=1$ , $\gamma=1$ , no entropy bonus.

Tables 1 & 2 report:

Task	Baseline (Single-Agent RL)	AT-GRPO (MAS, per-role)
Planning	14–47%	96–99.5%
Coding	3.87–7.62% improvement	up to +12.4 pp
Math	9.0–17.93% improvement	up to +38.7 pp

Ablation analyses show joint MAS RL and role specialization are necessary for high gains; isolated agent training yields only marginal improvement (≈10–15%), and cross-role policy swapping causes near-complete performance collapse (Zhao et al., 13 Oct 2025).

5. AT-GRPO in Cosmic Ray and Neutrino Detection (GRAND@Auger Prototype)

The acronym AT-GRPO also designates the GRAND@Auger prototype array, deployed by the GRAND and Pierre Auger Collaborations for ultra-high energy particle detection (Errico et al., 10 Jul 2025). Key technical elements include:

Ten AERA stations in concentric hexagons ( $\sim0.5\,\textrm{km}^2$ ), with three-polarization Horizon Antennas ( $\mathrm{NS}, \mathrm{EW}, \mathrm{V}$ ).
Solar-powered front-end electronics, 500 MSPS ADCs, Xilinx SoC for flexible triggering, 30–200 MHz bandpass, FPGA-based notch filters.
GPS timestamping to $\pm 10$ ns accuracy, enabling $\lesssim20$ ns inter-station timing for planar wave fits.

Commissioning, calibration, and measurements demonstrate:

Broadband Galactic background characterization at high frequencies (first such detection for GRAND).
System noise floor sub-dominant to the diffuse sky signal above $100$ MHz.
First self-triggered candidate cosmic-ray event, coincident with an Auger Surface Detector event of $E=1.39 \times 10^{19}\,\textrm{eV}$ , directional agreement within $3^\circ$ , and GPS timing offset within uncertainties.
Angular resolution up to $\lesssim3^\circ$ for very inclined showers, threshold near $10^{19}\,\textrm{eV}$ .

The hardware, firmware, and data-processing methodologies validated by GRAND@Auger have direct impact on future GRAND array designs, particularly in event triggering, calibration, and centralized analysis pipelines (Errico et al., 10 Jul 2025).

6. Significance, Implications, and Future Directions

AT-GRPO in collaborative LLMs establishes a principled grouped-advantage estimator for MAS RL, efficiently marrying agent- and turn-wise grouping, tree-structured sampling, and mixed reward aggregation. This configuration accomplishes deep role specialization and tight team alignment unachievable in standard single-agent RL or MAS RL without per-role optimization. Empirically, AT-GRPO achieves dramatic accuracy improvements in symbolic, code, plan, and math tasks, with scalable systems architecture that supports large multi-policy arrays. In the physical sciences, the GRAND@Auger (AT-GRPO) prototype demonstrates the viability of self-triggered, broadband radio detection arrays for UHE particle showers, informing array geometry, DAQ requirements, and the next-generation design for GRAND. The fusion of detailed calibration procedures, robust noise rejection, and advanced timing lays the groundwork for future expansion and precision search for ultra-high energy neutrinos and cosmic rays.

A plausible implication is that AT-GRPO, as both a methodological and experimental framework, will serve as a generalized template for future collaborative AI training and large-scale detection systems in high-energy astrophysics. Future research is likely to extend the group-based advantage estimator to settings with even greater heterogeneity and nonstationarity, and to further refine array designs based on lessons from GRAND@Auger and related prototypes.