Pre-GRPO: Baseline Initialization for GRPO

Updated 11 November 2025

Pre-GRPO is a suite of baseline optimization methods, including supervised fine-tuning, maximum-likelihood training, and direct preference optimization, that prepares models for effective groupwise policy refinement.
It ensures initial policy accuracy and maximizes reward variance to avoid gradient collapse during the subsequent GRPO phase.
Empirical studies show that omitting pre-GRPO leads to significant performance degradation, underscoring its critical role in multi-stage RL optimization.

Pre-GRPO refers to the collection of initialization and baseline optimization methods employed before the application of Group Relative Policy Optimization (GRPO) in preference-based or reinforcement learning contexts. Across the literature, pre-GRPO regimes are essential for ensuring stable and effective GRPO, serving as the foundation upon which relative or groupwise policy refinement can proceed. The precise nature of “pre-GRPO” depends on the domain and research problem, but generally encompasses either supervised fine-tuning, maximum-likelihood (cross-entropy) training, or reward-free preference optimization, all prior to the introduction of the groupwise comparative structure central to GRPO.

1. Conceptual Overview and Purpose

Pre-GRPO denotes the initial phase in multi-stage optimization pipelines where the model is equipped with sufficient baseline capability or policy competence before groupwise relative optimization begins. Pre-GRPO serves several critical purposes:

It lifts initial policy accuracy to a regime where the model can produce outputs sufficiently well-formed to support relative comparisons, i.e., so groupwise rewards are nontrivial and gradients do not vanish.
It provides a reference policy (typically the model at the end of pre-GRPO) against which the GRPO loss is defined in KL-regularized settings.
In reward-free RLHF, pre-GRPO typically refers to Direct Preference Optimization (DPO), providing a globally-aligned baseline before robust group-specific optimization.

Key metrics and empirical ablations repeatedly show that omitting pre-GRPO leads to severe performance degradation or optimization collapse in GRPO stages (Kang et al., 21 Sep 2025).

2. Typical Pre-GRPO Procedures by Domain

Pre-GRPO instantiations reflect the requirements of the downstream GRPO stage:

Domain	Pre-GRPO Procedure	Purpose
Multimodal table reasoning (Kang et al., 21 Sep 2025)	Supervised fine-tuning (SFT) on perception/reasoning	Increase initial policy accuracy ( $p$ ) to raise reward variance in GRPO
Spatial planning (AlphaMaze) (Dao et al., 20 Feb 2025)	SFT via cross-entropy on tokenized sequence	Teach step-by-step movement and maze representation for robust GRPO rollouts
Reward-free RLHF (Ramesh et al., 2024)	Direct Preference Optimization (DPO)	Learn average/global preferences without group specificity; robustifies initial alignment
Text-to-image RL (Pref-GRPO) (Wang et al., 28 Aug 2025)	Maximum-likelihood or SFT followed by standard RL	Provide a valid prior for policy; avoids collapse/degeneracy in comparative optimization

In all cases, the pre-GRPO objective is maximum likelihood or preference-based, rather than groupwise-relative.

3. Mathematical Formulations and Optimization

Specific pre-GRPO optimization objectives take the following representative forms:

Supervised Fine-Tuning (Table Reasoning, AlphaMaze):

Given input ( $I$ , $Q$ ), and target sequence $S=(s_1,...,s_T)$ , optimize

$L_{\text{warm-up}}(\theta) = - \mathbb{E}_{(I,Q,S)} \sum_{t=1}^T \log \pi_\theta(s_t | s_{<t}, I, Q)$

This cross-entropy phase is run for a predefined number of steps with AdamW and moderate learning rates, typically with dataset filtering for sequence/image size and batch size adjustments to fit accelerator memory (Kang et al., 21 Sep 2025, Dao et al., 20 Feb 2025).

Direct Preference Optimization (DPO, reward-free RLHF):

Given preference triplets $(x, y_w, y_\ell)$ and reference $\pi_{\mathrm{ref}}$ :

$L(\theta) = -\mathbb{E}_{(x, y_w, y_\ell)\sim D} \log\left( \sigma\left(\beta\,h_{\pi_\theta}(x, y_w, y_\ell)\right)\right)$

with

$h_\pi(x, y_w, y_\ell) = \log\frac{\pi(y_w|x)}{\pi_{\mathrm{ref}}(y_w|x)} - \log\frac{\pi(y_\ell|x)}{\pi_{\mathrm{ref}}(y_\ell|x)}$

This convex minimization ensures a well-initialized policy and global optimality before transitioning to group-robust GRPO (Ramesh et al., 2024).

4. Empirical Role and Necessity of Pre-GRPO

Pre-GRPO is empirically necessary to overcome two critical issues:

Reward Variance Collapse: In preference-based RL (e.g., table understanding), if initial policy accuracy is too low ( $p\approx0$ ) or too high ( $p\approx1$ ), the variance $\operatorname{Var}[R]=p(1-p)$ in groupwise rewards collapses and advantages vanish, leading to nearly zero informative policy gradients. The warm-up phase positions $p$ near $0.5$, maximizing variance and enabling effective GRPO (Kang et al., 21 Sep 2025).
Alignment and Skill Retention: For complex reasoning, starting from SFT or DPO guarantees that the policy exhibits baseline skill in the relevant domains (e.g., stepwise spatial planning (Dao et al., 20 Feb 2025), or chain-of-thought math reasoning (Rajani et al., 13 Jul 2025)), on which GRPO can amplify performance.

Ablation studies consistently show substantial performance drops (up to 40%) when pre-GRPO is omitted (Kang et al., 21 Sep 2025). In AlphaMaze, the SFT-only policy already achieves 86% maze-solving accuracy, forming a platform for GRPO to further improve accuracy to 93% (Dao et al., 20 Feb 2025).

5. Theoretical Properties and Guarantees

Pre-GRPO procedures—particularly DPO—possess beneficial theoretical properties under certain policy classes:

Convex Landscapes in DPO: For log-linear policies, the DPO loss is convex and has Lipschitz continuous gradients, guaranteeing global convergence and avoidance of local minima (Ramesh et al., 2024).
Initialization for KL Regularization: In multi-stage optimization, the pre-GRPO parameters $\theta_{\mathrm{warm}}$ typically initialize both the trainable policy $\pi_\theta$ and the reference policy $\pi_{\mathrm{ref}}$ against which KL-divergence is measured in downstream GRPO stages (Kang et al., 21 Sep 2025). This prevents large, destabilizing policy shifts.

6. Practical Configurations and Design Choices

Pre-GRPO protocols are characterized by careful selection of data, optimization schemes, and scaling choices:

Data Selection and Processing: Sequence length caps (e.g., 2048 tokens for structured outputs), image resolution filters, and joint supervised SFT over multiple domains (e.g., perception and reasoning) are all employed to ensure convergence and memory fit (Kang et al., 21 Sep 2025).
Optimizer and Scheduling: AdamW with standard hyperparameters ( $\beta_1 = 0.9, \beta_2 = 0.999$ ), prescribed weight decay, and learning-rate warm-ups are commonly used. Epoch and batch-size are selected with respect to domain compute constraints (Dao et al., 20 Feb 2025).
Reference Policy Inclusion: The pre-GRPO model state anchors the reference for later comparative policy optimization (e.g., DPO $\rightarrow$ GRPO, or warm-up $\rightarrow$ PA-GRPO/HC-GRPO).
No Groupwise Structure: Pre-GRPO is always non-groupwise; all preference or supervised labels are averaged, with no explicit robustness or subpopulation weighting (Ramesh et al., 2024).

7. Limitations of Pre-GRPO and Transition to Robust Groupwise Optimization

Pre-GRPO cannot guarantee equitable alignment across heterogeneous subpopulations or sub-tasks:

One-Size-Fits-All Limitation: In preference optimization, the DPO objective averages all labeler preferences, so the induced policy may be aligned only with the majority (“global average”) and may leave minority subgroups under-optimized ("misalignment") (Ramesh et al., 2024).
No Worst-Case Control: There is no mechanism in pre-GRPO for optimizing or bounding worst-case group loss.

This motivates the transition to groupwise or robust preference optimization (GRPO), which directly targets such limitations by adaptively reweighting subgroups or maximizing minimum group reward.

In summary, pre-GRPO represents the set of globally averaged, reference-establishing training procedures that precede group-relative or robust preference optimization. These initialization stages are not only empirically critical for successful downstream GRPO, but also mathematically well-behaved and computationally efficient. Their fundamental limitation—lack of subgroup robustness or worst-case guarantees—directly motivates the need for GRPO approaches in modern RLHF and large-model optimization pipelines.