PS-GRPO: Paraphrased Set Group Relative Policy Optimization
- The paper’s main contribution is introducing PS-GRPO, a framework that enhances consistency and alignment of outputs by leveraging group-relative advantage signals across paraphrased inputs.
- It employs a novel methodology where multiple semantically equivalent queries are aggregated using group similarity rewards, ensuring uniformity in structured prediction tasks.
- Empirical results demonstrate significant improvements in accuracy and reliability on tasks like multi-hop and long-form QA, supported by scalable subsampled reward estimation.
Paraphrased Set Group Relative Policy Optimization (PS-GRPO) is a reinforcement learning (RL) algorithmic framework developed to improve consistency, alignment, and performance of LLMs and multi-output systems in scenarios where semantically equivalent inputs—such as paraphrased queries—should yield consistent, reliable outputs. By extending the principles of Group Relative Policy Optimization (GRPO), PS-GRPO leverages group-wise preference aggregation, consistency rewards, and scalable RL objectives to address core challenges in retrieval-augmented generation, reasoning, and other structured prediction domains.
1. Conceptual Foundations and Motivation
PS-GRPO generalizes GRPO to settings where multiple semantically equivalent input variations, termed a paraphrased set, are provided simultaneously. Standard GRPO computes the group-relative advantage by normalizing rewards obtained from a group of candidate outputs sampled for a single context, adjusting the policy toward higher-reward members while regularizing against excessive deviation from a reference policy. PS-GRPO expands this mechanism by explicitly organizing outputs not just for a single context, but across paraphrase sets—thereby directly targeting consistency in the face of linguistic, structural, or system-induced variability (Hamman et al., 5 Oct 2025).
The motivation for PS-GRPO arises from the requirement that AI systems, especially those deployed in safety-critical applications (e.g., healthcare, finance, legal search), must yield outputs that are invariant to minor input perturbations such as paraphrasing. Empirical observations show that conventional LLMs and retrieval-augmented generators can produce divergent answers for inputs that are semantically equivalent, undermining reliability and user trust.
2. Algorithmic Structure and Policy Objective
The PS-GRPO objective is rooted in the GRPO family, which fundamentally combines two terms in its learning objective: a normalized group-relative advantage signal and a penalty term that reduces divergence from a reference policy. In the paraphrased set extension, the procedure is as follows:
- For a canonical query , a set of paraphrased variants is generated.
- For each paraphrase , the model generates output samples .
- The group similarity reward for output is computed by averaging its similarity to all outputs with the other paraphrases:
where is typically a lexical similarity metric such as BLEU; this can be extended to semantic metrics in future work.
- The group-relative advantage for output is then normalized within its paraphrase group:
- The optimization objective takes the PPO-style clipped update form:
where is the token-level probability ratio relative to the reference policy, and tunes regularization toward the reference.
3. Preference Aggregation and Consistency Reward Construction
Unlike logarithmic pooling (RLHF), which exponentially tilts the reference policy toward higher-reward candidates, GRPO—and by extension PS-GRPO—utilizes an inverse-linear weighting:
This means outputs with above-average group preference receive an amplified probability, but the amplification is modulated to avoid instability. In PS-GRPO, this group preference is specifically engineered to encourage output similarity across paraphrases (Vojnovic et al., 25 Feb 2025, Hamman et al., 5 Oct 2025).
The reward function can flexibly utilize similarity metrics (e.g., BLEU, ROUGE), stepwise reasoning agreement, or LLM-based factual consistency judgments. The choice of normalization, reward scaling, and penalty term (reverse vs. direct KL) can be tuned to interpolate between different aggregation regimes.
4. Computational Scalability and Approximation
Direct computation of group similarity rewards for every (, ) pair entails quadratic complexity. To ensure scalability, PS-GRPO implements subsampled reward estimation:
where and are randomly chosen subsets (, ). This approach maintains a strong training signal while constraining computation during large-scale QA or generation tasks (Hamman et al., 5 Oct 2025).
5. Empirical Performance and Practical Applications
PS-GRPO yields marked improvements in output consistency on short-form, multi-hop, and long-form QA datasets. For example, on TriviaQA, consistency metrics rise from 53% (standard RAG) to 87% under PS-GRPO-driven Con-RAG. The improvement holds for both lexical and information-level consistency, as measured by similarity scores and LLM-based judges.
Importantly, PS-GRPO also exhibits accuracy gains, often in the absence of explicit ground-truth rewards. This suggests that the consistency reward has a regularizing effect analogous to data augmentation, making the generator more robust to input and retrieval variability.
6. Alignment Objective, Parametric Dependence, and Extensions
PS-GRPO maintains the core GRPO alignment objective: maximizing expected normalized group advantage—here, consistency—and penalizing policy drift from reference. The form of the aggregate preference and stationary policy depends on hyperparameters including the regularization constant , the scale of similarity rewards, and group size. Modifications such as direct versus reverse KL penalties, omission of scale normalization, or dynamic entropy weighting (cf. (Tan et al., 6 Aug 2025)) further modulate sensitivity to reward differences and regularization strength.
Key extensions include:
- Hybrid consistency-semantic reward signals to encourage deeper semantic agreement beyond surface overlap.
- Joint retriever-generator optimization to mitigate inconsistencies arising from evidence variability.
- Application to other structured tasks: hyperparameter optimization (Guo et al., 21 Sep 2025), continuous control (Khanda et al., 25 Jul 2025), and reasoning in security (Simoni et al., 3 Jul 2025) and wireless systems (Zhang et al., 18 Sep 2025).
7. Limitations, Challenges, and Future Directions
While PS-GRPO robustly enhances consistency and accuracy, several limitations are noted:
- Trade-off exists between reward fidelity and computational tractability; heavy subsampling may underrepresent critical paraphrase divergences.
- Lexical reward metrics can over-penalize stylistically valid paraphrases; future research is needed on semantic and factual alignment scores.
- Policy stability depends critically on tuning and group size; excessive regularization can impede learning, while weak regularization can cause overfitting to surface form.
A plausible implication is that further refinements involving dynamic reward weighting, adaptive group formation, and integration with off-policy robustness principles (Yao et al., 29 Sep 2025) will be required for optimal deployment in open-ended, retrieval-intensive and safety-critical scenarios.
PS-GRPO constitutes a principled, scalable paradigm for improving output consistency and reliability in multi-input, multi-output systems. By leveraging group similarity rewards and GRPO-style normalized advantages, it provides a robust solution to the persistent challenge of aligning model behavior across paraphrased or variable inputs, with broad applicability across language, reasoning, and control domains (Hamman et al., 5 Oct 2025, Vojnovic et al., 25 Feb 2025).