Group Relative Policy Optimization (GRPO) for Image Captioning
Last updated: June 9, 2025
Certainly. Here is the finalized, fact-checked, well-sourced, and stylistically polished article on Group Relative Policy Optimization ° (GRPO °) for image captioning, strictly following the evidence provided in "Group Relative Policy Optimization for Image Captioning" (Liang, 3 Mar 2025 ° ).
Group Relative Policy Optimization (GRPO) for Image Captioning
Introduction and Motivation
Image captioning presents unique optimization challenges due to the exposure bias and mismatch between training objectives ° (e.g., cross-entropy) and evaluation metrics (e.g., CIDEr °, BLEU). While Self-Critical Sequence Training ° (SCST °) has been the dominant reinforcement learning (RL) method for optimizing captioning models to sequence-level metrics, it suffers from notable limitations:
- High variance in advantage estimation due to reliance on a single greedy decoding ° baseline.
- Limited diversity, as only a single sampled caption is compared to the greedy output per update.
- Lack of regularization °: No explicit mechanism prevents the policy from overfitting or drifting too far from its learned distribution.
Group Relative Policy Optimization (GRPO) was introduced to overcome these challenges by leveraging multi-sample group comparisons and regularized policy updates.
The GRPO Training Framework
GRPO introduces a new RL formulation for the image captioning fine-tuning phase, with several key innovations:
1. Multi-Candidate Generation
For every input image , the model samples a group of candidate captions from the current policy .
2. Intragroup Comparison and Advantage Calculation
Each candidate is evaluated using the desired reward function, typically a language metric such as CIDEr:
- For candidate , compute reward .
- Compute the group-normalized advantage:
This normalizes each sample's performance against its group's average, reducing variance and allowing learning signal even when all outputs are weak.
3. Policy Update with Constraints
Policy updates in GRPO are regularized for stability and steady progress:
- Clipped surrogate objective: Caps the policy likelihood ratio to prevent excessively large updates, adopting the technique from PPO °.
- KL-divergence ° penalty: Adds a term penalizing divergence from a reference policy ° (the previous policy or a pre-trained model) to avoid mode collapse ° or policy drift.
The GRPO objective is:
$\begin{align} \mathcal{J}_{\text{GRPO}}(\theta) =\ & \mathbb{E}_{q, \{o_i\}_{i=1}^G} \left[ \frac{1}{G} \sum_{i=1}^G \left( \min\left(r_i, \mathrm{clip}(r_i, 1-\epsilon, 1+\epsilon)\right)A_i - \beta\, \mathbb{D}_{KL}(\pi_{\theta} \| \pi_{ref}) \right) \right] \end{align}$
where
- ,
- limits the step size,
- is the KL regularization °,
- and are hyperparameters.
The KL penalty is given by:
$\mathbb{D}_{KL} \left( \pi_\theta \| \pi_{ref} \right) = \frac{\pi_{ref}(o_i|q)}{\pi_\theta(o_i|q)} - \log \frac{\pi_{ref}(o_i|q)}{\pi_\theta(o_i|q)} - 1$
Core Advantages over SCST
1. Variance Reduction:
By normalizing advantages within groups, GRPO reduces the noise from outlier samples ° and the instability of using a single greedy baseline, leading to more reliable and smoother learning.
2. Enhanced Diversity:
GRPO samples multiple captions per image, expanding the solution space and promoting diversity in generated language. This addresses the local optimum ° problem prevalent when optimizing against a single baseline.
3. Policy Stability:
The PPO-style clipping and KL constraint regularize the policy update, reducing the risk of collapse or divergence and making learning less sensitive to poor initializations or weak baselines.
Empirical Results
GRPO was benchmarked on standard datasets:
Method | BLEU-4 ° | METEOR ° | ROUGE-L ° | CIDEr | SPICE ° |
---|---|---|---|---|---|
CE ° | 26.8 | 23.7 | 50.5 | 84.3 | 16.4 |
SCST | 30.5 | 23.8 | 52.5 | 97.6 | 16.4 |
GRPO | 31.4 | 24.4 | 53.2 | 100.0 | 17.1 |
Observations:
- Across all metrics, GRPO outperforms SCST and CE. CIDEr improved by 2.4% over SCST, indicating the effectiveness of group-based reward optimization.
- Better efficiency: GRPO achieves these results in fewer RL fine-tuning ° epochs (5 vs. 20 for SCST).
- Stability: On challenging data (e.g., Flickr8k), GRPO produces steadier learning curves ° and avoids the metric drops sometimes observed with SCST.
Implementation Considerations
- Computational Requirements: GRPO's need to sample and process multiple outputs per image requires higher batch-level compute than single-sample baselines, but this overhead is offset by faster and more stable convergence.
- Hyperparameters: Typical values are , , and tuned for KL regularization (e.g., range).
- Code and reproducibility: The authors provide a public codebase ° at https://github.com/liangxu-one/ms-models/tree/image_caption_grpo/research/arxiv_papers/Image_Caption_GRPO.
Summary Table: GRPO vs. SCST
Aspect | SCST | GRPO |
---|---|---|
Baseline | Single greedy caption | Group mean (multi-candidate) |
Advantage | Normalized group advantage | |
Update variance | High | Lower (group normalization) |
Diversity | Low (one sample per update) | High (multiple samples per update) |
Policy regularization | None | Clipped step + KL divergence ° |
Convergence | Needs more epochs | Fewer epochs, more stable |
SOTA ° Metrics | Strong but improvable | Strongest in paper (+2.4 CIDEr over SCST) |
References
- [rennie2017self] - SCST (baseline method)
- [shao2024deepseekmath] - GRPO original formulation
Conclusion:
GRPO establishes a new standard for RL-based image captioning, providing a group-wise, diversity-promoting, and stability-regularized optimization process. Its empirical superiority and robust implementation make it a recommended method for advanced image captioning pipelines.