Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts
Detailed Answer
Thorough responses based on abstracts and some paper content
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash
115 tokens/sec
GPT-4o
79 tokens/sec
Gemini 2.5 Pro Pro
56 tokens/sec
o3 Pro
15 tokens/sec
GPT-4.1 Pro
76 tokens/sec
DeepSeek R1 via Azure Pro
54 tokens/sec
2000 character limit reached

Group Relative Policy Optimization (GRPO) for Image Captioning

Last updated: June 9, 2025

Certainly. Here is the finalized, fact-checked, well-sourced, and stylistically polished article on Group Relative Policy Optimization ° (GRPO °) for image captioning, strictly following the evidence provided in "Group Relative Policy Optimization for Image Captioning" (Liang, 3 Mar 2025 ° ).


Group Relative Policy Optimization (GRPO) for Image Captioning

Introduction and Motivation

Image captioning presents unique optimization challenges due to the exposure bias and mismatch between training objectives ° (e.g., cross-entropy) and evaluation metrics (e.g., CIDEr °, BLEU). While Self-Critical Sequence Training ° (SCST °) has been the dominant reinforcement learning (RL) method for optimizing captioning models to sequence-level metrics, it suffers from notable limitations:

  • High variance in advantage estimation due to reliance on a single greedy decoding ° baseline.
  • Limited diversity, as only a single sampled caption is compared to the greedy output per update.
  • Lack of regularization °: No explicit mechanism prevents the policy from overfitting or drifting too far from its learned distribution.

Group Relative Policy Optimization (GRPO) was introduced to overcome these challenges by leveraging multi-sample group comparisons and regularized policy updates.


The GRPO Training Framework

GRPO introduces a new RL formulation for the image captioning fine-tuning phase, with several key innovations:

1. Multi-Candidate Generation

For every input image qq, the model samples a group of GG candidate captions {o1,o2,...,oG}\{ o_1, o_2, ..., o_G \} from the current policy πθold\pi_{\theta_{\text{old}}}.

2. Intragroup Comparison and Advantage Calculation

Each candidate is evaluated using the desired reward function, typically a language metric such as CIDEr:

  • For candidate oio_i, compute reward rir_i.
  • Compute the group-normalized advantage:

Ai=rimean(r1,...,rG)std(r1,...,rG)A_i = \frac{r_i - \mathrm{mean}(r_1, ..., r_G)}{\mathrm{std}(r_1, ..., r_G)}

This normalizes each sample's performance against its group's average, reducing variance and allowing learning signal even when all outputs are weak.

3. Policy Update with Constraints

Policy updates in GRPO are regularized for stability and steady progress:

The GRPO objective is:

$\begin{align} \mathcal{J}_{\text{GRPO}}(\theta) =\ & \mathbb{E}_{q, \{o_i\}_{i=1}^G} \left[ \frac{1}{G} \sum_{i=1}^G \left( \min\left(r_i, \mathrm{clip}(r_i, 1-\epsilon, 1+\epsilon)\right)A_i - \beta\, \mathbb{D}_{KL}(\pi_{\theta} \| \pi_{ref}) \right) \right] \end{align}$

where

  • ri=πθ(oiq)πθold(oiq)r_i = \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{\text{old}}}(o_i|q)},
  • clip(ri,1ϵ,1+ϵ)\mathrm{clip}(r_i, 1-\epsilon, 1+\epsilon) limits the step size,
  • DKL(πθπref)\mathbb{D}_{KL}(\pi_\theta \| \pi_{ref}) is the KL regularization °,
  • ϵ\epsilon and β\beta are hyperparameters.

The KL penalty is given by:

$\mathbb{D}_{KL} \left( \pi_\theta \| \pi_{ref} \right) = \frac{\pi_{ref}(o_i|q)}{\pi_\theta(o_i|q)} - \log \frac{\pi_{ref}(o_i|q)}{\pi_\theta(o_i|q)} - 1$


Core Advantages over SCST

1. Variance Reduction:

By normalizing advantages within groups, GRPO reduces the noise from outlier samples ° and the instability of using a single greedy baseline, leading to more reliable and smoother learning.

2. Enhanced Diversity:

GRPO samples multiple captions per image, expanding the solution space and promoting diversity in generated language. This addresses the local optimum ° problem prevalent when optimizing against a single baseline.

3. Policy Stability:

The PPO-style clipping and KL constraint regularize the policy update, reducing the risk of collapse or divergence and making learning less sensitive to poor initializations or weak baselines.


Empirical Results

GRPO was benchmarked on standard datasets:

Method BLEU-4 ° METEOR ° ROUGE-L ° CIDEr SPICE °
CE ° 26.8 23.7 50.5 84.3 16.4
SCST 30.5 23.8 52.5 97.6 16.4
GRPO 31.4 24.4 53.2 100.0 17.1

Observations:

  • Across all metrics, GRPO outperforms SCST and CE. CIDEr improved by 2.4% over SCST, indicating the effectiveness of group-based reward optimization.
  • Better efficiency: GRPO achieves these results in fewer RL fine-tuning ° epochs (5 vs. 20 for SCST).
  • Stability: On challenging data (e.g., Flickr8k), GRPO produces steadier learning curves ° and avoids the metric drops sometimes observed with SCST.

Implementation Considerations


Summary Table: GRPO vs. SCST

Aspect SCST GRPO
Baseline Single greedy caption Group mean (multi-candidate)
Advantage rsamplergreedyr_{\text{sample}} - r_{\text{greedy}} Normalized group advantage
Update variance High Lower (group normalization)
Diversity Low (one sample per update) High (multiple samples per update)
Policy regularization None Clipped step + KL divergence °
Convergence Needs more epochs Fewer epochs, more stable
SOTA ° Metrics Strong but improvable Strongest in paper (+2.4 CIDEr over SCST)

References

  • [rennie2017self] - SCST (baseline method)
  • [shao2024deepseekmath] - GRPO original formulation

Conclusion:

GRPO establishes a new standard for RL-based image captioning, providing a group-wise, diversity-promoting, and stability-regularized optimization process. Its empirical superiority and robust implementation make it a recommended method for advanced image captioning pipelines.