Papers
Topics
Authors
Recent
Search
2000 character limit reached

Stage-Structured GRPO for Multimodal Tasks

Updated 16 May 2026
  • SS-GRPO is a multi-stage reinforcement learning framework that integrates supervised fine-tuning and staged GRPO to improve visual reasoning and structured outputs.
  • It employs a three-stage curriculum to sequentially enhance basic perceptual grounding, relational reasoning, and high-level semantic understanding in UAV and table-processing tasks.
  • By integrating domain-aware structured rewards and refined credit assignment, SS-GRPO overcomes cold-start and reward sparsity issues for stable, resource-efficient policy updates.

Stage-Structured Group Relative Policy Optimization (SS-GRPO) is a multi-stage reinforcement learning (RL) framework designed to enhance large vision-LLMs (LVLMs) on complex, structured multimodal tasks including UAV-based visual reasoning and multimodal table understanding. SS-GRPO succeeds by combining supervised fine-tuning (SFT) with a staged RL curriculum employing Group Relative Policy Optimization (GRPO), group-normalized rewards, and task-specific credit assignment mechanisms. This approach mitigates cold-start and reward sparsity problems endemic to GRPO in high-complexity or low-accuracy policy regimes, achieving stable credit propagation and interpretable, structured outputs in low-latency, resource-constrained settings (Guan et al., 15 Aug 2025, Kang et al., 21 Sep 2025).

1. Training Pipeline and Stagewise Structure

SS-GRPO interleaves supervised and RL-based optimization across three curriculum stages, each targeting progressively more difficult perception and reasoning skills. This process is initialized by supervised fine-tuning (SFT) to ensure initial policy accuracy enters the regime where GRPO learning signals exhibit sufficient variance, avoiding policy collapse due to degenerate advantage estimators near {0,1} accuracy.

In the UAV-VL-R1 model (Guan et al., 15 Aug 2025), the curriculum is organized as follows:

  • Stage A (RL1): Basic Attributes Tasks: Color recognition, size comparison, yes/no questions. Objective: Low-level perceptual grounding, binary structured reasoning.
  • Stage B (RL2): Object Reasoning & Counting Tasks: Object counting, shape classification, transportation recognition. Objective: Arithmetic and relational reasoning.
  • Stage C (RL3): Spatial & Semantic Understanding Tasks: Location inference, scene classification. Objective: Spatial layout and high-level semantic relationship understanding.

In multimodal table understanding (Table-R1 (Kang et al., 21 Sep 2025)), the stages are:

  • Stage 1: Warm-up (SFT) Bootstraps table perception and chain-of-thought (COT) reasoning via cross-entropy over perception (markup reconstruction) and reasoning datasets.
  • Stage 2: Perception-Alignment GRPO (PA-GRPO) Refines structural parsing using dense, continuous Tree-Edit-Distance Similarity (TEDS) rewards.
  • Stage 3: Hint-Completion GRPO (HC-GRPO) Improves step-wise reasoning accuracy via fine-grained residual-step rewards, leveraging partial solution “hints.”

This curriculum ensures robust policy improvement by sequentially aligning vision and language modalities, then optimizing over increasingly complex skill sets.

2. Mathematical Formulation

The core of SS-GRPO is the GRPO update, which uses group-normalized, relative advantage estimates. For a given stage ii:

  • For input qq (e.g., an image-question pair), sample GG candidate outputs {o1,,oG}\{o_1, \ldots, o_G\} from the old policy πθold\pi_{\theta_{\text{old}}}.
  • Compute task-specific reward rir_i for each oio_i.
  • Calculate group-relative advantage:

Ai=riμrσrA_i = \frac{r_i - \mu_r}{\sigma_r}

where μr=1Gj=1Grj\mu_r = \frac{1}{G} \sum_{j=1}^{G} r_j and σr=std({r1,,rG})\sigma_r = \text{std}(\{r_1, \ldots, r_G\}).

  • Define policy ratio qq0.
  • The GRPO objective for qq1 is:

qq2

where qq3 is the length, qq4 is the clipping constant, qq5 is the KL-penalty weight, and qq6 is the SFT-initialized reference policy.

This per-group normalization ensures non-zero gradients when policy performance is neither near-perfect nor abysmal, directly addressing coadaptation bottlenecks in standard RL for LVLMs.

3. Reward Engineering and Credit Assignment

SS-GRPO employs domain-aware, rule-guided rewards to enforce both output structure and semantic correctness. The design of the reward function is pivotal for RL stability and output interpretability:

  • Structured Format Reward:

Binary or fractional rewards for including required output tags (e.g., “qq7thinkqq8…qq9/thinkGG0” and “GG1answerGG2…GG3/answerGG4”).

  • Semantic/Accuracy Reward:

High rewards for exact matches to ground-truth answers or for structural similarity (e.g., TEDS for markup trees in table parsing).

  • Dense Task-Specific Reward:

For table structure alignment, rewards are computed as GG5, yielding a continuous scalar in GG6.

  • Residual-Step Reward:

In Hint-Completion GRPO, the model receives credit for both correct answers and adherence to XML-style output conventions.

This multifaceted reward architecture supports fine-grained, stable policy updates, maximizing learning signal propagation by leveraging grouped sampling and normalization.

4. Implementation Strategies

The SS-GRPO algorithm is implemented sequentially, following this workflow:

  1. Supervised Warm-up / SFT:
    • Models are initialized via LoRA-based SFT over appropriately formatted datasets, aligning vision-language modalities and enforcing structured output.
  2. Multi-Stage GRPO Optimization:
    • For each stage, the policy is updated using the closed-form GRPO objective, sampling GG7 generations per question and computing group-relative advantages.
    • KL-regularization towards the SFT reference policy is incorporated throughout to prevent catastrophic policy drift.

The following table contrasts the RL stage design in UAV-VL-R1 and Table-R1:

Model Stage 1 Stage 2 Stage 3
UAV-VL-R1 SFT Basic Attributes Object/Spatial
Table-R1 SFT PA-GRPO HC-GRPO

Warm-up strategies include “long-COT expansion” and prompt variation to diversify initial policy coverage and promote robust chain-of-thought generation (Kang et al., 21 Sep 2025).

5. Empirical Evidence and Ablation Analysis

Empirical studies confirm the efficacy of SS-GRPO on both aerial visual reasoning and multimodal table understanding:

  • UAV-VL-R1 (HRVQA-VL):

Zero-shot performance: Stage-A SS-GRPO-trained models achieve 48.37% accuracy on Stage-B and 46.59% on Stage-C, outperforming general VLM baselines (e.g., Qwen2-VL-72B achieves 51.39% on Stage-B, 40.05% on Stage-C). Final accuracy: UAV-VL-R1 attains 68.94% (unconstrained) and 72.13% (structured prompt) on the eight-task evaluation, exceeding Qwen2-VL-72B’s 51.46% and 46.67%, respectively. Ablations: SFT+GRPO delivers 71.30% overall (vs. 59.48% for GRPO-only), with major gains in high-level reasoning.

  • Table-R1 (Table Understanding):

Held-in/held-out performance: SS-GRPO yields held-in 68.63% (+3.93% v. SFT; +16.38% v. GRPO), held-out 55.43% (+7.72% v. SFT; +8.79% v. GRPO), matching or surpassing much larger LLMs such as Table-LLaVA 13B and closed-source GPT-4o. Ablations: Removing warm-up drops performance 12–40% across tasks. Omission of PA-GRPO causes negligible drops, but removal of HC-GRPO causes significant decreases (15–20% QA impact, 1–17% Table Fact Verification loss). Perception evaluation: PA-GRPO improves TEDS structural similarity by 0.10–0.28 over raw model, outperforming open-source alternatives. HC-GRPO, with residual-step rewards, further improves reasoning accuracy (e.g., raising TabMWP from 76.4% with solution-level GRPO to 83.0%).

6. Practical Constraints and Deployment Considerations

SS-GRPO provides RL-enhanced model policies compact enough for real-time or edge deployment. For example, UAV-VL-R1 sets a resource-efficient benchmark: 3.9GB memory consumption at FP16, quantizable to 2.5GB with INT8, supporting operation on resource-constrained UAV hardware (Guan et al., 15 Aug 2025). In Table-R1, batch size, sampling temperature, reward function hyperparameters, and chain-of-thought expansion are calibrated for low resource, high-accuracy tradeoff (Kang et al., 21 Sep 2025).

SS-GRPO’s multi-stage choreography, group-normalized advantage calculation, and rule-guided reward engineering collectively yield domain-robust, interpretable, and computationally efficient LVLMs for structured visual and multimodal reasoning.

7. Limitations and Implications

Experimental ablations identify that SFT substantially enhances semantic alignment but may adversely affect mathematical reasoning diversity. GRPO-based RL, especially when staged, compensates by strengthening logical flexibility and overall inference robustness. The precise sequence of stages and the shape of rewards heavily influences downstream transfer and zero-shot performance, indicating that SS-GRPO’s curriculum and credit assignment design are critical levers for further advancement (Guan et al., 15 Aug 2025, Kang et al., 21 Sep 2025).

A plausible implication is that future extensions can generalize SS-GRPO beyond UAV and tabular settings, provided that task decomposition, staged curriculum, and dense reward formulation are appropriately defined. The framework’s efficacy in overcoming GRPO’s sample variance collapse, cold-start issues, and sparse-reward bottlenecks is empirically established across two distinct domains, supporting its broader applicability in multimodal AI systems.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Stage-Structured GRPO (SS-GRPO).