Papers
Topics
Authors
Recent
2000 character limit reached

HC-GRPO: Hint-Completion Policy Optimization

Updated 1 January 2026
  • HC-GRPO is a reinforcement learning stage in Table-R1 that optimizes residual reasoning by splitting chain-of-thought answers into hints and completions.
  • It employs group-relative advantage normalization and multiple rollouts to provide dense feedback and mitigate reward sparsity in table question answering.
  • Empirical findings show significant improvements in QA accuracy, demonstrating HC-GRPO's effectiveness in enhancing multimodal table understanding.

Hint-Completion Group Relative Policy Optimization (HC-GRPO) is a reinforcement learning (RL) stage within the Table-R1 framework designed to advance multimodal table understanding by delivering fine-grained, stepwise feedback for reasoning tasks. HC-GRPO focuses on policy optimization for completing partially solved chains-of-thought in table question answering, where the model is rewarded on residual reasoning steps following a hint. This method addresses key challenges of coarse reward structures and reward sparsity in RL training for complex table reasoning, yielding substantial empirical gains over conventional approaches (Kang et al., 21 Sep 2025).

1. Position of HC-GRPO in Table-R1 Framework

Table-R1 is a three-stage RL pipeline for multimodal table understanding comprised of:

  1. Warm-up (Supervised Finetuning, SFT): Initial supervised training to enhance baseline accuracy of perception and basic reasoning.
  2. Perception Alignment GRPO (PA-GRPO): RL stage leveraging continuous Tree-Edit-Distance Similarity (TEDS) rewards to align extracted table structures and contents.
  3. Hint-Completion GRPO (HC-GRPO): A stage that fine-tunes the model’s step-by-step reasoning by training on partial solution completions, extracting maximal benefit from fine-grained hint-completion pairs.

HC-GRPO precisely targets the latter part of chain-of-thought (CoT) solutions by splitting reasoning sequences into an initial “hint” and the “completion” steps to be learned. This zoomed-in approach increases reward density and focuses updates on reasoning, as opposed to answering only at the solution level.

2. Formal Objective and Optimization Procedure

The HC-GRPO objective builds on the Group Relative Policy Optimization formalism, making use of rolled-out completions and groupwise advantage normalization. Let:

  • I,QI, Q denote the table image and original question.
  • S=[s1,s2,...,sm]S = [s_1, s_2, ..., s_m] is an expanded chain-of-thought solution.
  • jUniform{1,...,m1}j \sim \mathrm{Uniform}\{1, ..., m-1\} is the split index.
  • Hints=[s1,...,sj]\text{Hints} = [s_1, ..., s_j], Comps=[sj+1,...,sm]\text{Comps} = [s_{j+1}, ..., s_m].
  • Qr=concatenate(Q,Hints)Q_r = \text{concatenate}(Q, \text{Hints}) is the hint-augmented question; Sr=CompsS_r = \text{Comps} is the residual completion target.

For each training tuple, GG rollouts {Sri}i=1G\{S_r^i\}_{i=1}^G are sampled from the current policy πθ(I,Qr)\pi_\theta(\cdot|I, Q_r). Rewards are computed as: Ri=Racci+RformatiR^i = R_{\text{acc}}^i + R_{\text{format}}^i, where:

  • Racci=1R_{\text{acc}}^i = 1 if model answer MA=GAMA = GA; $0$ otherwise.
  • Rformati=1R_{\text{format}}^i = 1 if reasoning is wrapped in <<think>></think></think> and answer in <<answer>></answer></answer>; $0$ otherwise.

Groupwise normalized advantage:

A^i=Rimean({Rk})std({Rk})\hat{A}^i = \frac{R^i - \mathrm{mean}(\{R^k\})}{\mathrm{std}(\{R^k\})}

Optimized loss (clipped surrogate plus KL penalty):

LHCGRPO(θ)=1Gi=1Gmin(ri(θ)A^i,  clip(ri(θ),1ϵ,1+ϵ)A^i)βDKL[πθπref]\mathcal{L}_{\mathrm{HC-GRPO}}(\theta) = \frac{1}{G}\sum_{i=1}^G \min\bigl( r_i(\theta)\hat{A}^i,\; \mathrm{clip}(r_i(\theta), 1-\epsilon, 1+\epsilon)\hat{A}^i \bigr) - \beta D_{\mathrm{KL}}[\pi_\theta\,\|\,\pi_{\text{ref}}]

where ri(θ)=πθ(SriI,Qr)πold(SriI,Qr)r_i(\theta) = \frac{\pi_\theta(S_r^i|I,Q_r)}{\pi_{\text{old}}(S_r^i|I,Q_r)} and πref\pi_{\text{ref}} is a frozen copy of the policy at HC-GRPO start. Typical hyperparameters are ϵ=0.2\epsilon=0.2, β=0.04\beta=0.04.

3. Hint-Guided Question Mechanism

The hint-completion configuration is constructed as follows:

  • Each ground-truth solution chain is expanded (if brief) using GPT-4o to produce a detailed stepwise chain of length mm.
  • A random split index jj (1j<m1 \leq j < m) yields “hint” steps [s1sj][s_1…s_j] and target “completion” steps [sj+1sm][s_{j+1}…s_m].
  • The hint is appended to the original question to form the input QrQ_r; the model is then required to output only the residual steps SrS_r.

This decomposition ensures reward signals focus on the remaining reasoning rather than duplicating assessment of previously solved substeps.

4. Reward Design and Sparsity Mitigation

HC-GRPO overcomes reward sparsity and binary reward granularity by densely splitting solution chains into multiple hint-completion pairs per sample (up to three splits per CoT). While RaccR_{\text{acc}} and RformatR_{\text{format}} are individually binary, residual step-level completions admit a higher likelihood of partial correctness in GG parallel rollouts. This augments reward variance in {Ri}\{R^i\}, supports meaningful policy gradients, and results in a denser RL signal. Group-relative advantage normalization further amplifies reward differences, facilitating robust and stable learning.

5. Specialized Pseudocode for HC-GRPO

The HC-GRPO training loop is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Input:
  π_θ    -- current policy model (after warm-up, PA-GRPO)
  π_ref  -- frozen copy of π_θ
  D_r    -- dataset of (I, Q_r, S_r) hint-completion tuples
  G      -- number of rollouts (e.g. 4)
  ε, β   -- PPO clip, KL-weight hyperparameters

for epoch = 1...N do
  for each batch (I, Q_r, S_r) in D_r do
    # Generate G completions in parallel
    {Ŝ_r^i} = { sample π_θ(·|I,Q_r) for i=1G }
    # Compute rewards
    for i=1G:
      MA^i = extract_answer(Ŝ_r^i)
      R_acc^i = 1 if MA^i == GA else 0
      R_fmt^i = 1 if check_format(Ŝ_r^i) else 0
      R^i = R_acc^i + R_fmt^i
    μ, σ = mean,std of {R^i}
    A^i = (R^i - μ) / σ
    # Compute PPO-style loss
    L = 0
    for i=1G:
      r_i = π_θ(Ŝ_r^i|I,Q_r) / π_ref(Ŝ_r^i|I,Q_r)
      surrogate = min( r_i * A^i, clip(r_i,1-ε,1+ε)*A^i )
      L += surrogate
    L = -(L/G) + β * D_KL[π_θ || π_ref]
    # Gradient step
    θ  θ - lr  _θ L
  end for
end for

return π_θ

6. Empirical Findings and Performance Impact

Empirical results highlight HC-GRPO's significance within Table-R1:

Key ablation effects:

Dataset Table-R1 –HC-GRPO Δ
Avg. QA_I 51.4 % 36.3 % –19.7
TabFact_I 60.9 % 45.8 % –15.1
  • Removal of HC-GRPO causes substantial reductions in final QA accuracy (−19.7 pp held-in QA, −15.1 pp TabFact).
  • Training with standard GRPO at the solution level reaches only ~76.4 % TabMWP; using hint-completion splits and HC-GRPO raises accuracy to ~83.0 %.

Data density and parameter effects:

  • Increasing hint-completion splits from 1 to 4 improves TabMWP accuracy from 81.5 % to 83.4 %.
  • Raising GG (rollout group size) from 2 to 6 shows diminishing returns (peak at ~83.5 % for GG=6); GG=4 is an effective trade-off.
  • Convergence is rapid, with 2–3 epochs sufficient for HC-GRPO due to dense feedback and normalized advantages.

A plausible implication is that hint-based feedback with group-relative normalization effectively mitigates reward sparsity and initialization bottlenecks found in RL for table reasoning.

7. Significance Within Multimodal Table Understanding

HC-GRPO constitutes the definitive third stage of the Table-R1 pipeline, moving beyond standard RL approaches by reframing the learning task from solution-level to residual-step completion. This reconfiguration supplies fine-grained learning signals that alleviate sparsity and improve reasoning depth. Experimental validation demonstrates that HC-GRPO delivers outsized improvements in step-by-step table reasoning, driving Table-R1-trained models (e.g., Qwen2-VL-7B) to outperform larger specialized table models and approach state-of-the-art closed-source systems across multiple benchmarks. The architecture confirms that RL frameworks benefit by isolating and optimizing residual reasoning tasks in complex, structured information domains (Kang et al., 21 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Hint-Completion Group Relative Policy Optimization (HC-GRPO).