Papers
Topics
Authors
Recent
2000 character limit reached

Citation-Aware Group Relative Policy Optimization

Updated 12 January 2026
  • The paper introduces a contextual reward decomposition (CaRR) that integrates outcome correctness with citation fidelity to enhance evidence-grounded decision making.
  • It employs group-level policy updates and stable PPO clipping, demonstrating significant improvements in factuality and citation quality over baselines.
  • Empirical results in deep search and legal reasoning contexts show robust gains, with up to +74% citation-F1 improvements validated against standard benchmarks.

Citation-aware Group Relative Policy Optimization (C-GRPO) is a reinforcement learning (RL) framework designed for the robust alignment of LLM agents in tasks where factual grounding, comprehensive reasoning, and citation fidelity are critical, particularly in domains requiring evidence chains or legal citation accuracy. C-GRPO extends standard Group-Relative Policy Optimization (GRPO) by integrating a fine-grained, context-sensitive reward architecture—specifically Citation-aware Rubric Rewards (CaRR)—that evaluates rollouts not only on final outcome correctness, but also on the explicit identification and citation of supporting entities and the construction of complete evidence chains. Empirical evidence across deep search and retrieval-augmented question answering indicates that C-GRPO yields significant improvements in factuality and citation quality relative to traditional outcome-based RL or instruction tuning baselines (Zhang et al., 9 Jan 2026, Akarajaradwong et al., 13 Jul 2025).

1. Foundations and Conceptual Architecture

The canonical RL protocol for deep search agents typically employs a binary outcome reward Ro{0,1}R_o \in \{0, 1\}, assigning credit only for answers matching the ground truth. This paradigm fails to capture partial factuality, grounded reasoning, and evidence connectivity. C-GRPO addresses these shortcomings by introducing CaRR—a reward decomposition that attributes real-valued rubric scores Rr[0,1]R_r \in [0, 1] to agent rollouts. CaRR evaluates the trajectory along three axes: identification of hidden entities, citation grounding (verifiable support for each claim), and construction of a connected evidence chain linking intermediate facts to the answer. Only successful rollouts receive the mixed reward, which combines outcome correctness and normalized rubric score:

Ri=(1α)Ro(i)+αRo(i)R^r(i)R_i = (1-\alpha) R_o^{(i)} + \alpha R_o^{(i)} \hat{R}_r^{(i)}

where R^r(i)\hat{R}_r^{(i)} is group-wise normalized and α[0,1]\alpha \in [0, 1] modulates rubric emphasis (Zhang et al., 9 Jan 2026).

In Thai legal reasoning contexts, the reward is further refined to explicit citation fidelity via the Citation-F1_1 metric, format adherence, and non-hallucination checks. Here, C-GRPO uses a multi-component scalar reward:

R(s,a)=αRcitation(s,a)+βRans(s,a)R(s, a) = \alpha \cdot R_\text{citation}(s, a) + \beta \cdot R_\text{ans}(s, a)

with RcitationR_\text{citation} aggregating XML format checks, ground-truth citation overlap, and non-hallucination, and RansR_\text{ans} derived via semantic similarity or coverage/consistency scoring (Akarajaradwong et al., 13 Jul 2025).

2. Mathematical Formulation and Training Objective

For a group of GG trajectories per query, denote trajectories as {H1,,HG}\{\mathcal{H}_1, \ldots, \mathcal{H}_G\}, each assigned scalar reward RiR_i. Token-level importance sampling ratio and group-normalized advantage are computed as:

ρi,j=πθ(Hi,jq,Hi,1:j1)πθold(Hi,jq,Hi,1:j1)\rho_{i, j} = \frac{\pi_\theta(\mathcal{H}_{i, j} \mid q, \mathcal{H}_{i, 1:j-1})}{\pi_{\theta_{\text{old}}}(\mathcal{H}_{i, j} \mid q, \mathcal{H}_{i, 1:j-1})}

A^i,j=RiμRσR\hat{A}_{i, j} = \frac{R_i - \mu_R}{\sigma_R}

where μR\mu_R and σR\sigma_R are mean and standard deviation over the group.

The C-GRPO objective applies clipped PPO updates over group-token averages, masking out environment tokens:

J(θ)=E[1i,jI(Hi,j)i=1Gj=1HiI(Hi,j)min(ρi,jA^i,j,clip(ρi,j,1ϵ,1+ϵ)A^i,j)]\mathcal{J}(\theta) = \mathbb{E} \left[ \frac{1}{\sum_{i, j} I(\mathcal{H}_{i, j})} \sum_{i=1}^{G} \sum_{j=1}^{|\mathcal{H}_i|} I(\mathcal{H}_{i, j}) \min \left(\rho_{i, j} \hat{A}_{i, j}, \operatorname{clip}(\rho_{i, j}, 1-\epsilon, 1+\epsilon) \hat{A}_{i, j} \right) \right]

This architecture, with per-group baseline subtraction (bib_i in (Akarajaradwong et al., 13 Jul 2025)), stabilizes gradient estimates and reduces variance compared to global baselines.

3. Training Protocol and Implementation Details

C-GRPO agents utilize backbone models such as Qwen3-4B, Qwen3-30B (for deep search), and Qwen2.5-7B-Instruct or Thai-CPT variants (for legal reasoning). Training involves cold-start supervised fine-tuning (SFT), followed by RL with C-GRPO over batched rollout groups (e.g., G=16G=16, batch=128, lr=2×106\text{lr}=2 \times 10^{-6}, context window up to 128k tokens). Rubric weights α\alpha are ablated to identify optimal balance, peaking at α=0.3\alpha=0.3; excessive rubric emphasis (α>0.5\alpha>0.5) can detract from correctness (Zhang et al., 9 Jan 2026).

Legal QA applications employ WangchanX-Legal-ThaiCCL-RAG datasets spanning 8,000+ examples, with BGE-M3 embedding-based retrieval (Akarajaradwong et al., 13 Jul 2025). Rollout group sizes typically use K=10K=10 responses per prompt, with LoRA adapters (rank 256), AdamW optimization, and precision set to bfloat16.

Judge LLMs (e.g., DeepSeek-v3.2) are used for rubric scoring and, in legal tasks, for coverage/consistency evaluation. When semantic proxies are used, BGE-M3 embeddings offer cost-efficient similarity rewards with >2.5×>2.5\times computational savings over LLM judges.

4. Citation-aware Reward Engineering

Citation-aware rewards integrate the following components:

  • Format adherence: Binary check for required XML tags (e.g., <reasoning>, <answer>, <citation>). Reward Rformat(s,a){0,1}R_\text{format}(s, a) \in \{0, 1\}.
  • Non-hallucination: Partial reward ($0.5$) if all cited entities appear in the retrieved context.
  • Citation-F1_1: Strict overlap metric between model-generated and gold citations:

P=CGC,R=CGG,F1=2PRP+RP = \frac{|C \cap G|}{|C|},\quad R = \frac{|C \cap G|}{|G|},\quad F_1 = \frac{2PR}{P + R}

This term both shapes policy during training and acts as the key evaluation metric.

  • Answer quality: Semantic similarity (cosine over embeddings) or LLM-based coverage/consistency ({0,0.5,1.0}\in \{0, 0.5, 1.0\}).

Only rollouts passing format and non-hallucination thresholds receive nonzero citation-F1_1 reward, ensuring the agent must satisfy structural and factual constraints before earning fidelity credit (Akarajaradwong et al., 13 Jul 2025).

5. Empirical Benchmarking and Performance

Empirical results across benchmarks demonstrate that C-GRPO produces superior citation fidelity and correctness relative to baseline RL and instruction tuning. On deep search tasks (BrowseComp at 64k and 128k tokens) (Zhang et al., 9 Jan 2026):

Model BrowseComp@64k BrowseComp@128k
DeepDive-4B-SFT 7.7 14.1
+ GRPO 12.9 14.7
+ C-GRPO 13.9 17.5
DeepDive-30B-SFT 12.2 20.5
+ GRPO 16.0 18.9
+ C-GRPO 17.9 24.8

For Thai legal reasoning, C-GRPO with semantic-reward achieves up to +90% citation-F1_1 gains and +31% joint quality improvements over SFT (Akarajaradwong et al., 13 Jul 2025):

Model + Reward Citation F1_1 Coverage Consistency Joint Score
Qwen2.5-7B-Instruct (baseline) 0.4103 0.5908 0.8402 0.6138
+ LoRA SFT 0.5691 (+39%) 0.5832 0.8341 0.6622 (+8%)
+ LoRA GRPO (cov/cons) 0.6796 (+66%) 0.6322 0.8598 0.7239 (+18%)
+ LoRA GRPO (semantic) 0.7146 (+74%) 0.7197 0.8232 0.7525 (+23%)

C-GRPO’s improvements persist in out-of-domain generalization—e.g., on the NitiBench-Tax test set, GRPO variants substantially mitigate SFT degradation and yield higher citation F1_1 (up to +66%).

6. Behavior Shaping: Shortcuts, Grounding, and Robustness

Weak outcome-only RL (GRPO, PPO) incentivizes shortcut exploitation, with agents minimizing tool calls and unrolling incomplete evidence chains. C-GRPO reverses this trend: agents increase citation gathering and rubric satisfaction over training, as evidenced by empirical tool-call metrics and qualitative analyst cases (Zhang et al., 9 Jan 2026).

C-GRPO’s rubric-satisfying protocol produces more comprehensive chains: agents identify all relevant hidden entities and cite supporting URLs with high fidelity. In contrast, outcome-only baselines often halt at the final answer without verifying intermediate constraints.

Reward decomposition—involving format, hallucination, and citation-F1_1 adherence—ensures agents cannot exploit single criteria or “game” the system, focusing model updates on holistic, evidence-grounded reasoning (Akarajaradwong et al., 13 Jul 2025).

7. Limitations, Constraints, and Future Research Directions

C-GRPO’s rubric-based reward design leverages the compositional structure of multi-hop/synthetic QA; its applicability to open-ended, unstated-requirement QA is less direct (Zhang et al., 9 Jan 2026). Static rubric generation can limit adaptability; online evolution of rubrics via self-play or contrastive rollouts constitutes a plausible avenue for future work.

Reward assignment accuracy is bounded by the reliability of judge LLMs or embedding proxies. Model errors in entity identification or citation support can introduce noise or bias.

Potential extensions include human-in-the-loop rubric refinement, broader agentic task adaptation (e.g., code synthesis, planning), and further cost reduction via improved semantic proxies or data filtering (Akarajaradwong et al., 13 Jul 2025).


C-GRPO establishes a rigorous foundation for RL-fine-tuning of evidence-driven LLM agents, achieving demonstrable gains in reasoning fidelity and citation accuracy across deep search and legal domains by integrating comprehensive, context-structured rewards with stable, group-based updates (Zhang et al., 9 Jan 2026, Akarajaradwong et al., 13 Jul 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Citation-aware Group Relative Policy Optimization (C-GRPO).