Citation-Aware Group Relative Policy Optimization
- The paper introduces a contextual reward decomposition (CaRR) that integrates outcome correctness with citation fidelity to enhance evidence-grounded decision making.
- It employs group-level policy updates and stable PPO clipping, demonstrating significant improvements in factuality and citation quality over baselines.
- Empirical results in deep search and legal reasoning contexts show robust gains, with up to +74% citation-F1 improvements validated against standard benchmarks.
Citation-aware Group Relative Policy Optimization (C-GRPO) is a reinforcement learning (RL) framework designed for the robust alignment of LLM agents in tasks where factual grounding, comprehensive reasoning, and citation fidelity are critical, particularly in domains requiring evidence chains or legal citation accuracy. C-GRPO extends standard Group-Relative Policy Optimization (GRPO) by integrating a fine-grained, context-sensitive reward architecture—specifically Citation-aware Rubric Rewards (CaRR)—that evaluates rollouts not only on final outcome correctness, but also on the explicit identification and citation of supporting entities and the construction of complete evidence chains. Empirical evidence across deep search and retrieval-augmented question answering indicates that C-GRPO yields significant improvements in factuality and citation quality relative to traditional outcome-based RL or instruction tuning baselines (Zhang et al., 9 Jan 2026, Akarajaradwong et al., 13 Jul 2025).
1. Foundations and Conceptual Architecture
The canonical RL protocol for deep search agents typically employs a binary outcome reward , assigning credit only for answers matching the ground truth. This paradigm fails to capture partial factuality, grounded reasoning, and evidence connectivity. C-GRPO addresses these shortcomings by introducing CaRR—a reward decomposition that attributes real-valued rubric scores to agent rollouts. CaRR evaluates the trajectory along three axes: identification of hidden entities, citation grounding (verifiable support for each claim), and construction of a connected evidence chain linking intermediate facts to the answer. Only successful rollouts receive the mixed reward, which combines outcome correctness and normalized rubric score:
where is group-wise normalized and modulates rubric emphasis (Zhang et al., 9 Jan 2026).
In Thai legal reasoning contexts, the reward is further refined to explicit citation fidelity via the Citation-F metric, format adherence, and non-hallucination checks. Here, C-GRPO uses a multi-component scalar reward:
with aggregating XML format checks, ground-truth citation overlap, and non-hallucination, and derived via semantic similarity or coverage/consistency scoring (Akarajaradwong et al., 13 Jul 2025).
2. Mathematical Formulation and Training Objective
For a group of trajectories per query, denote trajectories as , each assigned scalar reward . Token-level importance sampling ratio and group-normalized advantage are computed as:
where and are mean and standard deviation over the group.
The C-GRPO objective applies clipped PPO updates over group-token averages, masking out environment tokens:
This architecture, with per-group baseline subtraction ( in (Akarajaradwong et al., 13 Jul 2025)), stabilizes gradient estimates and reduces variance compared to global baselines.
3. Training Protocol and Implementation Details
C-GRPO agents utilize backbone models such as Qwen3-4B, Qwen3-30B (for deep search), and Qwen2.5-7B-Instruct or Thai-CPT variants (for legal reasoning). Training involves cold-start supervised fine-tuning (SFT), followed by RL with C-GRPO over batched rollout groups (e.g., , batch=128, , context window up to 128k tokens). Rubric weights are ablated to identify optimal balance, peaking at ; excessive rubric emphasis () can detract from correctness (Zhang et al., 9 Jan 2026).
Legal QA applications employ WangchanX-Legal-ThaiCCL-RAG datasets spanning 8,000+ examples, with BGE-M3 embedding-based retrieval (Akarajaradwong et al., 13 Jul 2025). Rollout group sizes typically use responses per prompt, with LoRA adapters (rank 256), AdamW optimization, and precision set to bfloat16.
Judge LLMs (e.g., DeepSeek-v3.2) are used for rubric scoring and, in legal tasks, for coverage/consistency evaluation. When semantic proxies are used, BGE-M3 embeddings offer cost-efficient similarity rewards with computational savings over LLM judges.
4. Citation-aware Reward Engineering
Citation-aware rewards integrate the following components:
- Format adherence: Binary check for required XML tags (e.g.,
<reasoning>,<answer>,<citation>). Reward . - Non-hallucination: Partial reward ($0.5$) if all cited entities appear in the retrieved context.
- Citation-F: Strict overlap metric between model-generated and gold citations:
This term both shapes policy during training and acts as the key evaluation metric.
- Answer quality: Semantic similarity (cosine over embeddings) or LLM-based coverage/consistency ().
Only rollouts passing format and non-hallucination thresholds receive nonzero citation-F reward, ensuring the agent must satisfy structural and factual constraints before earning fidelity credit (Akarajaradwong et al., 13 Jul 2025).
5. Empirical Benchmarking and Performance
Empirical results across benchmarks demonstrate that C-GRPO produces superior citation fidelity and correctness relative to baseline RL and instruction tuning. On deep search tasks (BrowseComp at 64k and 128k tokens) (Zhang et al., 9 Jan 2026):
| Model | BrowseComp@64k | BrowseComp@128k |
|---|---|---|
| DeepDive-4B-SFT | 7.7 | 14.1 |
| + GRPO | 12.9 | 14.7 |
| + C-GRPO | 13.9 | 17.5 |
| DeepDive-30B-SFT | 12.2 | 20.5 |
| + GRPO | 16.0 | 18.9 |
| + C-GRPO | 17.9 | 24.8 |
For Thai legal reasoning, C-GRPO with semantic-reward achieves up to +90% citation-F gains and +31% joint quality improvements over SFT (Akarajaradwong et al., 13 Jul 2025):
| Model + Reward | Citation F | Coverage | Consistency | Joint Score |
|---|---|---|---|---|
| Qwen2.5-7B-Instruct (baseline) | 0.4103 | 0.5908 | 0.8402 | 0.6138 |
| + LoRA SFT | 0.5691 (+39%) | 0.5832 | 0.8341 | 0.6622 (+8%) |
| + LoRA GRPO (cov/cons) | 0.6796 (+66%) | 0.6322 | 0.8598 | 0.7239 (+18%) |
| + LoRA GRPO (semantic) | 0.7146 (+74%) | 0.7197 | 0.8232 | 0.7525 (+23%) |
C-GRPO’s improvements persist in out-of-domain generalization—e.g., on the NitiBench-Tax test set, GRPO variants substantially mitigate SFT degradation and yield higher citation F (up to +66%).
6. Behavior Shaping: Shortcuts, Grounding, and Robustness
Weak outcome-only RL (GRPO, PPO) incentivizes shortcut exploitation, with agents minimizing tool calls and unrolling incomplete evidence chains. C-GRPO reverses this trend: agents increase citation gathering and rubric satisfaction over training, as evidenced by empirical tool-call metrics and qualitative analyst cases (Zhang et al., 9 Jan 2026).
C-GRPO’s rubric-satisfying protocol produces more comprehensive chains: agents identify all relevant hidden entities and cite supporting URLs with high fidelity. In contrast, outcome-only baselines often halt at the final answer without verifying intermediate constraints.
Reward decomposition—involving format, hallucination, and citation-F adherence—ensures agents cannot exploit single criteria or “game” the system, focusing model updates on holistic, evidence-grounded reasoning (Akarajaradwong et al., 13 Jul 2025).
7. Limitations, Constraints, and Future Research Directions
C-GRPO’s rubric-based reward design leverages the compositional structure of multi-hop/synthetic QA; its applicability to open-ended, unstated-requirement QA is less direct (Zhang et al., 9 Jan 2026). Static rubric generation can limit adaptability; online evolution of rubrics via self-play or contrastive rollouts constitutes a plausible avenue for future work.
Reward assignment accuracy is bounded by the reliability of judge LLMs or embedding proxies. Model errors in entity identification or citation support can introduce noise or bias.
Potential extensions include human-in-the-loop rubric refinement, broader agentic task adaptation (e.g., code synthesis, planning), and further cost reduction via improved semantic proxies or data filtering (Akarajaradwong et al., 13 Jul 2025).
C-GRPO establishes a rigorous foundation for RL-fine-tuning of evidence-driven LLM agents, achieving demonstrable gains in reasoning fidelity and citation accuracy across deep search and legal domains by integrating comprehensive, context-structured rewards with stable, group-based updates (Zhang et al., 9 Jan 2026, Akarajaradwong et al., 13 Jul 2025).