Citation-Aware Group Relative Policy Optimization

Updated 12 January 2026

The paper introduces a contextual reward decomposition (CaRR) that integrates outcome correctness with citation fidelity to enhance evidence-grounded decision making.
It employs group-level policy updates and stable PPO clipping, demonstrating significant improvements in factuality and citation quality over baselines.
Empirical results in deep search and legal reasoning contexts show robust gains, with up to +74% citation-F1 improvements validated against standard benchmarks.

Citation-aware Group Relative Policy Optimization (C-GRPO) is a reinforcement learning (RL) framework designed for the robust alignment of LLM agents in tasks where factual grounding, comprehensive reasoning, and citation fidelity are critical, particularly in domains requiring evidence chains or legal citation accuracy. C-GRPO extends standard Group-Relative Policy Optimization (GRPO) by integrating a fine-grained, context-sensitive reward architecture—specifically Citation-aware Rubric Rewards (CaRR)—that evaluates rollouts not only on final outcome correctness, but also on the explicit identification and citation of supporting entities and the construction of complete evidence chains. Empirical evidence across deep search and retrieval-augmented question answering indicates that C-GRPO yields significant improvements in factuality and citation quality relative to traditional outcome-based RL or instruction tuning baselines (Zhang et al., 9 Jan 2026, Akarajaradwong et al., 13 Jul 2025).

1. Foundations and Conceptual Architecture

The canonical RL protocol for deep search agents typically employs a binary outcome reward $R_o \in \{0, 1\}$ , assigning credit only for answers matching the ground truth. This paradigm fails to capture partial factuality, grounded reasoning, and evidence connectivity. C-GRPO addresses these shortcomings by introducing CaRR—a reward decomposition that attributes real-valued rubric scores $R_r \in [0, 1]$ to agent rollouts. CaRR evaluates the trajectory along three axes: identification of hidden entities, citation grounding (verifiable support for each claim), and construction of a connected evidence chain linking intermediate facts to the answer. Only successful rollouts receive the mixed reward, which combines outcome correctness and normalized rubric score:

$R_i = (1-\alpha) R_o^{(i)} + \alpha R_o^{(i)} \hat{R}_r^{(i)}$

where $\hat{R}_r^{(i)}$ is group-wise normalized and $\alpha \in [0, 1]$ modulates rubric emphasis (Zhang et al., 9 Jan 2026).

In Thai legal reasoning contexts, the reward is further refined to explicit citation fidelity via the Citation-F $_1$ metric, format adherence, and non-hallucination checks. Here, C-GRPO uses a multi-component scalar reward:

$R(s, a) = \alpha \cdot R_\text{citation}(s, a) + \beta \cdot R_\text{ans}(s, a)$

with $R_\text{citation}$ aggregating XML format checks, ground-truth citation overlap, and non-hallucination, and $R_\text{ans}$ derived via semantic similarity or coverage/consistency scoring (Akarajaradwong et al., 13 Jul 2025).

2. Mathematical Formulation and Training Objective

For a group of $G$ trajectories per query, denote trajectories as $\{\mathcal{H}_1, \ldots, \mathcal{H}_G\}$ , each assigned scalar reward $R_i$ . Token-level importance sampling ratio and group-normalized advantage are computed as:

$\rho_{i, j} = \frac{\pi_\theta(\mathcal{H}_{i, j} \mid q, \mathcal{H}_{i, 1:j-1})}{\pi_{\theta_{\text{old}}}(\mathcal{H}_{i, j} \mid q, \mathcal{H}_{i, 1:j-1})}$

$\hat{A}_{i, j} = \frac{R_i - \mu_R}{\sigma_R}$

where $\mu_R$ and $\sigma_R$ are mean and standard deviation over the group.

The C-GRPO objective applies clipped PPO updates over group-token averages, masking out environment tokens:

$\mathcal{J}(\theta) = \mathbb{E} \left[ \frac{1}{\sum_{i, j} I(\mathcal{H}_{i, j})} \sum_{i=1}^{G} \sum_{j=1}^{|\mathcal{H}_i|} I(\mathcal{H}_{i, j}) \min \left(\rho_{i, j} \hat{A}_{i, j}, \operatorname{clip}(\rho_{i, j}, 1-\epsilon, 1+\epsilon) \hat{A}_{i, j} \right) \right]$

This architecture, with per-group baseline subtraction ( $b_i$ in (Akarajaradwong et al., 13 Jul 2025)), stabilizes gradient estimates and reduces variance compared to global baselines.

3. Training Protocol and Implementation Details

C-GRPO agents utilize backbone models such as Qwen3-4B, Qwen3-30B (for deep search), and Qwen2.5-7B-Instruct or Thai-CPT variants (for legal reasoning). Training involves cold-start supervised fine-tuning (SFT), followed by RL with C-GRPO over batched rollout groups (e.g., $G=16$ , batch=128, $\text{lr}=2 \times 10^{-6}$ , context window up to 128k tokens). Rubric weights $\alpha$ are ablated to identify optimal balance, peaking at $\alpha=0.3$ ; excessive rubric emphasis ( $\alpha>0.5$ ) can detract from correctness (Zhang et al., 9 Jan 2026).

Legal QA applications employ WangchanX-Legal-ThaiCCL-RAG datasets spanning 8,000+ examples, with BGE-M3 embedding-based retrieval (Akarajaradwong et al., 13 Jul 2025). Rollout group sizes typically use $K=10$ responses per prompt, with LoRA adapters (rank 256), AdamW optimization, and precision set to bfloat16.

Judge LLMs (e.g., DeepSeek-v3.2) are used for rubric scoring and, in legal tasks, for coverage/consistency evaluation. When semantic proxies are used, BGE-M3 embeddings offer cost-efficient similarity rewards with $>2.5\times$ computational savings over LLM judges.

4. Citation-aware Reward Engineering

Citation-aware rewards integrate the following components:

Format adherence: Binary check for required XML tags (e.g., <reasoning>, <answer>, <citation>). Reward $R_\text{format}(s, a) \in \{0, 1\}$ .
Non-hallucination: Partial reward ($0.5$) if all cited entities appear in the retrieved context.
Citation-F $_1$ : Strict overlap metric between model-generated and gold citations:

$P = \frac{|C \cap G|}{|C|},\quad R = \frac{|C \cap G|}{|G|},\quad F_1 = \frac{2PR}{P + R}$

This term both shapes policy during training and acts as the key evaluation metric.

Answer quality: Semantic similarity (cosine over embeddings) or LLM-based coverage/consistency ( $\in \{0, 0.5, 1.0\}$ ).

Only rollouts passing format and non-hallucination thresholds receive nonzero citation-F $_1$ reward, ensuring the agent must satisfy structural and factual constraints before earning fidelity credit (Akarajaradwong et al., 13 Jul 2025).

5. Empirical Benchmarking and Performance

Empirical results across benchmarks demonstrate that C-GRPO produces superior citation fidelity and correctness relative to baseline RL and instruction tuning. On deep search tasks (BrowseComp at 64k and 128k tokens) (Zhang et al., 9 Jan 2026):

Model	BrowseComp@64k	BrowseComp@128k
DeepDive-4B-SFT	7.7	14.1
+ GRPO	12.9	14.7
+ C-GRPO	13.9	17.5
DeepDive-30B-SFT	12.2	20.5
+ GRPO	16.0	18.9
+ C-GRPO	17.9	24.8

For Thai legal reasoning, C-GRPO with semantic-reward achieves up to +90% citation-F $_1$ gains and +31% joint quality improvements over SFT (Akarajaradwong et al., 13 Jul 2025):

Model + Reward	Citation F $_1$	Coverage	Consistency	Joint Score
Qwen2.5-7B-Instruct (baseline)	0.4103	0.5908	0.8402	0.6138
+ LoRA SFT	0.5691 (+39%)	0.5832	0.8341	0.6622 (+8%)
+ LoRA GRPO (cov/cons)	0.6796 (+66%)	0.6322	0.8598	0.7239 (+18%)
+ LoRA GRPO (semantic)	0.7146 (+74%)	0.7197	0.8232	0.7525 (+23%)

C-GRPO’s improvements persist in out-of-domain generalization—e.g., on the NitiBench-Tax test set, GRPO variants substantially mitigate SFT degradation and yield higher citation F $_1$ (up to +66%).

6. Behavior Shaping: Shortcuts, Grounding, and Robustness

Weak outcome-only RL (GRPO, PPO) incentivizes shortcut exploitation, with agents minimizing tool calls and unrolling incomplete evidence chains. C-GRPO reverses this trend: agents increase citation gathering and rubric satisfaction over training, as evidenced by empirical tool-call metrics and qualitative analyst cases (Zhang et al., 9 Jan 2026).

C-GRPO’s rubric-satisfying protocol produces more comprehensive chains: agents identify all relevant hidden entities and cite supporting URLs with high fidelity. In contrast, outcome-only baselines often halt at the final answer without verifying intermediate constraints.

Reward decomposition—involving format, hallucination, and citation-F $_1$ adherence—ensures agents cannot exploit single criteria or “game” the system, focusing model updates on holistic, evidence-grounded reasoning (Akarajaradwong et al., 13 Jul 2025).

7. Limitations, Constraints, and Future Research Directions

C-GRPO’s rubric-based reward design leverages the compositional structure of multi-hop/synthetic QA; its applicability to open-ended, unstated-requirement QA is less direct (Zhang et al., 9 Jan 2026). Static rubric generation can limit adaptability; online evolution of rubrics via self-play or contrastive rollouts constitutes a plausible avenue for future work.

Reward assignment accuracy is bounded by the reliability of judge LLMs or embedding proxies. Model errors in entity identification or citation support can introduce noise or bias.

Potential extensions include human-in-the-loop rubric refinement, broader agentic task adaptation (e.g., code synthesis, planning), and further cost reduction via improved semantic proxies or data filtering (Akarajaradwong et al., 13 Jul 2025).

C-GRPO establishes a rigorous foundation for RL-fine-tuning of evidence-driven LLM agents, achieving demonstrable gains in reasoning fidelity and citation accuracy across deep search and legal domains by integrating comprehensive, context-structured rewards with stable, group-based updates (Zhang et al., 9 Jan 2026, Akarajaradwong et al., 13 Jul 2025).

PDF Markdown Chat (Pro)

References (2)

Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards (2026)

Can Group Relative Policy Optimization Improve Thai Legal Reasoning and Question Answering? (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Citation-aware Group Relative Policy Optimization (C-GRPO).

Citation-Aware Group Relative Policy Optimization

1. Foundations and Conceptual Architecture

2. Mathematical Formulation and Training Objective

3. Training Protocol and Implementation Details

4. Citation-aware Reward Engineering

5. Empirical Benchmarking and Performance

6. Behavior Shaping: Shortcuts, Grounding, and Robustness

7. Limitations, Constraints, and Future Research Directions

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Citation-Aware Group Relative Policy Optimization

1. Foundations and Conceptual Architecture

2. Mathematical Formulation and Training Objective

3. Training Protocol and Implementation Details

4. Citation-aware Reward Engineering

5. Empirical Benchmarking and Performance

6. Behavior Shaping: Shortcuts, Grounding, and Robustness

7. Limitations, Constraints, and Future Research Directions

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research