Citation-aware Rubric Rewards (CaRR)
- The paper demonstrates that integrating fine-grained rubric rewards with citation verification significantly enhances model reasoning and evidence chaining.
- CaRR is a framework that assigns dense rewards for verified intermediate reasoning steps, ensuring clear links between supporting evidence and the final answer.
- Empirical results show that agents trained with CaRR achieve higher citation fidelity and reduced shortcut behaviors, leading to improved overall performance.
Citation-aware Rubric Rewards (CaRR) are a fine-grained, citation-sensitive reward framework designed to substantially improve the training and performance of deep search agents—especially those leveraging LLMs for multi-hop or retrieval-augmented tasks requiring rigorous factuality and evidence tracing. CaRR addresses the limitations of conventional binary reward schemes by constructing intermediate, verifiable checkpoints—dubbed “rubrics”—that enforce explicit reasoning, precise citation, and connectivity from supporting evidence to final answer. CaRR is the basis for advanced reinforcement learning (RL) optimization schemes such as Citation-aware Group Relative Policy Optimization (C-GRPO), which have empirically demonstrated enhanced robustness, reduced shortcut exploitation, and greater generalization than standard outcome-based RL (Zhang et al., 9 Jan 2026, Akarajaradwong et al., 13 Jul 2025).
1. Motivation and Rationale
Traditional RL training of LLM-based deep search agents typically relies on binary correctness signals—rewarding only the exact match of the final answer to a ground-truth label. This approach has several weaknesses:
- Lack of granularity: It does not capture intermediate correctness or factual grounding throughout multi-step reasoning chains.
- Shortcut exploitation: Agents may “jump to answers” without thoroughly verifying intermediate steps, as additional tool calls or evidence retrieval do not accrue reward if the final answer is correct.
- Hallucination and incomplete reasoning: Binary supervision provides no incentive to trace, cite, and connect sources for every logical hop, enabling superficial or unsubstantiated predictions.
Citation-aware Rubric Rewards counteract these deficits by assigning dense rewards for meeting a sequence of explicit rubrics. These rubrics typically require agents to:
- Identify and extract all “hidden entities” implicated in the question or reasoning chain.
- Provide citation-backed support for each entity or step, verified automatically or by judge LLMs.
- Maintain evidence connectivity—ensuring supporting material chains “link” to the predicted answer via breadth-first search or other traversal algorithms.
2. Formalization and Mathematical Structure
Let denote the deep search agent parameterized by . CaRR operates within the group-based RL architecture of C-GRPO as follows:
- For a given question , sample trajectories from the current policy.
- Compute outcome reward for each rollout: if , else $0$.
- Compute rubric reward: .
- Rubric reward is normalized in-group: .
- Define mixed reward for each rollout:
where controls the outcome-vs.-rubric trade-off, and rubric reward is only granted to correct rollouts.
Within the group, compute baseline and standard deviation . Generated tokens in rollout receive group-relative advantage:
Training proceeds by optimizing a clipped PPO-style surrogate objective at the token level:
with as the importance-sampling ratio and clipping bounds (Zhang et al., 9 Jan 2026, Akarajaradwong et al., 13 Jul 2025).
3. Rubric Construction and Evaluation Workflow
CaRR leverages a multi-stage rubric assessment procedure:
- Hidden-Entity Identification: Automatically enumerate latent entities or steps required to solve a question—these may not be explicitly mentioned but are essential for correct reasoning.
- Citation Support Verification: Deploy a judge LLM (e.g., DeepSeek-v3.2) or embedding-based method (e.g., BGE-M3) to check that each entity or reasoning hop is explicitly cited, and that citations are factual and drawn from context or corpora.
- Evidence Connectivity: Perform connectivity analysis (e.g., breadth-first search) to ensure evidence forms an uninterrupted chain from initial query to final answer. Only rollouts with well-structured, connected citations for all rubrics receive the maximum rubric reward.
Rollouts exhibiting format or length errors, or failing rubric satisfaction, are penalized ().
4. Integration with Group-Relative Policy Optimization
Citation-aware rewards are operationalized within group-based RL frameworks such as C-GRPO or Thai Legal QA's GRPO (Akarajaradwong et al., 13 Jul 2025):
- Multiple rollouts per prompt are grouped for local, per-prompt baseline variance reduction.
- Rubric and outcome rewards are combined exclusively for correct rollouts, ensuring that intermediate step completion is incentivized only when global task success is achieved.
- Embedding-based or judge-model proxies can provide efficient, scalable rubric reward computation, substantially reducing RL compute costs.
This group-based approach yields unbiased gradient estimates, sharply reduced variance, and trust-region-stabilized training.
5. Empirical Benchmarks and Observed Impact
The use of CaRR and C-GRPO has been validated across diverse domains and architectures:
- Deep search tasks (BrowseComp, BrowseComp-ZH, xbench-DS, GAIA, DeepResearch Bench): Substantial improvements in correctness and rubric satisfaction under increased context lengths (64k–128k tokens). C-GRPO outperforms standard GRPO/PPO by up to 8 points for 4B models and 6 points for 30B models at long contexts (Zhang et al., 9 Jan 2026).
- Evidence comprehensiveness: Agents trained with CaRR trace and cite more hops—on BrowseComp, 4.3 citations and 5.2 connected rubrics with CaRR, compared to 3.5 and 4.0 with standard GRPO.
- Legal QA: Citation-aware group optimization with semantic F1 metrics and formatting rewards produces 74–90% Citation-F1 gains and 22–31% improvements in joint QA metrics over instruction-tuned baselines, with up to 2.5× efficiency advantages when embedding proxies are used (Akarajaradwong et al., 13 Jul 2025).
Ablation studies confirm the necessity of hidden-entity identification and evidence-connectivity. Removing either significantly degrades performance (−1.0 to −2.4 points). Allocating rubric rewards to all rollouts (not just correct) leads to severe collapse.
6. Mechanisms Preventing Shortcuts and Hallucinations
Standard outcome RL fosters shortcut behavior as agents learn to predict correct answers with minimal evidence gathering or tool usage. CaRR, in contrast, enforces thoroughness by:
- Rewarding only correct trajectories that also satisfy maximal rubric (e.g., all checklist items with citations and connections).
- Proportionally incentivizing deeper tool use and evidence tracing—empirical curves show that trained C-GRPO agents perform more tool calls and thorough reasoning as opposed to shortcutting or hallucinating.
- Ensuring symbolic and factual completeness of response chains, verified by connectivity and citation consistency.
Qualitative analyses demonstrate that CaRR-equipped agents trace complete chains with explicit source attribution, in contrast to shallow, unverified outputs generated by outcome-only baselines (Zhang et al., 9 Jan 2026).
7. Limitations and Extensions
Current CaRR implementations rely on explicit, decomposable rubrics derivable from the synthetic structure of multi-hop QA frameworks. Extensions to unconstrained, open-ended questions require dynamic, online, or evolving rubric construction, potentially via rollout contrast or novel compositional methods. Static rubrics may not capture all facets of reasoning in less-structured domains. Further, adaptive scheduling of rubric vs. outcome weighting (), and integration of alternate signals (factuality, consistency) are open avenues.
A plausible implication is that future research may generalize CaRR to domains without well-defined intermediate hops, and explore curriculum-based or context-sensitive reward scheduling for maximal agent robustness and factuality (Zhang et al., 9 Jan 2026).
Key References:
- "Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards" (Zhang et al., 9 Jan 2026)
- "Can Group Relative Policy Optimization Improve Thai Legal Reasoning and Question Answering?" (Akarajaradwong et al., 13 Jul 2025)