- The paper introduces ARCO, a framework that co-evolves policy and rubric models to provide interpretable, step-specific reward signals in multi-step LLM-based agents.
- The methodology employs a hierarchical model that generates natural-language criteria and numeric scores per step, ensuring precise credit assignment through trajectory decomposition.
- Empirical evaluations on multi-hop QA benchmarks show that ARCO significantly improves exact-match performance and offers actionable insights for error analysis.
ARCO: Adaptive Rubric Co-Evolution for Multi-Step Language Agents
Problem Statement and Motivation
Multi-step LLM-based agentsโtypified by complex QA, tool use, and decomposition tasksโare conventionally optimized with sparse, scalar rewards, typically delivered only at trajectory completion. While straightforward, this practice fundamentally limits interpretability and step-level credit assignment: agents lack nuanced feedback about which decisions or actions contributed to failure or success, precluding effective reward tracing and opening the door to reward hacking. Recent trends toward rubric-based rewards, where interpretable natural-language criteria supplant opaque scalars, partially address interpretability but largely remain at the trajectory level and rely on static, closed-source external LLMs frozen during training. As a result, rubric criteria and the judgeโs assessment are decoupled from the agentโs policy learning and unable to dynamically accommodate evolving agent behaviors or error modes.
Method: The ARCO Framework
ARCO (Adaptive Rubric CO-evolution) proposes a unified architecture resolving these limitations via the joint, co-evolutionary training of both the policy and a rubric model, each instantiated with an identical open-source backbone architecture. Crucially, ARCO localizes evaluation both semantically and temporally: at every step, the rubric model generates step-specific natural-language criteria and, conditioned on those, computes scalar scores for the corresponding action. This two-stage process is underpinned by a hierarchical model with a shared backbone: an LM head generates an ordered list of K criteria per step, and a score head predicts K numeric values, whose mean forms the step-level reward.
The essential innovation is the trajectory decomposition constraint, enforcing that the sum of step scores equates to the trajectory-level reward. No gold process labels are neededโcredit propagation is achieved by aligning summed step-level outputs to the observed trajectory reward. Both the agent's policy and the rubric model are fine-tuned and undergo RL updates on on-policy trajectories, ensuring that the evaluation rubric and scoring function co-evolve, tightly coupled with the evolving error modes and strategy refinements of the policy itself.
Implementation Details
- Backbone architectures: Qwen3-4B-Instruct and Llama-3.2-3B-Instruct serve as both the policy and rubric model base, using LoRA adaptation.
- Training Regimen: A supervised warmup leverages API-based teacher LLMs to annotate initial trajectories with step-level rubrics and scores (ensuring initial rubric quality via decomposition projection). Subsequently, RL proceeds with alternating policy and rubric model updates, with the rubric model serving as both reward generator and dense annotator for on-policy rollouts.
- Rubric Model Input: Each rubric is generated conditioned on the full local trajectory prefix and candidate action, with the scoring head attending over a mean-pooled hidden representation of the input plus rubric criteria.
Empirical Results
ARCOโs approach is evaluated across three multi-hop QA benchmarks (HotpotQA, 2WikiMultiHopQA, MuSiQue), representing ascending reasoning complexity and hop count, using 2,000 training and 500 evaluation examples per dataset.
Main Results
Across all benchmarks and both backbone types, ARCO achieves the best exact-match (EM) in all (dataset, model) settings, outperforming:
- Outcome-level reward baselines (e.g., Search-R1, R1-Searcher),
- Rubric-based outcome reward models using static rubrics and closed-source judges (e.g., RaR, CARMO, RLER),
- Step-level scalar process reward models (e.g., AgentPRM).
Notably, ARCO consistently delivers both improved EM (up to 8โ10 points over strong rubric-based baselines on HotpotQA) and interpretable step-level reward signals. In MuSiQue, the most compositional setting, ARCO surpasses all baselines in EM and essentially matches or betters the best F1.
Ablation and Diagnostic Analysis
Ablations demonstrate that each architectural choiceโrubric text generation, adaptive (per-step) rubric content, and conditioning on trajectory prefixโcontributes to downstream performance. Removing rubric text, adaptivity, or prefix-awareness uniformly harms EM, confirming the necessity of these design principles for robust, step-discriminative process supervision.
The rubric's action specificity is empirically validated. Diagnostic evaluations using action selection conditioned solely on rubric text (with distractor actions sampled from the same trajectory) yield binding accuracies up to 54.8% (chance=25%) and high action specificity (scores around 4.0/5.0), confirming that ARCOโs rubrics are indeed step- and action-bound, not generic.
Optimal rubric width K is found to be 3. Increasing K to 5 or more produces diminishing returns and increases redundancy without introducing new semantic themes, echoing findings on the efficiency/expressivity tradeoff in rubric annotation.
Experiments varying rubric- and policy-backbone sizes reveal negligible benefits from scaling up the rubric head, and cross-family experiments confirm that the quality of the rubric evaluator dominates over simple parameter sharing. Asymmetric backbones do not preclude transfer of reward signals, but best outcomes are achieved when policy and rubric model families are aligned in scale and representation.
Theoretical and Practical Implications
ARCO unifies rubric-based interpretability with dense, step-level process rewards while removing dependence on closed-source judges and trajectory-level scoring. By enabling the reward model to co-evolveโboth in rubric content and scoring criteriaโit directly regularizes the policyโs exploration space, mitigating reward hacking and mode collapse typical in scalar setups.
Practically, ARCO's rubric decomposition enables actionable diagnosis during and after training. Missteps can be traced to concrete evaluation criteria, supporting error analysis and debugging in complex agent protocols. ARCOโs design ensures that reward signals remain synchronized with agent progress, dynamically surfacing emergent error modes for both policy and evaluator adaptation.
Future Research Directions
Several avenues are indicated:
- Richer Process Supervision: Beyond trajectory decomposition, integrating auxiliary rubric-consistency or contrastive losses may further sharpen step-level discrimination.
- Adaptive Training Schedules: Varying the pace of policy and rubric co-evolution, possibly via meta-learning, could improve synchronization and reward stability.
- Efficiency: Reducing invocation overhead (e.g., via distillation, rubric caching, asynchronous updates) would make ARCOโs paradigm applicable to longer-horizon or more resource-constrained domains.
- Contrastive Rubric Enhancement: For finer disambiguation between similar actions, rubrics can be further conditioned on negative or contrastive samples from the same trajectory.
Conclusion
ARCO provides a rigorously validated framework for interpretable, adaptive reward modeling in multi-step LLM agent training. Through co-evolutionary process supervision with action-bound, step-level reward criteria, ARCO achieves improved sample efficiency, transparency, and error localization over outcome-based and static rubric paradigms. Its architectural choices and empirical findings both illuminate open questions in reward-model/policy co-evolution and establish a reproducible baseline for the further study of interpretable agent alignment and credit assignment.
Reference: "ARCO: Adaptive Rubric with Co-Evolution for Multi-Step LLM-Based Agents" (2606.21262)