ARCO: Adaptive Rubric with Co-Evolution for Multi-Step LLM-Based Agents

Published 19 Jun 2026 in cs.AI and cs.CL | (2606.21262v1)

Abstract: Reinforcement learning for multi-step LLM agents often relies on scalar rewards that indicate success but cannot explain why a trajectory is good or bad. Rubric-based rewards improve interpretability through natural-language criteria, but existing methods score at the trajectory level and freeze the scorer behind a closed-source judge, leaving step-level credit assignment unresolved and the judge itself static. We propose ARCO (Adaptive Rubric CO-evolution), a rubric framework in which a same-scale model $μ$ shares a backbone with two heads: a generation head that produces per-step criteria, and a score head that predicts rubric-conditioned step-level rewards. A trajectory decomposition constraint ties the sum of step rewards to the terminal outcome, enabling credit assignment without step-level labels, while $μ$ and the policy $π$ are jointly updated on on-policy data so that the rubric content and the scoring function co-evolve at the parameter level. Across HotpotQA, 2WikiMultiHopQA, and MuSiQue with two open-source backbones, ARCO improves the best EM in every setting over strong outcome-, rubric-, and process-reward baselines, and analyses show that its rubrics are step-specific, robust to design choices, and useful for diagnosing agent behavior. Codes and data are available at https://github.com/zihangtian/ARCO.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces ARCO, a framework that co-evolves policy and rubric models to provide interpretable, step-specific reward signals in multi-step LLM-based agents.
The methodology employs a hierarchical model that generates natural-language criteria and numeric scores per step, ensuring precise credit assignment through trajectory decomposition.
Empirical evaluations on multi-hop QA benchmarks show that ARCO significantly improves exact-match performance and offers actionable insights for error analysis.

ARCO: Adaptive Rubric Co-Evolution for Multi-Step Language Agents

Problem Statement and Motivation

Multi-step LLM-based agents—typified by complex QA, tool use, and decomposition tasks—are conventionally optimized with sparse, scalar rewards, typically delivered only at trajectory completion. While straightforward, this practice fundamentally limits interpretability and step-level credit assignment: agents lack nuanced feedback about which decisions or actions contributed to failure or success, precluding effective reward tracing and opening the door to reward hacking. Recent trends toward rubric-based rewards, where interpretable natural-language criteria supplant opaque scalars, partially address interpretability but largely remain at the trajectory level and rely on static, closed-source external LLMs frozen during training. As a result, rubric criteria and the judge’s assessment are decoupled from the agent’s policy learning and unable to dynamically accommodate evolving agent behaviors or error modes.

Method: The ARCO Framework

ARCO (Adaptive Rubric CO-evolution) proposes a unified architecture resolving these limitations via the joint, co-evolutionary training of both the policy and a rubric model, each instantiated with an identical open-source backbone architecture. Crucially, ARCO localizes evaluation both semantically and temporally: at every step, the rubric model generates step-specific natural-language criteria and, conditioned on those, computes scalar scores for the corresponding action. This two-stage process is underpinned by a hierarchical model with a shared backbone: an LM head generates an ordered list of $K$ criteria per step, and a score head predicts $K$ numeric values, whose mean forms the step-level reward.

The essential innovation is the trajectory decomposition constraint, enforcing that the sum of step scores equates to the trajectory-level reward. No gold process labels are needed—credit propagation is achieved by aligning summed step-level outputs to the observed trajectory reward. Both the agent's policy and the rubric model are fine-tuned and undergo RL updates on on-policy trajectories, ensuring that the evaluation rubric and scoring function co-evolve, tightly coupled with the evolving error modes and strategy refinements of the policy itself.

Implementation Details

Backbone architectures: Qwen3-4B-Instruct and Llama-3.2-3B-Instruct serve as both the policy and rubric model base, using LoRA adaptation.
Training Regimen: A supervised warmup leverages API-based teacher LLMs to annotate initial trajectories with step-level rubrics and scores (ensuring initial rubric quality via decomposition projection). Subsequently, RL proceeds with alternating policy and rubric model updates, with the rubric model serving as both reward generator and dense annotator for on-policy rollouts.
Rubric Model Input: Each rubric is generated conditioned on the full local trajectory prefix and candidate action, with the scoring head attending over a mean-pooled hidden representation of the input plus rubric criteria.

Empirical Results

ARCO’s approach is evaluated across three multi-hop QA benchmarks (HotpotQA, 2WikiMultiHopQA, MuSiQue), representing ascending reasoning complexity and hop count, using 2,000 training and 500 evaluation examples per dataset.

Main Results

Across all benchmarks and both backbone types, ARCO achieves the best exact-match (EM) in all (dataset, model) settings, outperforming:

Outcome-level reward baselines (e.g., Search-R1, R1-Searcher),
Rubric-based outcome reward models using static rubrics and closed-source judges (e.g., RaR, CARMO, RLER),
Step-level scalar process reward models (e.g., AgentPRM).

Notably, ARCO consistently delivers both improved EM (up to 8–10 points over strong rubric-based baselines on HotpotQA) and interpretable step-level reward signals. In MuSiQue, the most compositional setting, ARCO surpasses all baselines in EM and essentially matches or betters the best F1.

Ablation and Diagnostic Analysis

Ablations demonstrate that each architectural choice—rubric text generation, adaptive (per-step) rubric content, and conditioning on trajectory prefix—contributes to downstream performance. Removing rubric text, adaptivity, or prefix-awareness uniformly harms EM, confirming the necessity of these design principles for robust, step-discriminative process supervision.

The rubric's action specificity is empirically validated. Diagnostic evaluations using action selection conditioned solely on rubric text (with distractor actions sampled from the same trajectory) yield binding accuracies up to 54.8% (chance=25%) and high action specificity (scores around 4.0/5.0), confirming that ARCO’s rubrics are indeed step- and action-bound, not generic.

Optimal rubric width $K$ is found to be 3. Increasing $K$ to 5 or more produces diminishing returns and increases redundancy without introducing new semantic themes, echoing findings on the efficiency/expressivity tradeoff in rubric annotation.

Experiments varying rubric- and policy-backbone sizes reveal negligible benefits from scaling up the rubric head, and cross-family experiments confirm that the quality of the rubric evaluator dominates over simple parameter sharing. Asymmetric backbones do not preclude transfer of reward signals, but best outcomes are achieved when policy and rubric model families are aligned in scale and representation.

Theoretical and Practical Implications

ARCO unifies rubric-based interpretability with dense, step-level process rewards while removing dependence on closed-source judges and trajectory-level scoring. By enabling the reward model to co-evolve—both in rubric content and scoring criteria—it directly regularizes the policy’s exploration space, mitigating reward hacking and mode collapse typical in scalar setups.

Practically, ARCO's rubric decomposition enables actionable diagnosis during and after training. Missteps can be traced to concrete evaluation criteria, supporting error analysis and debugging in complex agent protocols. ARCO’s design ensures that reward signals remain synchronized with agent progress, dynamically surfacing emergent error modes for both policy and evaluator adaptation.

Future Research Directions

Several avenues are indicated:

Richer Process Supervision: Beyond trajectory decomposition, integrating auxiliary rubric-consistency or contrastive losses may further sharpen step-level discrimination.
Adaptive Training Schedules: Varying the pace of policy and rubric co-evolution, possibly via meta-learning, could improve synchronization and reward stability.
Efficiency: Reducing invocation overhead (e.g., via distillation, rubric caching, asynchronous updates) would make ARCO’s paradigm applicable to longer-horizon or more resource-constrained domains.
Contrastive Rubric Enhancement: For finer disambiguation between similar actions, rubrics can be further conditioned on negative or contrastive samples from the same trajectory.

Conclusion

ARCO provides a rigorously validated framework for interpretable, adaptive reward modeling in multi-step LLM agent training. Through co-evolutionary process supervision with action-bound, step-level reward criteria, ARCO achieves improved sample efficiency, transparency, and error localization over outcome-based and static rubric paradigms. Its architectural choices and empirical findings both illuminate open questions in reward-model/policy co-evolution and establish a reproducible baseline for the further study of interpretable agent alignment and credit assignment.

Reference: "ARCO: Adaptive Rubric with Co-Evolution for Multi-Step LLM-Based Agents" (2606.21262)

Markdown Report Issue