RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards
Abstract: Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience. In this work, we argue that rubrics should serve not merely as final-answer evaluators, but as the shared interface that structures policy execution, judge feedback, and agent memory. Based on this view, we introduce RubricEM, a rubric-guided reinforcement learning framework that combines stagewise policy decomposition with reflection-based meta-policy evolution. RubricEM first makes research trajectories stage-aware by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics. It then assigns credit with Stage-Structured GRPO, which uses stagewise rubric judgments to provide denser semantic feedback for long-horizon optimization. In parallel, RubricEM trains a shared-backbone reflection meta-policy that distills judged trajectories into reusable rubric-grounded guidance for future attempts. The resulting RubricEM-8B achieves strong performance across four long-form research benchmarks, outperforming comparable open models and approaching proprietary deep-research systems. Beyond final performance, we perform thorough analyses to understand the key ingredients of RubricEM.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview: What is this paper about?
This paper is about teaching AI “research agents” to do complex, long reports on the internet—things like planning, searching, checking sources, and writing—using reinforcement learning (a trial‑and‑error training method). The twist: many research tasks don’t have a single “correct” answer you can easily check. So the authors use rubrics—clear checklists and criteria (like a teacher’s grading guide)—to guide the AI step by step, not just to grade the final answer. Their system is called RubricEM.
Goals: What questions were the researchers asking?
They wanted to solve three problems that make training research agents hard:
- How can we train with “rewards” when there isn’t a single right answer to verify?
- How can we give credit to the right parts of a long process (planning, searching, reviewing, writing) instead of only judging the final output?
- How can the agent turn past attempts into useful experience that actually helps on future tasks?
Methods: How does RubricEM work?
Think of a student writing a research report with a teacher’s rubric in hand. RubricEM makes the AI do the same thing, in three big ideas:
- Break the work into stages and use a rubric to guide each stage
- Stages: Plan → Research → Review → Answer.
- Plan: the agent writes a task-specific rubric and a plan (what facts to gather, what to avoid, how to reason).
- Research: it searches the web and checks whether the findings satisfy the plan and rubric; it can update the plan as needed.
- Review: it maps evidence back to the rubric and designs the final outline.
- Answer: it writes the final report with citations. Why this helps: Instead of one long, messy stream of text, the AI follows a structured workflow where each stage has a clear purpose and criteria—like sections on a test.
- Give stage-by-stage feedback instead of only a final score
- A judge AI (another model) scores each stage using rubrics tailored to that stage.
- These stage scores are combined so earlier steps get credit for helping later outcomes (for example, a strong Plan gets some credit if it leads to a strong Answer).
- The judge improves its own rubrics over time (it keeps a “rubric buffer” of the most useful, discriminating criteria). Why this helps: It’s much easier for the agent to learn when it gets specific, timely feedback—like getting separate grades for “research quality” and “writing clarity,” not just a final letter grade.
- Learn from experience by writing reflections and storing them in a “rubric bank”
- After finishing and being judged, the agent writes a short reflection: what mattered, what worked, and what to avoid next time—grounded in the rubric and the outcome.
- A judge AI scores multiple reflection drafts; the best one is saved in a “rubric bank” (a memory of lessons learned).
- Next time, the agent can retrieve helpful reflections:
- Cross-episode: reuse lessons from similar past questions.
- Within-episode: reuse its own earlier attempt on the same question.
- The reflection-writing model shares the same “backbone” as the main agent, so learning better reflections also improves the agent’s general skills.
- This reflection training runs asynchronously (in the background), so it doesn’t slow down the main training loop. Why this helps: It’s like keeping high-quality study notes you can reuse. The agent improves both in its parameters and through an external memory it can read.
A few technical terms in simple language:
- Reinforcement learning: training by trial and error with feedback (“rewards”).
- Policy: the agent’s “strategy” for what to do next.
- Trajectory: the whole sequence of steps the agent takes (plan, searches, drafts, etc.).
- Credit assignment: figuring out which earlier decisions deserve credit for success.
- GRPO: a popular, PPO-like method to nudge the policy using rewards, here adapted to handle stage-based scores.
Findings: What did they discover?
Using an 8-billion-parameter model (a relatively small modern model), RubricEM:
- Beat other open, comparable systems on four long-form research benchmarks: HealthBench, ResearchQA, DeepResearchBench, and ResearchRubrics.
- Came close to some proprietary (closed) deep-research systems—strong performance for a smaller, open setup.
- Needed fewer training steps than a strong prior method to reach better results.
Their analysis showed why it works:
- Stage-by-stage feedback (Stage-Structured GRPO) helps the agent learn more reliably than only giving a final score.
- The reflection meta-policy and rubric bank produce useful, reusable guidance that boosts future attempts.
- The structured stage scaffold (Plan → Research → Review → Answer) makes both learning and inference more stable and effective.
- Even though training focused on long-form tasks, the model also improved on several short-form search tasks, suggesting it learned general research skills (like better tool use and evidence grounding), not just report writing.
Implications: Why does this matter?
- Training beyond “verifiable” answers: Many real questions don’t have a single correct solution. RubricEM shows how to train agents for these open-ended tasks using clear criteria, stage-by-stage feedback, and experience reuse.
- Better, safer research agents: Structuring the process and reflecting on what worked can lead to more reliable, well-cited, and transparent answers—useful for school projects, journalism, market analysis, and scientific summaries.
- Reusable learning: The rubric bank is like an evolving playbook, helping the agent improve over time without needing tons of new labeled data.
- Practical training recipe: The approach works with a smaller model and reasonable training budget, which could make high-quality research agents more accessible.
- Future directions: Improve judge quality and fairness, expand to more domains and tools, and study how to detect and reduce bias in rubrics and reflections.
In short, RubricEM turns rubrics into the backbone of the whole learning process—guiding how the agent plans, searches, is graded, and learns from experience—so it can handle complex research tasks where there isn’t just one right answer.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
The paper leaves the following concrete gaps and open questions that future work could address:
- Judge dependency and alignment: How sensitive is SS-GRPO to the choice and quality of the privileged LLM judge? Quantify judge–human agreement, cross-judge generalization, and failure cases where optimizing for the judge diverges from human preferences.
- Reward hacking risks: Does stagewise rubric optimization induce behaviors that exploit judge scoring without improving real report quality (e.g., rubric keyword parroting, superficial structure)? Develop adversarial probes and cross-evaluator audits to detect and mitigate this.
- Human-grounded evaluation: The benchmarks rely heavily on LLM-judged or rubric-based metrics. Include human expert assessments (especially for HealthBench) to validate factuality, safety, and usefulness, and report judge–human correlation.
- Evidence grounding and citation faithfulness: Beyond aggregate scores, measure citation precision/recall, support coverage, and claim–evidence faithfulness. Provide quantitative grounding metrics and error taxonomies for hallucinations or misattributed citations.
- Stage decomposition sensitivity: The framework fixes four stages (Plan/Research/Review/Answer). How does performance vary with alternative decompositions, different numbers of stages, or dynamic/learned stage boundaries?
- Credit propagation design: The stage-dependence matrix Λ is introduced but not systematically tuned. Explore how different Λ choices (e.g., stronger/weaker downstream credit) affect stability, sample efficiency, and final quality; investigate automatic or learned Λ.
- Critic-free choice: SS-GRPO is critic-free for simplicity. Compare against stage-aware critics, value baselines, or process reward models to assess stability, variance reduction, and sample efficiency trade-offs.
- Judge noise thresholds: Theoretical claims hinge on “bounded judge noise” and “alignment” assumptions. Empirically characterize noise levels under which stage returns help or hurt, and test the predicted thresholds from the analysis.
- Quality and correctness of self-generated rubrics: What happens when the Plan stage produces poor rubrics? Study failure cascades, mechanisms for rubric repair/revision, and robustness to low-quality self-rubrics.
- Evolving-rubric judge stability: The judge’s rubric buffer adapts online, but stability, drift, and reproducibility are not quantified. Analyze rubric turnover, discriminative power over time, and safeguards against forgetting or overfitting to transient artifacts.
- Reflection meta-policy negative transfer: Reflections can misguide future rollouts. Develop automatic detection of harmful reflections, confidence estimation for reflections, and gating/undo mechanisms to prevent or reverse negative transfer.
- Rubric bank scalability and freshness: The paper does not detail indexing, pruning, or freshness policies. Study retrieval quality at scale, staleness handling for time-sensitive topics, memory compaction, and forgetting strategies.
- Asynchronous training staleness: The reflection branch lags by one step; the impact of larger lags or different scheduling on convergence and stability is unreported. Provide ablations on lag size and throughput vs. quality trade-offs.
- Search backend confounds: Performance may be influenced by the stronger search backend and tool integration. Provide controlled comparisons across search configurations and report how much each component (search vs. policy) contributes.
- Proprietary dependencies: SFT uses Gemini-3.1-Pro traces and a privileged LLM judge. Assess reproducibility with fully open teachers/judges, and quantify performance drop-offs when replacing them with open alternatives.
- Compute and cost transparency: Report wall-clock time, token counts for rollouts and judging, GPU hours, and per-query inference costs. Analyze cost–quality trade-offs for SS-GRPO and the reflection branch.
- Robustness to web volatility and adversarial content: Evaluate resilience to changing pages, rate limits, prompt injections, SEO spam, and conflicting sources. Incorporate robustness benchmarks and defenses for tool-augmented browsing.
- Safety and domain-specific compliance: Especially on health tasks, include safety audits (harmful advice, disclaimers, scope of practice), and explore rubric items tailored for safety and risk mitigation.
- Multimodal and PDF-heavy evidence: The system appears text-focused. Investigate extending to multimodal evidence (figures, tables, PDFs) with parsing and grounding across formats.
- Generalization beyond chosen benchmarks: Test on additional domains, non-English queries, and user-driven multi-turn research workflows to assess robustness and ecological validity.
- Continual learning and distribution shift: How does the agent handle long-term evolution of its rubric bank and changing web distributions? Study catastrophic forgetting, memory decay, and adaptation over months.
- Interactions between shared backbone and task policy: Joint training of task and reflection on a shared backbone could cause interference. Measure representational drift, catastrophic forgetting, and explore partial parameter sharing or adapters.
- Automated stage detection vs. fixed scaffold: Explore learning stage boundaries from data (e.g., latent options/HRL) and compare to hand-crafted XML scaffolds for flexibility and transfer.
- Alternative meta-learning formulations: Compare reflection-based natural-language memory to parametric fast adaptation methods (e.g., context variables, learned optimizers) and hybrids that fuse textual memory with learned context.
- Uncertainty and abstention: Incorporate confidence estimation, calibrated uncertainty, and abstention or deferral mechanisms when evidence is insufficient to meet rubric standards.
- Fairness and bias in rubrics: Self- and judge-generated rubrics may encode biases. Audit rubric content and outcomes across topics, sources, and demographic-sensitive content; develop de-biasing procedures.
- Legal/privacy aspects of memory: The rubric bank stores prior queries and distilled reflections. Detail policies for PII handling, retention limits, user consent, and compliance with data regulations.
- Transparent failure analyses: Provide qualitative analyses of common failure modes per stage (e.g., flawed plans, shallow evidence, poor synthesis), with targeted interventions and rubrics that address these errors.
- Scaling laws: Characterize how performance scales with model size, RL steps, judge strength, and tool budget; identify regimes of diminishing returns and optimal allocations.
- Tool-chain generality: Assess whether the approach extends beyond search/semantic scholar to broader tool ecosystems (APIs, databases, code execution) and how stage/rubric designs should evolve accordingly.
- Reproducibility artifacts: Release the scaffold schema, prompts, judge instructions, and (where licenses allow) distilled SFT traces or synthetic data generation recipes to enable third-party replication and stress testing.
Practical Applications
Practical Applications of RubricEM
RubricEM introduces three core innovations—rubric-guided stage scaffolding (Plan → Research → Review → Answer), Stage-Structured GRPO (stagewise credit assignment with evolving judge rubrics), and a reflection meta-policy with a reusable rubric bank. These enable deployable workflows for high-quality, auditable long-form research and provide a general recipe for RL beyond verifiable rewards. Below are applications organized by immediacy.
Immediate Applications
The following can be deployed with current LLMs and tooling (search APIs, document stores, judge models), even without RL retraining:
- Enterprise and consulting research assistants — stage-structured report generation with citations
- Sectors: enterprise services, consulting, pharma, legal, financial services, journalism
- Tools/workflows: rubric-guided scaffold prompts; web/enterprise RAG; stagewise LLM judging for QA; rubric bank per client/account for recurring topics; auditable XML logs
- Assumptions/dependencies: access to reliable search/RAG; judge LLM quality; content licensing; human review for high-stakes outputs
- Systematic literature reviews and survey automation
- Sectors: academia, R&D, healthcare research, regulatory sciences
- Tools/workflows: Plan-stage rubrics encode inclusion/exclusion, quality criteria; Research-stage iterates queries and sources (e.g., Semantic Scholar, PubMed); Review-stage mapping to rubrics; Answer-stage synthesis with citations; reflection bank for topic-specific heuristics
- Assumptions/dependencies: domain-specific rubric templates; access to scholarly databases; compliance with publisher terms
- Competitive intelligence and market research
- Sectors: product management, marketing, strategy, sales ops
- Tools/workflows: rubric checklists for competitor profiling, TAM/SAM/SOM, risks; reflection bank captures reusable sector frameworks; stagewise judge flags gaps or out-of-date references
- Assumptions/dependencies: web data reliability; governance for internal/external data blending
- Legal research and e-discovery triage (assistive, not advisory)
- Sectors: law firms, in-house counsel, compliance
- Tools/workflows: rubrics specify jurisdiction, precedent hierarchy, relief sought; stagewise QA for relevance and Shepardizing checks; audit logs; reflection bank of issue-spotting patterns
- Assumptions/dependencies: access to legal databases; strict human-in-the-loop; liability controls
- Clinical/health information synthesis and patient education (assistive)
- Sectors: healthcare providers, payers, medical education
- Tools/workflows: rubrics enforce sourcing from guidelines and primary literature; safety and scope disclaimers; stagewise judge checks citations and contraindications; reflection bank for recurring conditions
- Assumptions/dependencies: high-quality medical sources; clinical oversight; privacy and regulatory compliance
- Policy analysis and brief generation
- Sectors: government, NGOs, think tanks
- Tools/workflows: rubrics encode stakeholder impact, equity, cost-benefit, feasibility; evidence-grounded synthesis; reflection bank for policy frameworks and jurisdictional nuances
- Assumptions/dependencies: access to legislation and datasets; peer review; transparent provenance
- Investment memos and due diligence checklists
- Sectors: VC/PE, corporate development, risk
- Tools/workflows: Plan-stage rubrics for moat, traction, risk; Research-stage structured calls on filings, news, benchmarks; Review-stage gap analysis; reflection bank of sector-specific diligence templates
- Assumptions/dependencies: data freshness and source reliability; conflict-of-interest governance
- Editorial research and fact-checking
- Sectors: media, publishing, knowledge platforms
- Tools/workflows: rubrics for source credibility, independence, and corroboration; stagewise QA for claim–source alignment; reflection bank of style/standards
- Assumptions/dependencies: editorial policies; source access; human oversight
- Educational research coach and formative assessment
- Sectors: secondary/higher education, writing centers
- Tools/workflows: learner-facing rubrics; guided Plan/Review stages for self-evaluation; feedback based on stagewise judge; reflection bank becomes personalized study tips
- Assumptions/dependencies: alignment with curricula/assessment standards; academic integrity safeguards
- Internal knowledge-base upkeep and customer support article drafting
- Sectors: SaaS, IT, support ops
- Tools/workflows: rubrics ensure versioning, affected products, reproduction steps; stagewise QA for broken links or deprecated APIs; reflection bank of recurring issues
- Assumptions/dependencies: accurate product metadata; access control and content lifecycle
- Agent QA and observability for LLM-based systems
- Sectors: software/ML platforms
- Tools/workflows: adopt the stagewise judge and rubric buffers to evaluate agent processes (not just outputs); dashboards showing stage scores; reflection bank for failure patterns; candidate acceptance thresholds
- Assumptions/dependencies: budget for LLM judging; noise-tolerant rubric design; log retention and privacy
- Synthetic data generation and curriculum building for RLHF/RLAIF
- Sectors: AI/ML
- Tools/workflows: collect stage-structured trajectories with stagewise scores to train process reward models; mine reflection bank for rationales and rubrics; generate high-quality SFT/RL datasets
- Assumptions/dependencies: judge consistency and bias control; deduplication and quality filtering
- Personal research assistant for daily decisions
- Sectors: consumer
- Tools/workflows: rubrics for purchases, travel, education choices (budget, reliability, sustainability); stagewise planning and review; reflection bank of personal preferences
- Assumptions/dependencies: source credibility; transparency about uncertainty; privacy for personal data
Long-Term Applications
These require additional research, scaling, or institutionalization (e.g., domain reward models, stronger safety, integrations, or RL training at scale):
- Domain-specialized deep-research agents for regulated settings
- Sectors: medicine, law, finance, energy, aviation
- Products/workflows: train domain PRMs/reward models from stagewise rubrics; integrate with EHRs, legal case systems, financial terminals; strict human-in-the-loop and auditing
- Dependencies: certified datasets; formal safety cases; monitoring and rollback mechanisms
- Multi-agent research teams with rubric-aligned roles
- Sectors: enterprise R&D, think tanks, investigative journalism
- Products/workflows: planner–researcher–reviewer–writer agents, each with role-specific rubrics; inter-agent SS-GRPO for credit assignment across roles; EM-like coordination
- Dependencies: orchestration frameworks; process governance; conflict resolution policies
- Judge cost reduction via learned process reward models
- Sectors: AI/ML platforms
- Products/workflows: distill evolving LLM-judge rubrics into stagewise PRMs for low-latency scoring; online calibration against periodic human/LLM audits
- Dependencies: high-quality labeled traces; drift detection; fairness and bias analysis
- Organization-wide “rubric bank” as institutional memory
- Sectors: large enterprises, academia, government
- Products/workflows: versioned reflections/rubrics with access control, provenance, and deprecation policies; vector and symbolic retrieval; cross-team reuse
- Dependencies: KM governance; security and privacy; lifecycle management
- Standards for process-level evaluation and auditing
- Sectors: standards bodies, regulators
- Products/workflows: publish stage-specific rubric templates and benchmarks for long-form agents; certification protocols using process audits
- Dependencies: stakeholder consensus; reproducible evaluation suites; red-teaming
- Personalized education with rubric-driven mastery learning
- Sectors: EdTech
- Products/workflows: auto-generation of course- and teacher-aligned rubrics; stagewise feedback; meta-policy builds learner-specific rubrics over time
- Dependencies: data privacy (minors); alignment with accreditation; teacher oversight
- Scientific discovery assistants beyond literature (lab integration)
- Sectors: biotech, materials, chemistry
- Products/workflows: extend stages to plan–experiment–analyze–synthesize; instrument APIs; stagewise credit from experimental outcomes; reflection bank of failed/successful protocols
- Dependencies: safe automation; experiment logging; causal inference safeguards
- Policy design and scenario planning with quant models
- Sectors: public policy, development economics
- Products/workflows: integrate microsimulation/ABMs into Research/Review stages; rubrics for equity, uncertainty, feasibility; stagewise credit for model-driven insights
- Dependencies: validated models; data access; ethical review
- Multi-modal deep research (text, tables, figures, code)
- Sectors: scientific/technical communication, finance
- Products/workflows: extend rubrics to require multi-modal evidence (charts, code reproductions); process evaluators that parse and validate artifacts
- Dependencies: multi-modal LLMs; toolchains for reproducibility; compute budgets
- Long-horizon planning in software/ops and robotics
- Sectors: DevOps, IT ops, field robotics
- Products/workflows: apply stagewise RL to plan–execute–diagnose–repair loops; human supervisors provide stage scores; reflection bank of runbooks
- Dependencies: safe execution; robust rollback; operator training and approval gates
- Open-source SDKs and “Judge-as-a-Service”
- Sectors: developer platforms
- Products/workflows: libraries for Plan/Research/Review/Answer scaffolds; SS-GRPO training hooks; managed stagewise judging with evolving rubric buffers
- Dependencies: cost-effective inference; standardized schemas; privacy-preserving logs
- Compliance, safety, and provenance tooling for AI governance
- Sectors: regulated enterprises
- Products/workflows: stagewise audit trails, rubric conformance checks, automated gap reports; policy-to-rubric compilers for domain standards
- Dependencies: mapping org policies to rubrics; audit integration; change management
- Cross-lingual and locale-sensitive research agents
- Sectors: global enterprises, international organizations
- Products/workflows: localized rubrics for sources, legal regimes, cultural norms; reflection banks per locale; process evaluators with multilingual capability
- Dependencies: multilingual models; local source access; cultural/legal expertise
Notes on feasibility across applications:
- Core dependencies: capable base and judge models; high-quality search/RAG; disciplined prompt/scaffold implementation; compute and budget for judging.
- Risks/assumptions: LLM judge noise and bias; hallucinations despite rubrics; legal/IP constraints; privacy/compliance for memory banks; human oversight needed for high-stakes domains.
- Migration path: start with scaffolded prompting + judge-based QA + reflection bank (no RL), then add stagewise PRMs and, finally, SS-GRPO fine-tuning as data and governance mature.
Glossary
- Advantage: A scalar signal indicating how much better an action was compared to a baseline, used to weight policy gradients. "All tokens in the same stage block share the advantage ."
- Agent rubric bank: A memory store of accepted reflections/rubrics distilled from judged trajectories to guide future attempts. "The highest-scored accepted reflection is also written into an agent rubric bank as natural-language memory."
- Agent–judge co-evolution: A coupled training dynamic where the agent improves via policy and memory updates while the judge refines its rubric buffer over time. "Coupled agent--judge co-evolution."
- Asynchronous execution: Running reflection generation/judging and updates out of sync with rollouts to avoid blocking and improve throughput. "We designed an efficient asynchronous reflection branch to train this meta-policy alongside task-policy RL without adding a sequential bottleneck, a notable problem in prior meta-RL literature~\citep{jiang2026metarl}."
- Autoregressive sampling: Generating the next token/action conditioned on the history in sequence. "We consider a language-model-based agent that autoregressively samples the next step "
- Causal stage-dependence matrix: A matrix that specifies how downstream stage scores contribute to earlier stage returns, respecting causal order. "SS-GRPO uses a causal stage-dependence matrix , with for and "
- Credit assignment: Determining which decisions in a trajectory are responsible for outcomes to assign learning signals appropriately. "How can reinforcement learning train deep research agents beyond verifiable rewards, while enabling long-horizon credit assignment and learning from experience?"
- Critic-free: An RL setup that avoids training a learned value function (critic), instead relying on advantages or normalized returns. "giving GRPO finer-grained credit signals while remaining critic-free."
- Cross-episode transfer: Reusing reflections or knowledge from past, related tasks to improve performance on new tasks. "while cross-episode transfer retrieves reflections from related questions."
- Denser returns: More frequent and informative reward signals assigned within a trajectory, not just at the end. "These stagewise scores define denser returns that combine local stage quality with downstream impact, giving GRPO finer-grained credit signals while remaining critic-free."
- Expectation–Maximization (EM): An iterative estimate–maximize principle inspiring the framework’s view of latent task structure. "The name RubricEM reflects an Expectation--Maximization (EM)-inspired estimate--maximize view"
- GRPO: A critic-free policy optimization method using normalized, group-relative advantages with clipping, akin to PPO variants. "We instantiate SS-GRPO as a critic-free stagewise variant of GRPO"
- Grounded answer: A final response explicitly supported by retrieved evidence. "and eventually produces a final long-form answer grounded in retrieved evidence."
- Long-horizon: Tasks or trajectories with many sequential decisions where feedback may be delayed. "provide denser semantic feedback for long-horizon optimization."
- Meta-policy: A policy that learns to produce reusable reflections or strategies that help future task performance. "RubricEM trains a shared-backbone reflection meta-policy"
- Meta-RL: Learning to learn across tasks by optimizing policies that improve with experience. "A related line of work trains meta-policies during reinforcement learning, often referred to as Meta-RL"
- On-policy: Using data sampled from the current policy during training and evaluation. "they guide search and synthesis, serve as on-policy references for the judge"
- Process-level supervision: Feedback on intermediate steps of reasoning or action, not only final answers. "where trajectories can be decomposed into subgoals with reliable process-level supervision."
- ReAct: A prompting strategy that interleaves reasoning and acting (tool calls) during problem solving. "our scaffold outperforms a standard ReAct (think {paper_content} act) prompt on DRB."
- Rejection sampling: Filtering generated data by discarding samples that violate constraints or schema. "we apply rejection sampling to discard outputs that violate stage boundaries, tool-calling syntax, citation format, or grounding constraints."
- Rollout: A full sampled trajectory of actions and observations under a policy for a given query. "Given a query , we sample rollouts "
- Rubric buffer: A judge-maintained, evolving set of stage-specific rubrics used to score trajectories. "The judge maintains an evolving rubric buffer for each stage"
- Rubric-guided: Using explicit evaluation criteria (rubrics) to condition planning, search, feedback, and memory. "a rubric-guided reinforcement learning framework"
- Scaffold (Structured reasoning scaffold): A staged workflow that imposes explicit structure (Plan→Research→Review→Answer) on trajectories. "Rubric-guided structured reasoning scaffold in RubricEM."
- SFT (Supervised fine-tuning): Pretraining the model on labeled demonstrations before RL. "The SFT stage includes both short-form and long-form data, while the RL stage exclusively focuses on long-form queries."
- Shared backbone: A single model network used jointly for both task policy and reflection meta-policy. "The task policy and reflection meta-policy share one backbone"
- SS-GRPO (Stage-Structured GRPO): A GRPO variant that assigns stagewise rewards and advantages aligned with rubric-defined stages. "We propose Stage-Structured GRPO (SS-GRPO)"
- Stagewise normalization: Normalizing returns/advantages separately within each stage across sampled rollouts to stabilize training. "We instantiate SS-GRPO as a critic-free stagewise variant of GRPO by normalizing returns separately within each stage across the rollout group:"
- Tool-augmented rollouts: Trajectories that include structured tool calls (e.g., search) alongside text generation. "judge feedback is coarse and delayed over long tool-augmented rollouts"
- Trajectory: The sequence of actions and observations produced by the agent for a query. "produces a trajectory "
- Verifiable rewards: Objective correctness signals available for short-form tasks (e.g., exact answers), often absent in open-ended settings. "pushes reinforcement learning beyond the regime of verifiable rewards."
- Within-episode refinement: Reusing the reflection from a prior attempt on the same query to improve a subsequent attempt. "within-episode refinement retrieves the previous reflection for the same query"
Collections
Sign up for free to add this paper to one or more collections.