Off-Policy Value-Based Reinforcement Learning for Large Language Models
Abstract: Improving data utilization efficiency is critical for scaling reinforcement learning (RL) for long-horizon tasks where generating trajectories is expensive. However, the dominant RL methods for LLMs are largely on-policy: they update each batch of data only once, discard it, and then collect fresh samples, resulting in poor sample efficiency. In this work, we explore an alternative value-based RL framework for LLMs that naturally enables off-policy learning. We propose ReVal, a Bellman-update-based method that combines stepwise signals capturing internal consistency with trajectory-level signals derived from outcome verification. ReVal naturally supports replay-buffer-based training, allowing efficient reuse of past trajectories. Experiments on standard mathematical reasoning benchmarks show that ReVal not only converges faster but also outperforms GRPO in final performance. On DeepSeek-R1-Distill-1.5B, ReVal improves training efficiency and achieves improvement of 2.7% in AIME24 and 4.5% in out-of-domain benchmark GPQA over GRPO. These results suggest that value-based RL is a practical alternative to policy-based methods for LLM training.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper is about teaching LLMs to reason better using reinforcement learning (RL) in a way that wastes less data and time. The authors introduce a new method called ReVal that lets an LLM learn from past attempts (not just fresh ones) while staying fast and affordable to train.
What questions are the researchers trying to answer?
They focus on three simple questions:
- How can we make RL for LLMs more efficient, especially when generating long answers is slow and costly?
- Can we reuse old training examples safely and effectively instead of throwing them away after one use?
- Is there a way to use value-based RL (which learns “how good” each next choice is) with LLMs without adding extra heavy models?
How did they do it? (Methods explained simply)
First, some quick translations:
- Reinforcement Learning (RL): Like training by trial and error. The model tries answers, gets a score (reward), and improves.
- On-policy vs. Off-policy:
- On-policy is like practicing a sport and only learning from today’s games—you throw away yesterday’s recordings.
- Off-policy is like also studying a replay library—you learn from both new and old games, so you improve faster with less new practice.
- Value-based RL: Instead of directly pushing the model to pick certain actions, it teaches the model to estimate how good each possible next step is. Think of it as giving a “quality score” to each next word.
- Replay buffer: A big folder of past attempts that the model can revisit many times to learn more efficiently.
- Bellman update: A step-by-step rule that adjusts those “quality scores” by looking at how good the next steps turned out, like updating a player’s rating based on future plays.
- Logits: The model’s internal scores for each next token before turning them into probabilities. The key idea here is to treat these scores as the “quality scores” (Q-values) we need for value-based RL—so we don’t need a second model.
Here’s what ReVal does:
- Reuses past attempts (off-policy) via a replay buffer, so the model learns more from the same data.
- Treats the LLM’s internal scores (logits) as the “how good is this next token?” values (Q-values). This avoids training a separate value network, keeping things lightweight.
- Combines two kinds of feedback:
- Stepwise signals: small nudges that keep the model’s reasoning consistent at each step (like checking every line in a math solution).
- Trajectory-level signals: a final check on the whole answer—right or wrong—using an automatic verifier (great for math problems).
- Adds “safety belts” to keep training stable:
- KL regularization: keeps the model from drifting too far from a safe reference version.
- A clever “reward shaping” setup so that if there’s no useful reward yet, the model doesn’t wander off randomly.
- Periodically resetting the reference model so learning doesn’t slow down too much.
In short: ReVal is a single-model, value-based RL method that learns from replays, uses both step-by-step and final-answer checks, and stays stable and efficient.
What did they find, and why does it matter?
They tested ReVal on math reasoning benchmarks and compared it to popular methods like GRPO (a strong on-policy baseline).
Key results:
- Faster learning: Reusing past attempts speeds up training a lot—about 4.3× fewer steps to reach strong performance in their tests.
- Better accuracy: On a 1.5B-parameter model (DeepSeek-R1-Distill-1.5B), ReVal beat GRPO by:
- +2.7% on AIME24 (a math contest benchmark)
- +4.5% on GPQA (an out-of-domain test, which shows better generalization)
- Strong when data is scarce: When they limited new rollouts (N=1), ReVal pulled ahead even more—exactly where reuse matters most.
- Less wall-clock time: Because generating new answers is expensive, doing more learning per batch saves total time. In one setup, total training time dropped by about 18%.
Why it matters:
- It shows that value-based, off-policy RL isn’t just possible for LLMs—it can be better and cheaper.
- It makes training for long, complex tasks (like multi-step reasoning) more practical.
What could this change in the future?
- More capable reasoning models at lower cost: By learning more from less, research labs and companies can train better models without massive budgets.
- Better “agentic” LLMs: Tasks where the model plans or acts over many steps can benefit a lot from replay and value-based learning.
- Safer, steadier training: The paper’s stabilization tricks (reward shaping, KL control, reference resets) provide a toolkit for reliable RL training of LLMs.
- Room to grow: The authors used a simple replay buffer; smarter replay (like prioritizing the most helpful past attempts) could improve results further.
Overall, ReVal suggests a practical path forward: reuse what you’ve already generated, teach the model how good each step is, and keep training stable—so LLMs can learn to reason better, faster, and cheaper.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paper. These items are intended to guide future research.
- Task/generalization scope
- Validation is restricted to mathematical reasoning with verifiable 0/1 rewards; it is unknown how ReVal performs on non-verifiable rewards, partial-credit grading, code generation, tool-use/agentic tasks, dialogue safety/alignment, or multimodal settings.
- Out-of-domain evaluation is limited (primarily GPQA); broader OOD robustness across diverse domains and datasets remains untested.
- The method assumes deterministic transitions and fixed-length horizons; real agentic settings have variable horizons, tool calls, and stochastic outcomes. How ReVal handles these is unclear.
- Scalability and model size
- Experiments use 1.5B and 7B models; scalability to larger frontier models (e.g., 30B–70B+) and distributed/asynchronous training regimes is not demonstrated.
- Memory/computation trade-offs of maintaining a large replay buffer with long sequences are not quantified (I/O, GPU memory pressure, serialization costs).
- Off-policy stability and correctness
- No off-policy correction is used (e.g., importance sampling, V-trace, or reweighting by behavior policy); the impact of distribution mismatch between replay data and current policy is unstudied.
- The approach lacks “target network” or delay mechanisms commonly used in value-based RL for stability; effects of target-free Bellman updates on divergence/oscillation are unknown.
- Overestimation/underestimation biases introduced by the log-sum-exp backup and function approximation are not analyzed; no double-Q or conservative mechanisms are explored.
- Replay staleness is not quantified; there is no analysis of how age of samples in the buffer affects bias/variance and learning dynamics.
- Replay buffer design
- Only a FIFO, uniform sampler is used; prioritized replay, diversity sampling, deduplication, or recency/novelty-aware policies are suggested but untested.
- Sensitivity to buffer size, batch composition (on-policy vs off-policy mix), and number of updates per generation (K) is not systematically characterized.
- Effects of replay from heterogeneous sources (e.g., mixing logs from other models or older checkpoints) remain unexplored.
- Reward shaping and theoretical guarantees
- The shaped Bellman operator and Calibrated Initialization property are shown, but there are no convergence guarantees under function approximation and off-policy sampling.
- It is unclear whether the shaping induces any unintended fixed points or spurious optima when combined with KL regularization and replay.
- Theoretical justification for periodic reference-policy resets (frequency selection, stability guarantees, monotonic improvement) is absent.
- Logits-as-Q assumption
- The validity of interpreting logits as Q-values off-distribution (i.e., far from pretraining data) is assumed but not tested; calibration of logit-derived Q-values after RL updates is not measured.
- How RL fine-tuning affects the implicit Q-value interpretation (e.g., degradation of calibration or value semantics over time) is not analyzed.
- Stepwise vs trajectory-level signals
- Although the paper motivates combining stepwise and trajectory-level signals, the implementation largely relies on reference-policy-based shaping; there is no ablation isolating true step-level verification/process supervision.
- No experiments quantify the individual contribution of stepwise vs trajectory-level components to stability and performance.
- Handling of negative samples and credit assignment
- The paper notes that 0/1 rewards mostly pull incorrect samples toward the reference rather than explicitly penalizing them, and tries normalized advantages; a principled mechanism to downweight bad tokens/steps (e.g., token-level credit assignment, process-level error localization) remains undeveloped.
- Alternative negative-sample utilizations (e.g., margin losses, contrastive objectives, conservative penalties on incorrect continuations) are untested.
- Hyperparameterization and adaptivity
- The choice of β (reward scaling/KL weight) is hand-tuned and correlated with response length; there is no adaptive, per-sample or per-length schedule, nor a principled method to set β automatically.
- The frequency of reference resets is empirically chosen; adaptive strategies (e.g., based on KL growth, TD error, or performance plateaus) and their theoretical properties are not investigated.
- Exploration and data quality
- Effects of sampling temperature, nucleus/top-k sampling, or entropy bonuses on replay quality and off-policy learning are not studied.
- No mechanism controls for distributional coverage vs. overfitting to high-reward but narrow modes; exploration-exploitation balance in off-policy buffers is uncharacterized.
- Comparisons and baselines
- Comparisons omit several strong or related baselines (e.g., DPO/IPO, DAPO variants, SAC-like or conservative value-based methods for LLMs, off-policy RL with learned reward models, ROVER-style formulations).
- Fairness of comparisons (e.g., hyperparameter tuning budgets per baseline, decoding strategies, evaluation seeds) and statistical significance (error bars/tests) are not reported.
- Practical training efficiency
- Reported wall-clock savings are for specific settings; broader profiling (GPU utilization, memory footprint, latency variance, scaling efficiency) across hardware and cluster setups is missing.
- Interaction between generation system optimizations (e.g., speculative decoding, cache reuse) and off-policy reuse is not evaluated.
- Robustness, safety, and reliability
- Potential for reward hacking due to shaping and KL dynamics is not discussed (e.g., exploiting reference-policy terms).
- Catastrophic forgetting or mode collapse risk when repeatedly resetting the reference to the current policy is not assessed.
- Safety/harms in off-policy training (e.g., amplifying unintended behaviors from replayed trajectories) are not examined.
- Evaluation methodology
- Evaluations use avg@16 samples with temperature 1.0; sensitivity to evaluation protocol, decoding strategies, and sample count is not tested.
- No multi-seed variance reporting or confidence intervals; robustness across random seeds is unknown.
- Potential benchmark contamination (e.g., overlaps with training data in DeepScaleR) is not checked.
- Extensibility and integration
- Integration with RLHF (learned reward models) vs RLVR remains unclear—can ReVal leverage noisy preference/reward models and remain stable off-policy?
- Use with partial-credit or programmatic verification that assesses intermediate steps is not explored.
- Application to asynchronous/streaming data settings (e.g., online logs) is not demonstrated.
- Implementation details to clarify
- Absence of target networks or EMA/Polyak averaging for stability warrants clarification or empirical justification.
- How variable-length sequences are handled in practice (beyond padding) and its effect on β, KL, and replay sampling is not detailed.
- Storage format, deduplication, and memory management strategies for long sequences in the buffer are unspecified.
Practical Applications
Immediate Applications
Below are deployable use cases that capitalize on ReVal’s off‑policy, value‑based RL for LLMs, its replay-buffer training, calibrated initialization, and mixed stepwise/trajectory signals. They prioritize settings with verifiable rewards and long or variable horizons where generation dominates training cost, leveraging the paper’s reported 4.3x faster convergence and ~18% lower wall‑clock time.
- Sector: Software engineering (developer tools)
- Application: Faster RL post‑training of code assistants using unit tests as verifiable rewards
- What it does: Fine‑tune code generation/repair models by reusing failed and successful code attempts (trajectory logs) with unit tests as outcome verifiers, extracting more learning from each expensive generation cycle.
- Potential tools/workflows:
- “Trajectory warehouse” for code attempts (inputs, patches, test outcomes) as a replay buffer
- “Verifier‑as‑a‑Service” to run unit tests at scale
- KL‑reset scheduler and β auto‑tuner (length‑aware) plugs into existing RLVR pipelines
- Integration into existing frameworks (e.g., Verl) for single‑model logit‑as‑Q updates
- Assumptions/dependencies: Adequate test coverage to provide dense rule‑based rewards; privacy/compliance for storing code and logs; compute budget for off‑policy updates; correct calibration of β and KL resets to prevent instability.
- Sector: Education (intelligent tutoring, assessment)
- Application: Cost‑efficient math/logic tutoring LLMs with verifiable solutions
- What it does: Use problem answer checkers and stepwise consistency signals to improve chain‑of‑thought correctness, replaying student interaction trajectories (correct and incorrect) to accelerate learning.
- Potential tools/workflows:
- Math equivalence checkers and solution verifiers
- Replay buffers of student attempts (with anonymization)
- Normalized‑advantage handling for negative samples to avoid “drift to reference”
- Assumptions/dependencies: High‑quality verifiers for final answers; policies for anonymization/FERPA/GDPR; alignment constraints to maintain helpfulness.
- Sector: Finance (report automation, reconciliation)
- Application: Verifier‑guided reasoning on financial documents with replayed workflows
- What it does: Fine‑tune LLMs on reconciliation, variance analysis, or rule‑driven compliance checks where outcomes are decisively verifiable (e.g., balance checks, validation rules), exploiting replay to reduce rollout costs.
- Potential tools/workflows:
- Rule checkers (e.g., reconciliation rules, schema validators)
- Secure, redacted replay buffer of historical processing runs
- Monitoring of KL and β to maintain stability and compliance
- Assumptions/dependencies: Strong deterministic verifiers; strict data governance and PII handling; domain coverage to avoid reward hacking.
- Sector: Customer support/IT Ops
- Application: Troubleshooting agents with testable outcomes (e.g., config diffs, playbook execution)
- What it does: Train support agents using environment simulations or sandbox tests as rule‑based verifiers; replay historical tickets and agent actions to improve resolution steps without constantly sampling new trajectories.
- Potential tools/workflows:
- Sandbox “health checks”/playbook simulators as verifiers
- FIFO replay of agent traces; prioritized replay can be added later
- Assumptions/dependencies: Availability of test harnesses; mapping outcomes to clear pass/fail signals; safe sandboxing.
- Sector: LLM platforms/cloud ML
- Application: Production RLVR pipelines with replay to reduce compute cost
- What it does: Offer a “ReVal mode” for customers running RL fine‑tuning: K updates per generation round, FIFO buffer, periodic reference resets, and β tuning reduce rollout frequency and wall‑clock time.
- Potential tools/workflows:
- Managed replay buffers with configurable capacity and retention
- Job schedulers that separate generation from update phases
- Telemetry that optimizes K based on Tgeneration ≫ Tupdate
- Assumptions/dependencies: Clear service boundaries for storing trajectories; customer consent and data partitioning; operational guardrails for KL reset and β scheduling.
- Sector: Research & Academia
- Application: Sample‑efficient open benchmarking and reproducible off‑policy RL for LLMs
- What it does: Allow smaller labs to attain strong reasoning performance by maximizing reuse of collected trajectories (e.g., AIME/MATH/GPQA), lowering compute costs.
- Potential tools/workflows:
- Open replay datasets (with PII stripping), “ReVal‑ready” training recipes
- Ablation templates for KL reset frequency and β vs. response length
- Assumptions/dependencies: Availability of RLVR‑friendly tasks with deterministic verifiers; community norms for sharing replay logs safely.
- Sector: Sustainability/IT policy within organizations
- Application: Energy and cost reduction in RL training operations
- What it does: Use off‑policy updates to cut generation rounds (dominant energy/cost driver) and track efficiency metrics as part of internal Green‑AI initiatives.
- Potential tools/workflows:
- Dashboards showing generation/update time splits, K vs. throughput
- Policies that standardize replay use for long‑horizon post‑training
- Assumptions/dependencies: Instrumentation to measure Tgeneration and Tupdate; governance to ensure replay buffers adhere to data retention policies.
- Sector: Safety & alignment engineering
- Application: Drift‑free bootstrapping and safer “no reward, no change” training
- What it does: Apply ReVal’s calibrated initialization and reward shaping to avoid spurious drift when reward signals are sparse or temporarily unavailable.
- Potential tools/workflows:
- Training guardrails: assert r=0 ⇒ no policy change
- Canary tests for calibration and KL growth
- Assumptions/dependencies: Correct implementation of the shaped Bellman objective; monitoring for mis‑calibration and reward hacking.
Long-Term Applications
These use cases extend ReVal’s ideas (single‑model value learning with logit‑as‑Q, replay buffers, mixed stepwise/trajectory signals) into domains requiring additional research, tooling, or regulatory clearance.
- Sector: Autonomous agents (multi‑tool, long‑horizon workflows)
- Application: Off‑policy, log‑driven improvement of agents using real production traces
- What it could do: Train agents to plan and execute sequences (search + code + APIs) by replaying real traces and applying task verifiers (e.g., passed tests, successful API calls).
- Potential tools/products:
- “Trajectory commons” to pool and sample agent traces across tasks
- Prioritized experience replay for rare, high‑value traces
- Dependencies/assumptions: Robust verifiers across heterogeneous tools; methods to mitigate off‑policy distribution shift and reward hacking; privacy controls for production logs.
- Sector: Robotics & embodied AI
- Application: Language‑mediated long‑horizon planning with replayed robot logs
- What it could do: Use success detectors and simulators as verifiers; reuse long sequences to reduce expensive real‑world data collection.
- Potential tools/products:
- Replay buffers tied to robot telemetry; hybrid stepwise (subgoal consistency) + outcome signals
- Safety overseers for off‑policy updates
- Dependencies/assumptions: Reliable outcome verification (e.g., vision‑based success metrics); safe‑RL constraints; sim‑to‑real robustness.
- Sector: Healthcare
- Application: Verifier‑driven clinical reasoning and guideline adherence
- What it could do: Train clinical assistants on de‑identified logs; verify outputs via executable clinical rules/guideline checkers; reuse rare edge cases via prioritized replay.
- Potential tools/products:
- Guideline engines as verifiers; audit trails for KL resets and policy drift
- Dependencies/assumptions: Strict privacy and regulatory approvals; validated verifiers; clinical oversight; bias and safety evaluation.
- Sector: Governance & public policy
- Application: Standards for compute‑ and energy‑efficient RL fine‑tuning
- What it could do: Codify best practices (replay, off‑policy updates, calibrated initialization) into procurement or compliance guidelines for public sector AI projects.
- Potential tools/products:
- Reporting templates for generation/update splits, reuse factors, and energy metrics
- Dependencies/assumptions: Consensus measurement frameworks; third‑party audits; alignment with AI sustainability goals.
- Sector: Consumer devices/on‑device learning
- Application: Single‑model RL adaptation for personal assistants with limited memory
- What it could do: Leverage logit‑as‑Q to avoid separate value networks; reuse on‑device interaction traces for continual personalization with minimal generation.
- Potential tools/products:
- Lightweight replay buffers; privacy‑preserving schedulers for KL resets; dynamic β based on response length
- Dependencies/assumptions: Efficient on‑device training runtimes; strong privacy guarantees; throttled updates to manage power and wear.
- Sector: Foundation model operations (MLOps)
- Application: Continuous off‑policy RL from production logs with hybrid verification
- What it could do: Combine verifiable sub‑tasks (tests, schemas) with preference signals for others; schedule periodic reference resets for stable long‑running training.
- Potential tools/products:
- Unified “verifier suite” with pluggable checkers; β schedulers; curriculum strategies for resets; negative‑sample utilization policies
- Dependencies/assumptions: Reliable automated verifiers or human‑in‑the‑loop fallbacks; drift monitoring; legal frameworks for log use.
- Sector: Research
- Application: Advanced replay strategies and theory for value‑based RL in LLMs
- What it could do: Prioritized replay for rare reasoning patterns; distribution‑shift‑aware sampling; theoretical guarantees for off‑policy stability with language policies.
- Potential tools/products:
- Open benchmarks and leaderboards for off‑policy LLM RL; diagnostic suites for calibration and Bellman residuals
- Dependencies/assumptions: Community datasets of trajectories; standardized verifiers across domains.
- Sector: Cross‑domain assistants (generalization)
- Application: OOD‑robust assistants via off‑policy reuse of diverse traces
- What it could do: Exploit ReVal’s observed OOD gains (e.g., GPQA improvements) by curating diverse replay buffers that mix domains and difficulty.
- Potential tools/products:
- Replay buffer “mixers” that balance domain coverage and difficulty; OOD detectors to steer sampling
- Dependencies/assumptions: Rich, labeled/verifiable multi‑domain traces; safeguards against spurious correlations.
- Sector: Safety & alignment
- Application: Off‑policy RLHF hybrids with calibrated shaping
- What it could do: Unify human preference rewards (trajectory‑level) with stepwise consistency and reward shaping to limit drift and improve data efficiency.
- Potential tools/products:
- Human‑feedback integrators for ReVal objectives; auditing tools to detect KL spikes and mis‑calibration
- Dependencies/assumptions: Reliable preference data; clear protocols for mixing verifiable and subjective rewards; extensive evals to prevent reward hacking.
Notes on Assumptions and Dependencies (common across many applications)
- Verifiable rewards: ReVal thrives where outcomes can be deterministically or programmatically checked (tests, schemas, rules, success detectors). In tasks lacking verifiers, benefits diminish or require hybrid rewards with human feedback.
- Data governance: Storing and replaying trajectories requires consent, PII redaction, retention policies, and strong access controls.
- Stability controls: Proper β (scaled to response length) and periodic KL reference resets are critical to prevent training collapse or over‑constraint.
- Distribution shift: Off‑policy sampling must manage mismatch between replay data and current policy; prioritized or shift‑aware sampling may be needed for robust scaling.
- Compute profile: Gains rely on Tgeneration ≫ Tupdate; if generation is cheap relative to updates, replay benefits may be smaller.
- Implementation correctness: Calibrated initialization and shaped Bellman objectives must be implemented precisely to avoid spurious drift when r=0.
Glossary
- Actor-critic: A reinforcement learning architecture that learns both a policy (actor) and a value function (critic) jointly. Example: "ReMax (Li et al., 2024) was the first to move from actor-critic to actor-only RL, significantly reducing memory usage and training time for LLM post-training."
- Actor-only RL: A policy optimization approach that removes the critic/value network and updates only the policy to reduce compute/memory cost. Example: "ReMax (Li et al., 2024) was the first to move from actor-critic to actor-only RL, significantly reducing memory usage and training time for LLM post-training."
- Autoregressive (LLMs): A generative modeling setup where the model predicts the next token conditioned on all previous tokens. Example: "Because autoregressive LLMs are naturally parameterized as token-level policies, actor-critic policy optimiza- tion algorithms such as PPO (Schulman et al., 2017) initially became the dominant approach for RL-based post-training."
- Bellman operator: A mapping that defines the target for value function updates based on rewards and future values; central to value-based RL. Example: "Based on this modified reward, we define the Bellman operator as:"
- Bellman residual: The discrepancy between current value estimates and Bellman targets; minimizing it drives value consistency. Example: "The trajectory-level Bellman residual loss is then:"
- Calibrated Initialization: A property of an objective ensuring that when rewards are absent, the optimal policy equals the reference policy (no spurious drift). Example: "Definition 1 (Calibrated Initialization)."
- Endogenous reward: A reward signal derived from model or reference-policy terms (not the environment) used to guide training. Example: "the remaining terms form an endogenous reward (Li et al., 2025a) guided by the reference policy."
- FIFO replay buffer: A first-in-first-out storage for past trajectories enabling multiple off-policy updates from the same data. Example: "We adopt a first-in-first-out (FIFO) replay buffer of size M."
- GRPO: A low-cost policy optimization method for LLMs that follows an actor-only, largely on-policy paradigm. Example: "Following ReMax, a series of methods, including GRPO (Shao et al., 2024) and DAPO (Yu et al., 2025), further advanced this low-cost policy optimization paradigm."
- KL divergence: A measure of how one probability distribution diverges from another; used here to regularize policy updates. Example: "Besides, DKL (Te (.|x), Tref(.|x)) = LaLH To (a1:H | x ) log ( Te (a1:H |x) / Tref(a1:H |x)) denotes the KL divergence, which prevents the learning model from deviating too far from the reference model and B > 0 controls the regularization strength."
- KL regularization: Penalizing divergence from a reference policy during RL to maintain stability and controllability. Example: "Reinforcement Learning with Verifiable Reward (RLVR) has become a widely adopted paradigm in LLM reasoning following recent breakthroughs ... trains LLMs by maximizing a rule-based outcome reward with KL-regularization:"
- Logit-as-Q parameterization: Treating the LLM’s token logits as action-value (Q) estimates, unifying policy and value in one model. Example: "Similar to Li et al. (2025a), TBRM (Yuan et al., 2025) adopts the logit-as-Q parameterization:"
- Logits (as Q-values): The pre-softmax scores of the model interpreted as action-value estimates for tokens. Example: "It shows that the logits of a pretrained LLM can be interpreted as parameterizing action values of an endogenous reward, up to a state-dependent transformation."
- Markov decision process (MDP): The formal framework (states, actions, transitions, rewards, horizon) used to model LLM token generation as sequential decision-making. Example: "We adopt the Markov decision process (MDP) formulation of LLMs from (Li et al., 2024), defined by the tuple M = (S, V,r, P,p, H)."
- Off-policy: Learning from trajectories generated by a different policy than the one currently being updated, enabling experience reuse. Example: "The next fundamental requirement is the ability to reuse experience, that is, RL for LLMs must become off-policy."
- On-policy: Learning that requires fresh trajectories from the current policy for each update, limiting data reuse. Example: "However, these actor-only methods remain fundamentally on-policy."
- PPO: Proximal Policy Optimization, a popular stable policy-gradient algorithm often used in RL for LLMs. Example: "actor-critic policy optimiza- tion algorithms such as PPO (Schulman et al., 2017) initially became the dominant approach for RL-based post-training."
- Prioritized experience replay: A replay strategy that samples transitions with higher learning potential more frequently. Example: "We leave the exploration of more efficient sampling strategies, such as prioritized experience replay (Schaul et al., 2015), to future work."
- Q-function: The action-value function that estimates expected return for taking an action in a state and following a policy thereafter. Example: "A key challenge in applying value-based RL to LLMs is how to represent or initialize the Q-function."
- Reference policy: A fixed (periodically updated) policy used as an anchor for KL regularization and shaping terms. Example: "Input: Task prompt dataset Dtask, first-in-first-out (FIFO) replay buffer Dreplay = Ø, task reward r, reward scaling coefficient B, reference policy Tref with parameter fref, number of iterations T."
- Reinforcement Learning from Human Feedback (RLHF): Training that uses human preference signals via a learned reward model to guide RL. Example: "Since the advent of reinforcement learning from human feedback (RLHF), reinforcement learning (RL) has become a central component of LLM post-training"
- Reinforcement Learning with Verifiable Reward (RLVR): Training that uses deterministic, rule-based outcome rewards (e.g., correctness checks) instead of learned reward models. Example: "Reinforcement Learning with Verifiable Reward (RLVR) has become a widely adopted paradigm in LLM reasoning"
- ReMax: An actor-only RL method that reduces memory and compute by removing the critic in LLM post-training. Example: "ReMax (Li et al., 2024) was the first to move from actor-critic to actor-only RL, significantly reducing memory usage and training time for LLM post-training."
- Replay buffer: A storage of past trajectories used to sample batches for off-policy updates, increasing sample efficiency. Example: "ReVal introduces a replay buffer Dreplay that stores historical trajectories and enables off-policy learning, which satisfy desirable properties of value-based RL."
- ReVal: A value-based, off-policy RL framework that unifies policy and value within the LLM via Bellman updates and replay. Example: "In this paper, we propose ReVal, a value-based RL framework for LLM post-training that preserves the efficiency advantages of ReMax while introducing the off-policy capability required for agentic and long- horizon learning."
- Reward normalization: Rescaling or centering reward signals (e.g., via advantages) to stabilize and improve learning dynamics. Example: "To mitigate this, we introduce reward normalization and periodic reference policy reset,"
- Reward shaping: Adding potential-based or structured terms to the reward to ease learning without changing optimal policies. Example: "To address this issue, we introduce reward shaping to redefine the Bellman objective."
- Soft Q-function: A maximum-entropy value function where action values incorporate an entropy term, aligning with softmax policies. Example: "It shows that a LLM trained via next-token prediction implicitly learns a soft Q-function:"
- TBRM: A value-based method using the model’s logits as Q-values and minimizing a trajectory-level Bellman residual, originally trained on-policy. Example: "TBRM showed empirical success with on-policy data."
- Trajectory-level signals: Outcome-level feedback aggregated over an entire generated sequence (trajectory), such as correctness verification. Example: "We propose ReVal, a Bellman-update-based method that combines stepwise signals capturing internal consistency with trajectory-level signals derived from outcome verification."
- Value-based reinforcement learning (Value-based RL): RL methods that learn action values (Q) to derive policies, enabling off-policy updates via Bellman learning. Example: "we explore an alternative value-based RL framework for LLMs that naturally enables off-policy learning."
- V-function: The state value function giving the expected return from a state under a policy, used alongside Q in Bellman updates. Example: "Ve(s1) is the induced V-function, Ve (s1) = log Zac.A exp Q(s1, a)."
Collections
Sign up for free to add this paper to one or more collections.