Humanline: Online Alignment as Perceptual Loss (2509.24207v1)
Abstract: Online alignment (e.g., GRPO) is generally more performant than offline alignment (e.g., DPO) -- but why? Drawing on prospect theory from behavioral economics, we propose a human-centric explanation. We prove that online on-policy sampling better approximates the human-perceived distribution of what the model can produce, and PPO/GRPO-style clipping -- originally introduced to just stabilize training -- recovers a perceptual bias in how humans perceive probability. In this sense, PPO/GRPO act as perceptual losses already. Our theory further suggests that the online/offline dichotomy is itself incidental to maximizing human utility, since we can achieve the same effect by selectively training on any data in a manner that mimics human perception, rather than restricting ourselves to online on-policy data. Doing so would allow us to post-train more quickly, cheaply, and flexibly without sacrificing performance. To this end, we propose a design pattern that explicitly incorporates perceptual distortions of probability into objectives like DPO/KTO/GRPO, creating humanline variants of them. Surprisingly, we find that these humanline variants, even when trained with offline off-policy data, can match the performance of their online counterparts on both verifiable and unverifiable tasks.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper looks at how to teach AI LLMs to behave the way people want. It asks why “online” training methods (which keep generating new examples from the current model) often work better than “offline” methods (which train on a fixed dataset). The authors suggest a human-centered reason: people don’t see probabilities perfectly. We tend to overrate rare, exciting outcomes and underrate common, boring ones. The paper shows that popular online training methods already capture this human “probability bias,” and introduces a simple way—called “humanline”—to add the same effect to offline training so it can perform just as well, but faster and cheaper.
Key Objectives and Questions
- Why do online alignment methods like PPO/GRPO usually beat offline methods like DPO/KTO?
- Can we explain this advantage using how humans perceive chances and risks (from behavioral economics, known as “prospect theory”)?
- Do the technical tricks in PPO/GRPO (like clipping) already mimic human probability perception?
- Can we modify offline training to include human-like probability perception so it reaches online-level performance without the extra cost?
- Will this work on both open-ended tasks (like following instructions) and precise tasks (like math problems)?
Methods and Approach
Key ideas explained simply
- Alignment: Teaching a model to give answers that people prefer. You can do this:
- Offline: Train on a fixed set of examples scored by humans or a reward model. Methods include DPO and KTO.
- Online: Keep generating fresh examples from the current model, score them, and train repeatedly. Methods include PPO and GRPO.
- Prospect theory: A psychology and economics idea showing humans don’t treat probabilities exactly. We often:
- Overweight rare, extreme outcomes.
- Underweight common, moderate outcomes.
- The paper applies this idea to AI outputs: people perceive the “chance” of certain model outputs differently than the true numbers.
- Surprisal: Think of it as “how surprising” an output is to the reference model. If the policy (current model) makes something more likely than the reference model does, that output has positive surprisal. This surprisal acts like the “outcome” that humans judge.
- Clipping: In PPO/GRPO, the ratio of new probability to old probability is capped to a range. This stabilizes training. The authors show this also mimics how people distort probabilities (overrating extremes and underrating typical cases).
What is “humanline”?
It’s a small design pattern that adds human-like probability perception into standard objectives (DPO/KTO/GRPO), even when training offline. It has two steps:
- Humanline syncing: Regularly update the reference model to match the previous version of the current model. This keeps the “surprisal” comparison meaningful as the model learns.
- Humanline clipping: Clip the per-token probability ratios upstream (before the loss function) and allow the clip range to be asymmetric. This directly imitates human probability distortion.
In plain terms: the model keeps its “yardstick” up-to-date and caps how much “surprise” counts so training doesn’t get skewed by extreme cases—similar to how people weigh probabilities.
Main Findings and Why They Matter
- Online alignment beats plain offline alignment, but the gap can disappear with humanline:
- Instruction-following: When training Llama3-8B, regular offline methods were 1.3–1.6x worse than online. The same offline data, run through humanline DPO/KTO/GRPO, matched the online performance.
- Math reasoning: With GRPO, humanline allowed sampling new data 64 times less frequently (so training can be more asynchronous and efficient) with no loss in accuracy on MATH500. Both humanline and online reached Pass@1 around 0.593.
- Clipping already acts like a perceptual loss:
- The paper proves that the “clipping” in PPO/GRPO recovers a special case of human probability bias. That helps explain why these online methods are strong.
- Humanline syncing does most of the work; humanline clipping adds extra boost:
- Syncing the reference model every step, or every few steps, gives most of the improvement.
- Clipping upstream improves stability and ensures offline methods can fully match online performance.
- Not a magic trick: Data quality still matters.
- You can’t just pick any offline dataset and expect humanline to match online. But often there exists some good offline data that works.
Implications and Potential Impact
- Faster and cheaper training: If offline data plus humanline can match online methods, teams can avoid constant on-policy sampling. This saves compute and time, and makes training more stable.
- More flexible pipelines: You can combine data from different sources (human demos, other models, reward models) and still get online-level results, thanks to humanline’s perception-aware design.
- Better scalability: For tasks like math, humanline enables less frequent data sampling without hurting performance, which helps parallelize training and reduce bottlenecks.
- Human-centered theory: Viewing alignment through how humans perceive probabilities explains why certain tricks (like clipping) work and guides future method design.
Caveats and future directions
- Prospect theory was built for money-related decisions; applying it to language outputs is an assumption. Future work might refine the perception model for generative AI.
- Syncing too often can cause instability; choosing the right sync frequency and learning rate is important.
- Personalizing the “probability distortion” to different user groups might help, but needs research.
Overall, the paper reframes alignment around human perception and shows a practical path—humanline—for making offline training match online power, potentially transforming post-training into a cheaper, faster, and more adaptable process.
Knowledge Gaps
Below is a concise list of the paper’s unresolved knowledge gaps, limitations, and open questions. Each item is framed to be concrete and actionable for future research.
- Empirical estimation of human probability weighting for generative outputs: develop methods to infer users’ perceived probability distortions (e.g., estimating , , ) from behavioral data in text generation; validate whether the typical inverse-S weighting from monetary gambles holds for language outputs.
- Personalization and heterogeneity of perception: assess whether user-specific or population-specific weighting functions (and reference points ) improve alignment; design mechanisms to personalize or cluster humanline parameters per user/task/domain.
- Sensitivity to core assumptions: quantify performance and theoretical breakdowns when (a) policy and reference do not share support, (b) token likelihood ratios are unbounded, (c) the “tail regime” assumption fails, or (d) capacity functions deviate from the chosen form; provide diagnostics and robustness bounds.
- Practical estimation for Proposition 1: propose estimators to approximate (or usable surrogates) in large output spaces; analyze sample complexity and bias to make the bound operational.
- Mapping from prospect-theory parameters to clipping ranges: derive principled procedures to turn into upstream clipping bounds (and schedules), rather than relying on fixed heuristic values.
- Adaptive humanline schedules: develop algorithms to adapt (sync frequency) and clipping bounds during training based on stability signals, loss curvature, or gradient statistics; characterize convergence and prevent collapse.
- Stability and convergence theory: provide formal analyses guaranteeing stability (or bounding instability) under humanline syncing and clipping for DPO/KTO/GRPO; characterize conditions on learning rates, advantage normalization, and KL penalties.
- Guidance on offline data quality: define measurable properties (coverage, divergence from , entropy, reward-model agreement) that predict when offline+humanline can match online performance; design data selection or filtering criteria.
- Effects of support mismatch and domain shift: paper how off-policy distributions far from affect humanline’s efficacy; quantify degradation as divergence increases and propose mitigation strategies (e.g., reweighting, selective sampling).
- Sequence-level versus token-level trade-offs: analyze how token-level clipping changes sequence-level saturation, length biases, and coherence; assess interactions with KL regularization and advantage normalization in long contexts.
- Safety and bias implications: evaluate whether overweighting “extreme” outcomes increases sensational or unsafe generations; integrate safety constraints and measure trade-offs between perceptual alignment and safety guarantees.
- Human validation of the “human-centric” explanation: run controlled user studies to test whether humanline-trained models better match subjective utility than baselines, beyond proxy judges like GPT; quantify alignment to actual human judgments.
- Generality across tasks and modalities: test humanline on code generation, summarization, dialog, and multimodal models; identify tasks where perceptual weighting helps or harms performance, and why.
- Verifiable tasks theory: provide a principled account for why humanline helps in programmatically verifiable settings (e.g., math), and delineate conditions where perceptual bias should not help (or can hurt) correctness-focused objectives.
- Integration with RL coverage explanations: reconcile the prospect-theory account with existing RL-theoretic explanations (coverage, generation vs. discrimination, policy search space); identify regimes where each theory predicts performance gains.
- Reference syncing protocol and timing: precisely compare different syncing timings (pre/post optimizer step, asynchronous lags) and their impact; propose best-practice protocols with theoretical backing.
- Combining online and offline sources: design principled strategies to mix online on-policy and offline off-policy data under humanline (e.g., rejection sampling, reweighting); paper how to set mixture proportions dynamically.
- Humanline sampling parameterization: explore the Beta-based sampling formulation beyond clipping; determine practical settings for that balance exploration-exploitation and map them to task characteristics.
- Baseline model choice in GRPO: analyze sensitivity to the baseline under humanline clipping and syncing; propose criteria for selecting or updating baselines to avoid drift or over-regularization.
- Systems-level gains: rigorously measure end-to-end throughput improvements from overlapping training/inference/labeling; characterize memory, compute, and communication costs for humanline at scale (e.g., 70B+ models, distributed training).
- Partial or selective syncing: evaluate layer-wise or module-wise syncing (e.g., only attention or MLP layers) to reduce cost; paper how selective syncing affects stability and performance.
- Dynamic and multi-turn settings: extend humanline to interactive dialog where perceived probabilities evolve across turns; paper whether dynamic weighting and reference updates per turn improve user experience.
- Robustness to reward misspecification: analyze how humanline behaves when reward models or verifiers are biased or noisy; develop correction mechanisms (e.g., calibration, debiasing) under perceptual weighting.
- Reproducibility and evaluation variance: assess sensitivity to different judges (e.g., GPT-4.1 vs. others), release code/data to enable replication, and quantify variability due to evaluation choice and prompt formatting.
Practical Applications
Immediate Applications
The following items can be deployed today with modest changes to existing LLM alignment pipelines, using the paper’s “humanline” design pattern (humanline syncing and humanline clipping) to match online alignment performance while relying on cheaper, more flexible offline data.
- Drop-in upgrade to alignment pipelines to reduce compute and cost
- What to do: Replace offline DPO/KTO/GRPO with offline+humanline variants by (1) syncing the reference model with the current policy every k steps and (2) asymmetrically clipping token-level log-probability ratios upstream of the loss (in log-space).
- Impact: Matches online alignment performance on instruction-following (1.3–1.6x winrate uplift vs offline baseline) and mathematical reasoning, while avoiding the instability and expense of fully online training.
- Sectors: Software/AI infrastructure, cloud providers, AI product teams; open-source model communities.
- Tools/workflows:
- Integration in alignment libraries (e.g., TRL/Hugging Face) via a “Humanline Trainer” module.
- Metrics dashboards to monitor clipping ranges and reference sync cadence (k).
- Learning rate or gradient-norm tuning (±0.1–4x vs standard offline settings).
- Assumptions/dependencies:
- Alignment method must use a reference model (applies to DPO/KTO/GRPO; not applicable to SimPO-style losses).
- Offline data must be of high quality and not too divergent from the reference model (shared support and bounded likelihood ratios).
- Prospect-theory-style perceptual weighting is a reasonable proxy for human utility in generative modeling.
- Asynchronous, lower-frequency sampling for verifiable reasoning tasks
- What to do: For math/coding/verifiable tasks, sample new policy-generated data far less often (up to 64x less frequent) and train with humanline GRPO; overlap training, inference, and labeling asynchronously.
- Impact: Same performance as fully online GRPO with dramatically fewer synchronization barriers and lower inference bottlenecks during training.
- Sectors: Software (code assistants), education (math tutors), theorem-proving tools.
- Tools/workflows:
- “Offline Alignment Orchestrator” to schedule periodic sampling bursts and continuous training.
- Reward evaluators for correctness and formatting; batching inference in larger, less frequent jobs.
- Stability settings: humanline syncing every 12–24 steps often avoids collapse in math; clipping defaults of log εP = −1.5, log εR = 1.5 worked across tasks in the paper.
- Assumptions/dependencies:
- Verifiable reward signal; sufficient batch sizes when sampling less frequently.
- Proper tuning of k to avoid instability (too small may induce collapse).
- Immediate efficiency and sustainability benefits for RLHF operations
- What to do: Shift alignment workloads from fully online RL to offline+humanline training on curated preference datasets.
- Impact: >6x faster wall-clock than online GRPO at similar quality; lower energy footprint and cloud cost.
- Sectors: Industry RLHF providers, startups, and research labs; sustainability-focused policy efforts.
- Tools/workflows:
- Cost and energy reporting tied to humanline deployment.
- Procurement-friendly budgeting for small labs (cheaper post-training with comparable outcomes).
- Assumptions/dependencies:
- Adequate offline data coverage for target tasks; reward models or judges consistent with task needs.
- Quality uplift for production assistants without full RL loops
- What to do: Apply offline+humanline DPO/KTO/GRPO to improve instruction-following assistants without maintaining expensive online sampling pipelines.
- Impact: Significant winrate uplift vs offline-only baselines with minimal systems complexity increase.
- Sectors:
- Finance/retail customer support (policy-controlled assistants).
- Software engineering (coding copilots).
- Education (paper aids, tutoring).
- Healthcare triage and admin (non-diagnostic, carefully scoped use).
- Tools/workflows:
- A/B evaluation with humanline variants vs current offline methods (e.g., AlpacaEval2 or task-specific judges).
- Gradual rollout with conservative clipping ranges and sync cadence; add safety filters.
- Assumptions/dependencies:
- Domain-specific safety requirements (especially healthcare/finance).
- Offline data curated for domain style, tone, and compliance.
- Practical data curation heuristics for offline alignment
- What to do: Select offline datasets that are closer to the reference model distribution to satisfy bounded likelihood ratio assumptions and improve performance parity with online methods.
- Impact: Reduces variance in outcomes across sources; improves odds of matching online alignment.
- Sectors: Dataset marketplaces, academic benchmarks, enterprise data teams.
- Tools/workflows:
- “Perceptual Alignment Monitor” to track log-ratio distributions and detect divergence.
- Automated scoring of dataset closeness to a given reference model.
- Assumptions/dependencies:
- Clean, high-quality pairs/labels; consistency of reward models across datasets.
- Academic reproducibility and fair comparisons across alignment modes
- What to do: Use humanline variants to compare offline vs online alignment under controlled data volumes/context sets.
- Impact: More reliable conclusions about data coverage vs perceptual-loss effects; reproducible baselines.
- Sectors: Academia, benchmarks, open-source communities.
- Tools/workflows:
- Shared configs for clipping ranges, sync cadence, and LR/gradient norms.
- Reporting KL between perceived distribution and sampling distribution as diagnostic (per Proposition 1).
Long-Term Applications
The following items require further research, scaling, or product development to realize their full potential.
- Personalized perceptual alignment to different user populations
- Vision: Estimate or learn user-specific capacity parameters (γ, loss aversion λ, risk aversion α) and apply personalized humanline clipping/sync schedules per user or cohort.
- Impact: Models that better reflect diverse human perceptions and preferences; improved satisfaction across user segments.
- Sectors: Consumer AI, enterprise copilots, education (adaptive tutors), wellness apps.
- Dependencies:
- Privacy-preserving estimation of perceptual parameters.
- Robustness and fairness audits for differential weighting.
- Cross-modal perceptual losses for multimodal models
- Vision: Extend humanline concepts to speech, vision, and robotics policies where human perception of uncertainty differs from objective probabilities.
- Impact: Better alignment of multimodal agents to human utility, not just expected value.
- Sectors: Robotics (teleoperation, skill learning), autonomous systems, AR/VR.
- Dependencies:
- Generalization of prospect-theory weighting to modality-specific distributions.
- Safety constraints and simulation fidelity.
- Humanline sampling and data selection at scale
- Vision: Implement rejection-sampling-based humanline data selection (not just clipping) to simulate perceived probability distributions across large corpora; build data marketplaces that score and route data based on closeness to reference models.
- Impact: Task-agnostic alignment that can source data from anywhere while meeting perceptual criteria; reduced dependence on costly online generation.
- Sectors: Data platforms, MLOps providers, cloud AI.
- Dependencies:
- Efficient sampling and scoring algorithms for massive datasets.
- Tooling to manage stop-gradient strategies and stability.
- Federated and on-device alignment using offline+humanline methods
- Vision: Use the reduced compute requirements to enable edge alignment/personalization without always-on online sampling; periodically sync a lightweight reference model on-device.
- Impact: Private, local adaptation; better personalization and latency.
- Sectors: Mobile/edge AI, consumer devices, enterprise endpoints.
- Dependencies:
- Model compression and partial-weight syncing strategies.
- Secure aggregation and privacy guarantees.
- Risk-aware models for finance and high-stakes decision support
- Vision: Align models to human risk preferences using prospect-theoretic parameters (loss aversion, probability weighting), making outputs more consistent with human utility functions rather than expected value.
- Impact: Decision-support systems that better reflect human risk attitudes (e.g., portfolio explanations, scenario analysis).
- Sectors: Finance, operations research, strategic planning.
- Dependencies:
- Strong domain constraints and compliance; rigorous evaluation.
- Formalization of prospect-theory parameters in task-specific contexts.
- Robotics and offline preference learning with perceptual objectives
- Vision: Apply humanline variants to offline demonstrations and preference data in robotics, reducing the need for risky or costly online exploration.
- Impact: Safer and cheaper policy improvement from logs and demonstrations.
- Sectors: Industrial robotics, healthcare robotics, human-robot interaction.
- Dependencies:
- Reliable reward surrogates; sim-to-real transfer.
- Safety verification for perceptual-loss-induced exploration/exploitation trade-offs.
- Standards and policy for green, cost-efficient alignment
- Vision: Establish reporting standards that emphasize human utility gains per compute/energy unit; promote offline+humanline adoption as a best practice for sustainable alignment.
- Impact: Policy incentives for efficiency; democratization of alignment for smaller institutions.
- Sectors: Public policy, NGOs, research funding agencies.
- Dependencies:
- Agreement on metrics (e.g., winrate uplift vs energy cost, verifiable accuracy vs wall-clock).
- Third-party audits and benchmarks.
- Theory and measurement of perceived probability in generative modeling
- Vision: Empirically estimate capacity/weighting functions for language tasks via user studies, clicks, or preference trails; develop estimators for KL(ω∥Q) diagnostics at token/sequence level.
- Impact: Better-grounded perceptual losses; improved predictive value of humanline hyperparameters.
- Sectors: Academia; UX research for AI products.
- Dependencies:
- Ethical data collection; representative cohorts.
- Statistical identifiability and robust estimation.
- Full-stack training orchestration products
- Vision: Build turnkey “Humanline Alignment Platform” that schedules reference syncing, applies upstream clipping, manages asynchronous sampling, and provides stability monitors.
- Impact: Simplifies adoption across teams; reduces operational risk.
- Sectors: MLOps vendors, enterprise AI teams.
- Dependencies:
- Integration with major frameworks (Transformers/TRL/DeepSpeed/vLLM/Ray).
- Observability and governance features.
- Local personalization for daily-life applications
- Vision: Use humanline’s reduced compute demands to let individuals fine-tune assistants on personal data offline (e.g., writing style, scheduling preferences).
- Impact: More helpful personal assistants without sending private data to the cloud.
- Sectors: Consumer productivity, accessibility tools, education.
- Dependencies:
- UX for safe local fine-tuning; memory and storage constraints.
- Guardrails to prevent drift and maintain reliability.
Notes on feasibility across all applications:
- Data quality remains a decisive factor; the paper emphasizes that some—but not all—offline datasets enable parity with online alignment.
- The approach assumes a prospect-theoretic distortion (inverted S-shaped weighting) reasonably applies to human perception of generative model outcomes; more empirical work would strengthen this foundation.
- Stability management (choice of k for syncing, clipping ranges, and LR/grad norms) is critical; defaults observed in the paper (log εP = −1.5, log εR = 1.5; k ∈ [1, 4] for instruction-following and [12, 24] for math) are strong starting points but should be validated per task and model scale.
Glossary
- Advantage (sequence-level advantage): A normalized measure of how much a particular output sequence outperforms others, applied per token in policy optimization. "â¡â¡â¡A_{i,t} = {(R_i - \mathrm{mean}(R))/\mathrm{std}(R)} is the sequence-level advantage of output y_i compared to other outputs (applied per token)"
- Baseline: A fixed distribution (e.g., the initial model) used to regularize the policy via KL divergence. "KL denotes the token-wise forward KL divergence between the policy and a fixed baseline "
- Beta distribution: A probability distribution on [0,1] used to randomize acceptance thresholds in humanline sampling. "" and ""
- Capacity function: In prospect theory, a function that maps objective cumulative probabilities to perceived cumulative probabilities. "A typical functional form for the capacity function is"
- Clipping (PPO/GRPO-style): An RL technique that limits updates by clipping likelihood ratios to stabilize training. "PPO/GRPO-style clipping---originally introduced to just stabilize training---recovers a perceptual bias in how humans perceive probability."
- Cumulative probabilities: The total probability mass at or beyond an outcome, used to compute distorted weights. "these are our cumulative probabilities."
- Detaching (from the computational graph): An operation that stops gradients from flowing through selected tokens during training. "humanline sampling rejects tokens that meet the following rejection criteria by detaching them from the computational graph:"
- Direct Preference Optimization (DPO): An offline alignment objective that optimizes on pairs where one output is preferred over another. "DPO \citep{rafailov2023dpo}, which operates on preference pairs where "
- Forward KL divergence: The Kullback–Leibler divergence measured from the policy to the baseline, applied token-wise. " denotes the token-wise forward KL divergence between the policy and a fixed baseline "
- Grouped Relative Policy Optimization (GRPO): A PPO-like objective that operates on grouped outputs and clips token-wise ratios. "Given that Grouped Relative Policy Optimization (GRPO) simplifies PPO while often improving performance \citep{shao2024deepseekmath}, we use it instead."
- Holm-Bonferroni correction: A statistical method for controlling family-wise error in multiple comparisons. "We apply the Holm-Bonferroni correction to adjust for multiple comparisons"
- Humanline clipping: A design that asymmetrically clips token-wise likelihood ratios upstream of the loss to reflect perceptual biases. "Humanline Clipping: Clip all token-wise likelihood ratios to the range even before they are fed into the loss"
- Humanline sampling: A modified rejection sampling scheme that accepts or rejects tokens based on likelihood ratios and Beta-distributed thresholds. "Given output sequence , humanline sampling rejects tokens that meet the following rejection criteria by detaching them from the computational graph:"
- Humanline syncing: Periodically syncing the reference model with the current policy to track changing standards during training. "Humanline Syncing: Every steps, after the loss is calculated but before the optimizer step is taken, sync the weights of with "
- Humanline variant: A modified alignment objective that explicitly incorporates perceptual distortions of probability. "These simple changes, when applied correctly, create what we call the humanline variant of the original objective."
- Importance sampling: A technique for reweighting samples from a proposal distribution; noted as problematic in generative-model contexts. "Although importance sampling is another option, it comes with its own problems in the context of generative models, such as degenerate importance weights."
- Likelihood ratio: The ratio of policy to reference probabilities for a token, used in clipping and rejection criteria. " is a finite upper bound on the token-level likelihood ratio under the vocabulary"
- Loss aversion: A prospect theory property where losses loom larger than gains (λ > 1). "and greater sensitivity to relative losses than gains (), known as loss aversion."
- Nats: Units of information used to measure outcomes such as surprisal (log probabilities in base e). "measured in nats \citep{ethayarajh2024model}."
- Offline Off-policy Alignment: Alignment using static data not generated by the current policy. "Offline Off-policy Alignment"
- On-policy sampling: Drawing training samples from the current policy itself during optimization. "online on-policy sampling better approximates the human-perceived distribution of what the model can produce"
- Online On-policy Alignment: Alignment that iteratively samples from and updates the current policy, often with clipping and KL regularization. "Online On-policy Alignment"
- Pass@1: An accuracy metric measuring correctness with a single sampled solution. "the Pass@1 accuracy on the MATH500 test set is "
- Perceptual losses: Objectives that encode human probability distortions, aligning model training with human perception. "act as perceptual losses already."
- Preference pairs: Training tuples indicating a preferred output over a less preferred one for the same prompt. "operates on preference pairs where "
- Proximal Policy Optimization (PPO): An on-policy RL algorithm using clipped likelihood ratios for stable updates. "Proximal Policy Optimization (PPO) \citep{schulman2017ppo} has long been the default, since its clipped objective helps reduce training instability."
- Reference model: A model used as an anchor to compute surprisal or ratios, typically not updated by backprop. "a reference model that serves as an anchor, whose weights are not backpropagated through"
- Rejection sampling: A Monte Carlo method for sampling from a target distribution by accepting/rejecting proposed samples. "modify the standard rejection sampling algorithm to capture the perceptual bias"
- Risk aversion: A prospect theory property where gains are evaluated with a concave function (α < 1). "concavity in relative gains (), known as risk aversion;"
- Surprisal: The log ratio of policy to reference probabilities, treated as the outcome in prospect theoretic optimization. "they treat the surprisal term as the outcome"
- Tail regime: An assumption that outcomes with higher absolute surprisal have negligible cumulative probability mass. "assume we are in the tail regime."
- Token-wise probability ratio: The per-token ratio of policy to reference likelihoods used in GRPO and clipping. "$r_\theta(i,t) = \pi_{\theta}(y_{i,t} | x,y_{i,<t})/\pi_{\theta_{\text{old}(y_{i,t} | x,y_{i,<t})$ is a token-wise probability ratio."
- Trust region-style syncing: A syncing scheme that sets reference equal to the updated policy; found to degrade performance in this setting. "trust region-style syncing \citep{gorbatovski2024learn}, which happens after the policy is updated---thus rendering the policy and reference equal---leads to worse results"
- Value function: In prospect theory, maps outcomes relative to a reference point to subjective value via a piecewise function. "A value function maps an outcome , relative to a reference point , to its subjective value as perceived by the human."
- Verifiable tasks: Tasks whose correctness can be checked programmatically (e.g., math reasoning). "The literature increasingly focuses on verifiable tasks whose correctness can be checked programmatically"
- Weighting function: The prospect-theoretic function that distorts objective probabilities into subjective weights. "The weighting function , when applied to an outcome , supplants its objective probability."
- Unverifiable rewards: Rewards based on human judgments or preferences rather than programmatic verification. "with unverifiable rewards;"
Collections
Sign up for free to add this paper to one or more collections.