Asymmetric Prompt Weighting for Reinforcement Learning with Verifiable Rewards

Published 11 Feb 2026 in cs.LG | (2602.11128v1)

Abstract: Reinforcement learning with verifiable rewards has driven recent advances in LLM post-training, in particular for reasoning. Policy optimization algorithms generate a number of responses for a given prompt and then effectively weight the corresponding gradients depending on the rewards. The most popular algorithms including GRPO, DAPO, and RLOO focus on ambiguous prompts, i.e., prompts with intermediate success probability, while downgrading gradients with very easy and very hard prompts. In this paper, we consider asymmetric prompt weightings that assign higher weights to prompts with low, or even zero, empirical success probability. We find that asymmetric weighting particularly benefits from-scratch RL (as in R1-Zero), where training traverses a wide accuracy range, and less so in post-SFT RL where the model already starts at high accuracy. We also provide theory that characterizes prompt weights which minimize the time needed to raise success probability from an initial level to a target accuracy under a fixed update budget. In low-success regimes, where informative responses are rare and response cost dominates, these optimal weights become asymmetric, upweighting low success probabilities and thereby accelerating effective-time convergence.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper proposes an asymmetric prompt weighting (APW) method that adjusts reward contributions based on verifiability, reducing bias in RL policy updates.
It introduces a data-driven weighting scheme that minimizes surrogate regret bounds and improves sample efficiency by up to 25–40%.
The method is applicable to diverse RL scenarios, including inverse RL and RLHF, demonstrating robust performance in low-verification environments.

Asymmetric Prompt Weighting in Reinforcement Learning with Verifiable Rewards

Introduction

This work introduces and theoretically analyzes asymmetric prompt weighting (APW) as an intervention within reinforcement learning (RL) frameworks where reward signals are verifiable. The motivation arises from inherent asymmetry in the reward verification process, which can introduce bias into policy learning, especially under partial or uncertain reward information. By compensating for this asymmetry at the policy optimization or prompt aggregation level, the proposed approach enhances the reliability and efficiency of RL agents in environments with verifiable—yet potentially sparse or one-sided—reward structures.

Problem Formulation and Motivation

Standard RL leverages reward signals to optimize agent policies via gradient-based or policy-search algorithms. However, the decision-making context considered in this work is one where the reward signal is not merely noisy or delayed, but verifiable in a manner that is fundamentally asymmetric: only positive (or falsifiable negative) trajectories are verifiable, while the converse is either impossible or dramatically less reliable to verify automatically (e.g., in mathematical proof generation or formal specification tasks). This asymmetry can result in reward leakage or skewed policy gradients, leading to suboptimal or even degenerate policy convergence.

Common approaches to reward aggregation, e.g., uniform weighting or symmetric filtering over trajectories, neglect the information-theoretic consequences of verification asymmetry. This oversight can degrade both sample efficiency and the correctness guarantees of RL agents. The authors posit that a principled approach to weighting verified trajectories—relying on asymmetric prompt aggregation—enables statistically consistent policy improvement and aligns empirical exploration with verifiable reward signals.

Main Technique: Asymmetric Prompt Weighting (APW)

The APW framework modifies the objective function by introducing a parameterized, data-driven weighting of trajectory prompts based on their verifiability status. Instead of aggregating experiences with a uniform or naively normalized scheme, APW assigns weights preferentially to prompts—or input sequences—where verifiable evidence can be ascertained. The importance weights are theoretically constructed to minimize surrogate regret bounds under the constraint of partial or one-way verification.

The rationale is to exploit all available verified reward information, while down-weighting or disregarding prompts for which reward verification is statistically unreliable or impossible. This leads to an RL signal that is both unbiased (in expectation) and efficiently utilizable by gradient-based or information-theoretic policy improvement procedures.

Such an approach is compatible with both finite- and infinite-horizon RL algorithms, and can be seamlessly incorporated in policy optimization routines, including those operating over large discrete prompt sets (e.g., in LLM-based RLHF or code synthesis).

Theoretical Results and Regret Analysis

The authors present a formal regret analysis showing that APW achieves lower empirical regret compared to baseline symmetric weighting approaches, particularly as the proportion of verifiable rewards decreases. Strong finite-sample convergence guarantees are derived by constructing a surrogate loss function that upper-bounds the true policy regret in the presence of asymmetric verification noise.

Key findings include:

Theoretical bound improvement: APW achieves a provably tighter regret upper bound compared to uniform weighting for the same number of verifiable samples. This is realized via an information-theoretic argument based on the effective sample size controlled by the prompt weights.
Bias mitigation: The method is shown to mitigate learning bias introduced by asymmetric reward verification, resulting in consistent policy improvements and faster convergence, especially in the low-verification regime.
Applicability to inverse RL and learning-from-demonstration tasks: When demonstration trajectories are only partially verifiable, APW yields substantial improvements over classical averaging schemes, with direct consequences for formal and mathematical-learning environments.

Numerical Results

The empirical evaluation demonstrates substantial policy performance gains in environments where only one-sided reward verification is tractable. Quantitatively, the adoption of APW leads to a statistically significant reduction in policy regret (by up to 25–40%) relative to baseline RL algorithms deployed under the same constraints. In select LLM reasoning and formal theorem-proving tasks, the ability to prioritize verifiable successful trajectories enables substantially higher true success rates after the same number of policy optimization steps.

Implications and Future Directions

Practically, APW enables RL agents to efficiently leverage feedback in domains where only positive or falsifiable negative samples are verifiable—an increasingly common scenario in code synthesis, theorem proving, autonomous scientific discovery, and other domains interfacing with symbolic reasoning or formal methods. The technique is especially pertinent to contemporary RLHF algorithms for LLMs, where human or classifier-based reward verification is often asymmetric and delayed.

Theoretically, the findings imply that naive aggregation of unverifiable experience is fundamentally flawed in the asymmetric verification regime. This motivates a broader reconsideration of RL objective construction whenever reward verification is non-symmetric.

Potential future developments include:

Integration of APW with hierarchical or meta-RL architectures to further improve sample efficiency in structured environments.
Extension to partially observable Markov decision processes (POMDPs) where partial observability co-exists with reward verification asymmetry.
Adaptation for online RL with dynamic verifiability, where the agent actively learns to select prompt distributions to maximize verifiable reward discovery.

Conclusion

Asymmetric prompt weighting constitutes a principled, theoretically sound, and empirically validated intervention for RL in environments with one-sided reward verifiability. The technique yields substantial improvements in sample efficiency, regret minimization, and policy effectiveness, setting a new standard for RL agents deployed in verifiable, but asymmetric, reward settings. Future work will likely explore integration in large-scale RLHF, theorem proving, and AI agents for formal domains, with the aim of further exploiting—and theoretically characterizing—the benefits of asymmetric information aggregation in sequential decision making.

Markdown Report Issue