Beyond Verifiable Rewards: Scaling Reinforcement Learning for Language Models to Unverifiable Data (2503.19618v2)

Published 25 Mar 2025 in cs.LG

Abstract: We propose to scale RL to unverifiable data with a novel algorithm JEPO (Jensen's Evidence lower bound Policy Optimization). While most prior efforts on scaling RL for LLMs focus on verifiable data where ground truth answers are typically short-form and can be matched easily; we investigate the case where such assumptions are less valid (e.g., when answers are long-form such as mathematical proofs). To scale RL training to unverifiable data with contemporary training constraints, we propose JEPO. JEPO applies Jensen's evidence lower bound, a pragmatic simplification of the evidence lower bound which views chain-of-thought as a latent variable in the generative process. We show that on verifiable data (math), JEPO is as effective as RL with verifiable rewards; on semi-verifiable data (numina), JEPO improves on soft-match based evaluations compared to RL with verifiable rewards which can only leverage a subset of the data source; finally, on unverifiable data (numina-proof), JEPO outperforms SFT and a few ablation baselines on likelihood evaluations.

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/tianle_cai/status/1909636137255416142

https://twitter.com/fly51fly/status/1905007634543173682

https://twitter.com/rm_rafailov/status/1907444461292146868

Beyond Verifiable Rewards: Scaling Reinforcement Learning for Language Models to Unverifiable Data (2503.19618v2)

Summary

Related Papers

Tweets