Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 98 tok/s

Gemini 2.5 Pro 58 tok/s Pro

GPT-5 Medium 25 tok/s Pro

GPT-5 High 23 tok/s Pro

GPT-4o 112 tok/s Pro

Kimi K2 165 tok/s Pro

GPT OSS 120B 460 tok/s Pro

Claude Sonnet 4 29 tok/s Pro

2000 character limit reached

Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning (2505.15311v1)

Published 21 May 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Policy-based methods currently dominate reinforcement learning (RL) pipelines for LLM reasoning, leaving value-based approaches largely unexplored. We revisit the classical paradigm of BeLLMan Residual Minimization and introduce Trajectory BeLLMan Residual Minimization (TBRM), an algorithm that naturally adapts this idea to LLMs, yielding a simple yet effective off-policy algorithm that optimizes a single trajectory-level BeLLMan objective using the model's own logits as $Q$-values. TBRM removes the need for critics, importance-sampling ratios, or clipping, and operates with only one rollout per prompt. We prove convergence to the near-optimal KL-regularized policy from arbitrary off-policy data via an improved change-of-trajectory-measure analysis. Experiments on standard mathematical-reasoning benchmarks show that TBRM consistently outperforms policy-based baselines, like PPO and GRPO, with comparable or lower computational and memory overhead. Our results indicate that value-based RL might be a principled and efficient alternative for enhancing reasoning capabilities in LLMs.