Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 95 tok/s

Gemini 2.5 Pro 46 tok/s Pro

GPT-5 Medium 15 tok/s Pro

GPT-5 High 19 tok/s Pro

GPT-4o 90 tok/s Pro

GPT OSS 120B 449 tok/s Pro

Kimi K2 192 tok/s Pro

2000 character limit reached

RLVR-World: Training World Models with Reinforcement Learning (2505.13934v1)

Published 20 May 2025 in cs.LG and cs.AI

Abstract: World models predict state transitions in response to actions and are increasingly developed across diverse modalities. However, standard training objectives such as maximum likelihood estimation (MLE) often misalign with task-specific goals of world models, i.e., transition prediction metrics like accuracy or perceptual quality. In this paper, we present RLVR-World, a unified framework that leverages reinforcement learning with verifiable rewards (RLVR) to directly optimize world models for such metrics. Despite formulating world modeling as autoregressive prediction of tokenized sequences, RLVR-World evaluates metrics of decoded predictions as verifiable rewards. We demonstrate substantial performance gains on both language- and video-based world models across domains, including text games, web navigation, and robot manipulation. Our work indicates that, beyond recent advances in reasoning LLMs, RLVR offers a promising post-training paradigm for enhancing the utility of generative models more broadly.

Collections

Summary

The paper introduces RLVR-World, a post-training paradigm that uses verifiable rewards to fine-tune world models for enhanced prediction accuracy.
It employs autoregressive sequence modeling across modalities, directly optimizing task-specific metrics for both language and video applications.
Experimental results show up to +30.7% text game accuracy and improved video perceptual quality with fewer gradient steps than traditional MLE training.

RLVR-World: Training World Models with Reinforcement Learning

Introduction

The paper introduces RLVR-World, a framework that enhances world models through Reinforcement Learning with Verifiable Rewards (RLVR). Standard training methods often misalign with task-specific objectives like transition prediction accuracy or perceptual quality. To address this, RLVR-World utilizes RLVR to fine-tune world models, leading to improved performance in diverse applications such as language-based text games and video-based robotic manipulations.

Background and Motivation

World models are pivotal for predicting state transitions under action interventions. Typically, these models are pre-trained with Maximum Likelihood Estimation (MLE), which emphasizes data distribution over real-world utility. This misalignment often results in suboptimal task-specific performance, such as blurry predictions in video models or repetition in LLMs. RLVR-World seeks to address these shortcomings by leveraging autoregressive sequence modeling across modalities and directly optimizing verifiable task-specific metrics as rewards, rather than surrogate objectives (Figure 1).

Figure 1: Training world models with reinforcement learning. (Left) World models are typically pre-trained using surrogate objectives like MLE, which misalign with task-specific prediction metrics. (Right) We propose post-training world models with RLVR to directly optimize these metrics.

RLVR-World Framework

Problem Formulation

The task involves world models that predict state transitions in an environment described as a Markov Decision Process (MDP). RLVR-World casts this as a sequence modeling task, leveraging tokenized inputs and outputs to represent state-action pairs and predictions. It subsequently fine-tunes models on task-specific metrics, which serve as verifiable rewards rather than relying solely on generative probabilities.

Sequence Modeling

RLVR-World unifies language, video, and other modalities under an autoregressive token-based framework. This involves encoding current states and actions into question tokens and next states into response tokens. The process aligns well with transformer architectures and enables efficient joint optimization across diverse domains.

Figure 2: Illustration of RLVR-World framework. World models across various modalities are unified under a sequence modeling formulation, and task-specific prediction metrics serve as verifiable rewards. (Top) Language-based world models predict verbal state transitions in response to verbal actions. (Bottom) Video-based world models, equipped with a visual tokenizer, predict future visual observations conditioned on action vectors.

Experimental Evaluation

Language World Models

The framework was tested on language world models using datasets from text-based games and web navigation tasks. RLVR-World demonstrated substantial accuracy improvements over supervised fine-tuning alone, achieving up to +30.7% accuracy in text game predictions and +15.1% F1 score in web state predictions. This highlights the framework's efficacy in fine-tuning LLMs directly on the prediction metrics of interest.

Video World Models

In the domain of robotic manipulation, RLVR-World was applied to video world models using the RT-1 dataset. The results showed enhanced perceptual quality and accuracy in trajectory predictions, surpassing models pre-trained with traditional objectives. Notably, RLVR-World required significantly fewer gradient steps than MLE training, illustrating the efficiency of the RLVR approach (Figure 3).

Figure 3: Learning curves of video world models. Note the significant difference in the x-axis scale between the pre-training and post-training stages.

Model Analysis

Key metrics such as LPIPS, MSE, PSNR, and SSIM showed significant improvements across the board. RLVR-World effectively mitigated repetition issues inherent in generative models and demonstrated robust performance scalability with an increasing number of generated samples. This implies that RLVR-World could serve as a more effective post-training paradigm than traditional methods, especially when task-oriented evaluation is critical.

Figure 4: Model analysis. (a) Test-time scaling: best performance among different numbers of generated samples. (b) RL training scaling: learning curves using different group sizes in GRPO. (c) Metric-oriented optimization: RLVR-World trained and tested on different visual metrics.

Applications and Implications

Beyond performance improvements, RLVR-World models facilitated enhanced policy evaluation in real-world scenarios, such as robotic manipulation, bridging simulation and real-world tasks by reducing sim-to-real discrepancies. These findings introduce RLVR as a promising strategy for enhancing generative world models, suggesting expansions into multimodal applications and robust generative policies.

Conclusion

RLVR-World offers a refined post-training paradigm for world models, addressing prevalent shortcomings of conventional training objectives by focusing on real-world task metrics. The framework's adaptability across language and video tasks highlights its potential as a general method for improving generative model efficacy in prediction-centric applications. Future work should explore customizable reward functions tailored to specific domains and enhance out-of-distribution generalization for complex sequential decision-making.

Figure 5: Qualitative analysis: multi-step video prediction and policy evaluation on RT-1.