Beyond State Consistency: Behavior Consistency in Text-Based World Models

Published 15 Apr 2026 in cs.LG | (2604.13824v1)

Abstract: World models have been emerging as critical components for assessing the consequences of actions generated by interactive agents in online planning and offline evaluation. In text-based environments, world models are typically evaluated and trained with single-step metrics such as Exact Match, aiming to improve the similarity between predicted and real-world states, but such metrics have been shown to be insufficient for capturing actual agent behavior. To address this issue, we introduce a new behavior-aligned training paradigm aimed at improving the functional consistency between the world model and the real environment. This paradigm focuses on optimizing a tractable step-level metric named Behavior Consistency Reward (BehR), which measures how much the likelihood of a logged next action changes between the real state and the world-model-predicted state under a frozen Reference Agent. Experiments on WebShop and TextWorld show that BehR-based training improves long-term alignment in several settings, with the clearest gains in WebShop and less movement in near-ceiling regimes, while preserving or improving single-step prediction quality in three of four settings. World models trained with BehR also achieve lower false positives in offline surrogate evaluation and show modest but encouraging gains in inference-time lookahead planning.

Abstract PDF Upgrade to Chat

Authors (11)

Summary

The paper demonstrates that integrating Behavior Consistency Reward (BehR) enhances decision preservation by aligning predicted states with agent actions.
It employs Group Relative Policy Optimization (GRPO) to optimize behavior-level objectives, resulting in robust calibration and reduced false positives in weak agents.
Empirical results show improved task-level consistency, with a notable increase in Pairwise Consistency Ratio (e.g., from 0.345 to 0.483) in key environments.

Behavior Consistency in Text-Based World Models: Moving Beyond Surface State Similarity

Introduction

Text-based world models (WMs) are increasingly central to enabling LLMs to function as surrogate environments for interactive agents operating in text-rich settings—such as web navigation and text-based games. Most established approaches optimize for state consistency: maximizing token-level or semantic similarity between next-state predictions and ground truth environment responses. However, this paradigm is shown to be insufficient: world models with high textual or semantic overlap can fail to preserve the actual agent behavior due to their inability to capture decision-critical function. This paper proposes a paradigm shift, introducing behavior consistency—a functional criterion—into both the evaluation and training of text-based world models (2604.13824).

Limitations of State Consistency and Metric Inversion

Contemporary WMs are conventionally trained to reconstruct the next environment state with high textual fidelity, leveraging metrics such as Exact Match (EM), BERTScore, or token-level F1. However, these state-level metrics can misalign with agent requirements. For example, a predicted page in a web environment may appear highly similar to the ground-truth page but omit a decision-critical element (e.g., the target product to be purchased). Such a deficiency can have catastrophic downstream effects on agent action space, even as surface metrics remain high. This failure mode—metric inversion—arises when the metric values are not sensitive to functionally crucial omissions or inclusions in next states.

Figure 1: Metric inversion in WebShop—the omission of a decision-critical product results in a catastrophic error that typical surface or semantic metrics do not penalize adequately; only behavior consistency reflects the true impact on agent action space.

Behavior Consistency Reward (BehR) and Training Paradigm

The paper introduces Behavior Consistency Reward (BehR) as a step-level proxy for functional consistency. Rather than scoring the predicted state based on token-level similarity to the reference, BehR evaluates whether the agent is likely to take the same action in the predicted state as in the real state. This is operationalized as follows:

Reference Agent: A frozen agent model computes, for the logged next action, its log-likelihood under (a) the real state and (b) the predicted world model state.
BehR: BehR is defined as the negative absolute difference between these two log-likelihoods, thus rewarding predictions that preserve the propensity to execute the correct next action.

Practically, BehR is implementable in offline settings using logged trajectories. It obviates the need for notoriously inaccessible full action distributions for black-box policies. BehR-based models are optimized via Group Relative Policy Optimization (GRPO), a reinforcement learning approach that leverages behavior-level rewards instead of only surface reconstruction signals.

Figure 2: Behavior Consistency Training pipeline—compared to traditional state consistency, the proposed approach aligns world model outputs to agent decision preservation signals derived from a reference agent’s action likelihoods.

Empirical Results: Single-Step and Long-Horizon Performance

Single-Step Prediction

BehR-optimized world models demonstrate parity or improvement in single-step exact match (EM) accuracy in three out of four evaluated settings (WebShop and TextWorld, with both Qwen2.5-7B and LLaMA3.1-8B backbones). This suggests that optimizing for functional consistency does not inherently degrade local next-state prediction quality.

Task-Level Consistency and Functional Evaluation

Task-level evaluation centers on Pairwise Consistency Ratio (CR $_{\text{pw}}$ )—the fraction of real-environment successful tasks that remain successful when the same agent trajectory is replayed in the world model and then in the real environment (W2R setting). The strongest numerical gains are observed in WebShop, especially in Qwen-based settings (e.g., CR $_{\text{pw}}$ increases from 0.345 to 0.483 for Qwen3-8B). The gains transfer to other domains and model families, albeit with reduced margin in near-ceiling regimes or when underlying agents compensate for modeling deficiencies.

Notably, reward ablation experiments (contrasting BehR to token-level F1 and structured factual rewards) demonstrate that the choice of optimization target substantively impacts long-horizon functional preservation. While all methods improve over SFT baselines, meaningful gains in CR $_{\text{pw}}$ are attributable to behavior-level objectives, not merely the reinforcement learning procedure per se.

Downstream Implications: Evaluation Calibration and Planning

Surrogate Evaluation

World models are often used as offline surrogates for agent evaluation owing to the prohibitive cost of real-environment interaction. However, calibration is essential: a poorly calibrated WM can inflate the success of weak agents through false positives. BehR-based WMs reduce false positives by factors of 3–5 for weak agents in TextWorld, raising agreement rates at the task level from ~60% to over 80–90%.

Lookahead Planning

In online planning experiments (WebShop), integrating BehR-optimized WMs yields modest but consistent improvements in agent performance under lookahead planning protocols, compared to both pure SFT and F1-optimized WMs. Most gains derive from the lookahead paradigm itself, with BehR’s contribution being incremental—suggesting higher-fidelity action rollouts but not eliminating the need for strong downstream planning heuristics.

Why Optimization Target Selection is Critical

Control ablations confirm that optimizing for token-level F1 (surface) or structured factual correctness alone fails to consistently preserve decision-critical behavior. While such objectives can improve state-level similarity or factual correctness, only behavior-level alignment (as instantiated by BehR) consistently boosts task-level functional preservation—particularly in regimes where strong downstream policies cannot compensate for world model deficiencies. Optimization target selection, therefore, is not a trivial choice but a primary axis of world model efficacy.

Limitations and Theoretical Implications

BehR leverages only the log-likelihood of a single logged action under a fixed (frozen) reference agent, not the full downstream action distribution. As a practical proxy, its success depends on the relevance of the logged next action and the representativeness of the reference agent. Transferability across diverse agent architectures may be imperfect; gains are most pronounced when baseline world models are miscalibrated or insufficiently functional. Furthermore, improvements taper off near the upper bounds of agent or world model capacity (near-ceiling regimes).

Theoretically, the paradigm refocuses world model evaluation and training from the ill-posed task of linguistic similarity to a criteria focused on agent usability and decision invariance—a shift paralleling similar movements in computer vision (e.g., perceptual metrics for image generation).

Future Directions

Future theoretical and practical research should advance:

Estimation of fuller action distributions or sampling-based behavior consistency proxies.
Family-invariant reference judging for robust downstream generalization.
Integration into multi-agent or open-ended tool-use domains where function preservation is not reducible to a single action’s likelihood.
Extension to non-textual, multimodal environments and non-Markovian agent policies.

Conclusion

This paper demonstrates that behavior-level functional alignment is essential for effective text-based world modeling. Surface or semantic similarity is necessary but insufficient for preserving agent task performance. The introduction of behavior-consistent training, instantiated via the BehR reward, results in robust improvements on downstream agent-task preservation and evaluation calibration, and offers preliminary benefits for planning. These findings advocate for a reorientation of field-wide objectives in world model research, prioritizing functional consistency over text reconstruction as the principal axis of progress.