- The paper introduces a novel paradigm for visual planning that performs reasoning solely with image sequences, bypassing text-based mediation.
- The paper presents VPRL, a two-stage reinforcement learning framework that improves planning accuracy by over 20% compared to supervised methods.
- The paper demonstrates robust performance across tasks like FrozenLake, Maze, and MiniBehavior, highlighting the benefits of direct visual reasoning.
This paper introduces a new paradigm called "Visual Planning," where reasoning and planning are performed entirely using sequences of images, without relying on textual mediation. The authors argue that language may not always be the most effective modality for reasoning, especially for tasks involving spatial and geometrical information. Traditional multimodal LLMs (MLLMs) often convert visual information into text before reasoning, which can create a modality gap and hinder performance in vision-centric tasks.
To address this, the paper proposes Visual Planning via Reinforcement Learning (VPRL), a novel two-stage reinforcement learning framework designed to train Large Vision Models (LVMs) for visual planning. LVMs are chosen because they are trained exclusively on images and video frames, eliminating potential confounding factors from language-based supervision.
Visual Planning Paradigm
The core idea is to generate a sequence of intermediate images T=(v^1,…,v^n) that represent step-by-step visual states, leading from an initial visual state v0 to a goal. Each subsequent image v^i is generated autoregressively by a generative vision model πθ:
v^i∼πθ(vi∣v0,v^1,...,v^i−1)
This process is analogous to how humans might sketch or visualize steps to solve a problem.
Visual Planning via Reinforcement Learning (VPRL)
VPRL is a two-stage training framework:
- Stage 1: Policy Initialization:
- The LVM (πθ) is initialized by training it on random trajectories generated by random walks in the environment.
- The goal is to enable the model to generate valid sequences of visual states and encourage exploration.
The model is trained to predict the next state vi+1(ℓ) given a prefix sequence v≤i by minimizing the loss:
$\mathcal{L}_{\text{VPFT}(\theta)= -\mathbb{E}_{(v_{\leq i},\,v_{i+1}^{(\ell)})} \Bigl[ \log \pi_\theta\!\bigl( v^{(\ell)}_{i+1} \,\big|\, v_{\leq i} \bigr) \Bigr]$
* This stage acts as a warm-up, focusing on visual coherence and generation quality.
- Stage 2: Reinforcement Learning for Visual Planning:
- This stage uses the initialized model from Stage 1 and applies reinforcement learning, specifically Group Relative Policy Optimization (GRPO), to optimize for visual planning.
- Given an input prefix v≤i, the behavior model πθold samples a group of G candidate next visual states {v^i+1(1),…,v^i+1(G)}.
- Each candidate state v^i+1(k) corresponds to a planned action.
- A rule-based parsing function P(vi,v^i+1(k)) maps pairs of visual states to discrete actions (valid or invalid).
- Candidates are scored using a composite reward function r(vi,v^i+1(k)).
- GRPO computes relative advantages A(k) for each candidate within the group.
The policy πθ is updated by maximizing the GRPO objective:
$\mathcal{J}_{\text{VPRL}(\theta)} = \mathbb{E} \left[ \frac{1}{G} \sum_{k=1}^G \min \left( \rho^{(k)} A^{(k)},\; \text{clip} \left( \rho^{(k)}, 1-\epsilon,\, 1+\epsilon \right) A^{(k)} \right) - \beta\, D_{\text{KL}\left( \pi_\theta \,||\, \pi_{\text{ref}} \right) \right]$
where ρ(k) is the importance sampling ratio.
Reward Design:
The reward function is crucial for guiding the visual planner.
* A state-action parsing function P:V×V→A∪E interprets the intended action from the current state vi to a generated candidate state v^i+1(k). A is the set of valid actions, and E is the set of invalid transitions.
* A progress map D(v) estimates the remaining steps to the goal from state v.
* Actions are categorized into:
* Aopt: Optimal actions (progress towards goal, D(v^i+1(k))<D(vi)).
* Anopt: Non-optimal valid actions (D(v^i+1(k))≥D(vi)).
* Einv: Invalid actions.
* The progress reward function is:
r(vi,v^i+1(k))=αopt⋅I[P(⋅)∈Aopt]+αnopt⋅I[P(⋅)∈Anopt]+αinv⋅I[P(⋅)∈Einv]
* In experiments, αopt=1, αnopt=0, and αinv=−5.
System Variants for Comparison
Visual Planning via Fine-Tuning (VPFT): A supervised learning baseline that shares the architecture of VPRL Stage 1 but is trained on optimal planning trajectories instead of random walks.
Supervised Fine-Tuning (SFT) in Text: A traditional approach where the model, given a visual input and a textual prompt, generates a textual sequence of actions. The loss is cross-entropy for action prediction:
LSFT(θ)=−E(v,t)[l=1∑Llogπθ(tl∣t<l,v,p)]
Experiments and Results
Key Findings:
- Visual Planning Surpasses Textual Planning:
- VPRL consistently achieved the best performance across all tasks.
- VPFT (visual planning with SFT) outperformed SFT in text by an average of over 22% in EM.
- This suggests that for visual-centric tasks, reasoning directly in the visual modality is more effective.
- Inference-only MLLMs (even advanced ones like Gemini 2.5 Pro) struggled without task-specific fine-tuning.
- Gains from Reinforcement Learning:
- VPRL significantly outperformed its supervised counterpart VPFT by more than 20% across all tasks.
- VPRL Stage 1 (policy initialization) achieved near-random performance, while Stage 2 (RL optimization) led to the best results, highlighting RL's effectiveness in learning planning strategies beyond imitation.
- Robustness with Scaling Complexity:
- As task complexity increased (e.g., larger grid sizes in FrozenLake), the performance of text-based reasoning models like Gemini 2.5 Pro dropped sharply.
- Visual planners (VPFT and VPRL) maintained higher accuracy and showed more gradual performance degradation, with VPRL being the most robust.
Discussion and Analysis
- Error Analysis: VPRL can still make non-optimal (taking detours) or invalid actions (violating environment constraints, e.g., walking through walls), but it is more flexible than VPFT. Visual planning avoids cascading errors seen in text-based systems that misinterpret visual information early on.
- Random Policy Initialization: Initializing the policy with random trajectories (VPRL Stage 1) is crucial for exploration. VPFT, trained on optimal paths, has limited exploration (low entropy) and struggles if used directly for RL, as it yields near-zero advantages for GRPO. VPRL Stage 1 maintains high entropy with a low invalid action ratio.
- VPRL Reduces Invalid Actions: VPRL significantly reduces the proportion of failed trajectories caused by invalid actions compared to VPFT (e.g., from 60-78% down to 25-37%).
Implementation Details
- LVM Backbone: LVM-3B uses a VQGAN-based tokenizer to encode images into 256 discrete visual tokens.
- State-Action Parsing for Reward: The rule-based parsing function P for reward calculation involves:
- Converting images to grayscale and a coordinate-based representation.
- Computing Intersection-over-Union (IoU) to find the agent's predicted position.
- Inferring actions by comparing start and predicted positions against task rules.
- Using pixel-wise Mean Squared Error (MSE) to detect invalid transitions like agent disappearance.
- For MiniBehavior, IoU changes detect "pick" actions, and MSE changes in table regions detect "drop" actions.
- Progress Map for Reward: Breadth-First Search (BFS) is used to calculate the optimal steps to the goal from each position, forming the progress map D(v).
- Training:
- Low-Rank Adaptation (LoRA) was applied for fine-tuning.
- VPRL Stage 1 trained for 10 epochs on random trajectories.
- VPRL Stage 2 trained for 10 epochs using GRPO with a group size of 10 candidate responses and a KL divergence penalty coefficient β=0.001.
The paper concludes that Visual Planning is a viable and promising alternative to language-based reasoning for visually oriented tasks, opening new avenues for research in multimodal AI. The VPRL framework demonstrates significant improvements in planning performance and generalization.