Visual Planning: Let's Think Only with Images (2505.11409v1)

Published 16 May 2025 in cs.LG, cs.AI, cs.CL, and cs.CV

Abstract: Recent advancements in LLMs and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations, independent of text. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising alternative to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.

Summary

The paper introduces a novel paradigm for visual planning that performs reasoning solely with image sequences, bypassing text-based mediation.
The paper presents VPRL, a two-stage reinforcement learning framework that improves planning accuracy by over 20% compared to supervised methods.
The paper demonstrates robust performance across tasks like FrozenLake, Maze, and MiniBehavior, highlighting the benefits of direct visual reasoning.

This paper introduces a new paradigm called "Visual Planning," where reasoning and planning are performed entirely using sequences of images, without relying on textual mediation. The authors argue that language may not always be the most effective modality for reasoning, especially for tasks involving spatial and geometrical information. Traditional multimodal LLMs (MLLMs) often convert visual information into text before reasoning, which can create a modality gap and hinder performance in vision-centric tasks.

To address this, the paper proposes Visual Planning via Reinforcement Learning (VPRL), a novel two-stage reinforcement learning framework designed to train Large Vision Models (LVMs) for visual planning. LVMs are chosen because they are trained exclusively on images and video frames, eliminating potential confounding factors from language-based supervision.

Visual Planning Paradigm

The core idea is to generate a sequence of intermediate images $\mathcal{T} = (\hat{v}_1, \ldots, \hat{v}_n)$ that represent step-by-step visual states, leading from an initial visual state $v_0$ to a goal. Each subsequent image $\hat{v}_i$ is generated autoregressively by a generative vision model $\pi_{\theta}$ :

$\hat{v}_i \sim \pi_{\theta}(v_i | v_0, \hat{v}_1, ..., \hat{v}_{i-1})$

This process is analogous to how humans might sketch or visualize steps to solve a problem.

Visual Planning via Reinforcement Learning (VPRL)

VPRL is a two-stage training framework:

Stage 1: Policy Initialization:
- The LVM ( $\pi_\theta$ ) is initialized by training it on random trajectories generated by random walks in the environment.
- The goal is to enable the model to generate valid sequences of visual states and encourage exploration.
- The model is trained to predict the next state $v_{i+1}^{(\ell)}$ given a prefix sequence $v_{\leq i}$ by minimizing the loss:
  
  $\mathcal{L}_{\text{VPFT}(\theta)= -\mathbb{E}_{(v_{\leq i},\,v_{i+1}^{(\ell)})} \Bigl[ \log \pi_\theta\!\bigl( v^{(\ell)}_{i+1} \,\big|\, v_{\leq i} \bigr) \Bigr]$

* This stage acts as a warm-up, focusing on visual coherence and generation quality.

Stage 2: Reinforcement Learning for Visual Planning:
- This stage uses the initialized model from Stage 1 and applies reinforcement learning, specifically Group Relative Policy Optimization (GRPO), to optimize for visual planning.
- Given an input prefix $v_{\leq i}$ , the behavior model $\pi_{\theta_{\text{old}}}$ samples a group of $G$ candidate next visual states $\{\hat{v}_{i+1}^{(1)}, \ldots, \hat{v}_{i+1}^{(G)}\}$ .
- Each candidate state $\hat{v}_{i+1}^{(k)}$ corresponds to a planned action.
- A rule-based parsing function $\mathcal{P}(v_i, \hat{v}_{i+1}^{(k)})$ maps pairs of visual states to discrete actions (valid or invalid).
- Candidates are scored using a composite reward function $r(v_i, \hat{v}_{i+1}^{(k)})$ .
- GRPO computes relative advantages $A^{(k)}$ for each candidate within the group.
- The policy $\pi_\theta$ is updated by maximizing the GRPO objective:
  
  $\mathcal{J}_{\text{VPRL}(\theta)} = \mathbb{E} \left[ \frac{1}{G} \sum_{k=1}^G \min \left( \rho^{(k)} A^{(k)},\; \text{clip} \left( \rho^{(k)}, 1-\epsilon,\, 1+\epsilon \right) A^{(k)} \right) - \beta\, D_{\text{KL}\left( \pi_\theta \,||\, \pi_{\text{ref}} \right) \right]$
  
  where $\rho^{(k)}$ is the importance sampling ratio.
Reward Design:

The reward function is crucial for guiding the visual planner. * A state-action parsing function $\mathcal{P}: \mathcal{V} \times \mathcal{V} \rightarrow \mathcal{A} \cup \mathcal{E}$ interprets the intended action from the current state $v_i$ to a generated candidate state $\hat{v}_{i+1}^{(k)}$ . $\mathcal{A}$ is the set of valid actions, and $\mathcal{E}$ is the set of invalid transitions. * A progress map $D(v)$ estimates the remaining steps to the goal from state $v$ . * Actions are categorized into: * $\mathcal{A}_{\mathrm{opt}}$ : Optimal actions (progress towards goal, $D(\hat{v}_{i+1}^{(k)}) < D(v_i)$ ). * $\mathcal{A}_{\mathrm{nopt}}$ : Non-optimal valid actions ( $D(\hat{v}_{i+1}^{(k)}) \geq D(v_i)$ ). * $\mathcal{E}_{\mathrm{inv}}$ : Invalid actions. * The progress reward function is:

$r(v_i,\hat{v}_{i+1}^{(k)}) = \alpha_{\text{opt}}\cdot\mathbb{I}[\mathcal{P}(\cdot)\in\mathcal{A}_{\mathrm{opt}}] + \alpha_{\text{nopt}}\cdot\mathbb{I}[\mathcal{P}(\cdot)\in\mathcal{A}_{\mathrm{nopt}}] + \alpha_{\text{inv}}\cdot\mathbb{I}[\mathcal{P}(\cdot)\in\mathcal{E}_{\mathrm{inv}}]$

* In experiments, $\alpha_{\text{opt}} = 1$ , $\alpha_{\text{nopt}} = 0$ , and $\alpha_{\text{inv}} = -5$ .

System Variants for Comparison

Visual Planning via Fine-Tuning (VPFT): A supervised learning baseline that shares the architecture of VPRL Stage 1 but is trained on optimal planning trajectories instead of random walks.
Supervised Fine-Tuning (SFT) in Text: A traditional approach where the model, given a visual input and a textual prompt, generates a textual sequence of actions. The loss is cross-entropy for action prediction:

$\mathcal{L}_{\text{SFT}(\theta)} = -\mathbb{E}_{(v, t)} \left[ \sum_{l=1}^{L} \log \pi_\theta(t_{l} \mid t_{<l},\, v, p) \right]$

Experiments and Results

Tasks: Three visual navigation environments were used:
- FrozenLake: Navigate an agent to a destination on a grid, avoiding holes.
- Maze: Navigate an agent from a start to a goal in a maze.
- MiniBehavior: A more complex task involving picking up an object (printer) and dropping it at a target location (table).
Models:
- LVM-3B: A 3-billion parameter Large Vision Model used for VPFT and VPRL.
- Qwen 2.5-VL-Instruct-3B: Used for the SFT in text baseline.
- Closed-Source Models: Gemini 2.0 Flash and Gemini 2.5 Pro were used as reference points for state-of-the-art multimodal reasoning.
Evaluation Metrics:
- Exact Match (EM): Measures if the generated visual trajectory perfectly matches the shortest optimal path.
- Progress Rate (PR): Measures the ratio of consecutively correct steps from the start compared to the optimal path length.

Key Findings:

Visual Planning Surpasses Textual Planning:
- VPRL consistently achieved the best performance across all tasks.
- VPFT (visual planning with SFT) outperformed SFT in text by an average of over 22% in EM.
- This suggests that for visual-centric tasks, reasoning directly in the visual modality is more effective.
- Inference-only MLLMs (even advanced ones like Gemini 2.5 Pro) struggled without task-specific fine-tuning.
Gains from Reinforcement Learning:
- VPRL significantly outperformed its supervised counterpart VPFT by more than 20% across all tasks.
- VPRL Stage 1 (policy initialization) achieved near-random performance, while Stage 2 (RL optimization) led to the best results, highlighting RL's effectiveness in learning planning strategies beyond imitation.
Robustness with Scaling Complexity:
- As task complexity increased (e.g., larger grid sizes in FrozenLake), the performance of text-based reasoning models like Gemini 2.5 Pro dropped sharply.
- Visual planners (VPFT and VPRL) maintained higher accuracy and showed more gradual performance degradation, with VPRL being the most robust.

Discussion and Analysis

Error Analysis: VPRL can still make non-optimal (taking detours) or invalid actions (violating environment constraints, e.g., walking through walls), but it is more flexible than VPFT. Visual planning avoids cascading errors seen in text-based systems that misinterpret visual information early on.
Random Policy Initialization: Initializing the policy with random trajectories (VPRL Stage 1) is crucial for exploration. VPFT, trained on optimal paths, has limited exploration (low entropy) and struggles if used directly for RL, as it yields near-zero advantages for GRPO. VPRL Stage 1 maintains high entropy with a low invalid action ratio.
VPRL Reduces Invalid Actions: VPRL significantly reduces the proportion of failed trajectories caused by invalid actions compared to VPFT (e.g., from 60-78% down to 25-37%).

Implementation Details

LVM Backbone: LVM-3B uses a VQGAN-based tokenizer to encode images into 256 discrete visual tokens.
State-Action Parsing for Reward: The rule-based parsing function $\mathcal{P}$ $P$ for reward calculation involves:
- Converting images to grayscale and a coordinate-based representation.
- Computing Intersection-over-Union (IoU) to find the agent's predicted position.
- Inferring actions by comparing start and predicted positions against task rules.
- Using pixel-wise Mean Squared Error (MSE) to detect invalid transitions like agent disappearance.
- For MiniBehavior, IoU changes detect "pick" actions, and MSE changes in table regions detect "drop" actions.
Progress Map for Reward: Breadth-First Search (BFS) is used to calculate the optimal steps to the goal from each position, forming the progress map $D(v)$ .
Training:
- Low-Rank Adaptation (LoRA) was applied for fine-tuning.
- VPRL Stage 1 trained for 10 epochs on random trajectories.
- VPRL Stage 2 trained for 10 epochs using GRPO with a group size of 10 candidate responses and a KL divergence penalty coefficient $\beta = 0.001$ .

The paper concludes that Visual Planning is a viable and promising alternative to language-based reasoning for visually oriented tasks, opening new avenues for research in multimodal AI. The VPRL framework demonstrates significant improvements in planning performance and generalization.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/_yixu/status/1924497262220378550

https://twitter.com/chaumian/status/1924282560110026777

https://twitter.com/ai_database/status/1924441007414001699

https://twitter.com/dmvaldman/status/1925968149830160782

https://twitter.com/Montreal_AI/status/1924850104520819129

https://twitter.com/wildmindai/status/1924447658166374542

YouTube

Show All Videos

Reddit

"Visual Planning: Let's Think Only with Images", Xu et al 2025 (21 points, 4 comments)