Qwen2.5-VL Navigation Policy

Updated 29 December 2025

The paper presents a novel method combining the Qwen2.5-VL-3B architecture with dual-view visual prompts to enhance spatial context and navigation command accuracy.
It leverages SRGPO reinforcement learning to optimize policy performance, achieving up to 72.3% success rate in simulated navigation tasks.
The approach underscores the critical role of visual prompt components and geometric priors, outperforming traditional plug-and-play and behavior cloning baselines.

A Qwen2.5-VL-based navigation policy refers to a class of embodied vision-language navigation (VLN) agents that leverage the multimodal capabilities of the Qwen2.5-VL large vision-LLM (LVLM) to interpret natural language instructions and visual context, producing action commands in simulated or real environments. These policies are often instantiated using the Qwen2.5-VL-3B-Instruct backbone and may range from plug-and-play "frozen" models with lightweight planning logic to sophisticated, reinforcement-fine-tuned agents with advanced prompting and reward shaping. The following sections summarize the architectural foundations, prompting strategies, reinforcement learning optimizations, empirical results, and comparative context for Qwen2.5-VL-based navigation policies as documented in the recent literature.

1. Multimodal Policy Architecture

The canonical Qwen2.5-VL navigation agent consists of a large multimodal transformer (Qwen2.5-VL-3B-Instruct) that ingests text and visual tokens, producing autoregressive text culminating in an action token interpreted as a navigation command. Input modalities at step $t$ include:

Instruction $I$ : tokenized natural language route description.
Strategy Scaffold $S$ : optional human-specified hints ("You should..."), tokenized as text.
Visual Input $V'_t$ : a dual-view image ( $\mathbb{R}^{H\times 2W\times 3}$ ), concatenating a bird’s-eye (BEV) and front (FV) view overlaid with a multi-component Visual Prompt (see below).
History $H_t$ : serialized text buffer of the last $T_H$ steps, encoding (action, feedback) tuples.

The vision encoder projects $V'_t$ into a sequence of visual-token embeddings, which interleave with text embeddings (instruction, strategy, history) before processing in the LLM decoder. At each step, the model emits a chain-of-thought $c_t$ (text) and the final discrete action token $a_t\in\{0,\ldots,7\}$ , covering all elementary moves and rotations.

2. Visual Prompting and Spatial State Representation

Central to performance improvements is the dual-view Visual Prompt (VP) technique. Four overlaid modules are simultaneously rendered on both BEV and FV before concatenation:

Bounding Box (BB): demarcates the target in both views.
Navigation Line (NL): a red arrow indicating the vector from the agent to the target.
Agent Marker (AM): bidirectional circle with a forward arrow in BEV, encoding pose, position, and left/right context.
Action Projection (AP): candidate action arrows, each labeled (IDs $0$–$3$ for translation in BEV, $4$–$7$ for rotation in FV).
View Alignment (VA): BEV orientation is rotated such that its “forward” always matches the FV’s forward camera.

This augmentation improves the agent’s real-time awareness of spatial context, reduces perception hallucination, and yields high zero-shot performance. Critically, no model parameters are altered for visual prompt injection.

3. Reinforcement Fine-Tuning: Step Reward Group Policy Optimization (SRGPO)

Qwen2.5-VL-based navigation policies may be behavior-cloned from expert trajectories, but high performance is achieved by a post-training phase leveraging reinforcement learning—in particular, the Step Reward Group Policy Optimization (SRGPO) method, which uses bi-level advantage estimation for stable, sample-efficient policy optimization (Wang et al., 2 Dec 2025).

The navigation environment is modeled as an episodic MDP $(\mathcal{S},\mathcal{A},P,r,\gamma)$ , with:

State $o_t = (I, S, V_t, H_t)$
Action set $\mathcal{A} = \{0,\ldots,7\}$
Transition: deterministic given AI2-THOR simulator and $a_t$
Discount: $\gamma = 1$ for each episode (max 20 steps)

At each step, the state-independent process reward $r_t^s$ is:

$r_t^s = r_{t,\text{base}}^s - \lambda_{\text{valid}} r_{t,\text{valid}}^s$

where $r_{t,\text{base}}^s$ provides $+1$ if $p_t$ (agent’s position) is closer to $g$ (goal) than in the previous step, or if visibility of $g$ in $F_t$ has increased; otherwise $0$. Invalid actions are penalized by $r_{t,\text{valid}}^s$ , scaled by $\lambda_{\text{valid}}=0.1$ .

SRGPO Advantage Estimation:

Episode-level advantage: $A_E(\tau_i) = (R(\tau_i) - \text{mean}_j R(\tau_j))/\text{std}_j R(\tau_j)$
Step-level advantage: $A_S(o^{(i)}_t,a^{(i)}_t) = (r_t^{s(i)} - \text{mean}_j r_t^{s(j)})/\text{std}_j r_t^{s(j)}$ for randomly grouped steps.
Combined: $A_{i,t} = A_E(\tau_i) + \omega A_S(o^{(i)}_t,a^{(i)}_t)$ with $\omega=0.5$ .

The surrogate loss $J_{\mathrm{SRGPO}}(\theta)$ uses clipped PPO with KL penalty to ensure policy stability:

$J_{SRGPO}(\theta) = \mathbb{E}_{o, \{\tau_i\}} \left[ \frac{1}{N} \sum_{i=1}^N \frac{1}{T_i} \sum_{t=1}^{T_i} \min\left\{ \rho_{i,t}(\theta) A_{i,t}, \operatorname{clip}(\rho_{i,t}(\theta),1\pm\epsilon)A_{i,t} \right\} - \beta D_{KL}\left[\pi_\theta(\cdot|o) \parallel \pi_{\mathrm{ref}}(\cdot|o)\right] \right]$

with PPO ratio $\rho_{i,t}(\theta)$ and KL regularization ( $\epsilon=0.2$ , $\beta\sim0.01$ –$0.1$).

4. Training Protocols and Hyper-parameters

The policy optimization for Qwen2.5-VL-3B proceeds as follows:

Expert imitation phase: Optional supervised fine-tuning (SFT) on GPT-4.1-generated demonstration trajectories.
RL phase (SRGPO): Rollout groups of $N=4$ trajectories per task, group step samples into $N_S=16$ for step-level estimation, aggregate statistics as per the bi-level advantage, and perform a gradient step on $J_{SRGPO}$ . Action history buffer $T_H=5$ , max steps $T_{\text{max}}=20$ .
Optimizer: AdamW, learning rate $5\times10^{-6}$ , batch size 8 (trajectories).
Epochs: 150 (in-domain), 100 (out-of-domain).
Curriculum: None beyond random placement and agent spawn.

No network architecture changes are performed across dual-view prompt or advantage grouping; all improvements are input- and loss-driven.

5. Empirical Outcomes and Comparative Analysis

Empirical evaluations on the EmbodiedBench Navigation benchmark (Qwen2.5-VL-3B backbone) reveal the effectiveness of the above strategies (Wang et al., 2 Dec 2025):

Model/Regimen	Success Rate (%)
Qwen2.5-VL-3B-Ins zero-shot	16.7
+Visual Prompt (VP)	36.7
VP + SFT only	36.7
VP + GRPO	40.8 (±0.8)
VP + GiGPO (vanilla)	29.4 (±2.8)
VP + GiGPO w/ VPR	57.2 (±8.0)
VP + SRGPO (proposed)	72.3 (±0.8)

SRGPO substantially outperforms prior RFT baselines (GRPO, GiGPO), with $+31.7$ pp improvement over GRPO and $+14.5$ pp over GiGPO with Visual Process Reward. Convergence is rapid (around 50 epochs), with low variance and marked robustness in out-of-domain generalization.

Ablative studies confirm that all visual prompt components (especially BB, AP, VA) are critical: omitting any leads to significant performance drop, e.g., from 86.7% to as low as 45% on a GPT-4.1 backbone.

6. Comparative Frameworks and Plug-and-Play Baselines

Alternative Qwen2.5-VL navigation policies adopt either plug-and-play modular pipelines or behavior cloning strategies (Duan et al., 11 Jun 2025, Kåsene et al., 4 Aug 2025). A typical modular design comprises:

Vision Encoder: transforms one or two RGB images into embeddings (e.g., two-frame fusion).
Prompt Manager: assembles system prompts and maintains a rolling buffer of step-action-reflection history.
Planning Logic: fuses model output scores with geometric priors to select actions.
Action Executor: dispatches discrete navigation commands.

Prompt engineering leverages structured persona, action sets, common-sense priors, and history, asking the model to produce structured outputs (e.g., JSON with action scores and agent reflection). Planning logic then linearly combines LLM scores with geometric priors:

$S(a) = \alpha \cdot \mathrm{LLM\_score}(a) + \beta \cdot \mathrm{geom\_prior}(a)$

and selects $a^* = \arg\max_a S(a)$ . Performance on VLN-CE (Matterport3D, R2R val-unseen, first 20 paths) yields $5\%$ SR and $5\%$ SPL (Duan et al., 11 Jun 2025), far below actively fine-tuned methods, highlighting the necessity of joint RL and advanced state encoding.

Behavior-cloned Qwen2.5-VL-3B-Instruct policies using panoramic action space achieve up to $41\%$ SR on Room-to-Room (Kåsene et al., 4 Aug 2025), but trail models with active spatial reasoning or group-based RL.

7. Limitations and Prospective Extensions

While Qwen2.5-VL-based navigation policies with visual prompting and SRGPO achieve state-of-the-art results within the evaluated spectrum of LVLMs, certain challenges persist:

Generalization to strictly novel (o.o.d.) environments remains non-trivial without explicit mapping or geometric priors.
RL methods (e.g., SRGPO) introduce computational overhead relative to plug-and-play inference.
Vision-language alignment is achieved via simple projection/concatenation; explicit spatial or semantic fusion modules are absent.
In modular approaches, real-time deployment is hampered by LLM inference costs and limited by weak priors in unseen geometries.

Future directions identified in the literature emphasize integrating learned topological graphs, lightweight in-domain adapters, hierarchical planners coupling long-horizon embeddings with LLMs, and the use of depth or semantic modalities for improved environmental grounding (Duan et al., 11 Jun 2025). The success of panoramic action spaces (Kåsene et al., 4 Aug 2025) and explicit process reward shaping (Wang et al., 2 Dec 2025) suggest prioritizing spatial structure and dense feedback in future agent design.