Chain-of-View Prompting
- Chain-of-View Prompting (CoV) is a test-time reasoning framework that transforms static vision-language models into active explorers of 3D environments.
- It employs a two-stage, coarse-to-fine pipeline that interleaves LLM reasoning with discrete camera actions to dynamically gather question-relevant context.
- Empirical results indicate up to an 11.56% improvement in spatial reasoning performance, underscoring the benefit of targeted view selection.
Chain-of-View (CoV) Prompting is a training-free, test-time reasoning framework for 3D Embodied Question Answering (EQA) that enables vision–LLMs (VLMs) to perform active viewpoint selection and spatial reasoning in 3D environments. The CoV framework addresses the limitations of conventional VLMs, which typically process a fixed and finite set of views, by transforming them into active viewpoint reasoners through a two-stage, coarse-to-fine exploration process that interleaves LLM reasoning with discrete camera actions. This approach systematically gathers question-relevant context by navigating the continuous space of possible observations, significantly improving spatial reasoning performance in standard benchmarks (Zhao et al., 8 Jan 2026).
1. Problem Motivation and Overview
Embodied QA tasks require agents to answer natural language questions about 3D environments by collecting visual context from viewpoints in a continuous space Ω (e.g., through RGB-D captures or rendered meshes). Standard VLMs are constrained to static, predefined image sets, resulting in several challenges:
- Context Distribution: Question-relevant information is scattered and often only accessible from a few key viewpoints.
- Partial Occlusion: Important scene elements may be hidden from any single view.
- Fixed Input Bottleneck: Current VLMs’ restriction to fixed input sets prevents dynamic, multi-step exploration and active spatial reasoning.
Chain-of-View Prompting addresses these deficiencies by introducing a dynamic pipeline capable of both filtering to retain informative anchors and actively acquiring new, targeted observations.
2. CoV Pipeline: Coarse-to-Fine Exploration
CoV operates as a training-free, test-time system for augmenting any off-the-shelf VLM. The process comprises two stages:
2.1 Coarse-Grained View Selection
Given a set of sampled frames , the agent selects a small candidate pool by filtering for relevance to the query via LLM-based scoring. This reduces both redundancy and irrelevant context, focusing downstream exploration: with as the prompt-based relevance score for each view.
Algorithmic summary:
1 2 3 4 5 6 7 8 |
Algorithm 1: Coarse-Grained View Selection
Input: V = {v₁,...,v_T}, Q, K
Output: V′ = {v_{i₁},...,v_{i_K}}
1. for each v in V:
2. prompt LLM with (Q, v) to obtain s(v;Q)
3. sort V by descending s(v;Q)
4. V′ ← top K frames
5. return V′ |
2.2 Fine-Grained View Adjustment
CoV commences with one anchor from , then iteratively interleaves LLM-driven reasoning (“What should I do next?”) with discrete camera actions (e.g., move, rotate, or switch to another anchor), collecting new views until adequate context is established or an action budget is depleted.
The action space encompasses:
- Translational actions: forward/backward/left/right/up/down
- Rotational actions: yaw, pitch, roll in either direction
- View-switch: jump to any anchor
At each timestep :
- LLM produces the next action using current context .
- Execute as an transform on the camera’s pose, yielding new .
- Update context .
- Terminate if “enough information” or (where is the step budget).
Explicitly:
3. View Selection Agent and Relevance Scoring
The relevance scoring function quantifies alignment between a candidate frame and the natural-language query. For each , the LLM is prompted with to estimate . Selection of top- views forms the reduced pool . Empirical results indicate that omitting this stage—feeding all frames to the agent—results in a performance drop of on OpenEQA, underscoring the necessity of question-targeted filtering (Zhao et al., 8 Jan 2026).
4. Experimental Setup and Empirical Results
CoV was benchmarked on OpenEQA, ScanQA, and SQA3D, using metrics tailored to each dataset’s modalities:
- OpenEQA: LLM-Match (%), using multi-view RGB-D and point clouds (from ScanNet and HM3D).
- ScanQA: CIDEr, BLEU-4, METEOR, ROUGE-L, EM@1.
- SQA3D: EM@1.
Key results:
| Method | OpenEQA LLM-Match Δ | ScanQA CIDEr | ScanQA EM@1 | SQA3D EM@1 |
|---|---|---|---|---|
| Qwen3-VL | +13.62% | |||
| GLM-4.6V | +8.50% | |||
| GPT-4o-Mini | +12.40% | |||
| Gemini-2.5 | +11.70% | |||
| CoV (avg) | +11.56% | 116 | 31.9 | 51.1 |
- CoV achieves state-of-the-art (SOTA) zero-shot results on ScanQA and SQA3D.
- On OpenEQA, a single CoV step confers an average +4.87% LLM-Match improvement; adapting step count to optimal per-instance yields average +11.56% gains.
- Increasing the minimum step budget further increases average accuracy by +2.51% (up to +3.73% on Gemini-2.5).
- Performance drops by -4.59% if the view selection stage is removed, confirming the crucial role of question-aligned anchor filtering.
5. Test-Time Scaling and Comparative Analysis
Analysis of the action–reasoning step distribution reveals that most questions require 1–3 steps, yet further steps improve accuracy. Enforcing higher at test time acts as a mechanism for test-time scaling: as increases, so does performance. This behavior highlights the utility of allowing greater exploration before answer extraction.
CoV operates as a completely model-agnostic framework, compatible with both open-source (Qwen3-VL, GLM, InternVL) and commercial VLMs (GPT-4.1, Gemini, etc.), requiring neither retraining nor fine-tuning. Compared to both fixed-view video-VLMs and specialized 3D-LMMs, CoV achieves superior zero-shot performance by leveraging multi-step, query-guided interaction with the scene.
6. Limitations and Future Directions
The framework exhibits certain limitations:
- In highly dynamic or cluttered environments, the coarse-to-fine pipeline may misidentify anchor views or stray from the question’s informational target.
- Prolonged action trajectories increase risk of LLM hallucination or scene drift.
- Future improvements may involve advanced anchor-view selection algorithms, adaptive (possibly learned) budget strategies, or the incorporation of uncertainty-aware stopping criteria to mitigate over-exploration and reduce error accumulation.
This suggests further research into more robust and adaptive exploration policies will be valuable for continued progress in embodied spatial reasoning.
7. Significance and Implications
Chain-of-View Prompting provides a general, training-free methodology for augmenting existing VLMs with active, question-guided spatial reasoning capabilities. Empirical evidence demonstrates that strategic multi-step exploration—coordinated by LLM-driven prompting—enables effective acquisition of otherwise occluded or distributed context, dramatically narrowing the embodied reasoning gap in 3D EQA. The model-agnostic and training-free properties of CoV increase its practical applicability across a range of VLM platforms and tasks (Zhao et al., 8 Jan 2026).