Chain-of-View Prompting for 3D EQA

Updated 15 January 2026

Chain-of-View prompting is a training-free, test-time framework for 3D embodied question answering that iteratively selects and refines views to overcome occlusions and spatial ambiguities.
It integrates a two-stage process with coarse view selection using relevance scores and fine-grained adjustments via discrete SE(3) camera actions to enhance active spatial reasoning.
Empirical evaluations on benchmarks like OpenEQA and ScanQA demonstrate significant accuracy and scalability improvements over fixed-view VLM approaches.

Chain-of-View (CoV) prompting is a test-time, training-free framework for embodied question answering (EQA) in 3D environments that enables vision–LLMs (VLMs) to reason actively about spatial relationships by selecting and exploring relevant viewpoints. The approach alleviates the limitations of conventional fixed-view VLMs, which are constrained to finite sets of input frames unable to capture or dynamically acquire question-relevant context in occluded or spatially complex scenes. By interleaving iterative reasoning and viewpoint adjustment, CoV prompting transforms a passive VLM into an active viewpoint reasoner through a coarse-to-fine exploration pipeline, resulting in significant empirical improvements in spatial reasoning tasks (Zhao et al., 8 Jan 2026).

1. EQA in 3D: Problem Setting and Motivations

Embodied question answering (EQA) in 3D tasks require an agent to answer natural-language queries $Q$ about a scene $S$ , represented as a point cloud or mesh. The agent is provided a sequence of egocentric RGB(D) frames $\mathcal{V} = \{v_1, ..., v_T\}$ sampled from a traversal of $S$ , along with their camera poses. The objective is to maximize the answer likelihood $A^*=\arg\max_A P(A \mid S, \mathcal{V}, Q)$ . The main challenges arise when question-relevant information is spread across non-overlapping or occluded views, or requires multi-step spatial reasoning not achievable through access to any single fixed view or passive frame aggregation. Spatial relations such as "behind," "to the left of," and handling occlusions require viewpoint changes. Traditional VLM paradigms ingest a fixed set of frames and produce an answer in one shot, often missing key spatial evidence.

2. Two-Stage Coarse-to-Fine View Reasoning

Chain-of-View prompting addresses the EQA challenge by implementing a two-stage process:

2.1 Coarse View Selection Agent

The first stage is coarse selection of relevant views. Given the set $\mathcal{V}$ and question $Q$ , the agent selects a subset $\mathcal{V}'$ of $K$ views $(K \ll T)$ most relevant to $Q$ . For each $v_i \in \mathcal{V}$ , a prompt instructs the VLM to estimate its relevance score $r_i = f_\text{sel}(v_i, Q)$ . Views are ranked, and the top- $K$ anchor views form $\mathcal{V}'$ . This stage filters out redundant or irrelevant frames, focusing subsequent exploration.

2.2 Fine-Grained View Adjustment

The second stage employs fine-grained adjustment using discrete SE(3) camera actions—translations (forward, backward, left, right, up, down), rotations (yaw, pitch, roll), and anchor switches within $\mathcal{V}'$ . The exploration is iterative: starting with a selected anchor, the VLM repeatedly chooses and executes an action based on current context (revisited frames and actions taken so far), then reasons about the updated visual context. This loop continues for up to a step budget $B$ or until the VLM signals readiness to answer. This active loop enables new observations that resolve occlusions or spatial ambiguities.

Mathematically, at iteration $t$ , the policy is represented as: $a_t = \arg\max_{a\in\mathcal{A}} R(C_{t-1}, a; Q), \quad s_t = \text{observe}(a_t)$ with context updated accordingly and stopping criteria either step limit or sufficient information for answering.

3. Model-Agnostic, Prompt-Based Implementation

CoV prompting is model-agnostic and training-free: all reasoning and control is executed via prompt design in a shared, off-the-shelf VLM backbone, without any fine-tuning or weight updates. Both stages—coarse view selection and fine-grained chaining—utilize custom prompts specifying tasks such as "select relevant views" and "think-then-act" cycles. Any VLM capable of ingesting images and text and returning relevance scores or action suggestions suffices, broadening applicability across recent vision–LLMs.

4. Empirical Evaluation

4.1 Benchmarks and Metrics

CoV prompting has been evaluated on several benchmarks:

OpenEQA: 180+ real-world scenes (ScanNet, HM3D) with open-vocabulary EQA; metric: LLM-Match (0–100%), judged by LLM-based ordinal score $\gamma_i$ and averaged as $\text{LLM-Match} = \frac{1}{N} \sum_{i=1}^N \frac{\gamma_i-1}{4} \times 100\%$ .
ScanQA: 41k QA pairs, object-grounded 3D QA. Metrics: CIDEr, BLEU-4, METEOR, ROUGE-L, EM@1.
SQA3D: 33k situated reasoning questions. Metric: EM@1.

4.2 Performance and Test-Time Scaling

Empirical results demonstrate consistent gains:

Model on OpenEQA	Baseline	CoV (Best)	Absolute Improvement (%)	Scaling Gain (%)
Qwen3-VL-Flash	52.65	59.82	+13.62	+1.82
GLM-4.6V	62.40	67.70	+8.50	+1.04
GPT-4o-Mini	45.87	51.56	+12.40	+3.43
Gemini-2.5-Flash	52.30	59.23	+11.70	+3.73

Test-time scaling is exhibited: increasing minimum action budget yields further gains (e.g., average +2.51% LLM-Match). On ScanQA, CoV achieves 116 CIDEr and 31.9 EM@1, with SQA3D EM@1 of 51.1, outperforming both 2D zero-shot LMMs and several specialized 3D QA models.

5. Diagnostic Analysis and Ablation

Ablations indicate both the importance and interaction of the two exploration stages. Omitting coarse view selection (CVS) reduces LLM-Match by ≈–4.59% on average, showing the necessity of anchor view filtering. Qualitative analysis of view trajectories reveals that CoV incrementally resolves occlusions and spatial relationships through each reasoning and movement step. Multi-step search traces show how intermediate observations contribute to more precise context accumulation and correct answers.

6. Current Limitations and Future Prospects

Current limitations include susceptibility to dynamic or cluttered scenes, which may cause mis-selection or excessive exploration, introducing inaccuracies. Answer quality is sensitive to the selection of anchor views—suboptimal anchors degrade performance. The fundamentally discrete action space may be limiting for tasks requiring smoother exploration. Future directions include developing learned or adaptive view-selection policies conditioned on question types, expanding from discrete to continuous action spaces, implementing budget-aware stopping criteria that balance efficiency and accuracy, and extending the framework to multi-agent or interactive tasks in dynamic environments (Zhao et al., 8 Jan 2026).

7. Summary of Key Innovations

Chain-of-View prompting establishes a prompt-based, model-agnostic pipeline for 3D EQA that (i) decomposes the exploration into a coarse filtering of anchor views followed by fine-grained, iterative search and (ii) enables test-time active view reasoning in VLMs with no need for additional training or architectures. The test-time scaling observed suggests open-ended potential for further gains as exploration budgets increase and as future generalizations adopt more flexible control or interaction paradigms.

Markdown Report Issue Upgrade to Chat

References (1)

CoV: Chain-of-View Prompting for Spatial Reasoning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Chain-of-View (CoV) Prompting.