Prompt-Based Virtual View Selection

Updated 29 January 2026

Prompt-based virtual view selection is a method that uses language-driven prompts to determine optimal virtual camera poses in 3D environments.
It leverages structured prompt designs and chain-of-thought reasoning to reduce redundancy and resolve occlusions for task-specific performance.
Demonstrated improvements include faster training convergence, enhanced spatial reasoning, and higher QA accuracy in both simulated and real-world settings.

Prompt-based virtual view selection refers to the class of methods that utilize language-driven prompts—typically processed by large vision–language or multimodal foundation models—to guide the online selection or synthesis of camera viewpoints within real or simulated 3D environments. This paradigm enables agents or policy modules to acquire task-relevant visual evidence efficiently by “imagining” or moving a virtual camera to optimal, context-aligned positions, rather than aggregating redundant or irrelevant information from multi-view static rigs or random explorations. Approaches range from one-shot inference of a single informative viewpoint (as in VERM) to multi-step, chain-of-thought active view exploration (as in CoV and VG-AVS), and are deployed in applications spanning robotic 3D manipulation, embodied question answering, and spatial reasoning (Chen et al., 18 Dec 2025, Koo et al., 15 Dec 2025, Zhao et al., 8 Jan 2026).

1. Principles and Motivations

Prompt-based virtual view selection addresses several core challenges in 3D vision and robotics:

Redundancy and Information Bottleneck: Fixed multi-camera arrays or random movement generate large volumes of overlapping or uninformative visual data, leading to significant computational overhead and slower policy training. Virtual view selection permits targeted acquisition of views likely to maximize downstream task performance (Chen et al., 18 Dec 2025).
Task-aware Perception: Task or question context, encoded as a natural language prompt, enables the selection of views specifically relevant to the current goal (e.g., “insert peg into hole” or “count mugs on table”). This alignment between language and viewpoint selection is absent in classical view-planning (Chen et al., 18 Dec 2025, Koo et al., 15 Dec 2025).
Mitigation of Occlusion and Ambiguity: Imagined or actively selected virtual views can resolve occlusions and disambiguate spatial configurations not observable from any fixed real-world camera (Chen et al., 18 Dec 2025, Zhao et al., 8 Jan 2026).
Foundation Model Leverage: Modern large multimodal models (e.g., GPT-4o, Qwen3-VL, GLM-4.6V) can “reason” over scene, image, and language prompts in order to infer geometric or semantically informative camera actions. These models require appropriate prompt design to elicit correct view-selection behavior (Chen et al., 18 Dec 2025, Koo et al., 15 Dec 2025, Zhao et al., 8 Jan 2026).

2. Prompting Strategies and System Design

Prompt design is a critical component. VERM defines a structured, multimodal prompt comprising: (1) explicit definitions of the environment and fixed camera configuration; (2) natural-language task description L; (3) in-context demonstration pairs linking camera directions to angular output; (4) hard constraints and preferences (e.g., elevation cutoff for tabletop scenes); and (5) rich visual context, including both environmental schematics and live RGB images. The output is a pair of spherical angles $(\mathrm{elev}^*, \mathrm{azim}^*)$ specifying a virtual camera pose $v^*$ (Chen et al., 18 Dec 2025).

Chain-of-View (CoV) and VG-AVS extend prompting to interactive sequences. CoV prompts a vision–LLM to select from a potentially large pool of candidate views by ranking relevance to the natural-language query; selected “anchor” views anchor the subsequent fine-grained camera search, interleaving “think” (language reasoning) and “act” (camera movement) steps (Zhao et al., 8 Jan 2026). In VG-AVS, prompts instruct the model to output view parameters for “where to look next” to maximize question-answering reward, enforcing structured serialization (e.g., <H>, <D>, <V> tags for heading, distance, vertical angle) and chain-of-thought blocks to improve policy stability (Koo et al., 15 Dec 2025).

Prompt-based view selection often leverages visual-language alignment modules (e.g., CLIP) to compute per-point or per-pixel task-relevance, which modulates the prompt result or training objective (Chen et al., 18 Dec 2025).

3. Formal Objectives and Policy Construction

A formalized objective for prompt-based virtual view selection can be stated as follows (see VERM):

$v^* \;=\; \arg\max_{v\in V}\; \sum_{p\in P} w(p;L)\,\mathbf{1}_{\mathrm{vis}(p;v)} - \lambda\,R_{\mathrm{red}}(v)$

where $P$ is the point cloud, $w(p;L)$ is the per-point saliency derived from language prompt $L$ , and $R_{\mathrm{red}}(v)$ penalizes overlap with static real camera coverage. In practice, this instance-weighted visibility maximization, subject to redundancy constraints and possibly hard physical rules, is only approximated: large foundation models are queried to produce $v^*$ directly via prompt interpretation (Chen et al., 18 Dec 2025).

In VG-AVS, the selection policy is denoted by $\pi_\theta(v | o_c, q)$ , where $o_c$ is the current RGB observation and $q$ is the language query. Optimization is conducted by a combination of supervised fine-tuning (teacher-forcing loss) and reinforcement learning (policy gradient with QA accuracy as reward) (Koo et al., 15 Dec 2025).

CoV implements a two-stage policy: (1) anchor view selection by question relevance, using LLM prompt output; (2) iterative selection from a set of discrete camera actions expressed in SE(3) (translation/rotation/view-switch), mediated by closed-loop language-driven reasoning (Zhao et al., 8 Jan 2026).

4. Model Architectures and Coarse-to-Fine Procedures

Prompt-based systems integrate multimodal transformers with specialized modules for image, depth, and language:

Depth-aware Attention (VERM): After rendering from $v^*$ , a transformer module attends jointly to image patch, language token, and learnable depth-token embeddings. The architecture involves 8-layer transformers with output heads for translation heatmap, depth, rotation, gripper, and collision states. Depth tokens (e.g., 36 trainable vectors) are critical for accurate 3D localization. All heads except translation are trained via cross-entropy; translation head uses Gaussian heatmap MSE (Chen et al., 18 Dec 2025).
Chain-of-View Reasoning (CoV): The system employs a shared VLM backbone that interleaves image patch and prompt tokens in the input stream. The complete context—language query, anchor frame indices, and observation history—enables the agent to issue discrete actions (translations, rotations, anchor switches) and to terminate with an “Answer: ...” string. Each new action triggers a render of a new view, closing the perception–action loop at test time (Zhao et al., 8 Jan 2026).
Refinement Predictors and Coarse-to-Fine: In VERM, a “refinement predictor” is attached to the policy: coarse actions are planned from the global virtual view unless the predictor signals the need for precision, in which case a localized crop of the point cloud is rendered and re-encoded for fine action output (Chen et al., 18 Dec 2025). Similarly, CoV increases reasoning steps adaptively, with evidence accumulating over multiple iterative interactions (Zhao et al., 8 Jan 2026).

5. Training Algorithms, Datasets, and Performance

Prompt-based virtual view selection models have been evaluated across simulated and real settings:

VERM: Trained on RLBench (17 tasks, 100 demos each) and real-world tasks using point cloud renders and multimodal prompts. The policy network is approximately $2\times$ smaller and achieves $1.89\times$ faster training convergence and $1.54\times$ faster inference relative to RVT-2. Success rates reach $83.6\%$ on RLBench and $78.8\%$ in real-world rollouts, showing an advantage over predecessor architectures (Chen et al., 18 Dec 2025).
VG-AVS: Utilizes the ProcTHOR and HM3D simulation environments with data triples tying queries, language prompts, and images of view transitions. The combination of supervised fine-tuning and reinforcement learning (with frozen VLM verifier reward) pushes QA accuracy from $50.2\%$ (zero-shot Qwen2.5-VL) up to $83.7\%$ (SFT+RL) on AVS-ProcTHOR. Prompt structure (chain-of-thought, tag-based serialization) is crucial for stability (Koo et al., 15 Dec 2025).
CoV: Operates in a training-free regime, relying entirely on prompt conditioning. On OpenEQA, CoV improves LLM-Match by up to $+13.62\%$ (Qwen3-VL-Flash). Increasing the allowable number of action steps leads to further improvement (test-time scaling), with the average gain rising by $+2.51\%$ from 1 to 6 steps. State-of-the-art results are reported on ScanQA (31.9% EM@1) and SQA3D (51.1% EM@1) (Zhao et al., 8 Jan 2026).

Method	Training Regime	Main Evaluation Environments	Peak QA Accuracy	Notable Advantages
VERM (Chen et al., 18 Dec 2025)	SFT (no RL)	RLBench, Real-world	83.6% (RLBench)	$\sim2\times$ faster, depth tokens
VG-AVS (Koo et al., 15 Dec 2025)	SFT + RL (QA reward)	ProcTHOR, HM3D	83.7% (ProcTHOR)	Chain-of-thought output, supervised data
CoV (Zhao et al., 8 Jan 2026)	Training-free (prompt)	OpenEQA, ScanQA, SQA3D	31.9% EM@1 (ScanQA)	Test-time scaling, multi-step

6. Extensions, Limitations, and Open Problems

Current prompt-based virtual view selection systems offer several distinctive capabilities: test-time flexibility (especially in CoV), reduction of redundant visual inputs, explicit reference to language goals, and mitigation of scene occlusion via actively imagined virtual viewpoints. However, challenges persist:

Scene Memory and Multi-Turn Reasoning: VG-AVS and baseline architectures in this family discard past views, limiting their ability to exploit spatial memory. This suggests a need for explicit mapping or SLAM integration to handle occlusions, collisions, and longer-horizon policies (Koo et al., 15 Dec 2025).
Prompt Design and Output Serialization: Chain-of-thought blocks and custom output tags improve performance but introduce engineering complexity and demand careful ordering (e.g., staging RL after SFT in VG-AVS to prevent degenerate patterns) (Koo et al., 15 Dec 2025).
Test-time Scaling: CoV demonstrates that increasing view-selection steps at inference robustly yields higher spatial reasoning accuracy. No explicit retraining is required, highlighting the power and flexibility of prompt-based control (Zhao et al., 8 Jan 2026).
Handling Dynamic or Cluttered Scenes: Transfer from synthetic to real domains (e.g., from ProcTHOR to HM3D) sees diminished returns, and occlusions or scene changes may degrade performance. A plausible implication is that learned or adaptive domain-robustness modules, as well as lightweight heuristic augmentations, may be required for future work (Zhao et al., 8 Jan 2026, Koo et al., 15 Dec 2025).
Closed-loop Policy/Verifier Training: End-to-end policies that jointly optimize both the action selector and language-based verifier remain an open direction, as current systems often freeze one component during RL for stability (Koo et al., 15 Dec 2025).

Taken collectively, prompt-based virtual view selection occupies a central role in modern embodied AI, merging large-scale multimodal model capabilities with dynamic, task-aware 3D perception and action. Ongoing research into prompt engineering, dataset curation, policy-verifier co-training, and multi-step memory will define the future evolution of this paradigm.

Markdown Upgrade to Chat

References (3)

VERM: Leveraging Foundation Models to Create a Virtual Eye for Efficient 3D Robotic Manipulation (2025)

Toward Ambulatory Vision: Learning Visually-Grounded Active View Selection (2025)

CoV: Chain-of-View Prompting for Spatial Reasoning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Prompt-based Virtual View Selection.

Prompt-Based Virtual View Selection

1. Principles and Motivations

2. Prompting Strategies and System Design

3. Formal Objectives and Policy Construction

4. Model Architectures and Coarse-to-Fine Procedures

5. Training Algorithms, Datasets, and Performance

6. Extensions, Limitations, and Open Problems

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Prompt-Based Virtual View Selection

1. Principles and Motivations

2. Prompting Strategies and System Design

3. Formal Objectives and Policy Construction

4. Model Architectures and Coarse-to-Fine Procedures

5. Training Algorithms, Datasets, and Performance

6. Extensions, Limitations, and Open Problems

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research