Papers
Topics
Authors
Recent
Search
2000 character limit reached

Prompt-Based Virtual View Selection

Updated 29 January 2026
  • Prompt-based virtual view selection is a method that uses language-driven prompts to determine optimal virtual camera poses in 3D environments.
  • It leverages structured prompt designs and chain-of-thought reasoning to reduce redundancy and resolve occlusions for task-specific performance.
  • Demonstrated improvements include faster training convergence, enhanced spatial reasoning, and higher QA accuracy in both simulated and real-world settings.

Prompt-based virtual view selection refers to the class of methods that utilize language-driven prompts—typically processed by large vision–language or multimodal foundation models—to guide the online selection or synthesis of camera viewpoints within real or simulated 3D environments. This paradigm enables agents or policy modules to acquire task-relevant visual evidence efficiently by “imagining” or moving a virtual camera to optimal, context-aligned positions, rather than aggregating redundant or irrelevant information from multi-view static rigs or random explorations. Approaches range from one-shot inference of a single informative viewpoint (as in VERM) to multi-step, chain-of-thought active view exploration (as in CoV and VG-AVS), and are deployed in applications spanning robotic 3D manipulation, embodied question answering, and spatial reasoning (Chen et al., 18 Dec 2025, Koo et al., 15 Dec 2025, Zhao et al., 8 Jan 2026).

1. Principles and Motivations

Prompt-based virtual view selection addresses several core challenges in 3D vision and robotics:

  • Redundancy and Information Bottleneck: Fixed multi-camera arrays or random movement generate large volumes of overlapping or uninformative visual data, leading to significant computational overhead and slower policy training. Virtual view selection permits targeted acquisition of views likely to maximize downstream task performance (Chen et al., 18 Dec 2025).
  • Task-aware Perception: Task or question context, encoded as a natural language prompt, enables the selection of views specifically relevant to the current goal (e.g., “insert peg into hole” or “count mugs on table”). This alignment between language and viewpoint selection is absent in classical view-planning (Chen et al., 18 Dec 2025, Koo et al., 15 Dec 2025).
  • Mitigation of Occlusion and Ambiguity: Imagined or actively selected virtual views can resolve occlusions and disambiguate spatial configurations not observable from any fixed real-world camera (Chen et al., 18 Dec 2025, Zhao et al., 8 Jan 2026).
  • Foundation Model Leverage: Modern large multimodal models (e.g., GPT-4o, Qwen3-VL, GLM-4.6V) can “reason” over scene, image, and language prompts in order to infer geometric or semantically informative camera actions. These models require appropriate prompt design to elicit correct view-selection behavior (Chen et al., 18 Dec 2025, Koo et al., 15 Dec 2025, Zhao et al., 8 Jan 2026).

2. Prompting Strategies and System Design

Prompt design is a critical component. VERM defines a structured, multimodal prompt comprising: (1) explicit definitions of the environment and fixed camera configuration; (2) natural-language task description L; (3) in-context demonstration pairs linking camera directions to angular output; (4) hard constraints and preferences (e.g., elevation cutoff for tabletop scenes); and (5) rich visual context, including both environmental schematics and live RGB images. The output is a pair of spherical angles (elev,azim)(\mathrm{elev}^*, \mathrm{azim}^*) specifying a virtual camera pose vv^* (Chen et al., 18 Dec 2025).

Chain-of-View (CoV) and VG-AVS extend prompting to interactive sequences. CoV prompts a vision–LLM to select from a potentially large pool of candidate views by ranking relevance to the natural-language query; selected “anchor” views anchor the subsequent fine-grained camera search, interleaving “think” (language reasoning) and “act” (camera movement) steps (Zhao et al., 8 Jan 2026). In VG-AVS, prompts instruct the model to output view parameters for “where to look next” to maximize question-answering reward, enforcing structured serialization (e.g., <H>, <D>, <V> tags for heading, distance, vertical angle) and chain-of-thought blocks to improve policy stability (Koo et al., 15 Dec 2025).

Prompt-based view selection often leverages visual-language alignment modules (e.g., CLIP) to compute per-point or per-pixel task-relevance, which modulates the prompt result or training objective (Chen et al., 18 Dec 2025).

3. Formal Objectives and Policy Construction

A formalized objective for prompt-based virtual view selection can be stated as follows (see VERM):

v  =  argmaxvV  pPw(p;L)1vis(p;v)λRred(v)v^* \;=\; \arg\max_{v\in V}\; \sum_{p\in P} w(p;L)\,\mathbf{1}_{\mathrm{vis}(p;v)} - \lambda\,R_{\mathrm{red}}(v)

where PP is the point cloud, w(p;L)w(p;L) is the per-point saliency derived from language prompt LL, and Rred(v)R_{\mathrm{red}}(v) penalizes overlap with static real camera coverage. In practice, this instance-weighted visibility maximization, subject to redundancy constraints and possibly hard physical rules, is only approximated: large foundation models are queried to produce vv^* directly via prompt interpretation (Chen et al., 18 Dec 2025).

In VG-AVS, the selection policy is denoted by πθ(voc,q)\pi_\theta(v | o_c, q), where oco_c is the current RGB observation and qq is the language query. Optimization is conducted by a combination of supervised fine-tuning (teacher-forcing loss) and reinforcement learning (policy gradient with QA accuracy as reward) (Koo et al., 15 Dec 2025).

CoV implements a two-stage policy: (1) anchor view selection by question relevance, using LLM prompt output; (2) iterative selection from a set of discrete camera actions expressed in SE(3) (translation/rotation/view-switch), mediated by closed-loop language-driven reasoning (Zhao et al., 8 Jan 2026).

4. Model Architectures and Coarse-to-Fine Procedures

Prompt-based systems integrate multimodal transformers with specialized modules for image, depth, and language:

  • Depth-aware Attention (VERM): After rendering from vv^*, a transformer module attends jointly to image patch, language token, and learnable depth-token embeddings. The architecture involves 8-layer transformers with output heads for translation heatmap, depth, rotation, gripper, and collision states. Depth tokens (e.g., 36 trainable vectors) are critical for accurate 3D localization. All heads except translation are trained via cross-entropy; translation head uses Gaussian heatmap MSE (Chen et al., 18 Dec 2025).
  • Chain-of-View Reasoning (CoV): The system employs a shared VLM backbone that interleaves image patch and prompt tokens in the input stream. The complete context—language query, anchor frame indices, and observation history—enables the agent to issue discrete actions (translations, rotations, anchor switches) and to terminate with an “Answer: ...” string. Each new action triggers a render of a new view, closing the perception–action loop at test time (Zhao et al., 8 Jan 2026).
  • Refinement Predictors and Coarse-to-Fine: In VERM, a “refinement predictor” is attached to the policy: coarse actions are planned from the global virtual view unless the predictor signals the need for precision, in which case a localized crop of the point cloud is rendered and re-encoded for fine action output (Chen et al., 18 Dec 2025). Similarly, CoV increases reasoning steps adaptively, with evidence accumulating over multiple iterative interactions (Zhao et al., 8 Jan 2026).

5. Training Algorithms, Datasets, and Performance

Prompt-based virtual view selection models have been evaluated across simulated and real settings:

  • VERM: Trained on RLBench (17 tasks, 100 demos each) and real-world tasks using point cloud renders and multimodal prompts. The policy network is approximately 2×2\times smaller and achieves 1.89×1.89\times faster training convergence and 1.54×1.54\times faster inference relative to RVT-2. Success rates reach 83.6%83.6\% on RLBench and 78.8%78.8\% in real-world rollouts, showing an advantage over predecessor architectures (Chen et al., 18 Dec 2025).
  • VG-AVS: Utilizes the ProcTHOR and HM3D simulation environments with data triples tying queries, language prompts, and images of view transitions. The combination of supervised fine-tuning and reinforcement learning (with frozen VLM verifier reward) pushes QA accuracy from 50.2%50.2\% (zero-shot Qwen2.5-VL) up to 83.7%83.7\% (SFT+RL) on AVS-ProcTHOR. Prompt structure (chain-of-thought, tag-based serialization) is crucial for stability (Koo et al., 15 Dec 2025).
  • CoV: Operates in a training-free regime, relying entirely on prompt conditioning. On OpenEQA, CoV improves LLM-Match by up to +13.62%+13.62\% (Qwen3-VL-Flash). Increasing the allowable number of action steps leads to further improvement (test-time scaling), with the average gain rising by +2.51%+2.51\% from 1 to 6 steps. State-of-the-art results are reported on ScanQA (31.9% EM@1) and SQA3D (51.1% EM@1) (Zhao et al., 8 Jan 2026).
Method Training Regime Main Evaluation Environments Peak QA Accuracy Notable Advantages
VERM (Chen et al., 18 Dec 2025) SFT (no RL) RLBench, Real-world 83.6% (RLBench) 2×\sim2\times faster, depth tokens
VG-AVS (Koo et al., 15 Dec 2025) SFT + RL (QA reward) ProcTHOR, HM3D 83.7% (ProcTHOR) Chain-of-thought output, supervised data
CoV (Zhao et al., 8 Jan 2026) Training-free (prompt) OpenEQA, ScanQA, SQA3D 31.9% EM@1 (ScanQA) Test-time scaling, multi-step

6. Extensions, Limitations, and Open Problems

Current prompt-based virtual view selection systems offer several distinctive capabilities: test-time flexibility (especially in CoV), reduction of redundant visual inputs, explicit reference to language goals, and mitigation of scene occlusion via actively imagined virtual viewpoints. However, challenges persist:

  • Scene Memory and Multi-Turn Reasoning: VG-AVS and baseline architectures in this family discard past views, limiting their ability to exploit spatial memory. This suggests a need for explicit mapping or SLAM integration to handle occlusions, collisions, and longer-horizon policies (Koo et al., 15 Dec 2025).
  • Prompt Design and Output Serialization: Chain-of-thought blocks and custom output tags improve performance but introduce engineering complexity and demand careful ordering (e.g., staging RL after SFT in VG-AVS to prevent degenerate patterns) (Koo et al., 15 Dec 2025).
  • Test-time Scaling: CoV demonstrates that increasing view-selection steps at inference robustly yields higher spatial reasoning accuracy. No explicit retraining is required, highlighting the power and flexibility of prompt-based control (Zhao et al., 8 Jan 2026).
  • Handling Dynamic or Cluttered Scenes: Transfer from synthetic to real domains (e.g., from ProcTHOR to HM3D) sees diminished returns, and occlusions or scene changes may degrade performance. A plausible implication is that learned or adaptive domain-robustness modules, as well as lightweight heuristic augmentations, may be required for future work (Zhao et al., 8 Jan 2026, Koo et al., 15 Dec 2025).
  • Closed-loop Policy/Verifier Training: End-to-end policies that jointly optimize both the action selector and language-based verifier remain an open direction, as current systems often freeze one component during RL for stability (Koo et al., 15 Dec 2025).

Taken collectively, prompt-based virtual view selection occupies a central role in modern embodied AI, merging large-scale multimodal model capabilities with dynamic, task-aware 3D perception and action. Ongoing research into prompt engineering, dataset curation, policy-verifier co-training, and multi-step memory will define the future evolution of this paradigm.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Prompt-based Virtual View Selection.