E3VS-Bench: A Benchmark for Viewpoint-Dependent Active Perception in 3D Gaussian Splatting Scenes

Published 20 Apr 2026 in cs.CV | (2604.17969v2)

Abstract: Visual search in 3D environments requires embodied agents to actively explore their surroundings and acquire task-relevant evidence. However, existing visual search and embodied AI benchmarks, including EQA, typically rely on static observations or constrained egocentric motion, and thus do not explicitly evaluate fine-grained viewpoint-dependent phenomena that arise under unrestricted 5-DoF viewpoint control in real-world 3D environments, such as visibility changes caused by vertical viewpoint shifts, revealing contents inside containers, and disambiguating object attributes that are only observable from specific angles. To address this limitation, we introduce {E3VS-Bench}, a benchmark for embodied 3D visual search where agents must control their viewpoints in 5-DoF to gather viewpoint-dependent evidence for question answering. E3VS-Bench consists of 99 high-fidelity 3D scenes reconstructed using 3D Gaussian Splatting and 2,014 question-driven episodes. 3D Gaussian Splatting enables photorealistic free-viewpoint rendering that preserves fine-grained visual details (e.g., small text and subtle attributes) often degraded in mesh-based simulators, thereby allowing the construction of questions that cannot be answered from a single view and instead require active inspection across viewpoints in 5-DoF. We evaluate multiple state-of-the-art VLMs and compare their performance with humans. Despite strong 2D reasoning ability, all models exhibit a substantial gap from humans, highlighting limitations in active perception and coherent viewpoint planning specifically under full 5-DoF viewpoint changes.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper presents E3VS-Bench, a rigorous benchmark requiring agents to actively explore photorealistic 3D Gaussian Splatting scenes for embodied visual search tasks.
Methodologically, it uses 5-DoF agent control and strict dataset filtering to ensure episodes demand genuine exploration and visual evidence localization, enabling nuanced evaluation of spatial reasoning and object attribute tasks.
Key findings reveal current visual language models lag behind humans in viewpoint planning and evidence acquisition, underscoring the need for improved multimodal memory and perception-action planning architectures.

E3VS-Bench: A Benchmark for Viewpoint-Dependent Active Perception in 3D Gaussian Splatting Scenes

Introduction and Motivation

E3VS-Bench advances embodied visual search by formulating the Embodied 3D Visual Search (E3VS) task within photorealistic 3D Gaussian Splatting (3DGS) reconstructions. Traditional benchmarks for visual search and embodied QA, such as REVERIE and EQA, constrain agent movement to planar navigation or limited viewpoint adjustment, thus omitting critical viewpoint-dependent phenomena—including occlusion resolution, vertical viewpoint shifts, and access to fine-grained object attributes—that are essential for real-world active perception. E3VS-Bench explicitly targets this deficiency, requiring agents to actively control their 5-DoF viewpoint to acquire visually grounded, task-critical evidence.

Figure 1: The E3VS task requires 5-DoF viewpoint control to resolve occlusions and access fine-grained visual evidence, with examples such as reading production area labels.

Benchmark Design and Dataset Construction

E3VS-Bench comprises 99 high-fidelity indoor scenes reconstructed from SceneSplat++ via 3D Gaussian Splatting, yielding photorealistic renderings that preserve minute visual details (e.g., small text, logos) often lost in mesh-based environments. Benchmark episodes (2,014 in total) are question-driven, necessitating active viewpoint manipulation for their resolution.

To ensure episodes demand genuine exploration, the dataset generation pipeline includes multi-stage filtering (Figure 2): (1) manual scene curation, (2) VLM-assisted QA pair generation, (3) human verification of QAs and viewpoint answerability, (4) annotation of answerable viewpoints, and (5) answerability filtering via a VLM-as-a-judge protocol that excludes episodes solvable without viewpoint transitions beyond prior knowledge or initial observation.

Figure 2: The dataset construction pipeline integrates VLM-driven QA generation, human annotation, and viewpoint filtering to ensure episodes require active exploration.

Statistical analysis demonstrates scene diversity and coverage of various QA categories: object search, context-guided search, object state, object attribute, spatial reasoning, and counting. All action types—including those involving vertical and rotational viewpoint changes—are well represented, underscoring the necessity of full 5-DoF control for effective task completion.

Figure 3: Dataset statistics reveal balanced distributions in question types, scene categories, action types, and QA lengths.

Figure 4: Representative E3VS episodes for each reasoning category, highlighting the variation in required viewpoint control for task resolution.

Task Formulation and Evaluation Protocol

An E3VS episode consists of a $(\mathcal{S}, q, v_0)$ triplet: a 3DGS scene $\mathcal{S}$ , natural-language question $q$ , and initial viewpoint $v_0$ . The agent’s state is parameterized as $v_t = (x_t, y_t, z_t, \theta_t, \phi_t)$ , controlling translation and rotation in 3D space. The agent interacts via discrete actions in the environment until it issues a stop command, after which it generates an answer $\hat{y}$ from observation $O_T$ .

Correctness is assessed using a VLM-as-a-judge, scoring outputs on a 1–5 scale. This protocol incorporates semantic and visual grounding, with both agent-end and human-annotated goal images as reference to mitigate superficial matching in open-vocabulary QA. Judging reliability is supported by a moderate Spearman’s $\rho=0.54$ correlation with human evaluators.

Baselines and Experimental Analysis

E3VS-Bench benchmarks both proprietary (Gemini 2.5/3.0, GPT 5.1) and open-source (Qwen3-VL-8B/30B, InternVL3.5-8B, Step3-VL-10B) VLMs under the E3VS agent paradigm, as well as in static settings (blind VLM, VQA at start, birdview, goal, and single-image 2D visual search). Experiments are conducted in a zero-shot evaluation regime, with no optimization on train/val splits.

Key Findings

Active exploration is essential: There is a marked gap between VQA at Start versus VQA at Goal settings, confirming the necessity of viewpoint exploration beyond static or privileged perspectives.
Viewpoint planning limitations: All current VLMs—including high-performing proprietary models—exhibit substantial deficits compared to human performance in E3VS, particularly in acquiring fine-grained, viewpoint-sensitive evidence.
Task-type performance variance: Spatial reasoning (SR) and object attribute (OA) tasks reveal model disparities: Gemini 3.0 Pro attains near VQA at Goal performance in SR tasks, whereas OA and counting (CNT) remain challenging due to evidence localization and accumulation requirements across a trajectory.
Efficiency gains with multi-frame input: Increasing contextual frames does not substantially affect correctness scores but consistently reduces navigation steps and collision rate.
Figure 5: Navigation efficiency and safety improve with additional observation frames, while judge score remains saturated.

Ablation Results

Internal reasoning budget enhances GPT 5.1 viewpoint planning performance, but Gemini 3.0 Flash remains insensitive to the reasoning budget.
Initializing agents at goal viewpoints confirms robust evidence recognition for Gemini models, with efficient stopping behavior relative to Qwen3-VL-8B.
OA episodes are highly dependent on exploratory viewpoint discovery due to unpredictable evidence locations.

Qualitative Evaluation

Agent-selected viewpoints are frequently sub-optimal, failing to capture sufficient task-relevant evidence for robust answer justification. Human-selected viewpoints exhibit deliberate positioning for maximum visual clarity. The VLM-as-a-judge strictness penalizes answers unsupported by explicit visual grounding.

Figure 6: Comparison of predicted answers and judge scores for Gemini 3.0 Flash versus human responses underscores agent limitations in viewpoint selection and evidence acquisition.

Advanced Dataset Filtering

The appendix details advanced viewpoint filtering with VLM protocols (Figure 7) and QA generation (Figure 8), as well as penetrated viewpoint exclusion (Figure 9), ensuring physical plausibility and answerable QA pairs.

Figure 7: Sample viewpoint-to-QA filtering results, with green signifying accepted, red filtered viewpoints.

Figure 8: Examples of QA pairs generated by Gemini 2.5 Flash, illustrating the diversity and depth of instance-specific questions.

Figure 9: Penetrated viewpoint exclusion, where red denotes invalid camera positions due to geometric intersections.

Implications and Future Directions

E3VS-Bench establishes a rigorous evaluation platform for viewpoint-dependent embodied visual search, revealing persistent gaps between current VLM agents and human-level active perception. The benchmark exposes the criticality of evidence localization, trajectory planning, and visual grounding in complex 3D environments. Limitations in automated judging metrics—especially for viewpoint adequacy—suggest a need for refined evaluation protocols.

Practical: E3VS-Bench offers a standardized environment for testing embodied perception agents, with photorealistic details and strict open-vocabulary QA requirements. Deployment in robotic vision, AR, and navigation-centric AI systems is facilitated by the fidelity and granularity of benchmark tasks.

Theoretical: The dataset underscores the necessity for improved multimodal memory architectures, closed-loop perception-action planning, and compositional spatial reasoning capabilities in VLMs.

Future research paths: Incorporate trajectory-level evidence aggregation, memory-augmented architectures, reinforcement-driven viewpoint search, and more human-aligned evaluation criteria. Released splits enable learning-based agent optimization beyond zero-shot evaluation. Refining judge protocols and exploring the correlation of visual grounding with semantic correctness are critical for robust assessment.

Conclusion

E3VS-Bench systematically targets the limitations of embodied QA and visual search tasks by enforcing active, viewpoint-dependent exploration in high-fidelity 3DGS environments. Quantitative and qualitative analyses reveal persistent model deficits in spatial reasoning, evidence localization, and trajectory planning, with substantial performance divergence from humans. The benchmark sets an essential foundation for future advances in embodied active perception and multimodal reasoning (2604.17969).

Markdown Report Issue