ViSA: Verification through Spatial Assertions
- ViSA is a framework for spatial reasoning that uses frame-anchored micro-claims to verify local geometric and relational properties in 3D scenes.
- It replaces global or heuristic view selection with a two-stage process that generates and verifies atomic claims using an Evidence Quality (EQ) scoring mechanism.
- Experimental evaluations show ViSA can improve accuracy by up to 10% over baselines on embodied AI spatial tasks, highlighting its practical advantages.
Verification through Spatial Assertions (ViSA) is an operational and logical framework for spatial reasoning that replaces global or heuristic view selection with principled, claim-based verification grounded in atomic, frame-anchored assertions. ViSA addresses the challenge of spatial reasoning in embodied AI tasks, where answering queries about 3D scenes often demands synthesizing new views and explicitly verifying local geometric or relational properties. It is implemented both as a claim-generating pipeline for world-model-augmented vision-language reasoning (Jha et al., 5 Dec 2025) and, at a foundational level, as a logic-based approach unifying continuous and discrete spatial verification (Ciancia et al., 2014).
1. Formal Problem Setup and Notation
Let denote the input image of a static 3D scene and a spatial reasoning question with multiple-choice answers . Vision–LLMs (VLMs) are tasked to estimate but struggle with multi-view inference. At test time, a pre-trained video-diffusion world model is used as an autoregressive pixel-space planner to synthesize imagined trajectories. The action space consists of egocentric camera primitives with discrete parameters. Each action maps to a camera-pose transformation . An action-conditioned trajectory seeds frame synthesis:
0
where 1 encodes 2 and 3. To select useful synthesized views, a reward 4 is computed for each frame, forming the basis for trajectory pruning and evidence accumulation.
2. Frame-Anchored Micro-Claims: Generation and Verification
ViSA replaces heuristic or black-box utility scoring methods with a two-stage, localized verification protocol:
- Claim Generation: For each synthesized frame 5, a VLM produces a set 6 of question-conditioned micro-claims. Each atomic claim is a single-sentence, frame-anchored hypothesis of the form: "After performing action ‘turn-left 18°’, Object X moves closer to the left edge." The claim is created conditioned on both the "before" (7) and "after" (8) images and is targeted to directly distinguish between answers in 9.
- Claim Verification: Each micro-claim 0 is verified by a second VLM, which is prompted with 1 and outputs a verdict 2 with an associated confidence 3. The verifier focuses on local geometric or relational properties.
This decomposed protocol forces fine-grained inspection of spatial relations, contrasting with methods that assign a global helpfulness score to entire synthesized sequences.
3. Evidence Quality (EQ) Reward Function
For each candidate frame 4, let its claims and associated verdicts/confidences be 5, 6, and 7. The frame’s "Evidence Quality" is scored as:
8
Only frames rich in verifiable, high-confidence entailments are prioritized. This contrasts sharply with entropy or salience-based view selection, as EQ is both interpretable and tightly tied to task relevance.
4. Algorithmic Pipeline
Test-time scaling with ViSA proceeds via a search through action-trajectories, maintaining a beam of candidate evidentiary trajectories. Key steps are as follows:
- For each beam trajectory, consider all possible next camera moves to extend the path.
- For each resulting trajectory, generate and verify micro-claims for every synthesized frame, compute EQ scores, and retain the top-9 evidence frames into a buffer.
- Prune to beam-size 0 by mean EQ of candidate trajectories.
- Collate evidence from all surviving trajectories after full search depth.
- Use the VLM, given 1 and the accumulated evidence buffer, to select the final answer among 2.
Pseudocode can be formalized as:
3
5. Empirical Evaluation and Comparative Analysis
ViSA has been evaluated with two principal benchmarks:
- SAT-Real (150 real-image spatial questions, 5 categories):
- Baseline InternVL3-14B: 41.3% average accuracy.
- Random top-3 (γ=1): 63.3–66.0%.
- MindJourney verifier (γ=1): 63.3–67.3%.
- ViSA (γ=1): 65.3–72.7%, yielding up to ≈10% gain over baseline and 2–5% over MindJourney or random.
- ViSA’s best-category gain is in Egocentric Move (EgoM) (95.7% vs. MJ’s 73.9% at 4).
- MMSI-Bench (162-question subset, 11 fine-grained categories):
- Baseline: 27.2%.
- Random top-1 (γ=1): 33.3%.
- MindJourney top-1 (γ=1): 32.7%.
- ViSA top-1 (γ=1): 35.8%.
- No consistent improvement for any verifier as 5 or 6 increases; performance fluctuates between ~27–36%.
ViSA demonstrates smooth scaling of accuracy with 7 on SAT-Real, whereas MindJourney and random baselines plateau or decline. However, all verification methods plateau on MMSI-Bench, attributed to the lower perceptual quality of synthesized frames, as measured by LAION aesthetic scores (SAT-Real: 5.12; MMSI: 4.53) (Jha et al., 5 Dec 2025).
6. Biases, Limitations, and Failure Modes
Systematic analysis reveals:
- MindJourney action biases: MindJourney’s verification strategy exhibits a left-turn bias (~50% of moves) and over-selection of large-angle turns, linked to suboptimal exploratory behavior and visual salience over task-relevance.
- Variance reduction: Entropy-based calibration shows MindJourney’s helpfulness scores barely reduce answer uncertainty compared to random scoring and can worsen it.
- ViSA mitigation: By decomposing evidence evaluation into micro-claims, ViSA achieves more balanced action selection (forward moves 34–66%, left 24–40%, right 10–26%, with even turn magnitudes). The EQ reward penalizes unsupported or low-confidence claims, reducing selection bias.
- Persistent bottlenecks: On MMSI-Bench, all approaches are performance-limited by information bottlenecks: the world model’s pixel-level synthesis lacks the fidelity needed for subtle relational reasoning. The inability to produce verifiable, fine-grained changes leads to evidence indistinguishability, stalling accuracy improvements (Jha et al., 5 Dec 2025).
7. Theoretical Foundations and Modal-Logic Connections
ViSA’s broader underpinnings trace to modal-logic-based approaches for spatial verification. The SLCS (Spatial Logic of Closure Spaces) framework (Ciancia et al., 2014) encapsulates declarative spatial assertions within a logic supporting both continuous (topological) and discrete (graph-based) environments. Core principles include:
- Syntax: Atomic propositions 8, Boolean connectives, "one-step" neighborhood (9), and spatial until (0).
- Semantics: Closure spaces 1 (topological or graph-induced), valuation, and reachability via 2.
- Linear-time model checking for spatial properties, enabling practical analysis of large models.
- Applications: Maze solving, map region detection, and spatial reachability in extensive graphs/images.
These logical tools provide a systematic foundation for expressing and verifying micro-claims as in ViSA, thereby establishing a rigorous link from intuitive spatial assertions to formal verification (Ciancia et al., 2014).
8. Implications, Open Problems, and Future Directions
ViSA exemplifies how grounding test-time verification in explicit, frame-anchored spatial assertions improves interpretability and performance on spatial reasoning tasks given adequate world-model fidelity. However, several challenges and prospective avenues remain:
- Enhancement of world-models for higher-resolution synthesis enabling verification of small attribute and relational shifts.
- Integration of low-level perceptual or feature-space priors in claim verification protocols.
- Hybrid proposal strategies operating in abstract coordinate space rather than pixels, to better capture meaningful spatial novelty.
- Extensions to logic-based frameworks incorporating temporal, metric, or probabilistic properties and scalable automata-theoretic reasoning for recursive or infinite spatial domains (Ciancia et al., 2014).
A plausible implication is that as world models become more faithful and verification frameworks more expressive, ViSA-style protocols will become central to robust, interpretable spatial reasoning in both AI systems and formal verification settings (Jha et al., 5 Dec 2025, Ciancia et al., 2014).