Papers
Topics
Authors
Recent
Search
2000 character limit reached

ViSA: Verification through Spatial Assertions

Updated 8 May 2026
  • ViSA is a framework for spatial reasoning that uses frame-anchored micro-claims to verify local geometric and relational properties in 3D scenes.
  • It replaces global or heuristic view selection with a two-stage process that generates and verifies atomic claims using an Evidence Quality (EQ) scoring mechanism.
  • Experimental evaluations show ViSA can improve accuracy by up to 10% over baselines on embodied AI spatial tasks, highlighting its practical advantages.

Verification through Spatial Assertions (ViSA) is an operational and logical framework for spatial reasoning that replaces global or heuristic view selection with principled, claim-based verification grounded in atomic, frame-anchored assertions. ViSA addresses the challenge of spatial reasoning in embodied AI tasks, where answering queries about 3D scenes often demands synthesizing new views and explicitly verifying local geometric or relational properties. It is implemented both as a claim-generating pipeline for world-model-augmented vision-language reasoning (Jha et al., 5 Dec 2025) and, at a foundational level, as a logic-based approach unifying continuous and discrete spatial verification (Ciancia et al., 2014).

1. Formal Problem Setup and Notation

Let x0x_0 denote the input image of a static 3D scene and qq a spatial reasoning question with nn multiple-choice answers A={α1,,αn}A = \{\alpha_1,\ldots,\alpha_n\}. Vision–LLMs (VLMs) are tasked to estimate P(αx0,q)P(\alpha | x_0, q) but struggle with multi-view inference. At test time, a pre-trained video-diffusion world model W\mathcal{W} is used as an autoregressive pixel-space planner to synthesize imagined trajectories. The action space B\mathcal{B} consists of egocentric camera primitives a{move-forward(d), turn-left(θ), turn-right(θr)}a \in \{\mathrm{move\textrm{-}forward}(d),\ \mathrm{turn\textrm{-}left}(\theta_\ell),\ \mathrm{turn\textrm{-}right}(\theta_r)\} with discrete parameters. Each action maps to a camera-pose transformation ψ(a)=f\psi(a)=f. An action-conditioned trajectory τ1:t1=(f1,,ft1)\tau_{1:t-1}=(f_1,\ldots,f_{t-1}) seeds frame synthesis:

qq0

where qq1 encodes qq2 and qq3. To select useful synthesized views, a reward qq4 is computed for each frame, forming the basis for trajectory pruning and evidence accumulation.

2. Frame-Anchored Micro-Claims: Generation and Verification

ViSA replaces heuristic or black-box utility scoring methods with a two-stage, localized verification protocol:

  • Claim Generation: For each synthesized frame qq5, a VLM produces a set qq6 of question-conditioned micro-claims. Each atomic claim is a single-sentence, frame-anchored hypothesis of the form: "After performing action ‘turn-left 18°’, Object X moves closer to the left edge." The claim is created conditioned on both the "before" (qq7) and "after" (qq8) images and is targeted to directly distinguish between answers in qq9.
  • Claim Verification: Each micro-claim nn0 is verified by a second VLM, which is prompted with nn1 and outputs a verdict nn2 with an associated confidence nn3. The verifier focuses on local geometric or relational properties.

This decomposed protocol forces fine-grained inspection of spatial relations, contrasting with methods that assign a global helpfulness score to entire synthesized sequences.

3. Evidence Quality (EQ) Reward Function

For each candidate frame nn4, let its claims and associated verdicts/confidences be nn5, nn6, and nn7. The frame’s "Evidence Quality" is scored as:

nn8

Only frames rich in verifiable, high-confidence entailments are prioritized. This contrasts sharply with entropy or salience-based view selection, as EQ is both interpretable and tightly tied to task relevance.

4. Algorithmic Pipeline

Test-time scaling with ViSA proceeds via a search through action-trajectories, maintaining a beam of candidate evidentiary trajectories. Key steps are as follows:

  1. For each beam trajectory, consider all possible next camera moves to extend the path.
  2. For each resulting trajectory, generate and verify micro-claims for every synthesized frame, compute EQ scores, and retain the top-nn9 evidence frames into a buffer.
  3. Prune to beam-size A={α1,,αn}A = \{\alpha_1,\ldots,\alpha_n\}0 by mean EQ of candidate trajectories.
  4. Collate evidence from all surviving trajectories after full search depth.
  5. Use the VLM, given A={α1,,αn}A = \{\alpha_1,\ldots,\alpha_n\}1 and the accumulated evidence buffer, to select the final answer among A={α1,,αn}A = \{\alpha_1,\ldots,\alpha_n\}2.

Pseudocode can be formalized as:

P(αx0,q)P(\alpha | x_0, q)3

5. Empirical Evaluation and Comparative Analysis

ViSA has been evaluated with two principal benchmarks:

  • SAT-Real (150 real-image spatial questions, 5 categories):
    • Baseline InternVL3-14B: 41.3% average accuracy.
    • Random top-A={α1,,αn}A = \{\alpha_1,\ldots,\alpha_n\}3 (γ=1): 63.3–66.0%.
    • MindJourney verifier (γ=1): 63.3–67.3%.
    • ViSA (γ=1): 65.3–72.7%, yielding up to ≈10% gain over baseline and 2–5% over MindJourney or random.
    • ViSA’s best-category gain is in Egocentric Move (EgoM) (95.7% vs. MJ’s 73.9% at A={α1,,αn}A = \{\alpha_1,\ldots,\alpha_n\}4).
  • MMSI-Bench (162-question subset, 11 fine-grained categories):
    • Baseline: 27.2%.
    • Random top-1 (γ=1): 33.3%.
    • MindJourney top-1 (γ=1): 32.7%.
    • ViSA top-1 (γ=1): 35.8%.
    • No consistent improvement for any verifier as A={α1,,αn}A = \{\alpha_1,\ldots,\alpha_n\}5 or A={α1,,αn}A = \{\alpha_1,\ldots,\alpha_n\}6 increases; performance fluctuates between ~27–36%.

ViSA demonstrates smooth scaling of accuracy with A={α1,,αn}A = \{\alpha_1,\ldots,\alpha_n\}7 on SAT-Real, whereas MindJourney and random baselines plateau or decline. However, all verification methods plateau on MMSI-Bench, attributed to the lower perceptual quality of synthesized frames, as measured by LAION aesthetic scores (SAT-Real: 5.12; MMSI: 4.53) (Jha et al., 5 Dec 2025).

6. Biases, Limitations, and Failure Modes

Systematic analysis reveals:

  • MindJourney action biases: MindJourney’s verification strategy exhibits a left-turn bias (~50% of moves) and over-selection of large-angle turns, linked to suboptimal exploratory behavior and visual salience over task-relevance.
  • Variance reduction: Entropy-based calibration shows MindJourney’s helpfulness scores barely reduce answer uncertainty compared to random scoring and can worsen it.
  • ViSA mitigation: By decomposing evidence evaluation into micro-claims, ViSA achieves more balanced action selection (forward moves 34–66%, left 24–40%, right 10–26%, with even turn magnitudes). The EQ reward penalizes unsupported or low-confidence claims, reducing selection bias.
  • Persistent bottlenecks: On MMSI-Bench, all approaches are performance-limited by information bottlenecks: the world model’s pixel-level synthesis lacks the fidelity needed for subtle relational reasoning. The inability to produce verifiable, fine-grained changes leads to evidence indistinguishability, stalling accuracy improvements (Jha et al., 5 Dec 2025).

7. Theoretical Foundations and Modal-Logic Connections

ViSA’s broader underpinnings trace to modal-logic-based approaches for spatial verification. The SLCS (Spatial Logic of Closure Spaces) framework (Ciancia et al., 2014) encapsulates declarative spatial assertions within a logic supporting both continuous (topological) and discrete (graph-based) environments. Core principles include:

  • Syntax: Atomic propositions A={α1,,αn}A = \{\alpha_1,\ldots,\alpha_n\}8, Boolean connectives, "one-step" neighborhood (A={α1,,αn}A = \{\alpha_1,\ldots,\alpha_n\}9), and spatial until (P(αx0,q)P(\alpha | x_0, q)0).
  • Semantics: Closure spaces P(αx0,q)P(\alpha | x_0, q)1 (topological or graph-induced), valuation, and reachability via P(αx0,q)P(\alpha | x_0, q)2.
  • Linear-time model checking for spatial properties, enabling practical analysis of large models.
  • Applications: Maze solving, map region detection, and spatial reachability in extensive graphs/images.

These logical tools provide a systematic foundation for expressing and verifying micro-claims as in ViSA, thereby establishing a rigorous link from intuitive spatial assertions to formal verification (Ciancia et al., 2014).

8. Implications, Open Problems, and Future Directions

ViSA exemplifies how grounding test-time verification in explicit, frame-anchored spatial assertions improves interpretability and performance on spatial reasoning tasks given adequate world-model fidelity. However, several challenges and prospective avenues remain:

  • Enhancement of world-models for higher-resolution synthesis enabling verification of small attribute and relational shifts.
  • Integration of low-level perceptual or feature-space priors in claim verification protocols.
  • Hybrid proposal strategies operating in abstract coordinate space rather than pixels, to better capture meaningful spatial novelty.
  • Extensions to logic-based frameworks incorporating temporal, metric, or probabilistic properties and scalable automata-theoretic reasoning for recursive or infinite spatial domains (Ciancia et al., 2014).

A plausible implication is that as world models become more faithful and verification frameworks more expressive, ViSA-style protocols will become central to robust, interpretable spatial reasoning in both AI systems and formal verification settings (Jha et al., 5 Dec 2025, Ciancia et al., 2014).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ViSA (Verification through Spatial Assertions).