Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Published 28 May 2026 in cs.CV and cs.AI | (2605.30231v1)

Abstract: Vision-LLMs (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized 3D visual encoders is often inflexible and cumbersome. In this paper, we argue that genuine spatial understanding should emerge from learning fundamental geometric priors, not only from high-level VQA supervision. We propose GASP (Geometric-Aware Spatial Priors), a framework that injects these priors directly into the LLM's transformer layers. GASP employs a small correspondence head, applied as a deep supervision signal across all layers, and is trained with a dual objective leveraging ground-truth geometry from large-scale video scenes: a contrastive loss on ground-truth point correspondences enforces 2D view-invariance, while a depth consistency supervision resolves 3D geometric ambiguities. Our analysis first provides a diagnostic showing that standard VLMs' internal correspondence matching accuracy is very low (often below 5%). We then demonstrate that our training substantially improves this behavior, boosting peak layer-wise correspondence to over 70% and maintaining over 85% temporal robustness while baselines remain below 5%. These internal improvements translate to significant gains on downstream spatial benchmarks including +18.2% on All-Angles Bench and +29.0% on VSI-Bench, all without training on any 3D VQA data. Our findings indicate that learning from fundamental geometric priors is a promising and generalizable pathway towards VLMs with more reliable 3D spatial reasoning.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

Researching...

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview: What this paper is about

This paper is about helping AI systems better understand 3D space in videos and images. The authors focus on Vision-LLMs (VLMs), which are AIs that can look at pictures or videos and read text to answer questions. Today, many VLMs can describe scenes well, but they often struggle with 3D spatial reasoning—things like knowing where objects are, how far away they are, and recognizing the same object from different angles.

The paper introduces a new training method called GASP (Geometric-Aware Spatial Priors). Instead of teaching the model by giving it lots of question–answer pairs about 3D scenes, GASP teaches the model basic “rules of the 3D world,” like how the same point on an object should match across different views, and how depth (distance from the camera) can help avoid confusion. This makes the model’s inner understanding of space more solid and general, without needing special 3D inputs during use.

Key questions the paper asks

The paper aims to answer a few simple questions:

Can we make VLMs truly understand 3D space by teaching them basic geometric rules, instead of just drilling them on tons of 3D question–answer examples?
Can the model learn to recognize the same point on an object across different frames and angles (called “correspondence”)?
Will learning these geometric skills inside the model improve its performance on many different spatial tasks, even ones it wasn’t directly trained for?

How the method works (in everyday language)

Think of a VLM like a student who reads and watches videos to answer questions. Many current methods try to prepare the student by giving them lots of practice tests (3D VQA datasets). That can lead the student to memorize patterns instead of truly understanding 3D space. GASP takes a different route: it teaches the student the core principles of 3D vision.

Here’s the idea in simple terms:

Matching the “same point” across views:
- Imagine you have two photos of a room taken from different angles. If you put a small sticker on a table in one photo, you should be able to find the same spot in the other photo. This is called “correspondence.”
- GASP trains a small “correspondence head” inside the model to learn this skill: given a point in one frame, it should find the matching point in another frame.
Using depth as a tiebreaker:
- Sometimes two locations look very similar (like two identical chairs), and the model might mix them up.
- Depth (how far something is from the camera) helps: the closer chair and the farther chair have different depths. GASP uses depth information during training to nudge the model toward the correct match and away from look-alikes.
Teaching every layer, not just the final step:
- Modern AI models have many layers, like grades in school. GASP adds this matching lesson to many layers so that the sense of 3D consistency is built up step by step, from low-level details to high-level understanding.
Important: this extra “correspondence head” is only used while the model is learning.
- During training, it gives the model feedback about matching points and being consistent with depth.
- During use (inference), it’s removed. The model acts like a normal VLM—no extra inputs, no extra parts—just smarter about 3D space.
What data and training look like:
- The authors train on large collections of videos where they know which points match across frames and have depth maps (distance information).
- They use a simple push–pull learning rule: pull together the correct matches (make their features similar) and push apart the wrong ones (make them less similar). This is like teaching the model to recognize true pairs and ignore imposters.

Main findings and why they matter

Here are the most important results the authors report:

The model’s “inner” 3D understanding improves a lot:
- Before GASP, standard VLMs were very poor at matching the same point across frames inside their own layers (often below 5% accuracy).
- After GASP training, the best layers reach over 70% correct matches, and the model stays strong even when the two frames are far apart in time (over 85% robustness). This shows the model truly learned view-invariant, time-stable representations.
Big gains on spatial benchmarks without 3D Q&A training:
- The model gets significantly better at tasks that require solid 3D reasoning:
- All-Angles Bench (camera viewpoint understanding): up to +18.2 percentage points.
- VSI-Bench (object permanence and counting): up to +29.0 percentage points.
- BLINK (multi-view reasoning): up to +15.0 percentage points.
- Importantly, the model did not train on any 3D question–answer datasets to get these gains—it learned from geometric principles instead.
No heavy add-ons during use:
- At test time, the model works just like a normal VLM. It doesn’t need special 3D encoders or extra 3D inputs (like point clouds). That makes it simpler and faster to use.
Small trade-offs:
- There’s a small drop on some general, action-focused video QA tasks, likely because some model capacity is now focused on precise spatial reasoning. But overall, the model’s performance on broad tasks stays strong, and it often improves on time-related video understanding thanks to better “object permanence.”

What this could mean going forward

This work suggests a new path to building smarter, more reliable AI for the real world:

Stronger spatial intelligence:
- By teaching models the “rules of 3D,” we can make them better at understanding where things are, how they move, and how scenes look from different viewpoints. This is crucial for robots, AR/VR, drones, self-driving, and any task that needs reliable 3D awareness.
Better generalization:
- Instead of memorizing answers from specific datasets, the model learns principles that apply everywhere. That means it’s more likely to handle new places and unusual camera views.
Practical and efficient:
- Because the extra training parts are removed at test time, the improved model doesn’t become heavier or slower when you use it.
Next steps:
- Future work could mix these geometric lessons with other kinds of training to balance spatial skills with semantic and action understanding, and scale the approach to bigger models and more varied data.

In short, GASP shows that teaching simple but powerful geometric rules inside VLMs can unlock much better 3D reasoning—without bulky add-ons and without relying on tons of 3D question–answer training data.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of what remains missing, uncertain, or unexplored in the paper, phrased to guide concrete follow‑up work:

Scalability without ground-truth geometry: The approach relies on point correspondences and depth from DL3DV; it’s unclear how well GASP trains with self-supervised or noisy pseudo‑labels derived from in‑the‑wild videos (e.g., monocular depth/flow, SfM tracks) and what noise-robust training strategies are needed.
Domain and scene diversity: Generalization to dynamic/non‑rigid scenes, significant occlusions, motion blur, low light, outdoor scenes, and camera models beyond pinhole (fisheye, 360°) is not evaluated.
Long-horizon temporal consistency: Temporal robustness is measured up to Δt=24; behavior over much longer gaps (minutes, scene changes) and across shot boundaries remains unexplored.
Out-of-domain spatial benchmarks: Despite citing MMSI-Bench, STI-Bench, and SpaceVista, the paper does not report results on these more challenging OOD spatial benchmarks to validate generalization claims.
Absolute vs. relative geometry: The training enforces depth consistency but not metric scale; performance on tasks requiring absolute distances/scale (e.g., metric depth, real-world distance estimation) is untested.
Effect on action-centric and semantic tasks: A modest drop on NextQA suggests trade-offs; strategies to mitigate degradation (e.g., curriculum, loss weighting schedules, adapter routing, or task-balanced optimization) are not investigated.
Mechanistic link to attention: While PCK improvements are reported, the paper does not analyze how GASP changes Q_VK_V^T (e.g., attention alignment with epipolar constraints, layer-wise CKA, emergence of geometry-sensitive heads).
Head architecture and initialization: The correspondence head’s design (depth, nonlinearity, d_emb size) and SVD initialization are not ablated; alternatives (e.g., attention-based heads, gating, residual adapters) and their impact on stability/performance remain open.
Injection locus: Only LLM-layer injection is explored; whether injecting (or jointly injecting) into the visual encoder, cross-modal layers, or specific attention heads yields better trade-offs is untested.
Hyperparameter sensitivity: Temperature τ, loss weights λ_c and λ_d, and d_emb are not systematically studied; robustness ranges and recommended defaults are missing.
Negative sampling strategy: The method uses per-frame negatives without hard negative mining or memory banks; efficacy of hard negatives (e.g., same-instance look-alikes, repetitive textures) and cross-sequence negatives is not explored.
Occlusion and non-rigid handling: The depth-consistency term presumes valid geometric matches; how the approach copes with occlusions, disocclusions, deformable objects, transparencies/reflectives, and specularities is not assessed.
Label-noise tolerance: Sensitivity to errors in pseudo depth and point tracks (e.g., SfM failures, depth holes) and techniques like robust losses, confidence reweighting, or self-training are not examined.
Sequence construction bias: The ±48-frame sampling window and 8–24 frame sequences may bias training; the effect of diversified sampling policies (e.g., wider windows, curriculum over baselines to hard gaps) is unknown.
Beyond correspondences and depth: Other geometric priors (e.g., epipolar consistency without depth, surface normals, cycle-consistency, camera pose constraints, optical flow) may complement current losses but are not studied.
Calibration beyond Pearson ρ: Confidence–accuracy analysis uses correlation only; comprehensive calibration metrics (e.g., ECE, reliability diagrams) and their link to downstream decisions are missing.
Absolute vs. normalized temporal metrics: Temporal robustness is normalized to Δt=1; absolute PCK across Δt and error distributions (e.g., median/percentile errors) are not reported, limiting interpretability.
Compute/latency profile: While inference has no extra head, the paper lacks quantitative training/inference cost comparisons (FLOPs, memory, latency) versus 3D-encoder baselines and versus standard SFT.
Fairness baseline design: The “DL3DV VQA” baseline may be underpowered due to formulation; alternatives (e.g., contrastive QA, ranking, chain-of-thought prompts, or synthetic counterfactuals) are not explored to fairly isolate objective vs. data effects.
Language–geometry interaction: How improved geometric features affect text-conditioned grounding (e.g., phrase localization across views) and complex spatial-language compositionality is not evaluated.
Persistence of benefits without the head: The longevity and stability of learned geometric priors once the head is discarded (e.g., across fine-tuning steps on other tasks or after RLHF) is not assessed.
Data efficiency: The approach uses ~1.75M sequences; minimal data requirements, scaling laws, and active data selection for correspondences/depth remain open.
Robustness to repetitive structures: The method relies on depth to disambiguate look-alikes; performance in environments with many same-depth duplicates (e.g., lattices, corridors) and potential need for object-level grouping is untested.
Integration with explicit 3D encoders: A controlled comparison with lightweight/frozen 3D encoders under matched compute budgets—and hybrid strategies (e.g., distillation from or weak fusion with 3D features)—is missing.
Evaluation breadth: Additional tasks (3D reconstruction consistency, view synthesis QA, 3D detection/relational grounding, embodied navigation/manipulation) are not used to stress-test geometric reasoning in realistic settings.
Training stability and optimization: Interactions between LoRA rank, learning-rate multipliers (4× for the head), gradient flow to Q/K projections, and mixed precision are not analyzed for stability or reproducibility.
Reproducibility specifics: Exact hyperparameters (τ, λ_c, λ_d, d_emb), data filtering criteria, and seeds are not fully specified; detailed release is needed to replicate results.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The paper’s findings enable near-term deployments by fine-tuning existing VLMs (e.g., Qwen2.5-VL-7B, LLaVA-NeXT-Video-7B) with GASP to improve 3D spatial reasoning without adding heavy 3D encoders and without changing inference-time I/O.

Healthcare (non-clinical ops) — OR/video room asset tracking and utilization
- Use case: Persistent counting and localization of instruments and carts in OR suites or procedural rooms for logistics and turnover optimization.
- Why GASP helps: Stronger object permanence and multi-view consistency reduce double counting and missed items across cameras.
- Tools/workflows: Fine-tune house VLM with GASP; deploy as a video QA assistant on stored or live feeds; surface counts/locations via dashboards.
- Assumptions/dependencies: Non-diagnostic use; privacy safeguards; training data domain similarity; acceptance of small trade-offs on action-centric QA.
Education — Interactive spatial reasoning tutors
- Use case: A tutor that grades geometry/spatial-visualization assignments from multi-view images or short videos (e.g., “Is this prism net foldable into a cube?”).
- Why GASP helps: Viewpoint-invariant understanding and relative depth reasoning improve grading consistency across perspectives.
- Tools/workflows: Web app built on a GASP-fine-tuned VLM; prompt templates for spatial tasks; lightweight server-side inference.
- Assumptions/dependencies: Non-safety-critical; moderate-quality video/images; minimal calibration.
Software/Robotics — Language-driven manipulation and navigation
- Use case: Robot assistants perform “pick the front-most red cup on the left shelf” or “navigate behind the chair, then to the door” with fewer mis-localizations.
- Why GASP helps: Better relative direction, pose awareness, and cross-view identity tracking; inference latency unchanged.
- Tools/workflows: Integrate GASP-tuned VLM into ROS/MoveIt pipelines as a spatial-language planner; couple with existing grasp/motion planners.
- Assumptions/dependencies: Domain fine-tune on in-situ videos; safety interlocks; sensor calibration for control loop.
Retail/Operations Analytics — Multi-camera people/product counting
- Use case: Occupancy and queue length estimates, shelf audit (count and restock prompts) across time and viewpoints.
- Why GASP helps: +29% object counting gains (VSI-Bench) translate to more reliable tallies across cameras and hours.
- Tools/workflows: Queryable video assistant over CCTV; scheduled batch processing; alerts when counts deviate from planograms.
- Assumptions/dependencies: Privacy and consent; store-specific domain adaptation; acceptable error bounds for KPIs.
AR/VR/XR — Stable content anchoring and spatial QA
- Use case: Anchoring virtual labels to real objects across user motion; answering “What’s behind the sofa?” from short AR captures.
- Why GASP helps: Improved camera pose estimation and relative direction yield fewer anchor drifts and better spatial responses.
- Tools/workflows: Pair ARCore/ARKit tracking with a GASP-tuned VLM for semantic-spatial QA; mobile offload or on-device inference.
- Assumptions/dependencies: Mobile compute budget; variable lighting; privacy constraints.
Media Production — Continuity and scene-consistency checks
- Use case: Automated flagging of continuity errors (prop positions, counts) across takes and angles.
- Why GASP helps: Multi-view reasoning and object permanence reduce false positives/negatives in continuity QA.
- Tools/workflows: NLE plugin (Premiere/Resolve) calling a GASP-tuned VLM with shot metadata; shot-by-shot reports.
- Assumptions/dependencies: Access to shot slates/timecodes; GPU for batch passes.
Energy/Infrastructure Inspection — Drone/robot video triage
- Use case: Counting/locating assets (insulators, bolts), verifying relative positions along a line/tower from flyby videos.
- Why GASP helps: View-invariant correspondences stabilize counts amid parallax and repetitive textures.
- Tools/workflows: Post-flight triage assistant; export detections/notes into CMMS; integrate with existing inspection pipelines.
- Assumptions/dependencies: Adequate video quality; environmental variability; conservative thresholds for critical findings.
Insurance/Real Estate — Property walkthrough understanding
- Use case: Auto-count rooms/fixtures and summarize spatial relations (“the second bedroom is behind the kitchen”).
- Why GASP helps: Better relative direction and multi-view consistency across casual phone videos.
- Tools/workflows: Portal upload → batch GASP-VLM analysis → structured report for adjusters/agents.
- Assumptions/dependencies: Disclosure/consent; varied capture quality; tolerance for occasional layout ambiguities.
SLAM/Mapping Assist — Feature matching and pose hints
- Use case: Use internal correspondences from the VLM as an auxiliary matcher to stabilize visual SLAM and photogrammetry.
- Why GASP helps: Layer-wise PCK improvements (>70% peak) yield more robust matches under viewpoint/appearance changes.
- Tools/workflows: “VLM-SLAM Bridge” that exports token matches to ORB-SLAM/COLMAP; fallback to classic keypoints when uncertain.
- Assumptions/dependencies: Camera intrinsics/extrinsics availability; synchronization; confidence calibration.
Model DevOps (Industry/Academia) — Spatial QA of VLMs
- Use case: Procurement/QA teams benchmark and select VLMs with new internal metrics (PCK, confidence-accuracy correlation, temporal robustness).
- Why GASP helps: Offers a training recipe and diagnostic suite to build and verify geometry-aware VLMs without 3D encoders.
- Tools/workflows: Open-source “GASP plugin” for LoRA fine-tunes; analysis harness to compute layer-wise PCK and robustness curves.
- Assumptions/dependencies: Access to point tracks/depth for training (SfM or synthetic); model/data licensing.

Long-Term Applications

These require further research, domain-specific data, scaling, or regulatory approvals before deployment in safety-critical or large-scale settings.

Autonomous Driving/Robotics — 3D-aware perception and planning
- Use case: Robust multi-camera/multi-view perception for occlusion handling, map consistency, and language-grounded planning.
- Dependencies: Extensive closed-loop validation; integration with HD maps and sensor fusion; safety certification; adverse weather domain adaptation.
Clinical Healthcare — Surgical/endoscopic video understanding
- Use case: Instrument/lesion tracking and spatial referencing during procedures; postoperative scene summarization.
- Dependencies: Regulated datasets and annotations; rigorous performance and bias studies; FDA/CE approvals; robust domain generalization.
City-Scale Multi-Camera Analytics — Cross-camera re-identification and flow
- Use case: Privacy-preserving flow analysis and crowd management across a camera network using language queries.
- Dependencies: Strong privacy-by-design (on-edge, anonymization); fairness audits; legal frameworks for surveillance; calibration across heterogeneous cameras.
3D Reconstruction from Casual Video — “No-explicit-3D encoder” pipelines
- Use case: Structure-from-video with GASP-enhanced correspondences and depth-consistency priors for faster, lighter reconstructions.
- Dependencies: Integration with BA/SfM; camera calibration and scale handling; robustness to rolling shutter and motion blur.
General-Purpose Home Embodied Agents — Reliable spatial manipulation in clutter
- Use case: Assistive robots that reason over multi-view scenes via language to fetch, tidy, or assemble items.
- Dependencies: Scaling to larger VLM backbones; multimodal fusion (touch/force); long-horizon planning; human-in-the-loop safety.
Generative Media — 3D-consistent video generation/editing
- Use case: Controllable edits preserving object identity and depth relations across shots.
- Dependencies: Tight coupling of GASP-like geometric priors with diffusion/transformer video generators; evaluation protocols for 3D consistency.
AR Learning Environments — Spatially-aware lab companions
- Use case: Real-time guidance on lab setups (e.g., circuit assembly, physics experiments) via multi-view understanding.
- Dependencies: On-device performance; robust tracking under occlusion and clutter; teacher/admin controls and data governance.
Energy/Utilities Robotics — Fully autonomous inspection and repair
- Use case: Robots that localize faults and navigate complex structures using language instructions and robust 3D reasoning.
- Dependencies: Extreme environment robustness; fail-safe navigation; certification for critical infrastructure operations.
Finance/Insurance at Scale — Automated risk assessment from geo-tagged videos
- Use case: Large-scale property risk scoring from standardized capture flows.
- Dependencies: Standard capture protocols; bias mitigation; regulatory acceptance; explainability of spatial judgments.
Standards and Policy — Spatial robustness benchmarks for procurement
- Use case: Mandate spatial-consistency metrics (e.g., temporal robustness, confidence-accuracy correlation) in public-sector VLM RFPs.
- Dependencies: Community consensus on metrics; reference datasets with rights and privacy protection; certification bodies.

Cross-cutting assumptions and dependencies

Training data: GASP needs point correspondences and depth supervision; feasible with SfM/NeRF pipelines or synthetic data when ground truth is unavailable.
Compute and tooling: LoRA-based fine-tuning with a small correspondence head (discarded at inference) fits common GPU clusters; minimal inference overhead.
Domain shift: Spatial gains may trade off 1–2% on action-centric QA; consider multi-objective training when general VQA is critical.
Calibration and safety: For pose-sensitive or control applications, camera/lidar calibration and safety monitors remain essential.
Legal/ethical: Deployments involving people or private spaces must implement privacy, transparency, and bias audits.

View Paper Prompt View All Prompts

Glossary

AdamW: An optimizer that decouples weight decay from the gradient-based update for better regularization. "AdamW optimizer with a cosine learning rate schedule (peak 1e-4)"
BEV maps: Bird’s-Eye-View top-down representations of 3D scenes used for spatial reasoning. "BEV maps"
bfloat16: A 16-bit floating-point format that preserves exponent range for stable mixed-precision training. "bfloat16 mixed-precision"
contrastive loss: A learning objective that pulls positives together and pushes negatives apart in embedding space. "a contrastive loss on ground-truth point correspondences"
cosine learning rate schedule: A schedule that varies the learning rate following a cosine curve to improve convergence. "a cosine learning rate schedule (peak 1e-4)"
cosine similarity matrix: A matrix of pairwise cosine similarities used to match features across frames or modalities. "pairwise cosine similarity matrix"
correspondence head: A lightweight module attached to model layers to produce embeddings specialized for matching correspondences. "a small correspondence head"
cross-modal attention: Attention operating across different modalities (e.g., vision and language) within a transformer. "and cross-modal attention."
depth consistency loss: A loss that enforces predicted correspondences to be consistent with ground-truth depth to resolve 3D ambiguities. "The depth consistency loss then measures"
dual-encoder architectures: Systems with separate encoders (e.g., for vision and language) whose outputs are fused for multimodal reasoning. "dual-encoder architectures or grounding agents"
GELU: Gaussian Error Linear Unit, an activation function that blends linear and nonlinear behaviors. "with GELU activation"
geometric priors: Prior knowledge or constraints about 3D geometry injected to bias learning toward spatially consistent representations. "injects geometric priors directly into the LLM's transformer layers"
gradient checkpointing: A memory-saving technique that recomputes activations during backpropagation to reduce GPU memory usage. "gradient checkpointing"
gradient norm clipping: A stabilization technique that caps the norm of gradients to prevent exploding updates. "gradient norm clipping of 1.0"
ground-truth depth maps: Depth annotations used as supervision for enforcing 3D geometric consistency. "ground-truth depth maps"
ground-truth point correspondences: Annotations linking the same physical point across images/frames used to supervise matching. "ground-truth point correspondences"
InfoNCE: A contrastive objective that normalizes over positives and negatives with a temperature to learn discriminative embeddings. "We employ the InfoNCE contrastive loss"
inductive bias: Built-in assumptions or constraints in a model that guide it toward particular solutions. "inject a robust inductive bias"
LLM backbone: The core LLM component that processes token sequences and supplies representations to the VLM. "LLM backbone"
LoRA rank: The rank parameter controlling the capacity of Low-Rank Adaptation modules used for efficient fine-tuning. "a LoRA rank of 512"
MLP: Multi-Layer Perceptron; a stack of linear layers with nonlinearities for feature projection. "a lightweight 2-layer MLP"
motion parallax: Apparent motion of objects between views due to camera movement, providing depth cues. "rich motion parallax"
object constancy: The ability to recognize the same object across viewpoint changes and occlusions. "Learning this object constancy"
Pearson correlation coefficient: A statistic measuring linear correlation between two variables, used here for confidence-accuracy calibration. "Pearson correlation coefficient"
PCK (Percentage of Correct Keypoints): A metric for correspondence accuracy based on whether predicted points fall within a threshold. "percentage of correct keypoints (PCK)"
positional bias: A systematic tendency to prefer certain positions regardless of content, leading to miscalibrated predictions. "a statistical signature of positional bias"
QK-matching: Comparing query and key token projections (QK) to quantify consistency or correspondence across tokens/frames. "QK-matching is a key metric"
reinforcement learning (RL): Learning via trial-and-error with rewards, used in some VLM post-training strategies. "reinforcement learning (RL)"
scaled dot-product attention: The core transformer mechanism that computes attention weights from scaled query-key dot products. "scaled dot-product attention"
self-attention: An attention mechanism where tokens attend to other tokens within the same sequence, here within and across modalities. "visual self-attention"
Soft-Argmax: A differentiable approximation of argmax that returns an expectation over positions using softmax weights. "Soft-Argmax formulation"
supervised fine-tuning (SFT): Post-training on labeled examples to specialize a model for downstream tasks. "supervised fine-tuning (SFT)"
SVD decomposition: Singular Value Decomposition; factorization used here to initialize projection weights. "SVD decomposition"
temperature hyperparameter: A scalar in contrastive/softmax objectives that controls the sharpness of the probability distribution. "temperature hyperparameter"
temporal robustness: Stability of performance as the time gap between frames increases. "temporal robustness"
view-invariant representations: Embeddings that remain consistent across different viewpoints of the same scene/object. "view-invariant 2D representations"
voxel space: A 3D grid representation where each cell (voxel) encodes properties of the volume. "voxel space"
visual tokens: Tokenized representations of image/video patches fed into a transformer alongside language tokens. "visual tokens"

View Paper Prompt View All Prompts

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Generate Now

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Summary

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview: What this paper is about

Key questions the paper asks

How the method works (in everyday language)

Main findings and why they matter

What this could mean going forward

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and dependencies

Glossary

Open Problems

Continue Learning

Collections

Tweets