Token reduction that preserves semantic saliency and spatial coverage for 3D QA

Develop a token reduction strategy for multi-view image–based 3D question answering with vision–language models that simultaneously preserves semantically salient visual tokens corresponding to key objects and ensures broad spatial coverage across the 3D scene, thereby supporting robust 3D reasoning under aggressive token budgets.

Background

Multi-view inputs to vision–LLMs create substantial token redundancy, especially in 3D question answering where both semantic cues and geometric context are crucial. Existing 2D pruning/merging methods lack spatial awareness and tend to remove or over-concentrate tokens, harming 3D reasoning.

Recent 3D-aware approaches either reduce images without modeling token-level redundancy or treat geometry as an auxiliary signal, and they do not explicitly balance retaining semantically critical tokens with ensuring broad spatial diversity. This gap motivates the problem of designing a reduction strategy that preserves both object-level evidence and global scene coverage.

References

Therefore, an open challenge remains to design a token reduction strategy that jointly preserves salient visual semantics and broad spatial coverage for robust 3D question answering.

SeGPruner: Semantic-Geometric Visual Token Pruner for 3D Question Answering  (2603.29437 - Li et al., 31 Mar 2026) in Section 2.3 (Related Work: Token Reduction for VLMs)