Token reduction that preserves semantic saliency and spatial coverage for 3D QA
Develop a token reduction strategy for multi-view image–based 3D question answering with vision–language models that simultaneously preserves semantically salient visual tokens corresponding to key objects and ensures broad spatial coverage across the 3D scene, thereby supporting robust 3D reasoning under aggressive token budgets.
References
Therefore, an open challenge remains to design a token reduction strategy that jointly preserves salient visual semantics and broad spatial coverage for robust 3D question answering.
— SeGPruner: Semantic-Geometric Visual Token Pruner for 3D Question Answering
(2603.29437 - Li et al., 31 Mar 2026) in Section 2.3 (Related Work: Token Reduction for VLMs)