Imaginative Perception Tokens (IPT)
- Imaginative Perception Tokens (IPT) are visually grounded representations that enable models to generate plausible, unseen spatial views for reasoning about occlusion and perspective changes.
- Integrated within the BAGEL architecture, IPT leverages dual expert pathways to produce and decode latent generation tokens, maintaining spatial computation in the visual modality.
- Empirical analyses demonstrate that IPT improves spatial reasoning accuracy across various tasks by aligning visual and geometric computations effectively.
Imaginative Perception Tokens (IPT) formalize and externalize the process by which vision-LLMs (VLMs) infer or “imagine” unobserved spatial structure to solve tasks whose answers cannot be directly read from visible input. IPT serves as an intermediate, visually grounded representation of what a model would perceive from an alternative, unobserved viewpoint or spatial configuration, providing a direct, learnable substrate for reasoning about occlusion, viewpoint changes, or cross-view aggregation. This approach, implemented within the unified VLM architecture BAGEL, distinguishes itself from textual chain-of-thought (CoT) approaches by maintaining spatial computation in the visual modality, natively aligned with geometric transformation and inference (Bigverdi et al., 2 Jun 2026).
1. Formal Definition and Motivation
Let denote the set of observed images, a natural-language spatial query, the imagined (unobserved) image view, and the answer. IPT decomposes spatial reasoning into two generative steps:
- : production of a consistent imagined intermediate image.
- : answer prediction conditioned on both observed and imagined content.
In BAGEL, image processing employs two expert pathways: a ViT encoder generates “understanding” tokens (SigLIP2), while a FLUX VAE yields “generation” tokens . Imaginative Perception Tokens are specifically the generation-expert tokens , e.g., a grid for latent-64 resolution; decoding these tokens reconstructs the imagined intermediate image.
Unlike conventional visual embeddings restricted to observed content—or CoT, which forces geometric operations into the linguistic modality—IPT tokens explicitly predict the visual content of an unobserved vantage. This mechanism is motivated by human spatial reasoning, which often involves generating internal visualizations of alternative perspectives or occluded paths. IPT thus offers a modality-native mechanism for imagination, critical for spatial queries whose solutions are not photometrically explicit in the input (Bigverdi et al., 2 Jun 2026).
2. Model Architecture and Learning Framework
BAGEL is a decoder-only Transformer structured as a Mixture-of-Transformer-Experts (MoT):
- The understanding expert operates over text and 0 tokens.
- The generation expert controls VAE tokens 1.
- Both experts participate in shared self-attention at all layers, ensuring lossless and cross-modal information flow.
During training, the context 2text tokens, 3, 4 is succeeded by a generation block in which BAGEL:
- Predicts a sequence of imagined generation tokens 5 via iterative flow-matching.
- Decodes these to 6, immediately re-encodes to both 7 and 8, and appends them to the context.
- Autoregressively generates answer tokens 9.
The overall objective combines imagination supervision and answer supervision: 0 where:
- 1: flow-matching loss, enforcing denoising velocity in latent space.
- 2: standard language modeling loss for answer sequence.
Inference can proceed in “text-only” mode (directly predict 3) or in “imagination” mode, where 4 is denoised and decoded, appended to the context, and then the answer is predicted (Bigverdi et al., 2 Jun 2026).
3. Spatial Reasoning Tasks and Benchmark Datasets
IPT efficacy is established via three spatial reasoning paradigms, each uniquely dependent on the ability to “imagine” unobserved spatial structure. All training is on AI2-THOR simulator data with human-verified evaluation for generalization.
| Task | Input / Query | Imagination Target | Train Size | Eval Sources |
|---|---|---|---|---|
| Perspective Taking (PET) | 1st-person RGB, target X | Render from novel camera pose | AI2-THOR: 20,531<br>Habitat: 19,998<br>VST: 15,000 | AI2-THOR† 238<br>Habitat† 300 |
| Path Tracing (PT) | Map+path, arr, or egocentric views | Side-view render at midpoint M₁ | AI2-THOR: 11,204 | AI2-THOR† 329<br>Matterport 332 |
| Multiview Counting (MVC) | 4–8 egocentric frames | Top-down BEV map with all object instances localized | AI2-THOR: 17,079<br>MessyTable: 1,880<br>ScanNet++: 540 | AI2-THOR† 260 |
Each task is designed such that the correct answer depends on inferences about unobserved geometry, viewpoint, or object aggregation, necessitating a reliable imagination mechanism (Bigverdi et al., 2 Jun 2026).
4. Empirical Results and Analysis
Experimental results demonstrate the performance gains enabled by IPT supervision. On in-domain AI2-THOR benchmarks:
| Model | PET | PT | MVC |
|---|---|---|---|
| Bagel (base) | 40.3 | 29.9 | 35.4 |
| Bagel (label-only) | 97.5 | 65.7 | 63.9 |
| + Text CoT | 83.1 | 49.7 | 62.3 |
| + IPT | 96.8 | 49.0 | 67.3 |
| + Mixed Train | 97.8 | 66.7 | 62.3 |
- IPT outperforms label-only training by +3.4 pp on MVC (67.3 vs 63.9), +1.1 pp on PT in mixed training (66.7 vs 65.7), and +5.0 pp on out-of-domain PET (87.0 vs 82.0 on Habitat).
- Forcing spatial computation into the linguistic modality via textual CoT notably degrades performance (PET drops from 97.5 to 83.1%).
- Ablation on VAE latent resolution reveals that increased resolution improves the fidelity of imagined views and systematic downstream accuracy (e.g., PET Latent-64: 96.8%/83.3% AI2-THOR/Habitat).
- Mixed training (IPT + label-only) yields improved generalization across simulated and real-world evaluations (e.g., PT real 58.6% vs 54.7% label-only) (Bigverdi et al., 2 Jun 2026).
5. Interpretability and Modality-Specific Calculation
IPT enables model interpretability by externalizing the “imagined” intermediate for human inspection. Visualizing generated views for PET, PT, and MVC reveals that the predictions often capture correct spatial configurations, object positions, and aggregation. Failure cases are informative, showing which relationships were mis-imagined, aiding in targeted error analysis.
Unlike textual chain-of-thought, IPT’s visually grounded representations are natively compatible with geometric operations, as the flow-matching loss aligns the learned denoising trajectory with real spatial transformations. This modality alignment both improves accuracy and produces artifact images that illuminate the model’s internal spatial beliefs and “mental imagery” (Bigverdi et al., 2 Jun 2026).
6. Generalization, Cross-Domain, and Broader Impact
Performance transfers across simulation platforms and into real-world data, as indicated by:
- Cross-task transfer: SAT (external PET) accuracy improves from 34.9% (base) to 63.6% (mixed IPT); MessyTable (external counting) from 29.0% to 37.0%.
- Fine-tuning on AI2-THOR MVC with label-only data leads to improvements on disparate spatial benchmarks, including ScanNet multiview counting, MindCube, and All-Angles cross-view matching. This suggests that IPT supervision, especially in combination with diverse simulator data, establishes spatial priors that are broadly useful across spatial reasoning domains (Bigverdi et al., 2 Jun 2026).
In summary, Imaginative Perception Tokens provide an efficient, interpretable, and modality-aligned strategy for equipping VLMs with the capacity to reason about unobserved spatial structure, enhancing accuracy, domain robustness, and interpretability in spatial reasoning tasks.