Imaginative Perception Tokens (IPT)

Updated 3 July 2026

Imaginative Perception Tokens (IPT) are visually grounded representations that enable models to generate plausible, unseen spatial views for reasoning about occlusion and perspective changes.
Integrated within the BAGEL architecture, IPT leverages dual expert pathways to produce and decode latent generation tokens, maintaining spatial computation in the visual modality.
Empirical analyses demonstrate that IPT improves spatial reasoning accuracy across various tasks by aligning visual and geometric computations effectively.

Imaginative Perception Tokens (IPT) formalize and externalize the process by which vision-LLMs (VLMs) infer or “imagine” unobserved spatial structure to solve tasks whose answers cannot be directly read from visible input. IPT serves as an intermediate, visually grounded representation of what a model would perceive from an alternative, unobserved viewpoint or spatial configuration, providing a direct, learnable substrate for reasoning about occlusion, viewpoint changes, or cross-view aggregation. This approach, implemented within the unified VLM architecture BAGEL, distinguishes itself from textual chain-of-thought (CoT) approaches by maintaining spatial computation in the visual modality, natively aligned with geometric transformation and inference (Bigverdi et al., 2 Jun 2026).

1. Formal Definition and Motivation

Let $\mathcal{I}_{obs} = \{I_1, \ldots, I_k\}$ denote the set of observed images, $Q$ a natural-language spatial query, $\hat{I}_{imag}$ the imagined (unobserved) image view, and $A$ the answer. IPT decomposes spatial reasoning into two generative steps:

$P(\hat{I}_{imag}\mid \mathcal{I}_{obs}, Q)$ : production of a consistent imagined intermediate image.
$P(A\mid \mathcal{I}_{obs}, Q, \hat{I}_{imag})$ : answer prediction conditioned on both observed and imagined content.

In BAGEL, image processing employs two expert pathways: a ViT encoder generates “understanding” tokens $U$ (SigLIP2), while a FLUX VAE yields “generation” tokens $G$ . Imaginative Perception Tokens are specifically the generation-expert tokens $G_{imag}\in\mathbb{R}^{M\times d_g}$ , e.g., a $64 \times 64$ grid for latent-64 resolution; decoding these tokens reconstructs the imagined intermediate image.

Unlike conventional visual embeddings restricted to observed content—or CoT, which forces geometric operations into the linguistic modality—IPT tokens explicitly predict the visual content of an unobserved vantage. This mechanism is motivated by human spatial reasoning, which often involves generating internal visualizations of alternative perspectives or occluded paths. IPT thus offers a modality-native mechanism for imagination, critical for spatial queries whose solutions are not photometrically explicit in the input (Bigverdi et al., 2 Jun 2026).

2. Model Architecture and Learning Framework

BAGEL is a decoder-only Transformer structured as a Mixture-of-Transformer-Experts (MoT):

The understanding expert operates over text and $Q$ 0 tokens.
The generation expert controls VAE tokens $Q$ 1.
Both experts participate in shared self-attention at all layers, ensuring lossless and cross-modal information flow.

During training, the context $Q$ 2text tokens, $Q$ 3, $Q$ 4 is succeeded by a generation block in which BAGEL:

Predicts a sequence of imagined generation tokens $Q$ 5 via iterative flow-matching.
Decodes these to $Q$ 6, immediately re-encodes to both $Q$ 7 and $Q$ 8, and appends them to the context.
Autoregressively generates answer tokens $Q$ 9.

The overall objective combines imagination supervision and answer supervision: $\hat{I}_{imag}$ 0 where:

$\hat{I}_{imag}$ 1: flow-matching loss, enforcing denoising velocity in latent space.
$\hat{I}_{imag}$ 2: standard language modeling loss for answer sequence.

Inference can proceed in “text-only” mode (directly predict $\hat{I}_{imag}$ 3) or in “imagination” mode, where $\hat{I}_{imag}$ 4 is denoised and decoded, appended to the context, and then the answer is predicted (Bigverdi et al., 2 Jun 2026).

3. Spatial Reasoning Tasks and Benchmark Datasets

IPT efficacy is established via three spatial reasoning paradigms, each uniquely dependent on the ability to “imagine” unobserved spatial structure. All training is on AI2-THOR simulator data with human-verified evaluation for generalization.

Task	Input / Query	Imagination Target	Train Size	Eval Sources
Perspective Taking (PET)	1st-person RGB, target X	Render from novel camera pose	AI2-THOR: 20,531<br>Habitat: 19,998<br>VST: 15,000	AI2-THOR† 238<br>Habitat† 300
Path Tracing (PT)	Map+path, arr, or egocentric views	Side-view render at midpoint M₁	AI2-THOR: 11,204	AI2-THOR† 329<br>Matterport 332
Multiview Counting (MVC)	4–8 egocentric frames	Top-down BEV map with all object instances localized	AI2-THOR: 17,079<br>MessyTable: 1,880<br>ScanNet++: 540	AI2-THOR† 260

Each task is designed such that the correct answer depends on inferences about unobserved geometry, viewpoint, or object aggregation, necessitating a reliable imagination mechanism (Bigverdi et al., 2 Jun 2026).

4. Empirical Results and Analysis

Experimental results demonstrate the performance gains enabled by IPT supervision. On in-domain AI2-THOR benchmarks:

Model	PET	PT	MVC
Bagel (base)	40.3	29.9	35.4
Bagel (label-only)	97.5	65.7	63.9
+ Text CoT	83.1	49.7	62.3
+ IPT	96.8	49.0	67.3
+ Mixed Train	97.8	66.7	62.3

IPT outperforms label-only training by +3.4 pp on MVC (67.3 vs 63.9), +1.1 pp on PT in mixed training (66.7 vs 65.7), and +5.0 pp on out-of-domain PET (87.0 vs 82.0 on Habitat).
Forcing spatial computation into the linguistic modality via textual CoT notably degrades performance (PET drops from 97.5 to 83.1%).
Ablation on VAE latent resolution reveals that increased resolution improves the fidelity of imagined views and systematic downstream accuracy (e.g., PET Latent-64: 96.8%/83.3% AI2-THOR/Habitat).
Mixed training (IPT + label-only) yields improved generalization across simulated and real-world evaluations (e.g., PT real 58.6% vs 54.7% label-only) (Bigverdi et al., 2 Jun 2026).

5. Interpretability and Modality-Specific Calculation

IPT enables model interpretability by externalizing the “imagined” intermediate for human inspection. Visualizing generated views for PET, PT, and MVC reveals that the predictions often capture correct spatial configurations, object positions, and aggregation. Failure cases are informative, showing which relationships were mis-imagined, aiding in targeted error analysis.

Unlike textual chain-of-thought, IPT’s visually grounded representations are natively compatible with geometric operations, as the flow-matching loss aligns the learned denoising trajectory with real spatial transformations. This modality alignment both improves accuracy and produces artifact images that illuminate the model’s internal spatial beliefs and “mental imagery” (Bigverdi et al., 2 Jun 2026).

6. Generalization, Cross-Domain, and Broader Impact

Performance transfers across simulation platforms and into real-world data, as indicated by:

Cross-task transfer: SAT (external PET) accuracy improves from 34.9% (base) to 63.6% (mixed IPT); MessyTable (external counting) from 29.0% to 37.0%.
Fine-tuning on AI2-THOR MVC with label-only data leads to improvements on disparate spatial benchmarks, including ScanNet multiview counting, MindCube, and All-Angles cross-view matching. This suggests that IPT supervision, especially in combination with diverse simulator data, establishes spatial priors that are broadly useful across spatial reasoning domains (Bigverdi et al., 2 Jun 2026).

In summary, Imaginative Perception Tokens provide an efficient, interpretable, and modality-aligned strategy for equipping VLMs with the capacity to reason about unobserved spatial structure, enhancing accuracy, domain robustness, and interpretability in spatial reasoning tasks.

Markdown Report Issue Upgrade to Chat

References (1)

Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Imaginative Perception Tokens (IPT).