Pretraining a specialized tokenizer for camera maps

Develop and pretrain a tokenizer specialized for pixel‑wise camera maps (e.g., Perspective Field encodings of up‑vector and latitude) instead of reusing the image VAE tokenizer, to improve geometric conditioning fidelity in the Puffin framework.

Background

For conditioning the diffusion model, the framework encodes pixel‑wise camera maps by normalizing them and reusing the image tokenizer (VAE) since the maps have three channels, despite the distinct statistics and semantics of geometric fields versus natural images.

The authors explicitly defer pretraining a dedicated tokenizer for camera maps, indicating a concrete architectural and training component left unresolved.

References

Pretraining a specialized tokenizer for camera maps is left as future work.

— Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation (2510.08673 - Liao et al., 9 Oct 2025) in Section 6.1 (Implementation Details), Network Configuration

Pretraining a specialized tokenizer for camera maps

Background

References

Related Problems