MapFormers: Cognitive & Spatial Mapping Models

Updated 28 November 2025

MapFormers are Transformer-based models that learn and extract cognitive and spatial maps from complex observational data.
They employ input-dependent positional encodings using rotation-based mechanisms to disentangle structure and content effectively.
Empirical results show strong OOD generalization and state-of-the-art performance in autonomous mapping, remote sensing, and neuroscience applications.

MapFormers are a class of Transformer-based models designed to enable the learning, representation, and extraction of cognitive and spatial maps from complex observational or visual data. In the literature, the MapFormer nomenclature denotes multiple families of architectures, ranging from models for self-supervised structure-content disentanglement and path integration in sequential data (Rambaud et al., 24 Nov 2025), to large-scale, multi-task vision models for high-definition automated map vectorization (Ivanov et al., 18 Jun 2025), and end-to-end autoregressive polygons-from-image models for satellite remote sensing (Khomiakov et al., 25 Nov 2024). The unifying core across these variants is the use of advanced Transformers—augmented with task-specific structural or positional biases—to encode, disentangle, and decode spatial and relational structures for cognitive mapping, semantic segmentation, and vector representation across AI and neuroscience applications.

1. Structural Motivation: Cognitive Maps and Transformers

The motivation for MapFormers arises from the need to represent cognitive maps—internal models which encode the abstract spatial or relational structure of an environment. Standard attention-based architectures are permutation-invariant; they require explicit positional encoding to model structural relationships. However, cognitive mapping tasks in both neuroscience and robotics demand invariance of action semantics ("step right," "move up") and distinct separation between the effects of actions (spatial displacement) and observations (content at a location).

MapFormers address this by integrating input-dependent positional embeddings that are dynamically updated via the incoming stream of tokens, allowing actions to causally evolve abstract position, while observations only update local content. This explicit structure-content disentanglement is foundational for learning generalizable relational and spatial models that extrapolate beyond the training distribution, both in sequence length and spatial arrangement (Rambaud et al., 24 Nov 2025).

2. Core Mechanisms: Input-Dependent Positional Encoding

MapFormers implement input-dependent positional encodings using a rotation-based mechanism parameterized by the Lie algebra of SO(2):

For each input token $x_t \in \mathbb{R}^d$ , an integration increment $\Delta_t$ is computed via a learned bottleneck:

$\Delta_t = W_\Delta^{\text{out}}(W_\Delta^{\text{in}} x_t)$

Head- and block-specific angular frequencies $\omega$ produce the rotation angle

$\theta_t = \omega \odot \Delta_t$

The per-step update is a planar rotation $R(\theta_t)$ , forming the cumulative path-integration matrix via

$R_{0\to t}^{\text{PI}} = \prod_{s=1}^t R(\theta_s)$

For MapFormer-EM (episodic memory), absolute positional keys/queries are updated with the cumulative rotation, allowing the separation of "what" and "where" through joint and parallel attention mechanisms, while MapFormer-WM (working memory) instead utilizes relative (rotary) attention, depending only on trajectory difference (Rambaud et al., 24 Nov 2025).

This group-structured update guarantees invertibility, closure, and associativity, inducing an inductive bias for discovering the action manifold independent of content, thereby enabling systematic out-of-distribution (OOD) generalization to new trajectories or sequence lengths.

3. Model Variants and Their Architectures

3.1 Self-Supervised Cognitive Mapping (MapFormer-EM/WM)

In the context of cognitive map learning from trajectories, MapFormer-EM employs absolute structure-content factorization through parallel attention, with separate keys/queries for structure ("where," path integration) and content ("what," observations), united via outer product or parallel attention mask:

Keys/queries: $k^p_t = R(\Theta_t) k^p_*$ , $q^p_t = R(\Theta_t) q^p_*$ for position, $k^x_t = W_K x_t$ for content.
Final attention: $A_X = \mathrm{softmax}(Q^x (K^x)^\top)$ , $A_P = \mathrm{softmax}(Q^p (K^p)^\top)$ , $A = A_X \odot A_P$ , and $\mathrm{Attn}(X) = A V$ .

MapFormer-WM uses a RoPE-style rotary encoding, focusing on relative path differences, with $\bar q_j = R(\Theta_j) q_j$ , $\bar k_i = R(\Theta_i) k_i$ , and attention computed as a function of $\Theta_j - \Theta_i$ .

Both are trained with next-token prediction in a self-supervised setup, receiving as input interleaved streams of action and observation tokens and discovering the semantic split between them (Rambaud et al., 24 Nov 2025).

3.2 End-to-End Vectorized Mapping in Computer Vision (MapFM)

MapFM (Ivanov et al., 18 Jun 2025) adapts the "MapFormer" paradigm to high-definition mapping for autonomous driving. It comprises:

Foundation model encoder: A (frozen or readapted) large-scale ViT (e.g., DINOv2) encodes synchronized camera images into per-camera embeddings.
BEV fusion: BEVFormer-style cross-attention projects multi-view image features into a shared bird’s-eye-view spatial grid, conditioned on geometric calibration.
Multi-task heads: Dense BEV semantic segmentation and perspective-view segmentation heads (for road, lane, crosswalk, etc.) are jointly trained, inducing contextual feature richness through multi-objective loss (including Dice and cross-entropy).
Vector map decoder: A DETR-style Transformer with instance queries decodes polylines in BEV, producing semantic primitives (e.g., lane dividers, boundaries).

This configuration enables direct vectorized map prediction from raw images, benefiting from the contextual expressiveness of foundation-model features and improved by auxiliary semantic supervision.

3.3 Autoregressive Polygon Prediction (GeoFormer)

GeoFormer (Khomiakov et al., 25 Nov 2024) belongs to the "MapFormer" family focused on remote sensing, specifically multi-polygon building delineation from satellite images. It consists of:

SWINv2 image encoder yielding a fused spatial feature map $I_F$ .
Autoregressive Transformer decoder receiving coordinate tokens (quantized positions) with additional START, SEP, and STOP symbols, implementing both causal and cross-attention, and emitting polygon vertices end-to-end.
Explicitly crafted positional encodings: a sum of learned absolute 2D grid biases, fixed meshgrid encoding, RoPE, and ALiBi biases for enhanced spatial modeling.

This architecture enables scale-invariant, single-likelihood optimization for multi-polygon vectorization without the need for manual post-processing or multi-stage segmentation (Khomiakov et al., 25 Nov 2024).

4. Probabilistic Objectives and Training Protocols

All MapFormer variants employ a sequence modeling paradigm with token-level autoregressive (or masked) likelihoods. For autoregressive models (GeoFormer), the likelihood is factorized as:

$p_\theta(K \mid X) = \prod_{t=1}^T p_\theta(k_t \mid k_{<t}, X)$

and training minimizes the negative log-likelihood:

$\mathcal{L} = -\sum_{t=1}^T \log p_\theta(k_t \mid k_{<t}, X)$

In the self-supervised cognitive map variant, the model receives a sequence $s_1, s_2, \dots, s_T$ and optimizes

$\mathcal{L} = -\sum_{t=1}^{T-1} \log P(s_{t+1} \mid s_{\le t})$

with no auxiliary action/observation labels provided; the input-dependent positional encoding architecture compels the model to infer which tokens require positional integration and which represent stationary state updates (Rambaud et al., 24 Nov 2025).

In MapFM, the loss combines point-set regression, direction loss, and segmentation losses via a weighted sum:

$\mathcal{L}_{\rm total} = \beta_1 \mathcal{L}_{\rm pts} + \beta_2 \mathcal{L}_{\rm cls} + \beta_3 \mathcal{L}_{\rm dir} + \beta_4 \mathcal{L}_{\rm BEVseg} + \beta_5 \mathcal{L}_{\rm PVseg} + \beta_6 \mathcal{L}_{\rm surf}$

Explicit equations, such as the Dice loss for ARSS segmentation or segment-wise direction loss, are detailed in the original works (Ivanov et al., 18 Jun 2025).

5. Empirical Performance and OOD Generalization

MapFormers demonstrate strong empirical results across all evaluated domains:

Cognitive map learning: On 2D navigation tasks, MapFormer-EM and MapFormer-WM achieve near-perfect accuracy in OOD generalization: for longer sequences ( $T=512$ ) or larger grids ( $G=128$ ), accuracy approaches 100%, in contrast to RoPE or gating-based baselines which fail to extrapolate (Rambaud et al., 24 Nov 2025). The key enabler is the structural (group-theoretic) bias of rotation-based input-dependent positional encoding.
Remote sensing: GeoFormer, using an autoregressive MapFormer paradigm, outperforms segmentation-plus-vectorization pipelines by 12–30 pp on mean AP, boundary AP, and closed-IoU on the Aicrowd building delineation benchmark, while maintaining robustness to pixel dropout and orientation changes. Scale-invariance is achieved through grid-based tokenization and model structure (Khomiakov et al., 25 Nov 2024).
Automotive HD mapping: MapFM establishes state-of-the-art mean average precision (mAP) (up to 69.0%) on nuScenes, outperforming MapQR and MGMapNet baselines, with the inclusion of auxiliary segmentation heads and foundation-model backbones showing consistent mAP improvements (+1.1–1.5%) (Ivanov et al., 18 Jun 2025).

Ablation studies further validate that removal of structural elements (e.g., pyramidal encoding, ALiBi, RoPE) catastrophically degrades performance, confirming their necessity.

6. Applications in Neuroscience and Artificial Intelligence

MapFormers have broad implications:

Neuroscience: MapFormer-EM and MapFormer-WM reflect the division of hippocampal and prefrontal memory systems (episodic vs. working memory), and directly model grid/place cell path integration phenomena. The mathematical structure of their updates mirrors the algebraic properties identified in neural ensembles.
AI and robotics: In robotics and autonomous driving, MapFormer models enable one-shot vectorized HD map prediction from multi-view imagery, improving planning and scene understanding. Cognitive mapping variants allow for interpretable, group-structured memory representations that generalize systematically beyond training data, a challenge that standard Transformers continue to face.
Self-supervised learning: The self-supervised discovery of actions and observations in MapFormers permits large-scale relation modeling without explicit supervision, suitable for scaling to more abstract or relational domains.

7. Limitations and Future Directions

While MapFormers demonstrate marked advancements in OOD generalization and end-to-end structural modeling, current limitations include:

Computational cost: Large parameter footprint (e.g., GeoFormer at ~97M params) and quadratic dependence on sequence length make inference relatively slow (up to 1.9s/image).
Spatial resolution: Fixed output feature maps (e.g., 36x36 in GeoFormer) limit fidelity at very high or low scales.
Tokenization scheme: Scalar tokenization of coordinates may limit joint (x, y) modeling efficiency; future work could allow vectorized predictions.
Training protocols: Some approaches require careful sorting or canonicalization of map elements at training time.

Prospective research can explore higher-dimensional or non-Euclidean positional encoding groups, more efficient tokenization or decoding, and integration with large language and multimodal models for scalable, interpretable cognitive mapping (Rambaud et al., 24 Nov 2025, Khomiakov et al., 25 Nov 2024, Ivanov et al., 18 Jun 2025).