GeoFormer: Autoregressive Polygon Prediction

Updated 22 February 2026

Autoregressive polygon prediction is a method that sequentially generates polygon vertices using transformer decoders for precise object delineation.
GeoFormer utilizes an encoder–decoder framework with spatial features and positional encodings to model geometric dependencies in images.
This approach supports applications like building footprint extraction and segmentation, achieving state-of-the-art performance on large-scale datasets.

Autoregressive polygon prediction refers to a class of sequence modeling frameworks that generate polygonal object representations in images by emitting spatial keypoints or vertices one at a time, each conditioned on the previously generated outputs. GeoFormer and closely related architectures (e.g., PolyFormer) formulate object delineation as a spatial sequence modeling task, where objects such as buildings are described as sequences of keypoint tokens, and a transformer decoder predicts these tokens autoregressively. This approach offers a direct path from raw pixels to geometric, vectorized object descriptions, supporting applications such as building footprint extraction and referring segmentation in remote sensing.

1. Architectural Foundations

GeoFormer employs a classical encoder–decoder transformer structure. The encoder processes satellite or aerial imagery into a dense, multi-scale feature map through hierarchical vision transformers (typically Swin V2 Pyramid), which partition the input image into non-overlapping patches and compute hierarchical feature representations at successive resolutions. The encoded feature map serves as the context for a deep, stacked transformer decoder (e.g., 8 layers, 24 heads, embedding size 512), which autoregressively predicts a tokenized sequence encoding the polygon’s vertices, special marker tokens (e.g., start, separation, stop), and (in some variants) auxiliary information such as textual queries (Khomiakov et al., 2024).

A defining characteristic of the decoder in these systems is masked multi-head self-attention (causal masking) ensuring that only previously generated tokens are visible to the prediction at each step. The decoder further incorporates cross-attention to the encoder’s rich spatial feature grid, augmented with positional encodings (learned 2D sinusoidal embeddings, local ALiBi biases, RoPE rotary embeddings) to enhance spatial sensitivity (Khomiakov et al., 2024).

Tokenization of polygon keypoints is typically performed via coordinate quantization (e.g., discretizing image coordinates to match the feature grid, such as a $36 \times 36$ or $224 \times 224$ grid), with each $(x, y)$ coordinate encoded as a pair of tokens in the sequence. Complex scenes with multiple polygons flatten all vertex sequences into a single token stream using separator tokens to delineate boundaries (Khomiakov et al., 2024, Alfieri et al., 2021).

2. Autoregressive Sequence Modeling

The core operational principle is the factorization of the joint probability over the sequence of polygon tokens $s = (s_0, ..., s_T)$ :

$p_\theta(s\,|\,I) = \prod_{t=0}^{T} p_\theta(s_t\,|\,I,\,s_{<t})$

The model is trained to maximize the log-likelihood of ground-truth sequences under this factorization:

$\mathcal{L}(\theta) = -\sum_{t=0}^{T} \log p_\theta(s_t\,|\,I,\,s_{<t})$

In this paradigm, predicting each new polygon keypoint is conditioned not only on global image features, but also on the precise historically predicted sequence, allowing for direct modeling of geometric dependencies such as edge directionality or polygon closure. Special marker tokens ( $\langle\mathsf{start}\rangle$ , $\langle\mathsf{sep}\rangle$ , $\langle\mathsf{stop}\rangle$ ) structure the output as stepwise emission of complete polygons, supporting arbitrary numbers and topologies of objects per image (Khomiakov et al., 2024, Liu et al., 2023).

The autoregressive approach, while O( $T^2$ ) in decoder computation, imparts a strong inductive bias in scenarios such as multi-object delineation, where the problem can be naturally decomposed into subproblems (each sub-sequence of tokens corresponding to one polygon) (Alfieri et al., 2021).

3. Loss Functions and Decoding Variants

The canonical loss used in GeoFormer is a categorical cross-entropy over the predicted next token, aligning fully with the likelihood factorization and avoiding the need for auxiliary segmentation or regression losses (Khomiakov et al., 2024). This is in contrast to hybrid or earlier polygon prediction models, where auxiliary objectives (e.g., mask IOU, coordinate regression) are common.

Variants such as PolyFormer (Liu et al., 2023) introduce regression-based decoders for floating-point keypoints, with a composite loss:

$L_t = \lambda_t\, \mathbf{1}[\text{true}_t = \langle\mathrm{COO}\rangle]\,L_1\bigl((x_t, y_t), (\hat{x}_t, \hat{y}_t)\bigr) + \lambda_{\mathrm{cls}}\, H(\text{true}_t, \hat{p}_t)$

where $L_1$ is the absolute error and $H$ is the smoothed cross-entropy for token-type classification. Weighting factors (e.g., $\lambda_{box}=0.1$ for box vertices, $\lambda_{poly}=1$ for polygon, $\lambda_{\mathrm{cls}}=5\times 10^{-4}$ ) control the penalty assigned to different token types.

GeoFormer avoids quantization error by aligning the token vocabulary to the encoder’s spatial granularity, whereas PolyFormer decomposes $(x, y)$ coordinate embeddings continuously from a learned 2D codebook using bilinear interpolation, fully eliminating quantization bias (Liu et al., 2023).

4. Canonical Ordering, Tokenization, and Polygon Sequence Construction

Polygon prediction frameworks rely on a canonical decomposition of the object set for stability and train/test consistency. Common schemes include:

Sorting polygons by centroid position (left-to-right, top-to-bottom or by distance to image center).
Listing vertices in a fixed orientation (clockwise or counterclockwise) starting from a prescribed origin.
Flattening multi-polygon annotations into a single token sequence using separator tokens (e.g., end-of-polygon, separator, end-of-sequence).

This structure plays a pivotal role in ensuring unique, permutation-invariant representations for identical scene content. Empirical findings confirm that arbitrarily permuted sequences (e.g., random vertex orders) dramatically reduce segmentation fidelity (e.g., mIoU drops from 68.35% to 55.92% in PolyFormer ablations) (Liu et al., 2023). Augmentation strategies such as random down-sampling of contours introduce robustness to polygon granularity.

Table: Canonical Sequence Elements in Autoregressive Polygon Prediction

Purpose	Token Type	GeoFormer (Khomiakov et al., 2024)	PolyFormer (Liu et al., 2023)
Sequence Start	Special	$\langle\mathsf{start}\rangle$	$\langle\mathsf{BOS}\rangle$
Polygon Separation	Special	$\langle\mathsf{sep}\rangle$	$\langle\mathsf{SEP}\rangle$
Vertex Representation	Coordinate / Discrete	Quantized $(x, y)$ integer	Floating point $(x, y)$
Termination	Special	$\langle\mathsf{stop}\rangle$	$\langle\mathsf{EOS}\rangle$

A plausible implication is that the canonical sequence plays a crucial role in leveraging the Transformer’s capacity to model long-range dependencies over spatial structures.

5. Empirical Performance and Ablations

GeoFormer demonstrates state-of-the-art results on large-scale building delineation, e.g., Aicrowd Mapping Challenge (280,000+ images), outperforming HiSup and segmentation-vectorization baselines by substantial margins (AP: 91.5 vs. 79.4, bAP: 97.1 vs. 66.5, global IoU: 98.1% vs. 94.3%) (Khomiakov et al., 2024). Complexity-aware metrics and PoLiS distance indicate high-fidelity polygon predictions and near-optimal vertex count (N-ratio ≈ 1.01).

Ablation studies confirm the necessity of SWin-pyramidal features, ALiBi attention biases, rotary embeddings (RoPE), canonical sorting, and random token masking during training. Removing ALiBi or RoPE collapses AP to nearly zero; pyramidal features are necessary for spatial localization.

In "Investigating transformers in the decomposition of polygonal shapes as point collections," the auto-regressive approach yields a +22.8 percentage point mAP improvement over parallel Transformer baselines on challenging multi-polygon tasks, while remaining competitive (or slightly suboptimal) on simpler or unordered set prediction benchmarks (Alfieri et al., 2021).

PolyFormer, situated in the context of referring image segmentation, reports absolute mIoU improvements of +5.40% (RefCOCO+) and +4.52% (RefCOCOg) over prior art, with the regression-based decoding adding +1.86pp versus classification-based decoding (Liu et al., 2023). Sequence ordering and polygon augmentation further increase effectiveness.

6. Practical Applications and Robustness

Autoregressive polygon frameworks are directly applicable to building footprint extraction from satellite imagery, referred image segmentation, and any task requiring vectorized representations of complex spatial objects. GeoFormer in particular is robust to moderate spatial perturbations, including pixel dropout, rotation, and—up to a point—resolution downsampling (AP remains highest in most conditions except aggressive downscaling, where the encoder grid bottleneck appears) (Khomiakov et al., 2024).

By modeling the entire vectorization pipeline within a single, differentiable likelihood-based framework, these methods obviate the need for post-processing or heuristic segmentation-masking pipelines, offering compact and geometrically accurate object representations.

Scalability to variable numbers and cardinalities of polygons is a central feature, but performance still depends strongly on training/test distribution alignment; autoregressive transformers can degrade if presented with scene complexity (polygon number) unseen during training (Alfieri et al., 2021). Addressing this with permutation-invariant sequence modeling or learned ordering remains an open research direction.

7. Limitations and Future Prospects

While autoregressive polygon prediction achieves high accuracy and geometric fidelity, the approach has inherent trade-offs:

Sequence length grows linearly with number of polygons and vertices, leading to increased decoding computation (O( $T^2$ )), and potential compounding error on long sequences (Alfieri et al., 2021).
Reliance on explicit, fixed ordering heuristics for polygons and vertices, which may be brittle across domains or scene types.
Slightly reduced robustness under distribution shift in key object counts or granularity unless specifically augmented during training.

Ongoing efforts seek to extend these frameworks to handle unordered sets, exploit learned or task-specific ordering priors, and further integrate vector graphics principles directly into end-to-end spatial transformers. Applying these advances to broader object classes, annotation regimes, and resolution scales continues to be an active area of research (Khomiakov et al., 2024, Alfieri et al., 2021).