Anchor-Based Autoregressive Decoder
- Anchor-based autoregressive decoders are neural models that integrate explicit anchors as semantic priors to direct and sharpen sequential predictions.
- They empower diverse applications, from trajectory forecasting with LSTM and MLP-Mixer modules to visual generation with transformer-based random order decoding.
- This approach leverages specialized training losses and architectural innovations to enhance multi-modal prediction accuracy and inference speed.
An anchor-based autoregressive decoder is a neural network module that conditions sequential predictions on one or more explicit, structured “anchors.” These anchors serve as semantic priors—such as maneuver hypotheses in trajectory prediction or position-instruction tokens in image generation—that constrain output space, facilitate modality, and sharpen the autoregressive process. The approach unifies batchwise or mode-conditional decoding with stepwise probabilistic forecasting, under frameworks as diverse as LSTM-based trajectory prediction (Hasan et al., 2021), MLP-Mixer modules for cooperative V2X fusion (Wu et al., 19 Sep 2025), and transformer-based visual generation in arbitrary orders (Pang et al., 2024).
1. Formal Definition and General Principles
Let denote a sequence of target outputs conditioned on past observations . Anchor-based autoregressive decoders posit a set of anchor states (each a vector, sequence, or embedding) with mixture probabilities . The conditional distribution factorizes as
where each may itself be an autoregressive model, typically generating sequentially while conditioning on at each step (Hasan et al., 2021, Wu et al., 19 Sep 2025).
Anchors may be:
- Prototypical future trajectories for high-level intentions
- Embeddings of spatial position/progress cues for variable-order decoding
- Key spatial waypoints or prior consolidations in cooperative or multi-modal systems.
Anchors are either learned (initialized and optimized during training) or derived from domain structure and serve as mixture component supports or as auxiliary prompts.
2. Architectural Realizations
Trajectory Prediction: In maneuver-based roundabout prediction, anchors are maneuver-specific future trajectory templates computed by averaging human-labeled trajectories for each maneuver type (Hasan et al., 2021). The full distribution is: with the autoregressive decoder LSTM, at each time step , emitting an offset to the anchor . The decoder is conditioned both on history and anchor, and produces residual Gaussians.
Cooperative Prediction (V2X): In CoPAD (Wu et al., 19 Sep 2025), the Anchor-oriented Decoder (AoD) attaches anchors (e.g., midpoint and endpoint) to each of modes for agents. Initial anchor embeddings are refined by per-agent, per-mode feature-conditioned offsets , and the regressed trajectory is decoded stepwise (optionally autoregressively) from the fused feature embedding and refined anchors.
Visual Generation: RandAR (Pang et al., 2024) uses position-instruction (anchor) tokens interleaved into the autoregressive input sequence, where each predicts the next spatial position to decode. The anchor is a single learnable embedding transformed with 2D rotary positional encoding according to , prepending cues that select the proper decoding position even under random permutation.
3. Training Objectives and Loss Functions
Anchor-based decoders employ task-aligned loss structures:
- Anchor regression loss: Supervises anchors or anchor refinements to match keypoints of ground-truth sequences (e.g., midpoints/endpoints in CoPAD via smooth- loss (Wu et al., 19 Sep 2025))
- Weighted mixture loss: In multi-anchored GMMs, assigns soft responsibilities for each anchor based on the fit between predicted and ground-truth trajectory, then minimizes a weighted combination of anchor-wise regression and (optionally) mode classification or negative log-likelihood losses (Hasan et al., 2021):
- Classification loss: Penalizes misalignment between predicted anchor probabilities (or mode scores) and the best-matching anchor/mode ().
- Regression/likelihood loss: Negative log-likelihood of the final sequence under the mixture or Laplace/Gaussian predictions conditioned on the chosen anchor/mode.
For random-order image generation, training is by standard cross-entropy on the outputs conditioned on random anchor sequences (Pang et al., 2024).
4. Decoding Algorithms and Inference Protocols
All implementations support multi-modal, stepwise decoding:
Trajectory:
- At inference, select the mode , sample from , or ensemble over all anchors. Decode the sequence with the anchor trajectory plus per-step residuals (arising from the autoregressive LSTM or MLP-Mixer) (Hasan et al., 2021, Wu et al., 19 Sep 2025).
Visual Generation:
- Autoregressive transformer inference over random orders using anchor tokens enables parallel decoding. Multiple spatial positions are decoded in batch, maintaining correct causal dependencies and updating the KV-Cache for efficiency (Pang et al., 2024).
General properties:
- Anchor selection supports sharp mode separation, enabling the system to cover diverse high-level futures.
- Parallel decoding and anchor-conditional factorization enable significant speedups and richer context aggregation.
5. Structural Diversity and Multi-modality
Anchor-based architectures explicitly promote diversity:
- In trajectory prediction, different modalities correspond to different anchors (e.g., maneuvers, accelerations) and are enforced by anchor-wise responsibilities and mode-classification losses. The anchor mixing ensures that both common and rare outcomes are represented and sampled proportionally (Hasan et al., 2021, Wu et al., 19 Sep 2025).
- In CoPAD, mode-attention and anchor-diversity are integrated, and best-mode assignment per sample prevents mode collapse.
- In vision, training on all possible orders via anchor tokens induces learning of both local and long-range correlations, and enables bi-directional context extraction (Pang et al., 2024).
6. Hyperparameters, Ablations, and Empirical Outcomes
Key hyperparameters
- Number and type of anchors (K): optimal coverage vs. over-parameterization (e.g., 2 anchors per mode in CoPAD is optimal) (Wu et al., 19 Sep 2025).
- Maneuver discretization granularity in trajectory prediction (number of location and acceleration types).
- Decoder hidden dimension, MLP/Transformer depth in visual models (e.g., 343M-1.4B parameters for RandAR (Pang et al., 2024)).
- Loss weights (anchor loss , regression, classification).
- Prediction horizon (e.g., , 5s @ 10 Hz in CoPAD).
- Training strategy: random order sampling in token predictions, level of token dropout for regularization (Pang et al., 2024).
Empirical results show that anchor-based autoregressive decoders achieve:
- Lower trajectory prediction error (e.g., 28% RMSE reduction on RounD benchmark relative to best LSTM baseline (Hasan et al., 2021)).
- State-of-the-art performance on DAIR-V2X-Seq when employing sparse anchors and multi-modal prediction (Wu et al., 19 Sep 2025).
- Equivalent or superior FID/IS metrics, with up to speedup using parallel decoding in vision models (Pang et al., 2024).
7. Applications and Broader Impact
Anchor-based autoregressive decoders unify multi-modal sequence prediction with scalable, context-conditional generative modeling. Applications include:
- Motion forecasting for autonomous vehicles at intersections, roundabouts, and cooperative V2X environments (Hasan et al., 2021, Wu et al., 19 Sep 2025).
- Multi-agent trajectory fusion and prediction with robust diversity, exploiting early fusion of multiple sensor modalities (Wu et al., 19 Sep 2025).
- Visual generation tasks not restricted to left-to-right rasterization: arbitrary-order synthesis, inpainting, outpainting, and super-resolution (Pang et al., 2024).
The explicit representation and conditioning on anchors—be they maneuver templates, spatial tokens, or learned waypoints—facilitates interpretability, sample efficiency, and faster inference. Recent work suggests that this paradigm enables architectures to transcend conventional orderings, support task-flexible conditioning, and achieve state-of-the-art performance across domains. Potential extensions include anchor learning under domain transfer, probabilistic calibration, and further architectural generalization to arbitrary modality and granularity.