ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models

Published 11 May 2026 in cs.CV | (2605.10045v1)

Abstract: Visual Autoregressive (VAR) models have emerged as a strong alternative to diffusion for image synthesis, yet their fixed training resolution prevents direct generation at higher resolutions. Naively transferring training-free extrapolation methods from LLMs or diffusion models to VAR yields three characteristic failure modes: global repetition, local repetition, and detail degradation. We trace them to a unified band-stage mismatch: VAR generates images in a coarse-to-fine, scale-wise process where each stage is driven by a distinct dominant RoPE frequency band, and each failure mode emerges when the dominant band of a particular stage is disrupted. Building on this insight, we propose Stage-Aware RoPE Remapping, a training-free strategy that assigns each frequency band a stage-specific remapping rule, jointly suppressing all three failure modes. We further observe that attention becomes systematically dispersed as the image resolution increases. Existing methods typically depend on predefined attention scaling factors, which are neither adaptive to the target resolution nor capable of faithfully capturing the actual extent of attention dispersion. We therefore propose Entropy-Driven Adaptive Attention Calibration, which quantifies dispersion via a resolution-invariant normalized entropy and yields a closed-form per-head scaling factor that realigns the extrapolated-resolution attention entropy with its training-resolution counterpart. Extensive experiments show that our method consistently outperforms prior resolution-extrapolation methods in both structural coherence and fine-detail fidelity. Our code is available at https://github.com/feihongyan1/ExtraVAR.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper presents a training-free, stage-aware RoPE remapping strategy that extrapolates resolution in visual autoregressive models.
It demonstrates that dynamic frequency adjustments across layout, local, and detail stages eliminate repetition and preserve high-fidelity details.
Quantitative and qualitative evaluations show significant improvements over existing methods, highlighting the value of adaptive attention calibration.

ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models

Motivation and Problem Analysis

Visual Autoregressive (VAR) models have become a competitive image synthesis paradigm, offering advantages in fidelity and compositionality over diffusion-based approaches. However, a core limitation persists: VAR models are conventionally trained at a fixed maximal spatial resolution, making direct inference at higher resolutions nontrivial due to the quadratic scaling of self-attention and inherent architectural design. Existing training-free extrapolation techniques—developed for LLMs or diffusion models and relying mostly on static positional interpolation/remapping—result in three characteristic failure modes when applied naively to VARs: (1) global repetition of large-scale layout, (2) local repetition manifested in overpopulation of mid-sized structures, and (3) degradation of details at fine scales. These phenomena significantly degrade both visual fidelity and semantic alignment in high-resolution generation tasks.

Through systematic controlled interventions on rotary position encoding (RoPE) frequency bands across the generative stages of VAR, the paper identifies a common underlying cause: at each stage of the coarse-to-fine VAR generation process, a distinct positional frequency band dominates. Disruptions in these bands' encoding at inappropriate stages precipitate the observed failure modes, invalidating the assumption that a single, static positional remapping suffices for all scales and stages.

Core Methodology

Band-Stage Mismatch Characterization

VAR generation is partitioned into three canonical, sequential stages: (1) Layout Construction, (2) Local Refinement, and (3) Detail Refinement. Each stage places reliance on a differing RoPE frequency band—low, mid, and high frequencies, respectively. The paper provides empirical analyses revealing:

Very Low-Frequency Components: Effectively behave as no positional encoding (NoPE) but still contribute auxiliary cues that support both concept formation and detail refinement.
Low-Frequency Band: Governs global structure, with manipulations in the Layout Construction Stage leading to large-scale repetitive artifacts.
Mid-Frequency Band: Controls mid-sized composition and spatial consistency, with improper handling producing local object repetition.
High-Frequency Band: Encodes fine-grained detail, with errors in the Detail Refinement Stage resulting in loss of texture and edge blurring.

Importantly, the role and impact of each band are stage-dependent and often non-monotonic, necessitating a dynamic, stage-aware approach.

Stage-Aware RoPE Remapping

The principal methodological innovation is a training-free, progressive RoPE remapping strategy ("Stage-Aware RoPE Remapping") that adapts frequency assignments dynamically to each generative stage:

Very Low Frequencies: Uniformly replaced by NoPE.
Layout Construction Stage: Applies Position Interpolation (PI) for all other bands, aligning frequency scaling proportionally with extrapolated resolution and supporting faithful global structure formation.
Local Refinement Stage: Implements a linear interpolation between PI and YaRN (a frequency-dependent LLM-based remapping), transitioning from global layout preservation to more localized focus.
Detail Refinement Stage: Fully switches to YaRN, optimizing high-frequency discrimination and detail rendering.

This band-stage mapping is formally expressed as an interpolation, where weights are stage-dependent and bands are delineated by wavelength relative to spatial extent.

Entropy-Driven Adaptive Attention Calibration

To address the issue that self-attention dispersion increases as resolution grows—leading to diluted attention maps and further detail degradation—the paper introduces a resolution-invariant, entropy-driven adaptive attention calibration. Previous approaches leveraged raw entropy or fixed, hand-tuned attention scaling factors, which fail to adapt to varying resolution or recognize true dispersion:

Normalized Attention Entropy: Defined to be invariant to token count, permitting direct, cross-resolution comparison.
Closed-Form Scaling Factor: For each attention head, the scaling factor required to match training-resolution entropy is estimated via a Taylor expansion, yielding an efficient solution for per-head calibration.
Selective Application: Calibration is activated primarily in the Detail Refinement Stage or when entropy falls below a threshold, ensuring stage-local and numerically stable adaptation.

Empirical Evaluation

Evaluation is performed on Infinity, a strong VAR-based text-to-image model, extrapolated up to $2560 \times 2560$ resolution. Comparison baselines span PE, PI, NTK-aware scaling, YaRN, RiFlex, and DyPE, the latter two representing strong recent training-free diffusion approaches.

Qualitative Results: ExtraVAR preserves global structure, eliminates local repetition, and retains fine details across all benchmarks, outperforming both LLM- and diffusion-based remapping approaches, which either succumb to repetitive artifacts or lose fidelity as scaling increases.
Quantitative Results: On GenEval, DPG-Bench, and HPSv2.1, ExtraVAR achieves best-in-class scores, surpassing PI [5] by 0.13 (GenEval overall), and by 0.89 (DPG-Bench overall), with especially pronounced improvements on tasks requiring compositional and spatial alignment or fine detail preservation. On HPSv2.1, ExtraVAR improves over PI by 2.11 overall—this margin is consistent across both photographic and artistic domains.
Ablation Studies: Isolate the contributions of Stage-Aware Remapping and Attention Calibration, confirming their additive, complementary effects across all major benchmarks and minimal sensitivity to the precise delineation of stage boundaries.

Implications and Future Directions

The findings have substantial implications for scaling VAR models beyond their training regime without incurring the immense computational cost of retraining. The stage-aware remapping reveals that successful generalization in sequence models with spatial structure necessitates dynamic, context-adaptive positional encoding—static solutions developed for language do not suffice for hierarchical image generation. The entropy-calibrated attention mechanism adds a further layer of adaptability, mitigating distributional shifts due to increased context size.

Practically, ExtraVAR provides a highly effective, zero-shot, inference-time solution for applications in high-resolution content synthesis, offering utility in design, advertising, digital arts, and beyond. Theoretically, these results underscore the importance of characterizing and respecting the stage-wise structure and frequency-domain requirements of generative models, especially as research progresses towards even larger, multi-modal, or open-field autoregressive systems.

Future work may address adaptive, data-driven identification of generative stages, more sophisticated remapping families, or tighter integration with dynamic architectural scaling. Generalization to video-VAR, multi-modal transformers with complex spatial-temporal dependencies, or cascaded generation frameworks remains an open and promising avenue.

Conclusion

ExtraVAR presents a rigorous, task-specific, training-free protocol for resolution extrapolation in autoregressive vision models, addressing intrinsic failure modes via a principled band-stage decomposition and adaptive attention. It establishes a new state-of-the-art for structural and detail fidelity in upscaled image synthesis and provides a methodological template for further advances in scale-robust generative modeling.

Reference:

"ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models" (2605.10045)

Markdown Report Issue