Papers
Topics
Authors
Recent
2000 character limit reached

Axial RoPE: Banded Scaling for Quantized LLMs

Updated 8 December 2025
  • Axial RoPE is a technique that partitions RoPE frequency channels into logarithmically-spaced bands with per-band scaling to address quantization challenges in long-context LLMs.
  • It applies diagnostic metrics like interpolation pressure and tail inflation ratio to adjust phase angles per band, effectively reducing aliasing and outlier amplification.
  • Empirical results on quantized LLaMA-2 models demonstrate that axial RoPE lowers perplexity and improves zero-shot accuracy without requiring re-training or architecture changes.

Axial RoPE refers to the practice of partitioning the frequency channels of Rotary Position Embeddings (RoPE) into logarithmically-spaced bands and performing per-band scaling, rather than applying a uniform scaling factor, particularly for position interpolation (PI) in quantized long-context LLMs. This approach, introduced in Q-ROAR, addresses the performance degradation observed when PI is combined with post-training quantization (PTQ), notably resolving dynamic range and outlier issues without requiring re-training or kernel modifications (Qiao et al., 17 Sep 2025).

1. Background and Motivation

Rotary Position Embeddings (RoPE) encode sequence position information in transformer models by applying rotations in the complex plane, parameterized by a set of geometric sequence frequencies {ωi}\{\omega_i\}. Extending the effective input window beyond the pre-training context—critical for long-range LLM tasks—commonly relies on PI techniques that scale these frequencies. However, the use of post-training quantization (PTQ) for practical deployment introduces significant challenges when combined with PI. Notably, quantized models employing standard (uniform) PI exhibit severe accuracy and perplexity degradation due to two coupled failure modes:

  • Aliasing and Dynamic-range Dilation: Scaling the RoPE frequencies causes rapid phase wrapping, inflating the activation tails and driving pre-activations into quantizer clipping regions.
  • Axis-grid Anisotropy and Outlier Shifting: RoPE operates via 2×22\times 2 rotations, but quantizers, assuming axis-aligned distributions, suffer from axis misalignment and further outlier amplification, introducing logit noise and worsening perplexity and accuracy.

In Q-ROAR, these limitations motivate the development of an axis-aware, banded scaling scheme for RoPE, termed axial RoPE (Qiao et al., 17 Sep 2025).

2. Frequency Band Partitioning in RoPE

RoPE represents the model’s hidden dimension DD as D/2D/2 complex-valued channels, each associated with a frequency ωi=ωminρi1\omega_i = \omega_{\min} \cdot \rho^{i-1}, where ρ=(ωmax/ωmin)1/(D/21)\rho = (\omega_{\max}/\omega_{\min})^{1/(D/2-1)}. Axial RoPE partitions these D/2D/2 channels into BB logarithmically-spaced bands, yielding band boundaries:

Ωb={ωminρ(b1)D2B,,ωminρbD2B},b=1,,B\Omega_b = \left\{ \omega_{\min} \cdot \rho^{(b-1)\frac{D}{2B}}, \ldots, \omega_{\min} \cdot \rho^{b\frac{D}{2B}} \right\}, \quad b=1,\ldots,B

Each frequency band Bb\mathcal{B}_b is defined as:

Bb={i=1,,D/2:ωi[Ωb1,Ωb)}\mathcal{B}_b = \{ i=1, \ldots, D/2 : \omega_i \in [\Omega_{b-1}, \Omega_b) \}

Equivalently, the assignment from channel index ii to band index bb is given by:

b(i)=(i1)B/(D/2)+1b(i) = \left\lfloor (i-1) \cdot B / (D/2) \right\rfloor + 1

This structure underpins the mixed-frequency scaling used in Q-ROAR.

3. Per-Band Rescaling Methodology: Q-ROAR

For each frequency band Bb\mathcal{B}_b, a scale factor gbg_b is introduced. This allows differentiated adjustment of RoPE phase angles within each band, addressing over-inflation or compression of the activation distribution caused by uniform scaling. The per-band scaling is operationalized by rescaling the corresponding rows of the WQW_Q and WKW_K projection matrices. In shared mode:

WQ(b)gbWQ(b),WK(b)gbWK(b)W_Q^{(b)} \leftarrow g_b W_Q^{(b)}, \qquad W_K^{(b)} \leftarrow g_b W_K^{(b)}

resulting in an effective phase angle θ^i=gbθi\hat{\theta}_i = g_b \theta_i for iBbi \in \mathcal{B}_b.

A symmetric variant scales WKW_K by the inverse, preserving the product scale: WQ(b)gbWQ(b),WK(b)gb1WK(b)W_Q^{(b)} \leftarrow g_b W_Q^{(b)}, \qquad W_K^{(b)} \leftarrow g_b^{-1} W_K^{(b)} This maintains QiKiQ_i \cdot K_i^\top invariant, removing the need to tune softmax temperature or retrain.

Parameter selection employs a small grid search over the gbg_b within intervals derived from diagnostics, evaluated using a length-weighted perplexity objective ({gb})\ell(\{g_b\}) over a long-context dev set.

4. Diagnostics: Interpolation Pressure and Tail Inflation Ratios

Q-ROAR introduces two diagnostics to identify safe per-band scaling intervals and to prevent aliasing and tail inflation:

  • Interpolation Pressure (IP) measures the sensitivity of phase error to scaling for each channel ii:

IPi=ϵi/si=ωif(D)/si2IP_i = |\partial \epsilon_i / \partial s_i| = \omega_i \cdot f(D) / s_i^2

High-frequency bands (large ωi\omega_i) demand scaling close to unity to prevent excessive wrapping.

  • Tail Inflation Ratio (TIR) quantifies the proportionate increase in pre-activation tails after PI:

TIRbW=QWbh,long(1ε)/QWbh,short(1ε)TIR_b^{W} = Q_{|W_b h|, \text{long}}(1-\varepsilon) / Q_{|W_b h|, \text{short}}(1-\varepsilon)

where QWbh,(1ε)Q_{|W_b h|,\cdot}(1-\varepsilon) denotes the 1ε1 - \varepsilon quantile (e.g. ε=103\varepsilon = 10^{-3}) of the indicated distribution. Bands with large TIR indicate dangerous outlier amplification and should be down-scaled. Analogous diagnostics are applied to activation tails (TIRbATIR_b^{A}).

These diagnostics guide the construction of scaling intervals: gbmin=1/(1+τ/(1+log(ωb,med/ωmin))),gbmax=κ/TIRbWg_b^{\min} = 1 / \left(1 + \tau / (1 + \log(\omega_{b,\operatorname{med}} / \omega_{\min}))\right), \qquad g_b^{\max} = \kappa / TIR_b^W with τ[0.2,0.5]\tau \in [0.2, 0.5], κ[1.0,1.3]\kappa \in [1.0, 1.3]. This strategy enforces stricter control over high-frequency bands while protecting low-frequency expressivity.

5. Empirical Outcomes and Robustness

Evaluation on AWQ- and RTN-quantized LLaMA-2-7B models extended from 4 K to 32 K and 64 K tokens via YaRN demonstrates the effectiveness of axial RoPE via Q-ROAR. Key empirical results (Qiao et al., 17 Sep 2025):

Method GovReport PPL (32K) Zero-shot Acc (%, 32K) GovReport PPL (64K) Zero-shot Δ (abs.)
AWQ + YaRN 6.31 63.52 - -
RTN + YaRN >6.19 <63.96 6.71 baseline
Q-ROAR 6.19 63.96 5.83 +0.7

Q-ROAR's mixed-frequency scaling reduces GovReport 32K-token perplexity by over 10% relative to RTN and by 2% over AWQ. At 64K, Q-ROAR lowers 32768-token PPL from 6.30 to 5.83 (AWQ) and from 6.71 to 5.83 (RTN). Short-context (4K) WikiText2 PPL remains within +0.02 of the FP16 baseline, demonstrating that Q-ROAR does not regress short-context performance. Per-band scaling confers uniform advantage across context sizes and quantization strategies.

6. Implementation Details and Practical Considerations

Q-ROAR is weight-only, requires no fine-tuning, kernel, or architecture changes, and is fully compatible with pre-existing inference stacks. The search for per-band scaling factors gbg_b leverages a minimal long-context dev set (≈10 documents, each ≥60K tokens), and evaluates candidate scalings using the length-weighted perplexity objective, reusing the KV-cache for computational efficiency. The symmetric variant is preferred if it preserves stability.

A plausible implication is that the diagnostics-driven, multi-band scaling approach of axial RoPE offers a general solution to aliasing and outlier issues in quantized long-context LLMs using RoPE-based PI, without incurring the deployment costs of retraining or system modification.

Axial RoPE sits at the intersection of position encoding techniques for transformers and the practical constraints of low-precision, high-throughput LLM deployment. Earlier PI approaches, such as YaRN and linear scaling, apply a single scaling factor to all RoPE frequencies, but exhibit failure in the presence of quantization. Q-ROAR’s banded approach leverages the geometric structure of RoPE frequencies and hands off band-wise anisotropy and tail risk to automated diagnostics, thereby stabilizing both statistical and operational properties across extended context lengths.

This band-wise adjustment is distinct from prior uniform PI and frequency-aware methods by systematically controlling the impact of high-frequency bands on quantizer tail inflation and logit noise, all within post-training, weight-only adaptation. The empirical gains in perplexity and accuracy support the conclusion that axial RoPE is an essential refinement for robust, practical long-context quantized transformers (Qiao et al., 17 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Axial RoPE.