Papers
Topics
Authors
Recent
Search
2000 character limit reached

YaRN: Extending RoPE for Transformer Contexts

Updated 27 January 2026
  • YaRN is a frequency-aware methodology that interpolates low and mid rotation angles to extend RoPE-based transformer contexts while preserving local detail.
  • It employs adaptive attention temperature scaling to maintain effective dot-product variance, thereby reducing perplexity in long sequence modeling.
  • YaRN extends to multimodal settings with Partial YaRN and VLAT, significantly improving audio-language modeling and overall long-range performance.

YaRN (Yet another RoPE extensioN) is a methodology for extending the context window of rotary position embedding (RoPE)-based transformers, such as LLaMA and PaLM, while preserving or improving sequence modeling capacity and computational efficiency. YaRN comprises targeted frequency-wise interpolation of RoPE’s rotation angles and adaptive attention temperature scaling, with recent variants and extensions for multimodal (audio-language) settings.

1. Motivation: Limitations of Standard RoPE

RoPE parameterizes absolute token positions using rotations in subspaces of the embedding; for position mm, a token embedding is rotated by mθim \cdot \theta_i radians in each 2D subspace. The pairwise attention scores depend only on the relative displacement nmn-m. However, RoPE’s rotation frequencies θi\theta_i are selected so that all positions 0m<L0 \leq m < L are well represented within an “envelope” determined at pre-training. When models are applied to inputs much longer than LL, dot products become degenerate: high-frequency components wrap incoherently and low-frequency components provide insufficient resolution, leading to increased perplexity and degraded in-context learning (Peng et al., 2023).

Context extension via naive position interpolation (PI) rescales all frequencies by a factor s=L/Ls = L'/L for a new longer window LL' but results in loss of either local detail or global coherence since all frequencies are compressed uniformly. This motivates frequency-selective and scale-aware approaches such as YaRN.

2. Methodology: YaRN RoPE Modification

YaRN implements a two-part recipe:

A. Frequency-Wise Interpolation (NTK-by-Parts):

  • The RoPE spectrum is divided into disjoint low, mid, and high-frequency bands.
    • Low frequencies are interpolated by θiθi/s\theta_i \rightarrow \theta_i/s.
    • High frequencies are left unmodified: θiθi\theta_i \rightarrow \theta_i.
    • Mid frequencies are linearly interpolated via αi[1/s,1]\alpha_i \in [1/s,1].
  • The precise interpolation per dimension ii is governed by ratio r(i)=L/λi=Lθi/2πr(i) = L/\lambda_i = L \theta_i / 2\pi with thresholds:

γ(r)={0r<α 1r>β (rα)/(βα)otherwise\gamma(r) = \begin{cases} 0 & r < \alpha\ 1 & r > \beta\ (r-\alpha)/(\beta-\alpha) & \text{otherwise} \end{cases}

and final frequency

θi=(1γ(r))θis+γ(r)θi\theta_i' = (1-\gamma(r)) \frac{\theta_i}{s} + \gamma(r) \theta_i

B. Attention Temperature Scaling:

  • As ss increases, the average attention dot-product magnitude shrinks.
  • YaRN introduces a scalar temperature t>1t > 1 inside the attention softmax: softmax((QK)/(td))\text{softmax}((Q^\top K)/(t\sqrt{d})).
  • In practice, QQ and KK are both rescaled by 1/t1/\sqrt{t} (equivalently, the RoPE rotation matrices are scaled accordingly).
  • Empirical fit: 1/t0.1ln(s)+1\sqrt{1/t} \approx 0.1 \cdot \ln(s) + 1 for s[2,32]s \in [2,32] minimizes perplexity (Peng et al., 2023).

Implementation is minimally invasive: only the rotation frequency calculation and a scaling factor are patched in the attention blocks.

3. Extensions to Multimodal Models: Partial YaRN and VLAT

In large audio-LLMs (LALMs) with joint audio and text sequences, context window limitations primarily affect the audio region. Applying global YaRN disturbs language modeling capabilities tuned in pre-training. Partial YaRN constrains the RoPE modification to the audio token segment [p,p+Laudio)[p,p+L_\text{audio}), leaving text tokens outside this window untouched (Chaichana et al., 17 Oct 2025). Administratively:

  • The audio position-ids m[p,p+Laudio)m \in [p, p+L_\text{audio}) are mapped to m=p+(mp)sm' = p + (m - p) \cdot s; text positions before and after are unchanged: m=mm' = m.
  • Typically, a two-group frequency split is used (all low-frequency ikcutoffi\leq k_\text{cutoff} interpolated, high-frequency i>kcutoffi > k_\text{cutoff} left unchanged).
  • Attention-temperature is applied analogously to unimodal YaRN.

Virtual Longform Audio Training (VLAT) further generalizes Partial YaRN as a training-time positional augmentation scheme. During fine-tuning:

  • For each batch, random “virtual” audio lengths LvirtL_\text{virt} are sampled, and audio positions remapped into the actual data window via s=Ldata/Lvirts = L_\text{data}/L_\text{virt}, m=p+(mp)sm' = p + (m-p) \cdot s.
  • The model thus learns to attend over a wide range of audio spans, enabling robust generalization to much longer contexts than those observed in training (Chaichana et al., 17 Oct 2025).

4. Empirical Results

Across full-text and audio-language modeling, YaRN demonstrates strong improvements in context window extension:

Model/Setting Context Length Perplexity/Accuracy (Long Form)
LLaMA-2 7B + PI (s=2) 8K-128K >>100 at 128K
LLaMA-2 7B YaRN s=32 128K 2.37
SALMONN Partial YaRN 5–10 min +13–14 absolute MCQA gain
Qwen2-Audio VLAT (10 min audio) 10 min 81.7% (with Partial PI at inference)

YaRN preserves standard language modeling and reasoning accuracy (≤0.5% drop) on HuggingFace OpenLLM benchmarks at L=64KL'=64\text{K} and 128K128\text{K} (Peng et al., 2023).

Key ablations confirm:

  • Two-group frequency split and attention-temperature scaling are essential for stability under extreme extension (Chaichana et al., 17 Oct 2025).
  • VLAT provides dramatic accuracy increases on very long-form audio, e.g., from 32.8% (vanilla) to 81.7% (VLAT + Partial PI) at 10 min (Chaichana et al., 17 Oct 2025).

5. Theoretical and Practical Considerations

Conceptual Justification:

YaRN’s targeted interpolation is supported by the observation that crucial token relationships at long range are represented primarily in the low-frequency RoPE dimensions. Untouched high-frequency slots retain local resolution. The softmax temperature scaling counters the contraction in dot-product variance.

Practical Implementation:

  • YaRN requires minimal compute for fine-tuning: typical LLaMA extensions to L=65,536L'=65,536 or $131,072$ are achieved with 200–600 optimization steps and ~0.1% of the original token count.
  • Inference performance is unchanged; YaRN is compatible with key/value caching provided attention is properly re-applied on context changes.

Limitations and Edge Cases:

  • Extremely large scaling (s32s \gg 32) may “under-train” high frequencies; α,β\alpha, \beta interpolation thresholds may require model-specific tuning.
  • In zero-shot (“no fine-tune”) settings, Dynamic-YaRN applies scaling and temperature on-the-fly, extending up to 2×2\times original window (Peng et al., 2023).

6. Relation to Other RoPE Extensions

YaRN is contrasted with both:

  • Position Interpolation (PI): Uniform frequency scaling, which leads to sub-optimal retention of local or global structure.
  • Imaginary Extension (RoPE++): Re-introduces the imaginary component of the complex dot-product into attention (“dual-component” heads) to better preserve long-range dependencies and “semantic aggregation.” This approach is sometimes referred to in the literature interchangeably with YaRN, but differs mechanistically by augmenting attention heads and leveraging phase cues in the complex plane (Liu et al., 8 Dec 2025).

RoPE++ studies empirically confirm that the imaginary heads’ sine-integral decay provides robust long-range token-pair modeling, with two configurations trading memory for compute (equal-heads halves key-value cache, equal-cache doubles attention heads). This approach is complementary but distinct from YaRN’s frequency-wise interpolation and temperature scaling methodology.

7. Impact and Future Directions

YaRN establishes that principled, frequency-aware modification of positional embeddings permits efficient large-scale extension of transformer context windows with negligible degradation of core language/semantic understanding. Its extension to multimodal domains (Partial YaRN, VLAT) addresses a major bottleneck for audio-language and other long-form context modeling settings (Chaichana et al., 17 Oct 2025).

Future research directions mentioned in related RoPE++/YaRN work include:

YaRN remains a state-of-the-art, compute-efficient, and theoretically grounded strategy for addressing transformer context limitations across both unimodal and multimodal sequence modeling applications (Peng et al., 2023, Chaichana et al., 17 Oct 2025, Liu et al., 8 Dec 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to YaRN (Yet another RoPE extensioN).