YaRN: Extending RoPE for Transformer Contexts
- YaRN is a frequency-aware methodology that interpolates low and mid rotation angles to extend RoPE-based transformer contexts while preserving local detail.
- It employs adaptive attention temperature scaling to maintain effective dot-product variance, thereby reducing perplexity in long sequence modeling.
- YaRN extends to multimodal settings with Partial YaRN and VLAT, significantly improving audio-language modeling and overall long-range performance.
YaRN (Yet another RoPE extensioN) is a methodology for extending the context window of rotary position embedding (RoPE)-based transformers, such as LLaMA and PaLM, while preserving or improving sequence modeling capacity and computational efficiency. YaRN comprises targeted frequency-wise interpolation of RoPE’s rotation angles and adaptive attention temperature scaling, with recent variants and extensions for multimodal (audio-language) settings.
1. Motivation: Limitations of Standard RoPE
RoPE parameterizes absolute token positions using rotations in subspaces of the embedding; for position , a token embedding is rotated by radians in each 2D subspace. The pairwise attention scores depend only on the relative displacement . However, RoPE’s rotation frequencies are selected so that all positions are well represented within an “envelope” determined at pre-training. When models are applied to inputs much longer than , dot products become degenerate: high-frequency components wrap incoherently and low-frequency components provide insufficient resolution, leading to increased perplexity and degraded in-context learning (Peng et al., 2023).
Context extension via naive position interpolation (PI) rescales all frequencies by a factor for a new longer window but results in loss of either local detail or global coherence since all frequencies are compressed uniformly. This motivates frequency-selective and scale-aware approaches such as YaRN.
2. Methodology: YaRN RoPE Modification
YaRN implements a two-part recipe:
A. Frequency-Wise Interpolation (NTK-by-Parts):
- The RoPE spectrum is divided into disjoint low, mid, and high-frequency bands.
- Low frequencies are interpolated by .
- High frequencies are left unmodified: .
- Mid frequencies are linearly interpolated via .
- The precise interpolation per dimension is governed by ratio with thresholds:
and final frequency
B. Attention Temperature Scaling:
- As increases, the average attention dot-product magnitude shrinks.
- YaRN introduces a scalar temperature inside the attention softmax: .
- In practice, and are both rescaled by (equivalently, the RoPE rotation matrices are scaled accordingly).
- Empirical fit: for minimizes perplexity (Peng et al., 2023).
Implementation is minimally invasive: only the rotation frequency calculation and a scaling factor are patched in the attention blocks.
3. Extensions to Multimodal Models: Partial YaRN and VLAT
In large audio-LLMs (LALMs) with joint audio and text sequences, context window limitations primarily affect the audio region. Applying global YaRN disturbs language modeling capabilities tuned in pre-training. Partial YaRN constrains the RoPE modification to the audio token segment , leaving text tokens outside this window untouched (Chaichana et al., 17 Oct 2025). Administratively:
- The audio position-ids are mapped to ; text positions before and after are unchanged: .
- Typically, a two-group frequency split is used (all low-frequency interpolated, high-frequency left unchanged).
- Attention-temperature is applied analogously to unimodal YaRN.
Virtual Longform Audio Training (VLAT) further generalizes Partial YaRN as a training-time positional augmentation scheme. During fine-tuning:
- For each batch, random “virtual” audio lengths are sampled, and audio positions remapped into the actual data window via , .
- The model thus learns to attend over a wide range of audio spans, enabling robust generalization to much longer contexts than those observed in training (Chaichana et al., 17 Oct 2025).
4. Empirical Results
Across full-text and audio-language modeling, YaRN demonstrates strong improvements in context window extension:
| Model/Setting | Context Length | Perplexity/Accuracy (Long Form) |
|---|---|---|
| LLaMA-2 7B + PI (s=2) | 8K-128K | 100 at 128K |
| LLaMA-2 7B YaRN s=32 | 128K | 2.37 |
| SALMONN Partial YaRN | 5–10 min | +13–14 absolute MCQA gain |
| Qwen2-Audio VLAT (10 min audio) | 10 min | 81.7% (with Partial PI at inference) |
YaRN preserves standard language modeling and reasoning accuracy (≤0.5% drop) on HuggingFace OpenLLM benchmarks at and (Peng et al., 2023).
Key ablations confirm:
- Two-group frequency split and attention-temperature scaling are essential for stability under extreme extension (Chaichana et al., 17 Oct 2025).
- VLAT provides dramatic accuracy increases on very long-form audio, e.g., from 32.8% (vanilla) to 81.7% (VLAT + Partial PI) at 10 min (Chaichana et al., 17 Oct 2025).
5. Theoretical and Practical Considerations
Conceptual Justification:
YaRN’s targeted interpolation is supported by the observation that crucial token relationships at long range are represented primarily in the low-frequency RoPE dimensions. Untouched high-frequency slots retain local resolution. The softmax temperature scaling counters the contraction in dot-product variance.
Practical Implementation:
- YaRN requires minimal compute for fine-tuning: typical LLaMA extensions to or $131,072$ are achieved with 200–600 optimization steps and ~0.1% of the original token count.
- Inference performance is unchanged; YaRN is compatible with key/value caching provided attention is properly re-applied on context changes.
Limitations and Edge Cases:
- Extremely large scaling () may “under-train” high frequencies; interpolation thresholds may require model-specific tuning.
- In zero-shot (“no fine-tune”) settings, Dynamic-YaRN applies scaling and temperature on-the-fly, extending up to original window (Peng et al., 2023).
6. Relation to Other RoPE Extensions
YaRN is contrasted with both:
- Position Interpolation (PI): Uniform frequency scaling, which leads to sub-optimal retention of local or global structure.
- Imaginary Extension (RoPE++): Re-introduces the imaginary component of the complex dot-product into attention (“dual-component” heads) to better preserve long-range dependencies and “semantic aggregation.” This approach is sometimes referred to in the literature interchangeably with YaRN, but differs mechanistically by augmenting attention heads and leveraging phase cues in the complex plane (Liu et al., 8 Dec 2025).
RoPE++ studies empirically confirm that the imaginary heads’ sine-integral decay provides robust long-range token-pair modeling, with two configurations trading memory for compute (equal-heads halves key-value cache, equal-cache doubles attention heads). This approach is complementary but distinct from YaRN’s frequency-wise interpolation and temperature scaling methodology.
7. Impact and Future Directions
YaRN establishes that principled, frequency-aware modification of positional embeddings permits efficient large-scale extension of transformer context windows with negligible degradation of core language/semantic understanding. Its extension to multimodal domains (Partial YaRN, VLAT) addresses a major bottleneck for audio-language and other long-form context modeling settings (Chaichana et al., 17 Oct 2025).
Future research directions mentioned in related RoPE++/YaRN work include:
- Integrating imaginary extensions to further boost long-range dependence modeling.
- Adapting frequency partitioning heuristics for specific model architectures or downstream tasks.
- Exploring interpolation/extrapolation combinations with other positional encoding schemes (e.g., FoPE, PaTH) (Liu et al., 8 Dec 2025).
YaRN remains a state-of-the-art, compute-efficient, and theoretically grounded strategy for addressing transformer context limitations across both unimodal and multimodal sequence modeling applications (Peng et al., 2023, Chaichana et al., 17 Oct 2025, Liu et al., 8 Dec 2025).