YaRN: Extending Transformer Context with RoPE

Updated 19 November 2025

The paper introduces YaRN, a method that extends transformer context windows using piecewise frequency scaling of RoPE to overcome limitations in long-context generalization.
YaRN employs a ramped scaling of lower frequencies combined with attention-temperature augmentation to maintain both local detail and long-range dependencies.
Empirical results demonstrate significant gains in perplexity and retrieval accuracy with fewer fine-tuning steps, establishing YaRN as a baseline for long-context adaptation.

YaRN (Yet another RoPE extensioN) is a method for extending the context window of transformer models employing Rotary Position Embeddings (RoPE), designed to address the failure of such models to generalize beyond their pre-training context length. YaRN relies on a compute-efficient, piecewise frequency-scaling scheme and attention-temperature augmentation to enable robust extrapolation to long contexts. This approach has quickly become a baseline for long-context adaptation in both unimodal LLMs and, via partial adaptation, in large audio-LLMs.

1. Motivation and Background

Transformer models encode sequence position through various positional embedding mechanisms, with RoPE injecting position as a sequence of block-diagonal 2×2 rotations parameterized by linearly increasing angles. For a model trained with a maximum context window of $L$ , RoPE-based attention becomes unreliable for $\text{positions}>L$ because high-frequency (fine spatial scale) and low-frequency (long-range) rotations reach values not sampled during training. This limitation severely restricts the direct extension of pre-trained models to much longer context windows.

Legacy strategies, such as Position Interpolation (PI), remedied this only partially by globally re-scaling the RoPE frequency schedule, at the cost of blurring crucial high-frequency local cues and distorting attention-entropy. The need to balance faithful extrapolation of both local and long-range dependencies—and to maintain model calibration—guided the development of YaRN (Peng et al., 2023).

2. Methodology: Piecewise RoPE Scaling and Attention Temperature

YaRN formalizes context extension as dimension-wise piecewise scaling of RoPE frequencies. For embedding dimension $d$ , original RoPE angles are $\theta_d = b^{-2d/D}$ (with $b=10\,000$ ), yielding wavelengths $\lambda_d = 2\pi/\theta_d$ .

All RoPE-based context extension methods can be written as

$f'_W(x_m, m, \theta_d) = f_W(x_m, g(m), h(\theta_d)),$

with $(g, h)$ being method-specific.

In YaRN:

$g(m) = m$
$h(\theta_d) = (1-\gamma(r(d))) \,\frac{\theta_d}{s} + \gamma(r(d)) \,\theta_d$
$r(d) = L/\lambda_d$ , $s = L'/L$ (context scaling), and $\gamma(\cdot)$ is a ramp:

$\gamma(r) = \begin{cases} 0 & r < \alpha \ \frac{r-\alpha}{\beta-\alpha} & \alpha \le r \le \beta \ 1 & r > \beta \end{cases}$

In practice, $(\alpha, \beta) = (1, 32)$ for LLaMA-based models.

The intuition:

Low-frequency (large-wavelength, $r(d)\!<\!\alpha$ ): pure scaling ( $\theta \mapsto \theta/s$ )
High-frequency (short-wavelength, $r(d)\!>\!\beta$ ): unchanged
Mid-frequency: linear ramp between scaling and identity

To stabilize softmax entropy as sequence-length increases, YaRN introduces a temperature $t$ into the attention logits:

${\rm Attention}_{m,n} = {\rm softmax}_n\left(\frac{q_m^\top k_n}{t \sqrt{D}}\right)$

with temperature calibrated by sequence scaling via $\sqrt{1/t} \approx 0.1\ln s + 1$ .

Fine-tuning YaRN requires minimal tokens and steps compared to PI-based baselines. Empirical protocol for LLaMA-2 adaptation:

Objective: next-token language modeling (cross-entropy)
Dataset: e.g. PG19, chunked to required context size (e.g. 64k)
Hyper-parameters: LR $2\!\times\!10^{-5}$ , AdamW ( $\beta_1=0.9$ , $\beta_2=0.95$ ), 20-step warmup, batch size 64, 400 steps for $s=16$ , +200 for $s=32$
Platform: PyTorch FSDP + FlashAttention 2

Integration is a drop-in replacement at the positional embedding, affecting only RoPE frequency calculation and softmax temperature.

In audio-LLMs, Partial YaRN (Chaichana et al., 17 Oct 2025) extends only audio-modal regions of the sequence. This involves stretching/compressing the audio-span positions via:

$m' = p + \frac{m-p}{L_a-1} (L'_a-1)$

and applying a two-group frequency partition at the RoPE level: interpolated for low-frequencies, unchanged for high. VLAT (Virtual Longform Audio Training) mixes Partial YaRN with virtual input-length augmentation during fine-tuning for audio, further boosting robustness to unseen durations.

4. Empirical Results

YaRN achieves substantial improvements over previous RoPE extension methods, summarized for language and audio models in the following table:

Model/Task	Vanilla	PI	YaRN	Partial YaRN (audio)
LLaMA-2 7B PPL 8k-10k	PPL>6	3.34@8k	3.35@8k, <6@10k	N/A
LLaMA-2 7B PPL 128k	Fail	n/a	2.37@131k	N/A
Passkey 7B@128k retr.	<90%		>99%	N/A
SALMONN (30s→10min MCQA)	23.5%	n/a	n/a	32.9% (+9.5pts)
Qwen2-A (30s→10min MCQA)	22.0%	n/a	n/a	28.5% (+6.5pts)
LoRA-finetune 10min	64.9%	n/a	n/a	83.1%

YaRN enables robust extrapolation up to and beyond 128k context length in LLaMA-2 with only 400–600 fine-tuning steps, a 10x–2.5x reduction in tokens/steps vs. PI.
Passkey retrieval at 128k context yields $>$ 99% with YaRN, indicating strong position selectivity at extreme lengths.
Partial YaRN, without retraining, yields +6–9pt MCQA gains on long-form audio reasoning (Chaichana et al., 17 Oct 2025).
After augmented fine-tuning (VLAT), Partial YaRN delivers 81.7% MCQA accuracy after training on unseen 10min audio inputs ( $+$ ~42pt over vanilla inference).

5. Extensions: Resonance RoPE and Train-Short-Test-Long

Resonance RoPE (Wang et al., 29 Feb 2024) sharpens YaRN’s handling of pre-critical RoPE dimensions (where wavelengths are smaller than training length $L$ and phase offsets arise at OOD positions) by snapping wavelengths to exact integers:

$\widetilde{\lambda}_j = \mathrm{round}(\lambda_j),\quad \widetilde{\theta}_j = 2\pi/\widetilde{\lambda}_j$

This removes phase gaps on pre-critical dimensions, complementing YaRN’s post-critical scaling. In synthetic modular addition tasks (PosGen), Resonance+YaRN reduces out-of-distribution error by 20–50% and shrinks modeling perplexity by 0.5–1.0 points at long context. Combined, this provides state-of-the-art generalization in train-short-test-long (TSTL) regimes (Wang et al., 29 Feb 2024).

6. Practical Deployment and Limitations

YaRN is implemented as a modification at the RoPE embedding layer and can be activated via flags (e.g., --rope-scale-factor s, --rope-temp t) in publicly released repositories for LLaMA 2 and others. For dynamic scaling (context length varies at inference), RoPE recomputation is necessary at each step, precluding naive key-value caching of RoPE’d keys/values. Hyperparameters $(\alpha,\beta)$ may require retuning for non-LLaMA architectures or extremely large scaling factors ( $s \gg 32$ ). In partial audio adaptation (Partial YaRN), only audio spans are stretched and scaling is banded for efficiency.

Core resource links and released models are hosted at https://github.com/jquesnelle/yarn (Peng et al., 2023).

7. Future Developments and Research Directions

Current limits include residual inefficiency at very high context lengths due to quadratic attention, and incomplete calibration for enormous scaling factors. Combinations with RoPE rounding (Resonance), advanced frequency partitioning, or integration with efficient sparse/linearized attention remain open research directions (Wang et al., 29 Feb 2024). For multi-modal scenarios, further exploring context extension within modality-specific spans (e.g., Partial YaRN, VLAT) promises continued improvement for cross-modal context understanding (Chaichana et al., 17 Oct 2025).