RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably

Published 15 May 2026 in cs.CL, cs.AI, and cs.LG | (2605.15514v1)

Abstract: We identify intrinsic limitations of Rotary Positional Embeddings (RoPE) in Transformer-based long-context LLMs. Our theoretical analysis abstracts away from the specific content of the context and depends only on its length. We prove that as context length increases, RoPE-based attention becomes unpredictable and loses two properties that are central to its effectiveness. First, it loses its locality bias: RoPE is no more likely to favor nearer positions than substantially farther ones. Second, it loses consistency in token relevance: a key vector that receives a higher attention score than an alternative at one position may receive a lower score at another. In both cases, the probability of failure approaches 0.5, no better than random guessing. We further prove that the attention score can remain unchanged when a key token is moved to a different position, or even replaced by a different token, indicating a failure to distinguish positions or tokens. Adjusting the RoPE base trades off distinguishing positions against distinguishing tokens but cannot preserve both at the same time. Increasing the RoPE base hyperparameter, a common practice in today's long-context models, helps distinguish different tokens, but inevitably sacrifices the ability to distinguish positions. Our empirical analysis shows that multi-head, multi-layer architectures are insufficient to overcome these limitations. Our findings suggest that fundamentally new mechanisms for encoding position and token order may be needed in future Transformer long-context LLMs.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper shows that RoPE’s attention scores degrade with longer contexts, causing both token and position distinctions to approach random chance.
Empirical tests in models like Llama 3.1–8B reveal thousands of aliased positions and near-random accuracy, underscoring severe performance degradation.
The study challenges long-held assumptions about RoPE, prompting exploration of novel positional encodings for robust long-context language modeling.

Limitations of Rotary Positional Embeddings in Long-Context Transformers

Overview

This paper presents rigorous theoretical and empirical analyses identifying intrinsic limitations of Rotary Positional Embeddings (RoPE), the predominant positional encoding used in modern Transformer-based LLMs, particularly in long-context settings. The authors prove that as context length increases, RoPE-based attention exhibits fundamental failures: it loses the ability to reliably distinguish either token positions or token identities, with the probability of failure approaching random chance (0.5) for both. This essay outlines the paper's key findings, highlights critical empirical results, discusses the mechanisms underlying these failures, addresses implications for long-context LLM development, and suggests research directions informed by these results.

Theoretical Analysis of RoPE at Long Contexts

The core theoretical contribution is a probabilistic characterization of RoPE's dot product ("RoPE product") as a normal random variable, with its mean determined by low-frequency components and its variance by high-frequency oscillations. As context length $M$ grows:

Position Inversion: The likelihood that RoPE-based attention assigns a higher score to a distant key than to a nearby one increases with $M$ and the RoPE base parameter $B$ , approaching 0.5 at large $M$ and $B$ . Thus, the locality inductive bias, essential to language modeling, fails in long contexts.
Position Aliasing: The probability that moving a key token to a different position yields an unchanged attention score converges to 1 rapidly with increasing $M$ . This indicates RoPE attention can become completely agnostic to token position in long sequences.
Token Inversion: The relative ranking of two key tokens for a given query—reflected by their attention scores—can be arbitrarily reversed at different positions; the inversion probability also approaches 0.5 as $M$ increases.
Token Aliasing: The number of positions where attention cannot distinguish two different key tokens increases with $M$ ; for practical finite-precision arithmetic (e.g., BF16), this can result in thousands of aliased positions even at moderate context lengths.

The authors demonstrate that adjusting the RoPE base parameter trades off one kind of error for another: increasing $B$ enhances token distinction but degrades position distinction, preventing simultaneous preservation of both objectives over long contexts.

Empirical Evidence in State-of-the-Art LLMs

Empirical analysis, centered on Llama 3.1–8B (with a nominal 128K context window), confirms these theoretical predictions:

Across a 8K context length, they discover more than 75,000 position aliasing pairs and ~150 instances of token aliasing, regardless of token proximity.
In controlled experiments on six open-source, RoPE-based LLMs (sizes from 7B to over 100B parameters), a simple position-indexing task (finding the $k^{\mathrm{th}}$ element in a list of four integers) reveals that all models, regardless of their scale, collapse to near-random (25%) accuracy beyond 4K-token contexts.

Complexity augmentation—multi-head and multi-layer architectures—does not mitigate these pathologies. Once context length exceeds a head’s effective range, positional and token confusion emerges and persists through the model.

Mechanistic Explanation

RoPE’s encoding relies on representing position via phase rotations in the hidden state space. Theoretical analysis leveraging the Central Limit Theorem shows that with large $M$ 0, high-frequency components dominate variance, resulting in unpredictable oscillatory behavior of the attention scores. The net result is that, in long contexts, both the mean (from low frequencies) and variance (from high frequencies) lose their discriminative utility—meaning the model cannot distinguish nearby from distant tokens (loss of locality bias) or reliably relate attention scores to token identities.

Empirically, when numerical quantization (finite arithmetic precision) is incorporated, the aliasing phenomena (where different positions or tokens have identical attention scores) becomes even more prevalent, demonstrating that this is not an artifact of idealized, infinite-precision math but is exacerbated in practical deployments.

Practical and Theoretical Implications

The findings refute the assumption that merely scaling context lengths via RoPE hyperparameter adjustments suffices for robust long-context modeling. Persistent performance gaps between claimed context limits and real-world utility arise from these intrinsic positional encoding pathologies, not from optimizable engineering or data issues. The analysis provides a concrete mechanistic basis for the recurring empirical failures in long-context reasoning, variable tracking, or dependency tasks even in recent LLMs evaluated on synthetic or retrieval-based benchmarks.

The results motivate a re-examination of foundational architectural design: any robust positional mechanism for long-context modeling must maintain both positional and token discriminatory ability as context grows. Adjusting the RoPE base is fundamentally a zero-sum trade-off, not a solution.

Future Directions

This work signals a need for radically new positional encoding or context management schemes beyond algebraic RoPE extensions or scaling. Promising directions include:

Positional representations with richer or more flexible inductive biases (e.g., learned, hierarchical, or compositional positions);
Alternative architectures that decompose or segment long contexts and explicitly manage context memory;
Agentic or recursive models that can refer to or summarize past contexts without relying on monolithic positional encodings.

Additionally, benchmarking and evaluation practices should reflect real, effective context usage, not nominal window size, for future LLMs.

Conclusion

The paper establishes, both theoretically and empirically, that Rotary Positional Embeddings fail to provide reliable positional or token distinction in long contexts, and that tuning existing hyperparameters cannot overcome this fundamental limitation. The implications are clear: developing long-context LLMs that truly use large contexts robustly will require fundamentally new positional mechanisms or architectures. These findings should instigate research into alternative encoding strategies and raise the standard for what constitutes genuine long-context capability in LLMs.

Reference: "RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably" (2605.15514)

Markdown Report Issue