Papers
Topics
Authors
Recent
2000 character limit reached

Full-Complex Attention in Long-Context LLMs

Updated 10 December 2025
  • The paper introduces Full-Complex Attention (RoPE++) which incorporates both real and imaginary components of rotary position embeddings to enhance long-context modeling.
  • It details a dual-head architecture with EC and EH variants that separately compute attention scores, balancing retrieval accuracy with memory and computational efficiency.
  • Empirical results show significant improvements in long-range dependency capture, with retrieval gains up to 7–8 points and graceful perplexity degradation beyond training windows.

Full-complex attention in long-context LLMs refers to attentional mechanisms that leverage both real and imaginary components of complex-valued interactions within rotary positional embedding (RoPE) frameworks, enhancing the capture of positional and relational information over extended sequences. Recent work demonstrates that incorporating the previously discarded imaginary component of RoPE into attention—termed “RoPE++”—yields substantial gains in long-context modeling, particularly for tasks requiring robust retrieval and recall across sequences orders of magnitude longer than traditional LLM sliding windows (Liu et al., 8 Dec 2025).

1. Rotary Position Embeddings and the Complex Plane in Attention

RoPE is a widely adopted mechanism for encoding positional information by rotating query and key vectors in the complex plane. Given qt(n)q_t^{(n)} and ks(n)k_s^{(n)} as nn-th feature dimensions at positions tt and ss, RoPE reformulates these as complex pairs: q~t(n)=qt2n+iqt2n+1,k~s(n)=ks2n+iks2n+1\tilde{q}_t^{(n)} = q_t^{2n} + i q_t^{2n+1}, \quad \tilde{k}_s^{(n)} = k_s^{2n} + i k_s^{2n+1} Each is rotated by a position- and head-dependent angle, yielding $q'_t^{(n)} = \tilde{q}_t^{(n)} e^{i \theta_n t}$, and similarly for ksk'_s. The classic attention score uses only the real part of their inner product: $A_{t,s}^{\text{Re}} = \mathrm{Re} \sum_{n} q'_t^{(n)} (k'_s^{(n)})^*$ The imaginary component,

$A_{t,s}^{\text{Im}} = -\mathrm{Im} \sum_n q'_t^{(n)} (k'_s^{(n)})^*$

contains phase-sensitive relational information that standard RoPE ignores (Liu et al., 8 Dec 2025).

2. Full-Complex Attention: Mathematical Formulation and Algorithms

Full-complex attention (RoPE++) incorporates both At,sReA_{t,s}^{\text{Re}} and At,sImA_{t,s}^{\text{Im}} as dual components. Rather than discarding At,sImA_{t,s}^{\text{Im}}, it forms attention outputs via two parallel “head groups”: one from the real, one from the imaginary scores. Attention is then computed as: αt,s=softmaxs[At,sRe+λAt,sIm]\alpha_{t,s} = \mathrm{softmax}_s \left[ A_{t,s}^{\text{Re}} + \lambda A_{t,s}^{\text{Im}} \right] where λ\lambda is a scalar hyperparameter (in practice, RoPE++ splits scores into separate head groups and combines outputs by concatenation or summation).

PyTorch-style pseudocode implements this by:

  • Projecting inputs to queries, keys, values.
  • Applying complex-valued RoPE rotations.
  • Computing real and imaginary score matrices via separate inner products.
  • Feeding the resulting heads through softmax and output projections (details match Figure 1 and code block in (Liu et al., 8 Dec 2025)).

Variants provided include:

  • RoPE++_EC (“Equal Cache”): Doubled head count, doubled output features, unchanged KV-cache structure.
  • RoPE++_EH (“Equal Heads”): Single head group, halved parameters, increased memory efficiency.

3. Theoretical Properties and Positional Expressiveness

RoPE++’s reintroduction of the imaginary (phase) component enhances the model’s ability to differentiate relative positions with both magnitude and phase awareness. The real part decays with relative offset according to cosine integrals, favoring local attention, whereas the imaginary part decays more slowly, following a sine integral profile. This nontrivial imaginary contribution biases attention toward longer-range dependencies by sustaining nonzero sensitivity at greater positional distances.

Empirically, averaging over random queries shows both components E[ΔARe]>0E[\Delta A^{\text{Re}}] > 0 and E[ΔAIm]>0E[\Delta A^{\text{Im}}] > 0 for semantically similar data, with AImA^{\text{Im}} decaying more gradually—a property critical for retrieval tasks in extremely long contexts.

The approach incurs no additional learnable parameters beyond projections for expanded head counts in EC mode and maintains computational complexity at O(NHL2)O(N H L^2), matching vanilla attention (Liu et al., 8 Dec 2025).

4. Integration and Computational Considerations

Full-complex attention is a drop-in replacement for conventional RoPE attention. For EC, it requires only minor modifications—doubling attention heads and output projections. For EH, KV-cache and parameter cost are halved, boosting throughput at long sequence lengths. No new buffers are introduced, and fused kernel implementations (e.g., FlashAttention 2/3) can handle the dual-head computation efficiently, maintaining practical inference speeds. EH variant further enhances memory efficiency: on 376M and 776M models, EH runs 2–6% faster than vanilla on 32K contexts.

Memory consumption is linearly proportional to head count; compute cost marginally increases for EC (10–15% additional matmuls), but is unchanged for EH. This allows scaling to arbitrarily long contexts without quadratic KV-cache blowup.

5. Empirical Validation and Long-Context Performance

Across language modeling and retrieval benchmarks, RoPE++ demonstrates statistically significant improvements over standard RoPE, especially as context length increases:

  • On RULER (4K–64K synthetic retrieval), RoPE++_EC outperforms RoPE by 7–8 points at 64K.
  • On BABILong (2K–64K), improvements are 4–5 points at maximal context length.
  • Head ablation experiments confirm imaginary head criticality for long-distance retrieval, with noise perturbation revealing pronounced performance drops when imaginary heads are disrupted.
  • In short-context settings (4K test window), RoPE++ maintains parity or improves slightly in perplexity and downstream classification accuracy (Liu et al., 8 Dec 2025).

Perplexity extrapolation beyond the training window shows that RoPE++ degrades more gracefully with increasing context, consistent with its theoretical improvements in length-extrapolation capacity.

6. Relation to Chunking, Sparse Attention, and Dynamic Memory

Full-complex attention mechanisms complement approaches such as dynamic triangular attention and span-based retrieval frameworks (e.g., Ltri-LLM (Tang et al., 2024)). While Ltri-LLM decomposes long sequences into semantic spans using non-maximum suppression (NMS) and an offline retrieval index to select the most salient prior chunks, RoPE++ orthogonally augments the positional sensitivity and long-range expressiveness inside each attention head. Ltri-LLM addresses full attention’s prohibitive quadratic cost via span indexation and O(n) retrieval; RoPE++ refines how positional structure is maintained within this or any full-/partial-attention module. Combined, these strategies offer complementary gains: one in memory/computation, the other in retention and fidelity of long-range dependencies.

7. Summary and Implications

Full-complex attention (RoPE++) represents a minimal yet principled extension to rotary position embeddings, recovering the imaginary phase component typically discarded in self-attention calculations. By treating real and imaginary dot-products as parallel attention channels, RoPE++ increases the range and fidelity of positional relationships—especially crucial for LLMs with multi-hundred-thousand-token or million-token contexts. This approach yields gains in accuracy and retrieval performance over existing sparse and streaming paradigms, without introducing learning instability or significant computational overhead. Integration with chunked-attention frameworks further enables efficient, scalable long-context deployment in modern LLM architectures (Liu et al., 8 Dec 2025, Tang et al., 2024).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Full-Complex Attention for Long-Context LLMs.