Underlying reason for RedLLM long-context behavior

Characterize the underlying reason for the long-context behavior of the encoder–decoder large language model RedLLM—implemented with rotary positional embeddings across encoder self-attention, decoder self-attention, and cross-attention using continuous positions and pretrained with a prefix language modeling objective—when extrapolating to sequences substantially longer than the pretraining context length.

Background

The paper compares encoder–decoder (RedLLM) and decoder-only (DecLLM) architectures under scaling. With rotary embeddings, DecLLM is known to extrapolate context length, and the authors evaluate how RedLLM behaves on long sequences. They observe that perplexity rises with length for both models, but RedLLM’s increase is smoother and exhibits reduced locality decay compared to DecLLM. Cross-attention in RedLLM shows stable attention to subsets of input tokens across long contexts, hinting at a different mechanism than decoder-only self-attention.

Despite these observations, the authors explicitly state that the mechanism driving RedLLM’s long-context behavior is not yet understood, marking it as an unresolved question for future investigation.

References

Still, the underlying reason behind RedLLM's long context behavior remains unclear, which we leave to the future.

— Encoder-Decoder or Decoder-Only? Revisiting Encoder-Decoder Large Language Model (2510.26622 - Zhang et al., 30 Oct 2025) in Section 5, paragraph "The decoder self- and cross-attention in RedLLM show intriguing patterns under long context"

Underlying reason for RedLLM long-context behavior

Background

References

Related Problems