Papers
Topics
Authors
Recent
Search
2000 character limit reached

Wavelet-based Positional Representation for Long Context

Published 4 Feb 2025 in cs.CL | (2502.02004v1)

Abstract: In the realm of large-scale LLMs, a significant challenge arises when extrapolating sequences beyond the maximum allowable length. This is because the model's position embedding mechanisms are limited to positions encountered during training, thus preventing effective representation of positions in longer sequences. We analyzed conventional position encoding methods for long contexts and found the following characteristics. (1) When the representation dimension is regarded as the time axis, Rotary Position Embedding (RoPE) can be interpreted as a restricted wavelet transform using Haar-like wavelets. However, because it uses only a fixed scale parameter, it does not fully exploit the advantages of wavelet transforms, which capture the fine movements of non-stationary signals using multiple scales (window sizes). This limitation could explain why RoPE performs poorly in extrapolation. (2) Previous research as well as our own analysis indicates that Attention with Linear Biases (ALiBi) functions similarly to windowed attention, using windows of varying sizes. However, it has limitations in capturing deep dependencies because it restricts the receptive field of the model. From these insights, we propose a new position representation method that captures multiple scales (i.e., window sizes) by leveraging wavelet transforms without limiting the model's attention field. Experimental results show that this new method improves the performance of the model in both short and long contexts. In particular, our method allows extrapolation of position information without limiting the model's attention field.

Summary

  • The paper presents a novel wavelet-based positional encoding method that overcomes limitations of conventional RoPE and ALiBi approaches.
  • It leverages multi-scale Ricker wavelets for dynamic signal capture, significantly reducing perplexity in extended sequence contexts.
  • Experimental results demonstrate robust improvements in long-range dependency modeling without imposing constraints on the attention mechanism.

Wavelet-based Positional Representation for Long Context

This essay examines a novel approach to positional representation in large-scale LLMs, focusing on the implementation and implications of wavelet-based methods for handling long contexts. The study reinterprets conventional positional encoding mechanisms, probing their limitations and proposing enhancements through wavelet transforms.

Positional Encoding Challenges

In LLMs based on Transformer architectures, positional encoding is crucial for accurately representing token sequences. Challenges arise when extending the sequence length beyond the maximum allowable length, LtrainL_{\rm train}, encountered during pre-training. Traditional encoding methods such as RoPE and ALiBi exhibit specific limitations in extrapolation capabilities. RoPE, which uses a rotation matrix for absolute position embedding, is constrained by its fixed scale parameter and performs sub-optimally beyond $L_{\rm train$. On the other hand, ALiBi, relying on windowed attention with varying sizes, limits the receptive field and fails to capture extensive dependencies. Figure 1

Figure 1: Overview of Wavelet-based Relative Positional Representation As in RPE.

Theoretical Analysis of RoPE and ALiBi

RoPE can be viewed as a wavelet-like transform but restricted to Haar-like wavelets with a fixed scale, limiting its ability to model dynamic changes in signals. Mathematically, RoPE processes can be expressed as a wavelet transformation across the head dimension using Haar wavelets with an invariant scale of 2. In contrast, ALiBi facilitates multiple window sizes, analogous to the varying scales in wavelet transforms, providing adaptability but within a limited receptive field due to linear biases. Figure 2

Figure 2: Heatmap of scaled attention scores via softmax normalization in ALiBi without non-overlapping inference.

Proposed Wavelet-based Positional Representation

This study proposes a wavelet-based method leveraging multiple window sizes (scale parameters) and flexible shift parameters, drawing inspiration from Relative Position Representation (RPE) methodologies. Unlike RoPE, this approach utilizes wavelet transforms across dd dimensions, enabling dynamic signal capture akin to time-frequency analysis. By incorporating diversification in wavelet functions and scale parameters, the method offers resilience to varying sentence lengths and context shifts. Figure 3

Figure 3: Heatmap of scaled attention scores via softmax normalization in 4th head after softmax operation without non-overlapping inference.

Implementation Details

The implementation utilizes Ricker wavelets instead of Haar wavelets, deploying varying scales for enhanced context adaptability. The query and key vectors undergo wavelet transformation based on relative positional shifts, facilitating robust signal parsing, critical for extrapolation. The position representation is expressed using the wavelet function transformed across multiple scales, a={20,21,...,2s}a = \{2^0,2^1,...,2^{s}\}, and shift parameters b={0,1,2,...,ds−1}b = \{0, 1, 2,..., \frac{d}{s}-1\}. Figure 4

Figure 4

Figure 4: Graph of compared Ricker wavelet functions with a = [20,21,22,23,24].

Experimental Results

Evaluations on datasets demonstrate that the wavelet-based approach outperforms conventional methods in perplexity metrics, especially for contexts extending beyond LtrainL_{\rm train}. The use of wavelets significantly reduces perplexity, showcasing superiority in capturing long-range dependencies without restricting the attention field. Figures illustrate that even as sentence length increases, the proposed approach maintains low perplexity, contrasting sharply with RoPE’s escalation in perplexity due to new positional values in longer sequences. Figure 5

Figure 5: Graph of compared wavelet functions. The case with scale parameter a=24 and shift parameter b=0 is shown.

Conclusion

The wavelet-based positional representation presents a significant advancement in handling long-context sequences in LLMs. By facilitating dynamic adaptation through diversified scale and shift parameters, this approach transcends conventional limitations, enabling effective extrapolation and comprehensive signal analysis without imposing constraints on the attention mechanism's receptive field. This research marks a pivotal step toward enabling more robust and scalable language modeling, opening avenues for further exploration in adaptive positional encoding strategies.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 1 like about this paper.