- The paper introduces PaTH, a novel data-dependent position encoding framework using cumulative Householder-like transformations to enhance sequential reasoning.
- It employs an efficient blockwise algorithm and UT transform representation to improve state tracking and facilitate length extrapolation.
- Empirical results show that PaTH and its variant PaTH-FoX outperform RoPE and FoX on synthetic tasks and language modeling benchmarks.
This paper introduces PaTH (Position Encoding via Accumulating Householder Transformations), a novel data-dependent position encoding scheme for transformers designed to overcome limitations of existing methods like Rotary Position Encoding (RoPE). RoPE's transformations are data-independent, limiting its expressivity and performance on tasks requiring sequential reasoning. PaTH addresses this by using accumulated products of data-dependent Householder-like transformations.
Core Idea: Data-Dependent Transformations
Unlike RoPE, where the transformation between a query qi and a key kj depends only on their relative position, PaTH introduces a dynamic transformation matrix Hij that is a function of the input data. The attention logit is parameterized as qiTHijkj. Hij is formed by a cumulative product of Householder-like matrices along the path from position j to i:
Hij=∏s=j+1iHs
Each Hs is an identity-plus-rank-one matrix:
Ht=I−βtWtWtT
where Wt∈Rd and βt=2⋅sigmoid(uTxt+b)∈(0,2) are functions of the input xt at position t. This data-dependent nature allows PaTH to adapt its transformations based on the input sequence, potentially capturing more complex sequential patterns. The choice of βt∈(0,2) allows for negative eigenvalues, which has been shown to improve state-tracking performance in related linear RNN models. The vector Wt is derived from xt via a low-rank linear layer, a short 1D convolution (filter size 3), and L2 normalization, adding minimal parameters.
The paper shows theoretically that a one-layer PaTH transformer can solve an NC1-complete problem, suggesting it can extend transformers beyond the TC<sup\>0</sup> complexity class, unlike RoPE-based transformers.
PaTH-FoX: Combining with Forgetting Transformer
PaTH can be combined with other attention modifications. The paper explores PaTH-FoX, which integrates PaTH with the Forgetting Transformer (FoX). FoX additively modifies attention logits with a data-dependent "forget" gate fs:
Aij∝(s=j+1∏ifs)exp(kjTqi)
The combined PaTH-FoX attention is:
Aij∝(s=j+1∏ifs)exp(kjT(s=j+1∏iHs)qi)
This combination often leads to further performance improvements, especially in length extrapolation.
Efficient Implementation
A key contribution is an efficient, FlashAttention-style blockwise algorithm for PaTH.
- UT Transform for Householder Products: Products of Householder-like matrices P=∏t=0L−1Ht can be compactly represented as P=I−WT−1WT, where W=[W0,...,WL−1] and T−1 involves a triangular matrix inverse.
- Masked UT Transform for Sub-intervals: To compute products over arbitrary intervals ∏t=s0e0Ht without recomputing the UT transform each time, the paper uses a masked formulation: I−(W⊙ML)TT−1(W⊙MR), where ML and MR are binary masks selecting the appropriate Wt vectors for the interval.
- Blockwise Algorithm:
- A global T−1 computation would be O(L3). To avoid this, the algorithm operates blockwise.
Boundary-Adjusted Queries and Keys: For each block, queries Q[i] and keys K[i] are pre-transformed using local Householder products within that block. For example, for a query qiB+t in block i:
q~iB+t=(m=iB+1∏iB+tHm)qiB+t
These local products are computed efficiently using the UT transform applied only to the Wt vectors within the block.
Cross-Block Transformation: The transformation across block boundaries, P[i]=∏j=0B−1HiB+j, is accumulated. For attention between Q[i] and K[j] (where j<i), the queries Q[i] are further transformed by the product of P[m] for m from j+1 to i−1.
FlashAttention-Style Processing: The algorithm processes query blocks Q[i] one by one. For each Q[i], it iterates through key/value blocks K[j],V[j] from j=i−1 down to $0$ (right-to-left scan).
- Load Q[i] (already boundary-adjusted).
- Load K[j],V[j] (boundary-adjusted) and P[j].
- Compute logits A[i],[j]=Q~[i]K~[j]T.
- Update online softmax statistics and accumulate output.
- Update Q~[i]←Q~[i]P[j]T to incorporate the transformation from block j.
- The overall complexity becomes comparable to standard attention: O(L2d+Ld2/B) for attention computation and O(LB2+LBd) for preprocessing, assuming block size B∼d.
Efficient Inference
For inference, historical keys ki(t−1) can be updated in-place for the current timestep t:
ki(t)←(I−βtWtWtT)ki(t−1) for all i<t.
This avoids recomputing cumulative products. Before decoding, initial keys K[i] are transformed by suffix products P[i]P[i+1]...P[L/B−1] which can also be done efficiently. This makes PaTH compatible with existing optimized decoding kernels.
Implementation Details and Availability
- A Triton-based kernel for PaTH attention is provided as part of the
flash-linear-attention
library: https://github.com/fla-org/flash-linear-attention/tree/main/fla/ops/path_attn
.
- Figure 1 shows PaTH incurs a modest slowdown compared to RoPE but is faster than FoX in benchmarks on an H100 GPU.
Experimental Results
PaTH was evaluated on synthetic tasks and LLMing.
- Synthetic Tasks:
- Flip-Flop LLMing (FFLM): PaTH (1-layer, 2-heads) achieves near-perfect accuracy (0% error in-distribution, 0.0001% OOD), significantly outperforming RoPE, Stick-Breaking Attention (SBA), and FoX.
- Word Problems (A5 group, NC1-complete): PaTH solves the task with 2 layers for sequence length 20, while RoPE, SBA, and FoX require 4 layers.
- Multi-query Repeated Associative Recall with N-back (MQRAR-N): PaTH successfully tracks variable values for N up to 3 (N-back recall), while RoPE, SBA, and FoX struggle for N > 1.
- LLMing (760M parameter models on 50B Fineweb-Edu tokens, 4096 context):
- Standard Benchmarks: PaTH consistently outperforms RoPE and generally FoX on Wikitext perplexity, LAMBADA, PIQA, HellaSwag, WinoGrande, and ARC. PaTH-FoX often achieves the best perplexity.
| Model | Wiki. ppl ↓ | LMB. ppl ↓ | PIQA acc ↑ | Hella. acc ↑ | Wino. acc ↑ | ARC-e acc ↑ | ARC-c acc ↑ | Avg. ↑ |
| :-------- | :---------- | :--------- | :--------- | :----------- | :---------- | :---------- | :---------- | :----- |
| RoPE | 19.01 | 19.77 | 40.4 | 70.2 | 50.3 | 54.9 | 67.2 | 33.3 |
| FoX | 18.33 | 18.28 | 41.7 | 70.8 | 50.9 | 57.1 | 65.7 | 32.6 |
| PaTH | 18.03 | 16.79 | 44.0 | 70.5 | 51.5 | 56.0 | 68.9 | 34.4 |
| PaTH-FoX| 17.35 | 16.23 | 44.1 | 70.8 | 52.2 | 57.1 | 67.3 | 33.9 |
- Length Extrapolation: On PG-19, CodeParrot, and NarrativeQA, PaTH-FoX consistently achieves the lowest perplexity up to 64K tokens. PaTH alone generalizes well up to 32K. RoPE fails abruptly beyond its 4K training length. The benefit is especially pronounced on code, where state tracking is crucial.
- Long-Context Benchmarks (RULER, BABILONG, PhoneBook, LongBench-E):
- PaTH-FoX excels at retrieval tasks (e.g., Multi-Needle-In-A-Haystack, PhoneBook).
- PaTH and PaTH-FoX show substantial gains in state-tracking tasks (RULER Variable Tracking, BABILONG logic queries). For instance, on RULER 16K, PaTH achieves 18.7% and PaTH-FoX 22.6%, compared to RoPE's 0.0% and FoX's 4.9%. On PhoneBook 8K, PaTH-FoX scores 66.6% vs RoPE's 0.0%.
Practical Implications
- Improved State Tracking: PaTH's data-dependent nature makes it more suitable for tasks requiring tracking of state or entities over long sequences (e.g., coding, narrative understanding).
- Enhanced Length Extrapolation: PaTH, especially PaTH-FoX, demonstrates significantly better generalization to sequence lengths beyond the training context window compared to RoPE.
- Hardware Efficiency: The proposed blockwise algorithm allows PaTH to be trained and run with efficiency comparable to standard attention mechanisms, making it a practical alternative.
- Compatibility: PaTH's design is compatible with context-parallelism techniques for distributed training (e.g., Ring Attention) by passing transformed keys, values, and local Householder product matrices between devices.
In summary, PaTH offers a more expressive and data-aware approach to position encoding in transformers. Its strong empirical performance, particularly on state-tracking and long-context tasks, coupled with an efficient implementation strategy, makes it a promising advancement for LLMs.