Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 99 tok/s

Gemini 2.5 Pro 43 tok/s Pro

GPT-5 Medium 28 tok/s

GPT-5 High 35 tok/s Pro

GPT-4o 94 tok/s

GPT OSS 120B 476 tok/s Pro

Kimi K2 190 tok/s Pro

2000 character limit reached

PaTH Attention: Position Encoding via Accumulating Householder Transformations (2505.16381v1)

Published 22 May 2025 in cs.CL and cs.LG

Abstract: The attention mechanism is a core primitive in modern LLMs and AI more broadly. Since attention by itself is permutation-invariant, position encoding is essential for modeling structured domains such as language. Rotary position encoding (RoPE) has emerged as the de facto standard approach for position encoding and is part of many modern LLMs. However, in RoPE the key/query transformation between two elements in a sequence is only a function of their relative position and otherwise independent of the actual input. This limits the expressivity of RoPE-based transformers. This paper describes PaTH, a flexible data-dependent position encoding scheme based on accumulated products of Householder(like) transformations, where each transformation is data-dependent, i.e., a function of the input. We derive an efficient parallel algorithm for training through exploiting a compact representation of products of Householder matrices, and implement a FlashAttention-style blockwise algorithm that minimizes I/O cost. Across both targeted synthetic benchmarks and moderate-scale real-world LLMing experiments, we find that PaTH demonstrates superior performance compared to RoPE and other recent baselines.

Collections

Summary

The paper introduces PaTH, a novel data-dependent position encoding framework using cumulative Householder-like transformations to enhance sequential reasoning.
It employs an efficient blockwise algorithm and UT transform representation to improve state tracking and facilitate length extrapolation.
Empirical results show that PaTH and its variant PaTH-FoX outperform RoPE and FoX on synthetic tasks and language modeling benchmarks.

This paper introduces PaTH (Position Encoding via Accumulating Householder Transformations), a novel data-dependent position encoding scheme for transformers designed to overcome limitations of existing methods like Rotary Position Encoding (RoPE). RoPE's transformations are data-independent, limiting its expressivity and performance on tasks requiring sequential reasoning. PaTH addresses this by using accumulated products of data-dependent Householder-like transformations.

Core Idea: Data-Dependent Transformations

Unlike RoPE, where the transformation between a query $q_i$ and a key $k_j$ depends only on their relative position, PaTH introduces a dynamic transformation matrix $H_{ij}$ that is a function of the input data. The attention logit is parameterized as $q_i^T H_{ij} k_j$ . $H_{ij}$ is formed by a cumulative product of Householder-like matrices along the path from position $j$ to $i$ :

$H_{ij} = \prod_{s=j+1}^{i} H_s$

Each $H_s$ is an identity-plus-rank-one matrix:

$H_t = I - \beta_t W_t W_t^T$

where $W_t \in \mathbb{R}^d$ and $\beta_t = 2 \cdot \text{sigmoid}(u^T x_t + b) \in (0, 2)$ are functions of the input $x_t$ at position $t$ . This data-dependent nature allows PaTH to adapt its transformations based on the input sequence, potentially capturing more complex sequential patterns. The choice of $\beta_t \in (0, 2)$ allows for negative eigenvalues, which has been shown to improve state-tracking performance in related linear RNN models. The vector $W_t$ is derived from $x_t$ via a low-rank linear layer, a short 1D convolution (filter size 3), and L2 normalization, adding minimal parameters.

The paper shows theoretically that a one-layer PaTH transformer can solve an NC1-complete problem, suggesting it can extend transformers beyond the TC<sup\>0</sup> complexity class, unlike RoPE-based transformers.

PaTH-FoX: Combining with Forgetting Transformer

PaTH can be combined with other attention modifications. The paper explores PaTH-FoX, which integrates PaTH with the Forgetting Transformer (FoX). FoX additively modifies attention logits with a data-dependent "forget" gate $f_s$ :

$A_{ij} \propto \left( \prod_{s=j+1}^{i} f_s \right) \exp(k_j^T q_i)$

The combined PaTH-FoX attention is:

$A_{ij} \propto \left( \prod_{s=j+1}^{i} f_s \right) \exp \left( k_j^T \left( \prod_{s=j+1}^{i} H_s \right) q_i \right)$

This combination often leads to further performance improvements, especially in length extrapolation.

Efficient Implementation

A key contribution is an efficient, FlashAttention-style blockwise algorithm for PaTH.

UT Transform for Householder Products: Products of Householder-like matrices $P = \prod_{t=0}^{L-1} H_t$ can be compactly represented as $P = I - W T^{-1} W^T$ , where $W = [W_0, ..., W_{L-1}]$ and $T^{-1}$ involves a triangular matrix inverse.
Masked UT Transform for Sub-intervals: To compute products over arbitrary intervals $\prod_{t=s_0}^{e_0} H_t$ without recomputing the UT transform each time, the paper uses a masked formulation: $I - (W \odot M_L)^T T^{-1} (W \odot M_R)$ , where $M_L$ and $M_R$ are binary masks selecting the appropriate $W_t$ vectors for the interval.
Blockwise Algorithm:
- A global $T^{-1}$ computation would be $O(L^3)$ . To avoid this, the algorithm operates blockwise.
- Boundary-Adjusted Queries and Keys: For each block, queries $Q_{[i]}$ and keys $K_{[i]}$ are pre-transformed using local Householder products within that block. For example, for a query $q_{iB+t}$ in block $i$ :
  
  $\tilde{q}_{iB+t} = \left( \prod_{m=iB+1}^{iB+t} H_m \right) q_{iB+t}$
  
  These local products are computed efficiently using the UT transform applied only to the $W_t$ vectors within the block.
- Cross-Block Transformation: The transformation across block boundaries, $P_{[i]} = \prod_{j=0}^{B-1} H_{iB+j}$ , is accumulated. For attention between $Q_{[i]}$ and $K_{[j]}$ (where $j < i$ ), the queries $Q_{[i]}$ are further transformed by the product of $P_{[m]}$ for $m$ from $j+1$ to $i-1$ .
- FlashAttention-Style Processing: The algorithm processes query blocks $Q_{[i]}$ one by one. For each $Q_{[i]}$ , it iterates through key/value blocks $K_{[j]}, V_{[j]}$ from $j=i-1$ down to $0$ (right-to-left scan).
  - Load $Q_{[i]}$ (already boundary-adjusted).
  - Load $K_{[j]}, V_{[j]}$ (boundary-adjusted) and $P_{[j]}$ .
  - Compute logits $A_{[i],[j]} = \tilde{Q}_{[i]} \tilde{K}_{[j]}^T$ .
  - Update online softmax statistics and accumulate output.
  - Update $\tilde{Q}_{[i]} \leftarrow \tilde{Q}_{[i]} P_{[j]}^T$ to incorporate the transformation from block $j$ .
- The overall complexity becomes comparable to standard attention: $O(L^2d + Ld^2/B)$ for attention computation and $O(LB^2 + LBd)$ for preprocessing, assuming block size $B \sim d$ .

Efficient Inference

For inference, historical keys $k_i^{(t-1)}$ can be updated in-place for the current timestep $t$ : $k_i^{(t)} \leftarrow (I - \beta_t W_t W_t^T) k_i^{(t-1)}$ for all $i < t$ . This avoids recomputing cumulative products. Before decoding, initial keys $K_{[i]}$ are transformed by suffix products $P_{[i]} P_{[i+1]} ... P_{[L/B-1]}$ which can also be done efficiently. This makes PaTH compatible with existing optimized decoding kernels.

Implementation Details and Availability

A Triton-based kernel for PaTH attention is provided as part of the flash-linear-attention library: https://github.com/fla-org/flash-linear-attention/tree/main/fla/ops/path_attn.
Figure 1 shows PaTH incurs a modest slowdown compared to RoPE but is faster than FoX in benchmarks on an H100 GPU.

Experimental Results

PaTH was evaluated on synthetic tasks and LLMing.

Synthetic Tasks:
- Flip-Flop LLMing (FFLM): PaTH (1-layer, 2-heads) achieves near-perfect accuracy (0% error in-distribution, 0.0001% OOD), significantly outperforming RoPE, Stick-Breaking Attention (SBA), and FoX.
- Word Problems (A5 group, NC1-complete): PaTH solves the task with 2 layers for sequence length 20, while RoPE, SBA, and FoX require 4 layers.
- Multi-query Repeated Associative Recall with N-back (MQRAR-N): PaTH successfully tracks variable values for N up to 3 (N-back recall), while RoPE, SBA, and FoX struggle for N > 1.
LLMing (760M parameter models on 50B Fineweb-Edu tokens, 4096 context):
- Standard Benchmarks: PaTH consistently outperforms RoPE and generally FoX on Wikitext perplexity, LAMBADA, PIQA, HellaSwag, WinoGrande, and ARC. PaTH-FoX often achieves the best perplexity. | Model | Wiki. ppl ↓ | LMB. ppl ↓ | PIQA acc ↑ | Hella. acc ↑ | Wino. acc ↑ | ARC-e acc ↑ | ARC-c acc ↑ | Avg. ↑ | | :-------- | :---------- | :--------- | :--------- | :----------- | :---------- | :---------- | :---------- | :----- | | RoPE | 19.01 | 19.77 | 40.4 | 70.2 | 50.3 | 54.9 | 67.2 | 33.3 | | FoX | 18.33 | 18.28 | 41.7 | 70.8 | 50.9 | 57.1 | 65.7 | 32.6 | | PaTH | 18.03 | 16.79 | 44.0 | 70.5 | 51.5 | 56.0 | 68.9 | 34.4 | | PaTH-FoX| 17.35 | 16.23 | 44.1 | 70.8 | 52.2 | 57.1 | 67.3 | 33.9 |
- Length Extrapolation: On PG-19, CodeParrot, and NarrativeQA, PaTH-FoX consistently achieves the lowest perplexity up to 64K tokens. PaTH alone generalizes well up to 32K. RoPE fails abruptly beyond its 4K training length. The benefit is especially pronounced on code, where state tracking is crucial.
- Long-Context Benchmarks (RULER, BABILONG, PhoneBook, LongBench-E):
  - PaTH-FoX excels at retrieval tasks (e.g., Multi-Needle-In-A-Haystack, PhoneBook).
  - PaTH and PaTH-FoX show substantial gains in state-tracking tasks (RULER Variable Tracking, BABILONG logic queries). For instance, on RULER 16K, PaTH achieves 18.7% and PaTH-FoX 22.6%, compared to RoPE's 0.0% and FoX's 4.9%. On PhoneBook 8K, PaTH-FoX scores 66.6% vs RoPE's 0.0%.

Practical Implications

Improved State Tracking: PaTH's data-dependent nature makes it more suitable for tasks requiring tracking of state or entities over long sequences (e.g., coding, narrative understanding).
Enhanced Length Extrapolation: PaTH, especially PaTH-FoX, demonstrates significantly better generalization to sequence lengths beyond the training context window compared to RoPE.
Hardware Efficiency: The proposed blockwise algorithm allows PaTH to be trained and run with efficiency comparable to standard attention mechanisms, making it a practical alternative.
Compatibility: PaTH's design is compatible with context-parallelism techniques for distributed training (e.g., Ring Attention) by passing transformed keys, values, and local Householder product matrices between devices.

In summary, PaTH offers a more expressive and data-aware approach to position encoding in transformers. Its strong empirical performance, particularly on state-tracking and long-context tasks, coupled with an efficient implementation strategy, makes it a promising advancement for LLMs.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (8)

Tweets

https://twitter.com/SonglinYang4/status/1926065619272687842

https://twitter.com/fly51fly/status/1926754249372729685

https://twitter.com/papers_anon/status/1925808674884251907

https://twitter.com/michieldoteth/status/1926025970504376435

https://twitter.com/arxivsanitybot/status/1926107379092852913

https://twitter.com/GptMaestro/status/1932944478110953731