Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 43 tok/s Pro
GPT-5 Medium 28 tok/s
GPT-5 High 35 tok/s Pro
GPT-4o 94 tok/s
GPT OSS 120B 476 tok/s Pro
Kimi K2 190 tok/s Pro
2000 character limit reached

PaTH Attention: Position Encoding via Accumulating Householder Transformations (2505.16381v1)

Published 22 May 2025 in cs.CL and cs.LG

Abstract: The attention mechanism is a core primitive in modern LLMs and AI more broadly. Since attention by itself is permutation-invariant, position encoding is essential for modeling structured domains such as language. Rotary position encoding (RoPE) has emerged as the de facto standard approach for position encoding and is part of many modern LLMs. However, in RoPE the key/query transformation between two elements in a sequence is only a function of their relative position and otherwise independent of the actual input. This limits the expressivity of RoPE-based transformers. This paper describes PaTH, a flexible data-dependent position encoding scheme based on accumulated products of Householder(like) transformations, where each transformation is data-dependent, i.e., a function of the input. We derive an efficient parallel algorithm for training through exploiting a compact representation of products of Householder matrices, and implement a FlashAttention-style blockwise algorithm that minimizes I/O cost. Across both targeted synthetic benchmarks and moderate-scale real-world LLMing experiments, we find that PaTH demonstrates superior performance compared to RoPE and other recent baselines.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces PaTH, a novel data-dependent position encoding framework using cumulative Householder-like transformations to enhance sequential reasoning.
  • It employs an efficient blockwise algorithm and UT transform representation to improve state tracking and facilitate length extrapolation.
  • Empirical results show that PaTH and its variant PaTH-FoX outperform RoPE and FoX on synthetic tasks and language modeling benchmarks.

This paper introduces PaTH (Position Encoding via Accumulating Householder Transformations), a novel data-dependent position encoding scheme for transformers designed to overcome limitations of existing methods like Rotary Position Encoding (RoPE). RoPE's transformations are data-independent, limiting its expressivity and performance on tasks requiring sequential reasoning. PaTH addresses this by using accumulated products of data-dependent Householder-like transformations.

Core Idea: Data-Dependent Transformations

Unlike RoPE, where the transformation between a query qiq_i and a key kjk_j depends only on their relative position, PaTH introduces a dynamic transformation matrix HijH_{ij} that is a function of the input data. The attention logit is parameterized as qiTHijkjq_i^T H_{ij} k_j. HijH_{ij} is formed by a cumulative product of Householder-like matrices along the path from position jj to ii:

Hij=s=j+1iHsH_{ij} = \prod_{s=j+1}^{i} H_s

Each HsH_s is an identity-plus-rank-one matrix:

Ht=IβtWtWtTH_t = I - \beta_t W_t W_t^T

where WtRdW_t \in \mathbb{R}^d and βt=2sigmoid(uTxt+b)(0,2)\beta_t = 2 \cdot \text{sigmoid}(u^T x_t + b) \in (0, 2) are functions of the input xtx_t at position tt. This data-dependent nature allows PaTH to adapt its transformations based on the input sequence, potentially capturing more complex sequential patterns. The choice of βt(0,2)\beta_t \in (0, 2) allows for negative eigenvalues, which has been shown to improve state-tracking performance in related linear RNN models. The vector WtW_t is derived from xtx_t via a low-rank linear layer, a short 1D convolution (filter size 3), and L2 normalization, adding minimal parameters.

The paper shows theoretically that a one-layer PaTH transformer can solve an NC1-complete problem, suggesting it can extend transformers beyond the TC<sup\>0</sup> complexity class, unlike RoPE-based transformers.

PaTH-FoX: Combining with Forgetting Transformer

PaTH can be combined with other attention modifications. The paper explores PaTH-FoX, which integrates PaTH with the Forgetting Transformer (FoX). FoX additively modifies attention logits with a data-dependent "forget" gate fsf_s:

Aij(s=j+1ifs)exp(kjTqi)A_{ij} \propto \left( \prod_{s=j+1}^{i} f_s \right) \exp(k_j^T q_i)

The combined PaTH-FoX attention is:

Aij(s=j+1ifs)exp(kjT(s=j+1iHs)qi)A_{ij} \propto \left( \prod_{s=j+1}^{i} f_s \right) \exp \left( k_j^T \left( \prod_{s=j+1}^{i} H_s \right) q_i \right)

This combination often leads to further performance improvements, especially in length extrapolation.

Efficient Implementation

A key contribution is an efficient, FlashAttention-style blockwise algorithm for PaTH.

  1. UT Transform for Householder Products: Products of Householder-like matrices P=t=0L1HtP = \prod_{t=0}^{L-1} H_t can be compactly represented as P=IWT1WTP = I - W T^{-1} W^T, where W=[W0,...,WL1]W = [W_0, ..., W_{L-1}] and T1T^{-1} involves a triangular matrix inverse.
  2. Masked UT Transform for Sub-intervals: To compute products over arbitrary intervals t=s0e0Ht\prod_{t=s_0}^{e_0} H_t without recomputing the UT transform each time, the paper uses a masked formulation: I(WML)TT1(WMR)I - (W \odot M_L)^T T^{-1} (W \odot M_R), where MLM_L and MRM_R are binary masks selecting the appropriate WtW_t vectors for the interval.
  3. Blockwise Algorithm:
    • A global T1T^{-1} computation would be O(L3)O(L^3). To avoid this, the algorithm operates blockwise.
    • Boundary-Adjusted Queries and Keys: For each block, queries Q[i]Q_{[i]} and keys K[i]K_{[i]} are pre-transformed using local Householder products within that block. For example, for a query qiB+tq_{iB+t} in block ii:

      q~iB+t=(m=iB+1iB+tHm)qiB+t\tilde{q}_{iB+t} = \left( \prod_{m=iB+1}^{iB+t} H_m \right) q_{iB+t}

      These local products are computed efficiently using the UT transform applied only to the WtW_t vectors within the block.

    • Cross-Block Transformation: The transformation across block boundaries, P[i]=j=0B1HiB+jP_{[i]} = \prod_{j=0}^{B-1} H_{iB+j}, is accumulated. For attention between Q[i]Q_{[i]} and K[j]K_{[j]} (where j<ij < i), the queries Q[i]Q_{[i]} are further transformed by the product of P[m]P_{[m]} for mm from j+1j+1 to i1i-1.

    • FlashAttention-Style Processing: The algorithm processes query blocks Q[i]Q_{[i]} one by one. For each Q[i]Q_{[i]}, it iterates through key/value blocks K[j],V[j]K_{[j]}, V_{[j]} from j=i1j=i-1 down to $0$ (right-to-left scan).

      • Load Q[i]Q_{[i]} (already boundary-adjusted).
      • Load K[j],V[j]K_{[j]}, V_{[j]} (boundary-adjusted) and P[j]P_{[j]}.
      • Compute logits A[i],[j]=Q~[i]K~[j]TA_{[i],[j]} = \tilde{Q}_{[i]} \tilde{K}_{[j]}^T.
      • Update online softmax statistics and accumulate output.
      • Update Q~[i]Q~[i]P[j]T\tilde{Q}_{[i]} \leftarrow \tilde{Q}_{[i]} P_{[j]}^T to incorporate the transformation from block jj.
    • The overall complexity becomes comparable to standard attention: O(L2d+Ld2/B)O(L^2d + Ld^2/B) for attention computation and O(LB2+LBd)O(LB^2 + LBd) for preprocessing, assuming block size BdB \sim d.

Efficient Inference

For inference, historical keys ki(t1)k_i^{(t-1)} can be updated in-place for the current timestep tt: ki(t)(IβtWtWtT)ki(t1)k_i^{(t)} \leftarrow (I - \beta_t W_t W_t^T) k_i^{(t-1)} for all i<ti < t. This avoids recomputing cumulative products. Before decoding, initial keys K[i]K_{[i]} are transformed by suffix products P[i]P[i+1]...P[L/B1]P_{[i]} P_{[i+1]} ... P_{[L/B-1]} which can also be done efficiently. This makes PaTH compatible with existing optimized decoding kernels.

Implementation Details and Availability

  • A Triton-based kernel for PaTH attention is provided as part of the flash-linear-attention library: https://github.com/fla-org/flash-linear-attention/tree/main/fla/ops/path_attn.
  • Figure 1 shows PaTH incurs a modest slowdown compared to RoPE but is faster than FoX in benchmarks on an H100 GPU.

Experimental Results

PaTH was evaluated on synthetic tasks and LLMing.

  1. Synthetic Tasks:
    • Flip-Flop LLMing (FFLM): PaTH (1-layer, 2-heads) achieves near-perfect accuracy (0% error in-distribution, 0.0001% OOD), significantly outperforming RoPE, Stick-Breaking Attention (SBA), and FoX.
    • Word Problems (A5 group, NC1-complete): PaTH solves the task with 2 layers for sequence length 20, while RoPE, SBA, and FoX require 4 layers.
    • Multi-query Repeated Associative Recall with N-back (MQRAR-N): PaTH successfully tracks variable values for N up to 3 (N-back recall), while RoPE, SBA, and FoX struggle for N > 1.
  2. LLMing (760M parameter models on 50B Fineweb-Edu tokens, 4096 context):
    • Standard Benchmarks: PaTH consistently outperforms RoPE and generally FoX on Wikitext perplexity, LAMBADA, PIQA, HellaSwag, WinoGrande, and ARC. PaTH-FoX often achieves the best perplexity. | Model | Wiki. ppl ↓ | LMB. ppl ↓ | PIQA acc ↑ | Hella. acc ↑ | Wino. acc ↑ | ARC-e acc ↑ | ARC-c acc ↑ | Avg. ↑ | | :-------- | :---------- | :--------- | :--------- | :----------- | :---------- | :---------- | :---------- | :----- | | RoPE | 19.01 | 19.77 | 40.4 | 70.2 | 50.3 | 54.9 | 67.2 | 33.3 | | FoX | 18.33 | 18.28 | 41.7 | 70.8 | 50.9 | 57.1 | 65.7 | 32.6 | | PaTH | 18.03 | 16.79 | 44.0 | 70.5 | 51.5 | 56.0 | 68.9 | 34.4 | | PaTH-FoX| 17.35 | 16.23 | 44.1 | 70.8 | 52.2 | 57.1 | 67.3 | 33.9 |
    • Length Extrapolation: On PG-19, CodeParrot, and NarrativeQA, PaTH-FoX consistently achieves the lowest perplexity up to 64K tokens. PaTH alone generalizes well up to 32K. RoPE fails abruptly beyond its 4K training length. The benefit is especially pronounced on code, where state tracking is crucial.
    • Long-Context Benchmarks (RULER, BABILONG, PhoneBook, LongBench-E):
      • PaTH-FoX excels at retrieval tasks (e.g., Multi-Needle-In-A-Haystack, PhoneBook).
      • PaTH and PaTH-FoX show substantial gains in state-tracking tasks (RULER Variable Tracking, BABILONG logic queries). For instance, on RULER 16K, PaTH achieves 18.7% and PaTH-FoX 22.6%, compared to RoPE's 0.0% and FoX's 4.9%. On PhoneBook 8K, PaTH-FoX scores 66.6% vs RoPE's 0.0%.

Practical Implications

  • Improved State Tracking: PaTH's data-dependent nature makes it more suitable for tasks requiring tracking of state or entities over long sequences (e.g., coding, narrative understanding).
  • Enhanced Length Extrapolation: PaTH, especially PaTH-FoX, demonstrates significantly better generalization to sequence lengths beyond the training context window compared to RoPE.
  • Hardware Efficiency: The proposed blockwise algorithm allows PaTH to be trained and run with efficiency comparable to standard attention mechanisms, making it a practical alternative.
  • Compatibility: PaTH's design is compatible with context-parallelism techniques for distributed training (e.g., Ring Attention) by passing transformed keys, values, and local Householder product matrices between devices.

In summary, PaTH offers a more expressive and data-aware approach to position encoding in transformers. Its strong empirical performance, particularly on state-tracking and long-context tasks, coupled with an efficient implementation strategy, makes it a promising advancement for LLMs.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube