Lie Relative Encodings (LieRE)

Updated 16 March 2026

LieRE is a multidimensional positional encoding framework that generalizes RoPE using learnable Lie group rotations.
It employs D learnable skew-symmetric generators and matrix exponentiation to enable continuous, relative token rotations in embedding spaces.
Empirical results show LieRE improves accuracy and generalization on 2D/3D tasks while reducing data and compute requirements.

Lie Relative Encodings (LieRE) provide a mathematically principled and computationally efficient framework for high-dimensional positional encoding in transformer models. Generalizing Rotary Position Encoding (RoPE) from 1D commuting rotations to full multidimensional Lie group rotations, LieRE enables token positions in D-dimensional spaces to control continuous, learnable, and relative rotations in the embedding space. This approach leverages the geometry of the special orthogonal group SO(n) and its Lie algebra, maintaining dot-product attention mechanisms while overcoming limitations of traditional block-diagonal positional encodings in both expressivity and transfer to new spatial resolutions (Ostmeier et al., 2024).

1. Mathematical Foundations

LieRE is grounded in the theory of Lie groups and their Lie algebras. In contrast to standard RoPE, which encodes each scalar position $p \in \mathbb{R}$ as a product of 2×2 rotations in embedding space (using fixed or learnable frequencies), LieRE defines a continuous map from $p \in \mathbb{R}^D$ to an $n \times n$ rotation matrix $R(p) \in \text{SO}(n)$ : $R(p) = \exp{A(p)} = \exp\left( \sum_{d=1}^D p_d G_d \right)$ where each $G_d \in \mathbb{R}^{n \times n}$ is a learnable skew-symmetric generator ( $G_d^T = -G_d$ ), and $A : \mathbb{R}^D \to \mathfrak{so}(n)$ is a linear map spanning the Lie algebra. Token embeddings (keys and queries) are rotated: $k' = R(p) k, \qquad q' = R(p) q$ Attention scores between tokens at $p_i$ and $p_j$ are then

$(q'_i)^T k'_j = q_i^T R(p_i)^T R(p_j) k_j \approx q_i^T \exp(A(p_j - p_i)) k_j$

which (approximately) depends only on the relative spatial displacement $\Delta p = p_j - p_i$ . This contrasts sharply with standard RoPE, where only 1D geometry is available and block-diagonal structure prevents mixing across high-dimensional grids.

2. Parametrization and Optimization

LieRE maintains $D$ learnable skew-symmetric matrices $G_d$ of size $n \times n$ (typically $n=$ attention head dimension, e.g., 64), either shared across heads or instantiated per head. Different block-diagonal structures for $G_d$ can be imposed for computational efficiency:

2×2 blocks: recovers RoPE-Mixed and ensures commutativity.
Larger blocks up to n: enables full-rank capacity and richer representation.

All $G_d$ are optimized end-to-end with the full transformer using standard attention-plus-classification loss. Gradients propagate through the matrix exponential, which is efficiently handled via autodifferentiation frameworks (e.g., PyTorch's expm or Pade-approximation).

3. Computational Complexity and Practical Considerations

Computing the matrix exponential $\exp(A(p))$ has worst-case complexity $O(n^3)$ , but with moderate $n$ typical of attention heads, the overhead is practical. GPU-optimized implementations keep the cost under approximately 10–20% of a forward pass. $R(p)$ is computed once per token per forward pass and can be reused across attention layers, further amortizing the expense. Memory requirements for storing all $R(p)$ are $O(T n^2)$ for $T$ tokens, remaining comparable to the storage for attention head activations on modern hardware.

4. Empirical Performance and Experimental Setup

Evaluations were conducted on canonical 2D and 3D classification tasks:

2D image classification: CIFAR-100 (32×32, 100 classes), ImageNet-1K (224×224, 1,000 classes).
3D classification: UCF101 video action recognition, RSNA CT scan intracranial hemorrhage detection.

Standard ViT-B backbones (12 layers, hidden dim=768, MLP dim=3072, 12 heads) were used with Adam optimizer, cosine LR schedule, and augmentation pipelines. Training required less than 30 minutes on 8×NVIDIA L4 GPUs for CIFAR-100, and 1–2 days on 4–8×A100 GPUs for 3D tasks.

The following performance metrics highlight the improvements conferred by LieRE over baselines:

Method	CIFAR-100 (%)	ImageNet (%)	UCF101 (%)	RSNA (%)
Absolute Pos. E.*	63.9	66.1	44.4	80.7
RoPE-Mixed	66.7	68.5	48.6	81.9
LieRE (full)	69.4	68.8	51.1	82.7

On 2D tasks, LieRE improves accuracy by 1.5% over state-of-the-art baselines; in 3D, gains are on the order of 1%. With only 40% of the CIFAR-100 data, LieRE outperforms all baselines by approximately 4.5%, with gains widening as data decreases. To reach the accuracy of the absolute-PE baseline at 200 epochs, LieRE requires only one third as many training steps (3.5× fewer). Increasing the Lie algebra block size $b$ in $G_d$ yields monotonic accuracy gains, empirically confirming the importance of algebraic capacity.

5. Generalization to Higher-Dimensional and Unseen Grids

The core structural benefit of LieRE lies in continuous and smooth group-parameterization:

$R(p)=\exp(Ap)$ enables interpolation to unseen positions or grid configurations, unlike RoPE’s fixed sinusoidal tables.
By learning unrestricted $G_d$ generators, LieRE captures arbitrary relative displacements in $D$ dimensions (as opposed to stacking independent 1D RoPE blocks).
The approximate relation $\exp(Ap)^T \exp(Aq) \approx \exp(A(q-p))$ , though only exact for commuting generators, is exploited via end-to-end learning to minimize approximation error in ways advantageous for attention.

These properties confer marked advantages in tasks requiring resolution generalization or robust transfer to novel spatial layouts.

6. Limitations and Prospective Developments

Several important limitations and open questions remain for LieRE:

Dot-product attention constraint: The current formulation is intrinsically linked to models utilizing inner-product attention. Extending LieRE-type encodings to architectures lacking dot-product attention (e.g., CNNs) would require novel mechanisms.
Restrictive group structure: $A : \mathbb{R}^D \rightarrow \mathfrak{so}(n)$ cannot directly encode full rigid body motions (SE(3)) or arbitrary non-Euclidean geometry. Extensions to non-compact groups, matrix groups beyond SO(n), or semidirect product constructions could provide richer geometric modeling, especially for robotics.
Matrix exponential cost: While manageable at $n \sim 64$ , evaluating $\exp(A(p))$ at much larger $n$ or under resource constraints is expensive. Approximate methods (low-rank, sketched generators, truncated series) may mitigate overhead.
Numerical issues for large contexts: As with RoPE, unbounded $\|A(p)\|$ at large spatial extents may cause numerical instability. Techniques from NTK scaling or periodic reindexing may provide robustness in very large domains.

This suggests future research may focus on expanding the underlying group structure, approximating computation for larger heads, and integrating with non-attention architectures.

7. Summary and Significance

LieRE introduces a unified, high-fidelity approach to positional encoding for transformers operating on multidimensional spatial data. By replacing RoPE’s block-diagonal 2D rotations with learnable, full-rank D-parameter Lie algebraic rotations, LieRE simultaneously achieves:

Applicability across all 1D, 2D, and 3D domains
Continuous grid and coordinate generalization
Superior accuracy (2–6% increases)
Reduced data and compute requirements (up to 30% and 3.5× less, respectively)

The method thus establishes a new standard for general-purpose, geometrically grounded transformer positional encodings (Ostmeier et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

LieRE: Lie Rotational Positional Encodings (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Lie Relative Encodings (LieRE).