Papers
Topics
Authors
Recent
Search
2000 character limit reached

LieRE 3D: Rotational Positional Encoding

Updated 16 April 2026
  • LieRE 3D is a 3D positional encoding scheme that uses the Lie group SO(3) to capture noncommutative, path-dependent rotational relationships.
  • It integrates closed-form Rodrigues computations with transformer attention by mapping 3D token positions to learned skew-symmetric matrices, ensuring translational equivariance.
  • Empirical evaluations on benchmarks like UCF101 and RSNA demonstrate that LieRE 3D outperforms traditional absolute and RoPE-based methods in classification accuracy.

LieRE 3D (Lie Rotational Positional Encodings in Three Dimensions) is a positional encoding scheme that generalizes the rotary position encoding (RoPE) mechanism to noncommutative Lie group rotations in three dimensions, enabling transformer models to effectively encode relative positions in 3D grid-structured data. LieRE is built upon the mathematical structure of SO(3), the Lie group of 3D rotations, and leverages its associated Lie algebra for high-capacity, geometry-aware encodings that are crucial for 3D vision, volumetric imaging, and video representation tasks (Ostmeier et al., 2024).

1. Limitations of Previous Positional Encoding Mechanisms

Traditional rotary positional encoding (RoPE) injects relative position by applying block-diagonal 2D rotations to attention keys and queries. RoPE is effective for 1D sequence modeling, enabling attention scores to depend only on token offset jij - i. However, key limitations arise in higher dimensions:

  • Dimensionality Constraint: RoPE inherently operates on 1D sequences; naive extensions to 2D/3D (independent block rotations) impose a commuting structure, which fails to capture path-dependent relationships. In 3D data, this prevents distinction between different routes between grid points (e.g., “move up then right” vs “right then up”).
  • Limited Representational Expressivity: The block structure of 2D RoPE cannot represent general 3D rotations or encode orientation-dependent cues essential for spatial or spatiotemporal data.
  • Loss of Spatial Locality: Flattening multidimensional grids into sequences discards natural adjacency and undermines geometric inductive biases (Ostmeier et al., 2024).

2. Foundations in Lie Groups and Lie Algebras

LieRE leverages the properties of the Lie group SO(3) and its Lie algebra so(3):

  • SO(3): The group of all 3×33 \times 3 real orthogonal matrices with determinant +1+1, representing all proper rotations in R3\mathbb{R}^3.
  • so(3): The Lie algebra of SO(3), consisting of all 3×33 \times 3 real skew-symmetric matrices. Any ωR3\omega \in \mathbb{R}^3 is mapped to a skew-symmetric matrix:

Ω=(0ω3ω2 ω30ω1 ω2ω10)\Omega = \begin{pmatrix} 0 & -\omega_3 & \omega_2 \ \omega_3 & 0 & -\omega_1 \ -\omega_2 & \omega_1 & 0 \end{pmatrix}

  • Exponential and Logarithm Maps: exp:so(3)SO(3)\exp: \mathrm{so}(3) \to \mathrm{SO}(3) provides a local diffeomorphism; for small X,Yso(3)X, Y \in \mathrm{so}(3), exp(X)exp(Y)exp(X+Y)\exp(X)\exp(Y) \approx \exp(X+Y), supporting relative positional encoding.
  • Axis-Angle Parameterization: Any rotation in SO(3) can be parameterized by a vector 3×33 \times 30, with the Rodrigues formula providing a closed-form matrix exponential:

3×33 \times 31

where 3×33 \times 32 is the skew-symmetric matrix generated by 3×33 \times 33.

3. Construction of LieRE 3D Positional Encoding

LieRE maps each token’s 3D coordinate 3×33 \times 34 to the Lie algebra so(3) using a learned linear generator:

3×33 \times 35

where 3×33 \times 36, 3×33 \times 37, 3×33 \times 38 are learned 3×33 \times 39 skew-symmetric matrices.

Each +1+10 is exponentiated to obtain a rotation matrix:

+1+11

The rotation +1+12 is then applied to the query and key representations:

+1+13

When attention is computed between tokens +1+14 and +1+15:

+1+16

The inner product is thus a function only of their position difference +1+17, ensuring strict translational equivariance in 3D space.

4. Integration into Transformer Architectures

Within a multi-head self-attention module, LieRE applies the encoding as follows:

  • For +1+18 tokens with 3D positions +1+19,
    • Compute R3\mathbb{R}^30 for each source token,
    • Compute R3\mathbb{R}^31 for each target token.
  • Rotated queries and keys are used in the attention computation:

R3\mathbb{R}^32

  • GPU implementation uses vectorized, closed-form Rodrigues computation for all token positions in batch. For a batch of 64 tokens (12-layer ViT-B, 12 heads), memory requirements are approximately 40 GB for a single forward/backward pass (Ostmeier et al., 2024).

5. Empirical Evaluation in 3D Vision and Temporal Tasks

LieRE was empirically evaluated on volumetric and spatiotemporal benchmarks:

  • Datasets: UCF101 (3D video, 101 classes), RSNA (3D CT, binary brain hemorrhage detection).
  • Baseline Comparisons: Absolute Position Embedding (APE), RoPE-Mixed (separate RoPE on spatial and temporal axes), and LieRE (full SO(3)).
  • Results—Classification Accuracy:

| Method | UCF101 (%) | RSNA (%) | |---------------------|:----------:|:--------:| | Absolute Pos. Emb. | 44.4 | 80.7 | | RoPE-Mixed | 48.6 | 81.9 | | LieRE (SO(3)) | 51.1 | 82.7 |

LieRE delivered improvements of +6.7 percentage points over APE and +2.5 over RoPE-Mixed on UCF101; on RSNA, +2.0 and +0.8 points, respectively (Ostmeier et al., 2024).

  • Robustness and Inductive Bias: Under random patch shuffling at inference, LieRE suffered a 36.9 percentage point drop in accuracy, compared to 24.7 for RoPE-Mixed and 0.1 for APE, indicating stronger reliance on precise 3D relative cues.

6. Implementation Considerations and Computational Efficiency

  • Closed-Form Exponentials: All matrix exponentials are explicit via Rodrigues’ formula, eliminating the need for general-purpose matrix-exponential routines.
  • Batching: All position-to-generator mappings and exponentials are vectorized per batch, with negligible compute overhead for standard Vision Transformer (ViT) memory capacities.
  • Hyperparameters: Cosine learning rate decay from 1e-4, Adam optimizer (R3\mathbb{R}^33), patch size 4×16×16 on 32×224×224 video inputs, dropout 0.1, and no positional-encoding–specific tuning.
  • Scalability: Generator matrix block size (from R3\mathbb{R}^34 to R3\mathbb{R}^35) monotonically increases accuracy on UCF101, with even R3\mathbb{R}^36 blocks capturing most of the benefit.

7. Significance, Impact, and Potential Extensions

LieRE establishes a unifying, group-theoretic framework for 3D positional encoding in attention-based architectures, offering several critical capabilities:

  • Strict Relative Position Dependence: Attention depends only on R3\mathbb{R}^37 in R3\mathbb{R}^38, enforcing homogeneous locality priors in 3D grids.
  • Increased Geometric Expressivity: The noncommutative nature of SO(3) permits modeling of path-dependent transformations and more faithful spatial cues.
  • Empirical Gains: Outperforms both absolute and commutative block-diagonal (2D) positional encodings in 3D classification accuracy.
  • Practicality: Computationally efficient, minimal impact on standard ViT memory and runtime budgets, and readily implementable via closed-form operations.
  • Scalability Knob: Generator matrix block size offers a tunable trade-off between expressivity and compute.

A plausible implication is that the adoption of LieRE in video, volumetric 3D, or multi-modal transformer architectures can replace less expressive positional encoding schemes to achieve higher fidelity in geometric structure modeling and improved task generalization (Ostmeier et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LieRE 3D.