Papers
Topics
Authors
Recent
2000 character limit reached

Lorentz-Equivariant Geometric Algebra Transformer

Updated 3 December 2025
  • L-GATr is a neural architecture that combines 4D Minkowski geometric algebra with Lorentz-equivariant operations, ensuring physical symmetry in high-energy physics tasks.
  • It integrates a scalable Transformer with multivector attention and locally canonicalized frames to support regression, classification, and generative modeling.
  • Experimental results demonstrate state-of-the-art performance with robust generalization and efficient handling of high token counts in challenging particle collision data.

The Lorentz-Equivariant Geometric Algebra Transformer (L-GATr) is a neural architecture that combines geometric algebra representations of four-dimensional Minkowski space with exactly Lorentz-equivariant operations, embedded within the scalable Transformer backbone. It is designed for learning tasks in high-energy physics, notably regression, classification, and generative modeling of particle collision data, where Lorentz symmetry is fundamental. L-GATr encodes each particle’s state as a multivector in the 16-dimensional Clifford algebra Cl(1,3)\mathrm{Cl}(1,3), and defines all neural operations—including linear layers, attention, and normalization—so as to remain equivariant under SO+(1,3)\mathrm{SO}^+(1,3) transformations, ensuring that predictions properly reflect the underlying physical symmetries of relativistic kinematics (Spinner et al., 23 May 2024, Brehmer et al., 1 Nov 2024, Qureshi et al., 23 Feb 2025, Favaro et al., 20 Aug 2025).

1. Geometric Algebra Foundations and Lorentz Equivariance

L-GATr is built on the Clifford algebra Cl(1,3)\mathrm{Cl}(1,3), where any element xx may be written as x=k=04xkx = \sum_{k=0}^4 \langle x \rangle_k, with grades corresponding to scalars (k=0k=0), four-vectors (k=1k=1), bivectors (k=2k=2), trivectors (k=3k=3), and the pseudoscalar (k=4k=4). The underlying metric is η=diag(+1,1,1,1)\eta = \operatorname{diag}(+1,-1,-1,-1), defining Minkowski space-time. The geometric product, uv=uv+uvuv = u \cdot v + u \wedge v, supplies the basic operations: uvu \cdot v as the Lorentz inner product and uvu \wedge v as the exterior (bivector) part.

Lorentz transformations ΛSO+(1,3)\Lambda \in \mathrm{SO}^+(1,3) act naturally on grade-1 vectors as pμΛνμpνp^\mu \rightarrow \Lambda^\mu_\nu p^\nu, extended to the full algebra as a homomorphism, so that Λ(uv)=(Λu)(Λv)\Lambda \cdot (uv) = (\Lambda \cdot u)(\Lambda \cdot v). For neural network layers ff, the strong equivariance constraint f(Λx)=Λf(x)f(\Lambda \cdot x) = \Lambda \cdot f(x) holds at every stage. This is enforced mathematically by only permitting “grade-preserving” linear maps: φ(x)=k=04vkxk+k=04wke0123xk,\varphi(x) = \sum_{k=0}^4 v_k \langle x \rangle_k + \sum_{k=0}^4 w_k e_{0123} \langle x \rangle_k, where vk,wkRv_k, w_k\in\mathbb{R} are learnable parameters and e0123e_{0123} denotes the unique grade-4 pseudoscalar (Spinner et al., 23 May 2024, Brehmer et al., 1 Nov 2024, Qureshi et al., 23 Feb 2025).

Nonlinearities are restricted to functions that commute with Lorentz transformations, including the geometric product GP(x,y)=xyGP(x, y) = xy, scalar-gated GELU activations GELU(x0)x\operatorname{GELU}(\langle x \rangle_0) x, and grade-wise LayerNorm: LayerNorm(x)=x/(1/n)c=1nk=04xck,xck+ϵ.\operatorname{LayerNorm}(x) = x / \sqrt{ (1/n) \sum_{c=1}^n \sum_{k=0}^4 |\langle \langle x_c \rangle_k, \langle x_c \rangle_k \rangle| + \epsilon }.

2. Transformer Architecture and Multivector Attention

L-GATr integrates geometric algebra with a multi-head, self-attention Transformer. Each token represents a particle and includes nn channels of Cl(1,3)\mathrm{Cl}(1,3) multivectors (e.g., n=32n=32), plus mm scalar channels (e.g., for type, time, or other metadata). Four-momenta are encoded into the grade-1 part of one multivector channel; categorical tags are placed in scalars.

Attention is carried out by forming queries, keys, and values as multivectors. The attention weights use the Lorentz-invariant inner product,

scoreij=qi,kj/16nc,\text{score}_{ij} = \langle q_i, k_j \rangle / \sqrt{16 n_c},

which ensures invariance under group action. The architecture supports stacking of BB blocks:

  1. LayerNorm
  2. AttentionBlock: Linear(Attention(Linear(xˉ),))+x\operatorname{Linear}(\operatorname{Attention}(\operatorname{Linear}(\bar x), \ldots)) + x
  3. MLPBlock: includes geometric product and scalar-gated GELU
  4. Residual connection and equivariant linear mixing

All linear transformations adhere to the grade-wise constraints, and attention scales as O(N2)\mathcal{O}(N^2) due to FlashAttention backend compatibility (Spinner et al., 23 May 2024, Brehmer et al., 1 Nov 2024).

3. Lorentz Local Canonicalization and Generalization

The Lorentz Local Canonicalization (LLoCa) framework extends L-GATr by equipping each token or node with a learnable local Lorentz frame LiSO+(1,3)L_i \in \mathrm{SO}^+(1,3), predicted by a small network (“Frames-Net”). Feature tensors Tiμ1μkT_i^{\mu_1\dots\mu_k} are canonically mapped to local frames for exact equivariance, and messages are transported as

mj,Lia1ak=[ρm(Rij)mj,Lj]a1ak,Rij=LiLj1,m_{j,L_i}^{a_1\dots a_k} = [\rho_m(R_{ij}) m_{j,L_j}]^{a_1\dots a_k}, \quad R_{ij} = L_i L_j^{-1},

ensuring that both local computations and aggregate updates are Lorentz-invariant. This permits mixing arbitrary tensor ranks within attention and feed-forward layers, and enables "drop-in" equivariant upgrades for any Transformer or GNN backbone with minimal computational (<20%) and parameter (<1%) overhead. Mixed representations—splitting total feature dimension into scalar, vector, and tensor channels—enable optimal speed/performance tradeoff (Favaro et al., 20 Aug 2025).

4. Task-Specific Adaptations: Regression, Classification, and Generation

L-GATr supports a unified pipeline for high-energy physics tasks:

  • Regression (e.g., matrix element surrogates for qqˉZ+ngq \bar q \to Z + n g): Input tokens are one per incoming/outgoing particle, plus a global summary token. The model predicts standardized log-amplitudes, with the scalar grade of the global token as output, minimizing MSE loss.
  • Classification (e.g., top tagging): Events are tokenized as unordered point clouds, with type tags and optionally additional reference vectors to break to subgroups such as SO(3)\mathrm{SO}(3). The output head reads out a scalar, using sigmoid/BCE loss.
  • Generative modeling: L-GATr parameterizes continuous normalizing flows via Riemannian flow matching. Latents are mapped via ODE integration over physically valid manifolds (e.g., (ym,yp,η,ϕ)(y_m, y_p, \eta, \phi) for each particle) and projected back using the network’s computed vector fields, respecting geometric constraints. Evaluations use negative log-likelihood and classifier two-sample tests, providing robust phase space coverage (Spinner et al., 23 May 2024, Brehmer et al., 1 Nov 2024, Favaro et al., 20 Aug 2025).

5. Symmetry Breaking and Flexibility

While L-GATr is maximally equivariant by construction, physical tasks at the LHC may require breaking to a subgroup (e.g., by supplying beam-direction or time-like reference vectors as special tokens or channels). This reduces symmetry equivariance to that appropriate for the problem’s residual symmetry. Empirically, providing reference objects as part of the architecture, rather than the data, offers superior performance and flexibility. Symmetry breaking, when performed at the input level (such as explicit beam or time features within LLoCa), allows the network to operate effectively in realistic detector environments (Brehmer et al., 1 Nov 2024, Favaro et al., 20 Aug 2025).

6. Experimental Benchmarking and Performance

Across regression, classification, and generative benchmarks, L-GATr consistently matches or outperforms strong baselines:

  • Regression: Achieves lowest MSE for high-multiplicity QFT surrogate tasks (n=4,5n=4,5 gluons), with performance robust to dataset size (Spinner et al., 23 May 2024, Brehmer et al., 1 Nov 2024, Favaro et al., 20 Aug 2025).
  • Classification: Matches or slightly trails top Lorentz-equivariant GNN architectures (LorentzNet, CGENN, PELICAN) in top tagging (AUC=0.9870\mathrm{AUC}=0.9870, background rejection 1/εB22001/\varepsilon_B \approx 2200 at εS=0.3\varepsilon_S=0.3), and achieves best performance (AUC $0.9885$) on 10-way JetClass tasks when pretrained (Brehmer et al., 1 Nov 2024, Favaro et al., 20 Aug 2025).
  • Generative Modeling: L-GATr parameterized flows yield per-marginal distributions that best match ground truth, especially at distribution tails, with lowest negative log-likelihood and classifier AUCs near $0.5$ (chance-level distinguishability) (Spinner et al., 23 May 2024, Brehmer et al., 1 Nov 2024, Favaro et al., 20 Aug 2025).
  • Computational Scaling: L-GATr incurs a %%%%41η=diag(+1,1,1,1)\eta = \operatorname{diag}(+1,-1,-1,-1)42%%%% overhead versus vanilla Transformers for small particle multiplicities due to grade-preserving operations, but can handle O(103){O}(10^3) tokens, outscaling equivariant graph-based architectures (Spinner et al., 23 May 2024).
Task L-GATr Result Baseline Comparison
QFT Regression Lowest MSE for n3n\geq 3 Beats CGENN/Transformer
Top Tagging AUC 0.9870, 1/εB22001/\varepsilon_B\approx 2200 Matches SOTA equivariant GNNs
JetClass Multiway AUC 0.9885 Surpasses ParticleNet, ParT
Event Generation NLL \approx -32.8 Superior to MLP, transformer
Scaling N1000N\gtrsim 1000 tokens Outscales CGENN, ParT

Ablation studies demonstrate that removing geometric-algebra channels or enforcing symmetry at insufficient granularity significantly degrades performance and data efficiency.

7. Extensions and Theoretical Insights

Recent work integrates L-GATr with LLoCa, enabling per-particle frame prediction and exact equivariant transport at arbitrary tensorial rank, resulting in architectures that combine the flexibility and speed of standard transformers with guaranteed Lorentz symmetry. In this regime, L-GATr, when enhanced with local frames and explicit rotor transport, can achieve an order of magnitude speedup versus specialized architectures while retaining or surpassing SOTA accuracy, and exhibits optimal scaling with both sample size and multiplicity.

A theoretical implication is that building equivariance directly into the architecture obviates the need for networks to "learn" physical symmetries, leading to better generalization and sample efficiency, particularly for high-dimensional, small-sample, or symmetry-bound problems (Favaro et al., 20 Aug 2025).

References

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Lorentz-Equivariant Geometric Algebra Transformer (L-GATr).