Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 85 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 37 tok/s
GPT-5 High 37 tok/s Pro
GPT-4o 100 tok/s
GPT OSS 120B 473 tok/s Pro
Kimi K2 240 tok/s Pro
2000 character limit reached

Fast and Simplex: 2-Simplicial Attention in Triton (2507.02754v1)

Published 3 Jul 2025 in cs.LG and cs.AI

Abstract: Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count together. However, these scaling laws assume an infinite supply of data and apply primarily in compute-bound settings. As modern LLMs increasingly rely on massive internet-scale datasets, the assumption that they are compute-bound is becoming less valid. This shift highlights the need for architectures that prioritize token efficiency. In this work, we investigate the use of the 2-simplicial Transformer, an architecture that generalizes standard dot-product attention to trilinear functions through an efficient Triton kernel implementation. We demonstrate that the 2-simplicial Transformer achieves better token efficiency than standard Transformers: for a fixed token budget, similarly sized models outperform their dot-product counterparts on tasks involving mathematics, coding, reasoning, and logic. We quantify these gains by demonstrating that $2$-simplicial attention changes the exponent in the scaling laws for knowledge and reasoning tasks compared to dot product attention.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a Triton-optimized 2-simplicial Transformer that alters scaling laws and improves token efficiency for reasoning tasks.
  • It details a sliding-window approach and kernel optimizations that reduce the cubic complexity of trilinear attention.
  • Experimental results demonstrate enhanced performance on math, coding, and logic benchmarks compared to standard dot-product attention.

Fast and Simplex: 2-Simplicial Attention in Triton

The paper "Fast and Simplex: 2-Simplicial Attention in Triton" (2507.02754) introduces an efficient implementation of the 2-simplicial Transformer, an architecture that generalizes standard dot-product attention to trilinear functions. By optimizing the implementation using Triton, the authors demonstrate improved token efficiency compared to standard Transformers, particularly on tasks requiring mathematics, coding, reasoning, and logic. The paper quantifies these gains by showing that 2-simplicial attention changes the exponent in the scaling laws for knowledge and reasoning tasks relative to dot product attention.

Background and Motivation

Scaling laws in LLMs suggest that training loss decreases as a power law with both model size and the number of tokens. Compute-optimal models require scaling model size and token count together. However, modern LLMs increasingly rely on massive internet-scale datasets, making the assumption that they are compute-bound less valid. This shift highlights the need for architectures that prioritize token efficiency, and the paper revisits and optimizes the 2-simplicial Transformer [clift2019logic] to address this need. Prior research indicates that most architectural and optimizer improvements do not fundamentally alter the power-law exponent governing scaling, but the authors posit that 2-simplicial attention can offer such a change.

2-Simplicial Attention

The 2-simplicial Transformer extends dot-product attention from bilinear to trilinear forms. In standard attention, given a sequence XRn×dX \in R^{n \times d}, query QQ, key KK, and value VV projections are computed. The attention logits are then given by A=QK/dA = {QK^\top}/{\sqrt{d}}. The 2-simplicial Transformer introduces additional key and value projections, KK' and VV', and computes attention logits as a trilinear product:

$A_{ijk}^{(\text{2s})} = \frac{_i, k_j, k'_k}{\sqrt{d} = \frac{1}{\sqrt{d}\sum^d_{l=1} Q_{il} K_{jl} K'_{kl}}$

The final output is a linear combination of values, weighted by attention scores.

(Figure 1)

Figure 1: Geometry of dot product attention and 2-simplical attention.

Rotation Invariant Trilinear Forms

The paper explores generalizations of RoPE to trilinear functions, presenting a rotation-invariant trilinear form. Standard RoPE captures positional information in a sequence by applying position-dependent rotations to queries and keys, such that the dot product is a function of the relative distance iji-j. The authors note that the basic trilinear form is not invariant to rotation, and propose an alternative using determinants:

f^3(a,b,c)=det(a1a2a3 b1b2b3 c1c2c3)\hat{f}_3(a,b,c) = \text{det}{\begin{pmatrix} a_{1} & a_2& a_3\ b_1 & b_2 & b_3\ c_1 & c_2 & c_3\end{pmatrix}}

The attention score is then computed as a sum of determinants:

$A_{i j_1 j_2}^{(\text{det})} = \sum_{l=1}^{p} \det([_i^{(l)}, k_{j_1}^{(l)}, {k'_{j_2}^{(l)}]).$

This determinant-based approach involves six terms, requiring two einsums in the implementation, compared to one einsum for the simpler trilinear form.

Model Design and Kernel Optimization

Due to the O(n3)O(n^3) scaling of 2-simplicial attention, the authors employ a sliding window approach, parameterizing the computation as O(n×w1×w2)O(n \times w_1 \times w_2), where w1w_1 and w2w_2 define the dimensions of the sliding window. A window size of (512, 32) is chosen to balance latency and quality, making the computational complexity comparable to dot-product attention at 48k context length. Inspired by Native Sparse Attention, the model architecture leverages a high GQA ratio of 64, enabling efficient tiling along query heads. The authors introduce kernel optimizations for 2-simplicial attention, building on Flash Attention and using online softmax. 2D tiling is performed by merging one of the inputs via elementwise multiplication and executing matmul on the product. This allows overlapping QKQK and VVVV' on CUDA cores with (QK)@K(QK)@K' and P@(VV)P @ (VV') on Tensor Cores, achieving 520 TFLOPS in Triton. Figure 2

Figure 2: Left: Visualization of sliding window 2-simplical attention. Each QiQ_i attends to a [w1,w2][w1, w2] shaped rectangle of KK, KK'. Right: Tiling to reduce 2-simplicial einsum QKKQKK' to elementwise mul QKQK' on CUDA core and tiled matmul (QK)@K(QK')@K on tensor core.

Figure 3

Figure 3

Figure 3: FLOPs and Latencies of FAv3 vs 2-simplical attention.

For the backward pass, the authors decompose it into two distinct kernels: one for computing dKdK and dVdV, and another for dKdK', dVdV', and dQdQ. This approach reduces overhead from atomic operations.

Experimental Results

The authors trained a series of MoE models, ranging from 1 billion to 3.5 billion active parameters, with interleaved sliding-window 2-simplicial attention. Every fourth layer is a 2-simplicial attention layer to distribute the computational load when using pipeline parallelism. The models were evaluated on GSM8k, MMLU, MMLU-pro, and MBPP benchmarks.

The experiments demonstrated that the decrease in negative log-likelihood scaling from a 1.0 billion (active) parameter model increases going to a 3.5 billion (active) parameter model. The results also showed that models smaller than 2.0 billion (active) parameters did not benefit from 2-simplicial attention. By fitting parametric functions to the loss, the authors estimated the exponents in the scaling laws for both Transformer and 2-simplicial attention models. The 2-simplicial attention model exhibited a steeper slope α\alpha in its scaling law compared to the dot-product attention Transformer, indicating a more favorable scaling under token constraints.

Conclusions

The paper presents a Triton-optimized implementation of the 2-simplicial Transformer, demonstrating improved token efficiency on reasoning, math, and coding problems. The authors quantify these improvements by showing that 2-simplicial attention changes the exponent in the scaling law, leading to more favorable scaling under token constraints. While the Triton kernel is efficient for prototyping, the authors note that further co-design with hardware accelerators is needed for production use.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com

alphaXiv