Fast and Simplex: 2-Simplicial Attention in Triton
(2507.02754v1)
Published 3 Jul 2025 in cs.LG and cs.AI
Abstract: Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count together. However, these scaling laws assume an infinite supply of data and apply primarily in compute-bound settings. As modern LLMs increasingly rely on massive internet-scale datasets, the assumption that they are compute-bound is becoming less valid. This shift highlights the need for architectures that prioritize token efficiency. In this work, we investigate the use of the 2-simplicial Transformer, an architecture that generalizes standard dot-product attention to trilinear functions through an efficient Triton kernel implementation. We demonstrate that the 2-simplicial Transformer achieves better token efficiency than standard Transformers: for a fixed token budget, similarly sized models outperform their dot-product counterparts on tasks involving mathematics, coding, reasoning, and logic. We quantify these gains by demonstrating that $2$-simplicial attention changes the exponent in the scaling laws for knowledge and reasoning tasks compared to dot product attention.
Summary
The paper introduces 2-simplicial attention, a trilinear extension of dot-product attention that improves token efficiency for reasoning tasks.
It employs a sliding window mechanism and optimized Triton kernels to reduce cubic complexity to a manageable cost for long sequences.
Empirical results reveal higher scaling exponents and competitive latency, demonstrating superior performance on large-scale models.
2-Simplicial Attention in Triton: Architecture, Implementation, and Scaling Implications
The paper "Fast and Simplex: 2-Simplicial Attention in Triton" (2507.02754) presents a comprehensive paper of 2-simplicial attention as a generalization of standard dot-product attention, with a focus on practical implementation in Triton and empirical evaluation of scaling laws under token constraints. The work is motivated by the increasing scarcity of high-quality training data for LLMs, which challenges the prevailing assumption that model scaling is primarily compute-bound. The authors argue for architectures that are more token-efficient, and demonstrate that 2-simplicial attention can improve the scaling exponent for knowledge and reasoning tasks.
2-Simplicial Attention: Theoretical and Practical Formulation
2-simplicial attention extends the bilinear dot-product attention mechanism to a trilinear form, allowing each query to attend to pairs of keys and values. Formally, for input sequence X∈Rn×d, the model computes projections Q,K,V,K′,V′∈Rn×d and defines the attention logits as:
Aijk(2s)=d1l=1∑dQilKjlKkl′
The attention weights are obtained via a softmax over the j,k indices, and the output for each query is a weighted sum over the Hadamard products of Vj and Vk′. This trilinear structure increases the expressivity of the attention mechanism, enabling the model to capture higher-order interactions that are inaccessible to standard dot-product attention.
Implementation in Triton
The cubic complexity of naive 2-simplicial attention (O(n3)) is prohibitive for long sequences. The authors address this by introducing a sliding window mechanism, restricting each query to attend to a local w1×w2 region of the key and value tensors. This reduces the complexity to O(nw1w2), making the approach tractable for practical sequence lengths.
The implementation leverages Triton for custom kernel development, with several optimizations:
2D Tiling and Fused Operations: The trilinear einsum is decomposed into elementwise multiplications and tiled matrix multiplications, maximizing utilization of CUDA and Tensor cores.
Backward Pass Decomposition: The backward computation is split into two kernels to avoid excessive atomic operations, which are costly in Triton due to its pipeline control granularity.
The provided Triton kernel code demonstrates the forward and backward passes, including online softmax and efficient memory access patterns. The authors report achieving 520 TFLOPS, competitive with state-of-the-art FlashAttention v3 implementations.
Empirical Results and Scaling Law Analysis
The experimental evaluation focuses on MoE models with up to 3.5B active parameters and 176B total parameters, trained with interleaved 2-simplicial and standard attention layers. The models are evaluated on GSM8k, MMLU, MMLU-pro, and MBPP, benchmarks that emphasize reasoning, mathematics, and coding.
Key findings include:
Token Efficiency: For a fixed token budget, 2-simplicial models outperform standard Transformers on reasoning-heavy tasks as model size increases. Gains are negligible for models below 2B active parameters, but become significant at larger scales.
Scaling Exponent Improvement: The scaling law exponent α (in L(N)=E′+A/Nα) is consistently higher for 2-simplicial attention, with improvements of 8–20% depending on the benchmark. This indicates that performance improves more rapidly with model size compared to standard attention.
Latency and Throughput: With appropriate window sizes (e.g., w1=512, w2=32), the latency of 2-simplicial attention is comparable to dot-product attention at long context lengths, supporting its practical deployability.
Implications and Future Directions
The results demonstrate that 2-simplicial attention is a viable architectural modification for improving token efficiency and scaling behavior in LLMs, particularly in regimes where high-quality data is limited. The increase in the scaling exponent is a rare and notable property, as most architectural changes only affect the offset, not the exponent, of the scaling law.
Practical implications include:
Pretraining Under Data Constraints: 2-simplicial attention enables more effective use of limited data, potentially reducing the need for ever-larger datasets as models scale.
Reasoning and Logic Tasks: The architecture is particularly beneficial for tasks requiring higher-order reasoning, as supported by both theoretical expressivity results and empirical benchmarks.
Hardware Co-Design: While the Triton implementation is efficient for prototyping, further gains are possible with lower-level optimization (e.g., CUTLASS) and hardware-aware co-design.
Limitations and Open Questions:
Computational Overhead: Despite optimizations, 2-simplicial attention remains more expensive than standard attention, especially for large window sizes or long sequences.
Model Saturation: The benefits are most pronounced for larger models and may not justify the overhead for smaller architectures or tasks with less complex dependencies.
Generalization to Higher-Order Attention: The extension to k-simplicial attention (k>2) is theoretically appealing but likely to be computationally intractable without further algorithmic advances.
Speculation on Future Developments
The demonstrated improvement in scaling exponents suggests that further exploration of higher-order attention mechanisms could yield additional gains, particularly if combined with advances in efficient kernel design and hardware acceleration. The interplay between architectural expressivity and data efficiency is likely to become increasingly important as the field approaches the limits of available training data. Additionally, the integration of 2-simplicial attention with other efficiency-oriented techniques (e.g., retrieval augmentation, curriculum learning) may further enhance its practical impact.
In summary, this work provides both a theoretical and practical foundation for 2-simplicial attention as a means to improve the scaling behavior of LLMs under token constraints, and offers a blueprint for efficient implementation in modern deep learning frameworks.