Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 99 tok/s

Gemini 2.5 Pro 43 tok/s Pro

GPT-5 Medium 28 tok/s

GPT-5 High 35 tok/s Pro

GPT-4o 94 tok/s

GPT OSS 120B 476 tok/s Pro

Kimi K2 190 tok/s Pro

2000 character limit reached

Rethinking Transformer Connectivity: TLinFormer, A Path to Exact, Full Context-Aware Linear Attention (2508.20407v1)

Published 28 Aug 2025 in cs.LG

Abstract: The Transformer architecture has become a cornerstone of modern artificial intelligence, but its core self-attention mechanism suffers from a complexity bottleneck that scales quadratically with sequence length, severely limiting its application in long-sequence tasks. To address this challenge, existing linear attention methods typically sacrifice model performance by relying on data-agnostic kernel approximations or restrictive context selection. This paper returns to the first principles of connectionism, starting from the topological structure of information flow, to introduce a novel linear attention architecture-\textbf{TLinFormer}. By reconfiguring neuron connection patterns, TLinFormer achieves strict linear complexity while computing exact attention scores and ensuring information flow remains aware of the full historical context. This design aims to bridge the performance gap prevalent between existing efficient attention methods and standard attention. Through a series of experiments, we systematically evaluate the performance of TLinFormer against a standard Transformer baseline on long-sequence inference tasks. The results demonstrate that TLinFormer exhibits overwhelming advantages in key metrics such as \textbf{inference latency}, \textbf{KV cache efficiency}, \textbf{memory footprint}, and \textbf{overall speedup}.

Collections

Summary

The paper introduces TLinFormer, an innovative linear attention mechanism that ensures full context awareness while reducing quadratic complexity.
It employs unique neuron connectivity and context path encoding, achieving exact Softmax attention with strictly linear computational cost.
Experimental results show competitive perplexity and reduced memory footprint, enhancing efficiency for long-sequence tasks.

"Rethinking Transformer Connectivity: TLinFormer, A Path to Exact, Full Context-Aware Linear Attention"

Introduction

The Transformer architecture is pivotal in modern AI, especially due to its self-attention mechanism, which dynamically computes context representations. However, the quadratic complexity of self-attention limits its scalability for long-sequence tasks. This paper introduces TLinFormer, a novel linear attention mechanism that maintains exact attention computations and ensures that the complete historical context is accessible.

Current Limitations and Alternatives

Efficient Transformer variants (e.g., Longformer, BigBird) address the self-attention bottleneck using sparse attention or kernel methods. These approaches often compromise on performance due to their sparse patterns or data-agnostic kernel functions. TLinFormer tackles these limitations directly, providing a strictly linear complexity model without approximations.

TLinFormer Architecture

TLinFormer's architecture enables exact Softmax attention calculations, while maintaining linear computational complexity. This is achieved through a novel connectivity structure that involves unique neuron connection patterns:

Context Path Encoding: Historical context is compressed using focused attention, followed by layer-specific processing.
Generation Path Computation: Utilizes causal self-attention paired with context integration, ensuring that all generation steps are informed by the complete historical context.
Figure 1: Standard Transformer architecture.

This architecture (Figure 1) ensures full context awareness without approximating the attention mechanism, using architectural strategies that differ from traditional sparse or kernel-based approaches.

Complexity and Cache Efficiency

TLinFormer exhibits linear complexity in both Cache Miss and Cache Hit scenarios:

Cache Miss: Every forward pass is calculated freshly, maintaining linear complexity relative to sequence length.
Cache Hit: Exploits precomputed results to drastically reduce computation when generating subsequent tokens.

Experimental Results

The experiments validate TLinFormer's efficiency, showcasing a significant reduction in memory footprint and inference latency compared to traditional Transformer models (Figure 2).

Figure 3: Perplexity (PPL) of each model over training epochs.

Results demonstrate that TLinFormer achieves competitive perplexity scores while maintaining efficiency. Notably, it supports significantly larger sequence lengths with reduced inference memory, making it suitable for long-context tasks.

Discussion

TLinFormer's "forced compression" approach represents a step towards more intelligent models by compelling a deeper compression and abstraction of information. This architecture could lead to more efficient and higher capacity models that operate effectively on large inputs.

Conclusion

TLinFormer provides an innovative framework that resolves the efficiency and performance trade-offs in existing Transformer models. By adhering to connectionist principles and optimizing information flow, TLinFormer sets a strong precedent for future research into scalable and efficient neural architectures.

Through this work, TLinFormer demonstrates significant potential for advancing AI models' capabilities, especially in handling long sequences. Future work will involve scaling this architecture further and exploring its applications across diverse AI tasks.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (1)

Zhongpan Tang

Tweets

https://twitter.com/liranmarkin/status/1961850224936456501