Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MemoryFormer: Minimize Transformer Computation by Removing Fully-Connected Layers (2411.12992v1)

Published 20 Nov 2024 in cs.CL

Abstract: In order to reduce the computational complexity of LLMs, great efforts have been made to to improve the efficiency of transformer models such as linear attention and flash-attention. However, the model size and corresponding computational complexity are constantly scaled up in pursuit of higher performance. In this work, we present MemoryFormer, a novel transformer architecture which significantly reduces the computational complexity (FLOPs) from a new perspective. We eliminate nearly all the computations of the transformer model except for the necessary computation required by the multi-head attention operation. This is made possible by utilizing an alternative method for feature transformation to replace the linear projection of fully-connected layers. Specifically, we first construct a group of in-memory lookup tables that store a large amount of discrete vectors to replace the weight matrix used in linear projection. We then use a hash algorithm to retrieve a correlated subset of vectors dynamically based on the input embedding. The retrieved vectors combined together will form the output embedding, which provides an estimation of the result of matrix multiplication operation in a fully-connected layer. Compared to conducting matrix multiplication, retrieving data blocks from memory is a much cheaper operation which requires little computations. We train MemoryFormer from scratch and conduct extensive experiments on various benchmarks to demonstrate the effectiveness of the proposed model.

Evaluating MemoryFormer: An Innovative Approach for Reducing Transformer Computational Complexity

The research paper "MemoryFormer: Minimize Transformer Computation by Removing Fully-Connected Layers" introduces the MemoryFormer architecture, which aims to address the computational challenges associated with large transformer models by minimizing Floating Point Operations (FLOPs) typically utilized in fully-connected (FC) layers. The proposal is a substantial step forward in optimizing transformer efficiency by replacing computationally intensive FC layers with an innovative memory-centric approach based on locality-sensitive hashing (LSH).

Core Innovations

Reduction in Computational Complexity

The transformer architecture, renowned for its success across domains like natural language processing and computer vision, has been traditionally marred by high computational costs, especially in its FC layers. In standard transformers, while multi-head attention (MHA) incurs a complexity of roughly 2s2d2s^2d FLOPs (where ss is sequence length and dd is hidden layer size), the FC layers require 12sd212sd^2 FLOPs. MemoryFormer significantly reduces this burden. By removing FC layers and utilizing Memory Layers that replace matrix multiplication with memory retrieval operations, the model achieves dramatically lower FLOPs, effectively addressing inefficiencies inherent in previous transformer designs.

Novel Use of Locality-Sensitive Hashing

A distinctive feature of MemoryFormer is the application of LSH within memory layers. This technique allows feature transformation without traditional matrix multiplications, utilizing hash tables to store and retrieve approximate results of operations generally performed by FC layers. The retrieval from the hash tables is computationally cheaper than executing full matrix multiplications. The paper introduces an innovative strategy of splitting input vectors and assigning them to hash tables, balancing memory use and computational efficiency. This setup ensures that approximate projections are consistent and learned through back-propagation, marrying efficiency with adaptability.

Empirical Evaluation

The authors conducted experiments using the Pythia framework, benchmarking the MemoryFormer against standard transformer architectures on a selection of NLP tasks: PIQA, WinoGrande, WSC, ARC-E, ARC-C, and LogiQA. Across different model sizes, MemoryFormer achieved comparable or superior average accuracy with far fewer computational requirements. This empirical evidence suggests that MemoryFormer holds promise in real-world applications where computational resources are constrained but model performance cannot be compromised.

Implications and Future Directions

The incorporation of LSH and reduction of FLOPs without degrading model performance has profound implications for transformer application in resource-limited scenarios. MemoryFormer opens avenues for deploying powerful LLMs on devices with restricted computational capacity, such as mobile devices or edge computing nodes.

This work also raises interesting proposals for hardware design driven by software constraints, such as increased bus width and enhanced cache strategies to accommodate the memory retrieval operations that MemoryFormer relies upon. Future research could extend MemoryFormer by exploring architectural alterations to leverage recent advances in attention mechanisms, such as FlashAttention or Linear Attention, which could further optimize both memory and computational workloads.

Conclusion

MemoryFormer represents a valuable progression in the development of efficient transformer architectures. By optimizing computation and memory use through innovative machine learning techniques, it addresses a critical bottleneck in NLP applications today. The thoughtful design that prioritizes both performance and computational efficiency offers avenues for practical deployment of AI models in diverse computing environments, promising a wide-reaching impact across multiple sectors using machine learning technologies.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Ning Ding (122 papers)
  2. Yehui Tang (63 papers)
  3. Haochen Qin (3 papers)
  4. Zhenli Zhou (1 paper)
  5. Chao Xu (283 papers)
  6. Lin Li (329 papers)
  7. Kai Han (184 papers)
  8. Heng Liao (5 papers)
  9. Yunhe Wang (145 papers)