Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers (2406.16450v2)

Published 24 Jun 2024 in cs.CL

Abstract: State-of-the-art results in LLMs often rely on scale, which becomes computationally expensive. This has sparked a research agenda to reduce these models' parameter counts and computational costs without significantly impacting their performance. Our study focuses on transformer-based LLMs, specifically targeting the computationally intensive feedforward networks (FFNs), which are less studied than attention blocks. We consider three structured linear parameterizations of the FFN using efficient low-rank and block-diagonal matrices. In contrast to many previous works that examined these approximations, our study i) explores these structures from a training-from-scratch perspective, ii) scales up to 1.3B parameters, and iii) is conducted within recent Transformer-based LLMs rather than convolutional architectures. We demonstrate that these structures can lead to actual computational gains in various scenarios, including online decoding when using a pre-merge technique. Additionally, we propose a novel training regime, called \textit{self-guided training}, aimed at improving the poor training dynamics that these approximations exhibit when used from initialization. Interestingly, the scaling performance of structured matrices is explored, revealing steeper curves in scaling training FLOPs, along with a favorable scaling trend in the overtraining regime. Specifically, we show that wide and structured networks can utilize training FLOPs more efficiently, with fewer parameters and lower loss than dense models at their optimal trade-off. Our code is available at \url{https://github.com/CLAIRE-Labo/StructuredFFN/tree/main}.

PDF HTML Abstract

Overview of Efficient Training of LLMs with Structured Feedforward Layers

The research paper, "Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers," explores novel methodologies for optimizing the training and deployment of Transformer-based LLMs. The focus is primarily on reducing the parameter count and computational costs associated with feedforward network (FFN) components, which dominate the overall parameter budget in Transformers but are less studied than the attention mechanisms.

Research Context and Objectives

In an era where scaling LLMs involves substantial computational resources, the necessity to design more efficient architectures is paramount. While attention mechanisms have undergone significant optimizations, FFNs, constituting a major portion of both parameters and computation, remain comparatively inefficient. The paper targets this gap by proposing the use of structured linear transformations, such as low-rank and block-diagonal matrices, to substitute the dense layers in FFNs from the ground up.

Methodological Innovations

The authors assess three main structured linear transformations within the FFNs: LowRank, BlockShuffle, and BlockDense. These methods aim to approximate the dense linear layers more efficiently:

LowRank: Utilizes a low-rank decomposition of weight matrices, reducing dimensions significantly.
BlockShuffle: Incorporates block-diagonal matrices interspersed with shuffle operations to mix information across feature dimensions.
BlockDense: A combination strategy that includes a block-diagonal matrix followed by a dense or low-rank matrix to retain comprehensive feature interactions without the overhead of a full reshuffle.

Additionally, the paper highlights a self-guided training regime, analogous to homotopy methods, which helps mitigate training instability observed with certain structured matrices. This involves employing a residual dense matrix during early training stages, which gradually hands over the learning responsibilities to the more efficient structured representations.

Key Findings and Numerical Results

Empirical evaluations were performed on a dataset derived from the RefinedWeb corpus, with model sizes scaling up to 1.3 billion parameters. The experiments reveal:

A significant reduction in FFN parameters (up to 68% for some configurations) led to only a marginal increase in perplexity (less than 1 point in larger models), supporting the efficiency claims.
The introduction of structured FFNs with 32% parameters resulted in a 1.35x speed-up during training on large-scale models, with only a small trade-off in accuracy.
Scaling laws demonstrated that models equipped with structured FFNs could achieve steeper efficiency curves compared to their dense counterparts, suggesting the higher scalability potential of structured approaches.

Implications and Future Prospects

The implications of this research are considerable, with practical benefits in reducing computational costs and thereby democratizing access to state-of-the-art LLM training. Theoretically, the findings could inspire further exploration into structured paradigms across neural architectures. The insights on efficient training dynamics also pave the way for optimized training regimes applicable to other domains within machine learning.

Future developments may include automating the choice of structured parameters for specific tasks or datasets, further minimizing manual tuning. Additionally, integrating these structures with advanced attention mechanisms could unlock even higher efficiency gains.

Overall, this work presents a refined approach to LLM training, addressing both performance bottlenecks and resource constraints in model deployment.