The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction (2312.13558v1)

Published 21 Dec 2023 in cs.LG, cs.AI, cs.CL, and cs.CV

Abstract: Transformer-based LLMs have become a fixture in modern machine learning. Correspondingly, significant resources are allocated towards research that aims to further advance this technology, typically resulting in models of increasing size that are trained on increasing amounts of data. This work, however, demonstrates the surprising result that it is often possible to significantly improve the performance of LLMs by selectively removing higher-order components of their weight matrices. This simple intervention, which we call LAyer-SElective Rank reduction (LASER), can be done on a model after training has completed, and requires no additional parameters or data. We show extensive experiments demonstrating the generality of this finding across LLMs and datasets, and provide in-depth analyses offering insights into both when LASER is effective and the mechanism by which it operates.

Citations (68)

View on Semantic Scholar

Summary

The paper presents LASER, a layer-selective rank reduction method that prunes up to 90% of weight components to enhance model reasoning.
LASER applies singular value decomposition on specific Transformer layers, effectively denoising inputs and improving accuracy on less frequent data samples.
Experiments confirm that LASER not only strengthens language understanding but also extends its benefits to reinforcement learning with minimal retraining.

Introduction

Transformer-based LLMs have ushered in a new era in machine learning. The Transformer architecture, initially developed for natural language processing tasks, has seen success in various other domains including computer vision and reinforcement learning. These LLMs are known for their extensive parameterization, which has historically been believed to be essential for their performance. However, researchers have discovered that significant pruning of these parameters can be performed without compromising the model's capabilities. This surprising resilience to parameter reduction has sparked interest in strategies that efficiently prune neural networks.

Layer-Selective Pruning

Recent research introduces a technique called LAyer SElective Rank reduction (LASER), which selectively removes higher-order components in the weight matrices of a trained Transformer model's specific layers. LASER operates via singular value decomposition (SVD), allowing for simplification of a model's weight matrices post-training without re-training or additional parameters. This intervention targets the weight matrices in the multi-layer perceptron (MLP) and self-attention layers of Transformer blocks. Remarkably, despite the drastic pruning, where over 90% of components can be removed, LLMs often display significant performance boosts on reasoning benchmarks. Interestingly, these boosts are not limited to language tasks, as performance gains are also observed in reinforcement learning.

Insights into LASER's Effectiveness

Experiments demonstrate that LASER's benefits predominantly manifest in samples less frequently represented in the training dataset. This suggests that the technique acts as a denoising procedure, making faintly learned facts more accessible. In addition, LASER can enhance a model's robustness to paraphrases on previously correct questions. Deepening the understanding of this phenomenon, researchers observe that after LASER, models that previously delivered high-frequency and generic responses start providing correct answers. Conversely, when looking only at the higher-order components that LASER eliminates, these often align with incorrect answers or high-frequency words. It appears that the noise from higher-order components, when combined with lower-order elements, produces an "average" answer that is more likely to be incorrect.

Broader Implications and Future Work

Extensive evaluation of LASER across several datasets and models confirms its general effectiveness in improving language understanding tasks. The researchers show that in most cases, enhancements are associated with reductions in the later layers of the model, especially in the MLP input layers. This pruning technique also shows potential outside textual domains, as demonstrated through its application to reinforcement learning agents in a decision-making task. However, the full potential and implications of LASER need further exploration, including understanding why higher-order components accumulate noisy answers and the specific conditions under which LASER excels. Moving forward, it is crucial to unravel the reasons behind the selective efficiency of layer-wise pruning and to explore broader applications and implications within various machine learning domains.