- The paper presents LASER, a layer-selective rank reduction method that prunes up to 90% of weight components to enhance model reasoning.
- LASER applies singular value decomposition on specific Transformer layers, effectively denoising inputs and improving accuracy on less frequent data samples.
- Experiments confirm that LASER not only strengthens language understanding but also extends its benefits to reinforcement learning with minimal retraining.
Introduction
Transformer-based LLMs have ushered in a new era in machine learning. The Transformer architecture, initially developed for natural language processing tasks, has seen success in various other domains including computer vision and reinforcement learning. These LLMs are known for their extensive parameterization, which has historically been believed to be essential for their performance. However, researchers have discovered that significant pruning of these parameters can be performed without compromising the model's capabilities. This surprising resilience to parameter reduction has sparked interest in strategies that efficiently prune neural networks.
Layer-Selective Pruning
Recent research introduces a technique called LAyer SElective Rank reduction (LASER), which selectively removes higher-order components in the weight matrices of a trained Transformer model's specific layers. LASER operates via singular value decomposition (SVD), allowing for simplification of a model's weight matrices post-training without re-training or additional parameters. This intervention targets the weight matrices in the multi-layer perceptron (MLP) and self-attention layers of Transformer blocks. Remarkably, despite the drastic pruning, where over 90% of components can be removed, LLMs often display significant performance boosts on reasoning benchmarks. Interestingly, these boosts are not limited to language tasks, as performance gains are also observed in reinforcement learning.
Insights into LASER's Effectiveness
Experiments demonstrate that LASER's benefits predominantly manifest in samples less frequently represented in the training dataset. This suggests that the technique acts as a denoising procedure, making faintly learned facts more accessible. In addition, LASER can enhance a model's robustness to paraphrases on previously correct questions. Deepening the understanding of this phenomenon, researchers observe that after LASER, models that previously delivered high-frequency and generic responses start providing correct answers. Conversely, when looking only at the higher-order components that LASER eliminates, these often align with incorrect answers or high-frequency words. It appears that the noise from higher-order components, when combined with lower-order elements, produces an "average" answer that is more likely to be incorrect.
Broader Implications and Future Work
Extensive evaluation of LASER across several datasets and models confirms its general effectiveness in improving language understanding tasks. The researchers show that in most cases, enhancements are associated with reductions in the later layers of the model, especially in the MLP input layers. This pruning technique also shows potential outside textual domains, as demonstrated through its application to reinforcement learning agents in a decision-making task. However, the full potential and implications of LASER need further exploration, including understanding why higher-order components accumulate noisy answers and the specific conditions under which LASER excels. Moving forward, it is crucial to unravel the reasons behind the selective efficiency of layer-wise pruning and to explore broader applications and implications within various machine learning domains.