Compacter: Efficient Low-Rank Hypercomplex Adapter Layers (2106.04647v2)

Published 8 Jun 2021 in cs.CL

Abstract: Adapting large-scale pretrained LLMs to downstream tasks via fine-tuning is the standard method for achieving state-of-the-art performance on NLP benchmarks. However, fine-tuning all weights of models with millions or billions of parameters is sample-inefficient, unstable in low-resource settings, and wasteful as it requires storing a separate copy of the model for each task. Recent work has developed parameter-efficient fine-tuning methods, but these approaches either still require a relatively large number of parameters or underperform standard fine-tuning. In this work, we propose Compacter, a method for fine-tuning large-scale LLMs with a better trade-off between task performance and the number of trainable parameters than prior work. Compacter accomplishes this by building on top of ideas from adapters, low-rank optimization, and parameterized hypercomplex multiplication layers. Specifically, Compacter inserts task-specific weight matrices into a pretrained model's weights, which are computed efficiently as a sum of Kronecker products between shared "slow" weights and "fast" rank-one matrices defined per Compacter layer. By only training 0.047% of a pretrained model's parameters, Compacter performs on par with standard fine-tuning on GLUE and outperforms standard fine-tuning on SuperGLUE and low-resource settings. Our code is publicly available at~\url{https://github.com/rabeehk/compacter}.

Authors (3)

Rabeeh Karimi Mahabadi (9 papers)
James Henderson (52 papers)
Sebastian Ruder (93 papers)

Citations (431)

View on Semantic Scholar

Summary

Efficient Low-Rank Hypercomplex Adapter Layers

The paper "Compacter: Efficient Low-Rank Hypercomplex Adapter Layers" addresses the challenges faced when fine-tuning large-scale pretrained LLMs (PLMs), a widely adopted approach to achieve state-of-the-art performance on NLP benchmarks. Despite its effectiveness, fine-tuning is often sample-inefficient, unstable in low-resource settings, and computationally demanding, as it requires tuning all model parameters for different tasks and maintaining separate copies of the model per task. To tackle these issues, the authors propose a novel approach called Compacter, which integrates parameter-efficient tuning methods, leveraging ideas from adapters, low-rank optimization, and parameterized hypercomplex multiplication layers.

Core Contributions and Methodology

The principal innovation of Compacter lies in inserting task-specific weight matrices into a pretrained model's weights, calculated efficiently as a sum of Kronecker products between shared "slow" weights and task-specific "fast" rank-one matrices defined per Compacter layer. This allows adapting only about 0.047% of a PLM's parameters while achieving competitive results on key NLP benchmarks such as GLUE and SuperGLUE, as well as showing superior performance in low-resource scenarios.

The Kronecker product and hypercomplex multiplication (PHM) layers form the mathematical foundation of the proposed method. The Kronecker product enables an efficient multi-dimensional representation, which the authors utilize to structure the transformation matrices within the PLM. The low-rank constraint further limits the parameter growth, ensuring that only essential adaptations are learned, optimizing both the computation and memory overhead.

Numerical Results and Performance Analysis

Empirically, the paper demonstrates that Compacter performs comparably to or even outperforms traditional fine-tuning methods. Notably, Compacter requires training only a fraction of the parameters required by standard techniques, reducing storage requirements and computational footprint significantly. For example, on the GLUE benchmark, Compacter shows an outstanding balance between parameter efficiency and performance, achieving an average accuracy close to full fine-tuning while being orders of magnitude more parameter-efficient.

The performance gain becomes particularly pronounced in settings with limited data, where Compacter provides stability and reduces overfitting risks. This is attributed to its low-rank formulation and intelligent sharing of information across task layers, enabling robust adaptation across diverse tasks without requiring extensive retraining.

Theoretical and Practical Implications

Theoretically, Compacter presents a compelling case for the use of hypercomplex representations in large-scale NLP models. By emphasizing low-rank matrix decomposition, this method reinforces the importance of efficient structural design in neural architectures, which has traditionally been overshadowed by raw performance concerns. Practically, the approach opens up possibilities for deploying sophisticated models in resource-constrained environments where computational efficiency and storage limitation are critical, making advanced NLP capabilities more accessible.

Future Directions

While the paper establishes a robust framework for parameter-efficient tuning, future research might focus on further reducing memory overhead by investigating training methodologies that do not require layer normalization, as well as exploring the potential of combining Compacter with other compact neural network components. Additionally, insights gained from this line of work could be extended to other domains like computer vision or audio processing, where similar challenges in fine-tuning large pretrained models exist.

In conclusion, the paper offers a significant step towards optimizing the trade-off between model complexity and performance, ensuring that powerful LLMs are both practical and accessible for a wide range of applications. Compacter represents a promising advancement in the field of AI, suggesting new paths for the integration of efficient parameter management techniques and advanced model architectures.

PDF Markdown

Related Papers

Find Related Papers