- The paper introduces DotResize, a neuron merging technique via discrete optimal transport that preserves useful signals while reducing model width.
- It employs the Sinkhorn algorithm for soft neuron alignments and QR decomposition to maintain transformation matrix functionality within LLMs.
- Empirical results reveal significant perplexity reduction on benchmarks with models like Llama 3.1, Mistral, and Phi-4, maintaining near zero-shot performance.
DOTResize: Reducing LLM Width via Discrete Optimal Transport-based Neuron Merging
Introduction
The paper introduces DotResize, a new method for compressing LLMs by transforming neuron layers using discrete optimal transport (OT). This approach focuses on addressing computational redundancies at the neuron level, a departure from traditional pruning techniques. While pruning often discards neurons deemed unimportant, DotResize consolidates similar neurons, thus maintaining useful signals and redistributing them across reduced layers. The motivation for using OT is to provide a principled framework for neuron transformation, offering a robust alternative to existing methods that depend on specific hardware or pruning strategies.
Methodology
The DotResize method reframes neuron width reduction as a discrete optimal transport problem. Neurons in a layer are treated as a source distribution, while a subset, selected based on activation similarities, forms the target distribution. The transport plan is calculated using entropic regularization to ensure optimal mapping with soft alignments, which allows for linear neuron recombination rather than sparse one-to-one mappings.
Key innovations include:
- Transportation Maps: Constructed using the Sinkhorn algorithm to compute soft alignments between neuron activations, providing computational improvements over traditional methods.
- Matrix Factorization: QR decomposition is employed to maintain functional invariance of transform matrices within the Transformer architecture, ensuring orthogonality where necessary to preserve RMSNorm layers.
Results
Empirical evaluations reveal that DotResize offers superior performance compared to both simple magnitude pruning and state-of-the-art methods like SliceGPT. Testing on various LLM architectures (Llama 3.1, Mistral, and Phi-4) shows significant reductions in perplexity on tasks such as Wikitext-2, indicating robust compression without substantial performance loss. Notably, the method often recovers performance lost due to pruning, particularly noticeable in models like Phi-4 which maintained nearly their full zero-shot capabilities even at substantial width reduction levels.
Implications and Future Work
The implications of this approach are twofold:
- Theoretical: The paper advances the understanding of neuron-level redundancies, highlighting the potential of soft neuron alignment to improve compression while retaining functionality.
- Practical: The method reduces model latency and memory usage, facilitating broader accessibility and deployment of LLMs under constrained computational environments. However, the reliance on calibration data remains a limitation, necessitating further exploration into data-free or minimal data scenarios.
Future research could explore the integration of DotResize with other parameter compression techniques like quantization, and the adaptation of the method to multimodal and multilingual models. The ongoing challenge is to balance compression with performance across diverse model architectures without compromising the underlying capabilities of the models.
Conclusion
DotResize demonstrates a novel and effective approach to neuron-level compression in large Transformer models, utilizing optimal transport as a foundation for merging similar neurons. By maintaining signal integrity while achieving significant reductions in model size and computational cost, the method offers a promising direction for future research in model compression and efficient deep learning architectures.