DOTResize: Reducing LLM Width via Discrete Optimal Transport-based Neuron Merging (2507.04517v1)

Published 6 Jul 2025 in cs.LG and cs.CL

Abstract: Model compression offers a promising path to reducing the cost and inaccessibility of large pre-trained models, without significantly compromising their impressive performance. Large Transformer models, including LLMs, often contain computational redundancy, which can serve as a target for new model compression methods. In this work, we specifically target neuron-level redundancies in model layers by combining groups of similar neurons into fewer neurons. We frame this width reduction as a Discrete Optimal Transport problem, and propose DOTResize, a novel Transformer compression method that uses optimal transport theory to transform and compress model weights. To ensure applicability within the Transformer architecture, we motivate and incorporate entropic regularization and matrix factorization into the transportation maps produced by our method. Unlike pruning-based approaches which discard neurons based on importance measures, DOTResize re-projects the entire neuron width, allowing the retention and redistribution of useful signal across the reduced layer. Empirical results show that compared to simple or state-of-the-art neuron width-pruning techniques, DOTResize can outperform these methods across multiple LLM families and sizes, while achieving measurable reductions in real-world computational cost.

Summary

The paper introduces DotResize, a neuron merging technique via discrete optimal transport that preserves useful signals while reducing model width.
It employs the Sinkhorn algorithm for soft neuron alignments and QR decomposition to maintain transformation matrix functionality within LLMs.
Empirical results reveal significant perplexity reduction on benchmarks with models like Llama 3.1, Mistral, and Phi-4, maintaining near zero-shot performance.

DOTResize: Reducing LLM Width via Discrete Optimal Transport-based Neuron Merging

Introduction

The paper introduces DotResize, a new method for compressing LLMs by transforming neuron layers using discrete optimal transport (OT). This approach focuses on addressing computational redundancies at the neuron level, a departure from traditional pruning techniques. While pruning often discards neurons deemed unimportant, DotResize consolidates similar neurons, thus maintaining useful signals and redistributing them across reduced layers. The motivation for using OT is to provide a principled framework for neuron transformation, offering a robust alternative to existing methods that depend on specific hardware or pruning strategies.

Methodology

The DotResize method reframes neuron width reduction as a discrete optimal transport problem. Neurons in a layer are treated as a source distribution, while a subset, selected based on activation similarities, forms the target distribution. The transport plan is calculated using entropic regularization to ensure optimal mapping with soft alignments, which allows for linear neuron recombination rather than sparse one-to-one mappings.

Key innovations include:

Transportation Maps: Constructed using the Sinkhorn algorithm to compute soft alignments between neuron activations, providing computational improvements over traditional methods.
Matrix Factorization: QR decomposition is employed to maintain functional invariance of transform matrices within the Transformer architecture, ensuring orthogonality where necessary to preserve RMSNorm layers.

Results

Empirical evaluations reveal that DotResize offers superior performance compared to both simple magnitude pruning and state-of-the-art methods like SliceGPT. Testing on various LLM architectures (Llama 3.1, Mistral, and Phi-4) shows significant reductions in perplexity on tasks such as Wikitext-2, indicating robust compression without substantial performance loss. Notably, the method often recovers performance lost due to pruning, particularly noticeable in models like Phi-4 which maintained nearly their full zero-shot capabilities even at substantial width reduction levels.

Implications and Future Work

The implications of this approach are twofold:

Theoretical: The paper advances the understanding of neuron-level redundancies, highlighting the potential of soft neuron alignment to improve compression while retaining functionality.
Practical: The method reduces model latency and memory usage, facilitating broader accessibility and deployment of LLMs under constrained computational environments. However, the reliance on calibration data remains a limitation, necessitating further exploration into data-free or minimal data scenarios.

Future research could explore the integration of DotResize with other parameter compression techniques like quantization, and the adaptation of the method to multimodal and multilingual models. The ongoing challenge is to balance compression with performance across diverse model architectures without compromising the underlying capabilities of the models.

Conclusion

DotResize demonstrates a novel and effective approach to neuron-level compression in large Transformer models, utilizing optimal transport as a foundation for merging similar neurons. By maintaining signal integrity while achieving significant reductions in model size and computational cost, the method offers a promising direction for future research in model compression and efficient deep learning architectures.