DOTResize: Optimal Model Compression
- DOTResize is a model efficiency method that employs discrete optimal transport for structured width reduction in Transformers by merging redundant neurons.
- It dynamically selects optimal input resolutions for CNNs using DRNet principles to balance per-instance compute cost with accuracy.
- Empirical evaluations show 10–44% compute savings and improved performance over conventional pruning methods, enabling practical resource-efficient inference.
DOTResize encompasses two distinct yet complementary lines of model efficiency research: (1) width reduction in Transformers via neuron merging based on discrete optimal transport (DOT) (Verma et al., 6 Jul 2025), and (2) dynamic input-resolution selection for convolutional neural networks (CNNs) using per-instance compute-accuracy optimization, as grounded in the Dynamic Resolution Network (DRNet) methodology (Zhu et al., 2021). Both approaches use the term "DOTResize" to denote data- and structure-aware reductions of model redundancy, targeting either internal representation (width) or external input adaptation (resolution), each underpinned by mathematically principled frameworks.
1. Neuron Merging in Transformers via Discrete Optimal Transport
DOTResize for Transformer compression reframes width reduction as a discrete optimal transport problem, targeting neuron-level redundancies and merging similar neurons without discarding information. Given a neural layer with hidden-width , the goal is to reduce the width to by merging the entire neuron inventory onto a smaller basis. The set of original neurons is treated as a discrete 'source' distribution, and a chosen subset (e.g., neurons with highest mean -activations) forms the 'target' distribution. The method utilizes calibration data to gather activations and constructs a ground-cost matrix , where is the target neuron index set.
The unregularized discrete OT problem seeks a transport plan , minimizing under marginal constraints defined by the source and target distributions. Empirically, entropic regularization (Sinkhorn) is introduced,
where denotes entropy and regulates 'softness' versus sparsity. Practical integration with linear weight matrices requires rescaling of so its rows sum to one.
2. Entropic OT and Integration with Transformer Normalization
The use of Sinkhorn-Knopp iterations allows efficient approximation of the entropic OT solution. For each Transformer layer, the OT map is factored via thin QR decomposition: Defining and (with as the (pseudo-)inverse), this decomposition ensures invariance of pre-norm RMSNorm layers: Thus, the insertion of and into attention and feed-forward projections does not disrupt pre-norm stability, allowing for lossless transformation of signal through reduced-width layers.
3. Workflow, Algorithmic Complexity, and Implementation
The DOTResize process is a one-shot, training-free pipeline. For each layer:
- Collect activations for calibration data.
- Select top neurons (mean -norm).
- Compute OT cost and transport plan via Sinkhorn iterations.
- Apply QR decomposition and generate project/unproject maps.
- Absorb the factors into all linear sub-layers.
- Output compressed weights with new width.
The total computational overhead is dominated by the OT (Sinkhorn) and QR decomposition, scaling as and , respectively. Empirical use of tokens and iterations suffices for rapid compression on a single GPU per layer. On state-of-the-art LLMs (e.g., Llama-3.1 70B), inference device requirements drop from 8 to 6 V100 GPUs for 30% width reduction.
4. DOTResize versus Structured Pruning Strategies
Whereas structured pruning discards 'least important' neurons, rerunning inference on a struct subnetwork, DOTResize merges the entire activation basis onto a smaller set, redistributing all useful information. This approach avoids the sharp performance drop observed in classical pruning and achieves superior utilization of pruned-out capacity by explicitly reprojecting lost signals. Quantitatively, DOTResize consistently recovers 2–5 perplexity points over pruning baselines and outperforms state-of-the-art SliceGPT, especially when composed with a PCA pre-projection.
5. Empirical Evaluation and Benchmarks
Empirical language modeling and zero-shot task experiments demonstrate robust performance of DOTResize. On Wikitext-2 (20–30% reduction):
| Model | Magnitude-prune | DOTResize | SliceGPT | PCA+DOTResize |
|---|---|---|---|---|
| Llama-3.1-8B | 29.33 / 108.23 | 16.57 / 36.20 | 11.25 / 19.42 | 10.24 / 16.46 |
On zero-shot tasks at 20% sparsity, for Mistral-12B and Phi-4-12B, PCA+DOTResize provides +1–12 percentage points accuracy relative to SliceGPT. The method is effective across models and is easily composable with quantization schemes such as 4-bit post-training quantization for further size and speed improvements.
6. Dynamic-Resolution DOTResize for CNNs
The DRNet-based DOTResize module addresses compute redundancy in CNNs by dynamically selecting the smallest input resolution per image that maintains (or improves) accuracy. A lightweight resolution predictor is trained jointly with the backbone, using a candidate resolution set (e.g., ). The predictor's output, after softmax and straight-through Gumbel–Softmax reparameterization, yields a one-hot allocation to the resolution for each image. Each candidate resolution is handled by a private batch normalization branch. The loss combines cross-entropy and a compute regularizer that constrains average FLOPs to a target .
Practical integration requires minimal code overhead, leveraging on-the-fly resizing, BN multiplexing, and backpropagation through multitask losses. The system is compatible with any off-the-shelf CNN backbone.
Empirically, DRNet-based DOTResize achieves notable compute reductions without accuracy loss:
| Model | Params | FLOPs | FLOPs | Top-1 | Top-5 |
|---|---|---|---|---|---|
| ResNet-50 (224) | 25.6M | 4.1G | — | 76.1% | 92.9% |
| DOTResize (default) | 30.5M | 3.7G | 10% | 77.5% | 93.5% |
| DOTResize (=2.5) | 30.5M | 2.7G | 34% | 76.2% | 92.8% |
On MobileNetV2, DOTResize similarly outperforms fixed lower-resolution models.
7. Hyperparameters, Practical Tips, and Extensions
For Transformer DOTResize, Sinkhorn regularization (default 0.1) is effective, and 2¹⁷–2¹⁹ calibration tokens are sufficient. For CNN DOTResize, three-level resolution search-spaces provide most of the gain, and hyperparameter choices (, , annealed ) allow fine control of the FLOPs-accuracy tradeoff. OT and QR can be parallelized per-layer and per-GPU. Minimal engineering is required: 20–50 lines of predictor code, built-in PyTorch resizing, and batchnorm multiplexing.
A plausible implication is that DOTResize establishes a general paradigm for resource-efficient inference, where discrete optimal transport underlies principled width reduction and dynamic instance-aware input adaptation for deep neural networks. Both methodologies provide compute savings (10–44%) and, in tested cases, maintain or improve accuracy relative to baseline and state-of-the-art compression/pruning methods (Verma et al., 6 Jul 2025, Zhu et al., 2021).