DOTResize: Optimal Model Compression

Updated 10 December 2025

DOTResize is a model efficiency method that employs discrete optimal transport for structured width reduction in Transformers by merging redundant neurons.
It dynamically selects optimal input resolutions for CNNs using DRNet principles to balance per-instance compute cost with accuracy.
Empirical evaluations show 10–44% compute savings and improved performance over conventional pruning methods, enabling practical resource-efficient inference.

DOTResize encompasses two distinct yet complementary lines of model efficiency research: (1) width reduction in Transformers via neuron merging based on discrete optimal transport (DOT) (Verma et al., 6 Jul 2025), and (2) dynamic input-resolution selection for convolutional neural networks (CNNs) using per-instance compute-accuracy optimization, as grounded in the Dynamic Resolution Network (DRNet) methodology (Zhu et al., 2021). Both approaches use the term "DOTResize" to denote data- and structure-aware reductions of model redundancy, targeting either internal representation (width) or external input adaptation (resolution), each underpinned by mathematically principled frameworks.

1. Neuron Merging in Transformers via Discrete Optimal Transport

DOTResize for Transformer compression reframes width reduction as a discrete optimal transport problem, targeting neuron-level redundancies and merging similar neurons without discarding information. Given a neural layer with hidden-width $d_{\rm orig}$ , the goal is to reduce the width to $d_{\rm new} < d_{\rm orig}$ by merging the entire neuron inventory onto a smaller basis. The set of original neurons is treated as a discrete 'source' distribution, and a chosen subset (e.g., neurons with highest mean $\ell_2$ -activations) forms the 'target' distribution. The method utilizes calibration data to gather activations $X \in \mathbb{R}^{t \times d_{\rm orig}}$ and constructs a ground-cost matrix $C_{i,j} = \| x_i - (X_S)_j \|_1$ , where $S$ is the target neuron index set.

The unregularized discrete OT problem seeks a transport plan $T \in \mathbb{R}_+^{d_{\rm orig} \times d_{\rm new}}$ , minimizing $\langle T, C \rangle$ under marginal constraints defined by the source and target distributions. Empirically, entropic regularization (Sinkhorn) is introduced,

$T^* = \arg\min_{T \geq 0, T1=b, T^T1=a} \langle T, C \rangle - \lambda H(T)$

where $H(T)$ denotes entropy and $\lambda$ regulates 'softness' versus sparsity. Practical integration with linear weight matrices requires rescaling of $T$ so its rows sum to one.

2. Entropic OT and Integration with Transformer Normalization

The use of Sinkhorn-Knopp iterations allows efficient approximation of the entropic OT solution. For each Transformer layer, the OT map is factored via thin QR decomposition: $M = Q R, \qquad Q^T Q = I$ Defining $M_{\text{proj}} = Q$ and $M_{\text{unproj}} = R M^\dagger$ (with $M^\dagger$ as the (pseudo-)inverse), this decomposition ensures invariance of pre-norm RMSNorm layers: $\mathrm{RMSNorm}(x Q) R M^\dagger = \mathrm{RMSNorm}(x)$ Thus, the insertion of $M_{\text{proj}}$ and $M_{\text{unproj}}$ into attention and feed-forward projections does not disrupt pre-norm stability, allowing for lossless transformation of signal through reduced-width layers.

3. Workflow, Algorithmic Complexity, and Implementation

The DOTResize process is a one-shot, training-free pipeline. For each layer:

Collect activations for calibration data.
Select top $d_{\rm new}$ neurons (mean $\ell_2$ -norm).
Compute OT cost and transport plan via Sinkhorn iterations.
Apply QR decomposition and generate project/unproject maps.
Absorb the factors into all linear sub-layers.
Output compressed weights with new width.

The total computational overhead is dominated by the OT (Sinkhorn) and QR decomposition, scaling as $O(K d_{\rm orig} d_{\rm new})$ and $O(d_{\rm new}^2 d_{\rm orig})$ , respectively. Empirical use of $t \approx 2^{18}$ tokens and $K \approx 200$ iterations suffices for rapid compression on a single GPU per layer. On state-of-the-art LLMs (e.g., Llama-3.1 70B), inference device requirements drop from 8 to 6 V100 GPUs for 30% width reduction.

4. DOTResize versus Structured Pruning Strategies

Whereas structured pruning discards 'least important' neurons, rerunning inference on a struct subnetwork, DOTResize merges the entire activation basis onto a smaller set, redistributing all useful information. This approach avoids the sharp performance drop observed in classical pruning and achieves superior utilization of pruned-out capacity by explicitly reprojecting lost signals. Quantitatively, DOTResize consistently recovers 2–5 perplexity points over pruning baselines and outperforms state-of-the-art SliceGPT, especially when composed with a PCA pre-projection.

5. Empirical Evaluation and Benchmarks

Empirical language modeling and zero-shot task experiments demonstrate robust performance of DOTResize. On Wikitext-2 (20–30% reduction):

Model	Magnitude-prune	DOTResize	SliceGPT	PCA+DOTResize
Llama-3.1-8B	29.33 / 108.23	16.57 / 36.20	11.25 / 19.42	10.24 / 16.46

On zero-shot tasks at 20% sparsity, for Mistral-12B and Phi-4-12B, PCA+DOTResize provides +1–12 percentage points accuracy relative to SliceGPT. The method is effective across models and is easily composable with quantization schemes such as 4-bit post-training quantization for further size and speed improvements.

6. Dynamic-Resolution DOTResize for CNNs

The DRNet-based DOTResize module addresses compute redundancy in CNNs by dynamically selecting the smallest input resolution per image that maintains (or improves) accuracy. A lightweight resolution predictor is trained jointly with the backbone, using a candidate resolution set $R = \{ r_1, ..., r_m \}$ (e.g., $R=[224,168,112]$ ). The predictor's output, after softmax and straight-through Gumbel–Softmax reparameterization, yields a one-hot allocation to the resolution for each image. Each candidate resolution is handled by a private batch normalization branch. The loss combines cross-entropy and a compute regularizer that constrains average FLOPs to a target $\alpha$ .

Practical integration requires minimal code overhead, leveraging on-the-fly resizing, BN multiplexing, and backpropagation through multitask losses. The system is compatible with any off-the-shelf CNN backbone.

Empirically, DRNet-based DOTResize achieves notable compute reductions without accuracy loss:

Model	Params	FLOPs	$\downarrow$ FLOPs	Top-1	Top-5
ResNet-50 (224)	25.6M	4.1G	—	76.1%	92.9%
DOTResize (default)	30.5M	3.7G	10%	77.5%	93.5%
DOTResize ( $\alpha$ =2.5)	30.5M	2.7G	34%	76.2%	92.8%

On MobileNetV2, DOTResize similarly outperforms fixed lower-resolution models.

7. Hyperparameters, Practical Tips, and Extensions

For Transformer DOTResize, Sinkhorn regularization $\lambda \in [0.05, 0.5]$ (default 0.1) is effective, and 2¹⁷–2¹⁹ calibration tokens are sufficient. For CNN DOTResize, three-level resolution search-spaces provide most of the gain, and hyperparameter choices ( $\alpha$ , $\eta$ , annealed $\tau$ ) allow fine control of the FLOPs-accuracy tradeoff. OT and QR can be parallelized per-layer and per-GPU. Minimal engineering is required: 20–50 lines of predictor code, built-in PyTorch resizing, and batchnorm multiplexing.

A plausible implication is that DOTResize establishes a general paradigm for resource-efficient inference, where discrete optimal transport underlies principled width reduction and dynamic instance-aware input adaptation for deep neural networks. Both methodologies provide compute savings (10–44%) and, in tested cases, maintain or improve accuracy relative to baseline and state-of-the-art compression/pruning methods (Verma et al., 6 Jul 2025, Zhu et al., 2021).

Markdown Upgrade to Chat

References (2)

DOTResize: Reducing LLM Width via Discrete Optimal Transport-based Neuron Merging (2025)

Dynamic Resolution Network (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DOTResize.

DOTResize: Optimal Model Compression

1. Neuron Merging in Transformers via Discrete Optimal Transport

2. Entropic OT and Integration with Transformer Normalization

3. Workflow, Algorithmic Complexity, and Implementation

4. DOTResize versus Structured Pruning Strategies

5. Empirical Evaluation and Benchmarks

6. Dynamic-Resolution DOTResize for CNNs

7. Hyperparameters, Practical Tips, and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

DOTResize: Optimal Model Compression

1. Neuron Merging in Transformers via Discrete Optimal Transport

2. Entropic OT and Integration with Transformer Normalization

3. Workflow, Algorithmic Complexity, and Implementation

4. DOTResize versus Structured Pruning Strategies

5. Empirical Evaluation and Benchmarks

6. Dynamic-Resolution DOTResize for CNNs

7. Hyperparameters, Practical Tips, and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research