Dual-Modality Thresholding (DMT)
- Dual-Modality Thresholding is a method that applies distinct thresholding rules based on modality, optimizing sparsification and separation in both deep learning and signal processing.
- In MoE-MLLMs, DMT sets tailored thresholds for text versus vision tokens, enhancing inference accuracy and reducing computational load through optimized expert skipping.
- In harmonic analysis, DMT employs wavelet and curvelet thresholds to effectively decompose images into pointlike and curvelike components with sub–5% error in practical applications.
Dual-Modality Thresholding (DMT) encompasses algorithmic techniques that separately process modalities—either distinct data types such as text and vision tokens in deep learning, or morphological structures such as points and curves in harmonic analysis—via thresholding procedures adapted to the respective signal or activation characteristics of each modality. The methodology has emerged independently in large-scale multimodal neural inference (Huang et al., 19 Nov 2025) and in geometric component separation (Kutyniok, 2012), with each context developing rigorous mathematical justification and practical procedures for modality-aware threshold selection and application.
1. Theoretical Rationale and Problem Setting
In both deep learning and harmonic analysis, DMT addresses the problem of decomposing a complex input—either a multimodal token sequence or a multimodal signal—by leveraging the distinct statistical or geometric properties of its constituent modalities. In the context of Mixture-of-Experts Multimodal LLMs (MoE-MLLMs), standard expert-skipping procedures degrade performance because they do not account for the modality-specific impact of skipping experts; feed-forward updates are more critical for text tokens than vision tokens, and different MoE layers exhibit heterogeneous contribution (Huang et al., 19 Nov 2025). In geometric signal separation, DMT partitions an image into pointlike and curvelike components through appropriately tuned thresholding in complementary frames (wavelets and curvelets), exploiting their differential sparsity and decay rates (Kutyniok, 2012). In all these cases, the principle is to modulate thresholds or selection rules according to modality, thus optimizing sparsification or efficiency without sacrificing fidelity.
2. Formal Algorithms and Mathematical Foundations
Deep Learning: DMT for MoE-MLLM Inference
Let denote a token of modality , and let each MoE layer have expert-routing softmax logits and resulting routing probabilities . A global modulation factor is computed to quantify the sensitivity to omitting layer . The importance score used for thresholding is . Two thresholds, (text) and 0 (vision), are set. At inference, for token 1,
- If 2, expert 3 is skipped for 4 in layer 5.
- Otherwise, the expert output is accumulated weighted by the routing score.
Harmonic Analysis: DMT for Geometric Separation
Given a signal 6, with 7 pointlike and 8 curvelike, and subband 9, the DMT proceeds by:
- Thresholding wavelet coefficients at scale-dependent 0 to extract points,
- Subtracting the thresholded wavelet reconstruction to form a residual,
- Thresholding curvelet coefficients of the residual at 1 to extract curves.
Mathematically, these steps induce index sets 2 and 3 with
4
5
Thresholds are set to exploit the clustered sparsity of points in wavelets and the slower decay of curves in curvelets. Theoretical results establish that, as scale 6, the index sets converge in phase space to the wavefront sets of 7 and 8, achieving precise geometric separation (Kutyniok, 2012).
3. Threshold Selection, Optimization, and Frontier Search
MoE-MLLMs: Frontier Search for Optimal DMT Thresholds
Thresholds 9 are selected by solving a constrained optimization problem on calibration data:
- Minimize the mean output KL divergence 0 between the original and skipped models.
- Subject to a minimum expert-skipping ratio 1 for prescribed 2.
A frontier search algorithm exploits monotonicity properties—3 is non-decreasing in 4 for fixed 5, and the feasible threshold set forms a well-structured frontier—allowing 6 search complexity vs. naive 7 (where 8 is calibration set size and 9 is threshold grid resolution). This reduces threshold tuning from days to a few hours for large models (Huang et al., 19 Nov 2025).
Harmonic Analysis: Scale-Dependent Threshold Heuristics
Thresholds in the wavelet-curvelet context are derived from coefficient decay analyses:
- 0 for wavelets, set below the 1 decay of point singularities.
- 2 for curvelets, tuned below the 3 decay for curve singularities in the residual.
These parameters are theoretically justified to ensure separation error decays algebraically with scale (Kutyniok, 2012).
4. Empirical and Theoretical Performance
MoE-MLLMs
Comprehensive experiments on models such as Kimi-VL-A3B-Instruct and Qwen3-VL-MoE-30B demonstrate that DMT combined with globally-modulated local gating achieves superior accuracy at matched skip rates compared to unimodal baselines:
| Skip rate | MC-MoE | DiEP | MoDES (DMT+GMLG) |
|---|---|---|---|
| 50% | 97.7% | 98.2% | 99.9% |
| 67% | 95.5% | 94.8% | 98.5% |
| 83% | 88.3% | 87.6% | 96.3% |
Prefill and decode speedups for batch-8, 83% skip are ×2.03 and ×1.24, respectively. On Qwen3-VL-MoE-30B, 88% skip retains 97.3% accuracy versus 86.7%–88.7% for unimodal methods, with speedups ×2.16 (prefill) and ×1.26 (decode). Combination with 2.5-bit quantization at 67% skip yields only a 6.1% accuracy drop for MoDES/DMT versus 10.4% for the baseline (Huang et al., 19 Nov 2025).
Harmonic Analysis
Theoretical results guarantee exact wavefront-set separation and asymptotic 4 separation error tending to zero as scale 5. Numerical experiments demonstrate sub–5% relative separation errors at moderate scales and efficient, scalable computations for megapixel images (Kutyniok, 2012).
5. Connections to Clustered Sparsity, Coherence, and Multimodal Architectures
DMT exploits clustered sparsity: point singularities generate clusters of large coefficients in wavelet frames, while curves do so in curvelet frames. In MoE-MLLMs, the principle is analogous; text tokens produce higher-sensitivity routing and must be treated conservatively to preserve accuracy, whereas vision tokens benefit from aggressive sparsification. The notion of cluster coherence—ensuring limited overlap between the support of the two modalities in their respective representations—is critical in harmonic analysis for ensuring separation guarantees (Kutyniok, 2012). In MoE-MLLMs, heterogeneous token-wise gating renders cluster coherence manifest as global layer sensitivity, incorporated via the 6 modulation (Huang et al., 19 Nov 2025).
6. Applications, Extensions, and Implementation Considerations
DMT is integral to efficient inference in contemporary MoE-MLLMs, enabling high expert skip rates without major accuracy degradation and with significant reduction in computational cost. The technique generalizes to any multimodal MoE setting with modality-specific activation statistics. In geometric signal separation, DMT provides an efficient, interpretable approach suitable for large-scale imaging tasks requiring precise morphological decomposition. Scale-free implementation and efficient multi-threaded frame transforms make the method applicable to high-throughput data settings (Huang et al., 19 Nov 2025, Kutyniok, 2012). Extension to other dictionary/frame pairs (e.g., Meyer wavelets and shearlets) is justified under analogous sparsity and coherence assumptions, and DMT remains robust in noisy environments when decay conditions are met.
7. Summary and Comparative Perspective
Dual-Modality Thresholding provides a rigorous methodology for modality-aware sparsification and separation in both neural and signal processing domains. In MoE-MLLMs, its use of distinct thresholds for text and vision tokens, combined with global gating importance, substantially outperforms unimodal or layer-agnostic skipping rules in both accuracy and computational efficiency. In harmonic analysis, DMT achieves asymptotically perfect morphological separation, with guaranteed recovery of singularity wavefront sets and practical error rates below 5% on synthetic and real data. The technique's unifying criterion is its exploitation of modality-specific properties—statistical or geometric—realized through theoretically justified thresholding procedures (Huang et al., 19 Nov 2025, Kutyniok, 2012).