Modality Selection Procedure
- Modality selection procedure is a systematic method for dynamically identifying and weighting sensor or data modalities based on utility functions and constraints.
- It employs techniques such as greedy submodular maximization, soft/hard attention, dynamic ensembling, and federated strategies to optimize performance and resource use.
- Empirical results highlight its ability to achieve high accuracy and efficiency in diverse applications including multimodal classification, object tracking, and semantic segmentation.
A modality selection procedure is a systematic approach for identifying, weighting, or routing among multiple sensor, data, or representation sources (modalities) to maximize informativeness, computational efficiency, or robustness for a specific learning, inference, or communication objective. Recent literature demonstrates diverse algorithmic realizations of modality selection across deep multimodal learning, federated systems, cognitive robotics, large-scale retrieval, medical imaging, and more. Central distinctions among approaches include the level of adaptivity (static, dynamic per-instance, per-timestep), selection granularity (hard subset, soft weighting, routing, gating), and the degree to which selection is entangled with representation learning or prediction.
1. Formal Problem Definitions and Selection Criteria
Common formulations cast modality selection as an optimization of a utility, informativeness, or performance metric subject to computational, resource, or robustness constraints. Let denote available modalities, and let score any subset for predictive informativeness, mutual information with label , or expected reduction in uncertainty. A canonical objective is:
with possible additional knapsack constraints on modality-specific costs (Cheng et al., 2022).
Several frameworks leverage monotonicity and (approximate) submodularity of , enabling provable greedy or approximate algorithms for subset selection. Other approaches (such as DeepSuM (Gao et al., 3 Mar 2025)) define selection in terms of distance covariance between learned per-modality representations and targets, selecting those for which an empirical dependence statistic exceeds a threshold.
In dynamic and adaptive settings, selection may be task-conditional, context-specific, or even inference-time adaptive. Notable examples include gating networks that route features per-timestep or per-request (as in hierarchical recurrent models (Weng et al., 2021), transformer attention mechanisms (Jiang et al., 20 Apr 2025), or LLM-based routers (Rosa, 12 Jul 2025)) and strategies that maximize task-relevant information for each sample (Du et al., 30 Jan 2026).
Utility/reward functions frequently represent predictive gain, cross-entropy loss reduction, mutual information, Shapley value of a modality in a cooperative game of prediction (Yuan et al., 2023, Yuan et al., 2024), prototype-based similarity to class means (Du et al., 30 Jan 2026), or structured sparsity/regularization terms in regression (Yu et al., 2022).
2. Algorithmic Frameworks for Modality Selection
Approaches for modality selection can be organized by where and how selection is performed and learned.
- Greedy Submodular Maximization: At training or deployment time, select modalities by iteratively picking those with largest marginal utility until budget is exhausted. Provides a approximation under (approximate) submodularity (Cheng et al., 2022).
- Soft or Hard Attention (Gating/Masking): Transformer attention or gating modules compute per-modality (or per-modality-time) weights, used to route, amplify, or suppress features. Softmax-based mechanisms provide differentiable learning, while hard gating converts weights into binary routing decisions (Jiang et al., 20 Apr 2025, Weng et al., 2021, Yang et al., 9 Nov 2025).
- Dynamic Selection/Ensembling: Competence-weighted or meta-learned ensemble methods dynamically select or weight unimodal regressors for each instance based on local error or meta-classifier competence (Menon et al., 2024).
- Federated Modality Selection: In FL, clients select which local modality-models to upload based on Shapley impact, model size (communication cost), and optionally recency of updates (Yuan et al., 2023, Yuan et al., 2024). These are often aggregated into a per-modality priority score; knapsack or greedy selection applies (Yuan et al., 2023, Yuan et al., 2024).
- Unsupervised and Domain-Robust Selection: For domain adaptation, selection may be driven by unsupervised metrics—prediction correlation and mean maximum discrepancy—computed on unlabeled target data, with automatic thresholding via winsorized statistics (Marinov et al., 2022).
- Hierarchical Multi-Scale Selection: In semantic segmentation, selection can be performed hierarchically at multiple backbone levels, scoring each modality by similarity to the aggregated mean feature at every granularity (Zheng et al., 2024).
- Request-Aware and SLO-Constrained Routing: In large-scale inference or retrieval, selection/routing strategies are computed globally across possible combinations of batch size, modality subset, latency, and accuracy SLA, solved offline via integer linear programming and updated adaptively at serving time (Hu et al., 2023).
3. Modular Architectures and Integration Points
Selection is operationalized at varying points in the processing pipeline. Representative integrations:
- Input Selection/Gating: Discriminative modules after patch or feature embedding identify the current modality (e.g., thermal, depth, event) and activate corresponding sub-adapters (Wang et al., 25 Feb 2025). The predicted modality label serves as a "gate" for per-modality latent processing.
- Latent Feature Selection: Adaptive selection may occur in the intermediate representation space, through attention modules or adapters that leverage scoring/gating signals at each block or layer (Wang et al., 25 Feb 2025, Zheng et al., 2024).
- Output Layer/Head Selection: Task-customized adapters at the head project the fused feature into head-specific spaces, filtering modality-specific noise before the final output layer (Wang et al., 25 Feb 2025).
- Federated/Distributed Elements: In FL, selection occurs client-side post-local training, using local statistics to determine upload sets, with aggregation and update policies determined server-side (Yuan et al., 2023, Yuan et al., 2024).
- Request/Query Routing: Large-scale serving or retrieval system routers—implemented by LLMs or lookup tables—assign modality subsets in response to each incoming request/query (Rosa, 12 Jul 2025, Hu et al., 2023).
A defining property of high-performing systems is the unified parameter sharing and efficiency: e.g., UASTrack (Wang et al., 25 Feb 2025) achieves model and parameter unification with minimal extra parameters for adaptive selection, and MAGIC++ (Zheng et al., 2024) employs plug-and-play hierarchical selection modules compatible with a range of backbones.
4. Optimization Procedures and Loss Functions
Training objectives in modality selection frameworks reflect the need to both learn informative representations and enforce the desired selection behavior:
- Main Prediction Losses: Cross-entropy, regression (L1/L2), or focal loss for the target task (classification, regression, detection, etc.) (Wang et al., 25 Feb 2025, Yang et al., 9 Nov 2025).
- Selection Regularizers: Auxiliary losses incentivize correct gate predictions (e.g., cross-entropy constraint for modality identification (Wang et al., 25 Feb 2025)), stability of selection (contrastive InfoNCE enforcing fused-pool similarity to unimodal embeddings (Yang et al., 9 Nov 2025)), or usage constraints (penalizing deviation from budgeted selection rates (Weng et al., 2021)).
- Structured Sparsity Penalties: or group-lasso style penalties over feature-selection factors (e.g., in tensor regression, zeroing modality factors prunes modalities) (Yu et al., 2022).
- Prototype and Information Rewards: For dynamic selection with missing data, use reductions in cross-entropy or in Bregman-divergence to class prototypes as selection rewards, calibrated by intra-class similarity (Du et al., 30 Jan 2026).
- Submodular Function Maximization: Greedy maximization or stochastic greedy approaches, supported by theoretical guarantees when the objective is submodular (Cheng et al., 2022, Fan et al., 2023).
- Federated Consensus: Cross-client or global prototype alignment losses to mitigate modal bias and achieve balance in multimodal federated settings (Fan et al., 2023).
Selection thresholds and trade-off hyperparameters (e.g., weights on impact, size, recency in FL (Yuan et al., 2024), balance between accuracy and communication (Yuan et al., 2023), regularization multipliers) are typically set via cross-validation or adaptive schedule.
5. Practical Outcomes and Empirical Highlights
Table: Empirical Gains from Modality Selection (Selected Results)
| Paper (arXiv) | Setting/Task | Key Gains After Modality Selection |
|---|---|---|
| (Cheng et al., 2022) | Multimodal classification (Patch-MNIST, PEMS-SF, CMU-MOSI) | Greedy selection yields >98% accuracy with only 7/49 modalities, outperforms random and feature-importance ranking as #modalities grows |
| (Wang et al., 25 Feb 2025) | Single Object Tracking (multimodal, RGB-X) | Achieves competitive tracking with only +1.87M params and +1.95 GFLOPs for adaptive selection/discrimination; state-of-the-art on five benchmarks |
| (Du et al., 30 Jan 2026) | Incomplete Multimodal Classification | +5.7% (absolute) in accuracy at 80% missing (PolyMNIST); +4.1% on DVM; +1.9% AUC on UKBB (70% missing), robust to severe missingness |
| (Yuan et al., 2023) | Federated Learning, ActionSense | 4×–5× comm. reduction vs. baselines, same or higher accuracy; adaptive upload switches from small (eye) to informative (Myo/Xsens) modalities over time |
| (Fan et al., 2023) | Federated Audio-Visual Learning | Modal-balanced selection improves global accuracy and convergence speed vs. pure greedy fusion; achieves feature- and global-level modal diversity |
| (Zheng et al., 2024) | Semantic Segmentation (Modality-Agnostic) | Outperforms prior arts in both conventional and modality-agnostic settings; robust to sensor/environmental failures without dependence on RGB |
Empirical findings include: selection sharply reduces computational, communication, and data annotation costs; increases robustness to missing or noisy modalities; and—in many settings—outperforms naive late or early fusion baselines. In federated and serving settings, throughput and job-completion improvements of 3.6–11× are reported under latency/accuracy SLO constraints (Hu et al., 2023).
6. Special Considerations, Limitations, and Future Directions
- Adaptivity and Uncertainty: Fully dynamic selection requires reliable estimation of informativeness per instance or query. RL or meta-learning of selection policies is an open research area, as are explicit strategies for handling ambiguity and multi-modal necessity.
- Supervision and System Complexity: Some methods depend on labeled data for calibration, while others operate entirely unsupervised (e.g., correlation/discrepancy-based selection (Marinov et al., 2022)). The design must reflect available supervision, missing-data realities, and system constraints.
- Extensions to New Tasks/Modalities: Procedures such as cross-modality attention (Jiang et al., 20 Apr 2025), hierarchical selection (Zheng et al., 2024), and fully differentiable gating lend themselves to extension as modality counts and diversity grow, and can support compositional or skill-level selection for robotics or temporal tasks.
- Scalability and Efficiency: For large , computational bottlenecks (e.g., Shapley-value computation) are addressed with Monte Carlo or approximate submodular methods (Yuan et al., 2023, Cheng et al., 2022). Stepwise and knapsack approximations are typical in real-time or bandwidth-constrained settings.
- Interpretability and Diagnostic Insights: Selection weights, gate activations, and attention scores provide per-sample or per-class interpretability about modality contribution, failure cases, and dataset biases (e.g., in TVQA, most questions answerable by one modality; attention-inspection reveals model weaknesses (Madvil et al., 2023)).
A plausible implication is continued unification of selection with adaptive representation learning and decision support, using both analytical utility structures and data-driven mechanism design. The goal is sustainable, transparent, and robust multimodal systems operating at scale, in which the value of each modality is dynamically and optimally realized.