Complementary Priority Masking
- Complementary Priority Masking is defined by paired masked views where one view’s visible regions complement the other's hidden regions, ensuring full input coverage.
- It integrates a priority mechanism that directs the masking budget to more informative or challenging areas, improving tasks like RGB–Thermal segmentation.
- The approach is applied across diverse domains—including point clouds, video captioning, and domain-adaptive segmentation—demonstrating robust performance improvements.
Searching arXiv for the cited papers and directly related work on complementary masking. arXiv search query: "(Shin et al., 2023) Complementary Random Masking for RGB-Thermal Semantic Segmentation" Complementary Priority Masking (“CPM”; Editor’s term) denotes a class of masking strategies in which multiple masked views are constructed so that their visible regions, tokens, or substructures are complementary, while the masking budget is optionally directed toward harder, more informative, or more reliable content. In RGB–Thermal semantic segmentation, the clearest early formulation is Complementary Random Masking, which masks disjoint patch regions in RGB and thermal streams so that every spatial location remains visible in at least one modality, and couples this with self-distillation between clean and masked inputs (Shin et al., 2023). Related formulations appear in rotation-invariant point-cloud masked autoencoding, weakly supervised dense video captioning, domain-adaptive segmentation, RGB–depth feature dropout, multiview circuit representation learning, and diffusion LLM training (Yin et al., 18 Sep 2025, Ge et al., 2024, Wang et al., 16 Jul 2025, Yang et al., 2024, Shi et al., 25 Sep 2025, Ma et al., 16 Mar 2026). The term “priority” is not used uniformly across these works; in several cases it is an explicit mechanism, whereas in others it is an interpretive description of curriculum, salience, or alignment scheduling.
1. Definition and scope
The defining property of complementary masking is that masked views are not sampled independently. Instead, one view’s hidden regions are paired with another view’s visible regions, or a positive mask is paired with its exact complement, so that the pair jointly covers the full input. In the RGB–Thermal formulation, complementarity means that if one modality is masked in some regions, the other modality is unmasked in those exact regions, guaranteeing that each spatial location is visible to the network in at least one modality (Shin et al., 2023). In weakly supervised dense video captioning, the positive temporal mask is paired with the negative mask , and the two masked videos are trained to yield complementary caption subsets that together form a complete video description (Ge et al., 2024). In domain-adaptive segmentation, the dual form uses a binary mask and its complement $1-D$ to create two masked target views whose union equals the original image and whose intersection is empty (Wang et al., 16 Jul 2025).
A second axis is prioritization. In some formulations this is explicit. The proposed Complementary Priority Masking variant for RGB–Thermal segmentation defines a patch-level priority map and samples a complementary mask under a fixed ratio according to normalized priority probabilities (Shin et al., 2023). In rotation-invariant point-cloud MAE, the “priority” component is realized by curriculum learning with dynamic weights that shift emphasis from geometric masking to semantic masking over training time (Yin et al., 18 Sep 2025). In diffusion LLMs, information-dense hubs are assigned higher mask probabilities under a mass-conserving scheduler, and the complement forms a syntax-oriented view (Ma et al., 16 Mar 2026).
A compact cross-domain summary is as follows.
| Setting | Complementary mechanism | Priority mechanism |
|---|---|---|
| RGB–Thermal segmentation | Patch masks and $1-M$ across modalities | Optional priority map over patches |
| Point-cloud MAE | Geometric and semantic masking streams | Curriculum weight |
| Dense video captioning | Positive mask and negative mask | Soft salience through learned mask magnitude |
| UDA segmentation / RGB–depth UDA | Dual masked views or complementary dropout | Optional weighting or scheduled retention |
| Diffusion LLMs | Token mask and complement 0 | Information-density-driven probabilities |
This suggests a unifying abstraction: complementarity supplies coverage and anti-shortcut pressure, whereas priority determines where that pressure is concentrated.
2. Complementarity as a masking invariant
In the RGB–Thermal case, complementarity is implemented at input-token level, aligned with Swin Transformer patchification. For tokenized inputs 1 and 2, a random binary mask 3 is sampled on one modality and its exact complement 4 is applied to the other modality. Masked tokens are replaced by learnable mask-token vectors, yielding
5
This ensures that at least one valid modality is available per patch (Shin et al., 2023). The same work explicitly contrasts this with independent random masking, which can erase the same region in both modalities and degrades performance.
The same structural invariant appears in other domains, but with different carriers. In WSDVC, complementarity operates over time rather than space. The model predicts a normalized center 6 and width 7 for each event and constructs a Gaussian mask
8
with the negative mask fixed as 9 (Ge et al., 2024). The complement is therefore hard-wired rather than learned independently.
In domain-adaptive segmentation, complementary masking is expressed directly as
0
where the two masked views partition the image into disjoint parts that together cover the whole image (Wang et al., 16 Jul 2025). MICDrop applies the same logic in feature space rather than image space: blockwise DropBlock masks image features with 1 and depth features with 2, so that exactly one modality is active at each spatial block (Yang et al., 2024).
Point-cloud MAE uses a more abstract notion of complementarity. The 3D Spatial Grid Masking stream and Progressive Semantic Masking stream target different, non-redundant priors—rotation-consistent geometry and semantic part coherence—and are mixed by a curriculum schedule rather than by strict pointwise complementation (Yin et al., 18 Sep 2025). This suggests that in some later uses of the term, complementarity refers not only to set-theoretic disjointness but also to non-redundant inductive biases.
3. Priority mechanisms
Priority mechanisms differ sharply across domains. In the proposed Complementary Priority Masking extension of RGB–Thermal segmentation, each patch 3 is assigned a priority score 4, which may be uncertainty-based, saliency/gradient-based, or derived from curriculum or scene priors. The scores are normalized to probabilities
5
after which a subset of patches is sampled without replacement and complemented across modalities exactly as in CRM (Shin et al., 2023). The core property of complementary coverage is preserved while the masking budget is steered toward informative regions.
In the point-cloud setting, priority is temporal rather than spatial. The semantic priority weight is
6
and the combined masking propensity is
7
with reported values 8, 9, $1-D$0, $1-D$1, and $1-D$2 (Yin et al., 18 Sep 2025). Early training emphasizes geometric structure; later training emphasizes semantic parts extracted from attention.
The diffusion-LLM formulation makes priority fully probabilistic and token-specific. Let $1-D$3 be the binary indicator of information-dense hubs and $1-D$4. Given a global mask ratio $1-D$5 and bias weight $1-D$6, the mask probabilities satisfy
$1-D$7
with mass conservation enforcing
$1-D$8
In the unsaturated regime,
$1-D$9
The sampled mask 0 produces a reasoning-priority view, and the complement 1 produces a syntax-priority view (Ma et al., 16 Mar 2026).
Not all uses of “priority” are content-adaptive. MICDrop states explicitly that masks are not guided by semantic boundaries, depth discontinuities, or uncertainty maps; the priority signal is purely via the schedule of the masking ratio 2, which controls whether depth or RGB is kept more often (Yang et al., 2024). Likewise, the dense-video paper does not use the term “priority”; the learned mask magnitude 3 functions as a soft per-frame importance score, but this is an interpretive extension rather than the paper’s own terminology (Ge et al., 2024).
4. Objectives, architectures, and training paradigms
Complementary masking is typically embedded in multi-view consistency objectives rather than used as a stand-alone perturbation. In RGB–Thermal segmentation, the total objective is
4
where 5 supervises fused and clean single-modality predictions, 6 enforces consistency between clean and complementary-masked RGB–T inputs, and 7 aligns clean RGB–T predictions with partially masked single-modality predictions via 8 on class logits (Shin et al., 2023). The architecture is Mask2Former with Swin Transformer backbones, winner-take-all per-channel max fusion, MSDeformAttn pixel decoder, DETR-style transformer decoder, and 9 object queries.
The domain-adaptive segmentation framework MaskTwins uses a student–teacher scheme with EMA teacher updates, supervised source loss, masked consistency to pseudo labels, and complementary prediction consistency:
$1-M$0
Masking is applied only to target images, and the dual masked views are integrated directly into the main training pipeline rather than treated as separate pretraining (Wang et al., 16 Jul 2025).
MICDrop likewise plugs into existing UDA frameworks without introducing new losses of its own. It modifies forward features by masking image encoder features while inversely masking depth encoder features, then fuses them by a depth-guided global cross-attention branch and a local gated self-attention branch, with residual fusion into RGB features (Yang et al., 2024). The total segmentation objective follows the host framework, and the method adds an additional unmasked forward pass to reduce train–test distribution shift.
In weakly supervised dense video captioning, complementary masking underwrites implicit location–caption alignment. Training is two-stage: a global captioning stage with $1-M$1, followed by a localizing stage optimizing
$1-M$2
where positive masked captioning predicts caption $1-M$3 from $1-M$4-masked video features, negative masked captioning predicts the remaining captions from the complement, and diversity regularization discourages different event masks from collapsing to the same temporal region (Ge et al., 2024).
In multiview circuit representation learning, masking becomes effective only after a shared function-aware latent space is established. MixGate therefore uses an alignment-first curriculum: Stage 1 optimizes $1-M$5, Stage 2 adds $1-M$6, and Stage 3 adds the masked reconstruction term $1-M$7 (Shi et al., 25 Sep 2025). This ordering is itself a priority rule: alignment is treated as the precondition for complementarity.
5. Empirical behavior across domains
The most direct robustness evidence comes from RGB–Thermal semantic segmentation. On MFNet day–night, CMX reports $1-M$8 mIoU, whereas the Swin-B variant reports $1-M$9 and Swin-S reports 0; on PST900, GMNet reports 1 and the Swin-B variant reports 2 mIoU; on the KAIST Multispectral semantic benchmark, CMX reports 3 and the Swin-B variant reports 4 mIoU (Shin et al., 2023). Modality-drop results on MFNet show that the reported method preserves performance much better than RTFNet and CMX under RGB-drop and THR-drop, which the paper attributes to reduced over-reliance on a single modality.
MICDrop reports consistent gains across several UDA baselines. On GTA5Cityscapes, DAFormer improves from 6 to 7, MIC(DAFormer) from 8 to 9, HRDA from 0 to 1, and MIC(HRDA) from 2 to 3; on SYNTHIA4Cityscapes, DAFormer improves from 5 to 6, HRDA from 7 to 8, and MIC(HRDA) from 9 to 0 (Yang et al., 2024). Boundary IoU on GTA1Cityscapes for MIC(HRDA) increases from 2 to 3, with reported category gains including signs 4, truck 5, bus 6, and poles 7.
MaskTwins reports 8 mIoU on SYNTHIA9Cityscapes over 0 classes, improving 1 mIoU over MIC’s 2, with especially large reported gains on sidewalk and road; it also reports new state-of-the-art results on mitochondria segmentation and an F1-score of 3 on synapse detection, including 4 on post-synapse (Wang et al., 16 Jul 2025). In point-cloud pretraining, the dual-stream masking paper reports small but consistent gains on ModelNet40 and ScanObjectNN and larger gains of roughly 5 to 6 on OmniObject3D across several rotation settings (Yin et al., 18 Sep 2025).
The dense-video formulation reports on ActivityNet Captions that the CLIP-based model reaches SODA 7, METEOR 8, CIDEr 9, ROUGE-L 00, and BLEU-4 01, while the C3D-based variant reaches SODA 02, METEOR 03, and CIDEr 04 (Ge et al., 2024). Ablations show that removing positive masked captioning reduces CIDEr from 05 to 06, removing negative masked captioning gives 07, replacing the Gaussian with a hard binary mask gives 08, and removing diversity loss gives 09.
The diffusion-LLM work reports HumanEval 10, MBPP 11, GSM8K 12, MATH500 13, and average 14 for the CPM setting with 15, compared with baseline average 16 and original average 17 (Ma et al., 16 Mar 2026). The paper attributes the gap to better allocation of optimization toward information-dense reasoning pivots while retaining complementary syntax supervision.
6. Limitations, misconceptions, and failure modes
A persistent misconception is that complementary masking is equivalent to generic dropout, Cutout, PatchDrop, or random erasing. The RGB–Thermal study explicitly argues that these methods do not enforce any cross-modality constraint and may erase the same spatial content across both modalities, whereas CRM guarantees complementarity and outperforms square masking and independently sampled masking (Shin et al., 2023). MICDrop likewise reports that independent masking of RGB and depth features is unstable and yields no gain relative to complementary masking (Yang et al., 2024).
Another misconception is that complementarity alone suffices. The multiview circuit results show the opposite: masked modeling without prior alignment worsens both Signal Probability Prediction and Truth-Table Distance Prediction, whereas masking after Equivalence Alignment Loss yields the best results (Shi et al., 25 Sep 2025). This makes alignment a substantive precondition rather than a cosmetic addition.
Sensitivity to mask granularity and ratio recurs across settings. In RGB–Thermal segmentation, patch size 18 yields the best MFNet mIoU among tested patch sizes 19, and very large contiguous masks can impede non-local learning (Shin et al., 2023). In MaskTwins, the method is reported as robust for 20 and 21, but very small mask ratios can degrade performance (Wang et al., 16 Jul 2025). The point-cloud study reports that mask ratio near 22 works well, whereas very low 23 may underconstrain reconstruction and very high 24 can destabilize training (Yin et al., 18 Sep 2025).
Priority mechanisms can also fail when their signal is unreliable. The point-cloud paper notes that early attention can be noisy, hence the need for progressive thresholds and clustering schedules (Yin et al., 18 Sep 2025). The dense-video paper reports sensitivity to event overlap and to the Gaussian steepness 25 and diversity margin 26 (Ge et al., 2024). The diffusion-LLM work reports that hard priority masking with 27 underperforms soft priority masking and can produce contextual collapse by creating “information black holes” in block diffusion (Ma et al., 16 Mar 2026). In RGB–Thermal segmentation, if one modality is globally unreliable, such as severe thermal noise everywhere, complementary masking cannot compensate (Shin et al., 2023).
7. Broader context and antecedents
Complementary masking in modern representation learning also has non-neural antecedents in coding-theoretic masking. Linear complementary pairs 28 and complementary information set codes support direct sum masking for side-channel and fault-injection resistance by decomposing the ambient space into complementary subspaces (Bhowmick et al., 2023, Freibert, 2012). These works do not use the term “Complementary Priority Masking,” but they articulate an older complementarity principle: every vector admits a unique decomposition into two complementary code components, and the security parameter is governed by 29 in the LCP setting (Bhowmick et al., 2023).
This older literature is conceptually distinct from neural masked modeling, because the complement is algebraic rather than stochastic. Even so, a plausible implication is that current CPM-style methods inherit a long-standing intuition: complementarity is valuable when it guarantees coverage without redundancy and blocks degenerate reliance on a single subspace, modality, or coordinate set.
Across contemporary machine learning, the common lesson is narrower and more technical. Complementary masks are most effective when they are synchronized, structurally meaningful, and coupled to an auxiliary principle—self-distillation, curriculum, pseudo-label consistency, positive/negative captioning, functional alignment, or information-density weighting. Where that auxiliary principle is absent, complementarity tends to reduce to a perturbation. Where it is present, the masking operation becomes a mechanism for enforcing cross-view dependence, exposing non-local cues, and regularizing representation learning under missing, degraded, or weakly supervised inputs.