Papers
Topics
Authors
Recent
Search
2000 character limit reached

Complementary Priority Masking

Updated 4 July 2026
  • Complementary Priority Masking is defined by paired masked views where one view’s visible regions complement the other's hidden regions, ensuring full input coverage.
  • It integrates a priority mechanism that directs the masking budget to more informative or challenging areas, improving tasks like RGB–Thermal segmentation.
  • The approach is applied across diverse domains—including point clouds, video captioning, and domain-adaptive segmentation—demonstrating robust performance improvements.

Searching arXiv for the cited papers and directly related work on complementary masking. arXiv search query: "(Shin et al., 2023) Complementary Random Masking for RGB-Thermal Semantic Segmentation" Complementary Priority Masking (“CPM”; Editor’s term) denotes a class of masking strategies in which multiple masked views are constructed so that their visible regions, tokens, or substructures are complementary, while the masking budget is optionally directed toward harder, more informative, or more reliable content. In RGB–Thermal semantic segmentation, the clearest early formulation is Complementary Random Masking, which masks disjoint patch regions in RGB and thermal streams so that every spatial location remains visible in at least one modality, and couples this with self-distillation between clean and masked inputs (Shin et al., 2023). Related formulations appear in rotation-invariant point-cloud masked autoencoding, weakly supervised dense video captioning, domain-adaptive segmentation, RGB–depth feature dropout, multiview circuit representation learning, and diffusion LLM training (Yin et al., 18 Sep 2025, Ge et al., 2024, Wang et al., 16 Jul 2025, Yang et al., 2024, Shi et al., 25 Sep 2025, Ma et al., 16 Mar 2026). The term “priority” is not used uniformly across these works; in several cases it is an explicit mechanism, whereas in others it is an interpretive description of curriculum, salience, or alignment scheduling.

1. Definition and scope

The defining property of complementary masking is that masked views are not sampled independently. Instead, one view’s hidden regions are paired with another view’s visible regions, or a positive mask is paired with its exact complement, so that the pair jointly covers the full input. In the RGB–Thermal formulation, complementarity means that if one modality is masked in some regions, the other modality is unmasked in those exact regions, guaranteeing that each spatial location is visible to the network in at least one modality (Shin et al., 2023). In weakly supervised dense video captioning, the positive temporal mask MiM_i is paired with the negative mask Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i, and the two masked videos are trained to yield complementary caption subsets that together form a complete video description (Ge et al., 2024). In domain-adaptive segmentation, the dual form uses a binary mask DD and its complement $1-D$ to create two masked target views whose union equals the original image and whose intersection is empty (Wang et al., 16 Jul 2025).

A second axis is prioritization. In some formulations this is explicit. The proposed Complementary Priority Masking variant for RGB–Thermal segmentation defines a patch-level priority map and samples a complementary mask under a fixed ratio according to normalized priority probabilities (Shin et al., 2023). In rotation-invariant point-cloud MAE, the “priority” component is realized by curriculum learning with dynamic weights that shift emphasis from geometric masking to semantic masking over training time (Yin et al., 18 Sep 2025). In diffusion LLMs, information-dense hubs are assigned higher mask probabilities under a mass-conserving scheduler, and the complement forms a syntax-oriented view (Ma et al., 16 Mar 2026).

A compact cross-domain summary is as follows.

Setting Complementary mechanism Priority mechanism
RGB–Thermal segmentation Patch masks MM and $1-M$ across modalities Optional priority map over patches
Point-cloud MAE Geometric and semantic masking streams Curriculum weight α(t)\alpha(t)
Dense video captioning Positive mask MiM_i and negative mask 1Mi1-M_i Soft salience through learned mask magnitude
UDA segmentation / RGB–depth UDA Dual masked views or complementary dropout Optional weighting or scheduled retention
Diffusion LLMs Token mask MM and complement Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i0 Information-density-driven probabilities

This suggests a unifying abstraction: complementarity supplies coverage and anti-shortcut pressure, whereas priority determines where that pressure is concentrated.

2. Complementarity as a masking invariant

In the RGB–Thermal case, complementarity is implemented at input-token level, aligned with Swin Transformer patchification. For tokenized inputs Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i1 and Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i2, a random binary mask Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i3 is sampled on one modality and its exact complement Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i4 is applied to the other modality. Masked tokens are replaced by learnable mask-token vectors, yielding

Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i5

This ensures that at least one valid modality is available per patch (Shin et al., 2023). The same work explicitly contrasts this with independent random masking, which can erase the same region in both modalities and degrades performance.

The same structural invariant appears in other domains, but with different carriers. In WSDVC, complementarity operates over time rather than space. The model predicts a normalized center Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i6 and width Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i7 for each event and constructs a Gaussian mask

Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i8

with the negative mask fixed as Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i9 (Ge et al., 2024). The complement is therefore hard-wired rather than learned independently.

In domain-adaptive segmentation, complementary masking is expressed directly as

DD0

where the two masked views partition the image into disjoint parts that together cover the whole image (Wang et al., 16 Jul 2025). MICDrop applies the same logic in feature space rather than image space: blockwise DropBlock masks image features with DD1 and depth features with DD2, so that exactly one modality is active at each spatial block (Yang et al., 2024).

Point-cloud MAE uses a more abstract notion of complementarity. The 3D Spatial Grid Masking stream and Progressive Semantic Masking stream target different, non-redundant priors—rotation-consistent geometry and semantic part coherence—and are mixed by a curriculum schedule rather than by strict pointwise complementation (Yin et al., 18 Sep 2025). This suggests that in some later uses of the term, complementarity refers not only to set-theoretic disjointness but also to non-redundant inductive biases.

3. Priority mechanisms

Priority mechanisms differ sharply across domains. In the proposed Complementary Priority Masking extension of RGB–Thermal segmentation, each patch DD3 is assigned a priority score DD4, which may be uncertainty-based, saliency/gradient-based, or derived from curriculum or scene priors. The scores are normalized to probabilities

DD5

after which a subset of patches is sampled without replacement and complemented across modalities exactly as in CRM (Shin et al., 2023). The core property of complementary coverage is preserved while the masking budget is steered toward informative regions.

In the point-cloud setting, priority is temporal rather than spatial. The semantic priority weight is

DD6

and the combined masking propensity is

DD7

with reported values DD8, DD9, $1-D$0, $1-D$1, and $1-D$2 (Yin et al., 18 Sep 2025). Early training emphasizes geometric structure; later training emphasizes semantic parts extracted from attention.

The diffusion-LLM formulation makes priority fully probabilistic and token-specific. Let $1-D$3 be the binary indicator of information-dense hubs and $1-D$4. Given a global mask ratio $1-D$5 and bias weight $1-D$6, the mask probabilities satisfy

$1-D$7

with mass conservation enforcing

$1-D$8

In the unsaturated regime,

$1-D$9

The sampled mask MM0 produces a reasoning-priority view, and the complement MM1 produces a syntax-priority view (Ma et al., 16 Mar 2026).

Not all uses of “priority” are content-adaptive. MICDrop states explicitly that masks are not guided by semantic boundaries, depth discontinuities, or uncertainty maps; the priority signal is purely via the schedule of the masking ratio MM2, which controls whether depth or RGB is kept more often (Yang et al., 2024). Likewise, the dense-video paper does not use the term “priority”; the learned mask magnitude MM3 functions as a soft per-frame importance score, but this is an interpretive extension rather than the paper’s own terminology (Ge et al., 2024).

4. Objectives, architectures, and training paradigms

Complementary masking is typically embedded in multi-view consistency objectives rather than used as a stand-alone perturbation. In RGB–Thermal segmentation, the total objective is

MM4

where MM5 supervises fused and clean single-modality predictions, MM6 enforces consistency between clean and complementary-masked RGB–T inputs, and MM7 aligns clean RGB–T predictions with partially masked single-modality predictions via MM8 on class logits (Shin et al., 2023). The architecture is Mask2Former with Swin Transformer backbones, winner-take-all per-channel max fusion, MSDeformAttn pixel decoder, DETR-style transformer decoder, and MM9 object queries.

The domain-adaptive segmentation framework MaskTwins uses a student–teacher scheme with EMA teacher updates, supervised source loss, masked consistency to pseudo labels, and complementary prediction consistency:

$1-M$0

Masking is applied only to target images, and the dual masked views are integrated directly into the main training pipeline rather than treated as separate pretraining (Wang et al., 16 Jul 2025).

MICDrop likewise plugs into existing UDA frameworks without introducing new losses of its own. It modifies forward features by masking image encoder features while inversely masking depth encoder features, then fuses them by a depth-guided global cross-attention branch and a local gated self-attention branch, with residual fusion into RGB features (Yang et al., 2024). The total segmentation objective follows the host framework, and the method adds an additional unmasked forward pass to reduce train–test distribution shift.

In weakly supervised dense video captioning, complementary masking underwrites implicit location–caption alignment. Training is two-stage: a global captioning stage with $1-M$1, followed by a localizing stage optimizing

$1-M$2

where positive masked captioning predicts caption $1-M$3 from $1-M$4-masked video features, negative masked captioning predicts the remaining captions from the complement, and diversity regularization discourages different event masks from collapsing to the same temporal region (Ge et al., 2024).

In multiview circuit representation learning, masking becomes effective only after a shared function-aware latent space is established. MixGate therefore uses an alignment-first curriculum: Stage 1 optimizes $1-M$5, Stage 2 adds $1-M$6, and Stage 3 adds the masked reconstruction term $1-M$7 (Shi et al., 25 Sep 2025). This ordering is itself a priority rule: alignment is treated as the precondition for complementarity.

5. Empirical behavior across domains

The most direct robustness evidence comes from RGB–Thermal semantic segmentation. On MFNet day–night, CMX reports $1-M$8 mIoU, whereas the Swin-B variant reports $1-M$9 and Swin-S reports α(t)\alpha(t)0; on PST900, GMNet reports α(t)\alpha(t)1 and the Swin-B variant reports α(t)\alpha(t)2 mIoU; on the KAIST Multispectral semantic benchmark, CMX reports α(t)\alpha(t)3 and the Swin-B variant reports α(t)\alpha(t)4 mIoU (Shin et al., 2023). Modality-drop results on MFNet show that the reported method preserves performance much better than RTFNet and CMX under RGB-drop and THR-drop, which the paper attributes to reduced over-reliance on a single modality.

MICDrop reports consistent gains across several UDA baselines. On GTAα(t)\alpha(t)5Cityscapes, DAFormer improves from α(t)\alpha(t)6 to α(t)\alpha(t)7, MIC(DAFormer) from α(t)\alpha(t)8 to α(t)\alpha(t)9, HRDA from MiM_i0 to MiM_i1, and MIC(HRDA) from MiM_i2 to MiM_i3; on SYNTHIAMiM_i4Cityscapes, DAFormer improves from MiM_i5 to MiM_i6, HRDA from MiM_i7 to MiM_i8, and MIC(HRDA) from MiM_i9 to 1Mi1-M_i0 (Yang et al., 2024). Boundary IoU on GTA1Mi1-M_i1Cityscapes for MIC(HRDA) increases from 1Mi1-M_i2 to 1Mi1-M_i3, with reported category gains including signs 1Mi1-M_i4, truck 1Mi1-M_i5, bus 1Mi1-M_i6, and poles 1Mi1-M_i7.

MaskTwins reports 1Mi1-M_i8 mIoU on SYNTHIA1Mi1-M_i9Cityscapes over MM0 classes, improving MM1 mIoU over MIC’s MM2, with especially large reported gains on sidewalk and road; it also reports new state-of-the-art results on mitochondria segmentation and an F1-score of MM3 on synapse detection, including MM4 on post-synapse (Wang et al., 16 Jul 2025). In point-cloud pretraining, the dual-stream masking paper reports small but consistent gains on ModelNet40 and ScanObjectNN and larger gains of roughly MM5 to MM6 on OmniObject3D across several rotation settings (Yin et al., 18 Sep 2025).

The dense-video formulation reports on ActivityNet Captions that the CLIP-based model reaches SODA MM7, METEOR MM8, CIDEr MM9, ROUGE-L Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i00, and BLEU-4 Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i01, while the C3D-based variant reaches SODA Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i02, METEOR Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i03, and CIDEr Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i04 (Ge et al., 2024). Ablations show that removing positive masked captioning reduces CIDEr from Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i05 to Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i06, removing negative masked captioning gives Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i07, replacing the Gaussian with a hard binary mask gives Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i08, and removing diversity loss gives Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i09.

The diffusion-LLM work reports HumanEval Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i10, MBPP Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i11, GSM8K Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i12, MATH500 Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i13, and average Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i14 for the CPM setting with Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i15, compared with baseline average Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i16 and original average Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i17 (Ma et al., 16 Mar 2026). The paper attributes the gap to better allocation of optimization toward information-dense reasoning pivots while retaining complementary syntax supervision.

6. Limitations, misconceptions, and failure modes

A persistent misconception is that complementary masking is equivalent to generic dropout, Cutout, PatchDrop, or random erasing. The RGB–Thermal study explicitly argues that these methods do not enforce any cross-modality constraint and may erase the same spatial content across both modalities, whereas CRM guarantees complementarity and outperforms square masking and independently sampled masking (Shin et al., 2023). MICDrop likewise reports that independent masking of RGB and depth features is unstable and yields no gain relative to complementary masking (Yang et al., 2024).

Another misconception is that complementarity alone suffices. The multiview circuit results show the opposite: masked modeling without prior alignment worsens both Signal Probability Prediction and Truth-Table Distance Prediction, whereas masking after Equivalence Alignment Loss yields the best results (Shi et al., 25 Sep 2025). This makes alignment a substantive precondition rather than a cosmetic addition.

Sensitivity to mask granularity and ratio recurs across settings. In RGB–Thermal segmentation, patch size Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i18 yields the best MFNet mIoU among tested patch sizes Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i19, and very large contiguous masks can impede non-local learning (Shin et al., 2023). In MaskTwins, the method is reported as robust for Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i20 and Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i21, but very small mask ratios can degrade performance (Wang et al., 16 Jul 2025). The point-cloud study reports that mask ratio near Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i22 works well, whereas very low Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i23 may underconstrain reconstruction and very high Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i24 can destabilize training (Yin et al., 18 Sep 2025).

Priority mechanisms can also fail when their signal is unreliable. The point-cloud paper notes that early attention can be noisy, hence the need for progressive thresholds and clustering schedules (Yin et al., 18 Sep 2025). The dense-video paper reports sensitivity to event overlap and to the Gaussian steepness Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i25 and diversity margin Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i26 (Ge et al., 2024). The diffusion-LLM work reports that hard priority masking with Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i27 underperforms soft priority masking and can produce contextual collapse by creating “information black holes” in block diffusion (Ma et al., 16 Mar 2026). In RGB–Thermal segmentation, if one modality is globally unreliable, such as severe thermal noise everywhere, complementary masking cannot compensate (Shin et al., 2023).

7. Broader context and antecedents

Complementary masking in modern representation learning also has non-neural antecedents in coding-theoretic masking. Linear complementary pairs Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i28 and complementary information set codes support direct sum masking for side-channel and fault-injection resistance by decomposing the ambient space into complementary subspaces (Bhowmick et al., 2023, Freibert, 2012). These works do not use the term “Complementary Priority Masking,” but they articulate an older complementarity principle: every vector admits a unique decomposition into two complementary code components, and the security parameter is governed by Mˇi=1Mi\check{M}_i = \mathbf{1} - M_i29 in the LCP setting (Bhowmick et al., 2023).

This older literature is conceptually distinct from neural masked modeling, because the complement is algebraic rather than stochastic. Even so, a plausible implication is that current CPM-style methods inherit a long-standing intuition: complementarity is valuable when it guarantees coverage without redundancy and blocks degenerate reliance on a single subspace, modality, or coordinate set.

Across contemporary machine learning, the common lesson is narrower and more technical. Complementary masks are most effective when they are synchronized, structurally meaningful, and coupled to an auxiliary principle—self-distillation, curriculum, pseudo-label consistency, positive/negative captioning, functional alignment, or information-density weighting. Where that auxiliary principle is absent, complementarity tends to reduce to a perturbation. Where it is present, the masking operation becomes a mechanism for enforcing cross-view dependence, exposing non-local cues, and regularizing representation learning under missing, degraded, or weakly supervised inputs.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Complementary Priority Masking.