Complementary Masked Autoencoders (CoMA)
- Complementary Masked Autoencoders are self-supervised frameworks that use a dual-branch masking approach to ensure each token is visible in at least one branch.
- They integrate efficient architectures like DyViT and ViT to optimize reconstruction in both 2D images and 3D point clouds, yielding faster convergence and improved metrics.
- The method incorporates explicit contrastive losses and dual decoders to enforce semantic consistency and dense supervision, outperforming traditional masked autoencoding methods.
Complementary Masked Autoencoders (CoMA and CoMAE) are a family of masked modeling frameworks for self-supervised representation learning that utilize a dual-branch or multi-view “complementary” masking strategy. The core principle is to ensure that, over the course of each training iteration, every token (pixel, patch, or point) is visible in at least one branch. This method directly addresses inefficiencies in random masking found in standard Masked Autoencoders (MAE), achieving improved pre-training efficiency and superior or matched downstream performance in both 2D image and 3D point cloud domains. Architecturally, these models often integrate lightweight hierarchical backbones—such as Dynamic Vision Transformers (DyViT)—or ViT-style encoders, along with explicit or implicit contrastive constraints arising from complementary masked pairs.
1. Complementary Masking: Formalism and Motivation
The complementary masking strategy replaces traditional single-view random masking with a dual-branch framework where two mutually exclusive, exhaustive masks are generated per sample. For an input , one draws a random binary mask masking a fixed proportion of patches/tokens, then constructs its complement . The two masked inputs are: This guarantees that every element is visible in at least one of or and enables dense supervision across the entire input in a single iteration. In point cloud or tokenized representations , the strategy generalizes by drawing two independent random masks , defining the co-masked set
over which further explicit constraints can be imposed.
The principal motivation is twofold: (1) to accelerate convergence by improving token/patch coverage with fewer epochs, and (2) to counteract the sample inefficiency and feature instability induced by the stochastic omission of potentially critical patches in random masking schemes.
2. Architecture and Network Design
The CoMA framework for 2D images couples the complementary masking strategy with the DyViT ("Dynamic Vision Transformer") backbone. DyViT is a hierarchical ViT with four stages, progressively downsampling spatial resolutions and increasing channel depth. Its distinguishing feature is the Dynamic Multi-Window Self-Attention (DM-MSA) block, which allows each transformer block to aggregate feature context over a range of dynamically selected local window sizes. The hierarchical structure is as follows:
- Stage 1: , 96/112 channels
- Stage 2: , 192/224 channels
- Stage 3: , 384/448 channels
- Stage 4: , 768/896 channels
Downsampling is performed by stride-4 convolution in the first stage and max-pooling in subsequent stages. In DM-MSA, for each window size in the set (factor-of-two divisors of the patch grid), a strided convolution is applied to key and value projections, and multi-scale local attentions are fused to form the output. The last two stages revert to global self-attention mechanisms.
For 3D Point Cloud learning ("CoMAE"/"Point-CMAE"), the encoder is a standard ViT (12 layers, hidden size 384, 6 heads), and two independently parameterized decoders reconstruct masked tokens in their respective branches. The use of dual decoders, rather than shared weights, encourages the encoder to generate semantically rich representations that satisfy multiple reconstruction contexts.
3. Training Protocols and Objective Functions
At each step, the masked inputs are processed by their respective encoder-decoder pairs (or branches). For image modeling, after processing by Adaptive and Evaluation branches, the two reconstructed views are combined: and the objective is mean squared error (MSE) over all pixels: Only the Adaptive branch is updated per step; the Evaluation branch is frozen and synchronized with the Adaptive branch from the previous iteration to avoid redundant computation.
For point cloud modeling, the training objective is a combination of two Chamfer-distance based reconstruction losses (one per branch) and an explicit contrastive loss defined on decoder-extracted features for tokens masked by both branches: $\begin{align*} \mathcal{L}_{\mathrm{recon}}^{(k)} &= \frac{1}{|\Omega_k|}\sum_{i\in \Omega_k} \|D_k(Z_k)_i - X_i\|_2^2, \ \mathcal{L}_{\mathrm{contrast}} &= \frac{1}{|\Omega|}\sum_{i\in \Omega}\big(1 - \cos(h_1'_i, h_2'_i)\big), \ \mathcal{L} &= \mathcal{L}_{\mathrm{recon}}^{(1)} + \mathcal{L}_{\mathrm{recon}}^{(2)} + \lambda \mathcal{L}_{\mathrm{contrast}}, \end{align*}$ with as per ablation studies.
Implementation details include the use of patch construction by FPS+KNN, PointNet embedding, AdamW optimizer, and cosine decay over 300 epochs. Mask ratio and decoder depth of are empirically optimal.
4. Empirical Results and Ablation Analysis
Image Model Pretraining and Transfer
On ImageNet-1K classification (224×224), CoMA pre-training on DyViT achieves top-1 accuracy of 83.9% after 800 epochs with the base model, outperforming MAE, BEiT, and MultiMAE at the same or higher pre-training epochs. For DyViT-B, CoMA reaches 83.9% accuracy in 300 epochs, matching the baseline MAE trained for 800 epochs, and 84.6% at 800 epochs, the highest among compared MIM methods. Pre-training time per epoch is reduced by 10% over MAE/ColorMAE. Semantic segmentation on ADE20K yields mIoU 51.5 (CoMA) vs. 48.1 (MAE) and instance segmentation on COCO yields AP_box 53.1/ AP_mask 46.5 (+2.8/+2.6 over MAE).
Ablation studies indicate the crucial role of the complementary masking strategy. Varying the masking ratio in the adaptive branch from 30% to 75% finds peak accuracy at 60%; higher masking ratios (>70%) degrade reconstruction and transfer. Decoder architecture analysis shows that 8-layer global-attention ViT decoders give superior transfer. The efficiency improvement is further demonstrated by CoMA reaching state-of-the-art performance in roughly half the epochs compared to MAE.
Point Cloud Pretraining and Transfer
On ScanObjectNN (object classification), Point-CMAE achieves overall accuracy of 85.95% (no augmentation), a +0.77% improvement over Point-MAE, and 88.75% with rotation augmentation (+3.57%). On ModelNet40 (full), accuracy matches Point-MAE at 93.6%, but the linear probe and MLP-3 settings provide +1.1% and +0.27% improvements, respectively. Few-shot settings consistently show gains of 0.4% over the Point-MAE baseline. On ShapeNetPart for part segmentation, Point-CMAE improves class mIoU by 0.7% over Point-MAE.
Integration of contrastive pre-training schemes (MoCo, BYOL) directly into MAE leads to severe performance drops; the performance gains in CoMAE derive from the specific arrangement of complementary masking, dual decoders, and explicit feature-level contrastive regularization.
5. Theoretical Insights and Impact
Complementary masking enforces dense token supervision, ensuring that every input region receives direct reconstruction feedback in at least one of the two branches per sample. This not only increases statistical coverage but also mitigates the “blind patch” phenomenon in high-ratio random masking, contributing to improved stability and generalization. The explicit contrastive term imposed on co-masked tokens in CoMAE regularizes the feature space, discouraging trivial Chamfer-minimizing solutions and encouraging semantic consistency in latent space.
Separate decoders in point cloud models enforce additional semantic richness in encoder representations, as the encoder must satisfy dual (potentially non-identical) reconstruction tasks. In ablations, the combination of dual-masking, dual decoders, and contrastive loss provided non-redundant gains; omitting any one component resulted in decreased classification and segmentation performance.
The efficiency of CoMA/CoMAE is further reflected in pre-training resource consumption: in image domains, state-of-the-art transfer is achieved in 12% of the epochs required by MAE and with a 10% reduction in training time per epoch.
6. Extensions, Applications, and Limitations
CoMA and CoMAE have demonstrated empirical success in both 2D image and 3D point cloud settings. The complementary masking idea, although originally tailored for patch-based dense data, could plausibly generalize to other domains where uniform sample coverage is critical, or where masking redundancy undermines efficient supervision. In practice, the technique is simple to implement and compatible with hierarchical transformer architectures.
Potential limitations include diminished performance at very high masking ratios (>$0.7$ with patches in DyViT), and additional memory incurred by maintaining two decoders or branches, although evaluation branches are frozen to limit compute cost. Explicit contrastive regularization is highly dependent on the proportion and identity of co-masked tokens—excessive overlap or near-zero overlap could undermine the regularization effect.
The plausible implication is that complementary masking constitutes a general strategy for masked autoencoding tasks seeking improved training efficiency without sacrificing feature diversity or transfer performance. This approach has established empirical benchmarks in classification, detection, and segmentation while offering a blueprint for further extensions in masked modeling and self-supervised learning frameworks.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free