MaskAnyNet: Mask Reuse in Deep Learning

Updated 23 November 2025

MaskAnyNet is a deep learning approach that repurposes masked image regions as auxiliary semantic information to enhance feature extraction.
It employs a dual-branch architecture that processes both visible and masked regions, leading to consistent performance gains in classification, detection, and segmentation tasks.
The ToST protocol for sparse networks introduces innovations like ghost skip connections and re-scaled initialization to improve training stability and accuracy at high sparsity.

MaskAnyNet refers to a class of methodologies in deep learning focused on leveraging masked regions of images—or masked network parameter structures—for supervised learning, rather than treating them as mere noise or ignored signals. Two distinct lines of research bear this name: (1) MaskAnyNet for masked region reuse in supervised computer vision tasks (Hong et al., 16 Nov 2025), and (2) MaskAnyNet (ToST: "Training on Sparse Topologies") for improved sparse network training (Jaiswal et al., 2022). Both approaches utilize “any mask” as a principle: rather than discarding masked content, information is reclaimed to enhance model training, semantic diversity, and generalization.

1. Motivation and Fundamental Principles

Conventional image masking techniques (e.g., CutOut, Random Erasing, GridMask) discard selected pixels during training, which can degrade the recognition of small or occluded features and underutilize available image context. In contrast, Masked Image Modeling (MIM) in self-supervised learning has demonstrated that masked patches can often be reconstructed from visible regions, showing that even occluded regions hold valuable semantic and contextual information. Motivated by these findings, MaskAnyNet proposes to treat masked regions as auxiliary semantic knowledge for reuse rather than as information to be ignored, aiming to enrich feature representations and boost performance on fine-grained or occlusion-robust tasks (Hong et al., 16 Nov 2025).

In parallel, sparse neural networks—often created via pruning—typically discard the weights associated with masked (pruned) connections, focusing entirely on the quality of the subnetwork. MaskAnyNet (ToST) demonstrates that orthogonal improvements in the training protocol, regardless of mask type, can provide consistent accuracy gains at high sparsity, further generalizing the "any mask" concept (Jaiswal et al., 2022).

2. Architectural Design and Dual-Branch Mechanism

MaskAnyNet’s architectural innovation is a dual-branch pipeline for supervised learning tasks:

The primary branch processes the standard masked image, retaining global context.
The auxiliary "mask-region reuse branch" extracts the masked-out regions from the original image, spatially stitches them into a new image (reuse image), and processes this through the same low-level backbone as the primary branch.

Low-level features from both branches are concatenated channel-wise, then aligned and fused through a stack of three convolutional layers with residual connections. These features are forwarded to a shared high-level backbone for task prediction (e.g., classification head). The approach can be applied to CNNs (e.g., ResNet-34, EfficientNet-V2, ConvNeXt-V2) and Vision Transformer backbones (e.g., ViT-B/16, Swin-Tiny) (Hong et al., 16 Nov 2025).

In sparse network training (ToST), a network with arbitrary pruning masks undergoes modified dense-protocol training. Ghost Soft Neurons (PSwish activations) replace ReLU in the early phase; extra ghost skip connections are temporarily added to stabilize sparse layers; layer-wise initialization is re-scaled before main training; and label smoothing reduces overconfidence (Jaiswal et al., 2022).

3. Mathematical Formulation and Optimization

Let $X$ denote the input image with label $y$ . A binary mask of ratio $m$ is sampled (optimal range $20\%$ - $30\%$ , default $25\%$ ). This yields a masked image $X_m$ and a reuse image $R$ comprising masked-out patches.

Model predictions $p_{\mathrm{vis}} = \mathrm{softmax}(f_{\mathrm{vis}}(X_m))$ and $p_{\mathrm{mask}} = \mathrm{softmax}(f_{\mathrm{mask}}(R))$ correspond to visible and reused branches, respectively. The overall objective combines both predictions:

$L_{\mathrm{total}} = L_{\mathrm{vis}} + \lambda L_{\mathrm{mask}}$

with $L_{\mathrm{vis}}$ , $L_{\mathrm{mask}}$ denoting cross-entropy losses and $\lambda=1.0$ for equal weighting.

Feature fusion strategies are benchmarked:

Fusion Level	Accuracy Gain (ResNet)	Parameter Increase	Notes
Image-level	Low	Minimal	Concatenate inputs
Feature-level	+0.45% Top-1	+5-7%	Adopted: after low-level blocks
Decision-level	Poor alignment	High	Fuse at logits

Feature-level fusion consistently yields the best accuracy/efficiency trade-off (Hong et al., 16 Nov 2025).

For sparse networks, ToST applies masking $M\in\{0,1\}^n$ to weights $\theta$ , yielding subnetworks $f(\theta\odot M; x)$ . Four protocol modifications are applied in the early training stage, phasing out ghost elements at the first learning rate decay (Jaiswal et al., 2022).

4. Implementation Details

MaskAnyNet uses a twofold masking approach, combining random fine patches and fixed-stride grid masks. Standard augmentation (random crop, flip, color jitter, normalization) is used. Training hyperparameters for CNNs include SGD optimizer, learning rate = 0.01, momentum = 0.937, weight decay $5\times10^{-4}$ , and batch size 32; Transformers inherit AdamW with learning rate $6\times10^{-5}$ , weight decay $10^{-2}$ , and batch size 16 (Hong et al., 16 Nov 2025). For sparse networks, pseudocode specifies (SGD with momentum 0.9, weight decay $2\times10^{-4}$ , batch size 128) and a learning rate dropping at epochs 90 and 135 in a total of 180 epochs. Ghost soft neurons and skips use phased schedules during the first learning-rate phase. Layer-wise re-scaled initialization is performed by learning scalar gates $\gamma$ per block for a few iterations, freezing them after (Jaiswal et al., 2022).

5. Experimental Results and Benchmarking

MaskAnyNet (Region Reuse in Supervised Vision)

Classification:
- CIFAR-10/100: ResNet-34 achieves +1.56%/+1.78% gain; EfficientNet-V2 +1.03%/+1.49%; ViT +0.42%/+0.97%; Swin +0.67%/+0.77%.
- ImageNet-1K: ResNet-34 +1.45% Top-1; EfficientNet-V2 +1.12%; ViT-B/16 +1.56%; Swin-T +1.32%.
- Tiny-ImageNet: Up to +2.08% (ResNet-34).
Object Detection:
- RT-DETR-n: +1.32% [email protected] (VOC), +1.42% (COCO).
- YOLO12-n: +1.68% [email protected] (VOC), +1.89% (COCO).
Semantic Segmentation:
- DeepLabV3+ (MobileNetV2): +1.83% mIoU; (ResNet-101): +1.22%.
- SegFormer (MiT-B0): +1.28%; (MiT-B4): +0.71%.

Ablations confirm the dual-branch reuse (+0.85% to +1.06%) and feature fusion module (+0.12% to +0.06%) provide non-trivial incremental gains. Best performance is observed at a mask ratio of ~25% (Hong et al., 16 Nov 2025).

MaskAnyNet (ToST for Sparse Networks)

Generalization Across Sparsity:
- On CIFAR-10/100/TinyImageNet and architectures spanning ResNet, VGG, MobileNet, ToST provides +0.5–3% absolute accuracy gains over standard sparse-from-scratch protocols at 90–98% sparsity.
Mask-Invariance:
- Gains are consistently realized for all major mask-selection schemes (Lottery Ticket, SNIP, GraSP, SynFlow, Random).
- At 95% sparsity on TinyImageNet, SNIP rises from 51.19% to 53.21% and LTH from 53.19% to 54.76% using ToST (Jaiswal et al., 2022).
Component Analysis:
- Ghost skip connections provide up to +2.15% at extreme sparsity; ghost soft neurons +1.28%; label smoothing and initialization independently offer ~+0.4–0.8%. Full protocol outperforms individual elements (Jaiswal et al., 2022).

6. Analysis: Information Diversity and Fine-Grained Preservation

Feature entropy and similarity metrics indicate that patch and grid masks increase entropy by approximately 0.92 (random: 0.61), with patch masking delivering the best balance between diversity and feature reliability. Grad-CAM and feature visualization indicate that MaskAnyNet enhances attention on object boundaries, suppresses background distraction, and preserves spatially fine-grained features in both shallow and deep network layers (Hong et al., 16 Nov 2025).

ToST’s protocol modifications improve gradient flow (activation sparsity reduced from ~39% to 0.3%), flatten the loss landscape (lower top Hessian eigenvalues), and retain useful initializations (due to batch normalization absorbing scale changes). Label smoothing reduces logit overconfidence, enhancing calibration, especially in highly sparse networks (Jaiswal et al., 2022).

7. Limitations, Applications, and Future Directions

While MaskAnyNet’s dual-branch reuse offers systematic performance gains with minimal computational overhead (+5–7% parameters), limitations include marginal improvements at very low sparsity for sparse networks, and reliance on appropriate schedule tuning for ghost elements. Potential failure modes involve abrupt protocol changes mid-training. Extending sparse-protocol advances to transformer-based models may necessitate additional tuning.

Applications include fine-grained image classification (e.g., medical imaging, biodiversity), robust object detection and segmentation under occlusion, and any supervised vision task where global-local feature fusion is critical. MaskAnyNet (ToST) provides a universal, zero-inference-overhead method enhancing the trainability of any fixed sparse mask. Both lines demonstrate that reframing masked-out content as an auxiliary source of semantic diversity measurably improves deep learning model generalization (Hong et al., 16 Nov 2025, Jaiswal et al., 2022).