LGCOAMix: Superpixel Data Augmentation
- LGCOAMix is a context-aware, object-part-aware superpixel-based data augmentation technique that preserves semantic consistency in mixed images.
- It employs superpixel grid blending and integrated local-global context learning to maintain fine object details and improve training efficiency.
- Empirical results on benchmarks like CIFAR100 and CUB200-2011 show enhanced classification accuracy and robust performance for both CNNs and transformers.
LGCOAMix is a context-aware and object-part-aware superpixel-based data augmentation technique designed to overcome the generalization bottlenecks of existing cutmix-style augmentation methods for deep visual recognition. By generating augmented samples via superpixel-based grid blending and integrating both local and global context learning, LGCOAMix achieves improved semantic consistency of mixed images and labels, efficient training, and enhanced discriminative representation for both convolutional and transformer architectures. It is the first approach to propose a label mixing strategy using superpixel attention within the cutmix data augmentation paradigm and is unique in learning local features from discriminative superpixel-wise regions and across-image superpixel contrasts (Dornaika et al., 28 Nov 2025).
1. Motivation and Limitations of Prior Work
Cutmix-based data augmentation (such as CutMix, GridMix, SaliencyMix) constructs augmented images by patch-wise cut-and-paste operations, typically using rectangular or grid-shaped regions. While these strategies introduce global image-level variation and have exhibited strong generalization, they typically degrade local contextual focus and disrupt semantically coherent object parts (e.g., mixing distinct bird body regions), leading to performance ceilings—especially in fine-grained recognition. Furthermore, mixed label computation in prior art is primarily area-based, often resulting in label–image mismatches when background or non-discriminative regions dominate, requiring inefficient solutions like double forward propagation or dependencies on external saliency models. LGCOAMix directly addresses these issues by (1) employing superpixel-wise region mixing to preserve object parts, (2) learning both global and local (superpixel-based) context, and (3) realizing one-pass, semantically consistent label mixing without external models (Dornaika et al., 28 Nov 2025).
2. Superpixel-Based Grid Generation and Mixing
LGCOAMix utilizes the SLIC superpixel algorithm to partition each source image into a randomly sampled number of superpixels . This produces superpixel maps and , where each pixel is associated with a superpixel label in . The LGCOAMixer component then randomly samples each superpixel in with Bernoulli probability to generate a binary mask . The mixed image and its corresponding superpixel map are computed as:
This approach enables boundary-conforming region mixing, ensuring that object-part information and fine structures are preserved. Boundary-truncated superpixels are retained without degrading semantic consistency (Dornaika et al., 28 Nov 2025).
3. Joint Local and Global Context Representation Learning
LGCOAMix integrates both global and local context modeling in its training pipeline:
- Global Representation: The mixed image is processed through an encoder yielding feature map . A global classifier computes class predictions over the mixed label .
- Local/Superpixel Representation: The decoded high-res features are average-pooled within each superpixel in —yielding a sequence of feature vectors, one per superpixel.
- Superpixel Self-Attention: Each is projected into QKV representations and updated using scaled dot-product self-attention:
- Region Weights and Local Loss: Attention weights identify the most discriminative superpixels. The top- (typically ) are classified with a dedicated local head , contributing to the local classification loss .
Additionally, a superpixel-wise contrastive loss is adopted, aligning selected superpixel features across batch samples of the same class to promote invariance and enhance representation discriminability (Dornaika et al., 28 Nov 2025).
4. Semantic Superpixel Attention for Label Mixing
LGCOAMix introduces a novel mixed-label calculation based on superpixel attention rather than area: where indexes superpixels from and is the area for superpixel . The final mixed label is: This process delivers semantic consistency between image regions and label proportions, eliminates reliance on multiple forward passes or saliency pretrained networks, and is applicable to both CNNs and Vision Transformers (Dornaika et al., 28 Nov 2025).
5. Algorithmic Workflow
The LGCOAMix pipeline involves the following sequential steps:
| Step | Operation | Output |
|---|---|---|
| 1 | Superpixel sampling on inputs (, ) | , superpixel maps |
| 2 | Bernoulli mask generation on () | |
| 3 | Mixing (Eq. 4) | , |
| 4 | Encoding and decoding | , |
| 5 | Superpixel pooling + self-attention | , , |
| 6 | Label mixing via attention (Eq. 5, 2) | |
| 7–9 | Compute , , | |
| 10 | Backpropagation (inference: encoder + head only) |
Batch formation and training utilize SGD (momentum 0.9, weight decay ), with diverse learning rates and batch sizes tailored to each benchmark dataset. At inference, only the encoder and global classifier are required, ensuring runtime equivalence to OcCaMix (Dornaika et al., 28 Nov 2025).
6. Empirical Results
LGCOAMix demonstrates state-of-the-art performance on major visual recognition tasks. For classification (Top-1 accuracy):
- CIFAR100/ResNet18: Baseline 78.58, CutMix 79.69, AutoMix 82.04, LGCOAMix 82.34 (+0.30 over best prior)
- TinyImageNet/ResNet18: Baseline 61.66, CutMix 64.35, OcCaMix 67.35, LGCOAMix 68.27 (+0.92)
- CUB200-2011/ResNeXt50: Baseline 81.41, CutMix 82.63, AutoMix 83.52, LGCOAMix 84.37 (+0.85)
- Stanford Dogs/ResNet50: Baseline 61.46, CutMix 63.92, OcCaMix 69.34, LGCOAMix 70.95
- ViT-B/16 on CUB200-2011: Baseline 80.45, LGCOAMix 82.20
For weakly supervised object location (CUB200-2011, ResNet50): Loc Acc improves from 50.21% (baseline) and 55.22% (CutMix) to 58.65% (LGCOAMix).
Ablation studies reveal that switching from square grid to superpixel grid yields +0.78%, adding local classification +0.76%, and contrastive learning +0.31% (total +3.76%, CIFAR100/ResNet18), supporting the method’s core design choices (Dornaika et al., 28 Nov 2025).
7. Implementation, Hyperparameters, and Public Resources
Superpixel generation exploits SLIC (Achanta et al., 2012) with set per dataset (e.g., U(30,40) for CIFAR100). Masking probability is fixed at to maximize diversity. The top-ranked region selection parameter ranges from [0.6, 0.8], generally set to $0.7$. Loss weights are and (CIFAR100). All major benchmarks report comparable inference speed to state-of-the-art methods. Source code is available at https://github.com/DanielaPlusPlus/LGCOAMix (Dornaika et al., 28 Nov 2025).