LGCOAMix: Superpixel Data Augmentation

Updated 3 December 2025

LGCOAMix is a context-aware, object-part-aware superpixel-based data augmentation technique that preserves semantic consistency in mixed images.
It employs superpixel grid blending and integrated local-global context learning to maintain fine object details and improve training efficiency.
Empirical results on benchmarks like CIFAR100 and CUB200-2011 show enhanced classification accuracy and robust performance for both CNNs and transformers.

LGCOAMix is a context-aware and object-part-aware superpixel-based data augmentation technique designed to overcome the generalization bottlenecks of existing cutmix-style augmentation methods for deep visual recognition. By generating augmented samples via superpixel-based grid blending and integrating both local and global context learning, LGCOAMix achieves improved semantic consistency of mixed images and labels, efficient training, and enhanced discriminative representation for both convolutional and transformer architectures. It is the first approach to propose a label mixing strategy using superpixel attention within the cutmix data augmentation paradigm and is unique in learning local features from discriminative superpixel-wise regions and across-image superpixel contrasts (Dornaika et al., 28 Nov 2025).

1. Motivation and Limitations of Prior Work

Cutmix-based data augmentation (such as CutMix, GridMix, SaliencyMix) constructs augmented images by patch-wise cut-and-paste operations, typically using rectangular or grid-shaped regions. While these strategies introduce global image-level variation and have exhibited strong generalization, they typically degrade local contextual focus and disrupt semantically coherent object parts (e.g., mixing distinct bird body regions), leading to performance ceilings—especially in fine-grained recognition. Furthermore, mixed label computation in prior art is primarily area-based, often resulting in label–image mismatches when background or non-discriminative regions dominate, requiring inefficient solutions like double forward propagation or dependencies on external saliency models. LGCOAMix directly addresses these issues by (1) employing superpixel-wise region mixing to preserve object parts, (2) learning both global and local (superpixel-based) context, and (3) realizing one-pass, semantically consistent label mixing without external models (Dornaika et al., 28 Nov 2025).

2. Superpixel-Based Grid Generation and Mixing

LGCOAMix utilizes the SLIC superpixel algorithm to partition each source image $x_k\in\mathbb{R}^{W\times H\times C}$ into a randomly sampled number of superpixels $q_k\sim U(q_{min}, q_{max})$ . This produces superpixel maps $\mathbf{S}_1$ and $\mathbf{S}_2$ , where each pixel is associated with a superpixel label in $[1,q_k]$ . The LGCOAMixer component then randomly samples each superpixel in $\mathbf{S}_2$ with Bernoulli probability $p=0.5$ to generate a binary mask $\mathbf{M}$ . The mixed image and its corresponding superpixel map are computed as: $x_{mix} = (1 - M) \odot x_1 + M \odot x_2$

$S_{mix} = (1 - M) \odot S_1 + M \odot S_2$

This approach enables boundary-conforming region mixing, ensuring that object-part information and fine structures are preserved. Boundary-truncated superpixels are retained without degrading semantic consistency (Dornaika et al., 28 Nov 2025).

3. Joint Local and Global Context Representation Learning

LGCOAMix integrates both global and local context modeling in its training pipeline:

Global Representation: The mixed image $x_{mix}$ is processed through an encoder $\theta_{enc}$ yielding feature map $Z$ . A global classifier $f_{global}$ computes class predictions over the mixed label $y_{mix}$ .
Local/Superpixel Representation: The decoded high-res features $\hat{Z}$ are average-pooled within each superpixel in $S_{mix}$ —yielding a sequence $F \in \mathbb{R}^{L\times D}$ of feature vectors, one per superpixel.
Superpixel Self-Attention: Each $F_\ell$ is projected into QKV representations and updated using scaled dot-product self-attention:

$Q = F W_\phi,\quad K = F W^k,\quad V = F W^v$

$SA(Q, K, V) = \mathrm{softmax}(QK^\top/\sqrt{d}) V$

$C = \mathrm{LayerNorm}(F + SA(Q, K, V))$

Region Weights and Local Loss: Attention weights $w_\ell = \sigma(\sum_c C_{\ell,c})$ identify the most discriminative superpixels. The top- $N = \lfloor tL \rfloor$ (typically $t = 0.7$ ) are classified with a dedicated local head $f_{local}$ , contributing to the local classification loss $L_{local}$ .

Additionally, a superpixel-wise contrastive loss $L_{contrast}$ is adopted, aligning selected superpixel features across batch samples of the same class to promote invariance and enhance representation discriminability (Dornaika et al., 28 Nov 2025).

4. Semantic Superpixel Attention for Label Mixing

LGCOAMix introduces a novel mixed-label calculation based on superpixel attention rather than area: $\lambda_{att} = \frac{\sum_{i \in I_{x_2}} w_i \cdot |S_{mix}[i]|}{\sum_{j=1}^{L} w_j \cdot |S_{mix}[j]|}$ where $I_{x_2}$ indexes superpixels from $x_2$ and $|S_{mix}[i]|$ is the area for superpixel $i$ . The final mixed label is: $y_{mix} = (1 - \lambda_{att}) y_1 + \lambda_{att} y_2$ This process delivers semantic consistency between image regions and label proportions, eliminates reliance on multiple forward passes or saliency pretrained networks, and is applicable to both CNNs and Vision Transformers (Dornaika et al., 28 Nov 2025).

5. Algorithmic Workflow

The LGCOAMix pipeline involves the following sequential steps:

Step	Operation	Output
1	Superpixel sampling on inputs ( $q_1$ , $q_2$ )	$S_1$ , $S_2$ superpixel maps
2	Bernoulli mask generation on $S_2$ ( $p=0.5$ )	$\mathbf{M}$
3	Mixing (Eq. 4)	$x_{mix}$ , $S_{mix}$
4	Encoding and decoding $x_{mix}$	$Z$ , $\hat{Z}$
5	Superpixel pooling + self-attention	$C$ , $w$ , $c_s$
6	Label mixing via attention (Eq. 5, 2)	$y_{mix}$
7–9	Compute $L_{global}$ , $L_{local}$ , $L_{contrast}$	$L_{total}$
10	Backpropagation (inference: encoder + head only)

Batch formation and training utilize SGD (momentum 0.9, weight decay $5\cdot 10^{-4}$ ), with diverse learning rates and batch sizes tailored to each benchmark dataset. At inference, only the encoder and global classifier are required, ensuring runtime equivalence to OcCaMix (Dornaika et al., 28 Nov 2025).

6. Empirical Results

LGCOAMix demonstrates state-of-the-art performance on major visual recognition tasks. For classification (Top-1 accuracy):

CIFAR100/ResNet18: Baseline 78.58, CutMix 79.69, AutoMix 82.04, LGCOAMix 82.34 (+0.30 over best prior)
TinyImageNet/ResNet18: Baseline 61.66, CutMix 64.35, OcCaMix 67.35, LGCOAMix 68.27 (+0.92)
CUB200-2011/ResNeXt50: Baseline 81.41, CutMix 82.63, AutoMix 83.52, LGCOAMix 84.37 (+0.85)
Stanford Dogs/ResNet50: Baseline 61.46, CutMix 63.92, OcCaMix 69.34, LGCOAMix 70.95
ViT-B/16 on CUB200-2011: Baseline 80.45, LGCOAMix 82.20

For weakly supervised object location (CUB200-2011, ResNet50): Loc Acc improves from 50.21% (baseline) and 55.22% (CutMix) to 58.65% (LGCOAMix).

Ablation studies reveal that switching from square grid to superpixel grid yields +0.78%, adding local classification +0.76%, and contrastive learning +0.31% (total +3.76%, CIFAR100/ResNet18), supporting the method’s core design choices (Dornaika et al., 28 Nov 2025).

7. Implementation, Hyperparameters, and Public Resources

Superpixel generation exploits SLIC (Achanta et al., 2012) with $q_{min},q_{max}$ set per dataset (e.g., U(30,40) for CIFAR100). Masking probability is fixed at $p=0.5$ to maximize diversity. The top-ranked region selection parameter $t$ ranges from [0.6, 0.8], generally set to $0.7$. Loss weights are $\gamma_1=0.1$ and $\gamma_2=0.05$ (CIFAR100). All major benchmarks report comparable inference speed to state-of-the-art methods. Source code is available at https://github.com/DanielaPlusPlus/LGCOAMix (Dornaika et al., 28 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Local and Global Context-and-Object-part-Aware Superpixel-based Data Augmentation for Deep Visual Recognition (2025)

LGCOAMix: Superpixel Data Augmentation

1. Motivation and Limitations of Prior Work

2. Superpixel-Based Grid Generation and Mixing

3. Joint Local and Global Context Representation Learning

4. Semantic Superpixel Attention for Label Mixing

5. Algorithmic Workflow

6. Empirical Results

7. Implementation, Hyperparameters, and Public Resources

Whiteboard

Follow Topic

Continue Learning

LGCOAMix: Superpixel Data Augmentation

1. Motivation and Limitations of Prior Work

2. Superpixel-Based Grid Generation and Mixing

3. Joint Local and Global Context Representation Learning

4. Semantic Superpixel Attention for Label Mixing

5. Algorithmic Workflow

6. Empirical Results

7. Implementation, Hyperparameters, and Public Resources

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics