ResizeMix: Innovative Data Augmentation
- The paper demonstrates that resizing full images for patch mixing eliminates label misallocation and object loss compared to cut-based methods.
- ResizeMix leverages standard resize and paste operations, introducing no extra computational overhead while preserving complete semantic content.
- Empirical results on CIFAR and ImageNet benchmarks show that ResizeMix outperforms traditional methods like CutMix in both classification and detection tasks.
ResizeMix is a data mixing augmentation strategy for image recognition that addresses fundamental limitations of conventional cut-based mixing methods. ResizeMix operates by resizing an entire source image to a smaller patch and pasting it onto a randomly located region within a target image, subsequently mixing labels proportional to the area of the pasted patch. This approach systematically preserves object integrity and correct semantic labeling while introducing no additional computational overhead compared to prior cut-based mixing schemes. ResizeMix demonstrates superior performance relative to CutMix, saliency-guided variants, and most automatic augmentation approaches across standard classification and object detection benchmarks (Qin et al., 2020).
1. Motivation and Context
Data mixing augmentations, including Mixup and CutMix, have achieved widespread adoption due to their ability to improve the generalization of deep networks. Mixup forms convex combinations of entire image pairs, while CutMix pastes randomly cropped patches from one image onto another, mixing labels by spatial area. Saliency-guided extensions—such as PuzzleMix, SaliencyMix, FMix, and SuperMix—attempt to ensure that the crop contains informative content by leveraging pre-trained saliency detectors or by optimizing over mask locations. However, all conventional cutting-based approaches exhibit two intrinsic issues:
- Label misallocation: Random crops often contain only background, leading to incorrect label mixing.
- Object information missing: Cropping may truncate objects, resulting in incomplete semantic information in the mixed region.
Saliency-guided cropping partially addresses empty patches but often results in incomplete objects and reduced diversity. No existing cutting-based approach resolves both drawbacks simultaneously (Qin et al., 2020).
2. Methodological Framework
ResizeMix introduces a conceptual shift: the source image is not cropped, but rather uniformly resized to form a patch encompassing the entire image content at reduced resolution. The algorithm proceeds as follows:
- Patch Construction: Given images (source) and (target), and labels , , sample a resizing factor .
- Patch Placement: Resize to , yielding patch . Paste into at random coordinates such that , .
- Label Mixing: Compute area mixing coefficient . Assign soft label .
- Return: Output the mixed image and mixed label .
Because the pasted patch always contains a downscaled version of the entire source image, all semantic object and contextual information is preserved. This eliminates both label misallocation and object truncation, aligning the label mixing ratio with true semantic content (Qin et al., 2020).
3. Practical Implementation
The standard scale range for is , empirically determined to be optimal across CIFAR and ImageNet. is independently sampled for each mini-batch element, and patch location is uniform across all feasible positions. Implementation requires only a single resize (bilinear or bicubic) and pixel copy operation per mixed sample, with no dependency on saliency estimation or mask optimization. This results in identical computational cost to CutMix, and significantly reduced overhead relative to saliency-driven methods, which may incur hundreds of additional GPU-hours (Qin et al., 2020).
A representative PyTorch-style code fragment is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
import torch.nn.functional as F def resizemix(I_s, y_s, I_t, y_t, alpha=0.1, beta=0.8): B, C, H, W = I_s.size() tau = torch.empty(B).uniform_(alpha, beta).to(I_s.device) H_P = (tau * H).long() W_P = (tau * W).long() I_m, y_m = [], [] for i in range(B): p = F.interpolate( I_s[i:i+1], size=(H_P[i], W_P[i]), mode='bilinear', align_corners=False ) x = torch.randint(0, W-W_P[i]+1, ()) y = torch.randint(0, H-H_P[i]+1, ()) im = I_t[i].clone() im[:, y:y+H_P[i], x:x+W_P[i]] = p[0] lam = (H_P[i]*W_P[i])/(H*W) ymix = lam*y_s[i] + (1-lam)*y_t[i] I_m.append(im); y_m.append(ymix) I_m = torch.stack(I_m) y_m = torch.stack(y_m) return I_m, y_m |
4. Empirical Evaluation
ResizeMix was evaluated on standard image classification and object detection benchmarks, using WideResNet-28-10 and Shake-Shake for CIFAR-10/100, and ResNet-50/101 for ImageNet 1k classification. Object detection was assessed by pretraining ResNet-50 backbones on ImageNet, then fine-tuning SSD and Faster-RCNN on COCO and Pascal VOC.
Image Classification Performance
| Dataset/Model | Cost | CutMix | FMix | SaliencyMix | AutoAugment | RandAugment | ResizeMix | ResizeMix+ (with RandAugment) |
|---|---|---|---|---|---|---|---|---|
| CIFAR-10 (WRN-28-10) | 0 | 97.10 | 96.38 | 97.24 | 97.32 (5000) | 97.30 | 97.60 | 98.10 (6) |
| CIFAR-100 (WRN-28-10) | 0 | 83.40 | 82.03 | 83.44 | 82.91 (5000) | 83.30 | 84.31 | 85.23 (6) |
| ImageNet (ResNet-50) | 0 | 78.60 | - | 78.74 (280) | 77.63 (15000) | 77.60 | 79.00 | - |
| ImageNet (ResNet-101) | 0 | 79.83 | - | 79.91 (280) | - | - | 80.54 | - |
Values in parentheses indicate estimated GPU cost relative to baseline (Qin et al., 2020).
Object Detection Results
| Backbone | ImageNet Top-1 | SSD@COCO | FRCNN@COCO | SSD@VOC | FRCNN@VOC |
|---|---|---|---|---|---|
| ResNet-50 (baseline) | 76.1 | 25.1 | 38.1 | 75.6 | 81.0 |
| +CutMix | 78.6 | 24.9 | 38.2 | 76.1 | 81.9 |
| +ResizeMix | 79.0 | 25.5 | 38.4 | 77.3 | 82.0 |
ResizeMix consistently produced higher classification and detection metrics—outperforming CutMix and saliency-guided approaches without introducing additional resource requirements.
5. Ablation Experiments and Analysis
A comprehensive suite of ablations quantified the superiority of resizing-based over cutting-based patch construction. When evaluating train/validation using (a) random crop and (b) resizing to half the image, resizing produced markedly higher top-1 accuracy on CIFAR-10 (92.1% vs 71.8%) and CIFAR-100 (71.9% vs 35.8%), and slightly improved results on ImageNet (63.9% vs 63.6%). This demonstrates that resizing maintains critical global structure.
Scale range analysis on CIFAR-100 showed optimal performance for , . Additionally, pipeline placement with RandAugment yielded best results when ResizeMix was applied prior to RandAugment, further enhancing accuracy.
Importantly, ablations affirmed that ResizeMix entirely eliminates empty-patch label misallocation and semantic truncation: the label ratio always reflects the presence of a full (if small) object, and no semantic region is omitted (Qin et al., 2020).
6. Insights, Limitations, and Extensions
ResizeMix’s principal contribution is the demonstration that uniform downscaling of full source images yields regularization superior to cutting-based methods for supervised learning, without auxiliary overhead. Further, object integrity and semantic-label alignment are fully preserved at every mixing ratio.
Potential limitations include the effects of extreme scaling (), which may produce objects too small for reliable learning, and the discrepancy between small resized patches and true small-object statistics. Plausible extensions involve adaptive or learned schedules, nonuniform or masked patch shapes, extension to semantic segmentation or video domains, and the integration of adversarial resizing protocols.
ResizeMix thus constitutes a simple, computation-free methodology yielding systematically improved results on both classification and detection tasks by structurally resolving longstanding deficiencies of label misallocation and object information loss found in cut-based image mixing techniques (Qin et al., 2020).