Papers
Topics
Authors
Recent
Search
2000 character limit reached

ResizeMix: Innovative Data Augmentation

Updated 18 March 2026
  • The paper demonstrates that resizing full images for patch mixing eliminates label misallocation and object loss compared to cut-based methods.
  • ResizeMix leverages standard resize and paste operations, introducing no extra computational overhead while preserving complete semantic content.
  • Empirical results on CIFAR and ImageNet benchmarks show that ResizeMix outperforms traditional methods like CutMix in both classification and detection tasks.

ResizeMix is a data mixing augmentation strategy for image recognition that addresses fundamental limitations of conventional cut-based mixing methods. ResizeMix operates by resizing an entire source image to a smaller patch and pasting it onto a randomly located region within a target image, subsequently mixing labels proportional to the area of the pasted patch. This approach systematically preserves object integrity and correct semantic labeling while introducing no additional computational overhead compared to prior cut-based mixing schemes. ResizeMix demonstrates superior performance relative to CutMix, saliency-guided variants, and most automatic augmentation approaches across standard classification and object detection benchmarks (Qin et al., 2020).

1. Motivation and Context

Data mixing augmentations, including Mixup and CutMix, have achieved widespread adoption due to their ability to improve the generalization of deep networks. Mixup forms convex combinations of entire image pairs, while CutMix pastes randomly cropped patches from one image onto another, mixing labels by spatial area. Saliency-guided extensions—such as PuzzleMix, SaliencyMix, FMix, and SuperMix—attempt to ensure that the crop contains informative content by leveraging pre-trained saliency detectors or by optimizing over mask locations. However, all conventional cutting-based approaches exhibit two intrinsic issues:

  1. Label misallocation: Random crops often contain only background, leading to incorrect label mixing.
  2. Object information missing: Cropping may truncate objects, resulting in incomplete semantic information in the mixed region.

Saliency-guided cropping partially addresses empty patches but often results in incomplete objects and reduced diversity. No existing cutting-based approach resolves both drawbacks simultaneously (Qin et al., 2020).

2. Methodological Framework

ResizeMix introduces a conceptual shift: the source image is not cropped, but rather uniformly resized to form a patch encompassing the entire image content at reduced resolution. The algorithm proceeds as follows:

  1. Patch Construction: Given images Is∈RH×W×3I_s \in \mathbb{R}^{H \times W \times 3} (source) and It∈RH×W×3I_t \in \mathbb{R}^{H \times W \times 3} (target), and labels ysy_s, yty_t, sample a resizing factor τ∼U(α,β)\tau \sim U(\alpha, \beta).
  2. Patch Placement: Resize IsI_s to (HP,WP)=(τH,τW)(H_P, W_P) = (\tau H, \tau W), yielding patch PP. Paste PP into ItI_t at random coordinates (x,y)(x, y) such that x∈[0,W−WP]x \in [0, W-W_P], y∈[0,H−HP]y \in [0, H-H_P].
  3. Label Mixing: Compute area mixing coefficient λ=τ2\lambda = \tau^2. Assign soft label ym=λys+(1−λ)yty_m = \lambda y_s + (1-\lambda) y_t.
  4. Return: Output the mixed image ImI_m and mixed label ymy_m.

Because the pasted patch always contains a downscaled version of the entire source image, all semantic object and contextual information is preserved. This eliminates both label misallocation and object truncation, aligning the label mixing ratio with true semantic content (Qin et al., 2020).

3. Practical Implementation

The standard scale range for Ï„\tau is [0.1,0.8][0.1, 0.8], empirically determined to be optimal across CIFAR and ImageNet. Ï„\tau is independently sampled for each mini-batch element, and patch location is uniform across all feasible positions. Implementation requires only a single resize (bilinear or bicubic) and pixel copy operation per mixed sample, with no dependency on saliency estimation or mask optimization. This results in identical computational cost to CutMix, and significantly reduced overhead relative to saliency-driven methods, which may incur hundreds of additional GPU-hours (Qin et al., 2020).

A representative PyTorch-style code fragment is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import torch.nn.functional as F
def resizemix(I_s, y_s, I_t, y_t, alpha=0.1, beta=0.8):
    B, C, H, W = I_s.size()
    tau = torch.empty(B).uniform_(alpha, beta).to(I_s.device)
    H_P = (tau * H).long()
    W_P = (tau * W).long()
    I_m, y_m = [], []
    for i in range(B):
        p = F.interpolate(
            I_s[i:i+1], size=(H_P[i], W_P[i]), mode='bilinear', align_corners=False
        )
        x = torch.randint(0, W-W_P[i]+1, ())
        y = torch.randint(0, H-H_P[i]+1, ())
        im = I_t[i].clone()
        im[:, y:y+H_P[i], x:x+W_P[i]] = p[0]
        lam = (H_P[i]*W_P[i])/(H*W)
        ymix = lam*y_s[i] + (1-lam)*y_t[i]
        I_m.append(im);  y_m.append(ymix)
    I_m = torch.stack(I_m)
    y_m = torch.stack(y_m)
    return I_m, y_m
(Qin et al., 2020)

4. Empirical Evaluation

ResizeMix was evaluated on standard image classification and object detection benchmarks, using WideResNet-28-10 and Shake-Shake for CIFAR-10/100, and ResNet-50/101 for ImageNet 1k classification. Object detection was assessed by pretraining ResNet-50 backbones on ImageNet, then fine-tuning SSD and Faster-RCNN on COCO and Pascal VOC.

Image Classification Performance

Dataset/Model Cost CutMix FMix SaliencyMix AutoAugment RandAugment ResizeMix ResizeMix+ (with RandAugment)
CIFAR-10 (WRN-28-10) 0 97.10 96.38 97.24 97.32 (5000) 97.30 97.60 98.10 (6)
CIFAR-100 (WRN-28-10) 0 83.40 82.03 83.44 82.91 (5000) 83.30 84.31 85.23 (6)
ImageNet (ResNet-50) 0 78.60 - 78.74 (280) 77.63 (15000) 77.60 79.00 -
ImageNet (ResNet-101) 0 79.83 - 79.91 (280) - - 80.54 -

Values in parentheses indicate estimated GPU cost relative to baseline (Qin et al., 2020).

Object Detection Results

Backbone ImageNet Top-1 SSD@COCO FRCNN@COCO SSD@VOC FRCNN@VOC
ResNet-50 (baseline) 76.1 25.1 38.1 75.6 81.0
+CutMix 78.6 24.9 38.2 76.1 81.9
+ResizeMix 79.0 25.5 38.4 77.3 82.0

ResizeMix consistently produced higher classification and detection metrics—outperforming CutMix and saliency-guided approaches without introducing additional resource requirements.

5. Ablation Experiments and Analysis

A comprehensive suite of ablations quantified the superiority of resizing-based over cutting-based patch construction. When evaluating train/validation using (a) random crop and (b) resizing to half the image, resizing produced markedly higher top-1 accuracy on CIFAR-10 (92.1% vs 71.8%) and CIFAR-100 (71.9% vs 35.8%), and slightly improved results on ImageNet (63.9% vs 63.6%). This demonstrates that resizing maintains critical global structure.

Scale range analysis on CIFAR-100 showed optimal performance for α=0.1\alpha=0.1, β=0.8\beta=0.8. Additionally, pipeline placement with RandAugment yielded best results when ResizeMix was applied prior to RandAugment, further enhancing accuracy.

Importantly, ablations affirmed that ResizeMix entirely eliminates empty-patch label misallocation and semantic truncation: the label ratio always reflects the presence of a full (if small) object, and no semantic region is omitted (Qin et al., 2020).

6. Insights, Limitations, and Extensions

ResizeMix’s principal contribution is the demonstration that uniform downscaling of full source images yields regularization superior to cutting-based methods for supervised learning, without auxiliary overhead. Further, object integrity and semantic-label alignment are fully preserved at every mixing ratio.

Potential limitations include the effects of extreme scaling (τ→0\tau \to 0), which may produce objects too small for reliable learning, and the discrepancy between small resized patches and true small-object statistics. Plausible extensions involve adaptive or learned τ\tau schedules, nonuniform or masked patch shapes, extension to semantic segmentation or video domains, and the integration of adversarial resizing protocols.

ResizeMix thus constitutes a simple, computation-free methodology yielding systematically improved results on both classification and detection tasks by structurally resolving longstanding deficiencies of label misallocation and object information loss found in cut-based image mixing techniques (Qin et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ResizeMix.