Papers
Topics
Authors
Recent
Search
2000 character limit reached

ResNet-50 Large Meta: Efficient Training Techniques

Updated 8 June 2026
  • ResNet-50-large-meta is a meta category that standardizes training techniques for the canonical 50-layer ResNet-50 on the ImageNet-1K dataset.
  • It utilizes large-batch SGD with Layer-wise Adaptive Rate Scaling (LARS) to stabilize training with batch sizes ranging from 32K to 82K.
  • Meta-optimization strategies like Collapsed Ensemble and MEAL V2 with knowledge distillation significantly boost top-1 accuracy, surpassing 80% without altering the original architecture.

ResNet-50-large-meta refers to the collection of foundational methods, hyperparameters, and major empirical results that establish the state-of-the-art for efficient large-scale training and meta-optimization of the canonical ResNet-50 architecture on the ImageNet-1K dataset. This meta-category encompasses both algorithmic innovations for ultra-large batch stochastic gradient descent (SGD) and recent knowledge distillation frameworks that boost accuracy far beyond standard training, all while maintaining the original ResNet-50 topology. ResNet-50-large-meta, therefore, codifies the landscape of approaches that produce either highly accurate or extremely rapidly trained ResNet-50 models without architectural modifications.

1. Canonical Topology and Dataset Standardization

ResNet-50 in the large-meta context is defined by the original topology: 50 layers with bottleneck blocks, input size 224×224×3, and output size 1,000-way softmax. The standard dataset is ILSVRC2012 (ImageNet-1K), with 1.28M training images and 50k validation images across 1,000 classes. No architecture changes or outside data are permitted in this category; all comparison is on single-center-crop top-1 accuracy, unless otherwise stated (Codreanu et al., 2017).

2. Synchronous Large-Batch SGD and Layer-wise Rate Scaling

Large-scale distributed training is enabled by data-parallel synchronous SGD, with global minibatch sizes often scaling up to 32K–82K. However, naively increasing global batch size degrades optimization and generalization beyond B ≈ 8K. The Layer-wise Adaptive Rate Scaling (LARS) algorithm provides adaptive, per-layer learning rates, normalizing each layer's update by its L2-norm:

Δwl=−η∥wl∥∥gl∥+ϵgl\Delta w_l = -\eta \frac{\|w_l\|}{\|g_l\|+\epsilon} g_l

where wlw_l are the weights and gl=∇wlLg_l = \nabla_{w_l}L is the gradient for layer ll, with ϵ\epsilon set (typically 1e-8 or 1e-6) for numerical stability. This enables stable training at extreme batch sizes and allows the global learning rate η\eta to scale linearly with batch size BB (You et al., 2017, Yamazaki et al., 2019).

3. Meta-Optimization Strategies and Collapsed Ensemble

Meta-level training solutions include learning rate warmup (linear ramp over 5–10 epochs), polynomial or linear decay schedules, and the use of momentum (usually 0.9) and weight decay (1e-4 typical). Recent work introduces "Collapsed Ensemble" (CE), where cyclical learning rate and weight decay scheduling within fixed epoch budgets (e.g., 120 epochs) enable the recovery of multiple converged model snapshots. These snapshots are ensembled at inference, typically by averaging softmax outputs, boosting top-1 accuracy without extra training cost (Codreanu et al., 2017). CE achieves up to 77.5% top-1 accuracy for vanilla ResNet-50 in 120 epochs.

4. Knowledge Distillation with Ensembles: MEAL V2

The MEAL V2 framework demonstrates that knowledge distillation, using a high-accuracy teacher ensemble (e.g., SENet-154, ResNet-152), enables vanilla ResNet-50 to exceed 80% top-1 accuracy on ImageNet. The process is summarized as follows:

  • Soft labels are obtained by averaging K teacher model softmax outputs for each input image.
  • The student ResNet-50 is initialized from a strong pre-trained hard-label checkpoint.
  • Training optimizes a KL-based similarity loss plus an adversarial loss from a small discriminator network distinguishing teacher vs. student logits.
  • No one-hot loss, label smoothing, architectural modification, or advanced augmentation is used.
  • Weight decay can be removed; soft targets provide sufficient regularization.
  • Ablation reveals that strong initialization and omitting weight decay/one-hot loss are critical to surpassing standard performance (Shen et al., 2020).

Notably, MEAL V2 achieves 80.67% top-1 accuracy with a single-crop (224x224) vanilla ResNet-50 and outperforms all previous state-of-the-art under identical structural constraints.

5. Hyperparameters, Scheduling, and Practical Regimes

The following summarizes concrete, empirically validated regimes for ResNet-50-large-meta:

Method Batch Size Epochs Top-1 Accuracy Notable Recipes
Standard Large-batch Up to 32K 90 75.3–75.4% LARS, 5-epoch warmup, poly decay, std. augment
Akiba et al. 2017 32,768 90 ~74.9% RMSprop→SGD warmup, slow-start LR, BN statistics
MEAL V2 (2020) 512 180 80.67% Ensemble KD, soft labels, zero weight decay
Collapsed Ensemble 32–64K 120 77.5% (ensemble) 5–6 cyclical LR drops, multi-snapshot averaging
  • Learning rate scaling: η=0.1×batch/256\eta = 0.1 \times \mathrm{batch}/256 after warmup.
  • Weight decay: Standard is 10−410^{-4}; MEAL V2 sets it to zero during distillation.
  • Data augmentation: Only basic random crop and horizontal flip in MEAL V2; stronger regularization needed at batch ≳\gtrsim 32K (Codreanu et al., 2017, You et al., 2017, Shen et al., 2020).

6. System-Level and Communication Optimizations

Successful large-batch runs rely on overlapping all-reduce with backward compute, bucketized gradient communication, mixed-precision compute (FP16 for compute/allreduce, FP32 for weights), and efficient layer norm kernels. MPI rank synchronization tricks (e.g., fixing RNG seed per rank, batched norm computation) substantially reduce system overhead (Yamazaki et al., 2019). Scaling efficiency exceeds 80% for systems up to 1,536 CPU nodes or 2,048 GPUs (Codreanu et al., 2017, You et al., 2017, Yamazaki et al., 2019).

7. Accuracy, Throughput, and Recognized Trade-Offs

Standard large-batch pipelines with LARS enable 74.9–75.4% top-1 ImageNet accuracy in 14–74 seconds (on 2,048 GPU systems) or 14–20 minutes (on CPU clusters) without loss relative to standard-batch baselines. Collapsed Ensemble and MEAL V2 raise this benchmark to 77.0–80.67% without limit on training time or compute. The principal trade-off is that extreme batch sizes necessitate meta-level regularization and precise learning rate control; beyond batch ≳32K, test accuracy losses can occur if not mitigated by strong regularization, mixed-batch or distillation (Shen et al., 2020, Codreanu et al., 2017, You et al., 2017).

References

  • "MEAL V2: Boosting Vanilla ResNet-50 to 80%+ Top-1 Accuracy on ImageNet without Tricks" (Shen et al., 2020)
  • "Yet Another Accelerated SGD: ResNet-50 Training on ImageNet in 74.7 seconds" (Yamazaki et al., 2019)
  • "Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes" (Akiba et al., 2017)
  • "ImageNet Training in Minutes" (You et al., 2017)
  • "Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train" (Codreanu et al., 2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ResNet-50-large-meta.