Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Revisiting ResNets: Improved Training and Scaling Strategies (2103.07579v1)

Published 13 Mar 2021 in cs.CV

Abstract: Novel computer vision architectures monopolize the spotlight, but the impact of the model architecture is often conflated with simultaneous changes to training methodology and scaling strategies. Our work revisits the canonical ResNet (He et al., 2015) and studies these three aspects in an effort to disentangle them. Perhaps surprisingly, we find that training and scaling strategies may matter more than architectural changes, and further, that the resulting ResNets match recent state-of-the-art models. We show that the best performing scaling strategy depends on the training regime and offer two new scaling strategies: (1) scale model depth in regimes where overfitting can occur (width scaling is preferable otherwise); (2) increase image resolution more slowly than previously recommended (Tan & Le, 2019). Using improved training and scaling strategies, we design a family of ResNet architectures, ResNet-RS, which are 1.7x - 2.7x faster than EfficientNets on TPUs, while achieving similar accuracies on ImageNet. In a large-scale semi-supervised learning setup, ResNet-RS achieves 86.2% top-1 ImageNet accuracy, while being 4.7x faster than EfficientNet NoisyStudent. The training techniques improve transfer performance on a suite of downstream tasks (rivaling state-of-the-art self-supervised algorithms) and extend to video classification on Kinetics-400. We recommend practitioners use these simple revised ResNets as baselines for future research.

Revisiting ResNets: Improved Training and Scaling Strategies

The paper "Revisiting ResNets: Improved Training and Scaling Strategies" presents an investigation into the performance enhancements achievable through refined training and scaling strategies for the well-established ResNet architecture in computer vision tasks. The aim is to disentangle the effects of architectural, training, and scaling improvements, with a significant emphasis placed on the latter two aspects.

The work systematically assesses how modern training approaches can enhance ResNet performances, uncovering that these methods can consistently elevate accuracy. The authors conduct a comprehensive experimental paper, showing that contemporary regularization and data augmentation methods contribute significantly to performance gains, underscoring training method importance alongside traditional architectural innovations.

Key Findings

The paper yields a number of insights and recommendations:

  1. Training Strategies Matter: By adopting recent training methodologies, such as Cosine Learning Rate decay, dropout, label smoothing, stochastic depth, and RandAugment, ResNet performance saw notable improvements. For instance, the ResNet-200, initially achieving 79.0% top-1 ImageNet accuracy, improved to 82.2% solely through enhanced training—further demonstrating that thoughtful training modifications can provide substantial performance gains.
  2. Scaling Strategy Analysis: Scaling of model dimensions such as width, depth, and image resolution was scrutinized to optimize the speed-accuracy Pareto frontier. Crucially, the choice of best scaling strategy is shown to be dependent on the training regime. Specifically, depth scaling becomes more effective in longer epoch regimes, due to its ability to mitigate overfitting which width scaling exacerbates due to parameter growth.
  3. ResNet-RS Architecture: A new family of ResNet variants, termed ResNet-RS, emerged from this paper. These models incorporate enhanced training and scaling strategies and feature a balance between speed and accuracy, outperforming EfficientNets on the speed-accuracy curve in terms of training and inference times. ResNet-RS models were significantly faster—1.7x to 2.7x on TPUs—compared to EfficientNets, while offering similar or improved accuracies.
  4. Improving EfficientNet: An added examination shows that even EfficientNets could be improved by adopting a slow resolution scaling strategy, underscoring that EfficientNets' original compound scaling might be suboptimal.
  5. Transfer and Semi-supervised Learning: ResNet-RS models demonstrate robust performance in a semi-supervised learning context, achieving substantial speed-ups and competitive accuracies against Noisy Student models. They also perform strongly in transfer learning tasks, rivaling or surpassing self-supervised approaches like SimCLR in several downstream tasks.
  6. Video Classification: The methodologies applied also benefit video classification tasks, extending to 3D ResNets where they resulted in significant accuracy increases on the Kinetics-400 dataset.

Implications and Future Directions

The implications of these findings are multifold. Practically, by revisiting established architectures with modernized training regimes, computational efficiencies can be achieved that have direct applicability, especially in environments where computational resources are constrained. Theoretically, the disentangling of training and architectural impacts fosters a more nuanced understanding of model behavior, potentially guiding future architectural designs that better synergize with advanced training methodologies.

For future developments in AI, this research reinforces the need to critically evaluate and optimize training and scaling strategies, alongside architectural innovation. It suggests that future architectural explorations might benefit from a co-design strategy, where models are tailored to leverage hardware efficiently. This can lead to improved system performance without changing the underlying model architecture substantially, potentially influencing developments in both model scaling and deployment in resource-constrained environments, like mobile devices.

This paper notably calls for careful consideration of baseline settings in comparative studies, advocating for the use of consistent, modern training strategies across competing models to avoid conflating the benefits of architecture and training improvements. Lastly, the evidence suggests a promising avenue for further research into the scalability of these findings across broader types of machine learning tasks and data regimes.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Irwan Bello (12 papers)
  2. William Fedus (25 papers)
  3. Xianzhi Du (30 papers)
  4. Ekin D. Cubuk (37 papers)
  5. Aravind Srinivas (20 papers)
  6. Tsung-Yi Lin (49 papers)
  7. Jonathon Shlens (58 papers)
  8. Barret Zoph (38 papers)
Citations (272)