ResNet strikes back: An improved training procedure in timm (2110.00476v1)

Published 1 Oct 2021 in cs.CV and cs.LG

Abstract: The influential Residual Networks designed by He et al. remain the gold-standard architecture in numerous scientific publications. They typically serve as the default architecture in studies, or as baselines when new architectures are proposed. Yet there has been significant progress on best practices for training neural networks since the inception of the ResNet architecture in 2015. Novel optimization & data-augmentation have increased the effectiveness of the training recipes. In this paper, we re-evaluate the performance of the vanilla ResNet-50 when trained with a procedure that integrates such advances. We share competitive training settings and pre-trained models in the timm open-source library, with the hope that they will serve as better baselines for future work. For instance, with our more demanding training setting, a vanilla ResNet-50 reaches 80.4% top-1 accuracy at resolution 224x224 on ImageNet-val without extra data or distillation. We also report the performance achieved with popular models with our training procedure.

Authors (3)

Ross Wightman (5 papers)
Hugo Touvron (22 papers)
Hervé Jégou (71 papers)

Citations (442)

View on Semantic Scholar

Summary

An Improved Training Procedure for ResNet-50 in Image Classification

In the pursuit of enhancing neural network performance, this paper revisits the established ResNet-50 architecture, originally introduced by He et al., to explore the influence of contemporary training techniques. Since the introduction of ResNets in 2015, significant strides have been made in training methodologies, and this paper aims to leverage such advancements to improve the accuracy of the ResNet-50 on image classification tasks.

The paper focuses on optimizing the conventional ResNet-50 architecture using updated training recipes. By incorporating modern optimization strategies and augmentation techniques, the paper reports a notable accuracy of 80.4% top-1 on the ImageNet validation dataset, tested at a resolution of 224×224 pixels. This result is achieved without utilizing additional training data or employing distillation techniques, suggesting that merely improving the training approach can yield significant enhancements in model performance.

Three distinct training procedures are proposed in varying computational demands—spanning 100, 300, and 600 epochs—with adaptations in hyper-parameters and augmentation strategies to suit each. For instance, the prolonged 600-epoch training aims for optimal performance efficiency, utilizing nearly 4.6 days on a standard node equipped with 4 V100 32GB GPUs. Among the advancements, this approach introduces a shift from the traditional cross-entropy loss function to a binary cross-entropy loss when employing Mixup and CutMix data augmentations, treating the classification condition as a multi-label problem. The paper presents evidence suggesting that this shift has a favorable impact on convergence and model performance.

Extensive ablations are conducted to confirm the contributions of individual components, such as learning rate specifications, weight decays, the concept of repeated augmentation, and modifications in the loss function setup. These aspects were fine-tuned to sidestep overfitting while confirming robustness across varied experimental seeds. Notably, the approach addresses the variability in results due to stochastic factors in model training, underscoring the importance of comprehensive seed testing and validation across different datasets beyond standard benchmark datasets like ImageNet.

The implications of the findings extend towards setting stronger baselines for future architecture studies. By revisiting and refining the training procedures, the paper emphasizes that performance gains can be achieved not only through novel architectures but also through optimized training methodologies.

Looking ahead, while the procedures demonstrate improved baselines for ResNet-50, they highlight the necessity of adapting training strategies specific to architectural nuances and task objectives. Continued development in adaptive and automated hyper-parameter searching methods might play a critical role in further optimizing both architecture-specific and universal training recipes.

In summarizing the contribution, this work underscores how recalibrating the training paradigm can substantially bolster a time-tested architecture like ResNet-50, setting a performance benchmark that could potentially recalibrate how baseline models are evaluated and adapted in image classification tasks moving forward.

Related Papers

Find Related Papers

Tweets

https://twitter.com/i_m_d_Vaibhav/status/1825499041482117556