Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TResNet: High Performance GPU-Dedicated Architecture (2003.13630v3)

Published 30 Mar 2020 in cs.CV, cs.LG, and eess.IV

Abstract: Many deep learning models, developed in recent years, reach higher ImageNet accuracy than ResNet50, with fewer or comparable FLOPS count. While FLOPs are often seen as a proxy for network efficiency, when measuring actual GPU training and inference throughput, vanilla ResNet50 is usually significantly faster than its recent competitors, offering better throughput-accuracy trade-off. In this work, we introduce a series of architecture modifications that aim to boost neural networks' accuracy, while retaining their GPU training and inference efficiency. We first demonstrate and discuss the bottlenecks induced by FLOPs-optimizations. We then suggest alternative designs that better utilize GPU structure and assets. Finally, we introduce a new family of GPU-dedicated models, called TResNet, which achieve better accuracy and efficiency than previous ConvNets. Using a TResNet model, with similar GPU throughput to ResNet50, we reach 80.8 top-1 accuracy on ImageNet. Our TResNet models also transfer well and achieve state-of-the-art accuracy on competitive single-label classification datasets such as Stanford cars (96.0%), CIFAR-10 (99.0%), CIFAR-100 (91.5%) and Oxford-Flowers (99.1%). They also perform well on multi-label classification and object detection tasks. Implementation is available at: https://github.com/mrT23/TResNet.

Citations (197)

Summary

  • The paper presents innovative architectural modifications that enhance GPU throughput while achieving high accuracy on benchmarks like ImageNet.
  • It employs design techniques such as the SpaceToDepth stem, Anti-Alias Downsampling, and Inplace-ABN to reduce memory usage and boost performance.
  • Experimental results show TResNet-M reaching 80.8% top-1 accuracy and state-of-the-art transfer learning outcomes on diverse datasets.

TResNet: High Performance GPU-Dedicated Architecture

The paper introduces TResNet, a series of GPU-dedicated models designed to optimize both accuracy and efficiency in deep learning architectures, particularly on the ImageNet dataset. The authors highlight the limitations of using FLOPs as the sole indicator of efficiency, emphasizing throughput as a more pertinent metric for practical GPU usage.

Key Contributions

The authors propose several architecture modifications to enhance neural network performance while maintaining efficient GPU utilization:

  1. SpaceToDepth Stem: A replacement for the convolution-based stem unit, this transformation reduces resolution with minimal information loss, improving both accuracy and throughput.
  2. Anti-Alias Downsampling (AA): An optimized variant that replaces stride-2 convolutions with stride-1 convolutions followed by a blur filter, enhancing shift-equivariance and robustness without sacrificing GPU speed excessively.
  3. In-Place Activated BatchNorm (Inplace-ABN): This refinement replaces traditional BatchNorm layers to reduce memory footprint and allow larger batch sizes, contributing to improved GPU utilization.
  4. Novel Block-Type Selection: Utilizing both BasicBlock and Bottleneck layers in varying stages, this approach optimizes receptive fields and computational efficiency, diverging from the uniform block-type application in traditional ResNet models.
  5. Optimized SE Layers: Selective placement and hyper-parameter tuning of squeeze-and-excitation layers reduce computational overhead, improving speed without compromising accuracy.

Numerical Results

The TResNet model series achieves significant performance gains:

  • TResNet-M offers a top-1 accuracy of 80.8% on ImageNet with a GPU throughput similar to ResNet50, which scores 79.0%.
  • Transfer learning tests on various datasets demonstrate state-of-the-art accuracy, with notable improvements on Stanford Cars (96.0%) and Oxford-Flowers (99.1%).
  • On multi-label tasks like MS-COCO, TResNet outperforms previous benchmarks with a 86.4% mAP.
  • Object detection results on MS-COCO show TResNet achieving a mAP of 44.0%, compared to 42.8% with ResNet50 as the backbone.

Implications and Future Directions

This work advances the understanding of architecture design for GPU performance, promoting a shift from FLOPs-centric evaluations to a holistic consideration of actual throughput for both training and inference phases. The TResNet models highlight that practical speed gains are achievable without sacrificing accuracy on large-scale tasks.

Future research may expand on integrating TResNet's optimizations into other domains of deep learning beyond image classification and detection. As AI frameworks evolve, these insights into GPU-effective designs can contribute to the development of even more efficient models, potentially influencing the standard practices in network architecture evaluation.

Conclusion

TResNet sets a precedent for designing networks that marry high accuracy with efficient GPU utilization. By addressing both theoretical performance metrics and practical deployment considerations, this work provides substantial contributions to the deep learning architectural landscape, underscoring the importance of throughput in the evaluation of model efficiency.