Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training (2405.08768v1)

Published 14 May 2024 in cs.CV, cs.AI, and cs.LG

Abstract: The superior performance of modern visual backbones usually comes with a costly training procedure. We contribute to this issue by generalizing the idea of curriculum learning beyond its original formulation, i.e., training models using easier-to-harder data. Specifically, we reformulate the training curriculum as a soft-selection function, which uncovers progressively more difficult patterns within each example during training, instead of performing easier-to-harder sample selection. Our work is inspired by an intriguing observation on the learning dynamics of visual backbones: during the earlier stages of training, the model predominantly learns to recognize some 'easier-to-learn' discriminative patterns in the data. These patterns, when observed through frequency and spatial domains, incorporate lower-frequency components, and the natural image contents without distortion or data augmentation. Motivated by these findings, we propose a curriculum where the model always leverages all the training data at every learning stage, yet the exposure to the 'easier-to-learn' patterns of each example is initiated first, with harder patterns gradually introduced as training progresses. To implement this idea in a computationally efficient way, we introduce a cropping operation in the Fourier spectrum of the inputs, enabling the model to learn from only the lower-frequency components. Then we show that exposing the contents of natural images can be readily achieved by modulating the intensity of data augmentation. Finally, we integrate these aspects and design curriculum schedules with tailored search algorithms. The resulting method, EfficientTrain++, is simple, general, yet surprisingly effective. It reduces the training time of a wide variety of popular models by 1.5-3.0x on ImageNet-1K/22K without sacrificing accuracy. It also demonstrates efficacy in self-supervised learning (e.g., MAE).

Citations (5)

Summary

  • The paper proposes a generalized curriculum learning framework that reduces training cost by up to 20% while maintaining competitive accuracy.
  • It uses frequency domain techniques and dynamic data augmentation to help models learn simple patterns before tackling complex details.
  • The approach achieves 1.5–3.0x speedups and improves transferability to tasks like object detection and semantic segmentation.

EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training

Introduction

Training sophisticated visual backbone models, like vision Transformers (ViTs), is typically a computationally intense process. It involves processing billions of parameters across expansive datasets. For example, training ViT-H/14 on the JFT-300M dataset requires a staggering 2,500 TPUv3-core-days. This computational load translates to significant time, cost, and even environmental impacts due to carbon emissions. EfficientTrain++, introduced by Yulin Wang et al., proposes a novel approach to mitigate these challenges by focusing on a generalized curriculum learning framework.

Generalized Curriculum Learning

The Idea Behind Curriculum Learning

Curriculum learning is akin to human education, where learning starts from simpler tasks before moving to more complex ones. Traditional curriculum learning progressively introduces harder examples during the training process. However, this model selection can be difficult and sometimes not universally effective.

Generalized Curriculum Learning Framework

EfficientTrain++ builds on curriculum learning by hypothesizing that each training sample contains both easier-to-learn and harder-to-learn patterns. Rather than merely selecting samples progressively, this approach suggests consistently using all data while initially presenting low-complexity patterns and gradually adding complexity.

Identifying Easier-to-Learn Patterns

Frequency Domain Approach

Observing Learning Dynamics

Empirical evidence shows that visual backbones initially learn lower-frequency components of images before moving on to more complex, higher-frequency details. Based on this, low-frequency images were constructed by cropping their Fourier spectrum. Training models with these low-frequency images often yielded similar initial performance to those trained on the original images, suggesting these were easier-to-learn patterns.

Implementation and Results

By starting training with low-frequency images and progressively introducing higher-frequency content, models maintained competitive accuracy while reducing computational costs:

  • Savings: Up to 20% in computational cost.
  • Accuracy: Remained comparable to baseline models.

Spatial Domain Approach

Data Augmentation

Modern training pipelines utilize strong data augmentation techniques. By weakening data augmentation (e.g., reducing the magnitude of RandAug) during the early training stages, models were found to learn more effectively:

  • Final Accuracy: Increased by up to 1% in DeiT-Small.
  • Empirical Gains: Consistent across various models and datasets.

EfficientTrain Curriculum

A combination of frequency cropping and weaker-to-stronger data augmentation strategies was employed. Schedules for frequency cropping were devised using a greedy search algorithm, ensuring computational efficiency:

  • Greedy-Search Algorithm: Ensured model performance was not degraded by reducing input size where possible.

EfficientTrain++

EfficientTrain++ introduces:

  1. Computational-Constrained Sequential Searching:
    • Motivation: To minimize training cost effectively.
    • Implementation: Adjust the number of training steps while varying input size to ensure fixed computational cost without degrading model accuracy.
  2. Efficient Low-Frequency Downsampling:
    • Method: Replace high I/O cost Fourier transforms with CPU-friendly low-pass filtering followed by image down-sampling.
  3. Implementation Techniques:
    • Larger Batch Sizes with Small Inputs: Optimizing GPU usage.
    • Replay Buffer: Reduces data pre-processing load on CPUs.

Results and Performance

Training Efficiency

EfficientTrain++ was benchmarked against various models on ImageNet-1K and ImageNet-22K:

  • Consistency: Maintained or improved accuracy across diverse model architectures.
  • Training Speedup: Up to 1.5 to 3.0-fold reduction in the training cost without sacrificing accuracy.
  • Wall-time Reduction: Significant practical reductions when training on GPUs.

Transferability

EfficientTrain++ also demonstrated excellent transferability to various downstream tasks:

  • Object Detection: Consistent improvements in COCO dataset performance.
  • Semantic Segmentation: Notable mIoU improvements in ADE20K.

Conclusion

EfficientTrain++ presents a thoughtful evolution of curriculum learning by focusing on gradually introducing visual complexity to models during training. By leveraging low-frequency cropping, optimized augmentation strategies, and efficient computational techniques, it provides a remarkable reduction in training costs while ensuring high model performance. This balance of efficiency and effectiveness makes it a valuable approach for advancing modern deep network training both in practical deployment and future research.