- The paper proposes a generalized curriculum learning framework that reduces training cost by up to 20% while maintaining competitive accuracy.
- It uses frequency domain techniques and dynamic data augmentation to help models learn simple patterns before tackling complex details.
- The approach achieves 1.5–3.0x speedups and improves transferability to tasks like object detection and semantic segmentation.
EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training
Introduction
Training sophisticated visual backbone models, like vision Transformers (ViTs), is typically a computationally intense process. It involves processing billions of parameters across expansive datasets. For example, training ViT-H/14 on the JFT-300M dataset requires a staggering 2,500 TPUv3-core-days. This computational load translates to significant time, cost, and even environmental impacts due to carbon emissions. EfficientTrain++, introduced by Yulin Wang et al., proposes a novel approach to mitigate these challenges by focusing on a generalized curriculum learning framework.
Generalized Curriculum Learning
The Idea Behind Curriculum Learning
Curriculum learning is akin to human education, where learning starts from simpler tasks before moving to more complex ones. Traditional curriculum learning progressively introduces harder examples during the training process. However, this model selection can be difficult and sometimes not universally effective.
Generalized Curriculum Learning Framework
EfficientTrain++ builds on curriculum learning by hypothesizing that each training sample contains both easier-to-learn
and harder-to-learn
patterns. Rather than merely selecting samples progressively, this approach suggests consistently using all data while initially presenting low-complexity patterns and gradually adding complexity.
Identifying Easier-to-Learn Patterns
Frequency Domain Approach
Observing Learning Dynamics
Empirical evidence shows that visual backbones initially learn lower-frequency components of images before moving on to more complex, higher-frequency details. Based on this, low-frequency images were constructed by cropping their Fourier spectrum. Training models with these low-frequency images often yielded similar initial performance to those trained on the original images, suggesting these were easier-to-learn patterns.
Implementation and Results
By starting training with low-frequency images and progressively introducing higher-frequency content, models maintained competitive accuracy while reducing computational costs:
- Savings: Up to 20% in computational cost.
- Accuracy: Remained comparable to baseline models.
Spatial Domain Approach
Data Augmentation
Modern training pipelines utilize strong data augmentation techniques. By weakening data augmentation (e.g., reducing the magnitude of RandAug) during the early training stages, models were found to learn more effectively:
- Final Accuracy: Increased by up to 1% in DeiT-Small.
- Empirical Gains: Consistent across various models and datasets.
EfficientTrain Curriculum
A combination of frequency cropping and weaker-to-stronger data augmentation strategies was employed. Schedules for frequency cropping were devised using a greedy search algorithm, ensuring computational efficiency:
- Greedy-Search Algorithm: Ensured model performance was not degraded by reducing input size where possible.
EfficientTrain++
EfficientTrain++ introduces:
- Computational-Constrained Sequential Searching:
- Motivation: To minimize training cost effectively.
- Implementation: Adjust the number of training steps while varying input size to ensure fixed computational cost without degrading model accuracy.
- Efficient Low-Frequency Downsampling:
- Method: Replace high I/O cost Fourier transforms with CPU-friendly low-pass filtering followed by image down-sampling.
- Implementation Techniques:
- Larger Batch Sizes with Small Inputs: Optimizing GPU usage.
- Replay Buffer: Reduces data pre-processing load on CPUs.
Results and Performance
Training Efficiency
EfficientTrain++ was benchmarked against various models on ImageNet-1K and ImageNet-22K:
- Consistency: Maintained or improved accuracy across diverse model architectures.
- Training Speedup: Up to 1.5 to 3.0-fold reduction in the training cost without sacrificing accuracy.
- Wall-time Reduction: Significant practical reductions when training on GPUs.
Transferability
EfficientTrain++ also demonstrated excellent transferability to various downstream tasks:
- Object Detection: Consistent improvements in COCO dataset performance.
- Semantic Segmentation: Notable mIoU improvements in ADE20K.
Conclusion
EfficientTrain++ presents a thoughtful evolution of curriculum learning by focusing on gradually introducing visual complexity to models during training. By leveraging low-frequency cropping, optimized augmentation strategies, and efficient computational techniques, it provides a remarkable reduction in training costs while ensuring high model performance. This balance of efficiency and effectiveness makes it a valuable approach for advancing modern deep network training both in practical deployment and future research.