Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DeiT-LT Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets (2404.02900v1)

Published 3 Apr 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Vision Transformer (ViT) has emerged as a prominent architecture for various computer vision tasks. In ViT, we divide the input image into patch tokens and process them through a stack of self attention blocks. However, unlike Convolutional Neural Networks (CNN), ViTs simple architecture has no informative inductive bias (e.g., locality,etc. ). Due to this, ViT requires a large amount of data for pre-training. Various data efficient approaches (DeiT) have been proposed to train ViT on balanced datasets effectively. However, limited literature discusses the use of ViT for datasets with long-tailed imbalances. In this work, we introduce DeiT-LT to tackle the problem of training ViTs from scratch on long-tailed datasets. In DeiT-LT, we introduce an efficient and effective way of distillation from CNN via distillation DIST token by using out-of-distribution images and re-weighting the distillation loss to enhance focus on tail classes. This leads to the learning of local CNN-like features in early ViT blocks, improving generalization for tail classes. Further, to mitigate overfitting, we propose distilling from a flat CNN teacher, which leads to learning low-rank generalizable features for DIST tokens across all ViT blocks. With the proposed DeiT-LT scheme, the distillation DIST token becomes an expert on the tail classes, and the classifier CLS token becomes an expert on the head classes. The experts help to effectively learn features corresponding to both the majority and minority classes using a distinct set of tokens within the same ViT architecture. We show the effectiveness of DeiT-LT for training ViT from scratch on datasets ranging from small-scale CIFAR-10 LT to large-scale iNaturalist-2018.

DeiT-LT: Enhancing Vision Transformer Training on Long-Tailed Datasets with Distillation

Introduction

"DeiT-LT: Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets" presents an innovative approach to training Vision Transformers (ViTs) effectively on long-tailed datasets by leveraging distillation techniques. The premise of the work is rooted in the challenge posed by long-tailed distributions prevalent in real-world datasets, where a small number of classes (the "head") possess a large number of examples, while a larger number of classes (the "tail") have relatively few examples. The authors propose DeiT-LT, a novel training framework designed to enhance the performance of ViTs on such imbalanced datasets without necessitating large-scale pre-training.

Key Contributions

  1. Distillation DIST Token: DeiT-LT introduces an efficient and innovative method of knowledge distillation via out-of-distribution images, significantly improving ViTs’ performance on the tail classes by enabling the model to learn CNN-like features in the early blocks.
  2. Re-Weighting the Distillation Loss: A novel aspect of DeiT-LT is its approach to focusing on tail classes through re-weighting the distillation loss, which is essential for mitigating the imbalance challenge.
  3. Dual Expertise within ViT: The distilled ViT models embody dual expertise, where the classifier (CLS) token becomes proficient with head classes and the distillation (DIST) token excels with the tail classes. This dual expertise is crucial for addressing the disparity in class representation within the training data.
  4. Generalization through Low-Rank Features: DeiT-LT further incorporates distillation from flat CNN teachers trained via Sharpness Aware Minimization (SAM) to promote the learning of low-rank, hence more generalizable, features across all ViT blocks.

Numerical Results

The effectiveness of DeiT-LT is underscored by its performance across a range of datasets, from small-scale (e.g., CIFAR-10 LT and CIFAR-100 LT) to large-scale (e.g., ImageNet-LT and iNaturalist-2018) benchmarks. One notable numerical result is the significant improvement in performance on the CIFAR-100 LT dataset, showcasing DeiT-LT's capability to enhance learning on datasets with severe class imbalances.

Implications and Future Work

  • Practical Implications: The DeiT-LT framework holds considerable promise for practical applications, especially in domains where data imbalance is a perennial challenge. It mitigates the need for large-scale pre-training, making it a cost-effective solution for deploying ViTs in specialized areas such as medical imaging and satellite imagery analysis.
  • Theoretical Implications: From a theoretical viewpoint, this paper prompts further inquiry into the mechanisms through which distillation and feature re-weighting influence the learning dynamic of transformers vis-à-vis CNNs, particularly in the context of imbalanced datasets.
  • Speculation on Future Developments: The paper sparks intrigue about the potential refinements in distillation techniques and their integration within transformer architectures. Future work could explore the synergy between different architectural modifications and distillation strategies to further bridge the performance gap across head and tail classes.

In conclusion, "DeiT-LT: Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets" presents a compelling methodology for enhancing the performance of Vision Transformers on imbalanced datasets. By introducing a novel distillation scheme and leveraging the merits of re-weighting distillation loss, the authors set a new precedence for training ViTs more effectively and efficiently in the face of long-tailed distributions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Quantifying attention flow in transformers. arXiv preprint arXiv:2005.00928, 2020.
  2. Evaluating clip: towards characterization of broader capabilities and downstream implications. arXiv preprint arXiv:2108.02818, 2021.
  3. Sharpness-aware minimization leads to low-rank features. arXiv preprint arXiv:2305.16292, 2023.
  4. Ace: Ally complementary experts for solving long-tailed recognition in one-shot. In ICCV, 2021.
  5. Learning imbalanced datasets with label-distribution-aware margin loss. In NeurIPS, 2019.
  6. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  7. Reltransformer: A transformer-based long-tail visual relationship recognition. In CVPR, 2022.
  8. Parametric contrastive learning. In ICCV, 2021.
  9. Class-balanced loss based on effective number of samples. In CVPR, 2019.
  10. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  11. Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE TPAMI, 2015.
  12. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  13. Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412, 2020.
  14. LVIS: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  15. Deep residual learning for image recognition. In CVPR, 2016.
  16. Disentangling label distribution for long-tailed visual recognition. In CVPR, 2021.
  17. Safa:sample-adaptive feature augmentation for long-tailed image classification. In ECCV, 2022.
  18. Class-balanced distillation for long-tailed visual recognition. 2021.
  19. Rethinking class-balanced methods for long-tailed visual recognition from a domain adaptation perspective. In CVPR, 2020.
  20. Decoupling representation and classifier for long-tailed recognition. arXiv preprint arXiv:1910.09217, 2019.
  21. Exploring balanced feature spaces for representation learning. In International Conference on Learning Representations, 2020.
  22. M2m: Imbalanced classification via major-to-minor translation. In CVPR, 2020.
  23. Label-imbalanced and group-sensitive classification under overparameterization. In Advances in Neural Information Processing Systems, pages 18970–18983. Curran Associates, Inc., 2021.
  24. Learning multiple layers of features from tiny images. 2009.
  25. Nested collaborative learning for long-tailed visual recognition. In CVPR, pages 6949–6958, 2022a.
  26. Long tail visual recognition via gaussian clouded logit adjustment. In CVPR, 2022b.
  27. Self supervision to distillation for long-tailed visual recognition. In ICCV, 2021.
  28. Targeted supervised contrastive learning for long-tailed recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6918–6928, 2022c.
  29. Large-scale long-tailed recognition in an open world. In CVPR, 2019.
  30. Retrieval augmented classification for long-tail visual recognition. In CVPR, 2022.
  31. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  32. Long-tail learning via logit adjustment. arXiv preprint arXiv:2007.07314, 2020.
  33. Effectiveness of arbitrary transfer sets for data-free knowledge distillation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1430–1438, 2021.
  34. Probing toxic content in large pre-trained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4262–4274, 2021.
  35. Designing network design spaces. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10428–10436, 2020.
  36. Do vision transformers see like convolutional neural networks? Advances in Neural Information Processing Systems, 34:12116–12128, 2021.
  37. Class balancing gan with a classifier in the loop. In Conference on Uncertainty in Artificial Intelligence (UAI), 2021.
  38. Escaping saddle points for effective generalization on class-imbalanced data. In Advances in Neural Information Processing Systems, pages 22791–22805. Curran Associates, Inc., 2022a.
  39. Improving gans for long-tailed data through group spectral regularization. In European Conference on Computer Vision (ECCV), 2022b.
  40. Noisytwins: Class-consistent and diverse image generation through styleGANs. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  41. Balanced meta-softmax for long-tailed visual recognition. arXiv preprint arXiv:2007.10740, 2020.
  42. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
  43. Parameter-efficient long-tailed recognition. arXiv preprint arXiv:2309.10019, 2023.
  44. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7262–7272, 2021.
  45. Attention-based pedestrian attribute analysis. TIP, 28(12):6126–6140, 2019.
  46. Vl-ltr: Learning class-wise visual-linguistic representation for long-tailed visual recognition. In European Conference on Computer Vision, pages 73–91. Springer, 2022.
  47. Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems, 34:24261–24272, 2021.
  48. Training data-efficient image transformers amp; distillation through attention. In International Conference on Machine Learning, pages 10347–10357, 2021a.
  49. Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 32–42, 2021b.
  50. Three things everyone should know about vision transformers. arXiv preprint arXiv:2203.09795, 2022a.
  51. Deit iii: Revenge of the vit. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, pages 516–533. Springer, 2022b.
  52. The inaturalist species classification and detection dataset. In CVPR, 2018.
  53. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  54. Overwriting pretrained bias with finetuning data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3957–3968, 2023.
  55. Contrastive learning based hybrid networks for long-tailed image classification. In CVPR, 2021a.
  56. Long-tailed recognition by routing diverse distribution-aware experts. In ICLR, 2021b.
  57. Demystifying clip data. arXiv preprint arXiv:2309.16671, 2023a.
  58. Learning imbalanced data with vision transformers. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
  59. Rethink long-tailed recognition with vision transforms. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023c.
  60. Identifying and compensating for feature deviation in imbalanced deep learning. arXiv preprint arXiv:2001.01385, 2020.
  61. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
  62. mixup: Beyond empirical risk minimization. In ICLR, 2018.
  63. Distribution alignment: A unified framework for long-tail visual recognition. In CVPR, 2021a.
  64. Bag of tricks for long-tailed visual recognition with deep convolutional neural networks. In AAAI, 2021b.
  65. Improving calibration for long-tailed recognition. In CVPR, 2021.
  66. Places: A 10 million image database for scene recognition. IEEE TPAMI, 2017.
  67. Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In CVPR, 2020.
  68. Imbsam: A closer look at sharpness-aware minimization in class-imbalanced recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11345–11355, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Harsh Rangwani (14 papers)
  2. Pradipto Mondal (1 paper)
  3. Mayank Mishra (38 papers)
  4. Ashish Ramayee Asokan (5 papers)
  5. R. Venkatesh Babu (108 papers)
Citations (3)