Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Frozen Feature Augmentation for Few-Shot Image Classification (2403.10519v2)

Published 15 Mar 2024 in cs.CV

Abstract: Training a linear classifier or lightweight model on top of pretrained vision model outputs, so-called 'frozen features', leads to impressive performance on a number of downstream few-shot tasks. Currently, frozen features are not modified during training. On the other hand, when networks are trained directly on images, data augmentation is a standard recipe that improves performance with no substantial overhead. In this paper, we conduct an extensive pilot study on few-shot image classification that explores applying data augmentations in the frozen feature space, dubbed 'frozen feature augmentation (FroFA)', covering twenty augmentations in total. Our study demonstrates that adopting a deceptively simple pointwise FroFA, such as brightness, can improve few-shot performance consistently across three network architectures, three large pretraining datasets, and eight transfer datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Alex Krizhevsky. Learning Multiple Layers of Features From Tiny Images, 2009.
  2. Improved Few-Shot Visual Classification. In Proc. of CVPR, pages 14481–14490, virtual, 2020.
  3. DeepMind Lab. arXiv, 1612.03801:1–11, 2016.
  4. Better Plain ViT Baselines for ImageNet-1k. arXiv, 2205.01580:1–3, 2022.
  5. AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition. In Proc. of NeurIPS, pages 16664–16678, New Orleans, LA, USA, 2022.
  6. PaLI: A Jointly Scaled Multilingual Language-Image Model. In Proc. of ICLR, pages 1–33, Kigali, Rwanda, 2023.
  7. Remote Sensing Image Scene Classification: Benchmark and State of the Art. Proc. IEEE, 105(10):1865–1883, 2017.
  8. Reproducible Scaling Laws for Contrastive Language-Image Learning. In Proc. of CVPR, pages 2818–2829, Vancouver, BC, Canada, 2023.
  9. MODALS: Modality-agnostic Automated Data Augmentation in the Latent Space. In Proc. of ICLR, pages 1–18, virtual, 2021.
  10. François Chollet. Xception: Deep Learning with Depthwise Separable Convolutions. In Proc. of CVPR, pages 1063–6919, Honolulu, HI, USA, 2017.
  11. Describing Textures in the Wild. In Proc. of CVPR, pages 3606–3613, Columbus, OH, USA, 2014.
  12. AutoAugment: Learning Augmentation Strategies From Data. In Proc. of CVPR, pages 113–123, Long Beach, CA, USA, 2019.
  13. RandAugment: Practical Automated Data Augmentation with a Reduced Search Space. In Proc. of NeurIPS, pages 18613–18624, virtual, 2020.
  14. CoAtNet: Marrying Convolution and Attention for All Data Sizes. In Proc. of NeurIPS, pages 3965–3977, virtual, 2021.
  15. Scenic: A JAX Library for Computer Vision Research and Beyond. In Proc. of CVPR, pages 21393–21398, New Orleans, LA, USA, 2022.
  16. Scaling Vision Transformers to 22 Billion Parameters. In Proc. of ICML, pages 7480–7512, Honolulu, HI, USA, 2023.
  17. ImageNet: A Large-Scale Hierarchical Image Database. In Proc. of CVPR, pages 248–255, Miami, FL, USA, 2009.
  18. Dataset Augmentation in Feature Space. In Proc. of ICLR - Workshops, pages 1–12, Toulon, France, 2017.
  19. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proc. of ICLR, pages 1–21, virtual, 2021.
  20. A Unified Few-Shot Classification Benchmark to Compare Transfer and Meta Learning Approaches. In Proc. of NeurIPS - Datasets and Benchmarks Track, pages 1–14, virtual, 2021.
  21. How to Train Vision Transformer on Small-Scale Datasets? In Proc. of BMVC, pages 1–16, London, UK, 2022.
  22. CLIP-Adapter: Better Vision-Language Models with Feature Adapters. Int. J. Comput. Vis., 132(2):581–595, 2023.
  23. Tradeoffs in Data Augmentation: An Empirical Study. In Proc. of ICLR, pages 1–27, virtual, 2021.
  24. Improving Neural Language Models with a Continuous Cache. In Proc. of ICLR, pages 1–9, Toulon, France, 2017.
  25. Parameter-Efficient Model Adaptation for Vision Transformers. In Proc. of AAAI, pages 817–825, Washington, DC, USA, 2023.
  26. AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty. In Proc. of ICLR, pages 1–15, Virtual, 2020.
  27. Distilling Knowledge in a Neural Network. In Proc. of NIPS - Workshops, pages 1–9, Montréal, QC, Canada, 2014. (In 2018, ‘NIPS’ was renamed to ‘NeurIPS’).
  28. Parameter-Efficient Transfer Learning for NLP. In Proc. of ICML, pages 2790–2799, Long Beach, CA, USA, 2019.
  29. LoRA: Low-Rank Adaptation of Large Language Models. In Proc. of ICLR, pages 1–13, virtual, 2022.
  30. Visual Prompt Tuning. In Proc. of ECCV, pages 709–727, Tel Aviv, Israel, 2022.
  31. Adam: A Method for Stochastic Optimization. In Proc. of ICLR, pages 1–15, San Diego, CA, USA, 2015.
  32. Big Transfer (BiT): General Visual Representation Learning. In Proc. of ECCV, pages 491–507, virtual, 2020.
  33. Three Towers: Flexible Contrastive Learning with Pretrained Image Models. In Proc. of NeurIPS, pages 31340–31371, New Orleans, LA, USA, 2023.
  34. SentencePiece: A Simple and Language-Independent Subword Tokenizer and Detokenizer for Neural Text Processing. In Proc. of EMNLP - System Demonstrations, pages 66–71, Brussels, Belgium, 2018.
  35. A Closer Look At Feature Space Data Augmentation For Few-Shot Intent Classification. In Proc. of EMNLP - Workshops, pages 1–10, Hong Kong, China, 2019.
  36. The Omniglot Challenge: A 3-year Progress Report. Curr. Opin. Behav. Sci., 29:97–104, 2019.
  37. Set Transformer: A Framework for Attention-Based Permutation-Invariant Neural Networks. In Proc. of ICML, pages 3744–3753, Long Beach, CA, USA, 2019.
  38. Vision Transformer for Small-Size Datasets. arXiv, 2112.13492:1–11, 2021.
  39. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proc. of EMNLP, pages 3045–3059, virtual, 2021.
  40. Data Augmentation via Latent Space Interpolation for Image Classification. In Proc. of ICPR, pages 728–733, Beijing, China, 2018.
  41. Efficient Training of Visual Transformers with Small Datasets. In Proc. of NeurIPS, pages 23818–23830, virtual, 2021a.
  42. PatchDropout: Economizing Vision Transformers Using Patch Dropout. In Proc. of WACV, pages 3942–3951, Waikoloa, HI, USA, 2023a.
  43. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proc. of ICCV, pages 10012–10022, virtual, 2021b.
  44. Learning Multimodal Data Augmentation in Feature Space. In Proc. of ICLR, pages 1–15, Kigali, Rwanda, 2023b.
  45. Decoupled Weight Decay Regularization. In Proc. of ICLR, pages 1–18, New Orleans, LA, USA, 2019.
  46. TrivialAugment: Tuning-Free Yet State-of-the-Art Data Augmentation. In Proc. of ICCV, pages 774–782, virtual, 2021.
  47. Reading Digits in Natural Images with Unsupervised Feature Learning. In Proc. of NIPS - Workshops, pages 1–9, Granada, Spain, 2011. (In 2018, ‘NIPS’ was renamed to ‘NeurIPS’).
  48. On First-Order Meta-Learning Algorithms. arXiv, 1803.02999:1–15, 2018.
  49. DINOv2: Learning Robust Visual Features Without Supervision. Trans. Mach. Learn. Res., 1:1–32, 2024.
  50. TADAM: Task-Dependent Adaptive Metric for Improved Few-Shot Learning. In Proc. of NeurIPS, pages 719–729, Montréal, QC, Canada, 2018.
  51. Emin Orhan. A Simple Cache Model for Image Recognition. In Proc. of NeurIPS, pages 10128–10137, Montréal, Canada, 2018.
  52. Learning Transferable Visual Models From Natural Language Supervision. In Proc. of ICML, pages 8748–8763, virtual, 2021.
  53. Exploring the Limits of Transfer Learning With a Unified Text-to-Text Transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.
  54. Fast and Flexible Multi-Task Classification Using Conditional Neural Adaptive Processes. In Proc. of NeurIPS, pages 7957–7968, Vancouver, BC, Canada, 2019.
  55. ImageNet-21K Pretraining for the Masses. In Proc. of NeurIPS - Datasets and Benchmarks Track, pages 1–12, virtual, 2021.
  56. Embedding Propagation: Smoother Manifold for Few-Shot Classification. In Proc. of ECCV, pages 121–138, virtual, 2020.
  57. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis., 115(3):211–252, 2015.
  58. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost. In Proc. of ICML, pages 4596–4604, Stockholm, Sweden, 2018.
  59. Prototypical Networks for Few-Shot Learning. In Proc. of NIPS, pages 4077–4087, Long Beach, CA, USA, 2017. (In 2018, ‘NIPS’ was renamed to ‘NeurIPS’).
  60. How to Train Your ViT? Data, Augmentation, and Regularization in Vision Transformers. Trans. Mach. Learn. Res., 5:1–16, 2022.
  61. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. In Proc. of ICCV, pages 843–852, Venice, Italy, 2017.
  62. Rethinking the Inception Architecture for Computer Vision. In Proc. of CVPR, pages 2818–2826, Las Vegas, NV, USA, 2016.
  63. Training Data-Efficient Image Transformers & Distillation Through Attention. In Proc. of ICML, pages 10347–10357, virtual, 2021.
  64. Attention Is All You Need. In Proc. of NIPS, pages 5998–6008, Long Beach, CA, USA, 2017. (In 2018, ‘NIPS’ was renamed to ‘NeurIPS’).
  65. Manifold Mixup: Better Representations by Interpolating Hidden States. In Proc. of ICML, pages 6438–6447, Long Beach, CA, USA, 2019.
  66. Matching Networks for One-Shot Learning. In Proc. of NIPS, pages 3637–3645, Barcelona, Spain, 2016. (In 2018, ‘NIPS’ was renamed to ‘NeurIPS’).
  67. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions. In Proc. of ICCV, pages 548–558, virtual, 2021.
  68. SUN Database: Large-Scale Scene Recognition From Abbey to Zoo. In Proc. of CVPR, pages 3485–3492, San Francisco, CA, USA, 2010.
  69. SUN Database: Exploring a Large Collection of Scene Categories. Int. J. Comput. Vis., 119(1):3–22, 2016.
  70. A Large-Scale Study of Representation Learning with the Visual Task Adaptation Benchmark. arXiv, 1910.04867:1–33, 2020.
  71. Scaling Vision Transformers. In Proc. of CVPR, pages 12104–12113, New Orleans, LA, USA, 2022.
  72. Sigmoid Loss for Language-Image Pretraining. In Proc. of ICCV, pages 11975–11986, Paris, France, 2023.
  73. Mixup: Beyond Empirical Risk Minimization. In Proc. of ICLR, pages 1–13, Vancouver, BC, Canada, 2018.
  74. Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification. In Proc. of ECCV, pages 493–510, Tel Aviv, Israel, 2022.
Citations (2)

Summary

  • The paper demonstrates that applying data augmentation directly on frozen features significantly improves few-shot classification performance.
  • It categorizes twenty augmentation techniques, revealing that stylistic and per-channel transformations offer the greatest performance boosts.
  • Extensive experiments across various datasets and architectures validate the robustness and transfer learning potential of frozen feature augmentations.

Frozen Feature Augmentation Enhances Few-Shot Image Classification

Introduction

Advancements in vision transformers (ViTs) and their exceptional performance on ImageNet and other datasets have steered the recent research direction towards exploiting these models for varied applications. A growing trend involves pretraining these models on extensive datasets and then adapting them for downstream tasks. Among these methods, training on frozen features extracted from pretrained models has demonstrated remarkable efficacy across numerous few-shot tasks. However, a gap exists in incorporating data augmentation techniques, which are pivotal in directly trained networks, into the frozen feature space. This paper embarks on filling this gap by introducing and extensively analyzing data augmentations applied directly on frozen features.

Theoretical Framework

The research is grounded on the hypothesis that data augmentations, when applied to frozen features, can improve the model's robustness and generalization capability, much like their impact in the image space. The paper categorizes twenty data augmentations into geometric, crop & drop, stylistic, and others, and tests their efficacy in enhancing few-shot image classification performance. Augmentations are applied after a point-wise scaling of features extracted from pretrained ViTs to align with conventional image value ranges. This paper charts a novel path by not just predefining stochastic transformations but by examining their applicability in the latent space of large-scale pretrained models.

Methodology and Experimental Setup

This comprehensive analysis is done on eight varying few-shot classification datasets, leveraging models pretrained on JFT-3B, ImageNet-21k, and WebLI datasets. The downstream few-shot performance is evaluated through a lightweight multitask head trained on top of the augmented frozen features. Different classes of augmentations are explored, including rotational and geometric transformations, intensity and color augmentations (termed stylistic adjustments), along with novel augmentation approaches conceived for the feature space.

Key Insights and Observations

The analysis yields several intriguing observations:

  • Stylistic frozen feature augmentations predominantly outperform other classes, with linear adjustments such as brightness modifications leading to the most significant performance boosts across diverse settings.
  • The benefits of geometric and crop & drop augmentations are less pronounced, pointing towards a distinctive characteristic of feature space augmentations as opposed to traditional image augmentations.
  • Notably, augmentations that encapsulate channel-wise transformations offer a conspicuous advantage, suggesting per-channel variability as a critical factor for enriching the representational capacity of frozen features.
  • The paper also establishes the robustness of these findings across different network architectures, underlying pretraining datasets, and transfer datasets, underscoring the universal applicability of feature space augmentation.

Practical Implications and Future Pathways

The findings from this paper hold substantive implications for the domain of transfer learning and few-shot learning. By establishing the effectiveness of frozen feature augmentations, this work opens up new avenues for leveraging large-scale pretrained models in data-constrained settings. The demonstrated augmentation strategies, especially the per-channel augmentations, present a low-cost, high-reward mechanism for enhancing the performance of lightweight models trained on frozen features.

Looking ahead, the research suggests the exploration of more nuanced augmentation strategies in the feature space, including channel-wise and element-wise transformations. The potential for combining frozen feature augmentations with novel pretraining and finetuning methodologies also emerges as a promising area for further investigation.

Conclusion

This exhaustive paper elucidates the untapped potential of applying data augmentations in the frozen feature space, providing a fresh perspective on augmenting few-shot learning capabilities without the need for extensive retraining. The discernment that simple stylistic augmentations can offer substantial improvements marks a significant milestone, cementing the position of frozen feature augmentation as a potent tool in the arsenal for advancing few-shot image classification.

Youtube Logo Streamline Icon: https://streamlinehq.com