Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

StochCA: A Novel Approach for Exploiting Pretrained Models with Cross-Attention (2402.16092v2)

Published 25 Feb 2024 in cs.CV

Abstract: Utilizing large-scale pretrained models is a well-known strategy to enhance performance on various target tasks. It is typically achieved through fine-tuning pretrained models on target tasks. However, na\"{\i}ve fine-tuning may not fully leverage knowledge embedded in pretrained models. In this study, we introduce a novel fine-tuning method, called stochastic cross-attention (StochCA), specific to Transformer architectures. This method modifies the Transformer's self-attention mechanism to selectively utilize knowledge from pretrained models during fine-tuning. Specifically, in each block, instead of self-attention, cross-attention is performed stochastically according to the predefined probability, where keys and values are extracted from the corresponding block of a pretrained model. By doing so, queries and channel-mixing multi-layer perceptron layers of a target model are fine-tuned to target tasks to learn how to effectively exploit rich representations of pretrained models. To verify the effectiveness of StochCA, extensive experiments are conducted on benchmarks in the areas of transfer learning and domain generalization, where the exploitation of pretrained models is critical. Our experimental results show the superiority of StochCA over state-of-the-art approaches in both areas. Furthermore, we demonstrate that StochCA is complementary to existing approaches, i.e., it can be combined with them to further improve performance. Our code is available at https://github.com/daintlab/stochastic_cross_attention

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  2. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  3. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, 2023.
  4. The effectiveness of mae pre-pretraining for billion-scale pretraining. arXiv preprint arXiv:2303.13496, 2023.
  5. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
  6. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  7. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  8. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  9. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  10. Learning transferable visual models from natural language supervision. In International conference on machine learning, 2021.
  11. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 2022.
  12. Omnivec: Learning robust representations with cross modal sharing. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024.
  13. One-peace: Exploring one general representation model toward unlimited modalities. arXiv preprint arXiv:2305.11172, 2023.
  14. Domain generalization by mutual-information regularization with pre-trained models. European Conference on Computer Vision, 2022.
  15. Simple: Specialized model-sample matching for domain generalization. In International Conference on Learning Representations, 2023.
  16. Catastrophic forgetting meets negative transfer: Batch spectral shrinkage for safe transfer learning. Advances in Neural Information Processing Systems, 2019.
  17. Fine-tuning can distort pretrained features and underperform out-of-distribution. In International Conference on Learning Representations, 2022.
  18. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  19. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  20. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  21. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
  22. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  23. Unifying vision-language representation space with single-tower transformer. arXiv preprint arXiv:2211.11153, 2022.
  24. Explicit inductive bias for transfer learning with convolutional networks. In International Conference on Machine Learning, 2018.
  25. Delta: Deep learning transfer using feature map with attention for convolutional networks. In International Conference on Learning Representations, 2019.
  26. Improved regularization and robustness for fine-tuning in neural networks. Advances in Neural Information Processing Systems, 2021.
  27. Stochastic normalization. Advances in Neural Information Processing Systems, 2020.
  28. Co-tuning for transfer learning. Advances in Neural Information Processing Systems, 2020.
  29. Prompt vision transformer for domain generalization. arXiv preprint arXiv:2208.08914, 2022.
  30. Self-distilled vision transformer for domain generalization. In Proceedings of the Asian Conference on Computer Vision, 2022.
  31. Cadg: A model based on cross attention for domain generalization. arXiv preprint arXiv:2203.17067, 2022.
  32. Cross-attention transformer for video interpolation. In Proceedings of the Asian Conference on Computer Vision (ACCV) Workshops, 2022.
  33. Dual cross-attention learning for fine-grained visual categorization and object re-identification. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  34. Attention as grounding: Exploring textual and cross-modal attention on entities and relations in language-and-vision transformer. In Findings of the Association for Computational Linguistics: ACL 2022, 2022.
  35. Cat: Cross attention in vision transformer. In 2022 IEEE International Conference on Multimedia and Expo (ICME), 2022.
  36. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009.
  37. Technical report, California Institute of Technology, 2011.
  38. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, 2013.
  39. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
  40. Novel dataset for fine-grained image categorization: Stanford dogs. In Proc. CVPR workshop on fine-grained visual categorization (FGVC), 2011.
  41. Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, 2017.
  42. Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias. In Proceedings of the IEEE International Conference on Computer Vision, 2013.
  43. Deep hashing network for unsupervised domain adaptation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  44. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF international conference on computer vision, 2019.
  45. In search of lost domain generalization. In International Conference on Learning Representations, 2021.
  46. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
  47. Deep domain generalization via conditional invariant adversarial networks. In Proceedings of the European conference on computer vision (ECCV), 2018.
  48. Domain-adversarial training of neural networks. The journal of machine learning research, 2016.
  49. Distributionally robust neural networks. In International Conference on Learning Representations, 2020.
  50. Deep coral: Correlation alignment for deep domain adaptation. In Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14, 2016.
  51. Learning to generalize: Meta-learning for domain generalization. In Proceedings of the AAAI conference on artificial intelligence, 2018.
  52. Heterogeneous domain generalization via domain mixup. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
  53. Domain generalization with adversarial feature learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018.
  54. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Seungwon Seo (3 papers)
  2. Suho Lee (2 papers)
  3. Sangheum Hwang (18 papers)

Summary

We haven't generated a summary for this paper yet.