Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Test-time Distribution Learning Adapter for Cross-modal Visual Reasoning (2403.06059v1)

Published 10 Mar 2024 in cs.CV

Abstract: Vision-Language Pre-Trained (VLP) models, such as CLIP, have demonstrated remarkable effectiveness in learning generic visual representations. Several approaches aim to efficiently adapt VLP models to downstream tasks with limited supervision, aiming to leverage the acquired knowledge from VLP models. However, these methods suffer from either introducing biased representations or requiring high computational complexity, which hinders their effectiveness in fine-tuning the CLIP model. Moreover, when a model is trained on data specific to a particular domain, its ability to generalize to uncharted domains diminishes. In this work, we propose Test-Time Distribution LearNing Adapter (TT-DNA) which directly works during the testing period. Specifically, we estimate Gaussian distributions to model visual features of the few-shot support images to capture the knowledge from the support set. The cosine similarity between query image and the feature distribution of support images is used as the prediction of visual adapter. Subsequently, the visual adapter's prediction merges with the original CLIP prediction via a residual connection, resulting in the final prediction. Our extensive experimental results on visual reasoning for human object interaction demonstrate that our proposed TT-DNA outperforms existing state-of-the-art methods by large margins.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
  2. “Scaling up visual and vision-language representation learning with noisy text supervision,” in International Conference on Machine Learning, 2021, pp. 4904–4916.
  3. “Clip-adapter: Better vision-language models with feature adapters,” International Journal of Computer Vision, pp. 1–15, 2023.
  4. “Proposalclip: Unsupervised open-category object proposal generation via exploiting clip cues,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9611–9620.
  5. “Learning to prompt for open-vocabulary object detection with vision-language model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14084–14093.
  6. “Cpt: Colorful prompt tuning for pre-trained vision-language models,” arXiv preprint arXiv:2109.11797, 2021.
  7. “Learning to prompt for vision-language models,” International Journal of Computer Vision, vol. 130, no. 9, pp. 2337–2348, 2022.
  8. “Prompt distribution learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5206–5215.
  9. “PLOT: Prompt learning with optimal transport for vision-language models,” in International Conference on Learning Representations, 2023.
  10. “Tip-adapter: Training-free adaption of clip for few-shot classification,” in European Conference on Computer Vision. Springer, 2022, pp. 493–510.
  11. “Bongard-hoi: Benchmarking few-shot visual reasoning for human-object interactions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19056–19065.
  12. “Test-time prompt tuning for zero-shot generalization in vision-language models,” in Advances in Neural Information Processing Systems, 2022, pp. 14274–14289.
  13. “Bdc-adapter: Brownian distance covariance for better vision-language reasoning,” in British Machine Vision Conference, 2023.
  14. “Bongard-logo: A new benchmark for human-level concept learning and reasoning,” in Advances in Neural Information Processing Systems, 2020, vol. 33, pp. 16468–16480.
  15. “Meta-baseline: Exploring simple meta-learning for few-shot learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9062–9071.
  16. “Domain generalization via encoding and resampling in a unified latent space,” IEEE Transactions on Multimedia, vol. 25, pp. 126–139, 2021.
  17. “Category-stitch learning for union domain generalization,” ACM Transactions on Multimedia Computing, Communications and Applications, vol. 19, no. 1, pp. 1–19, 2023.
  18. “Adversarial robustness through the lens of causality,” in International Conference on Learning Representations, 2021.
  19. “Regularizing deep networks with semantic data augmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 7, pp. 3733–3748, 2021.
  20. “Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15211–15222.
  21. “Dall·e mini,” 2021.
  22. “Learning affinity from attention: End-to-end weakly-supervised semantic segmentation with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16846–16855.
  23. “Prototypical networks for few-shot learning,” in Advances in Neural Information Processing Systems, 2017, vol. 30, pp. 4080–4090.
  24. “End-to-end human object interaction detection with hoi transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11825–11834.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Yi Zhang (994 papers)
  2. Ce Zhang (215 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com