Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 147 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 23 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 59 tok/s Pro
Kimi K2 190 tok/s Pro
GPT OSS 120B 446 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Efficient Remote Sensing with Harmonized Transfer Learning and Modality Alignment (2404.18253v5)

Published 28 Apr 2024 in cs.CV and cs.LG

Abstract: With the rise of Visual and Language Pretraining (VLP), an increasing number of downstream tasks are adopting the paradigm of pretraining followed by fine-tuning. Although this paradigm has demonstrated potential in various multimodal downstream tasks, its implementation in the remote sensing domain encounters some obstacles. Specifically, the tendency for same-modality embeddings to cluster together impedes efficient transfer learning. To tackle this issue, we review the aim of multimodal transfer learning for downstream tasks from a unified perspective, and rethink the optimization process based on three distinct objectives. We propose "Harmonized Transfer Learning and Modality Alignment (HarMA)", a method that simultaneously satisfies task constraints, modality alignment, and single-modality uniform alignment, while minimizing training overhead through parameter-efficient fine-tuning. Remarkably, without the need for external data for training, HarMA achieves state-of-the-art performance in two popular multimodal retrieval tasks in the field of remote sensing. Our experiments reveal that HarMA achieves competitive and even superior performance to fully fine-tuned models with only minimal adjustable parameters. Due to its simplicity, HarMA can be integrated into almost all existing multimodal pretraining models. We hope this method can facilitate the efficient application of large models to a wide range of downstream tasks while significantly reducing the resource consumption. Code is available at https://github.com/seekerhuang/HarMA.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems, 34:24206–24221, 2021.
  2. Language processing in the occipital cortex of congenitally blind adults. Proceedings of the National Academy of Sciences, 108(11):4429–4434, 2011.
  3. The visual word form area (vwfa) is part of both language and attention circuitry. Nature communications, 10(1):5601, 2019.
  4. Adaptformer: Adapting vision transformers for scalable visual recognition. Adv. Neural Inf. Process. Syst., 35:16664–16678, 2022.
  5. Uniter: Universal image-text representation learning. In European conference on computer vision, pp.  104–120. Springer, 2020.
  6. A deep semantic alignment network for the cross-modal image-text retrieval in remote sensing. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14:4284–4297, 2021.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  8. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612, 2017.
  9. Clip-adapter: Better vision-language models with feature adapters. arXiv:2110.04544, 2021.
  10. Feature Article: Distributed Modular Architectures Linking Basal Ganglia, Cerebellum, and Cerebral Cortex: Their Role in Planning and Controlling Action. Cerebral Cortex, 5(2):95–110, 03 1995. ISSN 1047-3211. doi: 10.1093/cercor/5.2.95. URL https://doi.org/10.1093/cercor/5.2.95.
  11. Parameter-efficient transfer learning for NLP. In Proc. 36th Int. Conf. Mach. Learn., volume 97, pp.  2790–2799, 2019.
  12. Unit: Multimodal multitask learning with a unified transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  1439–1449, 2021.
  13. Cross-modal adapter for text-video retrieval. arxiv.2211.09623, 2022a.
  14. Cross-modal adapter for text-video retrieval. arXiv preprint arXiv:2211.09623, 2022b.
  15. Convolutional bypasses are better vision transformer adapters. arXiv preprint arXiv:2207.07039, 2022.
  16. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
  17. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pp.  12888–12900. PMLR, 2022.
  18. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pp.  121–137. Springer, 2020.
  19. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  20. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  21. Uniadapter: Unified parameter-efficient transfer learning for cross-modal modeling. arxiv.2302.06605, 2023.
  22. Exploring models and data for remote sensing image caption generation. IEEE Transactions on Geoscience and Remote Sensing, 56(4):2183–2195, 2017.
  23. Direction-oriented visual-semantic embedding model for remote sensing image-text retrieval. arXiv preprint arXiv:2310.08276, 2023.
  24. Unipelt: A unified framework for parameter-efficient language model tuning. arXiv preprint arXiv:2110.07577, 2021.
  25. A prior instruction representation framework for remote sensing image-text retrieval. In Proceedings of the 31st ACM International Conference on Multimedia, pp.  611–620, 2023a.
  26. Reducing semantic confusion: Scene-aware aggregation network for remote sensing cross-modal retrieval. In Proceedings of the 2023 ACM International Conference on Multimedia Retrieval, pp.  398–406, 2023b.
  27. Learning transferable visual models from natural language supervision. In Proc. Int. Conf. Mach. Learn., volume 139, pp.  8748–8763, 2021a.
  28. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021b.
  29. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019.
  30. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  31. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  32. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pp.  9929–9939. PMLR, 2020.
  33. Parameter-efficient transfer learning for remote sensing image-text retrieval. IEEE Transactions on Geoscience and Remote Sensing, 2023.
  34. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval. arXiv preprint arXiv:2204.09868, 2022a.
  35. Remote sensing cross-modal text-image retrieval based on global and local information. IEEE Trans. Geosci. Remote Sens., 60:1–16, 2022b.
  36. Mm-llms: Recent advances in multimodal large language models, 2024a.
  37. Reweighted low-rank and joint-sparse unmixing with library pruning. IEEE Trans. Geosci. Remote Sens., 60:1–16, 2022.
  38. User: Unified semantic enhancement with momentum contrast for image-text retrieval. IEEE Transactions on Image Processing, 2024b.
  39. Rs5m: A large scale vision-language dataset for remote sensing vision-language foundation model. arXiv preprint arXiv:2306.11300, 2023.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (1)

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.