Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

What explains the success of cross-modal fine-tuning with ORCA? (2403.13537v1)

Published 20 Mar 2024 in cs.CL, cs.AI, cs.CV, and cs.LG

Abstract: ORCA (Shen et al., 2023) is a recent technique for cross-modal fine-tuning, i.e., applying pre-trained transformer models to modalities beyond their training data. The technique consists primarily of training an embedder and fine-tuning the embedder and model. Despite its high performance on a variety of downstream tasks, we do not understand precisely how each of these components contribute to ORCA's success. Therefore, we run a series of ablations and find that embedder training does not help 2D tasks at all, contrary to what the original paper posits. In 1D tasks, some amount of embedder training is necessary but more is not better. In 4 out of 6 datasets we experiment with, it is model fine-tuning that makes the biggest difference. Through our ablations and baselines, we contribute a better understanding of the individual components of ORCA.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)
  1. David Alvarez-Melis and Nicolo Fusi. 2020. Geometric dataset distances via optimal transport. In Advances in Neural Information Processing Systems, volume 33, pages 21428–21439. Curran Associates, Inc.
  2. Li Deng. 2012. The MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Processing Magazine, 29(6):141–142.
  3. Alex Krizhevsky. 2009. Learning multiple layers of features from tiny images. Technical report.
  4. RoBERTa: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
  5. Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022.
  6. Pretrained transformers as universal computation engines. arXiv preprint.
  7. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc.
  8. Satellite image time series analysis under time warping. IEEE Transactions on Geoscience and Remote Sensing, 50(8):3081–3095.
  9. Cross-modal fine-tuning: Align then refine. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 31030–31056. PMLR.
  10. Ups: Towards foundation models for pde solving via cross-modal adaptation. arXiv preprint arXiv:2403.07187.
  11. Movements classification of multi-channel semg based on cnn and stacking ensemble learning. IEEE Access, 7:137489–137500.
  12. OmniPred: Language models as universal regressors. arXiv preprint.
  13. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147.
  14. NAS-bench-360: Benchmarking neural architecture search on diverse tasks. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  15. Learning which features matter: RoBERTa acquires a preference for linguistic generalizations (eventually). In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 217–235, Online. Association for Computational Linguistics.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com