Cross-Modal Fine-Tuning: Align then Refine (2302.05738v2)

Published 11 Feb 2023 in cs.LG

Abstract: Fine-tuning large-scale pretrained models has led to tremendous progress in well-studied modalities such as vision and NLP. However, similar gains have not been observed in many other modalities due to a lack of relevant pretrained models. In this work, we propose ORCA, a general cross-modal fine-tuning framework that extends the applicability of a single large-scale pretrained model to diverse modalities. ORCA adapts to a target task via an align-then-refine workflow: given the target input, ORCA first learns an embedding network that aligns the embedded feature distribution with the pretraining modality. The pretrained model is then fine-tuned on the embedded data to exploit the knowledge shared across modalities. Through extensive experiments, we show that ORCA obtains state-of-the-art results on 3 benchmarks containing over 60 datasets from 12 modalities, outperforming a wide range of hand-designed, AutoML, general-purpose, and task-specific methods. We highlight the importance of data alignment via a series of ablation studies and demonstrate ORCA's utility in data-limited regimes.

Citations (31)

View on Semantic Scholar

Summary

The paper introduces Orca, a three-stage framework that aligns data distributions for effective cross-modal fine-tuning.
It leverages a custom embedder and OTDD metric to maintain pretrained transformer integrity, especially in data-scarce environments.
Empirical evaluations across 12 modalities and over 60 datasets demonstrate Orca’s superior performance compared to traditional and AutoML methods.

Cross-Modal Fine-Tuning: Align then Refine

The paper "Cross-Modal Fine-Tuning: Align then Refine" introduces a framework named Orca, addressing the challenge of extending the utility of large-scale pretrained models to diverse modalities outside of their initial training domains. Traditional fine-tuning of such models, primarily developed for well-explored areas like NLP and vision, does not easily translate to unstructured or cross-modal data such as genomics or physical simulations. Orca proposes an innovative three-stage process: architecture design for dimensionality alignment, embedding network learning for distributional alignment, and a comprehensive fine-tuning phase.

Key Contributions

Embedder Architecture and Distribution Alignment: Orca designs a custom embedder architecture compatible with any pretrained transformer model. This embedder transforms inputs from the specified diverse domain into sequences that the pretrained transformer can effectively process. The use of optimal transport dataset distance (OTDD) as a metric in the embedder learning stage is notable, focusing on aligning the feature and label distributions between the source and the target datasets. This alignment minimizes distortion of the pretrained weights, which is crucial when adapting models to drastically different domains.
Empirical Evaluation with Broad Tasks: Orca's efficacy is validated across an extensive set of tasks spanning 12 modalities and over 60 datasets. It outperforms both hand-designed models and automated machine learning (AutoML) architectures previously established for specific domains. The empirical results highlight the importance of the embedding alignment stage, showing improved accuracy in downstream applications.
Superior Results in Data-Scarce Environments: One of the distinguishing claims is Orca’s superior effectiveness in situations where data is limited. The experiments simulate reduced training data conditions, with Orca consistently outperforming naive fine-tuning strategies by leveraging the pre-trained model's knowledge.
Traditional Model Comparisons and Fine-Tuning Strategy: In contrast to methods like the Frozen Pretrained Transformers (FPT), which adapt minimal model parameters, Orca’s strategy demonstrates that full parameter fine-tuning, when preceded by data alignment, results in better performance. The research also explores different alignment metrics, with OTDD providing the most consistent improvements over others such as maximum mean discrepancy (MMD).

Practical and Theoretical Implications

Pragmatically, Orca’s approach allows researchers to unlock the potential of existing pretrained models for new, less studied applications in physical sciences, healthcare, and finance without requiring bespoke model architectures, which necessitate comprehensive domain knowledge. Theoretically, it extends the known boundaries of transfer learning by demonstrating systematic model adaptability beyond intra-domain contexts traditionally explored.

Anticipated Directions in AI Research

The methodology laid out by Orca can guide future advancements towards truly general AI systems capable of learning across a spectrum of tasks with varied input-output dimensions and modalities. Further innovations might involve integrating Orca’s alignment-centric technique with more sophisticated transfer learning paradigms, such as domain-specific adapters or modular multi-modal pretraining, to enhance performance in even broader application areas and more efficiently tackle high-dimensional problems.

Conclusion

The research presents a meticulous elaboration on cross-modal model transfer—a domain ripe with opportunity yet sparsely traversed. Orca not only challenges the conventional scope of pretrained models' applicability but also presents a practical workflow capable of harnessing the stereotypical knowledge embedded in these models for utilitarian cross-field tasks. Through its refinements in aligning diverse data for fine-tuning, Orca sets a methodological precedent for subsequent AI developments aimed at achieving scalable, versatile, and equitable advancements in machine learning applications.

PDF Markdown

Related Papers

GitHub

GitHub - sjunhongshen/ORCA: Official implementation of ORCA proposed in the paper "Cross-Modal Fine-Tuning: Align then Refine" (70 stars)

Tweets

https://twitter.com/atalwalkar/status/1768269719528780245