Stabilizing full fine-tuning of fully autoregressive vision-language models
Develop an optimization procedure that enables stable end-to-end training of the fully autoregressive vision-language model architecture used in Idefics2 when all parameters—including the pre-trained SigLIP-SO400M vision encoder, the pre-trained Mistral-7B language model, and the newly initialized modality projection and perceiver pooling layers—are unfrozen and fully fine-tuned, without relying on parameter-efficient adapters such as LoRA.
References
Under these conditions, training the fully autoregressive architecture would yield loss divergences, and we were not successful in stabilizing the training even by aggressively lowering the learning rate or gradually unfreezing various components.
— What matters when building vision-language models?
(2405.02246 - Laurençon et al., 3 May 2024) in Section 3.2, “How does the fully autoregressive architecture compare to the cross-attention architecture?”