Stabilizing full fine-tuning of fully autoregressive vision-language models

Develop an optimization procedure that enables stable end-to-end training of the fully autoregressive vision-language model architecture used in Idefics2 when all parameters—including the pre-trained SigLIP-SO400M vision encoder, the pre-trained Mistral-7B language model, and the newly initialized modality projection and perceiver pooling layers—are unfrozen and fully fine-tuned, without relying on parameter-efficient adapters such as LoRA.

Background

The paper compares cross-attention and fully autoregressive architectures for vision-LLMs under different training regimes. With frozen unimodal backbones, cross-attention performs better; however, when adapting backbones via LoRA, the fully autoregressive architecture outperforms cross-attention.

The authors attempted full fine-tuning of all parameters for the fully autoregressive setup but encountered training instabilities and loss divergences. They resorted to LoRA to stabilize training, leaving unresolved how to achieve stable full fine-tuning without parameter-efficient adapters.

References

Under these conditions, training the fully autoregressive architecture would yield loss divergences, and we were not successful in stabilizing the training even by aggressively lowering the learning rate or gradually unfreezing various components.

— What matters when building vision-language models? (2405.02246 - Laurençon et al., 2024) in Section 3.2, “How does the fully autoregressive architecture compare to the cross-attention architecture?”

Stabilizing full fine-tuning of fully autoregressive vision-language models

Background

References

Related Problems