Information Theoretic Text-to-Image Alignment (2405.20759v1)

Published 31 May 2024 in cs.LG and cs.CV

Abstract: Diffusion models for Text-to-Image (T2I) conditional generation have seen tremendous success recently. Despite their success, accurately capturing user intentions with these models still requires a laborious trial and error process. This challenge is commonly identified as a model alignment problem, an issue that has attracted considerable attention by the research community. Instead of relying on fine-grained linguistic analyses of prompts, human annotation, or auxiliary vision-LLMs to steer image generation, in this work we present a novel method that relies on an information-theoretic alignment measure. In a nutshell, our method uses self-supervised fine-tuning and relies on point-wise mutual information between prompts and images to define a synthetic training set to induce model alignment. Our comparative analysis shows that our method is on-par or superior to the state-of-the-art, yet requires nothing but a pre-trained denoising network to estimate MI and a lightweight fine-tuning strategy.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (5)

Chao Wang (555 papers)
Giulio Franzese (18 papers)
Alessandro Finamore (19 papers)
Massimo Gallo (8 papers)
Pietro Michiardi (58 papers)

Information Theoretic Text-to-Image Alignment (2405.20759v1)

Related Papers

Tweets