Convert VLM 2AFC judgments into an optimizable perceptual metric for GAN-based compression

Determine a principled procedure to convert binary two-alternative-forced-choice (2AFC) judgments of visual similarity produced by vision-language models into an optimizable, differentiable perceptual metric that can be used to train generative adversarial network (GAN)–based perceptually oriented image compression models.

Background

The paper shows that state-of-the-art vision-LLMs (e.g., Gemini 2.5-Flash) can replicate human two-alternative forced choice (2AFC) judgments about perceptual image similarity in a zero-shot manner. Existing perceptual compression systems typically rely on differentiable perceptual metrics (such as LPIPS) that are trained on human judgments and used directly as losses, but these metrics have known limitations, including potential null-space exploitation and limited generalization beyond their calibration datasets.

Motivated by the difficulty of turning binary VLM judgments into a differentiable objective suitable for gradient-based training of GAN-based compression systems, the authors instead propose a diffusion-based compression framework (VLIC) that can use preference-based post-training (Diffusion DPO) without requiring a differentiable reward. The sentence explicitly flags the unresolved question of how to translate VLM 2AFC outputs into an optimizable perceptual metric for GAN-based methods.

References

However, it is not clear how to convert the binary 2AFC judgments produced by VLMs into an optimizable perceptual metric which can be exploited by existing GAN-based perceptually oriented compression systems.

VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression (2512.15701 - Sargent et al., 17 Dec 2025) in Section 1 (Introduction)