Faithful recovery from true latent features in text-guided reconstruction

Establish whether text-guided visual image reconstruction pipelines that use CLIP features and text-to-image diffusion models (e.g., Stable Diffusion or Versatile Diffusion) can faithfully recover original target images with high perceptual similarity when provided the true latent features of those images, thereby meeting the fundamental requirement for accurate reconstruction from brain activity.

Background

The authors argue that a minimal requirement for any visual image reconstruction method is the ability to recover a target image when the latent features are known exactly, serving as an upper-bound check on the generator’s fidelity. They perform a recovery check and find that, for text-guided methods, reconstructions from true CLIP features via diffusion models are semantically similar but not perceptually faithful, unlike iCNN-based approaches that recover images more accurately.

They explicitly state that it has been unclear whether recent text-guided reconstruction methods meet this basic recovery criterion. Resolving this uncertainty is crucial to validate whether such pipelines can genuinely reconstruct perceived images rather than primarily generating semantically plausible outputs (hallucinations).

References

To ensure that a visual image reconstruction method has the potential to faithfully reproduce an individual's perceived visual experiences, it is crucial that the method can recover the original images with a high degree of perceptual similarity when the neural translation from brain activity to latent features is perfect. However, it has been unclear whether recent text-guided reconstruction methods meet this fundamental requirement.

— Spurious reconstruction from brain activity (2405.10078 - Shirakawa et al., 2024) in Results — Case study — Failed recovery of a stimulus from its latent features

Faithful recovery from true latent features in text-guided reconstruction

Background

References

Related Problems