Learning to Generate Images with Perceptual Similarity Metrics (1511.06409v3)

Published 19 Nov 2015 in cs.LG and cs.CV

Abstract: Deep networks are increasingly being applied to problems involving image synthesis, e.g., generating images from textual descriptions and reconstructing an input image from a compact representation. Supervised training of image-synthesis networks typically uses a pixel-wise loss (PL) to indicate the mismatch between a generated image and its corresponding target image. We propose instead to use a loss function that is better calibrated to human perceptual judgments of image quality: the multiscale structural-similarity score (MS-SSIM). Because MS-SSIM is differentiable, it is easily incorporated into gradient-descent learning. We compare the consequences of using MS-SSIM versus PL loss on training deterministic and stochastic autoencoders. For three different architectures, we collected human judgments of the quality of image reconstructions. Observers reliably prefer images synthesized by MS-SSIM-optimized models over those synthesized by PL-optimized models, for two distinct PL measures ($\ell_1$ and $\ell_2$ distances). We also explore the effect of training objective on image encoding and analyze conditions under which perceptually-optimized representations yield better performance on image classification. Finally, we demonstrate the superiority of perceptually-optimized networks for super-resolution imaging. Just as computer vision has advanced through the use of convolutional architectures that mimic the structure of the mammalian visual system, we argue that significant additional advances can be made in modeling images through the use of training objectives that are well aligned to characteristics of human perception.

Citations (167)

View on Semantic Scholar

Summary

The paper demonstrates that using perceptual similarity metrics like MS-SSIM in training yields image reconstructions that are significantly preferred by human observers over traditional pixel-wise losses.
It introduces Expected-Loss Variational Autoencoders (EL-VAE), which extend standard VAE frameworks to incorporate differentiable perceptual loss functions for enhanced image synthesis.
The study highlights that perceptual loss functions improve downstream tasks, including image super-resolution, by better capturing fine details and textures compared to conventional metrics.

An Evaluation of Perceptual Loss Functions in Image Generation

The paper, "Learning to Generate Images With Perceptual Similarity Metrics", addresses image synthesis through artificial neural networks by exploring the utility of perceptually-based loss functions over standard pixel-wise loss functions, like mean squared error (MSE) and mean absolute error (MAE). The exploration primarily revolves around using the multiscale structural similarity score (MS-SSIM) as a substitute for pixel-based metrics in training image generation models. Their central claim is that optimizing perceptual loss functions aligned with human judgments, such as MS-SSIM, can yield improved image reconstructions and encoded representations.

Key Findings

The authors conducted experiments using both deterministic and probabilistic autoencoders. For deterministic autoencoders, the results indicate that models trained with MS-SSIM loss produce reconstructions preferred by human observers over those optimized with either MSE or MAE. This finding is consistent across various datasets, architectures, and image sizes. In particular, human observers preferred the MS-SSIM models’ image reconstructions at a significant rate on the CIFAR-10 and STL-10 datasets.

Additionally, the research introduces the concept of Expected-Loss Variational Autoencoders (EL-VAE), extending the VAE framework to accommodate non-probabilistic and arbitrary differentiable losses such as MS-SSIM. The EL-VAEs trained using MS-SSIM demonstrated superior image reconstruction abilities compared to those trained with MSE or MAE. The quantitative evaluations were complemented by qualitative human judgments, further favoring the perceptual approach.

Practical Implications

The paper affirms that perceptual similarity metrics, particularly MS-SSIM, when used as training objectives in neural networks, can refine both deterministic and probabilistic autoencoders. This refinement leads to models that encode image representations more aligned with human perception, thereby improving downstream tasks such as image classification and super-resolution imaging. Indeed, the perceptual models showed superior performance in capturing fine details and textures in super-resolution tasks compared to pixel-wise losses, with notable improvements in SSIM scores on standard benchmark datasets (Set5, Set14, BSD200).

Theoretical Implications and Future Directions

Theoretically, the paper underscores the potential of integrating perceptual attributes into machine learning objectives for image synthesis, suggesting that further benefits could be harnessed by exploring more sophisticated perceptual metrics beyond MS-SSIM. Future research could probe into different perceptually-grounded loss functions and possibly develop new composite loss functions that combine differentiable and non-differentiable perceptual metrics. Furthermore, the authors suggest pursuing perceptual losses in handling fine-grained classification tasks, which could extend to more complex images and datasets where textual or contextual detail is increasingly critical.

Conclusion

In summary, the findings presented in this paper contribute meaningful insights into the role of perceptual similarity metrics in improving the quality of images synthesized by neural networks. By aligning training objectives with human perception, the researchers demonstrate notable advantages over traditional pixel-wise error metrics, paving the way for more effective image generation techniques in neural networks.