Generating Images with Perceptual Similarity Metrics based on Deep Networks (1602.02644v2)

Published 8 Feb 2016 in cs.LG, cs.CV, and cs.NE

Abstract: Image-generating machine learning models are typically trained with loss functions based on distance in the image space. This often leads to over-smoothed results. We propose a class of loss functions, which we call deep perceptual similarity metrics (DeePSiM), that mitigate this problem. Instead of computing distances in the image space, we compute distances between image features extracted by deep neural networks. This metric better reflects perceptually similarity of images and thus leads to better results. We show three applications: autoencoder training, a modification of a variational autoencoder, and inversion of deep convolutional networks. In all cases, the generated images look sharp and resemble natural images.

Citations (1,106)

View on Semantic Scholar

Summary

The paper introduces DeePSiM, a novel loss function that combines feature, adversarial, and image-space components to enhance perceptual similarity in generated images.
Its application in autoencoders, variational autoencoders, and network inversion consistently yields sharper and more natural image outputs.
Experimental results demonstrate that using deep network features significantly improves the realism and detail of image reconstructions.

Deep Perceptual Similarity Metrics for Image Generation

The paper "Generating Images with Perceptual Similarity Metrics based on Deep Networks" by Alexey Dosovitskiy and Thomas Brox addresses a prevalent issue in image-generating machine learning models, specifically the challenge of over-smoothed results due to loss functions computed in the image space. The authors propose an alternative approach centered around deep perceptual similarity metrics (DeePSiM) which compute distances between image features extracted by deep neural networks rather than raw pixel values. This novel metric aims to better reflect perceptual similarities, thereby producing more visually authentic images.

Overview of the Paper

The primary contribution of the paper is the introduction of DeePSiM, a set of loss functions that integrate three key terms: feature loss, adversarial loss, and image space loss. This is substantiated through three practical applications: autoencoder training, a modified variational autoencoder (VAE), and the inversion of deep convolutional networks (DCNs). The experimental results demonstrate that this combined loss framework yields sharper and more natural images compared to traditional image-space loss functions.

Methodology

Loss Functions

The DeePSiM loss is a weighted sum of three components:

Feature Loss: This term measures the squared Euclidean distance between the features of generated and real images as extracted by a comparator neural network (e.g., layers from AlexNet or VideoNet). This is designed to capture perceptually relevant features while remaining invariant to minor deformations.
Adversarial Loss: Borrowing from the GAN framework, this term involves training a discriminator to differentiate between generated and real images, while the generator strives to produce images indistinguishable from real ones, thus enforcing realism.
Image Space Loss: To stabilize training, the authors incorporate a standard image-space loss, which guides the generator toward reproducing low-level details of the target image.

Applications

Autoencoder Training: The DeePSiM loss was applied to an autoencoder, revealing that this approach preserves fine structures much better than traditional SE or L1 losses. The reconstructed images display reduced blurriness and enhanced textures.
Variational Autoencoder (VAE): When integrating DeePSiM into a VAE framework, the generated images retained realistic textures and sharper details compared to the baseline VAE using just SE loss. This was achieved without requiring supervised comparator training.
Inversion of AlexNet: The method's efficacy was further demonstrated via inversion tasks where deep features from layers of AlexNet were successfully mapped back to the pixel space, preserving perceptually significant information across layers, even from high-level representations like class probabilities.

Experimental Results

The experiments show that DeePSiM consistently outperforms traditional loss functions across various metrics. Notably, reconstructions of images from autoencoders using DeePSiM maintained high levels of detail, and the VAE samples were more realistic. Additionally, the inversion of deep features from AlexNet layers revealed detailed and nearly accurate reconstructions, a testament to the effectiveness of the DeePSiM loss function.

Theoretical and Practical Implications

The implications of this research span both theoretical and practical realms:

Feature Space Utilization: The work underscores the benefits of leveraging deep neural network features for perceptual similarity, moving beyond pixel space distortions to more meaningful, human-aligned metrics of image quality.
Enhanced Generative Models: Practically, DeePSiM facilitates the development of generative models that produce higher quality images, which can be pivotal for applications in image synthesis, compression, and even video prediction—areas where preserving perceptual fidelity is paramount.
Future Research Directions: While the paper primarily focuses on convolutional architectures, future developments might explore the use of transformer-based comparators or investigate the integration of more sophisticated feature representations. Additionally, refining adversarial training stability and exploring alternative priors for natural image generation could further enhance the quality and versatility of generative models.

In conclusion, the introduction of DeePSiM represents a significant advancement in the loss functions used for image generation tasks, providing a balance between feature representations and realism. This framework has the potential to influence a multitude of generative applications, steering future research towards more perceptually aligned methodologies.