RetrieveGAN: Image Synthesis via Differentiable Patch Retrieval

Published 16 Jul 2020 in cs.CV | (2007.08513v1)

Abstract: Image generation from scene description is a cornerstone technique for the controlled generation, which is beneficial to applications such as content creation and image editing. In this work, we aim to synthesize images from scene description with retrieved patches as reference. We propose a differentiable retrieval module. With the differentiable retrieval module, we can (1) make the entire pipeline end-to-end trainable, enabling the learning of better feature embedding for retrieval; (2) encourage the selection of mutually compatible patches with additional objective functions. We conduct extensive quantitative and qualitative experiments to demonstrate that the proposed method can generate realistic and diverse images, where the retrieved patches are reasonable and mutually compatible.

Abstract PDF Upgrade to Chat

Citations (50)

View on Semantic Scholar

Summary

RetrieveGAN: Image Synthesis via Differentiable Patch Retrieval

The paper "RetrieveGAN: Image Synthesis via Differentiable Patch Retrieval" introduces a novel semi-parametric model in the field of image synthesis, which aims to create images from scene descriptions by retrieving and integrating relevant image patches. The proposed framework, RetrieveGAN, addresses some of the limitations of existing methods in generating realistic images with mutually compatible elements.

Contribution Summary

The key contributions of the paper can be outlined as follows:

Differentiable Retrieval Module: RetrieveGAN introduces a differentiable retrieval module that allows the retrieval process of image patches to be integrated into an end-to-end trainable pipeline. This end-to-end property enables the model to learn better feature embeddings, improving the patch selection process for image generation.
Iterative Retrieval Process: To ensure that retrieved patches are compatible with one another, the paper proposes an iterative retrieval process. This mechanism is guided by a differentiable approach using a Gumbel-softmax trick, allowing for the selection of patches that align with the surrogate image generation task.
Use of Scene Graphs: Scene graphs, which offer a clear and structured object-relation representation, are employed as input data for RetrieveGAN. This facilitates a more accurate understanding and representation of complex scene configurations, contributing to the generation of more coherent and contextually appropriate images.
Co-occurrence and Selection Losses: The model incorporates ground-truth selection and co-occurrence losses to enhance the compatibility of selected patches. These losses ensure that the relationships among the elements in the generated images are contextually and visually coherent.

Experimental Results

RetrieveGAN was evaluated on two datasets: COCO-Stuff and Visual Genome. The model was tested against several state-of-the-art models such as sg2im, AttnGAN, layout2im, and PasteGAN. The evaluation was conducted using key metrics such as Frechet Inception Distance (FID), Inception Score (IS), and Diversity Score (DS), which assess the quality and diversity of generated images.

Performance: RetrieveGAN demonstrated superior performance, achieving improved FID and IS scores, indicating that it generates more realistic and diverse images compared to its predecessors. In particular, its differentiable and iterative patch retrieval process was shown to effectively select compatible patches, enhancing the resultant image quality.
User Studies: A user study further validated the mutual compatibility of the patches selected by RetrieveGAN, with users expressing a preference for the patch sets produced by the model over those generated by methods like PasteGAN.

Implications and Future Work

The RetrieveGAN model represents a significant step forward in the domain of conditional image synthesis, particularly through its innovative integration of a differentiable retrieval component. The implications of RetrieveGAN are both practical and theoretical, offering a structured method for creating images based on scene descriptions. This advances capabilities in content creation and image editing, enabling more controlled and context-aware image synthesis.

For future exploration, the paper suggests several directions. Enhancing the patch pre-filtering method to handle a larger pool of candidate patches, refining the loss functions to better facilitate the retrieval process, and extending the approach to conditional video generation and manipulation tasks could further bolster the effectiveness and applicability of RetrieveGAN.

Overall, RetrieveGAN sets a foundation for future research by elegantly merging retrieval and generative processes within a cohesive framework, highlighting the evolving interplay between parametric and non-parametric methodologies in image synthesis.

Markdown