Cross-View Image Synthesis using Conditional GANs (1803.03396v2)

Published 9 Mar 2018 in cs.CV

Abstract: Learning to generate natural scenes has always been a challenging task in computer vision. It is even more painstaking when the generation is conditioned on images with drastically different views. This is mainly because understanding, corresponding, and transforming appearance and semantic information across the views is not trivial. In this paper, we attempt to solve the novel problem of cross-view image synthesis, aerial to street-view and vice versa, using conditional generative adversarial networks (cGAN). Two new architectures called Crossview Fork (X-Fork) and Crossview Sequential (X-Seq) are proposed to generate scenes with resolutions of 64x64 and 256x256 pixels. X-Fork architecture has a single discriminator and a single generator. The generator hallucinates both the image and its semantic segmentation in the target view. X-Seq architecture utilizes two cGANs. The first one generates the target image which is subsequently fed to the second cGAN for generating its corresponding semantic segmentation map. The feedback from the second cGAN helps the first cGAN generate sharper images. Both of our proposed architectures learn to generate natural images as well as their semantic segmentation maps. The proposed methods show that they are able to capture and maintain the true semantics of objects in source and target views better than the traditional image-to-image translation method which considers only the visual appearance of the scene. Extensive qualitative and quantitative evaluations support the effectiveness of our frameworks, compared to two state of the art methods, for natural scene generation across drastically different views.

Authors (2)

Krishna Regmi (6 papers)
Ali Borji (89 papers)

Citations (176)

View on Semantic Scholar

Summary

The paper introduces novel architectures—X-Fork and X-Seq—that effectively maintain semantic integrity when synthesizing images across disparate viewpoints.
It leverages conditional GANs to generate both images and semantic segmentation maps, achieving significant improvements in metrics like SSIM and PSNR.
The methods demonstrate practical potential in urban planning, autonomous navigation, and remote sensing by accurately bridging aerial and street views.

Cross-View Image Synthesis using Conditional GANs

The paper, "Cross-View Image Synthesis using Conditional GANs" by Krishna Regmi and Ali Borji, presents an insightful approach to generating images across drastically different viewpoints, using conditional Generative Adversarial Networks (cGANs). The authors propose innovative architectures tailored for synthesizing images between aerial and street views, namely Crossview Fork (X-Fork) and Crossview Sequential (X-Seq). These approaches tackle the challenging task of maintaining the semantics of source objects when transforming images across different perspectives.

The authors begin by highlighting the complexity inherent in view synthesis, particularly when handling disparate angles between viewpoints such as aerial and street views. Single-object scenes often pose fewer challenges due to their uniform backgrounds, while complex scenes necessitate intricate understanding and transformation of details obscured by occlusions or varying appearances. Traditional image-to-image translation methods, which primarily focus on visual appearances, are identified by the authors as insufficient for preserving semantics across views, necessitating novel solutions.

The paper introduces two specific architectures: X-Fork, which optimizes the generation of both images and semantic segmentation maps using a single generator that forks into two outputs; and X-Seq, which utilizes two sequentially connected cGANs, enhancing image quality through iterative refinement using feedback from semantic segmentations. Notably, the authors demonstrate through extensive evaluations that these approaches yield sharper and more detail-oriented images compared to existing methods, reaffirming the utility of semantic data in improving visual synthesis.

Quantitative assessments performed include Inception Scores using learned image labels, classification accuracy for real vs synthesized images, and metrics like SSIM and PSNR for quality evaluation. X-Fork and X-Seq methods outperform baseline approaches by significant margins in maintaining image quality, diversity, and alignment with real data distributions, with X-Seq delivering particularly notable results in synthesizing street views from aerial imagery.

The implications of this research extend to practical applications in urban planning, autonomous navigation, and remote sensing, where understanding and visualizing scenes from different viewpoints is paramount. Theoretically, it aligns with ongoing advancements in cross-domain learning and transformations, highlighting cGANs’ potential in transcending visual domain limitations.

Future directions may explore higher-resolution synthesis, incorporation of temporal elements for transient objects, or expansion into other view-dependent tasks. Equally, further refinement of semantic learning, perhaps leveraging larger, annotated datasets, could bolster synthesis accuracy, especially in clutter-rich environments. As the understanding of complex, semantic-driven transformations in AI advances, the methodologies presented in this paper offer a significant contribution to both the practical utility and foundational theory of cross-domain image synthesis.

PDF Markdown

Cross-View Image Synthesis using Conditional GANs (1803.03396v2)

Summary

Cross-View Image Synthesis using Conditional GANs

Related Papers