Compositional GAN: Learning Image-Conditional Binary Composition

Published 19 Jul 2018 in cs.CV, cs.AI, cs.LG, and stat.ML | (1807.07560v3)

Abstract: Generative Adversarial Networks (GANs) can produce images of remarkable complexity and realism but are generally structured to sample from a single latent source ignoring the explicit spatial interaction between multiple entities that could be present in a scene. Capturing such complex interactions between different objects in the world, including their relative scaling, spatial layout, occlusion, or viewpoint transformation is a challenging problem. In this work, we propose a novel self-consistent Composition-by-Decomposition (CoDe) network to compose a pair of objects. Given object images from two distinct distributions, our model can generate a realistic composite image from their joint distribution following the texture and shape of the input objects. We evaluate our approach through qualitative experiments and user evaluations. Our results indicate that the learned model captures potential interactions between the two object domains, and generates realistic composed scenes at test time.

Abstract PDF Upgrade to Chat

Citations (26)

View on Semantic Scholar

Summary

The paper introduces a novel GAN that leverages a composition-by-decomposition (CoDe) framework to synthesize composite images using unpaired inputs.
It employs spatial transformer networks, relative appearance flow, and mask prediction to ensure accurate spatial alignment and occlusion handling.
Example-specific meta-refinement (ESMR) enhances test-time fidelity, yielding superior results on both synthetic and real-world compositing tasks.

Compositional GAN: Learning Image-Conditional Binary Composition

Introduction and Motivation

"Compositional GAN: Learning Image-Conditional Binary Composition" (1807.07560) presents a generative framework for synthesizing composite images conditioned on explicit object pairs. Conventional GAN and cGAN models typically map from a single input domain, lacking mechanisms to explicitly model multi-object 2D or 3D interactions, spatial layout, occlusion ordering, and viewpoint transformations. The proposed framework directly addresses these limitations by generating a realistic composite image from two input objects sampled from distinct marginal distributions and modeling their joint distribution.

This work introduces a self-consistent Composition-by-Decomposition (CoDe) network, leveraging decomposition as both supervision and test-time refinement. The formulation enables composition without requiring paired data and allows example-specific meta-refinement (ESMR) for improved test-time fidelity. The method is evaluated across synthetic and real domains, including shape-object, face-accessory, and urban scene compositing tasks.

Figure 1: Binary composition examples—top row: background image, middle row: foreground object, bottom row: generated composite.

Model Architecture

The architecture centers on several key modules: (i) a composition generator, (ii) a decomposition network for self-consistency, (iii) relative Spatial Transformer Network (STN) and Relative Appearance Flow Network (RAFN) for spatial and viewpoint alignment, (iv) a mask prediction network for object segmentation, and (v) inpainting mechanisms to address unpaired training scenarios. At a high level, the generation pipeline operates by spatially shifting and scaling input objects, synthesizing their relative viewpoints if necessary, and conditionally generating the composite. The system enforces that the composite can be decomposed back to the constituent objects—a form of cycle consistency that constrains realism and content preservation.

Figure 2: (a) CoDe architecture—composition network receives supervisory signal from the decomposition network (b) ESMR—test-time fine-tuning on a single input pair.

For training, $\{x,y\} \rightarrow c$ is modeled, with adversarial and multiple reconstruction losses applied in both composed and decomposed space. For unpaired data, an inpainting network restores occluded segments, effectively converting unpaired samples into pseudo-paired ones. The architecture and data flow are summarized below.

Figure 3: Schematic overview—training pipeline incorporates inpainting, RAFN, STN, and CoDe modules, with masks and viewpoints as auxiliary signals; test-time ESMR refines the predicted composite.

Spatial Alignment and Mask Prediction

Relative STN explicitly learns affine transformations for both objects, conditioned on their joint context. For object domains requiring viewpoint alignment, RAFN predicts appearance flow fields to mediate geometric compatibility. Mask prediction provides per-pixel assignment for occlusion-aware decomposition, informing both the $L_1$ and cross-entropy objectives.

Composition-by-Decomposition (CoDe) and Self-Consistency

A salient innovation is the integration of decomposition supervision: the composition generator's output, $\hat{c}$ , is decomposed back to $(\hat{x}^T, \hat{y}^T)$ , and explicit $L_1$ self-consistency loss is imposed. This approach not only regularizes generation but also enables end-to-end training even without corresponding input-output pairs.

To further improve test-time realism and better preserve input object characteristics (color, texture, geometry), example-specific meta-refinement is employed. ESMR utilizes the decomposition network as a source of gradient signals, fine-tuning the composition network on-the-fly for each test instance. During ESMR, STN, RAFN, and mask network weights are frozen; only the CoDe layers are updated using adversarial and decomposition-based self-consistency losses.

Experimental Evaluation

Synthetic Datasets

The framework is evaluated on composite scenes involving ShapeNet objects (chairs+tables, baskets+bottles), entailing challenges in 3D rotation, scaling, placement, and occlusion. Results indicate the approach produces realistic composites that qualitatively and quantitatively resemble ground truth, outperforming baseline CycleGAN and Pix2Pix translations, which lack explicit compositional mechanisms.

Figure 4: Test results on chair-table and basket-bottle domains; comparison of generator outputs (before/after ESMR), outputs without inpainting, and nearest neighbor samples from the training data.

Ablation studies demonstrate that inpainting and mask modules are critical in handling occlusions, and the absence of ESMR or self-consistency degradation leads to color bleed, misaligned objects, or incorrect occlusion ordering.

Real-World Datasets

The model generalizes to real-world tasks such as adding sunglasses to faces (CelebA) and placing cars into street scenes (Cityscapes), domains characterized by strong spatial dependencies and variable occlusion.

Figure 5: Test examples from the face-sunglasses composition task—our model's output (paired/unpaired training) compared to ST-GAN results.

In all setups, the model is both data-efficient (requiring fewer or unpaired examples) and delivers superior AMT user preference scores compared to ST-GAN.

Figure 6: Test street scenes-cars composition—progression from input objects, pre/post ESMR outputs, summation of masked inputs, ST-GAN, and nearest neighbor visualizations.

Ablation and Baseline Comparisons

Head-to-head visualizations highlight clear qualitative superiority over domain-translation baselines, which are not designed for explicit binary composition and fail to capture plausible spatial/occlusion relationships.

Figure 7: Ablation—effect of removing loss terms or whole modules (pixel loss, mask, GAN, decomposition cycle); baseline Pix2Pix and CycleGAN outputs for direct comparison.

Implications and Future Directions

The compositional GAN architecture demonstrates that decomposition consistency is a powerful self-supervisory signal, enabling robust, flexible composition across varied datasets and under both paired and unpaired supervision. This work suggests that future generative systems should incorporate more structured multi-entity modeling. The explicit modeling of spatial relations and occlusion hierarchies will be instrumental for controllable scene synthesis, image editing, and cross-domain translation tasks.

Potential extensions include scaling to compositions of more than two objects, handling non-rigid or articulated entities, and integrating photometric interaction modeling (e.g., inter-object lighting/shadow). Furthermore, the ESMR test-time adaptation paradigm is general and may benefit other conditional generation tasks suffering from generalization bottlenecks or appearance drift.

Conclusion

This paper introduces a structured binary compositional generation framework that supersedes prior GAN approaches by integrating explicit modules for spatial arrangement, viewpoint alignment, and object-aware self-consistency. The strong empirical results, both on synthetic and real datasets, highlight the effectiveness of composition-by-decomposition and meta-refinement strategies in generative image modeling. The paper's approach provides a scalable foundation for future work on compositional and multi-object image synthesis.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Compositional GAN: Learning Image-Conditional Binary Composition

Summary

Compositional GAN: Learning Image-Conditional Binary Composition

Introduction and Motivation

Model Architecture

Spatial Alignment and Mask Prediction

Composition-by-Decomposition (CoDe) and Self-Consistency

Example-Specific Meta-Refinement (ESMR)

Experimental Evaluation

Synthetic Datasets

Real-World Datasets

Ablation and Baseline Comparisons

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (4)

Collections

Compositional GAN: Learning Image-Conditional Binary Composition

Summary

Compositional GAN: Learning Image-Conditional Binary Composition

Introduction and Motivation

Model Architecture

Spatial Alignment and Mask Prediction

Composition-by-Decomposition (CoDe) and Self-Consistency

Example-Specific Meta-Refinement (ESMR)

Experimental Evaluation

Synthetic Datasets

Real-World Datasets

Ablation and Baseline Comparisons

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (4)

Collections