Papers
Topics
Authors
Recent
Search
2000 character limit reached

Make-A-Scene: Controllable Image Synthesis

Updated 27 November 2025
  • Make-A-Scene model is a controllable text-to-image framework that employs semantic scene representations and human-prior tokenization for precise image synthesis.
  • It utilizes a multi-channel segmentation map, a VQ-SEG tokenizer, and a 4B-parameter autoregressive transformer to effectively fuse text, scene, and image modalities.
  • The model achieves state-of-the-art performance with improved FID scores and human evaluation metrics, supporting applications like story illustration and scene editing.

The Make-A-Scene model is a scene-controllable text-to-image generation framework that integrates semantic scene representations, human-prior-driven tokenization, and classifier-free guidance within a large autoregressive transformer. This approach enables precise and user-controllable image synthesis, supporting advances in image fidelity, compositional flexibility, and robustness to diverse prompts and editing tasks. The model achieves state-of-the-art Frechet Inception Distance (FID) and strong human evaluation metrics on benchmark datasets, while introducing novel capabilities such as direct scene control and story illustration generation (Gafni et al., 2022).

1. Scene Representation and Encoding

Make-A-Scene models scenes as multi-channel semantic segmentation maps that capture layered human priors about the visual world. Scene representations comprise:

  • Panoptic semantic classes (mp=133m_p=133)
  • Human parsing classes (mh=20m_h=20)
  • Face parsing classes (mf=5m_f=5)
  • An additional “edge” channel marking class/instance boundaries

The resulting input is a scene tensor iyRhy×wy×mi_y \in \mathbb{R}^{h_y \times w_y \times m} with m=mp+mh+mf+1m = m_p + m_h + m_f + 1.

Scene tokenization is performed using a Vector-Quantized VAE (VQ-SEG), where the encoder EsE_s generates latent codes, which are quantized using a codebook of KsK_s vectors: for grid location pp, the quantized code is

ty[p]=argminjEs(iy)pej2,ty{1,,Ks}h×wt_y[p] = \arg\min_j \| E_s(i_y)_p - e_j \|_2, \quad t_y \in \{1, \ldots, K_s\}^{h' \times w'}

Each ty[p]t_y[p] is then mapped to a dd-dimensional embedding before being processed by the transformer.

2. Model Architecture and Multimodal Fusion

The generator is a 4B-parameter autoregressive GPT-style transformer with 48 layers, each with 48 self-attention heads and a hidden dimension d=2560d=2560. Three modalities—text, scene, and image—are tokenized independently and concatenated to form the transformer’s input sequence:

  • nx=256n_x = 256 for text (tokenized via BPE)
  • ny=256n_y = 256 for scene (VQ-SEG)
  • nz=1024n_z = 1024 for image (VQ-IMG)

The training pipeline is as follows:

  1. Text inputs: tx=BPE(ix)t_x = \operatorname{BPE}(i_x)
  2. Scene: ty=VQ-SEG(iy)t_y = \operatorname{VQ-SEG}(i_y)
  3. Image: tz=VQ-IMG(iz)t_z = \operatorname{VQ-IMG}(i_z)
  4. Concatenate to t=[tx;ty;tz]t = [t_x; t_y; t_z]
  5. Input each token embedding Etok(t[p])+Epos(p)E_\text{tok}(t[p]) + E_\text{pos}(p) to the transformer
  6. Autoregressively predict t[p+1]t[p+1], optimizing cross-entropy loss.

Modal fusion is performed implicitly: the transformer self-attends over the complete concatenated sequence, learning to integrate information across modalities.

3. Human-Prior Tokenization and Specialized Objectives

Tokenization for both scene and image leverages human priors to preserve spatial and semantic detail:

  • VQ-IMG (image tokenizer) extends VQGAN with dedicated perceptual losses:

    LFace=k=1kflαflFEl(S^(cfk))FEl(cfk)2\mathcal{L}_\text{Face} = \sum_{k=1}^{k_f} \sum_l \alpha_f^l \| \mathrm{FE}^l(\hat{S}(c_f^k)) - \mathrm{FE}^l(c_f^k) \|_2

    where FEl\mathrm{FE}^l denotes activations in VGGFace2, cfkc_f^k are up to kfk_f face crops, and S^\hat{S} denotes the reconstructed crop. - Object-aware VQ: Perceptual loss over object crops using VGG, with analogous formulation. - VQ-IMG loss combines reconstruction, codebook, adversarial, global perceptual, face, and object losses.

  • VQ-SEG (scene tokenizer) adds a weighted BCE term emphasizing face-part labels:

LWBCE=αcatBCE(s,s^)\mathcal{L}_\text{WBCE} = \alpha_\text{cat} \cdot \mathrm{BCE}(s, \hat{s})

with αcat=20\alpha_\text{cat} = 20 for facial parts and $1$ otherwise. Complete VQ-SEG objective is:

LSEG=Lrecon+Lcommit+Lcodebook+LWBCE\mathcal{L}_\text{SEG} = \mathcal{L}_\text{recon} + \mathcal{L}_\text{commit} + \mathcal{L}_\text{codebook} + \mathcal{L}_\text{WBCE}

This approach ensures that minor semantic and facial details are preserved throughout tokenization and decoding.

4. Classifier-Free Guidance in Autoregressive Generation

The Make-A-Scene transformer is fine-tuned for classifier-free guidance via scheduled unconditional sampling (probability pCF=0.2p_{CF}=0.2 during the final 30k training iterations, substituting the text stream with padding tokens).

At inference, each autoregressive decoding step computes two logit streams:

  • logitscond=T(scene, image so fartext)\mathrm{logits_{cond}} = T(\text{scene, image so far} \mid \text{text})
  • logitsuncond=T(scene, image so far)\mathrm{logits_{uncond}} = T(\text{scene, image so far} \mid \varnothing)

Fused as:

logitscf=logitsuncond+αc(logitscondlogitsuncond)\mathrm{logits_{cf}} = \mathrm{logits_{uncond}} + \alpha_c \cdot (\mathrm{logits_{cond}} - \mathrm{logits_{uncond}})

with guidance scale αc=5\alpha_c = 5 (or 3). Sampling proceeds using softmax(logitscf)\text{softmax}(\mathrm{logits_{cf}}) and a truncated multinomial over the top 50% logits.

This method brings the controllability of classifier-free diffusion models to the autoregressive transformer paradigm, improving sampling sharpness and alignment with user prompts.

5. Empirical Performance and Ablative Analysis

Quantitative evaluation demonstrates improved generation fidelity and human preference:

  • 256×256 model: FID =7.55= 7.55 (unfiltered), =11.84=11.84 (filtered)
  • With scene-based guidance and αc\alpha_c: FID =4.69=4.69
  • Ground-truth FID (COCO validation): 2.47\sim2.47

Human evaluation over 500 prompts (5 AMT workers per prompt) indicates 92%\approx92\% preference for Make-A-Scene over CogView256_{256} across image quality, photorealism, and text alignment dimensions.

Ablative studies reveal the contribution of each component:

Method Variant FID Human Preference (%)
Base (no scene/no CF) 18.01 --
+Scene tokens only 19.16 >50
+Face-aware VQ 14.45 --
+Classifier-free guidance (CF) 7.55 76.8 (quality)
+Object-aware @512 8.70 --
+Scene at inference 4.69 --

Values as reported in the source.

These results establish new benchmarks for text-to-image models at the time of publication (Gafni et al., 2022).

6. Capabilities and Applications

The Make-A-Scene architecture enables multiple generation and editing modalities:

  • Text-only synthesis: Generates high-fidelity 512×512 images purely from textual descriptors.
  • Out-of-distribution prompt handling: User-sketched scenes (e.g., “mouse hunting lion”) can direct the transformer to synthesize compositions with rare or previously unseen class combinations.
  • Scene editing: Segmentation maps can be extracted from existing images, edited (e.g., relabeling sky to sea), and re-rendered into consistent new images.
  • Text editing with anchor scenes: Scene layout can be fixed; varying the text prompt yields a controlled set of semantically consistent image variants.
  • Story illustration: Sequentially generates consistent illustrations for narrative text using user-provided scene sketches.

This set of capabilities highlights the model’s explicit and implicit control mechanisms and generalization to creative and task-driven workflows (Gafni et al., 2022).

7. Implementation Specifics

Key implementation parameters include:

  • VQ-SEG: 600k training iterations, batch size 48, codebook size 1024, class weights for facial parts αcat=20\alpha_\text{cat}=20.
  • VQ-IMG: 256×256 (800k, batch 192), 512×512 (940k, batch 128), codebook size 8192, variable channel depth.
  • Transformer: 48 layers, 48 heads, dmodel=2560d_{model}=2560, nx=256n_x=256, ny=256n_y=256, nz=1024n_z=1024, 170k iterations, batch 1024, Adam optimizer with β1=0.9,β2=0.96\beta_1=0.9,\, \beta_2=0.96, weight decay 4.5×1044.5\times10^{-4}, loss ratio (image:text =7:1=7:1).
  • Classifier-free fine-tuning: Last 30k iterations, pCF=0.2p_{CF}=0.2, inference guidance scale αc=5\alpha_c=5 (or 3).
  • Sampling: Top 50% logit truncation per step, multinomial sampling.
  • Output resolutions: Available models for 256×256 and 512×512 outputs.

This architectural configuration supports Make-A-Scene’s state-of-the-art quantitative and qualitative performance, and its novel suite of editing and illustration features (Gafni et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Make-A-Scene Model.