Make-A-Scene: Controllable Image Synthesis

Updated 27 November 2025

Make-A-Scene model is a controllable text-to-image framework that employs semantic scene representations and human-prior tokenization for precise image synthesis.
It utilizes a multi-channel segmentation map, a VQ-SEG tokenizer, and a 4B-parameter autoregressive transformer to effectively fuse text, scene, and image modalities.
The model achieves state-of-the-art performance with improved FID scores and human evaluation metrics, supporting applications like story illustration and scene editing.

The Make-A-Scene model is a scene-controllable text-to-image generation framework that integrates semantic scene representations, human-prior-driven tokenization, and classifier-free guidance within a large autoregressive transformer. This approach enables precise and user-controllable image synthesis, supporting advances in image fidelity, compositional flexibility, and robustness to diverse prompts and editing tasks. The model achieves state-of-the-art Frechet Inception Distance (FID) and strong human evaluation metrics on benchmark datasets, while introducing novel capabilities such as direct scene control and story illustration generation (Gafni et al., 2022).

1. Scene Representation and Encoding

Make-A-Scene models scenes as multi-channel semantic segmentation maps that capture layered human priors about the visual world. Scene representations comprise:

Panoptic semantic classes ( $m_p=133$ )
Human parsing classes ( $m_h=20$ )
Face parsing classes ( $m_f=5$ )
An additional “edge” channel marking class/instance boundaries

The resulting input is a scene tensor $i_y \in \mathbb{R}^{h_y \times w_y \times m}$ with $m = m_p + m_h + m_f + 1$ .

Scene tokenization is performed using a Vector-Quantized VAE (VQ-SEG), where the encoder $E_s$ generates latent codes, which are quantized using a codebook of $K_s$ vectors: for grid location $p$ , the quantized code is

$t_y[p] = \arg\min_j \| E_s(i_y)_p - e_j \|_2, \quad t_y \in \{1, \ldots, K_s\}^{h' \times w'}$

Each $t_y[p]$ is then mapped to a $d$ -dimensional embedding before being processed by the transformer.

2. Model Architecture and Multimodal Fusion

The generator is a 4B-parameter autoregressive GPT-style transformer with 48 layers, each with 48 self-attention heads and a hidden dimension $d=2560$ . Three modalities—text, scene, and image—are tokenized independently and concatenated to form the transformer’s input sequence:

$n_x = 256$ for text (tokenized via BPE)
$n_y = 256$ for scene (VQ-SEG)
$n_z = 1024$ for image (VQ-IMG)

The training pipeline is as follows:

Text inputs: $t_x = \operatorname{BPE}(i_x)$
Scene: $t_y = \operatorname{VQ-SEG}(i_y)$
Image: $t_z = \operatorname{VQ-IMG}(i_z)$
Concatenate to $t = [t_x; t_y; t_z]$
Input each token embedding $E_\text{tok}(t[p]) + E_\text{pos}(p)$ to the transformer
Autoregressively predict $t[p+1]$ , optimizing cross-entropy loss.

Modal fusion is performed implicitly: the transformer self-attends over the complete concatenated sequence, learning to integrate information across modalities.

3. Human-Prior Tokenization and Specialized Objectives

Tokenization for both scene and image leverages human priors to preserve spatial and semantic detail:

VQ-IMG (image tokenizer) extends VQGAN with dedicated perceptual losses:
- Face-aware VQ: Perceptual loss over face crops, computed as
$\mathcal{L}_\text{Face} = \sum_{k=1}^{k_f} \sum_l \alpha_f^l \| \mathrm{FE}^l(\hat{S}(c_f^k)) - \mathrm{FE}^l(c_f^k) \|_2$

where $\mathrm{FE}^l$ denotes activations in VGGFace2, $c_f^k$ are up to $k_f$ face crops, and $\hat{S}$ denotes the reconstructed crop. - Object-aware VQ: Perceptual loss over object crops using VGG, with analogous formulation. - VQ-IMG loss combines reconstruction, codebook, adversarial, global perceptual, face, and object losses.
VQ-SEG (scene tokenizer) adds a weighted BCE term emphasizing face-part labels:

$\mathcal{L}_\text{WBCE} = \alpha_\text{cat} \cdot \mathrm{BCE}(s, \hat{s})$

with $\alpha_\text{cat} = 20$ for facial parts and $1$ otherwise. Complete VQ-SEG objective is:

$\mathcal{L}_\text{SEG} = \mathcal{L}_\text{recon} + \mathcal{L}_\text{commit} + \mathcal{L}_\text{codebook} + \mathcal{L}_\text{WBCE}$

This approach ensures that minor semantic and facial details are preserved throughout tokenization and decoding.

4. Classifier-Free Guidance in Autoregressive Generation

The Make-A-Scene transformer is fine-tuned for classifier-free guidance via scheduled unconditional sampling (probability $p_{CF}=0.2$ during the final 30k training iterations, substituting the text stream with padding tokens).

At inference, each autoregressive decoding step computes two logit streams:

$\mathrm{logits_{cond}} = T(\text{scene, image so far} \mid \text{text})$
$\mathrm{logits_{uncond}} = T(\text{scene, image so far} \mid \varnothing)$

Fused as:

$\mathrm{logits_{cf}} = \mathrm{logits_{uncond}} + \alpha_c \cdot (\mathrm{logits_{cond}} - \mathrm{logits_{uncond}})$

with guidance scale $\alpha_c = 5$ (or 3). Sampling proceeds using $\text{softmax}(\mathrm{logits_{cf}})$ and a truncated multinomial over the top 50% logits.

This method brings the controllability of classifier-free diffusion models to the autoregressive transformer paradigm, improving sampling sharpness and alignment with user prompts.

5. Empirical Performance and Ablative Analysis

Quantitative evaluation demonstrates improved generation fidelity and human preference:

256×256 model: FID $= 7.55$ (unfiltered), $=11.84$ (filtered)
With scene-based guidance and $\alpha_c$ : FID $=4.69$
Ground-truth FID (COCO validation): $\sim2.47$

Human evaluation over 500 prompts (5 AMT workers per prompt) indicates $\approx92\%$ preference for Make-A-Scene over CogView $_{256}$ across image quality, photorealism, and text alignment dimensions.

Ablative studies reveal the contribution of each component:

Method Variant	FID	Human Preference (%)
Base (no scene/no CF)	18.01	--
+Scene tokens only	19.16	>50
+Face-aware VQ	14.45	--
+Classifier-free guidance (CF)	7.55	76.8 (quality)
+Object-aware @512	8.70	--
+Scene at inference	4.69	--

Values as reported in the source.

These results establish new benchmarks for text-to-image models at the time of publication (Gafni et al., 2022).

6. Capabilities and Applications

The Make-A-Scene architecture enables multiple generation and editing modalities:

Text-only synthesis: Generates high-fidelity 512×512 images purely from textual descriptors.
Out-of-distribution prompt handling: User-sketched scenes (e.g., “mouse hunting lion”) can direct the transformer to synthesize compositions with rare or previously unseen class combinations.
Scene editing: Segmentation maps can be extracted from existing images, edited (e.g., relabeling sky to sea), and re-rendered into consistent new images.
Text editing with anchor scenes: Scene layout can be fixed; varying the text prompt yields a controlled set of semantically consistent image variants.
Story illustration: Sequentially generates consistent illustrations for narrative text using user-provided scene sketches.

This set of capabilities highlights the model’s explicit and implicit control mechanisms and generalization to creative and task-driven workflows (Gafni et al., 2022).

7. Implementation Specifics

Key implementation parameters include:

VQ-SEG: 600k training iterations, batch size 48, codebook size 1024, class weights for facial parts $\alpha_\text{cat}=20$ .
VQ-IMG: 256×256 (800k, batch 192), 512×512 (940k, batch 128), codebook size 8192, variable channel depth.
Transformer: 48 layers, 48 heads, $d_{model}=2560$ , $n_x=256$ , $n_y=256$ , $n_z=1024$ , 170k iterations, batch 1024, Adam optimizer with $\beta_1=0.9,\, \beta_2=0.96$ , weight decay $4.5\times10^{-4}$ , loss ratio (image:text $=7:1$ ).
Classifier-free fine-tuning: Last 30k iterations, $p_{CF}=0.2$ , inference guidance scale $\alpha_c=5$ (or 3).
Sampling: Top 50% logit truncation per step, multinomial sampling.
Output resolutions: Available models for 256×256 and 512×512 outputs.

This architectural configuration supports Make-A-Scene’s state-of-the-art quantitative and qualitative performance, and its novel suite of editing and illustration features (Gafni et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Make-A-Scene Model.