DogLayout: Diffusion GAN for Discrete Layouts
- The paper introduces DogLayout, a novel framework that integrates denoising diffusion processes with GAN-style training to overcome challenges in mixed discrete and continuous layout generation.
- The model leverages Transformer-based generators and discriminators to achieve high sample efficiency and robust structural fidelity, as demonstrated on the PubLayNet dataset.
- DogLayout achieves significant performance gains by reducing sampling steps and overlap metrics while maintaining competitive FID and IoU scores, offering practical value for layout synthesis.
DogLayout is a generative framework for discrete and continuous layout generation that integrates denoising diffusion processes with adversarial (GAN-style) training. Its design targets challenges inherent in layout generation tasks where layouts are described by both discrete categorical labels (e.g., object classes) and continuous geometric parameters (e.g., bounding boxes). DogLayout addresses the inefficiency of conventional @@@@1@@@@ in sampling, as well as the limitations of GANs with discrete data, by conditioning GAN training on a denoising diffusion process while maintaining full end-to-end differentiability for both discrete and continuous layout components (Gan et al., 2024).
1. Model Architecture and Design
DogLayout constructs a hybrid generative model, narrowing the gap between the control and sample quality of diffusion models and the sampling efficiency of GANs. The architecture consists of a generator and a discriminator , both built around Transformer encoders and fully connected (FC) projection branches.
- Generator (): Accepts a noisy layout , corresponding to elements, label logits (pre-softmax), and $4$ box parameters per element. Latent noise with shape is jointly embedded with . The model stacks layers of a multi-head Transformer encoder, producing an output — unnormalized logits for discrete labels and continuous coordinates.
1 2 3 4 5 |
h_z = FC_z(z) # (M × d) h_x = FC_x(x_t) # (M × d) h_cat = concat(h_z, h_x) # (2M × d) or interleaved tokens h_out = TransformerEnc_g(h_cat) # L_g layers, H_g heads x0_pred = FC_out(h_out) # (M × (N+4)) |
- Discriminator and Decoder ( + De): Receives a tuple: either a real pair or a fake pair produced by feeding through the forward diffusion kernel. Inputs are FC-embedded, concatenated, and processed via Transformer layers with a global special token . The discriminator outputs a real/fake score via . An attached decoder head further reconstructs the original from the global token, enforcing structural awareness and preventing trivial solutions.
1 2 3 4 5 6 |
inp = concat(x_prev, x_t) # (2M × (N+4)) h_in = FC_D(inp) # (2M × d) h_all = TransformerEnc_d([h_s; h_in]) h_glob, h_tok = split(h_all) # h_glob = (1 × d) p_real = sigmoid(FC_logit(h_glob)) x0_rec = De(h_glob) |
The model eschews non-differentiable operations (e.g., argmax) during training. Instead, both and manipulate real-valued logits, with discrete labels recovered only at inference via .
2. Denoising Diffusion Process
DogLayout leverages a Gaussian forward–reverse process inherited from Denoising Diffusion Probabilistic Models (DDPMs), with modifications for adversarial sampling.
- Forward process: For noise schedule , set , .
or equivalently,
- Reverse process: Standard DDPMs use parameterized Gaussians:
DogLayout instead adversarially matches the conditional reverse kernel for small , relying on the closed-form posterior:
with
The diffusion noise is scheduled linearly (), identically for all channels.
- Discrete label handling: The forward kernel treats label channels as real-valued logits until final argmax extraction, maintaining overall differentiability.
3. Objective Functions and Optimization
DogLayout's loss formulation amalgamates adversarial and denoising objectives. The core losses are:
- Discriminator loss:
with as an or loss, .
- Generator loss:
Optionally, is added to stabilize generator predictions.
- Min–max training objective:
is implicit in the sampling of and .
Regularization and architectural features such as the decoder in are crucial for enforcing nontrivial structure learning, as pure GANs on discrete layouts become trivially degenerate.
4. Training Algorithm and Procedures
Practical training is characterized by short diffusion chains () and high-throughput batch sizes.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
for epoch in 1..E: for batch x0 ∼ data: # 1) sample a random t ∈ [1, T] t = Uniform({1,…,T}) # 2) noise up x_{t-1} → x_t ε ∼ N(0, I) x_t = sqrt(α_t) * x_{t-1} + sqrt(β_t) * ε # 3) G predicts clean x'₀ from (x_t, z) z ∼ N(0, I) x0_pred = G(x_t, z) # 4) compute fake x'_{t-1} via q(x_{t-1}|x_t, x0_pred) μ_q, σ_q² = posterior_params(x_t, x0_pred, t) x_prev_fake ∼ N(μ_q, σ_q²) # 5) D update: real vs fake & reconstruction loss_D = – log D(x_{t-1}, x_t) – log(1–D(x_prev_fake, x_t)) + λ_rec · L_rec(x0, De(h)) D.optimizer.zero_grad(); loss_D.backward(); D.optimizer.step() # 6) G update: fool D loss_G = – log D(x_prev_fake, x_t) G.optimizer.zero_grad(); loss_G.backward(); G.optimizer.step() |
No explicit time embedding is used; noise scale suffices. Warm-up of with reconstruction-only loss may aid early stability. The decoder in inhibits shortcut solutions on discrete channels.
5. Sampling and Inference
DogLayout enables sampling chains up to shorter than standard diffusion models by operating with as few as steps.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
x = N(0, I) for t in T..1: # 2) predict clean layout z ∼ N(0, I) x0_pred = G(x, z) # 3) compute posterior mean & var μ_q, σ_q² = posterior_params(x, x0_pred, t) # 4) step to x_{t-1} if t > 1: x = μ_q + sqrt(σ_q²) * ε # ε ∼ N(0, I) else: x = x0_pred # final labels = argmax(softmax(x[:,0:N])) boxes = x[:,N:N+4] |
Posterior parameters follow the DDPM closed-form.
6. Empirical Results
DogLayout demonstrates performance improvements in layout quality and efficiency. Quantitative experiments on PubLayNet are summarized below.
| Model | Overlap (C→S+P) ↓ | FID (C→S+P) ↓ | Max IoU ↑ | T/sample (ms) |
|---|---|---|---|---|
| LayoutGAN++ | 22.8 | — | — | 0.0327 (T=1) |
| LayoutDM | 16.43 | 8.96 | 0.308 | 23.3 (T=50) |
| DogLayout | 9.59 | 9.62 | 0.287 | 0.133 (T=4) |
Additional metrics:
- DogLayout overlap (C+S→P): 12.5
- DogLayout sampling: ms, ms, ms per sample
- GANs without diffusion are unstable on discrete labels (discriminator accuracy saturates).
Ablation for shows a sweet spot at for unconditional PubLayNet: FID drops from $199$ () to $14.7$ (), with alignment improving in parallel.
7. Implementation, Limitations, and Future Extensions
- Hyperparameters: Generator and decoder transformers use layers, heads; discriminator uses layers, heads; all with , feedforward width $2048$, GELU activations, and LayerNorm. Adam optimizer with , , learning rate , batch $512$, 200 epochs.
- Design choices: No explicit time embedding; noise magnitude conveys diffusion step.
- Limitations: Automatic metrics underperform human design, and no image-layout cross-attention is present; content-aware layout generation is a direction for future research. The approach may generalize to other mixed discrete+continuous structured domains.
For reproducibility, precomputing is suggested, and warm-up with pure reconstruction loss stabilizes generator learning in early epochs (Gan et al., 2024).