Papers
Topics
Authors
Recent
Search
2000 character limit reached

DogLayout: Diffusion GAN for Discrete Layouts

Updated 20 February 2026
  • The paper introduces DogLayout, a novel framework that integrates denoising diffusion processes with GAN-style training to overcome challenges in mixed discrete and continuous layout generation.
  • The model leverages Transformer-based generators and discriminators to achieve high sample efficiency and robust structural fidelity, as demonstrated on the PubLayNet dataset.
  • DogLayout achieves significant performance gains by reducing sampling steps and overlap metrics while maintaining competitive FID and IoU scores, offering practical value for layout synthesis.

DogLayout is a generative framework for discrete and continuous layout generation that integrates denoising diffusion processes with adversarial (GAN-style) training. Its design targets challenges inherent in layout generation tasks where layouts are described by both discrete categorical labels (e.g., object classes) and continuous geometric parameters (e.g., bounding boxes). DogLayout addresses the inefficiency of conventional @@@@1@@@@ in sampling, as well as the limitations of GANs with discrete data, by conditioning GAN training on a denoising diffusion process while maintaining full end-to-end differentiability for both discrete and continuous layout components (Gan et al., 2024).

1. Model Architecture and Design

DogLayout constructs a hybrid generative model, narrowing the gap between the control and sample quality of diffusion models and the sampling efficiency of GANs. The architecture consists of a generator GθG_\theta and a discriminator DϕD_\phi, both built around Transformer encoders and fully connected (FC) projection branches.

  • Generator (GG): Accepts a noisy layout xtRM×(N+4)x_t \in \mathbb{R}^{M \times (N+4)}, corresponding to MM elements, NN label logits (pre-softmax), and $4$ box parameters per element. Latent noise zN(0,I)z \sim \mathcal{N}(0,I) with shape (M×dz)(M \times d_z) is jointly embedded with xtx_t. The model stacks LgL_g layers of a multi-head Transformer encoder, producing an output x0RM×(N+4)x'_0 \in \mathbb{R}^{M \times (N+4)} — unnormalized logits for discrete labels and continuous coordinates.

1
2
3
4
5
h_z = FC_z(z)                   # (M × d)
h_x = FC_x(x_t)                 # (M × d)
h_cat = concat(h_z, h_x)        # (2M × d) or interleaved tokens
h_out = TransformerEnc_g(h_cat) # L_g layers, H_g heads
x0_pred = FC_out(h_out)         # (M × (N+4))

  • Discriminator and Decoder (DD + De): Receives a tuple: either a real pair (xt1,xt)(x_{t-1}, x_t) or a fake pair (xt1,xt)(x'_{t-1}, x_t) produced by feeding x0x'_0 through the forward diffusion kernel. Inputs are FC-embedded, concatenated, and processed via LdL_d Transformer layers with a global special token hsh_s. The discriminator outputs a real/fake score via p=D(h)[0,1]p = D(h) \in [0,1]. An attached decoder head further reconstructs the original x0x_0 from the global token, enforcing structural awareness and preventing trivial solutions.

1
2
3
4
5
6
inp = concat(x_prev, x_t)        # (2M × (N+4))
h_in = FC_D(inp)                 # (2M × d)
h_all = TransformerEnc_d([h_s; h_in])
h_glob, h_tok = split(h_all)     # h_glob = (1 × d)
p_real = sigmoid(FC_logit(h_glob))
x0_rec = De(h_glob)

The model eschews non-differentiable operations (e.g., argmax) during training. Instead, both GG and DD manipulate real-valued logits, with discrete labels recovered only at inference via argmaxisoftmax(x0[i])\mathrm{arg\,max}_i\,\mathrm{softmax}(x_0[i]).

2. Denoising Diffusion Process

DogLayout leverages a Gaussian forward–reverse process inherited from Denoising Diffusion Probabilistic Models (DDPMs), with modifications for adversarial sampling.

  • Forward process: For noise schedule {βt}\{\beta_t\}, set αt=1βt\alpha_t = 1 - \beta_t, αˉt=s=1tαs\bar \alpha_t = \prod_{s=1}^t \alpha_s.

q(xtxt1)=N(xt;αtxt1,βtI)q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{\alpha_t} x_{t-1}, \beta_t I)

or equivalently,

xtN(αˉtx0,(1αˉt)I)x_t \sim \mathcal{N}(\sqrt{\bar\alpha_t} x_0, (1 - \bar\alpha_t)I)

  • Reverse process: Standard DDPMs use parameterized Gaussians:

pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t))p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

DogLayout instead adversarially matches the conditional reverse kernel for small TT, relying on the closed-form posterior:

q(xt1xt,x0)=N(xt1;μ~(xt,x0),β~tI)q(x_{t-1}|x_t, x_0) = \mathcal{N}\left(x_{t-1}; \tilde\mu(x_t, x_0), \tilde\beta_t I\right)

with

q(xt1xt,x0)=q(xtxt1,x0)q(xt1x0)q(xtx0)q(x_{t-1}|x_t, x_0) = \frac{q(x_t|x_{t-1},x_0)q(x_{t-1}|x_0)}{q(x_t|x_0)}

The diffusion noise is scheduled linearly (βt[104,0.02]\beta_t \in [10^{-4}, 0.02]), identically for all channels.

  • Discrete label handling: The forward kernel treats label channels as real-valued logits until final argmax extraction, maintaining overall differentiability.

3. Objective Functions and Optimization

DogLayout's loss formulation amalgamates adversarial and denoising objectives. The core losses are:

  • Discriminator loss:

LD=t=1TEq(xt)[Eq(xt1xt)[logD(xt1,xt)]+Ext1pθ(xt1xt)[log(1D(xt1,xt))]+λrecEq(x0xt)[Lrec(x0,De(h))]]L_D = \sum_{t=1}^T \mathbb{E}_{q(x_t)}\left[ \mathbb{E}_{q(x_{t-1}|x_t)}[-\log D(x_{t-1},x_t)] + \mathbb{E}_{x'_{t-1}\sim p_\theta(x_{t-1}|x_t)}[-\log(1-D(x'_{t-1},x_t))] + \lambda_\mathrm{rec}\cdot \mathbb{E}_{q(x_0|x_t)}[L_\mathrm{rec}(x_0, De(h))] \right]

with LrecL_\mathrm{rec} as an 2\ell_2 or 1\ell_1 loss, λrec1\lambda_\mathrm{rec} \approx 1.

  • Generator loss:

LG=t=1TEq(xt),z[logD(xt1(xt,z),xt)]L_G = \sum_{t=1}^T \mathbb{E}_{q(x_t),z}[-\log D(x'_{t-1}(x_t, z), x_t)]

Optionally, LrecL_\mathrm{rec} is added to stabilize generator predictions.

  • Min–max training objective:

minθmaxϕ LGAN(Gθ,Dϕ)+Ldiff(Gθ)+Lrec(Dϕ)\min_\theta \max_\phi\ L_\mathrm{GAN}(G_\theta, D_\phi) + L_\mathrm{diff}(G_\theta) + L_\mathrm{rec}(D_\phi)

LdiffL_\mathrm{diff} is implicit in the sampling of xtx_t and xt1x_{t-1}.

Regularization and architectural features such as the decoder in DD are crucial for enforcing nontrivial structure learning, as pure GANs on discrete layouts become trivially degenerate.

4. Training Algorithm and Procedures

Practical training is characterized by short diffusion chains (T{4,8,12}T\in\{4,8,12\}) and high-throughput batch sizes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
for epoch in 1..E:
  for batch x0  data:
    # 1) sample a random t ∈ [1, T]
    t = Uniform({1,,T})
    # 2) noise up x_{t-1} → x_t
    ε  N(0, I)
    x_t = sqrt(α_t) * x_{t-1} + sqrt(β_t) * ε
    # 3) G predicts clean x'₀ from (x_t, z)
    z  N(0, I)
    x0_pred = G(x_t, z)
    # 4) compute fake x'_{t-1} via q(x_{t-1}|x_t, x0_pred)
    μ_q, σ_q² = posterior_params(x_t, x0_pred, t)
    x_prev_fake  N(μ_q, σ_q²)
    # 5) D update: real vs fake & reconstruction
    loss_D =  log D(x_{t-1}, x_t)  log(1D(x_prev_fake, x_t)) + λ_rec · L_rec(x0, De(h))
    D.optimizer.zero_grad(); loss_D.backward(); D.optimizer.step()
    # 6) G update: fool D
    loss_G =  log D(x_prev_fake, x_t)
    G.optimizer.zero_grad(); loss_G.backward(); G.optimizer.step()

No explicit time embedding is used; noise scale suffices. Warm-up of GG with reconstruction-only loss may aid early stability. The decoder in DD inhibits shortcut solutions on discrete channels.

5. Sampling and Inference

DogLayout enables sampling chains up to 175×175\times shorter than standard diffusion models by operating with as few as T=4T=4 steps.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
x = N(0, I)
for t in T..1:
  # 2) predict clean layout
  z  N(0, I)
  x0_pred = G(x, z)
  # 3) compute posterior mean & var
  μ_q, σ_q² = posterior_params(x, x0_pred, t)
  # 4) step to x_{t-1}
  if t > 1:
    x = μ_q + sqrt(σ_q²) * ε   # ε ∼ N(0, I)
  else:
    x = x0_pred            # final
labels = argmax(softmax(x[:,0:N]))
boxes  = x[:,N:N+4]

Posterior parameters follow the DDPM closed-form.

6. Empirical Results

DogLayout demonstrates performance improvements in layout quality and efficiency. Quantitative experiments on PubLayNet are summarized below.

Model Overlap (C→S+P) ↓ FID (C→S+P) ↓ Max IoU T/sample (ms)
LayoutGAN++ 22.8 0.0327 (T=1)
LayoutDM 16.43 8.96 0.308 23.3 (T=50)
DogLayout 9.59 9.62 0.287 0.133 (T=4)

Additional metrics:

  • DogLayout overlap (C+S→P): 12.5
  • DogLayout sampling: T=4:0.133T=4: 0.133 ms, T=8:0.255T=8: 0.255 ms, T=12:0.377T=12: 0.377 ms per sample
  • GANs without diffusion are unstable on discrete labels (discriminator accuracy saturates).

Ablation for TT shows a sweet spot at T=8T=8 for unconditional PubLayNet: FID drops from $199$ (T=2T=2) to $14.7$ (T=8T=8), with alignment improving in parallel.

7. Implementation, Limitations, and Future Extensions

  • Hyperparameters: Generator and decoder transformers use Lg=4L_g=4 layers, Hg=8H_g=8 heads; discriminator uses Ld=8L_d=8 layers, Hd=4H_d=4 heads; all with d=256d=256, feedforward width $2048$, GELU activations, and LayerNorm. Adam optimizer with β1=0.5\beta_1=0.5, β2=0.999\beta_2=0.999, learning rate 1e51e^{-5}, batch $512$, \approx200 epochs.
  • Design choices: No explicit time embedding; noise magnitude conveys diffusion step.
  • Limitations: Automatic metrics underperform human design, and no image-layout cross-attention is present; content-aware layout generation is a direction for future research. The approach may generalize to other mixed discrete+continuous structured domains.

For reproducibility, precomputing αˉt\bar\alpha_t is suggested, and warm-up with pure reconstruction loss stabilizes generator learning in early epochs (Gan et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Denoising Diffusion GAN for Discrete Layouts (DogLayout).