Papers
Topics
Authors
Recent
2000 character limit reached

Scale-Aware Prompt Decoder

Updated 7 December 2025
  • Scale-Aware Prompt Decoder is a lightweight, inference-time module that adaptively selects optimal classifier-free guidance scales for each prompt.
  • It leverages a data-driven synthetic oracle and a compact multi-layer perceptron to model prompt-semantics and guide quality predictions without retraining the backbone.
  • Empirical validation on image and audio tasks shows enhanced generative fidelity, improved alignment, and superior perceptual quality compared to fixed guidance scales.

A Scale-Aware Prompt Decoder is a lightweight, inference-time module that predicts and selects the optimal classifier-free guidance scale (CFG scale) for each input prompt in text-to-image or text-to-audio diffusion models. Traditional CFG uses a fixed guidance scale, which has limited ability to generalize across prompts of varying semantic complexity. A scale-aware approach adaptively chooses the guidance strength per prompt, enhancing generative fidelity, prompt alignment, and perceptual quality without additional backbone retraining. The methodology is underpinned by modeling the dependence of multi-metric generation quality on both prompt semantics and the guidance scale, realized through a fully data-driven synthetic oracle and a trained multi-layer perceptron predictor (Zhang et al., 25 Sep 2025).

1. Construction of the Synthetic Oracle Dataset

To capture the relationship between prompt, guidance scale, and generation quality, a synthetic dataset is constructed as follows:

  • Prompt Pool: For image tasks, approximately 8,000 prompts are randomly sampled from MSCOCO 2014 captions. For audio tasks, about 6,000 from AudioCaps.
  • Guidance Scales (S={s1,,sK}S=\{s_1,\ldots,s_K\}): A discrete set of candidate scales (e.g., $1.0$ to $10.0$ in $0.5$ steps, with K19K\approx19).
  • Sample Generation: For each prompt pp and each scale sSs\in S, a pretrained diffusion model (SDXL for images or AudioLDM2 for audio) is run NgN_g times (e.g., Ng=4N_g=4) under CFG with guidance weight ss, generating outputs Ip,s(j)I^{(j)}_{p,s}.
  • Quality Metric Evaluation:
    • Images: Metrics MimageM_{\text{image}} include KID, CLIP, ImageReward, Precision, Recall.
    • Audio: Metrics MaudioM_{\text{audio}} include AudioBox-Aesthetics, and optionally FAD and CLAP score.
  • Oracle Aggregation: For each metric mMm\in M,

qm(p,s)=1Ngj=1NgMm(Ip,s(j),p)q_m(p,s) = \frac{1}{N_g} \sum_{j=1}^{N_g} M_m(I^{(j)}_{p,s},p)

with q(p,s)RMq(p,s)\in\mathbb{R}^{|M|} the oracle quality vector for prompt/scale pair.

This dataset forms the empirical basis for supervised training of the predictor.

2. Architecture of the Lightweight Predictor

The scale-aware decoder is a compact, feed-forward multi-layer perceptron with the following components:

  • Inputs:
    • Semantic embedding e(p)e(p): Extracted from a frozen vision-language encoder (CLIP for images, de=768d_e=768; CLAP for audio, de=512d_e=512).
    • Complexity features r(p)r(p): Metrics such as prompt length, token entropy, perplexity, modifier diversity, punctuation count; projected via c(p)=Wcr(p)+bcc(p)=W_cr(p) + b_c, with dc=32d_c=32.
    • Scale ss (normalized to [0,1][0,1]).
    • Concatenated vector: h(p,s)=[e(p);c(p);s]Rde+dc+1h(p,s) = [e(p);c(p);s] \in \mathbb{R}^{d_e + d_c + 1}.
  • Network:
    • 2–3 fully connected layers (hidden sizes 512256512\to256), ReLU activations, optional dropout (p=0.1p=0.1).
    • Output layer returns dq=Md_q = |M| estimated metric values.
  • Output:
    • q^(p,s)=gϕ(h(p,s))q(p,s)\hat{q}(p,s) = g_\phi(h(p,s)) \approx q(p,s), per-metric.

The total parameter count is ϕ2\phi\approx2–4 million, reflecting a lightweight profile suitable for inference.

3. Mathematical Formulation and Guidance Scale Selection

Key formulations underlying the scale selection process:

  • Classifier-Free Guidance: For conditional noise estimators ϵθ(xt,p,t)\epsilon_\theta(x_t,p,t) and unconditional ϵθ(xt,,t)\epsilon_\theta(x_t, \emptyset, t),

ϵ^θ(xt,p,t;s)=(1s)ϵθ(xt,,t)+sϵθ(xt,p,t)\hat{\epsilon}_\theta(x_t,p,t; s) = (1-s)\epsilon_\theta(x_t, \emptyset, t) + s\epsilon_\theta(x_t, p, t)

and the distribution is reweighted: ps(xp)p(xp)[p(xp)/p(x)]sp^s(x|p) \propto p(x|p)[p(x|p)/p(x)]^s.

  • Multi-Metric Quality Curves:

Qm(sp)EIpθ(p,s)[Mm(I,p)],approximated by qm(p,s)Q_m(s|p) \triangleq \mathbb{E}_{I\sim p_\theta(\cdot | p,s)} [M_m(I,p)], \quad \text{approximated by } q_m(p,s)

The predictor estimates Q^m(sp)=q^m(p,s)\hat{Q}_m(s|p) = \hat{q}_m(p,s).

  • Utility Function and Scale Optimization: Nonnegative metric weights wR0dqw\in\mathbb{R}^{d_q}_{\geq0} and quadratic regularization centered at anchor μs\mu_s (default scale) are used:

U(sp)=mwmQ^m(sp)α(sμs)2,s(p)=argmaxsSU(sp)U(s|p) = \sum_{m} w_m\hat{Q}_m(s|p) - \alpha(s-\mu_s)^2, \quad s^*(p) = \arg\max_{s\in S}U(s|p)

4. Training Procedure

  • Dataset: All (p,s)(p,s) pairs from the synthetic oracle are used.
  • Inputs: h(p,s)h(p,s); Targets: q(p,s)q(p,s) (multi-metric oracle).
  • Loss Function:

L(ϕ)=(p,s)gϕ(h(p,s))q(p,s)22+λϕ22L(\phi) = \sum_{(p,s)} \|g_\phi(h(p,s)) - q(p,s)\|_2^2 + \lambda\|\phi\|_2^2

with λ104\lambda\approx10^{-4}.

  • Optimization: Adam, learning rate 1e-41\text{e-}4, batch size $64$, $20$ epochs, early stopping on validation set.
  • Regularization: Optional label-smoothing or target noise.

This regime ensures the predictor generalizes multi-metric quality as a function of prompt and scale.

5. Inference Workflow

The deployment process is as follows:

  1. Feature Extraction: For incoming prompt pp, extract e(p)e(p) and r(p)r(p), compute c(p)c(p).
  2. Metric Prediction: For each guidance scale sSs\in S, concatenate h(p,s)h(p,s), compute q^(p,s)=gϕ(h(p,s))\hat{q}(p,s) = g_\phi(h(p,s)).
  3. Utility Maximization: Compute U(sp)=wq^(p,s)α(sμs)2U(s|p) = w^\top\hat{q}(p,s) - \alpha(s-\mu_s)^2, select s=argmaxsSU(sp)s^* = \arg\max_{s\in S}U(s|p).
  4. Sample Generation: Run the diffusion model under CFG with ss^* to generate the final sample.

This workflow eliminates the need for run-time grid-search, selecting the optimal scale per-prompt.

6. Empirical Validation and Performance

Experiments on MSCOCO 2014 (images, SDXL backbone; 3,000 validation captions) and AudioCaps (audio, AudioLDM2 backbone; 1,000 validation captions) demonstrate quantifiable improvements:

Image Generation (Table 1)

Method FID CLIP↑
No guidance (s=1.0s=1.0) 62.44 0.27
Vanilla CFG (default s5s\approx5) 31.04 0.31
Prompt-aware (s(p)s^*(p)) 30.74 0.33

Audio Generation (Table 2)

Method CE↑ CU↑ PC↑ PQ↑
No guidance (s=1.0s=1.0) 3.62 5.18 3.13 5.76
Vanilla CFG (default s2.5s\approx2.5) 3.66 5.25 3.04 5.79
Prompt-aware (s(p)s^*(p)) 3.68 5.22 3.16 5.81
  • Ablation: Using all four training metrics (KID, CLIP, ImageReward, Precision) outperforms KID+CLIP or vanilla CFG, with FID 30.74 versus 31.81 (KID+CLIP only) and 31.04 (CFG default).
  • Perceptual Preference: Human raters prefer prompt-aware outputs over vanilla CFG in approximately 60% of comparisons.

7. Trade-offs and Parameter Tuning

  • Scale Dynamics: Increasing guidance scale ss generally increases semantic alignment (CLIP, Audio-CLAP) but can reduce diversity (Recall) and, past a mid-range ss, lead to fidelity loss via over-sharpening.
  • Quadratic Penalty: The regularizer α(sμs)2\alpha(s-\mu_s)^2 discourages extreme scale choices when predicted utility gains are marginal.
  • Utility Weights:
    • Increase walign/wfidelityw_{\text{align}}/w_{\text{fidelity}} for alignment-sensitive tasks.
    • For highly detailed prompts, use lower μs\mu_s and higher α\alpha to mitigate over-guidance.
    • Dev-set sweeps allow empirical setting of ww and α\alpha to match task priorities.

These mechanisms enable controlled trade-off navigation across fidelity, alignment, and diversity in prompt-conditional generation.


By directly modeling prompt-scale dependencies with a learned Quality Curve Predictor and utility-based scale selection, the Scale-Aware Prompt Decoder provides a practical, computationally efficient enhancement over fixed CFG weighting, delivering improved generation metrics and perceptual outcomes across diverse prompts (Zhang et al., 25 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Scale-Aware Prompt Decoder.