Scale-Aware Prompt Decoder

Updated 7 December 2025

Scale-Aware Prompt Decoder is a lightweight, inference-time module that adaptively selects optimal classifier-free guidance scales for each prompt.
It leverages a data-driven synthetic oracle and a compact multi-layer perceptron to model prompt-semantics and guide quality predictions without retraining the backbone.
Empirical validation on image and audio tasks shows enhanced generative fidelity, improved alignment, and superior perceptual quality compared to fixed guidance scales.

A Scale-Aware Prompt Decoder is a lightweight, inference-time module that predicts and selects the optimal classifier-free guidance scale (CFG scale) for each input prompt in text-to-image or text-to-audio diffusion models. Traditional CFG uses a fixed guidance scale, which has limited ability to generalize across prompts of varying semantic complexity. A scale-aware approach adaptively chooses the guidance strength per prompt, enhancing generative fidelity, prompt alignment, and perceptual quality without additional backbone retraining. The methodology is underpinned by modeling the dependence of multi-metric generation quality on both prompt semantics and the guidance scale, realized through a fully data-driven synthetic oracle and a trained multi-layer perceptron predictor (Zhang et al., 25 Sep 2025).

1. Construction of the Synthetic Oracle Dataset

To capture the relationship between prompt, guidance scale, and generation quality, a synthetic dataset is constructed as follows:

Prompt Pool: For image tasks, approximately 8,000 prompts are randomly sampled from MSCOCO 2014 captions. For audio tasks, about 6,000 from AudioCaps.
Guidance Scales ( $S=\{s_1,\ldots,s_K\}$ ): A discrete set of candidate scales (e.g., $1.0$ to $10.0$ in $0.5$ steps, with $K\approx19$ ).
Sample Generation: For each prompt $p$ and each scale $s\in S$ , a pretrained diffusion model (SDXL for images or AudioLDM2 for audio) is run $N_g$ times (e.g., $N_g=4$ ) under CFG with guidance weight $s$ , generating outputs $I^{(j)}_{p,s}$ .
Quality Metric Evaluation:
- Images: Metrics $M_{\text{image}}$ include KID, CLIP, ImageReward, Precision, Recall.
- Audio: Metrics $M_{\text{audio}}$ include AudioBox-Aesthetics, and optionally FAD and CLAP score.
Oracle Aggregation: For each metric $m\in M$ ,

$q_m(p,s) = \frac{1}{N_g} \sum_{j=1}^{N_g} M_m(I^{(j)}_{p,s},p)$

with $q(p,s)\in\mathbb{R}^{|M|}$ the oracle quality vector for prompt/scale pair.

This dataset forms the empirical basis for supervised training of the predictor.

2. Architecture of the Lightweight Predictor

The scale-aware decoder is a compact, feed-forward multi-layer perceptron with the following components:

Inputs:
- Semantic embedding $e(p)$ : Extracted from a frozen vision-language encoder (CLIP for images, $d_e=768$ ; CLAP for audio, $d_e=512$ ).
- Complexity features $r(p)$ : Metrics such as prompt length, token entropy, perplexity, modifier diversity, punctuation count; projected via $c(p)=W_cr(p) + b_c$ , with $d_c=32$ .
- Scale $s$ (normalized to $[0,1]$ ).
- Concatenated vector: $h(p,s) = [e(p);c(p);s] \in \mathbb{R}^{d_e + d_c + 1}$ .
Network:
- 2–3 fully connected layers (hidden sizes $512\to256$ ), ReLU activations, optional dropout ( $p=0.1$ ).
- Output layer returns $d_q = |M|$ estimated metric values.
Output:
- $\hat{q}(p,s) = g_\phi(h(p,s)) \approx q(p,s)$ , per-metric.

The total parameter count is $\phi\approx2$ –4 million, reflecting a lightweight profile suitable for inference.

3. Mathematical Formulation and Guidance Scale Selection

Key formulations underlying the scale selection process:

Classifier-Free Guidance: For conditional noise estimators $\epsilon_\theta(x_t,p,t)$ and unconditional $\epsilon_\theta(x_t, \emptyset, t)$ ,

$\hat{\epsilon}_\theta(x_t,p,t; s) = (1-s)\epsilon_\theta(x_t, \emptyset, t) + s\epsilon_\theta(x_t, p, t)$

and the distribution is reweighted: $p^s(x|p) \propto p(x|p)[p(x|p)/p(x)]^s$ .

Multi-Metric Quality Curves:

$Q_m(s|p) \triangleq \mathbb{E}_{I\sim p_\theta(\cdot | p,s)} [M_m(I,p)], \quad \text{approximated by } q_m(p,s)$

The predictor estimates $\hat{Q}_m(s|p) = \hat{q}_m(p,s)$ .

Utility Function and Scale Optimization: Nonnegative metric weights $w\in\mathbb{R}^{d_q}_{\geq0}$ and quadratic regularization centered at anchor $\mu_s$ (default scale) are used:

$U(s|p) = \sum_{m} w_m\hat{Q}_m(s|p) - \alpha(s-\mu_s)^2, \quad s^*(p) = \arg\max_{s\in S}U(s|p)$

4. Training Procedure

Dataset: All $(p,s)$ pairs from the synthetic oracle are used.
Inputs: $h(p,s)$ ; Targets: $q(p,s)$ (multi-metric oracle).
Loss Function:

$L(\phi) = \sum_{(p,s)} \|g_\phi(h(p,s)) - q(p,s)\|_2^2 + \lambda\|\phi\|_2^2$

with $\lambda\approx10^{-4}$ .

Optimization: Adam, learning rate $1\text{e-}4$ , batch size $64$, $20$ epochs, early stopping on validation set.
Regularization: Optional label-smoothing or target noise.

This regime ensures the predictor generalizes multi-metric quality as a function of prompt and scale.

5. Inference Workflow

The deployment process is as follows:

Feature Extraction: For incoming prompt $p$ , extract $e(p)$ and $r(p)$ , compute $c(p)$ .
Metric Prediction: For each guidance scale $s\in S$ , concatenate $h(p,s)$ , compute $\hat{q}(p,s) = g_\phi(h(p,s))$ .
Utility Maximization: Compute $U(s|p) = w^\top\hat{q}(p,s) - \alpha(s-\mu_s)^2$ , select $s^* = \arg\max_{s\in S}U(s|p)$ .
Sample Generation: Run the diffusion model under CFG with $s^*$ to generate the final sample.

This workflow eliminates the need for run-time grid-search, selecting the optimal scale per-prompt.

6. Empirical Validation and Performance

Experiments on MSCOCO 2014 (images, SDXL backbone; 3,000 validation captions) and AudioCaps (audio, AudioLDM2 backbone; 1,000 validation captions) demonstrate quantifiable improvements:

Image Generation (Table 1)

Method	FID↓	CLIP↑
No guidance ( $s=1.0$ )	62.44	0.27
Vanilla CFG (default $s\approx5$ )	31.04	0.31
Prompt-aware ( $s^*(p)$ )	30.74	0.33

Audio Generation (Table 2)

Method	CE↑	CU↑	PC↑	PQ↑
No guidance ( $s=1.0$ )	3.62	5.18	3.13	5.76
Vanilla CFG (default $s\approx2.5$ )	3.66	5.25	3.04	5.79
Prompt-aware ( $s^*(p)$ )	3.68	5.22	3.16	5.81

Ablation: Using all four training metrics (KID, CLIP, ImageReward, Precision) outperforms KID+CLIP or vanilla CFG, with FID 30.74 versus 31.81 (KID+CLIP only) and 31.04 (CFG default).
Perceptual Preference: Human raters prefer prompt-aware outputs over vanilla CFG in approximately 60% of comparisons.

7. Trade-offs and Parameter Tuning

Scale Dynamics: Increasing guidance scale $s$ generally increases semantic alignment (CLIP, Audio-CLAP) but can reduce diversity (Recall) and, past a mid-range $s$ , lead to fidelity loss via over-sharpening.
Quadratic Penalty: The regularizer $\alpha(s-\mu_s)^2$ discourages extreme scale choices when predicted utility gains are marginal.
Utility Weights:
- Increase $w_{\text{align}}/w_{\text{fidelity}}$ for alignment-sensitive tasks.
- For highly detailed prompts, use lower $\mu_s$ and higher $\alpha$ to mitigate over-guidance.
- Dev-set sweeps allow empirical setting of $w$ and $\alpha$ to match task priorities.

These mechanisms enable controlled trade-off navigation across fidelity, alignment, and diversity in prompt-conditional generation.

By directly modeling prompt-scale dependencies with a learned Quality Curve Predictor and utility-based scale selection, the Scale-Aware Prompt Decoder provides a practical, computationally efficient enhancement over fixed CFG weighting, delivering improved generation metrics and perceptual outcomes across diverse prompts (Zhang et al., 25 Sep 2025).

Markdown Upgrade to Chat

References (1)

Prompt-aware classifier free guidance for diffusion models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Scale-Aware Prompt Decoder.