DominanceBench: Unified Benchmark Framework

Updated 26 December 2025

DominanceBench is a unified framework spanning uncertainty quantification and generative modeling that evaluates lower-prevision decision criteria and the DvD phenomenon.
It leverages rigorous algorithms and linear programming solvers to efficiently identify optimal gambles via Γ-maximin, Γ-maximax, and interval dominance criteria.
It quantifies concept suppression in diffusion models using metrics such as DvD and Focus Scores to diagnose model collapse and guide diversity improvements.

DominanceBench is a unified benchmarking and testbed framework serving distinct research domains in uncertainty quantification and generative modeling. Its two principal instantiations are: (1) as a systematic environment for evaluating algorithms for lower-prevision-based decision criteria under severe uncertainty (Nakharutai et al., 2021, Nakharutai et al., 2019); and (2) as a diagnostic benchmark for quantifying concept dominance in multi-concept text-to-image generative models, specifically addressing the Dominant-vs-Dominated (DvD) phenomenon (Jeong et al., 19 Dec 2025). Both usages share rigorous dataset construction, controlled problem specification, and the explicit comparison of algorithms or model behaviors via formal metrics.

1. Formal Characterization and Scope

In uncertainty-driven decision analysis, DominanceBench comprises instance generators, algorithmic solvers, and metric suites to evaluate the identification of optimal gambles according to three criteria: $\Gamma$ -maximin, $\Gamma$ -maximax, and interval dominance. These criteria correspond to maximizing (or prioritizing) decisions whose reward is robust under partially specified probability models, formalized using lower previsions and their natural extensions. The instance space $\Omega$ and candidate gambles $\mathcal{K} \subseteq \mathbb{R}^{\Omega}$ are constructed, and the lower prevision $\underline{P}$ is defined on a finite domain dom $\underline{P} \subseteq \mathcal{L}$ .

In the generative modeling context, DominanceBench evaluates diffusion models' propensity for single-token dominance in multi-concept prompts. The DvD phenomenon—where one concept visually suppresses another—is assessed using specific prompt and image generation protocols, concept-specific visual question answering (VQA), and cross-attention analytical metrics.

2. Mathematical Foundations and Benchmarked Criteria

Lower-Prevision Decision Criteria

DominanceBench instantiates the following criteria (Nakharutai et al., 2021):

$\Gamma$ -maximin: For $\mathcal{K}$ , seeks $f^* = \arg \max_{f \in \mathcal{K}} \underline{E}(f)$ , where $\underline{E}$ is the natural extension of lower prevision.
$\Gamma$ -maximax: Seeks $f^* = \arg \max_{f \in \mathcal{K}} \overline{E}(f)$ , with $\overline{E}$ the conjugate upper prevision.
Interval dominance: $f$ interval-dominates $g$ if $\underline{E}(f) > \overline{E}(g)$ . The benchmark identifies the undominated set $\{f \in \mathcal{K} : \overline{E}(f) \geq \max_{g \in \mathcal{K}} \underline{E}(g)\}$ .

Each value is solved via repeated linear programming (LP), using both primal and dual formulations. Early feasible-point identification and stopping conditions are critical to efficiency.

DvD Analysis in Diffusion Models

DominanceBench evaluates the dominance collapse in generative diffusion models by:

DvD Score: $\text{DvD Score} = \frac{C_1 (N - C_2)}{N^2} \times 100$ , quantifying concept-wise visual presence for two-concept prompts.
Focus Score: Quantifies early timesteps' cross-attention concentration via spatial and head aggregation.
Attention Deviation and Temporal Dynamics: Tracks $\alpha_i^{(\ell,t)}$ and its increments to expose suppression patterns.
Interference Index (optional): $I(c_i, c_j)$ compares the aggregate attention weights over timesteps.

3. Algorithm Families and Optimization Strategies

Three algorithmic families were subjected to benchmarking for each decision criterion (Nakharutai et al., 2021):

Algorithm Group	Stopping Rule	Advantages
Original (Algorithms 1)	Full LP solution	Baseline; robust for large dom $\underline{P}$
Warm start + Early stop (2)	Primal/dual gap intervals	5–100× speedup; optimal when dom $\underline{P}$ small or moderate
Simultaneous elimination (3)	Parallel bound updating	Can enumerate multiple optimizers; efficient tie-handling

Early stopping leverages primal-dual bounds: if upper bound $u$ for current candidate falls below incumbent, computation halts for that candidate. "Warm start" uses a shared feasible point $p^0$ for all LPs in the batch (obtained by a single phase-I LP). The simultaneous elimination approach iterates candidate pool using updated bounds and drops irrecoverably suboptimal gambles.

In generative modeling, ablation studies involve single- and multi-head suppression in UNet cross-attention to parse distributed versus localized dominance mechanisms.

4. Benchmark Construction and Dataset Specifications

For lower-prevision criteria:

Problem instances span $|\mathcal{K}|, |\Omega| = 2^i$ for $i \in 1...10$ and dom $\underline{P} = 2^n$ , $n \in \{4,6,8,10\}$ .
Gambles $f_i(\omega)$ sampled i.i.d. from $[0,1]$ .
Lower previsions generated using random coherent-mixture procedure yielding sure-loss avoiding domains.
Testing involves up to $1\,024$ gambles and $1\,024$ outcomes per instance.

For DvD generative analysis (Jeong et al., 19 Dec 2025):

300 balanced multi-concept prompts, with object/concept pairs sampled from LAION and stratified by diversity statistics (CLIP cosine distances).
Each prompt evaluated on 10 images generated by Stable Diffusion 1.4 and 2.1, each image annotated via 10 binary VQA probes.
Fine-grained DreamBooth variations constructed for controlled diversity analysis.
Per-image and per-seed records ensure reproducibility and statistical robustness.

5. Evaluation Metrics and Empirical Results

Lower-Prevision Benchmarks

Total wall-clock time tracked across 1\,000 repetitions (for $\Gamma$ -maximin/ $\Gamma$ -maximax) and 500 (interval dominance).
Key findings:
- Warm-start and early-stopping algorithms (2, 3) vastly outperform Algorithm 1 when dom $\underline{P}$ is not excessively large.
- Algorithm 3 uniquely returns all $\Gamma$ -maximin solutions in case of ties.
- For interval dominance, Algorithm 2 is most efficient except when dom $\underline{P} \gg |\mathcal{K}|, |\Omega|$ , where original naive algorithms can sometimes be preferable due to phase-I overhead.

Generative Model Benchmarks

DvD Score thresholds delineate dominance (score $\geq 36$ denotes failure).
Focus Score and cross-attention heatmaps visualize attention collapse.
Key findings:
- Low-diversity concept priors strongly induce DvD behavior; increasing training diversity mitigates collapse.
- Cross-attention concentration emerges early in semantic layers and persists through the generative process.
- Distributed multi-head mechanisms underlie DvD failures, verified via ablation studies: single-head ablations mitigate fewer DvD cases than memorization collapse, while multi-head ablations still leave part of the pool unchanged.
- DvD persists (though at lower levels) across Stable Diffusion versions.

6. Implementation Guidelines and Extensions

For lower-prevision decision analysis:
- Use warm-start + early-stop algorithms for comparable or smaller dom $\underline{P}$ .
- Enumerate all maximizers with simultaneous elimination when multiplicity occurs.
- For regime dom $\underline{P} \gg |\mathcal{K}|, |\Omega|$ , naive algorithms may be justified.
- Benchmark can be extended to other decision criteria (e.g. maximality, E-admissibility), and to alternative LP solvers (GPU, commercial IPM).
For DvD generative evaluation:
- DominanceBench enables model selection and diagnosis of internal collapse mechanisms.
- Both output-level (DvD Score) and internal-level (Focus Score, attention deviation) metrics are critical for comprehensive analysis.
- Architectural solutions require attention rebalancing and diversity augmentation rather than mere head pruning.
All provided code is in MATLAB, with potential for C/C++/Python and hardware-optimized implementations.

7. Illustrative Examples and Application Scenarios

In $\Gamma$ -maximin benchmarking, instance generators produce diverse $(|\mathcal{K}|, |\Omega|, |\text{dom} P|)$ configurations, enabling precise comparison and algorithmic improvements.
In the generative domain, example cases such as "Neuschwanstein Castle coaster" consistently demonstrate one-sided dominance (DvD Score $\approx$ 80). Controlled DreamBooth variants with single-breed concept training exhibit near-complete suppression of secondary concepts, directly linked to training diversity statistics.

DominanceBench establishes itself as a modular, reproducible, and extensible reference point for benchmarking decision-theoretic algorithms under uncertainty as well as for diagnosing and quantifying dominance failures in state-of-the-art generative models.

Markdown Report Issue Upgrade to Chat

References (3)

Improving and benchmarking of algorithms for $Γ$-maximin, $Γ$-maximax and interval dominance (2021)

Improving and benchmarking of algorithms for decision making with lower previsions (2019)

Dominating vs. Dominated: Generative Collapse in Diffusion Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DominanceBench.