DominanceBench: Unified Benchmark Framework
- DominanceBench is a unified framework spanning uncertainty quantification and generative modeling that evaluates lower-prevision decision criteria and the DvD phenomenon.
- It leverages rigorous algorithms and linear programming solvers to efficiently identify optimal gambles via Γ-maximin, Γ-maximax, and interval dominance criteria.
- It quantifies concept suppression in diffusion models using metrics such as DvD and Focus Scores to diagnose model collapse and guide diversity improvements.
DominanceBench is a unified benchmarking and testbed framework serving distinct research domains in uncertainty quantification and generative modeling. Its two principal instantiations are: (1) as a systematic environment for evaluating algorithms for lower-prevision-based decision criteria under severe uncertainty (Nakharutai et al., 2021, Nakharutai et al., 2019); and (2) as a diagnostic benchmark for quantifying concept dominance in multi-concept text-to-image generative models, specifically addressing the Dominant-vs-Dominated (DvD) phenomenon (Jeong et al., 19 Dec 2025). Both usages share rigorous dataset construction, controlled problem specification, and the explicit comparison of algorithms or model behaviors via formal metrics.
1. Formal Characterization and Scope
In uncertainty-driven decision analysis, DominanceBench comprises instance generators, algorithmic solvers, and metric suites to evaluate the identification of optimal gambles according to three criteria: -maximin, -maximax, and interval dominance. These criteria correspond to maximizing (or prioritizing) decisions whose reward is robust under partially specified probability models, formalized using lower previsions and their natural extensions. The instance space and candidate gambles are constructed, and the lower prevision is defined on a finite domain dom .
In the generative modeling context, DominanceBench evaluates diffusion models' propensity for single-token dominance in multi-concept prompts. The DvD phenomenon—where one concept visually suppresses another—is assessed using specific prompt and image generation protocols, concept-specific visual question answering (VQA), and cross-attention analytical metrics.
2. Mathematical Foundations and Benchmarked Criteria
Lower-Prevision Decision Criteria
DominanceBench instantiates the following criteria (Nakharutai et al., 2021):
- -maximin: For , seeks , where is the natural extension of lower prevision.
- -maximax: Seeks , with the conjugate upper prevision.
- Interval dominance: interval-dominates if . The benchmark identifies the undominated set .
Each value is solved via repeated linear programming (LP), using both primal and dual formulations. Early feasible-point identification and stopping conditions are critical to efficiency.
DvD Analysis in Diffusion Models
DominanceBench evaluates the dominance collapse in generative diffusion models by:
- DvD Score: , quantifying concept-wise visual presence for two-concept prompts.
- Focus Score: Quantifies early timesteps' cross-attention concentration via spatial and head aggregation.
- Attention Deviation and Temporal Dynamics: Tracks and its increments to expose suppression patterns.
- Interference Index (optional): compares the aggregate attention weights over timesteps.
3. Algorithm Families and Optimization Strategies
Three algorithmic families were subjected to benchmarking for each decision criterion (Nakharutai et al., 2021):
| Algorithm Group | Stopping Rule | Advantages |
|---|---|---|
| Original (Algorithms 1) | Full LP solution | Baseline; robust for large dom |
| Warm start + Early stop (2) | Primal/dual gap intervals | 5–100× speedup; optimal when dom small or moderate |
| Simultaneous elimination (3) | Parallel bound updating | Can enumerate multiple optimizers; efficient tie-handling |
Early stopping leverages primal-dual bounds: if upper bound for current candidate falls below incumbent, computation halts for that candidate. "Warm start" uses a shared feasible point for all LPs in the batch (obtained by a single phase-I LP). The simultaneous elimination approach iterates candidate pool using updated bounds and drops irrecoverably suboptimal gambles.
In generative modeling, ablation studies involve single- and multi-head suppression in UNet cross-attention to parse distributed versus localized dominance mechanisms.
4. Benchmark Construction and Dataset Specifications
For lower-prevision criteria:
- Problem instances span for and dom , .
- Gambles sampled i.i.d. from .
- Lower previsions generated using random coherent-mixture procedure yielding sure-loss avoiding domains.
- Testing involves up to gambles and outcomes per instance.
For DvD generative analysis (Jeong et al., 19 Dec 2025):
- 300 balanced multi-concept prompts, with object/concept pairs sampled from LAION and stratified by diversity statistics (CLIP cosine distances).
- Each prompt evaluated on 10 images generated by Stable Diffusion 1.4 and 2.1, each image annotated via 10 binary VQA probes.
- Fine-grained DreamBooth variations constructed for controlled diversity analysis.
- Per-image and per-seed records ensure reproducibility and statistical robustness.
5. Evaluation Metrics and Empirical Results
Lower-Prevision Benchmarks
- Total wall-clock time tracked across 1\,000 repetitions (for -maximin/-maximax) and 500 (interval dominance).
- Key findings:
- Warm-start and early-stopping algorithms (2, 3) vastly outperform Algorithm 1 when dom is not excessively large.
- Algorithm 3 uniquely returns all -maximin solutions in case of ties.
- For interval dominance, Algorithm 2 is most efficient except when dom , where original naive algorithms can sometimes be preferable due to phase-I overhead.
Generative Model Benchmarks
- DvD Score thresholds delineate dominance (score denotes failure).
- Focus Score and cross-attention heatmaps visualize attention collapse.
- Key findings:
- Low-diversity concept priors strongly induce DvD behavior; increasing training diversity mitigates collapse.
- Cross-attention concentration emerges early in semantic layers and persists through the generative process.
- Distributed multi-head mechanisms underlie DvD failures, verified via ablation studies: single-head ablations mitigate fewer DvD cases than memorization collapse, while multi-head ablations still leave part of the pool unchanged.
- DvD persists (though at lower levels) across Stable Diffusion versions.
6. Implementation Guidelines and Extensions
- For lower-prevision decision analysis:
- Use warm-start + early-stop algorithms for comparable or smaller dom .
- Enumerate all maximizers with simultaneous elimination when multiplicity occurs.
- For regime dom , naive algorithms may be justified.
- Benchmark can be extended to other decision criteria (e.g. maximality, E-admissibility), and to alternative LP solvers (GPU, commercial IPM).
- For DvD generative evaluation:
- DominanceBench enables model selection and diagnosis of internal collapse mechanisms.
- Both output-level (DvD Score) and internal-level (Focus Score, attention deviation) metrics are critical for comprehensive analysis.
- Architectural solutions require attention rebalancing and diversity augmentation rather than mere head pruning.
- All provided code is in MATLAB, with potential for C/C++/Python and hardware-optimized implementations.
7. Illustrative Examples and Application Scenarios
- In -maximin benchmarking, instance generators produce diverse configurations, enabling precise comparison and algorithmic improvements.
- In the generative domain, example cases such as "Neuschwanstein Castle coaster" consistently demonstrate one-sided dominance (DvD Score 80). Controlled DreamBooth variants with single-breed concept training exhibit near-complete suppression of secondary concepts, directly linked to training diversity statistics.
DominanceBench establishes itself as a modular, reproducible, and extensible reference point for benchmarking decision-theoretic algorithms under uncertainty as well as for diagnosing and quantifying dominance failures in state-of-the-art generative models.