SustainDiffusion: Fair & Energy-Efficient SD3
- SustainDiffusion is a search-based framework that improves social and environmental sustainability in SD3 through multi-objective evolutionary search over prompt and hyperparameter configurations.
- It minimizes gender and ethnic bias and reduces energy consumption by optimizing image generation, achieving up to 68% bias reduction and 48% energy savings.
- The method functions as a black-box wrapper, preserving or enhancing image quality while balancing trade-offs using NSGA-II across multiple measurable objectives.
SustainDiffusion is a search-based framework designed to enhance the social and environmental sustainability of text-to-image generation using Stable Diffusion (SD), specifically SD3-Medium. It achieves substantial reductions in gender and ethnic bias, as well as energy consumption, without any modification or fine-tuning of the underlying model architecture. SustainDiffusion functions as a black-box wrapper, optimizing over prompt structure and hyperparameters to return configurations that provide superior trade-offs across multiple competing objectives, while preserving or improving image quality relative to default SD3 (d'Aloisio et al., 21 Jul 2025).
1. Methodological Overview
SustainDiffusion operates by conducting automated evolutionary search over a defined space of prompt engineering variable combinations and core SD3 hyperparameters. The inputs and search space are:
- Base prompt: For example, ‘Photo portrait of a Software Engineer that codes’ or variants known to reveal demographic bias.
- Hyperparameters:
- Guidance Scale: in 0.1 increments
- Inference Steps:
- Prompt engineering variables:
- Up to 20 positive (“quality- and fairness-enhancing”) and 25 negative keywords, drawn from curated lists
- Uniform prompt weight (applied to all keywords via SD's “++” syntax)
Each candidate solution is a five-dimensional vector specifying the above. The search proceeds using NSGA-II, a widely-used multi-objective evolutionary algorithm, with a population size of 30 over 25 generations.
At each iteration, for each configuration:
- Inject guidance scale and inference steps into SD3.
- Construct the prompt with selected keywords and weights.
- Generate 20 images.
- Evaluate images on four pre-defined objectives (see section 2).
Pareto-optimal fronts and corresponding configurations are returned at run completion, ready for direct deployment.
2. Multi-Objective Optimization Formulation
The search problem is formulated with four explicit objectives, each grounded in measurable proxies:
- Image Quality (to maximize):
For each image in the batch, YOLOv8 returns confidences for detected objects. The average over all such detections and images, where is the confidence for object in image .
- Gender Bias (to minimize):
Automated labeling with BLIP-VQA on “Is the person in this image Male or Female?”. Let and be the observed fractions. The absolute difference is the bias proxy:
0
- Ethnic Bias (to minimize):
BLIP-VQA is queried with “Is the person Arab, Asian, Black, or White?”. If 1 is the fraction for ethnicity 2,
3
- CPU Energy (to minimize):
CodeCarbon, via pynvml and pyrapl, measures CPU energy consumption in kWh for the batch. The median value is used per generation. Pre-studies established this to be a suitable proxy for total compute energy.
The four objectives are optimized concurrently: maximize image quality, minimize gender bias, minimize ethnic bias, and minimize CPU energy.
3. Search Algorithm and Implementation
SustainDiffusion employs the elitist NSGA-II algorithm with the following setup:
- Crossover: Single-point uniform, 80% probability.
- Mutation: Applied with 20% probability per individual per generation, affecting attributes with 20% probability.
- Population: 30 individuals
- Generations: 25
The evaluation pipeline for each individual proceeds as described above, yielding a fitness vector 4. After the evolutionary loop, the Pareto front is extracted, providing a diverse set of optimal trade-off configurations.
This outer-loop optimization treats SD3 as a strict black box; no network weights are altered and no architectural changes are made. Optimization is performed only over prompt text and runtime hyperparameters.
4. Empirical Evaluation and Baselines
SustainDiffusion is evaluated on a curated set of 56 “Software Engineer” prompts, each intentionally exposing axes of bias. The baseline comparisons include:
- SD3 default (unmodified)
- Random Search (matching evaluation budget)
- SustainDiffusion ablations:
- No prompt engineering (hyperparameters only)
- Image Quality only
- Image Quality + Bias
- Image Quality + Energy
Each baseline is run 10 times, and all runs are evaluated with respect to mean and standard deviation on each objective. Statistical significance is assessed using Wilcoxon signed-rank test with Bonferroni correction (p < 0.05/6), and effect size is measured via Vargha–Delaney 5 statistic. Multi-objective comparison includes hypervolume indicators and counting the number of Pareto-optimal solutions returned.
5. Key Outcomes and Quantitative Results
On the canonical prompt “Photo portrait of a Software Engineer that codes”, the main results are as follows:
| Metric | SD3 Default (mean ± σ) | SustainDiffusion (mean ± σ) | Relative Change |
|---|---|---|---|
| Gender Bias | 1.00 ± 0.00 | 0.32 ± 0.29 | –68% |
| Ethnic Bias | 0.76 ± 0.11 | 0.31 ± 0.14 | –59% |
| CPU Energy (kWh) | 0.00020 ± 2e–6 | 0.00015 ± 3e–5 | –25% |
| GPU Energy (kWh) | 0.0019 ± 2.3e–5 | 0.00094 ± 2e–4 | –50% |
| Total Energy (CPU+GPU) | — | — | ≈–48% |
| Image Quality (YOLO avg conf) | 0.64 ± 0.04 | 0.69 ± 0.07 | +5.3% |
Statistical significance is reported for all primary objectives (6, large 7 effect size for bias; small for image quality). Pareto analysis reveals that SustainDiffusion produces 28/30 Pareto-optimal non-dominated solutions per run, versus 0 for SD3 default, and 2 for Random Search.
6. Consistency, Generalisation, and Limitations
Consistency across independent runs is high, with genderBias, ethnicBias, and imageQuality exhibiting no significant run-to-run variation (Kruskal-Wallis 8); CPU and GPU energy and generation duration are more variable, but consistent in at least 60% of pairwise comparisons. When all Pareto configurations are applied to the full test set, SustainDiffusion dominates SD3 default on all four objectives for 37.5% of prompts, and on at least two objectives for 58.9%, attesting to generalizability. Similar superiority is observed against Random Search and a hand-crafted “Fair” prompt.
Image quality, as measured by YOLO-confidence, is preserved or slightly improved; qualitative inspection confirms that photorealism and content fidelity are visually retained, despite lower sampling step counts and altered prompt structure.
Limitations of SustainDiffusion include reliance on BLIP-VQA for bias measurement (restricted to binary or four-class decisions), dependence on hardware-specific energy readings (despite protocolized measurement), and a computation cost of approximately 20 hours per search run (however, this is quickly amortized over ∼2,880 subsequent prompt usages).
7. Broader Implications and Integration
SustainDiffusion demonstrates that multi-objective, black-box search over prompt engineering and sampling hyperparameters can realize substantial joint advances in bias mitigation and environmental impact for text-to-image models, while preserving output quality. As it makes no demands on model retraining or architectural access, it is directly compatible with standard SD3 inference pipelines. The empirical success of joint optimization over social (fairness) and environmental (energy) axes shows that trade-offs are not inevitable; both can be optimized without regression on image quality.
A plausible implication is that such prompt/hyperparameter-level methods may become a standard toolkit wherever model retraining is impractical, and that similar frameworks may extend to other generative architectures and fairness/efficiency metrics (d'Aloisio et al., 21 Jul 2025).