Multi-CMGAN+/+: Multi-Objective GANs

Updated 26 December 2025

The paper introduces Multi-CMGAN+/+, a GAN architecture that integrates multiple loss functions with dynamic weighting to optimize perceptual and fidelity metrics.
It employs principled multi-objective optimization methods like hypervolume maximization and multiple gradient descent to achieve balanced trade-offs among competing objectives.
Empirical results on CHiME challenge datasets show improved DNSMOS scores and highlight trade-offs in speech enhancement performance, demonstrating both strengths and limitations.

Multi-objective GANs, and specifically the Multi-CMGAN+/+ family, refer to adversarial neural architectures that explicitly integrate multiple, potentially competing, objective functions into the generative modeling process. This is motivated by applications where a single criterion (such as pixel-level fidelity or realism) is insufficient, and several task-specific or perceptual metrics must be optimized jointly. The evolution of multi-objective GANs has produced advanced designs that target diverse problem domains including image synthesis, tabular data synthesis, optimization of engineering designs, diverse coverage of data modes, and real-world audio enhancement.

1. Multi-Objective GAN Problem Formulation

Traditional GANs define a single minimax game between a generator and a discriminator, with one scalar loss for each side. Multi-objective GANs extend this central formulation to encompass a vector of objectives, $\{f_i(\theta)\}_{i=1}^k$ , which may reflect various task-based losses (e.g., adversarial, perceptual, pixel, privacy, diversity, or downstream performance metrics).

In classic image enhancement with GANs, objective terms include adversarial loss ( $\mathcal{L}_{\mathrm{GAN}}$ ), a pixel-wise loss ( $\mathcal{L}_{\mathrm{pix}}$ ), and a perceptual feature loss ( $\mathcal{L}_{\mathrm{fea}}$ ), among others. The naive approach is to aggregate these losses by a static weighted sum,

$\mathcal{L}_\mathrm{sum}(\theta)=\sum_{i=1}^k \alpha_i f_i(\theta), \quad \alpha_i > 0,$

but tuning $\{\alpha_i\}$ is empirically challenging and typically fails to represent the true Pareto front for non-convex loss landscapes (Su et al., 2020).

Similarly, in settings with multiple discriminators or multi-headed architectures (e.g., for realism and privacy), the generator faces $K$ loss functions, each associated with one discriminator or metric prediction task (Albuquerque et al., 2019, DeSmet et al., 2021).

2. Principled Multi-Objective Optimization Methods in GANs

The field has produced algorithmic solutions that treat the multi-loss problem as an explicit multi-objective optimization (MOO) challenge. The key approaches include:

Hypervolume Maximization (HV): The generator's performance is judged by the hypervolume in objective space between the loss vector and a nadir/reference point $\mathbf{r}$ :

$H(\theta) = \prod_{i=1}^k (r_i - f_i(\theta)).$

The training loss is then the negative log hypervolume, yielding autobalancing weights for each loss dimension:

$\mathcal{L}_\mathrm{HV}(\theta) = -\sum_{i=1}^k \log(r_i - f_i(\theta)),$

leading to dynamic, loss-adaptive gradient weighting (Su et al., 2020, Albuquerque et al., 2019).

Multiple Gradient Descent (MGD): The update direction is computed by solving for a Pareto-stationary direction in the loss vector space, ensuring no objective can be improved without worsening another (Albuquerque et al., 2019).
Game-theoretic Multi-Head/Agent Approaches: Architectures such as HydraGAN instantiate multiple generators and discriminators, each specializing in a distinct objective, and define training as a coupled multi-agent game with equilibrium guarantees (DeSmet et al., 2021). Similarly, Multi-Generator GANs maximize sample diversity across generators, effectively maximizing the Jensen-Shannon divergence between their output distributions (Hoang et al., 2017).
Reinforcement-Learned Action Selection: In data synthesis, evolutionary or reinforcement learning controllers select among loss functions/optimizers per generator instance, and multi-objective pareto selection is performed periodically via NSGA-II (Ran et al., 15 Apr 2024).

3. The Multi-CMGAN+/+ Model: Architecture and Training

Multi-CMGAN+/+ is a multi-objective extension of the Conformer-based MetricGAN sequence for speech enhancement, engineered for use when speech-quality assessment is only possible by non-intrusive, potentially competing, metrics (Close et al., 2023).

Components:
- Generator $\mathcal{G}$ : Convolutional encoder–conformer–decoder takes noisy STFT and produces enhanced STFT.
- Discriminator/Metric Predictor $\mathcal{D}$ : Takes HuBERT front-end features and predicts multiple differentiable, normalized speech quality scores $Q'_i$ .
- Pseudo-Generator $\mathcal{N}$ : BLSTM-based, generates augmentation outputs used to adversarially challenge $\mathcal{D}$ .
Loss Structure:
- $\mathcal{D}$ loss: Multi-label MSE for each target metric, trained on clean, noisy, enhanced ( $\mathcal{G}$ ), and pseudo-enhanced ( $\mathcal{N}$ ) samples with historical replay for robustness.
- $\mathcal{N}$ loss: MSE to push all metric predictions to the upper bound.
- $\mathcal{G}$ loss: Sum of adversarial metric loss (drives $\mathcal{D}$ 's metric outputs high), waveform L1 distance, and SI-SDR loss.
Training alternates updates for $\mathcal{D}$ , $\mathcal{N}$ , and $\mathcal{G}$ within each batch, maintaining a dynamic loss balancing regime and leveraging a historical buffer of $\mathcal{G}$ / $\mathcal{N}$ outputs for stability.

4. Empirical Results and Performance Characterization

Multi-CMGAN+/+ was evaluated on the CHiME-7 UDASE challenge (real and simulated datasets). Its performance was characterized across several metric triplets (DNSMOS SIG, BAK, OVR; PESQ):

On the real CHiME5 set, optimizing all three DNSMOS components yields highest overall DNSMOS (OVR 3.42), with a trade-off (small decrease in SIG vs. single-objective baseline, but overall improvement in OVR and BAK).
For simulated datasets, SI-SDR degrades under most multi-metric objectives except (BAK, OVR, PESQ), evidencing the classic tradeoff between perceptual metric improvement and time-domain fidelity.
Historical buffer training for $\mathcal{D}$ prevents overfitting to the current generator's output distribution.

A tabular summary:

Variant	OVR (real)	BAK (real)	SIG (real)	SI-SDR (sim)
Unprocessed	2.84	2.92	3.48	6.59
CMGAN+/+ (SIG)	3.29	3.85	3.76	4.71
Multi-CMGAN+/+ (SIG, BAK, OVR)	3.42	3.86	3.56	3.36
Multi-CMGAN+/+ (BAK, OVR, PESQ)	3.12	3.86	3.49	6.95

This demonstrates that multi-objective learning with non-intrusive metric predictors leads to empirically improved real-world perceptual scores.

5. Extensions and Limitations in Multi-Objective GANs

The hypervolume maximization approach is broadly applicable to GANs with any set of differentiable losses, and can in principle handle arbitrary numbers of constraints or objectives (e.g., cycle-consistency, style, identity losses in image translation) (Su et al., 2020). For tabular data, evolutionary schemes with Pareto-based archiving enable balancing utility with disclosure risk, with early stopping via an Improvement Score to capture the "sweet spot" on the risk–utility frontier (Ran et al., 15 Apr 2024).

Limitations of these methods include:

Trade-off curves imply that improving one objective may actively harm another (e.g., SI-SDR drop under perceptual metric optimization in Multi-CMGAN+/+).
The selection and dynamic adjustment of loss weights, reference points, or reward-paremeters in MOO is often heuristic and sensitive to hyperparameters.
Surrogate-based losses (as in MO-PaDGAN) can drive generators out-of-distribution unless auxiliary validity constraints are imposed (Chen et al., 2020).

6. Future Directions and Open Challenges

Recent work proposes the integration of automatic loss weighting, dynamic scheduling of objectives, and inclusion of intelligibility or downstream-use metrics (such as ASR accuracy) in the multi-objective stack (Close et al., 2023). There is active investigation into:

Richer diversity/stability constraints via DPP or maximum mean discrepancy (Chen et al., 2020),
Adaptive reference-point selection in HV maximization and reinforcement learning for optimizer policy (Ran et al., 15 Apr 2024),
Theoretical guarantees and generalization bounds for dual-objective f-divergence minimization (Welfert et al., 2023).

A plausible implication is that future multi-objective GANs will increasingly leverage both architectural and algorithmic innovations—incorporating multi-head, multi-discriminator, evolutionary, and game-theoretic approaches—to handle complex trade-offs in real-world generative modeling tasks.

References

HypervolGAN: "HypervolGAN: An efficient approach for GAN with multi-objective training function" (Su et al., 2020)
Multi-CMGAN+/+: "Multi-CMGAN+/+: Leveraging Multi-Objective Speech Quality Metric Prediction for Speech Enhancement" (Close et al., 2023)
Multi-objective evolutionary GAN for tabular data: (Ran et al., 15 Apr 2024)
Multi-objective training of GANs with multiple discriminators: (Albuquerque et al., 2019)
HydraGAN: (DeSmet et al., 2021)
Multi-Generator GANs: (Hoang et al., 2017)
Dual-objective $(\alpha_D, \alpha_G)$ -GANs: (Welfert et al., 2023)
MO-PaDGAN: (Chen et al., 2020)
GAN-based manifold interpolation: (Wang et al., 2021)