Geneses: Unified Speech Enhancement & Separation
- The paper introduces a unified generative framework that merges speech enhancement and separation via latent flow matching.
- It leverages a multi-modal diffusion Transformer and a VAE-based latent space to efficiently restore clean speech from noisy mixtures.
- Results under complex degradations show near-ground-truth performance, highlighting the method's robustness and scalability.
Geneses is a unified generative framework for performing both speech enhancement (SE) and speech separation (SS) in the presence of challenging real-world distortions, including both additive noise and complex non-linear degradations. Its core is a latent flow matching approach that leverages a multi-modal diffusion Transformer conditioned on self-supervised learning features, providing a scalable and robust solution to multi-speaker and degraded-audio scenarios (Asai et al., 26 Jan 2026).
1. Motivation and Unified Problem Formulation
The traditional paradigm in speech processing treats enhancement (background noise and artifact removal) and separation (demixing of overlapping speakers) as distinct, often sequential, tasks. Such modular approaches fail to robustly generalize under complex degradations (e.g., reverberation, clipping, packet loss) frequently encountered in real environments. Geneses forgoes explicit target mask estimation or supervised regression in favor of a continuous, conditional, generative modeling framework that unifies SE and SS as latent-variable inference over speaker-specific clean representations conditioned on mixture observations.
This approach provides substantial gains in parameter and architectural efficiency, simplifying pipelines and enabling joint training for robust end-to-end inference. By learning to generate clean speech features for each speaker directly from noisy mixtures and handling diverse degradations, Geneses advances beyond previous deterministic or mask-based models (Scheibler et al., 2022, Wang et al., 11 Aug 2025, Park et al., 7 Dec 2025).
2. System Architecture and Latent Generative Modeling
The Geneses architecture consists of three primary components:
- Input Feature Extractor: Applies a finetuned w2v-BERT 2.0 to extract self-supervised learning (SSL) representations from the input mixture . LoRA finetuning enhances robustness to domain shift.
- Variational Autoencoder (VAE): Encodes clean speech waveforms into a low-dimensional latent space (16-dimensional per speaker at 25 Hz, decoder based on DAC GAN, frozen during main training). The VAE stabilizes modeling by imposing a structured latent prior and preserving detailed speech characteristics.
- Flow Predictor (Multi-Modal Diffusion Transformer, MM-DiT): A 12-layer Transformer (768 hidden, 12 heads, FlashAttention) that operates on the concatenation of VAE latents, SSL features, and timestep embeddings. It predicts the velocity field for latent flow matching, serving as a conditional generator for clean speech latents.
Geneses frames both SE and SS as conditional generative latent trajectory modeling: Given the mixture, sample two independent latent vectors, apply a continuous latent flow ODE driven by MM-DiT to transport noise to clean representations for each speaker, and finally decode these via the VAE.
3. Latent Flow Matching and Multi-Modal Conditioning
At the core is a latent flow matching strategy, implemented as follows:
- Let each clean speaker map to VAE latent ; sample Gaussian noise .
- Define the linearly interpolated path in latent space:
- A vector field predictor is trained to match the ideal trajectory:
- The conditional ODE,
is numerically solved (Euler’s method, step size 0.01) at inference to transport noise to speaker-specific clean latents.
- Each is decoded by the frozen VAE to yield clean waveform .
- SSL features provide multi-modal and temporally aligned conditioning, enhancing discrimination under noise or speaker overlap.
Crucially, no permutation-invariance is needed during training, as explicit speaker order is maintained in the latent representations.
4. Performance Evaluation and Comparative Results
Geneses is evaluated using two-speaker mixtures from LibriTTS-R under two distortion regimes: additive noise and complex degradations (including reverberation, bandwidth constraints, clipping, and packet loss—all with randomized mixtures per utterance). Objective evaluation employs both reference-free perceptual metrics (DNSMOS, NISQA, UTMOS, WER) and reference-aware metrics (ESTOI, MCD, LSD, SpeechBERTScore, SpkSim). Benchmarks follow ICASSP 2023 SE-SS mask-based methods as baselines.
Key results:
| Condition | DNSMOS ↑ | NISQA ↑ | UTMOSv2 ↑ | WER ↓ | ESTOI ↑ | MCD ↓ | LSD ↓ | SBS ↑ | SpkSim ↑ |
|---|---|---|---|---|---|---|---|---|---|
| Ground Truth | 3.37 | 4.73 | 3.65 | 0.11 | – | – | – | – | – |
| Conv. (noise only) | 2.91 | 2.32 | 1.75 | 0.35 | 0.72 | 7.17 | 3.96 | 0.80 | 0.95 |
| Geneses (noise only) | 3.40 | 4.44 | 3.44 | 0.39 | 0.75 | 7.60 | 4.65 | 0.83 | 0.99 |
| Conv. (complex) | 2.08 | 1.34 | 0.84 | 5.54 | 0.42 | 9.24 | 5.79 | 0.61 | 0.91 |
| Geneses (complex) | 3.39 | 4.44 | 3.40 | 0.43 | 0.72 | 8.09 | 5.00 | 0.82 | 0.98 |
Under complex degradations, conventional systems experience catastrophic failure in recognition and intelligibility (WER 5.54), whereas Geneses restores performance to near-ground-truth levels (WER 0.43, ESTOI 0.72). For additive noise, Geneses matches or exceeds state-of-the-art in perceptual metrics but exhibits known generative trade-offs in precise signal fidelity (WER, MCD, LSD). This suggests that Geneses achieves high perceptual quality and robustness but, like other generative approaches, may smooth fine spectral cues.
5. Methodological Relationships and Comparison to Prior Work
Geneses builds upon and extends prior lines of unified, generative modeling for speech front-end tasks:
- Diffusion and Score-Based Separation: Earlier frameworks such as DiffSep apply continuous-time diffusion (SDE-based) to model mixture-to-source transformations, using score-based learning and permutational loss corrections to jointly solve speech enhancement and separation (Scheibler et al., 2022).
- Latent-Space Diffusion/Flow Models: Approaches like UniFlow generalize to a broader class of speech tasks (SE, SS, TSE, AEC, LASS) by operating in VAE-compressed latent space using diffusion, flow matching, or mean-flow strategies, controlled via conditional embeddings and multimodal context (Wang et al., 11 Aug 2025).
- Token-Based and LM-Driven Techniques: Works such as UniSE formulate SE and SS as conditional discrete token generation using neural audio codecs and decoder-only LMs (e.g., LLaMA-style Transformers), exploiting prompt tokens for mode selection (Yan et al., 23 Oct 2025).
- Audio-Visual Guidance and Wasserstein Autoencoders: UniVoiceLite introduces unsupervised, lightweight integration of visual cues (lip motion, face identity) using a Wasserstein autoencoder architecture, supporting SE and SS by leveraging cross-modal priors and Wasserstein latent regularization (Park et al., 7 Dec 2025).
The principal innovation of Geneses is its fusion of latent flow transport and multi-modal SSL feature conditioning, specifically enhancing resilience to complex, real-world degradations beyond the reach of mask-based or discriminative pipelines.
6. Design Insights, Limitations, and Future Research Directions
Geneses demonstrates that latent flow matching, particularly when combined with SSL conditioning and a robust VAE prior, is key to robust performance under diverse degradation paths. Ablations confirm that SSL removal or replacement of latent flows with direct diffusion degrades both separation quality and noise robustness.
Several limitations and avenues for further investigation remain:
- Slight degradation in fidelity metrics under simple noise reflects the “hallucination” effect observed in modern generative SE.
- Current models are restricted to two speakers; generalization to higher-order mixtures, conversational or real-recorded audio, and integration with more structured priors remain to be fully explored.
- Tighter coupling of VAE priors and latent conditional flows may further stabilize and improve sample quality.
A plausible implication is that as generative latent modeling frameworks mature, unified and multimodal architectures such as Geneses will replace complex, heavily parametrized, and supervised pipelines for real-world SE and SS.
7. Summary and Impact
Geneses establishes a principled unified generative model for both enhancement and separation, leveraging latent flow matching, robust multi-modal feature conditioning, and a stable VAE architecture (Asai et al., 26 Jan 2026). Its performance, especially under complex, real-world degradations, surpasses conventional mask-based pipelines and aligns with the emerging trend of scalable, task-agnostic generative models in speech processing. These results suggest that latent generative architectures, notably those integrating SSL and conditional flows, provide a strong foundation for the next generation of unified speech front-end solutions.