Geneses: Unified Speech Enhancement & Separation

Updated 20 February 2026

The paper introduces a unified generative framework that merges speech enhancement and separation via latent flow matching.
It leverages a multi-modal diffusion Transformer and a VAE-based latent space to efficiently restore clean speech from noisy mixtures.
Results under complex degradations show near-ground-truth performance, highlighting the method's robustness and scalability.

Geneses is a unified generative framework for performing both speech enhancement (SE) and speech separation (SS) in the presence of challenging real-world distortions, including both additive noise and complex non-linear degradations. Its core is a latent flow matching approach that leverages a multi-modal diffusion Transformer conditioned on self-supervised learning features, providing a scalable and robust solution to multi-speaker and degraded-audio scenarios (Asai et al., 26 Jan 2026).

1. Motivation and Unified Problem Formulation

The traditional paradigm in speech processing treats enhancement (background noise and artifact removal) and separation (demixing of overlapping speakers) as distinct, often sequential, tasks. Such modular approaches fail to robustly generalize under complex degradations (e.g., reverberation, clipping, packet loss) frequently encountered in real environments. Geneses forgoes explicit target mask estimation or supervised regression in favor of a continuous, conditional, generative modeling framework that unifies SE and SS as latent-variable inference over speaker-specific clean representations conditioned on mixture observations.

This approach provides substantial gains in parameter and architectural efficiency, simplifying pipelines and enabling joint training for robust end-to-end inference. By learning to generate clean speech features for each speaker directly from noisy mixtures and handling diverse degradations, Geneses advances beyond previous deterministic or mask-based models (Scheibler et al., 2022, Wang et al., 11 Aug 2025, Park et al., 7 Dec 2025).

2. System Architecture and Latent Generative Modeling

The Geneses architecture consists of three primary components:

Input Feature Extractor: Applies a finetuned w2v-BERT 2.0 to extract self-supervised learning (SSL) representations $\bm c$ from the input mixture $y(t)$ . LoRA finetuning enhances robustness to domain shift.
Variational Autoencoder (VAE): Encodes clean speech waveforms into a low-dimensional latent space (16-dimensional per speaker at 25 Hz, decoder based on DAC GAN, frozen during main training). The VAE stabilizes modeling by imposing a structured latent prior and preserving detailed speech characteristics.
Flow Predictor (Multi-Modal Diffusion Transformer, MM-DiT): A 12-layer Transformer (768 hidden, 12 heads, FlashAttention) that operates on the concatenation of VAE latents, SSL features, and timestep embeddings. It predicts the velocity field for latent flow matching, serving as a conditional generator for clean speech latents.

Geneses frames both SE and SS as conditional generative latent trajectory modeling: Given the mixture, sample two independent latent vectors, apply a continuous latent flow ODE driven by MM-DiT to transport noise to clean representations for each speaker, and finally decode these via the VAE.

At the core is a latent flow matching strategy, implemented as follows:

Let each clean speaker $s$ map to VAE latent $\bm x_1$ ; sample Gaussian noise $\bm x_0 \sim \mathcal{N}(0,I)$ .
Define the linearly interpolated path in latent space:

$\bm x_t = (1-t)\bm x_0 + t \bm x_1,\quad t \in [0,1].$

A vector field predictor $\bm v_\theta(\bm x_t, \bm c, t)$ is trained to match the ideal trajectory:

$\mathcal{L}_{\mathrm{flow}}(\theta) = \mathbb{E}_{t, \bm x_0, \bm x_1} \left\| \bm v_\theta(\bm x_t, \bm c, t) - (\bm x_1 - \bm x_0) \right\|^2.$

The conditional ODE,

$\frac{d\bm x_t}{dt} = \bm v_\theta(\bm x_t, \bm c, t),$

is numerically solved (Euler’s method, step size 0.01) at inference to transport noise to speaker-specific clean latents.

Each $\bm x_1^k$ is decoded by the frozen VAE to yield clean waveform $\hat{s}_k(t)$ .
SSL features provide multi-modal and temporally aligned conditioning, enhancing discrimination under noise or speaker overlap.

Crucially, no permutation-invariance is needed during training, as explicit speaker order is maintained in the latent representations.

4. Performance Evaluation and Comparative Results

Geneses is evaluated using two-speaker mixtures from LibriTTS-R under two distortion regimes: additive noise and complex degradations (including reverberation, bandwidth constraints, clipping, and packet loss—all with randomized mixtures per utterance). Objective evaluation employs both reference-free perceptual metrics (DNSMOS, NISQA, UTMOS, WER) and reference-aware metrics (ESTOI, MCD, LSD, SpeechBERTScore, SpkSim). Benchmarks follow ICASSP 2023 SE-SS mask-based methods as baselines.

Key results:

Condition	DNSMOS ↑	NISQA ↑	UTMOSv2 ↑	WER ↓	ESTOI ↑	MCD ↓	LSD ↓	SBS ↑	SpkSim ↑
Ground Truth	3.37	4.73	3.65	0.11	–	–	–	–	–
Conv. (noise only)	2.91	2.32	1.75	0.35	0.72	7.17	3.96	0.80	0.95
Geneses (noise only)	3.40	4.44	3.44	0.39	0.75	7.60	4.65	0.83	0.99
Conv. (complex)	2.08	1.34	0.84	5.54	0.42	9.24	5.79	0.61	0.91
Geneses (complex)	3.39	4.44	3.40	0.43	0.72	8.09	5.00	0.82	0.98

Under complex degradations, conventional systems experience catastrophic failure in recognition and intelligibility (WER 5.54), whereas Geneses restores performance to near-ground-truth levels (WER 0.43, ESTOI 0.72). For additive noise, Geneses matches or exceeds state-of-the-art in perceptual metrics but exhibits known generative trade-offs in precise signal fidelity (WER, MCD, LSD). This suggests that Geneses achieves high perceptual quality and robustness but, like other generative approaches, may smooth fine spectral cues.

5. Methodological Relationships and Comparison to Prior Work

Geneses builds upon and extends prior lines of unified, generative modeling for speech front-end tasks:

Diffusion and Score-Based Separation: Earlier frameworks such as DiffSep apply continuous-time diffusion (SDE-based) to model mixture-to-source transformations, using score-based learning and permutational loss corrections to jointly solve speech enhancement and separation (Scheibler et al., 2022).
Latent-Space Diffusion/Flow Models: Approaches like UniFlow generalize to a broader class of speech tasks (SE, SS, TSE, AEC, LASS) by operating in VAE-compressed latent space using diffusion, flow matching, or mean-flow strategies, controlled via conditional embeddings and multimodal context (Wang et al., 11 Aug 2025).
Token-Based and LM-Driven Techniques: Works such as UniSE formulate SE and SS as conditional discrete token generation using neural audio codecs and decoder-only LMs (e.g., LLaMA-style Transformers), exploiting prompt tokens for mode selection (Yan et al., 23 Oct 2025).
Audio-Visual Guidance and Wasserstein Autoencoders: UniVoiceLite introduces unsupervised, lightweight integration of visual cues (lip motion, face identity) using a Wasserstein autoencoder architecture, supporting SE and SS by leveraging cross-modal priors and Wasserstein latent regularization (Park et al., 7 Dec 2025).

The principal innovation of Geneses is its fusion of latent flow transport and multi-modal SSL feature conditioning, specifically enhancing resilience to complex, real-world degradations beyond the reach of mask-based or discriminative pipelines.

6. Design Insights, Limitations, and Future Research Directions

Geneses demonstrates that latent flow matching, particularly when combined with SSL conditioning and a robust VAE prior, is key to robust performance under diverse degradation paths. Ablations confirm that SSL removal or replacement of latent flows with direct diffusion degrades both separation quality and noise robustness.

Several limitations and avenues for further investigation remain:

Slight degradation in fidelity metrics under simple noise reflects the “hallucination” effect observed in modern generative SE.
Current models are restricted to two speakers; generalization to higher-order mixtures, conversational or real-recorded audio, and integration with more structured priors remain to be fully explored.
Tighter coupling of VAE priors and latent conditional flows may further stabilize and improve sample quality.

A plausible implication is that as generative latent modeling frameworks mature, unified and multimodal architectures such as Geneses will replace complex, heavily parametrized, and supervised pipelines for real-world SE and SS.

7. Summary and Impact

Geneses establishes a principled unified generative model for both enhancement and separation, leveraging latent flow matching, robust multi-modal feature conditioning, and a stable VAE architecture (Asai et al., 26 Jan 2026). Its performance, especially under complex, real-world degradations, surpasses conventional mask-based pipelines and aligns with the emerging trend of scalable, task-agnostic generative models in speech processing. These results suggest that latent generative architectures, notably those integrating SSL and conditional flows, provide a strong foundation for the next generation of unified speech front-end solutions.

Markdown Report Issue Upgrade to Chat

References (5)

Geneses: Unified Generative Speech Enhancement and Separation (2026)

Diffusion-based Generative Speech Source Separation (2022)

UniFlow: Unifying Speech Front-End Tasks via Continuous Generative Modeling (2025)

Lightweight Wasserstein Audio-Visual Model for Unified Speech Enhancement and Separation (2025)

UniSE: A Unified Framework for Decoder-only Autoregressive LM-based Speech Enhancement (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Geneses: Unified Generative Speech Enhancement and Separation.

Geneses: Unified Speech Enhancement & Separation

1. Motivation and Unified Problem Formulation

2. System Architecture and Latent Generative Modeling

4. Performance Evaluation and Comparative Results

5. Methodological Relationships and Comparison to Prior Work

6. Design Insights, Limitations, and Future Research Directions

7. Summary and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Geneses: Unified Speech Enhancement & Separation

1. Motivation and Unified Problem Formulation

2. System Architecture and Latent Generative Modeling

3. Latent Flow Matching and Multi-Modal Conditioning

4. Performance Evaluation and Comparative Results

5. Methodological Relationships and Comparison to Prior Work

6. Design Insights, Limitations, and Future Research Directions

7. Summary and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research