CodeSep: Codec-Driven Speech Separation & Compression
- CodeSep is a codec-driven architecture that jointly performs low-bitrate speech separation and compression by disentangling mixed-speech signals into base and auxiliary tokens.
- The model combines a neural RVQ-based codec, base-token disentanglement, and auxiliary-token serial prediction to achieve high perceptual quality at 1 kbps per speaker.
- Experimental results show that CodeSep outperforms conventional pipelines like FCTS and FSTC, delivering superior objective and subjective speech quality metrics.
CodeSep is a codec-driven architecture for joint low-bitrate speech separation and compression, targeting scenarios where mixed-speech signals must be disentangled and efficiently encoded for transmission or storage. By combining a residual vector quantizer (RVQ)-based neural codec, a base-token disentanglement (BTD) module, and auxiliary-token serial prediction (ATSP) modules, CodeSep enables the reconstruction of separated speech signals from minimal discrete representations, achieving high perceptual quality at bitrates as low as 1 kbps per speaker. Applications include online meetings and multi-speaker dialogue archiving, where bandwidth efficiency and source separation are critical (Du et al., 19 Jan 2026).
1. Model Architecture
The CodeSep system consists of three principal components: the MDCT-based neural codec, the BTD module, and the ATSP modules. The architecture processes time-frequency representations in a multistage manner:
- MDCTCodec (RVQ-based Codec):
Operates in the MDCT (Modified Discrete Cosine Transform) domain, using a block size (e.g., 20 ms; 50% overlap). The encoder comprises ConvNeXt v2 blocks mapping MDCT frames to features . The quantization is staged via RVQ: At each stage , input is quantized by a codebook , selecting with . The quantized representations are summed and decoded via a mirrored ConvNeXt v2 stack to MDCT, followed by inverse MDCT to waveform .
- Base-Token Disentanglement (BTD) Module:
Given a mixed-speech mel-spectrogram , is processed through: 1. : 3-layer convolutional stack (stride=2) → 2. : 4 self-attention Transformer blocks 3. Duplication and addition of anti-consistency biases , to prevent output collapse 4. : 4 cross-attention Transformer blocks 5. Two linear+softmax heads yield code distributions ; index selection by argmax yields Anti-consistency biases, generated by a trainable ACBG, are essential to enforce separation across branches.
- Auxiliary-Token Serial Prediction (ATSP) Modules:
For each speaker , a parallel ATSP branch predicts auxiliary token sequences given . The conditional factorization is:
Each stage uses embedding , processed via 2-layer LSTM and 3 Conformer blocks, then outputting logits and distribution .
- Waveform Reconstruction:
For speaker , the embeddings for all quantization stages (base plus predicted auxiliaries) are summed to form for the decoder , which reconstructs .
Data Flow Diagram (simplified, per time frame for two sources):
1 2 3 4 5 6 7 8 9 10 11 12 |
Mixed mel Y(t)
|
[BTD]
V
d_base^(1), d_base^(2)
| |
[ATSP] [ATSP]
| |
aux tokens aux tokens
| |
V V
Decoded waveforms for each speaker |
2. Training Objectives
Training in CodeSep leverages separate objectives for each module, performed independently:
- Codec Training (single-speaker data):
- Spectral loss:
- Multi-scale GAN adversarial loss:
- RVQ commitment loss: , with denoting stop-gradient.
- BTD Module (Permutation-Invariant Cross-Entropy):
Accounts for speaker order ambiguity via:
- ATSP Modules (Teacher-Forcing Cross-Entropy):
During training, ATSP receives ground-truth code indices:
with the prediction at stage , the corresponding ground-truth code.
- Total Loss:
where each term controls a different module’s training.
3. Low-Bitrate Transmission Strategy
A key feature of CodeSep is its ability to operate at extremely low bitrates:
- Only the base tokens (per speaker, per frame) are transmitted.
- With , each token is 10 bits; frame shift is 10 ms ($100$ frames/sec).
- Achieved bitrate is $1$ kbps per speaker: bits/sec.
- Auxiliary tokens are omitted during transmission; instead, they are inferred at the receiver using the ATSP modules.
This design allows extremely compact representations without directly transmitting detailed quantization stages, thus significantly reducing bandwidth requirements.
4. Experimental Evaluation
- Dataset and Setup:
Libri2Mix-clean (16 kHz) is utilized: 270 h training, 11 h each for development and test. MDCTCodec is configured with stages, codebook size, embedding dimension. BTD and ATSP modules are as described in the architecture.
- Baselines:
- FCTS (Front-end Codec Then Separation): Mixed speech is encoded at 1 kbps before separation via Sepformer.
- FSTC (Front-end Separation Then Codec): Separation precedes individual 0.5 kbps streams per speaker.
- Sepformer ( kbps) is used as an upper bound.
- Metrics:
- UTMOS and DNSMOS: Non-intrusive objective speech quality metrics (higher is better)
- NMOS and SMOS: Subjective mean opinion scores for naturalness and speaker similarity
- NABX and SABX: ABX pairwise preference tests
Table 1: Objective & Subjective Scores at 1 kbps
| Method | UTMOS ↑ | DNSMOS ↑ | NMOS ↑ | SMOS ↑ |
|---|---|---|---|---|
| CodeSep | 3.14 | 3.67 | 3.65 ± 0.08 | 3.43 ± 0.09 |
| FCTS | 1.34 | 3.03 | 2.96 ± 0.09 | 2.86 ± 0.09 |
| FSTC | 1.99 | 3.33 | 3.24 ± 0.09 | 3.15 ± 0.09 |
| Sepformer (∞) | 3.54 | 3.55 | – | – |
At only 1 kbps, CodeSep demonstrates superior performance over FCTS and FSTC baselines (p < 0.01).
Table 2: CodeSep v. FSTC at Higher Bitrates
| Method | Bitrate | UTMOS ↑ | DNSMOS ↑ |
|---|---|---|---|
| CodeSep | 1 kbps | 3.14 | 3.67 |
| FSTC | 2 kbps | 2.30 | 3.44 |
| FSTC | 4 kbps | 2.87 | 3.53 |
| FSTC | 8 kbps | 3.11 | 3.56 |
CodeSep at 1 kbps outperforms FSTC up to 8 kbps in objective quality.
Table 3: ABX Preference (NABX, SABX) vs FSTC
| Comparison | CodeSep pref. | FSTC pref. | No pref. | p-value |
|---|---|---|---|---|
| 1 vs 2 kbps NABX | 55.8% | 41.9% | 2.3% | <0.01 |
| 1 vs 4 kbps NABX | 52.8% | 43.0% | 4.2% | <0.01 |
| 1 vs 8 kbps NABX | 38.6% | 53.6% | 7.9% | <0.01 |
| 1 vs 2 kbps SABX | 54.3% | 41.8% | 3.9% | <0.01 |
Perceptual and preference ratings highlight the advantage of CodeSep over baseline approaches at the same and higher bitrates.
5. Strengths, Limitations, and Potential Extensions
Strengths:
- Achieves joint speech separation and compression in a single token-level model ("JSAC") at exceptionally low bitrates.
- The BTD and ATSP modules factorize token representations into "which speaker" (base tokens) and "fine detail" (auxiliary tokens) aspects.
- Enables high perceptual quality for separated speech at only 1 kbps, surpassing naive compress/separate pipelines.
Limitations:
- Current implementation demonstrated only for 2-speaker mixtures.
- Codec and separation modules are trained independently; there is no fully end-to-end quantization-gradient propagation.
- Evaluation focuses on non-intrusive metrics; SI-SDR/PESQ are not reported for direct separation-plus-codec comparison.
Potential Extensions:
- Generalization to mixtures of three or more speakers by increasing the number of anti-consistency branches in BTD.
- Joint fine-tuning of the codec, BTD, and ATSP using straight-through estimators for quantization.
- Replacing MDCTCodec with alternative discrete codecs or learned tokenizers (e.g., SoundStream).
- Augmenting for noise robustness and far-field speech via room simulation.
- Implementing adaptive bitrate by reducing base token transmission in silence.
A plausible implication is that CodeSep’s token-level disentanglement and auxiliary prediction strategy can facilitate future ultra-low-bitrate applications in speech communication, scalable to more complex acoustic environments, and flexible integration with emerging neural audio codecs (Du et al., 19 Jan 2026).