Papers
Topics
Authors
Recent
Search
2000 character limit reached

CodeSep: Codec-Driven Speech Separation & Compression

Updated 21 January 2026
  • CodeSep is a codec-driven architecture that jointly performs low-bitrate speech separation and compression by disentangling mixed-speech signals into base and auxiliary tokens.
  • The model combines a neural RVQ-based codec, base-token disentanglement, and auxiliary-token serial prediction to achieve high perceptual quality at 1 kbps per speaker.
  • Experimental results show that CodeSep outperforms conventional pipelines like FCTS and FSTC, delivering superior objective and subjective speech quality metrics.

CodeSep is a codec-driven architecture for joint low-bitrate speech separation and compression, targeting scenarios where mixed-speech signals must be disentangled and efficiently encoded for transmission or storage. By combining a residual vector quantizer (RVQ)-based neural codec, a base-token disentanglement (BTD) module, and auxiliary-token serial prediction (ATSP) modules, CodeSep enables the reconstruction of separated speech signals from minimal discrete representations, achieving high perceptual quality at bitrates as low as 1 kbps per speaker. Applications include online meetings and multi-speaker dialogue archiving, where bandwidth efficiency and source separation are critical (Du et al., 19 Jan 2026).

1. Model Architecture

The CodeSep system consists of three principal components: the MDCT-based neural codec, the BTD module, and the ATSP modules. The architecture processes time-frequency representations in a multistage manner:

  • MDCTCodec (RVQ-based Codec):

Operates in the MDCT (Modified Discrete Cosine Transform) domain, using a block size (e.g., 20 ms; 50% overlap). The encoder φe\varphi_e comprises ConvNeXt v2 blocks mapping MDCT frames xRdx \in \mathbb{R}^d to features z0RT×Kz_0 \in \mathbb{R}^{T \times K}. The quantization is staged via RVQ: At each stage nn, input rn1r_{n-1} is quantized by a codebook Wn={wn,mRkm=1M}W_n = \{w_{n,m} \in \mathbb{R}^k | m=1 \dots M\}, selecting z^n=wn,mẑ_n = w_{n, m^\star} with m=argminmrn1wn,m2m^\star = \arg \min_m \| r_{n-1} - w_{n,m} \|_2. The quantized representations are summed and decoded via a mirrored ConvNeXt v2 stack to MDCT, followed by inverse MDCT to waveform x^\hat{x}.

  • Base-Token Disentanglement (BTD) Module:

Given a mixed-speech mel-spectrogram Y(t)R80Y(t) \in \mathbb{R}^{80}, YY is processed through: 1. φmeld\varphi_{meld}: 3-layer convolutional stack (stride=2) → Zmeld(t)R256Z_{meld}(t) \in \mathbb{R}^{256} 2. φintra\varphi_{intra}: 4 self-attention Transformer blocks 3. Duplication and addition of anti-consistency biases δ(1)\delta^{(1)}, δ(2)\delta^{(2)} to prevent output collapse 4. φinter\varphi_{inter}: 4 cross-attention Transformer blocks 5. Two linear+softmax heads yield code distributions pbase(i)(t)ΔMp^ { (i) }_{ base }( t ) \in \Delta^M; index selection by argmax yields dbase(i)(t){1,...,M}d_{ base }^{(i)}( t ) \in \{ 1, ..., M \} Anti-consistency biases, generated by a trainable ACBG, are essential to enforce separation across branches.

  • Auxiliary-Token Serial Prediction (ATSP) Modules:

For each speaker i{1,2}i \in \{1,2\}, a parallel ATSP branch predicts auxiliary token sequences daux(i)=[daux,1(i),...,daux,N1(i)]d_{aux}^{(i)} = [ d_{aux,1}^{(i)}, ..., d_{aux,N-1}^{(i)} ] given dbase(i)d_{base}^{(i)}. The conditional factorization is:

p(daux(i)dbase(i))=n=1N1p(daux,n(i)dbase(i),daux,1(i),...,daux,n1(i))p(d_{aux}^{(i)} | d_{base}^{(i)}) = \prod_{n=1}^{N-1} p(d_{aux,n}^{(i)} | d_{base}^{(i)}, d_{aux,1}^{(i)}, ..., d_{aux,n-1}^{(i)})

Each stage uses embedding en(i)=L(dbase(i))+m<nL(daux,m(i))e_n^{(i)} = L(d_{base}^{(i)}) + \sum_{m<n} L(d_{aux,m}^{(i)}), processed via 2-layer LSTM and 3 Conformer blocks, then outputting logits and distribution paux,n(i)p_{aux,n}^{(i)}.

  • Waveform Reconstruction:

For speaker ii, the embeddings for all NN quantization stages (base plus predicted auxiliaries) are summed to form z^(i)ẑ^{(i)} for the decoder φd\varphi_d, which reconstructs x^i\hat{x}_i.

Data Flow Diagram (simplified, per time frame for two sources):

1
2
3
4
5
6
7
8
9
10
11
12
  Mixed mel Y(t)
      |
   [BTD]
      V
d_base^(1), d_base^(2)
   |           |
[ATSP]      [ATSP]
   |           |
aux tokens   aux tokens
   |           |
   V           V
Decoded waveforms for each speaker

2. Training Objectives

Training in CodeSep leverages separate objectives for each module, performed independently:

  • Codec Training (single-speaker data):
    • Spectral loss: Lspec=MDCT(x)MDCT(x^)1L_{spec} = \| \text{MDCT}(x) - \text{MDCT}(\hat{x}) \|_1
    • Multi-scale GAN adversarial loss: LadvL_{adv}
    • RVQ commitment loss: Lq=nsg[zn]wn2+βznsg[wn]2L_q = \sum_n \| sg[z_n] - w_n \|^2 + \beta \| z_n - sg[w_n] \|^2, with sg[]sg[\cdot] denoting stop-gradient.
  • BTD Module (Permutation-Invariant Cross-Entropy):

Accounts for speaker order ambiguity via:

LPICE=E(y,x1,x2)minπS2i=12logpbase(i)[d1(π(i))]L_{PI-CE} = -\mathbb{E}_{(y, x_1, x_2)} \min_{\pi \in S_2} \sum_{i=1}^2 \log p_{base}^{(i)}[d_1^{(\pi(i))}]

  • ATSP Modules (Teacher-Forcing Cross-Entropy):

During training, ATSP receives ground-truth code indices:

LTFCE=ExDsn=1N1logp~aux,n[dn+1]L_{TF-CE} = -\mathbb{E}_{x \sim D_s} \sum_{n=1}^{N-1} \log \tilde{p}_{aux,n}[d_{n+1}]

with p~aux,n\tilde{p}_{aux,n} the prediction at stage nn, dn+1d_{n+1} the corresponding ground-truth code.

  • Total Loss:

Ltotal=Ladv+λspecLspec+λqLq+μLPICE+νLTFCEL_{total} = L_{adv} + \lambda_{spec} L_{spec} + \lambda_q L_q + \mu L_{PI-CE} + \nu L_{TF-CE}

where each term controls a different module’s training.

3. Low-Bitrate Transmission Strategy

A key feature of CodeSep is its ability to operate at extremely low bitrates:

  • Only the base tokens dbase(i)(t){1,...,M}d_{base}^{(i)}(t) \in \{1, ..., M\} (per speaker, per frame) are transmitted.
  • With M=1024M = 1024, each token is 10 bits; frame shift is 10 ms ($100$ frames/sec).
  • Achieved bitrate is $1$ kbps per speaker: 10 bits×100=100010 \text{ bits} \times 100 = 1000 bits/sec.
  • Auxiliary tokens are omitted during transmission; instead, they are inferred at the receiver using the ATSP modules.

This design allows extremely compact representations without directly transmitting detailed quantization stages, thus significantly reducing bandwidth requirements.

4. Experimental Evaluation

  • Dataset and Setup:

Libri2Mix-clean (16 kHz) is utilized: 270 h training, 11 h each for development and test. MDCTCodec is configured with N=4N = 4 stages, M=1024M = 1024 codebook size, K=32K = 32 embedding dimension. BTD and ATSP modules are as described in the architecture.

  • Baselines:
    • FCTS (Front-end Codec Then Separation): Mixed speech is encoded at 1 kbps before separation via Sepformer.
    • FSTC (Front-end Separation Then Codec): Separation precedes individual 0.5 kbps streams per speaker.
    • Sepformer (\infty kbps) is used as an upper bound.
  • Metrics:
    • UTMOS and DNSMOS: Non-intrusive objective speech quality metrics (higher is better)
    • NMOS and SMOS: Subjective mean opinion scores for naturalness and speaker similarity
    • NABX and SABX: ABX pairwise preference tests

Table 1: Objective & Subjective Scores at 1 kbps

Method UTMOS DNSMOS ↑ NMOS ↑ SMOS ↑
CodeSep 3.14 3.67 3.65 ± 0.08 3.43 ± 0.09
FCTS 1.34 3.03 2.96 ± 0.09 2.86 ± 0.09
FSTC 1.99 3.33 3.24 ± 0.09 3.15 ± 0.09
Sepformer (∞) 3.54 3.55

At only 1 kbps, CodeSep demonstrates superior performance over FCTS and FSTC baselines (p < 0.01).

Table 2: CodeSep v. FSTC at Higher Bitrates

Method Bitrate UTMOS ↑ DNSMOS ↑
CodeSep 1 kbps 3.14 3.67
FSTC 2 kbps 2.30 3.44
FSTC 4 kbps 2.87 3.53
FSTC 8 kbps 3.11 3.56

CodeSep at 1 kbps outperforms FSTC up to 8 kbps in objective quality.

Table 3: ABX Preference (NABX, SABX) vs FSTC

Comparison CodeSep pref. FSTC pref. No pref. p-value
1 vs 2 kbps NABX 55.8% 41.9% 2.3% <0.01
1 vs 4 kbps NABX 52.8% 43.0% 4.2% <0.01
1 vs 8 kbps NABX 38.6% 53.6% 7.9% <0.01
1 vs 2 kbps SABX 54.3% 41.8% 3.9% <0.01

Perceptual and preference ratings highlight the advantage of CodeSep over baseline approaches at the same and higher bitrates.

5. Strengths, Limitations, and Potential Extensions

Strengths:

  • Achieves joint speech separation and compression in a single token-level model ("JSAC") at exceptionally low bitrates.
  • The BTD and ATSP modules factorize token representations into "which speaker" (base tokens) and "fine detail" (auxiliary tokens) aspects.
  • Enables high perceptual quality for separated speech at only 1 kbps, surpassing naive compress/separate pipelines.

Limitations:

  • Current implementation demonstrated only for 2-speaker mixtures.
  • Codec and separation modules are trained independently; there is no fully end-to-end quantization-gradient propagation.
  • Evaluation focuses on non-intrusive metrics; SI-SDR/PESQ are not reported for direct separation-plus-codec comparison.

Potential Extensions:

  • Generalization to mixtures of three or more speakers by increasing the number of anti-consistency branches in BTD.
  • Joint fine-tuning of the codec, BTD, and ATSP using straight-through estimators for quantization.
  • Replacing MDCTCodec with alternative discrete codecs or learned tokenizers (e.g., SoundStream).
  • Augmenting for noise robustness and far-field speech via room simulation.
  • Implementing adaptive bitrate by reducing base token transmission in silence.

A plausible implication is that CodeSep’s token-level disentanglement and auxiliary prediction strategy can facilitate future ultra-low-bitrate applications in speech communication, scalable to more complex acoustic environments, and flexible integration with emerging neural audio codecs (Du et al., 19 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CodeSep.