CodeSep: Codec-Driven Speech Separation & Compression

Updated 21 January 2026

CodeSep is a codec-driven architecture that jointly performs low-bitrate speech separation and compression by disentangling mixed-speech signals into base and auxiliary tokens.
The model combines a neural RVQ-based codec, base-token disentanglement, and auxiliary-token serial prediction to achieve high perceptual quality at 1 kbps per speaker.
Experimental results show that CodeSep outperforms conventional pipelines like FCTS and FSTC, delivering superior objective and subjective speech quality metrics.

CodeSep is a codec-driven architecture for joint low-bitrate speech separation and compression, targeting scenarios where mixed-speech signals must be disentangled and efficiently encoded for transmission or storage. By combining a residual vector quantizer (RVQ)-based neural codec, a base-token disentanglement (BTD) module, and auxiliary-token serial prediction (ATSP) modules, CodeSep enables the reconstruction of separated speech signals from minimal discrete representations, achieving high perceptual quality at bitrates as low as 1 kbps per speaker. Applications include online meetings and multi-speaker dialogue archiving, where bandwidth efficiency and source separation are critical (Du et al., 19 Jan 2026).

1. Model Architecture

The CodeSep system consists of three principal components: the MDCT-based neural codec, the BTD module, and the ATSP modules. The architecture processes time-frequency representations in a multistage manner:

MDCTCodec (RVQ-based Codec):

Operates in the MDCT (Modified Discrete Cosine Transform) domain, using a block size (e.g., 20 ms; 50% overlap). The encoder $\varphi_e$ comprises ConvNeXt v2 blocks mapping MDCT frames $x \in \mathbb{R}^d$ to features $z_0 \in \mathbb{R}^{T \times K}$ . The quantization is staged via RVQ: At each stage $n$ , input $r_{n-1}$ is quantized by a codebook $W_n = \{w_{n,m} \in \mathbb{R}^k | m=1 \dots M\}$ , selecting $ẑ_n = w_{n, m^\star}$ with $m^\star = \arg \min_m \| r_{n-1} - w_{n,m} \|_2$ . The quantized representations are summed and decoded via a mirrored ConvNeXt v2 stack to MDCT, followed by inverse MDCT to waveform $\hat{x}$ .

Base-Token Disentanglement (BTD) Module:

Given a mixed-speech mel-spectrogram $Y(t) \in \mathbb{R}^{80}$ , $Y$ is processed through: 1. $\varphi_{meld}$ : 3-layer convolutional stack (stride=2) → $Z_{meld}(t) \in \mathbb{R}^{256}$ 2. $\varphi_{intra}$ : 4 self-attention Transformer blocks 3. Duplication and addition of anti-consistency biases $\delta^{(1)}$ , $\delta^{(2)}$ to prevent output collapse 4. $\varphi_{inter}$ : 4 cross-attention Transformer blocks 5. Two linear+softmax heads yield code distributions $p^ { (i) }_{ base }( t ) \in \Delta^M$ ; index selection by argmax yields $d_{ base }^{(i)}( t ) \in \{ 1, ..., M \}$ Anti-consistency biases, generated by a trainable ACBG, are essential to enforce separation across branches.

Auxiliary-Token Serial Prediction (ATSP) Modules:

For each speaker $i \in \{1,2\}$ , a parallel ATSP branch predicts auxiliary token sequences $d_{aux}^{(i)} = [ d_{aux,1}^{(i)}, ..., d_{aux,N-1}^{(i)} ]$ given $d_{base}^{(i)}$ . The conditional factorization is:

$p(d_{aux}^{(i)} | d_{base}^{(i)}) = \prod_{n=1}^{N-1} p(d_{aux,n}^{(i)} | d_{base}^{(i)}, d_{aux,1}^{(i)}, ..., d_{aux,n-1}^{(i)})$

Each stage uses embedding $e_n^{(i)} = L(d_{base}^{(i)}) + \sum_{m<n} L(d_{aux,m}^{(i)})$ , processed via 2-layer LSTM and 3 Conformer blocks, then outputting logits and distribution $p_{aux,n}^{(i)}$ .

Waveform Reconstruction:

For speaker $i$ , the embeddings for all $N$ quantization stages (base plus predicted auxiliaries) are summed to form $ẑ^{(i)}$ for the decoder $\varphi_d$ , which reconstructs $\hat{x}_i$ .

Data Flow Diagram (simplified, per time frame for two sources):

  Mixed mel Y(t)
      |
   [BTD]
      V
d_base^(1), d_base^(2)
   |           |
[ATSP]      [ATSP]
   |           |
aux tokens   aux tokens
   |           |
   V           V
Decoded waveforms for each speaker

2. Training Objectives

Training in CodeSep leverages separate objectives for each module, performed independently:

Codec Training (single-speaker data):
- Spectral loss: $L_{spec} = \| \text{MDCT}(x) - \text{MDCT}(\hat{x}) \|_1$
- Multi-scale GAN adversarial loss: $L_{adv}$
- RVQ commitment loss: $L_q = \sum_n \| sg[z_n] - w_n \|^2 + \beta \| z_n - sg[w_n] \|^2$ , with $sg[\cdot]$ denoting stop-gradient.
BTD Module (Permutation-Invariant Cross-Entropy):

Accounts for speaker order ambiguity via:

$L_{PI-CE} = -\mathbb{E}_{(y, x_1, x_2)} \min_{\pi \in S_2} \sum_{i=1}^2 \log p_{base}^{(i)}[d_1^{(\pi(i))}]$

ATSP Modules (Teacher-Forcing Cross-Entropy):

During training, ATSP receives ground-truth code indices:

$L_{TF-CE} = -\mathbb{E}_{x \sim D_s} \sum_{n=1}^{N-1} \log \tilde{p}_{aux,n}[d_{n+1}]$

with $\tilde{p}_{aux,n}$ the prediction at stage $n$ , $d_{n+1}$ the corresponding ground-truth code.

Total Loss:

$L_{total} = L_{adv} + \lambda_{spec} L_{spec} + \lambda_q L_q + \mu L_{PI-CE} + \nu L_{TF-CE}$

where each term controls a different module’s training.

3. Low-Bitrate Transmission Strategy

A key feature of CodeSep is its ability to operate at extremely low bitrates:

Only the base tokens $d_{base}^{(i)}(t) \in \{1, ..., M\}$ (per speaker, per frame) are transmitted.
With $M = 1024$ , each token is 10 bits; frame shift is 10 ms ($100$ frames/sec).
Achieved bitrate is $1$ kbps per speaker: $10 \text{ bits} \times 100 = 1000$ bits/sec.
Auxiliary tokens are omitted during transmission; instead, they are inferred at the receiver using the ATSP modules.

This design allows extremely compact representations without directly transmitting detailed quantization stages, thus significantly reducing bandwidth requirements.

4. Experimental Evaluation

Dataset and Setup:

Libri2Mix-clean (16 kHz) is utilized: 270 h training, 11 h each for development and test. MDCTCodec is configured with $N = 4$ stages, $M = 1024$ codebook size, $K = 32$ embedding dimension. BTD and ATSP modules are as described in the architecture.

Baselines:
- FCTS (Front-end Codec Then Separation): Mixed speech is encoded at 1 kbps before separation via Sepformer.
- FSTC (Front-end Separation Then Codec): Separation precedes individual 0.5 kbps streams per speaker.
- Sepformer ( $\infty$ kbps) is used as an upper bound.
Metrics:
- UTMOS and DNSMOS: Non-intrusive objective speech quality metrics (higher is better)
- NMOS and SMOS: Subjective mean opinion scores for naturalness and speaker similarity
- NABX and SABX: ABX pairwise preference tests

Table 1: Objective & Subjective Scores at 1 kbps

Method	UTMOS ↑	DNSMOS ↑	NMOS ↑	SMOS ↑
CodeSep	3.14	3.67	3.65 ± 0.08	3.43 ± 0.09
FCTS	1.34	3.03	2.96 ± 0.09	2.86 ± 0.09
FSTC	1.99	3.33	3.24 ± 0.09	3.15 ± 0.09
Sepformer (∞)	3.54	3.55	–	–

At only 1 kbps, CodeSep demonstrates superior performance over FCTS and FSTC baselines (p < 0.01).

Table 2: CodeSep v. FSTC at Higher Bitrates

Method	Bitrate	UTMOS ↑	DNSMOS ↑
CodeSep	1 kbps	3.14	3.67
FSTC	2 kbps	2.30	3.44
FSTC	4 kbps	2.87	3.53
FSTC	8 kbps	3.11	3.56

CodeSep at 1 kbps outperforms FSTC up to 8 kbps in objective quality.

Table 3: ABX Preference (NABX, SABX) vs FSTC

Comparison	CodeSep pref.	FSTC pref.	No pref.	p-value
1 vs 2 kbps NABX	55.8%	41.9%	2.3%	<0.01
1 vs 4 kbps NABX	52.8%	43.0%	4.2%	<0.01
1 vs 8 kbps NABX	38.6%	53.6%	7.9%	<0.01
1 vs 2 kbps SABX	54.3%	41.8%	3.9%	<0.01

Perceptual and preference ratings highlight the advantage of CodeSep over baseline approaches at the same and higher bitrates.

5. Strengths, Limitations, and Potential Extensions

Strengths:

Achieves joint speech separation and compression in a single token-level model ("JSAC") at exceptionally low bitrates.
The BTD and ATSP modules factorize token representations into "which speaker" (base tokens) and "fine detail" (auxiliary tokens) aspects.
Enables high perceptual quality for separated speech at only 1 kbps, surpassing naive compress/separate pipelines.

Limitations:

Current implementation demonstrated only for 2-speaker mixtures.
Codec and separation modules are trained independently; there is no fully end-to-end quantization-gradient propagation.
Evaluation focuses on non-intrusive metrics; SI-SDR/PESQ are not reported for direct separation-plus-codec comparison.

Potential Extensions:

Generalization to mixtures of three or more speakers by increasing the number of anti-consistency branches in BTD.
Joint fine-tuning of the codec, BTD, and ATSP using straight-through estimators for quantization.
Replacing MDCTCodec with alternative discrete codecs or learned tokenizers (e.g., SoundStream).
Augmenting for noise robustness and far-field speech via room simulation.
Implementing adaptive bitrate by reducing base token transmission in silence.

A plausible implication is that CodeSep’s token-level disentanglement and auxiliary prediction strategy can facilitate future ultra-low-bitrate applications in speech communication, scalable to more complex acoustic environments, and flexible integration with emerging neural audio codecs (Du et al., 19 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

CodeSep: Low-Bitrate Codec-Driven Speech Separation with Base-Token Disentanglement and Auxiliary-Token Serial Prediction (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CodeSep.