Papers
Topics
Authors
Recent
2000 character limit reached

Discrete Speech Units Overview

Updated 14 December 2025
  • Discrete Speech Units (DSU) are quantized speech representations derived via self-supervised neural encoders that convert continuous features into discrete indices.
  • They are generated by extracting frame-level features using models like CPC and HuBERT and then applying clustering or VQ-VAE techniques for quantization.
  • Recent advances with soft DSUs preserve finer phonetic details and improve intelligibility and naturalness in applications such as voice conversion and alignment.

Discrete Speech Units (DSU) are symbolic, frame-level representations derived from speech by quantizing high-dimensional continuous features into a finite set of cluster indices. Generated from self-supervised neural encoders such as CPC or HuBERT, DSUs encode coarse phonetic information while discarding speaker and prosodic cues. In voice conversion, DSUs offer powerful speaker anonymization but suffer from phonetic collapses and pronunciation errors. This motivates extensions such as soft units—graded, uncertainty-aware distributions over discrete codebooks—which more faithfully preserve linguistic content, enhance intelligibility, and improve naturalness in downstream synthesis (Niekerk et al., 2021).

1. Self-Supervised Extraction and Quantization of DSUs

The standard DSU pipeline consists of two main steps: (1) extracting frame-level features from a large, unlabeled corpus via self-supervised encoders, and (2) quantizing these features into discrete token indices. Two backbones are commonly employed:

  • CPC-big: Convolutional encoder with a stack of LSTMs; features taken from the final LSTM.
  • HuBERT-Base: Convolutional frontend followed by Transformer layers; features typically extracted from layer 7.

Given frame-level features ztRDz_t \in \mathbb{R}^D, clustering (most often KK-means) is applied:

Lkmeans=tztect2,ct=argminkztek2L_{\text{kmeans}} = \sum_t \|z_t - e_{c_t}\|^2,\quad c_t = \arg\min_k \|z_t - e_k\|^2

At inference, each frame is assigned a discrete index dt=ctd_t = c_t, yielding a token stream d1,,dT\langle d_1,\ldots,d_T\rangle. Alternatively, Vector-Quantized Variational Autoencoders (VQ-VAEs) perform joint encoder and codebook learning with a differentiable commitment loss:

LVQ=sg[E(x)]ek2+βE(x)sg[ek]2L_{\text{VQ}} = \| sg[E(x)] - e_k \|^2 + \beta \| E(x) - sg[e_k] \|^2

Soft DSUs represent distributions over codebook vectors using temperature-scaled softmax of cosine similarity, which allows downstream models to access richer probabilistic content (Niekerk et al., 2021).

2. Linguistic Content, Speaker-Invariance, and Information Loss

DSUs act as an information bottleneck where complex, speaker-conditioned feature spaces are compressed into a small number of clusters. This transformation is effective for speaker anonymization—HuBERT-Discrete achieves equal-error-rate (EER) near random baseline (49.8%), indicating negligible speaker information remains (Niekerk et al., 2021). However, the hard quantization discards fine phonetic details, increasing phoneme error rate (PER) and word error rate (WER):

  • HuBERT-Discrete: PER = 10.4%, WER = 5.4%
  • Ground truth: PER ≈ 7.9%, WER ≈ 2.0%

Phonetic misassignments are concentrated in confusable fricatives and affricates (e.g., /f/ vs /θ/, /ʃ/ vs /tʃ/, /ʒ/), with elevated PER for those classes.

Soft units recover much of the lost content, halving WER for HuBERT (5.4% → 2.6%) and improving MOS by 0.46 points over hard quantization (Niekerk et al., 2021).

3. DSUs in Voice Conversion, Synthesis, and Alignment

In voice conversion pipelines, DSUs are used as bottleneck features to strip out speaker identity while retaining an encoding of the input content. A Tacotron-style encoder-decoder maps DSUs to mel-spectrograms followed by a HiFi-GAN vocoder. Training and evaluation use PER/WER via ASR backbones and MOS for naturalness. For alignment tasks (e.g., automatic voice-over), DSU supervision enables direct, per-frame classification rather than indirect acoustic reconstruction, yielding substantial gains in lip-speech synchronization and naturalness versus baseline mel-based objectives (Lu et al., 2023).

Soft units, created via distributional prediction over the codebook, not only improve intelligibility and speaker similarity but also model prosodic uncertainty, further boosting naturalness (Niekerk et al., 2021).

4. Empirical Evaluation, Cross-Lingual Generalization, and Limitations

Experimental results demonstrate:

  • Intelligibility: Soft units halve WER compared to discrete units. CPC-based systems drop from 8.1% to 3.7%.
  • Speaker similarity: EER for soft units remains near anonymization (HuBERT-Soft: 45.6%, CPC-Soft: 41.3%).
  • Naturalness: HuBERT-Discrete (MOS = 3.69±0.13), HuBERT-Soft (MOS = 4.15±0.12), ground truth (MOS = 4.57±0.10).
  • Cross-lingual transfer: On French, HuBERT-Discrete WER = 64.6%, HuBERT-Soft = 28.2%; Afrikaans discrete = 24.7%, soft = 12.9%—soft units dramatically improve intelligibility while trading off for slight accent leakage.

Key limitations include:

  • Hard DSUs discard fine phonetic details, affecting pronunciation fidelity.
  • Soft DSUs partially recover those details at the cost of minor speaker or accent leakage.
  • The clustering-based codebook design may be suboptimal for out-of-domain or cross-lingual setups; end-to-end learning of codebooks is a potential direction.

Guidelines supported by evidence:

  • Use discrete units (via clustering or VQ-VAE) for tasks where speaker anonymization is critical, accepting a moderate increase in phonetic error rates.
  • Prefer soft DSUs—trained via cross-entropy against discrete codes—for downstream tasks prioritizing intelligibility and naturalness, especially where prosodic fidelity is desired and minor speaker leakage is permissible (Niekerk et al., 2021).
  • For voice-conversion and alignment tasks, direct DSU classification yields robust, fast convergence and better alignment than indirect acoustic reconstruction (Lu et al., 2023).
  • In zero-resource or cross-lingual systems, soft units facilitate better transfer of linguistic content and generalization to unseen languages.

Recommended research directions include integrating prosody prediction (F0, energy) with DSUs, exploring multi-speaker and cross-lingual codebooks, and developing joint codebook learning via differentiable quantization objectives to further optimize the trade-off between discreteness, fidelity, and speaker invariance (Niekerk et al., 2021, Lu et al., 2023).

6. Summary Table: DSU Extraction and Performance Metrics

Backbone Clustering Method K (Codebook Size) PER (%) WER (%) EER (%) MOS
HuBERT k-means 100 10.4 5.4 49.8 3.69±0.13
CPC k-means 100 -- 8.1 -- --
HuBERT-Soft softmax (cosine) 100 7.8 2.6 45.6 4.15±0.12
CPC-Soft softmax (cosine) 100 -- 3.7 41.3 --

In conclusion, Discrete Speech Units are a foundational representation for non-textual, self-supervised speech modeling—enabling controllable, speaker-independent encoding of linguistic content, with the capability to recover intelligibility and naturalness via soft, uncertainty-aware extensions (Niekerk et al., 2021).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Discrete Speech Units (DSU).