Speech to Units (S2U) in Speech Processing
- Speech-to-Units (S2U) is a paradigm that converts continuous speech waveforms into discrete, text-like symbolic embeddings for robust downstream processing.
- It leverages self-supervised learning models and quantization techniques like k-means to optimize codebook sizes, balancing granularity and efficiency in tasks such as translation and synthesis.
- Post-processing methods like run-length deduplication and subword modeling further compress unit sequences, enabling effective control over bitrate and modeling complexity.
Speech-to-Units (S2U) is a paradigm in speech processing that transforms continuous speech waveforms into sequences of discrete symbols ("units"), enabling a highly compressed, text-like representation of spoken language suitable for downstream modeling tasks. S2U leverages self-supervised learning (SSL) representations and differentiable or non-differentiable quantization schemes to replace high-dimensional, redundant acoustic features with compact, symbolic embeddings. This discrete representation underpins advances in speech-to-speech translation (S2ST), textless speech processing, speech synthesis, robust modeling for low-resource languages, and efficient speech understanding.
1. Principles and Formalization of Speech-to-Units
S2U is defined by its pipeline architecture: raw speech input is mapped—typically via a deep SSL model such as HuBERT or wav2vec 2.0—to a sequence of continuous acoustic embeddings. Each embedding is subsequently discretized, most often by k-means clustering, yielding integer-valued unit tokens. Mathematically, for a sequence of frame-level embeddings (with –$1024$ depending on the encoder), discrete code indices are assigned as over cluster centroids determined by minimizing the k-means objective (Duret et al., 2024).
The mapped sequence constitutes the unitized representation of the input waveform. Downstream tasks treat this symbolic sequence analogously to text in natural language processing pipelines.
2. Quantization Strategies and Unit Inventory Design
Unitization hinges on the choice of quantization mechanism and hyperparameters, particularly the codebook size :
- K-means Clustering: The dominant unsupervised strategy, with cluster sizes typically . Smaller yields coarse representations that may under-segment phonetic distinctions, while larger provides finer granularity but risks overfitting to speaker and acoustic variation, as well as increased model size and token rates (Duret et al., 2024, Chang et al., 2023).
- Vector Quantization Variants: Some frameworks employ learnable VQ and FSQ (finite scalar quantization; as in ToneUnit) (Tao et al., 2024). VQ may achieve high codebook usage if well-regularized, but can collapse in low-data regimes, while FSQ guarantees full codebook occupancy, yielding stable token distributions that enhance both downstream intelligibility and naturalness.
- CTC-Supervised Discretization: In the tone-aware regime (e.g., Mandarin), units can be informed by explicit sequence alignment between quantized embeddings and annotated syllabic or tonal labels, eliminating the "tone shift" phenomena in TTS synthesis (Tao et al., 2024).
- Coarse Unit Extraction: For extremely efficient representations, segmentation can be driven by self-supervised loss prediction (e.g., LossPred in SyllableLM), producing syllable-like or semantic units at token rates an order of magnitude lower than frame-level quantization (Baade et al., 2024).
3. Post-Processing and Unit Sequence Compression
Following quantization, further sequence length reduction is achieved by:
- Run-length Deduplication: Consecutive identical tokens are collapsed into a single instance, reducing redundancy inherent in stationary speech sections.
- Subword Modeling (BPE/SentencePiece): Learned subword merges can reduce sequence length by 20–30%, with vocabulary sizes ranging from 2,000 to 10,000 unit-words. N-gram language modeling over unit sequences, as in unit language modeling frameworks, organizes units into pseudo-text facilitating downstream LLM compatibility (Chang et al., 2023, Zhang et al., 21 May 2025).
These steps lower sequence length and modeling cost, and control the effective bitrate (bits/s) of the representation.
4. Evaluation Metrics and Empirical Trade-Offs
Assessing S2U systems involves both resynthesis metrics and downstream task performance:
| Metric | Description | Key Use |
|---|---|---|
| Mel-cepstral Distortion (MCD) | Quantifies spectral error between original and resynthesized speech | Synthesis |
| Mean Opinion Score (MOS) | Subjective/human-rated naturalness of synthesized waveform | Synthesis |
| Character Error Rate (CER) | Letter-level ASR accuracy when decoding unit sequences | ASR, translation |
| BLEU Score | MT/translation accuracy via ASR-rescored transcriptions | Translation |
| Equal Error Rate (EER) | Speaker verification accuracy | Speaker reg. |
| Classification accuracy | Emotion or attribute recognition over units | Paralinguistics |
A central empirical finding is that optimal configuration for one metric may not transfer to another: maximal MOS (e.g., for K=512, HuBERT-Base, MOS≈3.45) does not maximize BLEU for translation (which peaks at K=1024, BLEU≈20.1) (Duret et al., 2024). Weak correlation (ρ≈+0.2) between MOS and BLEU indicates that synthesis quality is a poor proxy for translation efficacy, while BLEU and CER are moderately correlated (ρ≈–0.5).
For S2ST applications, S2U configurations that best preserve linguistic content (minimizing CER) in moderate codebook sizes strike a practical balance, as opposed to maximizing acoustic fidelity or paralinguistic attribute retention (Duret et al., 2024).
5. Model Architectures and Downstream Integration
The canonical S2U-involving model takes the following form:
- Encoder: Conformer, E-Branchformer, or Transformer stack processes (optionally subsampled) units. Pre-trained SSL models (HuBERT, Wav2Vec 2.0, SAMU-XLSR, or WavLM) are often used for initial feature extraction (Inaguma et al., 2022, Chang et al., 2023, Zhang et al., 2022).
- Decoder: Transformer or causal attention-based decoder produces target sequences—discrete units, subwords, or text—via autoregressive or sequence-sequence modeling (Lee et al., 2021, Rashidi et al., 16 Nov 2025).
- Multi-Task Heads: Additional tasks such as unit language modeling, sentiment, or entity recognition can leverage S2U embeddings directly (Zhang et al., 21 May 2025, Chou et al., 2023).
Speech synthesis from discrete units operates via a unit-based vocoder, typically a HiFi-GAN variant, which embeds unit tokens and generates waveform samples (Rashidi et al., 16 Nov 2025, Inaguma et al., 2022).
S2U backbones are used as drop-in replacements for continuous features in models for ASR, spoken translation, language understanding, and TTS—yielding substantial compute and memory reductions (Duret et al., 2024, Chang et al., 2023, Baade et al., 2024).
6. Multilingual and Specialized Extensions
Multilingual S2U systems (S2MU, S2U with masking) employ family- or language-specific codebooks and vocabulary masking to minimize cross-lingual interference. Decoder outputs are constrained to valid target language units during training and inference (Gong et al., 2023).
Tonal-aware unitization is accomplished by supervised CTC heads forcing code separability across tonal variants, critical for languages such as Mandarin (Tao et al., 2024).
Unit language modeling (n-gram merges, SentencePiece) enables the composition of S2U sequences into units analogous to text tokens, facilitating cross-modal and mixed-modality language modeling (Zhang et al., 21 May 2025, Chou et al., 2023).
In specialized clinical settings, S2U is integrated into dysarthric speech normalization and reconstruction, leveraging CTC-based normalization onto reference speaker unit vocabularies for superior content restoration and intelligibility (Wang et al., 2024).
7. Limitations, Open Questions, and Best Practices
- Granularity Selection: Determining the optimal codebook size remains empirical, with recommendations converging on –$2000$ for monolingual tasks and higher for cross-lingual setups, but often requiring validation against end-to-end task scores rather than synthesis metrics alone (Duret et al., 2024).
- Semantic Compression vs. Prosodic Detail: Coarse, syllable-level units drastically reduce sequence length and wall-clock modeling cost, but may obscure local prosody or tonal information, motivating ongoing research into hybrid tokenization schemes (Baade et al., 2024).
- Speaker and Emotion Factors: High-fidelity preservation of non-verbal attributes (speaker, emotion) is weakly correlated or even orthogonal to translation/ASR accuracy; specialized approaches—multistream quantization, explicit embeddings—may be necessary for paralinguistic-sensitive applications (Duret et al., 2024).
- Low-Resource and Tonal Languages: S2U frameworks can be adapted to low-resource and tonal contexts with minimal labeled data via tailored quantization and supervision (Tao et al., 2024, Rashidi et al., 16 Nov 2025).
- Bitrate Efficiency: Symbolic tokenization allows rigorous control of the trade-off between bitrate, compute, and output fidelity, with fine grained tunability achieved through BPE/subword merges and segment rate adjustment (Chang et al., 2024, Baade et al., 2024).
Empirical guidelines recommend: (1) extracting units from mid-to-high SSL layers; (2) setting for translation-centric pipelines, for TTS/ASR generality; (3) validating configurations via full end-to-end BLEU or WER; (4) leveraging BPE/sequencing compaction to control modeling complexity (Duret et al., 2024, Chang et al., 2023, Zhang et al., 21 May 2025).
The S2U paradigm, propelled by advances in SSL model transfer and downstream discrete modeling, bridges the gap between speech and text processing. It enables new regimes of model efficiency, cross-lingual transfer, unsupervised discovery, and application in scenarios where text transcripts are unavailable or impractical. Ongoing research continues to delineate best practices for codebook design, hybrid tokenization, and task-specific optimization.