Codec2Vec: Discrete Speech Representation
- Codec2Vec is a self-supervised speech representation framework that uses discrete codec units as input, enhancing efficiency, storage, and privacy.
- It replaces continuous acoustic features with precomputed codec tokens, enabling a streamlined Transformer-based masked prediction approach.
- The framework achieves competitive performance on SUPERB evaluations while significantly reducing training time and computational resources.
Searching arXiv for the cited papers to ground the article and confirm metadata. Codec2Vec is a self-supervised speech representation learning framework that uses only discrete neural speech codec units as input during pretraining, rather than raw waveforms or continuous acoustic features such as log-Mel spectrograms. It was introduced as the first speech representation learning framework to rely exclusively on discrete audio codec units, with the explicit aim of testing whether standard masked prediction remains effective in a purely discrete setting while improving storage, training speed, and privacy (Tseng et al., 20 Nov 2025). Within the broader shift toward treating neural codecs as universal acoustic feature extractors, Codec2Vec recasts codec-token sequences from systems such as DAC as the substrate for general-purpose speech SSL, and thereby separates representation learning from continuous acoustic front ends.
1. Conceptual basis and research motivation
Codec2Vec is motivated by a contrast between two established tendencies in speech representation learning. On one side, modern speech SSL systems such as HuBERT, Wav2Vec 2.0, WavLM, and DinoSR typically operate on continuous inputs and generally start from raw waveforms or continuous spectral features. That regime yields strong performance, but it is expensive because raw audio is large to store and transmit, waveforms require a convolutional feature extractor during training, and large-scale SSL becomes I/O-heavy and compute-heavy (Tseng et al., 20 Nov 2025).
On the other side, neural audio codecs such as DAC, EnCodec, and SoundStream compress speech into discrete token sequences. These tokens are much smaller to store, easier to transmit, and potentially more privacy-preserving because reconstructing the exact original waveform requires the codec model and is lossy. Prior work had shown that codec units can be useful for ASR, speaker tasks, and speech generation, but mostly in supervised or task-specific settings. Codec2Vec formulates the broader question of whether general-purpose self-supervised speech representation learning can be carried out directly on discrete codec units, with no continuous input at all (Tseng et al., 20 Nov 2025).
A central clarification is that Codec2Vec does not introduce a brand-new SSL objective. Its novelty lies in replacing continuous input entirely with codec units, removing the online acoustic feature extractor during SSL pretraining, and shifting expensive feature computation to a one-time offline preprocessing step. This makes the framework more modular and efficient, and better suited to settings where storage and transmission are constrained. A common misconception is therefore that Codec2Vec is primarily an alternative loss function; the paper instead presents it as a discrete-input reformulation of masked-prediction SSL.
2. Input representation and model architecture
The paper uses DAC as the main codec, specifically its 16 kHz variant. DAC produces codebook sequences at a rate of 50 Hz. For an input speech signal , the codec maps it to
where each codebook stream is
and is the number of frames (Tseng et al., 20 Nov 2025).
Each discrete code is embedded using a codebook-specific embedding table , yielding
These per-codebook embeddings are then aggregated into a single frame-level sequence,
and this aggregated sequence is the input to the Transformer encoder (Tseng et al., 20 Nov 2025).
Two implementation choices are emphasized. First, the embedding layer is initialized with the codec’s own codebook embeddings. Second, quantizer dropout is applied during training by randomly dropping some codebook streams to improve robustness. The pretraining model itself is a Base-sized Transformer encoder with 12 layers and 768-dimensional embeddings. After pretraining, a lightweight downstream head is attached for task-specific evaluation (Tseng et al., 20 Nov 2025).
This architecture establishes the distinctive modality shift of Codec2Vec. The contextual encoder remains recognizably aligned with contemporary speech SSL practice, but the signal entering that encoder is already discretized and compressed. A plausible implication is that the framework inherits much of the modeling capacity of Transformer-based SSL while relocating the costly acoustic front-end computation to preprocessing.
3. Masked prediction and target derivation
Codec2Vec adopts a masked prediction objective over codec-token embeddings. Given the input sequence 0, a subset of time indices 1 is selected randomly, and the corresponding frames are replaced by a learnable mask embedding to form a corrupted sequence 2. The paper uses a mask span of 10 frames and selects 8% of the input representation as mask starting points. The Transformer outputs hidden states
3
and a projection layer produces the target distribution
4
where 5 is the target class, 6 is the classifier weight for class 7, 8 is a temperature parameter, and 9 is the number of classes or targets (Tseng et al., 20 Nov 2025).
A major focus of the paper is that, in a discrete-input SSL setting, the choice of prediction targets matters substantially. The authors compare three strategies.
| Strategy | Mechanism | Reported interpretation |
|---|---|---|
| Reconstruction-based targets | Predict the original masked codec units themselves; multiple projection layers are used to predict multiple codebook sequences | The most “self-contained” discrete formulation |
| Iterative clustering targets | Train first with reconstruction-based targets, extract latent representations, run k-means, use cluster assignments as new frame-level targets, and repeat | Encourages the model to learn more abstract units than raw codec reconstruction |
| Online clustering targets | A teacher model generates cluster assignments online from its own intermediate representations; the student predicts these clusters | The most dynamic and abstract target strategy |
For iterative clustering, the paper uses a 10-hour subset for k-means, 500 clusters, and two rounds of iterative clustering. For online clustering, masking is applied only to the student input, the teacher sees unmasked input, clustering is done on teacher layers 5–12, each layer has a codebook size of 256, and codebook decay is 0.9. The teacher is updated as an exponential moving average of the student (Tseng et al., 20 Nov 2025).
These design choices support one of the paper’s principal interpretive claims: direct reconstruction of codec units is a useful baseline, but clustering-derived targets are more effective for learning semantically useful contextual representations. The framework therefore distinguishes between codec units as compressed observations and the target space required for abstraction.
4. Training protocol and evaluation setting
Pretraining is conducted on LibriSpeech 960 hours. For reconstruction-based and iterative clustering experiments, the training setup follows HuBERT-style training with a batch size equivalent to 47 minutes of audio, 400k training steps, and a learning rate that warms up to 0 over the first 32k steps and then decays linearly to zero. For iterative clustering efficiency, the authors use faiss for k-means, reducing clustering time from over a day to a few hours (Tseng et al., 20 Nov 2025).
For online clustering, the setup follows DinoSR with a batch size equivalent to 63 minutes of audio and 400k steps. The EMA schedule starts at 0.999, increases to 0.9999 over the first 30k steps, and the teacher is frozen after 200k steps, at which point the teacher frozen decay rate becomes 1.0 (Tseng et al., 20 Nov 2025).
Evaluation is performed on SUPERB, including PR, ASR, KS, IC, SF, SD, SV, and ER. The tasks are grouped by Content, Semantic, Speaker, and Paralinguistic domains. Baselines include DeCoAR 2.0, HuBERT, and DinoSR, along with a control condition in which a HuBERT-style model is trained with the same target but uses codec tokens instead of waveform input. That control is methodologically important because it isolates the effect of input modality (Tseng et al., 20 Nov 2025).
The paper explicitly states that the goal is not necessarily to beat the best continuous-input SSL models, but to show that discrete-only pretraining can be competitive while being much more efficient. This is an important framing device for interpreting the reported results: performance is analyzed together with computational and storage characteristics rather than as a pure leaderboard exercise.
5. Empirical findings and efficiency profile
The main empirical finding is that discrete input can replace waveform input surprisingly well. In the key control experiment comparing HuBERT target plus waveform input against the same HuBERT target plus discrete codec input, the codec-input version is reported as highly competitive across all tasks, indicating that discrete codec units can serve as a viable replacement for continuous input in SSL (Tseng et al., 20 Nov 2025).
The paper further reports that the plain reconstruction-based version of Codec2Vec performs below the strongest continuous baselines, especially on some content tasks, but that iterative clustering Codec2Vec becomes competitive with waveform HuBERT on most tasks. It even outperforms HuBERT on SF, SD, and ER, while lagging somewhat on ASR and IC. The online clustering version generally surpasses waveform-based HuBERT, approaches DinoSR, but remains slightly below the strongest continuous baseline in some settings (Tseng et al., 20 Nov 2025).
Efficiency gains are a central quantitative result. Using discrete codec units instead of waveform files reduces the LibriSpeech dataset size from 60.4 GB to 3.6 GB, corresponding to a 16.5× storage reduction. For the same 400k-step HuBERT-style training setup, continuous input requires 830 GPU hours while discrete input requires 356 GPU hours, including a one-time codec extraction cost of about 6 GPU hours; this yields a 2.3× training-time reduction (Tseng et al., 20 Nov 2025). The paper attributes the speedup to two factors: the absence of an expensive online convolutional feature extractor and substantially reduced I/O because the data are smaller and easier to cache or load into RAM.
The codec choice itself matters. Inputs derived from DAC perform better than the Encodec variant on the tested tasks. This indicates that not all codec tokenizers are equally suitable for SSL, and that the information preserved by the codec strongly affects downstream quality (Tseng et al., 20 Nov 2025). A common overgeneralization would be to treat “discrete codec units” as a uniform input class; the reported DAC-versus-Encodec comparison argues against that simplification.
6. Relation to codec-based tokenization and learned audio representations
Codec2Vec belongs to a broader research direction in which neural codecs are treated not only as compressors but also as representation-learning substrates. In this sense, it is closely related to work that evaluates codec token sequences as compact discrete representations beyond waveform reconstruction. A relevant adjacent example is Q2D2, a geometry-aware neural audio codec quantizer that can be interpreted as a learned audio tokenizer or codec representation learner because it produces compact discrete audio tokens and evaluates them for both reconstruction quality and semantic usefulness as learned audio representations (Shuster et al., 1 Dec 2025).
The relationship, however, is structurally asymmetric. Codec2Vec is a self-supervised speech representation learning framework whose pretraining input consists only of discrete codec units, whereas Q2D2 is a geometry-aware quantization scheme that modifies the tokenization layer itself by quantizing pairs of latent channels jointly in 2D rather than introducing a new semantic encoder or contrastive representation model (Shuster et al., 1 Dec 2025). This suggests a useful taxonomy: Codec2Vec studies how to learn contextual speech representations from discrete codec outputs, while Q2D2 studies how to improve the construction of those discrete outputs through quantization geometry.
That distinction is important because “codec-based representation learning” can refer either to downstream SSL over codec sequences or to codec design choices that alter the representational content of the sequence being produced. Codec2Vec occupies the former category. A plausible implication is that stronger tokenizers of the Q2D2 type could eventually improve discrete-only SSL frameworks of the Codec2Vec type, although that integration is not claimed in either paper.
7. Limitations, interpretive boundaries, and broader significance
The paper explicitly identifies several limitations. Codec selection remains open because different codecs preserve different information and the best codec for SSL is not known. Task-specific trade-offs remain, especially for ASR and some other demanding tasks. There is no extensive noisy-condition study, so robustness under real-world noise is not deeply explored. In addition, existing SSL objectives were designed mainly for continuous inputs, which suggests that better objectives may be needed for discrete-token speech SSL (Tseng et al., 20 Nov 2025).
These limitations help delimit what Codec2Vec establishes. It shows that standard masked prediction works in a fully discrete setting and that discrete-only pretraining can achieve competitive SUPERB performance relative to continuous-input baselines while delivering substantial efficiency gains. It does not show that codec-token SSL has closed the performance gap for every speech task, nor does it resolve which codec or objective is optimal (Tseng et al., 20 Nov 2025).
Its broader significance lies in demonstrating that speech foundation models do not necessarily need to begin from raw waveforms. By using codec tokens, pretraining can become more scalable, datasets can be drastically compressed, SSL can be decoupled from continuous acoustic front ends, and discrete speech units become a plausible substrate for general-purpose speech learning (Tseng et al., 20 Nov 2025). For research on large-scale speech modeling, this reframes codec outputs from mere compression artifacts into a computationally economical interface between speech data and contextual representation learning.