CTCBERT: Robust Speech Pretraining
- The paper demonstrates a novel training strategy that replaces the traditional cross-entropy loss with a CTC loss on masked spans, thereby reducing misalignment errors.
- The methodology leverages deduplication of pseudo-label sequences and employs a blank symbol to better handle temporal uncertainties during speech pretraining.
- Empirical results show up to an 11% relative reduction in word error rate, highlighting the practical benefits of integrating CTC loss into HuBERT-based models.
CTCBERT is an extension of the Hidden-unit BERT (HuBERT) self-supervised speech pretraining framework, in which the frame-level cross-entropy (CE) objective traditionally used in HuBERT is replaced with a Connectionist Temporal Classification (CTC) loss applied over masked spans with deduplicated target sequences. This modification is designed to address spurious alignment errors in pseudo-labels produced via clustering or hybrid models and to increase robustness to temporal misalignments during pretraining. CTCBERT presents consistent improvements in empirical word error rate (WER) over both the original HuBERT and PBERT systems across multiple settings (Fan et al., 2022).
1. Model Architecture and Training Objectives
HuBERT pretraining architecture consists of a convolutional front-end that converts an input waveform into frame-level features , followed by a Transformer encoder operating with random span masking, a linear projection, and a codebook of learnable ID embeddings. Conventional HuBERT applies a frame-wise softmax and is trained using a cross-entropy loss for masked frames,
where is a pseudo-label (e.g., K-means cluster) for frame .
CTCBERT, in contrast, trains on masked spans using a CTC loss, after deduplicating repeated IDs within each span. For a masked span with target ID sequence , let denote its version with consecutive duplicates removed. The CTC loss is
where 0 denotes the standard CTC collapse map (removal of blanks and repeats), and 1 enumerates all frame-level alignments yielding the target sequence. This formulation marginalizes over all monotonic alignments and thus accommodates noise and temporal shifts in ID labels.
The CTC procedure introduces a blank symbol and employs the standard forward-backward algorithm, recursively defined by the forward variable 2,
3
with suitable initialization and backward recursion as in standard CTC.
2. Alignment Robustness and Motivating Analysis
A central motivation is the observation that pseudo-labels for masked spans—obtained via clustering or forced-alignment—often exhibit spurious run-length errors or misalignments. For example, both 4 and 5 collapse to the same sequence 6. Frame-level CE loss enforces each frame strictly, forcing misaligned learning. CTC, by contrast, implicitly sums over all valid monotonic alignments, allowing the model to extract correct contextual information despite ID misplacements.
Posterior probability analysis supports this greater flexibility. Models trained with CE and CTC were evaluated for log-posterior of ground-truth alignments on train-clean-100h and train-960h subsets. CE-trained models dropped from 7 to 8 (3.1% relative drop), whereas CTC-trained models changed from 9 to 0 (2.9% drop). This smaller degradation for CTC suggests enhanced tolerance to training-label noise (Fan et al., 2022).
3. Experimental Protocol and Stabilization Strategies
Pretraining employed the full 960 hours of LibriSpeech audio without transcripts, followed by finetuning on 100 hours of train-clean-100 with associated labels. Evaluation included both dev/test clean subsets and the more challenging "other" subsets.
Pseudo-labels for pretraining were derived three ways:
- HuBERT Iter1: K-means clusters (500) on MFCCs,
- HuBERT Iter2: K-means on mid-layer HuBERT representations,
- PBERT: Context-dependent phoneme IDs produced by a TDNN-F hybrid model with LF-MMI training.
Key pretraining parameters included HuBERT-Base (12-layer Transformer, 1), masked span length 10 frames, 8% masking probability, total batch up to 87.5s audio/GPU across 32 GPUs, for 400,000 steps. AdamW was used for optimization, with linearly warmed learning rate to 2, then decayed.
To address CTC convergence instability, two strategies were effective:
- CE Warmup: Initial training (32,000 steps) with CE, then switching to pure CTC.
- Joint CE+CTC Training: Simultaneously optimizing 3. The best empirical setting was 4.
Finetuning was conducted with frozen convolutional features, CTC loss, tri-stage learning rate schedule, and beam search decoding (beam=1500) with a 4-gram LM.
CTC requires an explicit blank symbol, introducing a distinct embedding and projection. Parameters 5 of the blank can be extracted from the pretrained model's output layer and used to initialize the blank in the finetuning output layer, providing small but consistent gains.
4. Empirical Results
Performance was assessed by WER on the LibriSpeech "test-other" subset, both without LLM (No LM) and with a 4-gram LM. Key results are summarized below.
| Model | test-other No LM | Relative Δ | test-other 4-gram LM | Relative Δ |
|---|---|---|---|---|
| HuBERT iter1 (CE) | 16.5% | – | 9.5% | – |
| CTCBERT | 15.8% | 4.2% | 9.4% | 1.1% |
| CTCBERT +blank init | 15.7% | 4.8% | 9.3% | 2.1% |
| HuBERT iter2 (CE) | 13.6% | – | 8.7% | – |
| CTCBERT | 12.1% | 11.0% | 7.9% | 9.2% |
| CTCBERT +blank init | 12.1% | 11.0% | 7.9% | 9.2% |
| PBERT (CE) | 7.7% | – | 7.7% | – |
| CTCBERT | 7.5% | 2.6% | 7.5% | 2.6% |
| CTCBERT +blank init | 7.4% | 3.9% | 7.4% | 3.9% |
CTCBERT yields consistent relative WER reductions of 2–11% compared to baseline models, with blank parameter initialization providing additional improvements. Joint CE+CTC training with 6 always performed best among convergence options.
5. Interpretation and Future Directions
The segment-level CTC loss confers robustness to timing offsets and run-length errors arising from imperfect clustering or alignment in label generation. By marginalizing over monotonic alignments, CTC enables the underlying Transformer context representations to better accommodate label noise and temporal uncertainty.
CTCBERT’s design also naturally introduces a blank embedding, which transfers effectively to downstream CTC finetuning stages—yielding modest additional improvements.
Several future research directions are identified:
- Extension of the CTC objective to continuous target spaces, such as codebook or feature representations.
- Identification of optimal CE/CTC mixing schedules and curricula for accelerated convergence.
- Evaluation of CTCBERT in low-resource or cross-lingual conditions with substantial alignment noise.
- Integration with sequence-to-sequence finetuning regimes to combine the strengths of both approaches.
In summary, CTCBERT demonstrates that incorporating the alignment-marginalizing properties of CTC loss in masked span prediction brings measurable performance gains and improved robustness in self-supervised speech pretraining (Fan et al., 2022).