Joint CTC–USDM Decoding Framework
- The paper introduces a hybrid joint CTC–USDM decoding framework that fuses framewise acoustic scores with sequence-level language modeling to improve recognition accuracy.
- The method employs a reverse diffusion process and joint scoring function with optimized hyperparameters, balancing CTC and USDM contributions.
- Empirical evaluations on Librispeech show WER improvements, with the joint approach achieving up to 4.71% WER under optimal tuning.
The Joint CTC–USDM decoding framework is a hybrid speech recognition decoding method that integrates Connectionist Temporal Classification (CTC) acoustic models and Uniform-State Diffusion Models (USDMs) to produce hypotheses benefiting from both strong framewise acoustic information and contextual sequence-level language modeling. This method constructs candidate hypotheses by fusing the CTC-derived framewise probability distributions with the labelwise probability outputs of USDM at each step of the reverse diffusion process, resulting in improved recognition accuracy relative to either subsystem alone (Naveriani et al., 15 Apr 2026).
1. Constituent Models: CTC and USDM
Connectionist Temporal Classification (CTC) defines, for every acoustic frame , a probability distribution over an alphabet augmented with the blank symbol , where denotes the sequence of input acoustic features. The probability of a label sequence (post-collapse) is given by summing over all alignments such that :
The Uniform-State Diffusion Model (USDM) is a discrete diffusion-based LLM. The forward corruption process at each step randomly replaces each token 0 with a token drawn from the uniform distribution 1 with probability 2, and retains it with probability 3. The transition distribution is
4
During the denoising reverse process, a transformer 5 predicts, for each position 6 in the sequence 7, a categorical distribution over 8: 9
2. Construction of the Joint Scoring Function
The framework aligns the CTC encoder’s predictions to token positions and fuses these distributions with the per-position USDM predictions during the denoising process, as follows.
- A greedy CTC pass produces a collapsed label sequence 0.
- Each token 1 is assigned its first corresponding frame index 2 from the CTC alignment path.
- The relevant CTC framewise distribution (renormalized) is 3 for 4.
At each USDM denoising step 5 with noised input 6, the joint log-probability for position 7 is formed as a weighted sum: 8 The interpolation weights 9 and 0 balance the acoustic and LLM contributions. This fused distribution is used to resample each sequence element for the next denoising step: 1
3. Decoding Procedure
The complete decoding method initializes from a greedy CTC sequence at a specified noise level and iteratively denoises using the hybrid probability fusion until a clean sequence is produced. The loop for each denoising step synthesizes both acoustic and contextual probabilities at the token level.
6 Key implementation details include ancestral sampling at each denoising step, and the option to generate multiple chains for selection based on the final CTC score or to approximate beam search effects.
4. Hyperparameters and Tuning Strategies
Optimal performance is highly dependent on three main hyperparameters:
- 2, 3 (acoustic-language balance): Best results were achieved with 4 (5).
- 6 (initial noise index): Controls initial corruption. A 7 value accelerates convergence without sacrificing accuracy.
- Number of denoising steps 8 (equivalently 9): Experimentation covered 0, with diminishing returns beyond 1.
The recommended tuning procedure involves:
- Fixing the USDM checkpoint;
- Grid search over 2 (dev set, 3);
- Sweeping 4;
- Varying 5 for speed-accuracy trade-off.
5. Computational Complexity and Practical Optimizations
The per-step complexity is 6 for the computation and normalization of joint distributions, with 7 additional for sampling. The total decoding cost is 8. Typical problems feature 9 (sequence length shorter than frame count) but large vocabularies (0 in the range of 10,000).
Optimization techniques include:
- Restricting candidate tokens per position to the top-1 (2) from the CTC distribution,
- Caching static per-position CTC probabilities,
- Employing half-precision (FP16) inference for the USDM transformer,
- Early termination if the sequence remains unchanged over several denoising steps.
6. Empirical Outcomes and Comparative Analysis
Evaluation on Librispeech dev-other gives the following word error rates (WER):
| System | WER (%) |
|---|---|
| CTC greedy (no LM) | 5.08 |
| USDM rescoring (K=256) | 4.82 |
| Joint CTC+USDM decoding | 4.71–4.77 |
| MDLM rescoring | 4.52 |
| AR LM joint decoding | 3.86 |
For joint CTC–USDM decoding with 3, 4 and with training extended to 25 epochs (5), the best obtained WER was 4.71%. The absolute reduction in WER relative to CTC-only and USDM-rescoring baselines is 0.31%.
This fusion approach consistently surpasses static USDM rescoring. Per-step joint fusion enables the decoder to recover from certain CTC errors. Most performance gain is captured with 32–48 denoising steps; longer chains yield little additional benefit. Although state-of-the-art autoregressive LM joint decoding still achieves lower WER, the performance gap narrows as USDM is trained longer and leveraged in joint decoding. The findings confirm that stepwise fusion of CTC and USDM probabilistic information yields superior recognition hypotheses (Naveriani et al., 15 Apr 2026).