Papers
Topics
Authors
Recent
Search
2000 character limit reached

Joint CTC–USDM Decoding Framework

Updated 2 May 2026
  • The paper introduces a hybrid joint CTC–USDM decoding framework that fuses framewise acoustic scores with sequence-level language modeling to improve recognition accuracy.
  • The method employs a reverse diffusion process and joint scoring function with optimized hyperparameters, balancing CTC and USDM contributions.
  • Empirical evaluations on Librispeech show WER improvements, with the joint approach achieving up to 4.71% WER under optimal tuning.

The Joint CTC–USDM decoding framework is a hybrid speech recognition decoding method that integrates Connectionist Temporal Classification (CTC) acoustic models and Uniform-State Diffusion Models (USDMs) to produce hypotheses benefiting from both strong framewise acoustic information and contextual sequence-level language modeling. This method constructs candidate hypotheses by fusing the CTC-derived framewise probability distributions with the labelwise probability outputs of USDM at each step of the reverse diffusion process, resulting in improved recognition accuracy relative to either subsystem alone (Naveriani et al., 15 Apr 2026).

1. Constituent Models: CTC and USDM

Connectionist Temporal Classification (CTC) defines, for every acoustic frame tt, a probability distribution PCTC,t(vX)P_{\mathrm{CTC},t}(v \mid X) over an alphabet VV augmented with the blank symbol \varnothing, where X=x1:TX = x_{1:T} denotes the sequence of input acoustic features. The probability of a label sequence y1:Sy_{1:S} (post-collapse) is given by summing over all alignments π1:T(V{})T\pi_{1:T} \in (V \cup \{\varnothing\})^T such that Collapse(π)=y\mathrm{Collapse}(\pi) = y: PCTC(yX)=π:Collapse(π)=yt=1TPCTC,t(πtX)P_{\mathrm{CTC}}(y \mid X) = \sum_{\pi:\,\mathrm{Collapse}(\pi)=y} \prod_{t=1}^T P_{\mathrm{CTC},t}(\pi_t\mid X)

The Uniform-State Diffusion Model (USDM) is a discrete diffusion-based LLM. The forward corruption process at each step tt randomly replaces each token PCTC,t(vX)P_{\mathrm{CTC},t}(v \mid X)0 with a token drawn from the uniform distribution PCTC,t(vX)P_{\mathrm{CTC},t}(v \mid X)1 with probability PCTC,t(vX)P_{\mathrm{CTC},t}(v \mid X)2, and retains it with probability PCTC,t(vX)P_{\mathrm{CTC},t}(v \mid X)3. The transition distribution is

PCTC,t(vX)P_{\mathrm{CTC},t}(v \mid X)4

During the denoising reverse process, a transformer PCTC,t(vX)P_{\mathrm{CTC},t}(v \mid X)5 predicts, for each position PCTC,t(vX)P_{\mathrm{CTC},t}(v \mid X)6 in the sequence PCTC,t(vX)P_{\mathrm{CTC},t}(v \mid X)7, a categorical distribution over PCTC,t(vX)P_{\mathrm{CTC},t}(v \mid X)8: PCTC,t(vX)P_{\mathrm{CTC},t}(v \mid X)9

2. Construction of the Joint Scoring Function

The framework aligns the CTC encoder’s predictions to token positions and fuses these distributions with the per-position USDM predictions during the denoising process, as follows.

  • A greedy CTC pass produces a collapsed label sequence VV0.
  • Each token VV1 is assigned its first corresponding frame index VV2 from the CTC alignment path.
  • The relevant CTC framewise distribution (renormalized) is VV3 for VV4.

At each USDM denoising step VV5 with noised input VV6, the joint log-probability for position VV7 is formed as a weighted sum: VV8 The interpolation weights VV9 and \varnothing0 balance the acoustic and LLM contributions. This fused distribution is used to resample each sequence element for the next denoising step: \varnothing1

3. Decoding Procedure

The complete decoding method initializes from a greedy CTC sequence at a specified noise level and iteratively denoises using the hybrid probability fusion until a clean sequence is produced. The loop for each denoising step synthesizes both acoustic and contextual probabilities at the token level.

y1:Sy_{1:S}6 Key implementation details include ancestral sampling at each denoising step, and the option to generate multiple chains for selection based on the final CTC score or to approximate beam search effects.

4. Hyperparameters and Tuning Strategies

Optimal performance is highly dependent on three main hyperparameters:

  • \varnothing2, \varnothing3 (acoustic-language balance): Best results were achieved with \varnothing4 (\varnothing5).
  • \varnothing6 (initial noise index): Controls initial corruption. A \varnothing7 value accelerates convergence without sacrificing accuracy.
  • Number of denoising steps \varnothing8 (equivalently \varnothing9): Experimentation covered X=x1:TX = x_{1:T}0, with diminishing returns beyond X=x1:TX = x_{1:T}1.

The recommended tuning procedure involves:

  1. Fixing the USDM checkpoint;
  2. Grid search over X=x1:TX = x_{1:T}2 (dev set, X=x1:TX = x_{1:T}3);
  3. Sweeping X=x1:TX = x_{1:T}4;
  4. Varying X=x1:TX = x_{1:T}5 for speed-accuracy trade-off.

5. Computational Complexity and Practical Optimizations

The per-step complexity is X=x1:TX = x_{1:T}6 for the computation and normalization of joint distributions, with X=x1:TX = x_{1:T}7 additional for sampling. The total decoding cost is X=x1:TX = x_{1:T}8. Typical problems feature X=x1:TX = x_{1:T}9 (sequence length shorter than frame count) but large vocabularies (y1:Sy_{1:S}0 in the range of 10,000).

Optimization techniques include:

  • Restricting candidate tokens per position to the top-y1:Sy_{1:S}1 (y1:Sy_{1:S}2) from the CTC distribution,
  • Caching static per-position CTC probabilities,
  • Employing half-precision (FP16) inference for the USDM transformer,
  • Early termination if the sequence remains unchanged over several denoising steps.

6. Empirical Outcomes and Comparative Analysis

Evaluation on Librispeech dev-other gives the following word error rates (WER):

System WER (%)
CTC greedy (no LM) 5.08
USDM rescoring (K=256) 4.82
Joint CTC+USDM decoding 4.71–4.77
MDLM rescoring 4.52
AR LM joint decoding 3.86

For joint CTC–USDM decoding with y1:Sy_{1:S}3, y1:Sy_{1:S}4 and with training extended to 25 epochs (y1:Sy_{1:S}5), the best obtained WER was 4.71%. The absolute reduction in WER relative to CTC-only and USDM-rescoring baselines is 0.31%.

This fusion approach consistently surpasses static USDM rescoring. Per-step joint fusion enables the decoder to recover from certain CTC errors. Most performance gain is captured with 32–48 denoising steps; longer chains yield little additional benefit. Although state-of-the-art autoregressive LM joint decoding still achieves lower WER, the performance gap narrows as USDM is trained longer and leveraged in joint decoding. The findings confirm that stepwise fusion of CTC and USDM probabilistic information yields superior recognition hypotheses (Naveriani et al., 15 Apr 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Joint CTC-USDM Decoding Framework.