Insertion and Deletion of Interruption Token (IDIT)
- IDIT is a supervised technique that aligns LLM subword outputs with the token granularity required for code-switched ASR.
- It segments mixed-language inputs into distinct units, ensuring each Chinese character and English word is represented by exactly one token.
- IDIT integrates into a MoE-enhanced speech-to-text pipeline, improving accuracy and stability without altering the core model architecture.
The Insertion and Deletion of Interruption Token (IDIT) mechanism is a supervised label construction technique designed to address the misalignment between subword tokenization in LLMs and the token-level requirements of code-switching automatic speech recognition (ASR). IDIT enables effective transfer of LLM text generation capabilities into ASR for mixed-language (e.g., Mandarin/English) transcripts by enforcing explicit character- and word-level granularity at the label construction stage. It is introduced as a central component within a speech-conditioned LLM pipeline enhanced by a Mixture of Experts (MoE) connector (Zhang et al., 2024).
1. Formal Definition and Mathematical Formulation
Let be a ground-truth code-switched string composed of a sequence of Mandarin Chinese characters and English words. The IDIT method introduces a dedicated “interruption” token (vocabulary index ) and defines the following operators:
- : injects after every Chinese character and English word in .
- : the LLM’s pretrained subword tokenizer (e.g., a 151,643-token BPE vocabulary).
- : eliminates all tokens from a tokenized sequence.
Define . The final sequence of training labels is constructed by tokenizing 0 and removing all occurrences of 1:
2
The training objective is standard autoregressive next-token cross-entropy:
3
No auxiliary regularization is added; the sole innovation is the construction of 4 so that the model outputs exactly one token per Chinese character or English word.
2. Algorithmic Construction and Implementation
The IDIT label construction algorithm segments the code-switched string into single Chinese characters or complete English words, appends the interruption token 5 after each, tokenizes the resulting sequence, and then removes all interruption tokens. During both training and inference, no interruption tokens are produced by the model itself; IDIT solely modifies the target sequence. 7 This procedure is applied for each training example. The model architecture—comprising speech encoder, MoE connector, and LLM with LoRA adapters—remains unchanged, and gradients propagate normally through all parameters since IDIT only affects the integer label targets in the loss function.
3. Integration with Mixture-of-Experts Connector and Training Regimen
The speech-conditioned LLM framework utilizing IDIT features a Mixture-of-Experts connector with two expert feedforward networks (6): 7 for Mandarin and 8 for English. Frame-wise expert probabilities are computed by a weight-shared router,
9
where 0 are 1-dimensional speech frame embeddings.
The two-stage progressive training schedule is as follows:
| Stage | LLM/LoRA | IDIT Used | Expert Routing | Label Granularity | Modules Trained |
|---|---|---|---|---|---|
| 1: Alignment | Frozen | No | Deterministic (language class) | Native subword tokenization | MoE connector only |
| 2: Fine-tune | Unfrozen | Yes | Softmax (Eqns above) | Char-level (zh), word-level(en) | MoE connector + LoRA adapters |
This curriculum allows the connector to initially learn a robust cross-modal mapping before the LLM is adapted to the enforced label granularity.
4. Hyperparameters and Training Protocol
The following settings govern the empirical success of the IDIT-enhanced pipeline:
- LLM tokenizer vocabulary size: 151,643 (Qwen2-7B).
- Interruption token ID: a single additional vocab item.
- Number of experts: 2 (2, 3).
- Expert FFN hidden dimension: 2048.
- Router projection: 4, 5.
- Downsampling (frame splicing): factor 5.
- LoRA adapter hyperparameters: rank 6, 7.
- Effective batch size: 18 (batch size 6, grad-accumulation 3).
- Optimizer: AdamW, 8, learning rate 9, no weight decay.
- Learning rate warmup: 1000 steps.
- Hardware: 8 × Nvidia A800 GPUs, each training stage ≈ 25k steps.
Staged curriculum, specifically omitting IDIT and LoRA adaptation in stage 1, is a critical training trick for stable convergence and effective representation learning.
5. Illustrative Example of the IDIT Mechanism
For a code-switched segment 0:
- Segment 1 yields 2.
- Insertion: 3.
- Tokenization (BPE or similar): 4.
- Removal of 5 tokens: 6.
The final label sequence yields exactly one token per Chinese character and one per English word, tightly controlling output granularity and aligning token emission with spoken segments.
6. Effect and Significance in ASR for Code-switching
The IDIT mechanism eliminates spurious insertions and deletions that typically arise when LLM subword tokenization is misaligned with spoken unit boundaries, particularly in code-switched utterances. It achieves character-level resolution for Chinese and word-level for English, precisely matching linguistic intuition for mixed transcripts. Importantly, IDIT operates entirely outside the model forward pass, yielding a fully differentiable pipeline and avoiding complexity in model architecture or loss computation.
Within the MoE-enhanced speech-LM pipeline, IDIT enables the model to leverage pretrained LLM text generation capabilities in an ASR context without requiring structural changes to the model or introduction of extra loss terms. The approach is thus both lightweight and effective for code-switched speech recognition (Zhang et al., 2024).